Parallel Processing

We have seen one of example of parallelizing an algorithm to use multiple CPUs: the animated quick sort example. The algorithm requires shared memory since the array is sorted ``in place'' and needs to be accessed by all the worker threads. A bag of tasks is used to distribute the work.

We can categorize parallel algorithms along several lines, using the abbreviations shown in parentheses:

  1. coarse (CG) versus fine grain (FG), determined by the frequency of thread synchronization or communication relative to the amount of computation being done;

  2. shared memory (SM) multiprocessor versus distributed memory (DM) CPUs, determined by the presence of shared data;

  3. message passing (MP) versus semaphores, monitors, barriers, or join() (SY), where the latter set requires shared memory but the first does not imply distributed memory;

  4. worker crew (WC) with a bag of tasks and a fixed number of workers, perhaps based on the number of CPUs, versus data parallelism (DP), where the amount of data or work to do determines the number of worker threads spawned and CPUs needed.

In some cases the need for shared memory can be relaxed. If the shared data is read-only, it can be replicated or broadcast to each distributed memory and the copy stored there. In other situations, it might be possible to pass the shared data around as a message, to be updated by the currently owning thread. For example, it might be possible to replace

    /* shared */ int N;
    /* shared */ Object mutex = new Object();
                         // ...
    synchronized (mutex) { N = N + 1; }
    /* local */ int N;
                         // ...
    N = receive(channel);
    N = N + 1;
    send(channel, N);

There is too much overhead and inefficiency with FG unless the communication and synchronization tools are highly optimized, perhaps with hardware support. DM can be a network of workstations (NOW) or a specialized parallel architecture in which each CPU has its own memory and there is some kind of fast interconnect or switch connecting them.

Synchronization Package Classes

On a shared memory platform, a barrier can be used for thread synchronization. It is an extension of the semaphore idea. No thread can continue past the barrier until all threads have arrived at the barrier; in other words, each thread arriving at the barrier blocks until all the other threads have arrived. There are two constructors and two versions of the gate() method: for a one-dimensional and a two-dimensional structure of the data and threads.

    A Barrier for Thread Synchronization.

Example Programs

These examples are categorized according to CG or FG, SM or DM, MP or SY, WC or DP.

Calculate the first n prime numbers. Start up n filter threads. FG, SM or DM, MP, DP.

    Prime Number Sieve.

Sort an array of length n by creating a pipeline of length n. FG, SM or DM, MP, DP.

    Pipeline Sort.

Sort an array of length n by creating two sorting threads and a merge thread. Send half the array to each sorting thread. FG, SM or DM, MP, WC. Can you finish the code?

    Merge Sort Skeleton.

Sort an array of length n by creating n/2 worker threads. FG, SM or DM, MP, DP. Can you finish the code?

    Compare-Exchange Sort Skeleton.

For an N-by-N chess board, start up a thread for each row in column one that a queen can be placed. The size of the board could be broadcast to each thread for DM. CG, SM or DM, MP, DP.

    Data Parallel N-Queens.

For each CPU, start up a worker thread that reads a row number for the column one queen from the bag of tasks. CG, SM or DM, MP, WC.

    Master/Worker N-Queens.

Modify the previous example so each worker executes on a networked workstation. The rendezvous technique is used: each worker is like a client that asks the server for more work to do. The program can also be run entirely locally with the -w command line option. This tests the three kinds of EstablishRendezvous objects that can be constructed. CG, DM, MP, WC.

    Multi-Machine Master/Worker N-Queens.

Sort an array of length n. In this example, every thread needs to communicate with every other thread, possible in SM, networked workstations, or a specialized DM architecture. FG, SM or DM, MP, DP.

    Radix Sort.


    Matrix Multiply Threads Using join().


    Matrix Multiply Threads Using a Semaphore.


    Laplace Grid Using a Barrier.

Create a new worker thread each time a section of the array is partitioned. CG, SM, SY, DP.

    Data Parallel Quick Sort.

FG, SM or DM, MP, DP.

    Systolic Array Matrix Multiply.

Replace the shared grid array and the barrier with message passing. FG, SM or DM, MP, DP.

    Laplace Grid Using Message Passing.

Laboratory Exercises

  1. Implement one-way merge sort in Java using message passing.
  2. Implement multi-way (binary tree) merge sort using message passing.
  3. Write a pipeline sieve of Eratosthenes using message passing.
  4. Complete the compare-exchange sort skeleton using message passing.
  5. Convert the sequential ``Game of Life'' simulation program into a multithreaded one where each cell has its own thread. Use semaphores, a barrier, or message passing for synchronization.
  6. Animate one or more of the above programs.

Last modified 10 September 1997.

© 1997 Stephen J. Hartley