Parallelization divides a program into threads. A thread is
a single flow of control within a process. It can be a unique flow
of control that performs a specific function, or one of several
instances of a flow of control, each of which is operating on a
unique data set.
On a V-Class server, parallel shared-memory programs run as
a collection of threads on multiple processors. When a program starts,
a separate execution thread is created on each system processor
on which the program is running. All but one of these threads is
then idle. The nonidle thread is known as thread 1, and this thread
runs all of the serial code in the program.
Spawn thread IDs are assigned only to nonidle threads when
they are spawned. This occurs when thread 1 encounters parallelism
and “wakes up” other idle threads to execute the
parallel code. Spawn thread IDs are consecutive, ranging from 0
to N-1, where N is the number of threads spawned as a result of the
spawn operation. This operation defines the current spawn context.
The spawn context is the loop, task list, or region that initiates
the spawning of the threads. Spawn thread IDs are valid only within
a given spawn context.
This means that the idle threads are not assigned spawn thread
IDs at the time of their creation. When thread 1 encounters a parallel
loop, task, or region, it spawns the other threads, signaling them
to begin execution. The threads then become active, acquire spawn
thread IDs, run until their portion of the parallel code is finished,
and go idle once again, as shown in Figure 6-1 “One-dimensional
parallelism in threads”.
 |
 |  |
 |
 | NOTE: Machine loading does not affect the number of threads spawned,
but it may affect the order in which the threads in a given spawn
context complete. |
 |
 |  |
 |
Loop
transformations |
 |
Figure 6-1 “One-dimensional
parallelism in threads” above shows that
various loop transformations can affect the manner in which a loop
is parallelized.
To implement this, the compiler transforms the loop in a manner
similar to strip mining. However, unlike in strip mining, the outer
loop is conceptual. Because the strips execute on different processors,
there is no processor to run an outer loop like the one created
in traditional strip mining.
Instead, the loop is transformed. The starting and stopping
iteration values are variables that are determined at runtime based
on how many threads are available and which thread is running the
strip in question.
Example 6-2 Loop
transformations
Consider the previous Fortran example written for an unspecified number
of iterations:
DO I = 1, N A(I) = B(I) + C(I) ENDDO |
The code shown in Figure 6-2 “Conceptual
strip mine for parallelization” is
a conceptual representation of the transformation the compiler performs
on this example when it is compiled for parallelization, assuming
that N >= NumThreads.
For N < NumThreads, the compiler uses N threads, assuming there is enough work in the
loop to justify the overhead of parallelizing it. If NumThreads is not an integral divisor of N, some threads perform fewer iterations than others.
NumThreads is the number of available threads. ThrdID is the ID number of the thread this particular
loop runs on, which is between 0 and NumThreads-1. A unique ThrdID is assigned to each thread, and the ThrdIDs are consecutive. So, for NumThreads = 8, as in Figure 6-1 “One-dimensional
parallelism in threads”,
8 loops would be spawned, with ThrdIDs = 0 through 7. These 8 loops are illustrated in Figure 6-3 “Parallelized
loop”.
 |
 |  |
 |
 | NOTE: The strip-based parallelism described here is the default. Stride-based
parallelism is possible through use of the prefer_parallel and loop_parallel compiler directives and pragmas. |
 |
 |  |
 |
In these examples, the data being manipulated within the loop
is disjoint so that no two threads attempt to write the same data
item. If two parallel threads attempt to update the same storage
location, their actions must be synchronized. This is discussed
further in Chapter 13 “Parallel
synchronization”.