Parallelization divides a program into threads. A thread is a single
flow of control within a process. It can be a unique flow of control
that performs a specific function, or one of several instances of
a flow of control, each of which is operating on a unique data set.
On a V-Class server, parallel shared-memory programs run as
a collection of threads on multiple processors. When a program starts,
a separate execution thread is created on each system processor
on which the program is running. All but one of these threads is
then idle. The nonidle thread is known as thread 1, and this thread
runs all of the serial code in the program.
Spawn thread IDs are assigned only to nonidle threads when
they are spawned. This occurs when thread 1 encounters parallelism
and "wakes up" other idle threads to execute the
parallel code. Spawn thread IDs are consecutive, ranging from 0
to N-1, where N is
the number of threads spawned as a result of the spawn operation.
This operation defines the current spawn context. The spawn context
is the loop, task list, or region that initiates the spawning of
the threads. Spawn thread IDs are valid only within a given spawn
context.
This means that the idle threads are not assigned spawn thread
IDs at the time of their creation. When thread 1 encounters a parallel
loop, task, or region, it spawns the other threads, signaling them
to begin execution. The threads then become active, acquire spawn
thread IDs, run until their portion of the parallel code is finished,
and go idle once again, as shown in Figure 6-1 “One-dimensional parallelism in threads”.
 |
 |  |
 |
 | NOTE: Machine loading does not affect the number of threads
spawned, but it may affect the order in which the threads in a given
spawn context complete. |
 |
 |  |
 |
Loop transformations |
 |
Figure 6-1 “One-dimensional parallelism in threads” above shows that
various loop transformations can affect the manner in which a loop
is parallelized.
To implement this, the compiler transforms the loop in a manner
similar to strip mining. However, unlike in strip mining, the outer
loop is conceptual. Because the strips execute on different processors,
there is no processor to run an outer loop like the one created
in traditional strip mining.
Instead, the loop is transformed. The starting and stopping
iteration values are variables that are determined at runtime based
on how many threads are available and which thread is running the
strip in question.
Loop transformations
Consider the previous Fortran example written for an unspecified
number of iterations:
DO I = 1, N A(I) = B(I) + C(I) ENDDO |
The code shown in Figure 6-2 “Conceptual strip mine for parallelization”
is a conceptual representation of the transformation the compiler
performs on this example when it is compiled for parallelization,
assuming that N >= NumThreads.
For N < NumThreads,
the compiler uses N threads, assuming
there is enough work in the loop to justify the overhead of parallelizing
it. If NumThreads is not an integral
divisor of N, some threads perform
fewer iterations than others.
NumThreads is the number
of available threads. ThrdID is
the ID number of the thread this particular loop runs on, which
is between 0 and NumThreads-1.
A unique ThrdID is assigned to
each thread, and the ThrdIDs are
consecutive. So, for NumThreads = 8,
as in Figure 6-1 “One-dimensional parallelism in threads”, 8 loops would
be spawned, with ThrdIDs =
0 through 7. These 8 loops are illustrated in Figure 6-3 “Parallelized loop”.
 |
 |  |
 |
 | NOTE: The
strip-based parallelism described here is the default. Stride-based
parallelism is possible through use of the prefer_parallel
and loop_parallel compiler directives
and pragmas. |
 |
 |  |
 |
In these examples, the data being manipulated within the loop
is disjoint so that no two threads attempt to write the same data
item. If two parallel threads attempt to update the same storage
location, their actions must be synchronized. This is discussed
further in Chapter 12 “Parallel synchronization”.