Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 6 Parallel optimization features

Threads

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

Parallelization divides a program into threads. A thread is a single flow of control within a process. It can be a unique flow of control that performs a specific function, or one of several instances of a flow of control, each of which is operating on a unique data set.

On a V-Class server, parallel shared-memory programs run as a collection of threads on multiple processors. When a program starts, a separate execution thread is created on each system processor on which the program is running. All but one of these threads is then idle. The nonidle thread is known as thread 1, and this thread runs all of the serial code in the program.

Spawn thread IDs are assigned only to nonidle threads when they are spawned. This occurs when thread 1 encounters parallelism and "wakes up" other idle threads to execute the parallel code. Spawn thread IDs are consecutive, ranging from 0 to N-1, where N is the number of threads spawned as a result of the spawn operation. This operation defines the current spawn context. The spawn context is the loop, task list, or region that initiates the spawning of the threads. Spawn thread IDs are valid only within a given spawn context.

This means that the idle threads are not assigned spawn thread IDs at the time of their creation. When thread 1 encounters a parallel loop, task, or region, it spawns the other threads, signaling them to begin execution. The threads then become active, acquire spawn thread IDs, run until their portion of the parallel code is finished, and go idle once again, as shown in Figure 6-1 “One-dimensional parallelism in threads”.

NOTE: Machine loading does not affect the number of threads spawned, but it may affect the order in which the threads in a given spawn context complete.

Figure 6-1 One-dimensional parallelism in threads

One-dimensional parallelism in threads

Loop transformations

Figure 6-1 “One-dimensional parallelism in threads” above shows that various loop transformations can affect the manner in which a loop is parallelized.

To implement this, the compiler transforms the loop in a manner similar to strip mining. However, unlike in strip mining, the outer loop is conceptual. Because the strips execute on different processors, there is no processor to run an outer loop like the one created in traditional strip mining.

Instead, the loop is transformed. The starting and stopping iteration values are variables that are determined at runtime based on how many threads are available and which thread is running the strip in question.

Loop transformations

Consider the previous Fortran example written for an unspecified number of iterations:

DO I = 1, N
A(I) = B(I) + C(I)
ENDDO

The code shown in Figure 6-2 “Conceptual strip mine for parallelization” is a conceptual representation of the transformation the compiler performs on this example when it is compiled for parallelization, assuming that N >= NumThreads.
For N < NumThreads, the compiler uses N threads, assuming there is enough work in the loop to justify the overhead of parallelizing it. If NumThreads is not an integral divisor of N, some threads perform fewer iterations than others.

Figure 6-2 Conceptual strip mine for parallelization

Conceptual strip mine for parallelization

NumThreads is the number of available threads. ThrdID is the ID number of the thread this particular loop runs on, which is between 0 and NumThreads-1. A unique ThrdID is assigned to each thread, and the ThrdIDs are consecutive. So, for NumThreads = 8, as in Figure 6-1 “One-dimensional parallelism in threads”, 8 loops would be spawned, with ThrdIDs = 0 through 7. These 8 loops are illustrated in Figure 6-3 “Parallelized loop”.

Figure 6-3 Parallelized loop

Parallelized loop
NOTE: The strip-based parallelism described here is the default. Stride-based parallelism is possible through use of the prefer_parallel and loop_parallel compiler directives and pragmas.

In these examples, the data being manipulated within the loop is disjoint so that no two threads attempt to write the same data item. If two parallel threads attempt to update the same storage location, their actions must be synchronized. This is discussed further in Chapter 12 “Parallel synchronization”.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.