| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems > Chapter 6 Parallel
optimization featuresParallel optimizations |
|
Simple loops can be parallelized without the need for extensive transformations. However, most loop transformations do enhance optimum parallelization. For instance, loop interchange orders loops so that the innermost loop best exploits the processor data cache, and the outermost loop is the most efficient loop to parallelize. Loop blocking similarly aids parallelization by maximizing cache data reuse on each of the processors that the loop runs on. It also ensures that each processor is working on nonoverlapping array data. The compiler has no way of determining how many processors are available to run compiled code. Therefore, it sometimes generates both serial and parallel code for loops that are parallelized. Replicating the loop in this manner is called cloning, and the resulting versions of the loop are called clones. Cloning is also performed when the loop-iteration count is unknown at compile-time. It is not always profitable, however, to run the parallel clone when multiple processors are available. Some overhead is involved in executing parallel code. This overhead includes the time it takes to spawn parallel threads, to privatize any variables used in the loop that must be privatized, and to join the parallel threads when they complete their work. HP compilers use a powerful form of dynamic selection known as workload-based dynamic selection. When a loop’s iteration count is available at compile time, workload-based dynamic selection determines the profitability of parallelizing the loop. It only writes a parallel version to the executable if it is profitable to do so. If the parallel version will not be needed, the compiler can omit it from the executable to further enhance performance. This eliminates the runtime decision as to which version to use. The power of dynamic selection becomes more apparent when the loop’s iteration count is unknown at compile time. In this case, the compiler generates code that, at runtime, compares the amount of work performed in the loop nest (given the actual iteration counts) to the parallelization overhead for the available number of processors. It then runs the parallel version of the loop only if it is profitable to do so. When specified with +Oparallel at +O3, workload-based dynamic selection is enabled by default. The compiler only generates a parallel version of the loop when +Onodynsel is selected, thereby disabling dynamic selection. When dynamic selection is disabled, the compiler assumes that it is profitable to parallelize all parallelizable loops and generates both serial and parallel clones for them. In this case the parallel version is run if there are multiple processors at runtime, regardless of the profitability of doing so. The dynsel and no_dynsel directives are used to specify dynamic selection for specific loops in programs compiled using the +Onodynsel option or to provide trip count information for specific loops in programs compiled with dynamic selection enabled. To disable dynamic selection for selected loops by using the no_dynsel compiler directive or pragma. This directive or pragma is used to disable dynamic selection on specific loops in programs compiled with dynamic selection enabled. The form of these directives and pragmas are shown in Table 6-2 “Form of dynsel directive and pragma”. Table 6-2 Form of dynsel directive and pragma
where
As with all optimizations that replicate loops, the number
of new loops created when the compiler performs dynamic selection
is limited by default to ensure reasonable code sizes. To increase
the replication limit (and possibly increase your compile time and
code size), specify the +Onosize +Onolimit compiler options. These are described in |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||