Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 6 Parallel optimization features

Parallel optimizations

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

Simple loops can be parallelized without the need for extensive transformations. However, most loop transformations do enhance optimum parallelization. For instance, loop interchange orders loops so that the innermost loop best exploits the processor data cache, and the outermost loop is the most efficient loop to parallelize.

Loop blocking similarly aids parallelization by maximizing cache data reuse on each of the processors that the loop runs on. It also ensures that each processor is working on nonoverlapping array data.

Dynamic selection

The compiler has no way of determining how many processors are available to run compiled code. Therefore, it sometimes generates both serial and parallel code for loops that are parallelized. Replicating the loop in this manner is called cloning, and the resulting versions of the loop are called clones. Cloning is also performed when the loop-iteration count is unknown at compile-time.

It is not always profitable, however, to run the parallel clone when multiple processors are available. Some overhead is involved in executing parallel code. This overhead includes the time it takes to spawn parallel threads, to privatize any variables used in the loop that must be privatized, and to join the parallel threads when they complete their work.

Workload-based dynamic selection

HP compilers use a powerful form of dynamic selection known as workload-based dynamic selection. When a loop's iteration count is available at compile time, workload-based dynamic selection determines the profitability of parallelizing the loop. It only writes a parallel version to the executable if it is profitable to do so.

If the parallel version will not be needed, the compiler can omit it from the executable to further enhance performance. This eliminates the runtime decision as to which version to use.

The power of dynamic selection becomes more apparent when the loop's iteration count is unknown at compile time. In this case, the compiler generates code that, at runtime, compares the amount of work performed in the loop nest (given the actual iteration counts) to the parallelization overhead for the available number of processors. It then runs the parallel version of the loop only if it is profitable to do so.

When specified with +Oparallel at +O3, workload-based dynamic selection is enabled by default. The compiler only generates a parallel version of the loop when +Onodynsel is selected, thereby disabling dynamic selection. When dynamic selection is disabled, the compiler assumes that it is profitable to parallelize all parallelizable loops and generates both serial and parallel clones for them. In this case the parallel version is run if there are multiple processors at runtime, regardless of the profitability of doing so.

dynsel, no_dynsel

The dynsel and no_dynsel directives are used to specify dynamic selection for specific loops in programs compiled using the +Onodynsel option or to provide trip count information for specific loops in programs compiled with dynamic selection enabled.

To disable dynamic selection for selected loops by using the no_dynsel compiler directive or pragma. This directive or pragma is used to disable dynamic selection on specific loops in programs compiled with dynamic selection enabled.

The form of these directives and pragmas are shown in Table 6-2 “Form of dynsel directive and pragma”.

Table 6-2 Form of dynsel directive and pragma

LanguageForm
Fortran

C$DIR DYNSEL [(THREAD_TRIP_COUNT = n)]

C$DIR NO_DYNSEL

C

#pragma _CNX dynsel [(thread_trip_count = n )]

#pragma _CNX no_dynsel

 

where

thread_trip_count

is an optional attribute used to specify threshold iteration counts.

When thread_trip_count = n is specified, the serial version of the loop is run if the iteration count is less than n. Otherwise, the thread-parallel version is run.

If a trip count is not specified for a dynsel directive or pragma, the compiler uses a heuristic to estimate the actual execution costs. This estimate is then used to determine if it is profitable to execute the loop in parallel.

As with all optimizations that replicate loops, the number of new loops created when the compiler performs dynamic selection is limited by default to ensure reasonable code sizes. To increase the replication limit (and possibly increase your compile time and code size), specify the +Onosize +Onolimit compiler options. These are described in
Chapter 7 “Controlling optimization”.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.