| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 4 Standard optimization featuresRoutine level optimizations (+O2) |
|
At optimization level +O2, the compiler performs optimizations on a routine level. The compiler continues to perform the optimizations performed at +O1, with the following additions: Constant folding computes the value of a constant expression at compile time. Constant propagation is the automatic compile-time replacement of variable references with a constant value previously assigned to that variable. Advanced constant folding and propagation The following C/C++ code example describes an advanced constant folding and propagation:
Once a is assigned, its value is propagated to the statement where b is assigned so that the assignment reads:
The expression 10 + 5 can then be folded. Now that b has been assigned a constant, the value of b is propagated to the statement where c is assigned. After all the folding and propagation, the original code is replaced by:
Common subexpression elimination optimization identifies expressions that appear more than once and have the same result. It then computes the result and substitutes the result for each occurrence of the expression. Subexpression types include instructions that load values from memory, as well as arithmetic evaluation. Common subexpression elimination In Fortran, for example, the code first looks like this:
After this form of optimization, it becomes:
Scalar variables can often be stored in registers, eliminating the need for costly memory accesses. Global register allocation (GRA) attempts to store commonly referenced scalar variables in registers throughout the code in which they are most frequently accessed. The compiler automatically determines which scalar variables are the best candidates for GRA and allocates registers accordingly. GRA can sometimes cause problems when parallel threads attempt to update a shared variable that has been allocated a register. In this case, each parallel thread allocates a register for the shared variable; it is then unlikely that the copy in memory is updated correctly as each thread executes. Parallel assignments to the same shared variables from multiple threads make sense only if the assignments are contained inside critical or ordered sections, or are executed conditionally based on the thread ID. GRA does not allocate registers for shared variables that are assigned within critical or ordered sections, as long as the sections are implemented using compiler directives or sync_routine-defined functions (refer to Chapter 12 “Parallel synchronization” for a discussion of sync_routine). However, for conditional assignments based on the thread ID, GRA may allocate registers that may cause wrong answers when stored. In such cases, GRA is disabled only for shared variables that are visible to multiple threads by specifying +Onosharedgra. A description of this option is located in “+O[no]sharedgra”. In procedures with large numbers of loops, GRA can contribute to long compile times. Therefore, GRA is only performed if the number of loops in the procedure is below a predetermined limit. You can remove this limit (and possibly increase compile time) by specifying +O[no]limit. A description of this option is located in “+O[no]limit”. This optimization is also known as coloring register allocation because of the similarity to map-coloring algorithms in graph theory. In C and C++, you can help the optimizer understand when certain variables are heavily used within a function by declaring these variables with the register qualifier. GRA may override your choices and promote a variable not declared register to a register over a variable that is declared register, based on estimated speed improvements. The loop-invariant code motion optimization recognizes instructions inside a loop whose results do not change and then moves the instructions outside the loop. This optimization ensures that the invariant code is only executed once. Loop-invariant code motion This example begins with following C/C++ code:
After loop-invariant code motion, it becomes:
Loop unrolling increases a loop's step value and replicates the loop body. Each replication is appropriately offset from the induction variable so that all iterations are performed, given the new step. Unrolling is total or partial. Total unrolling involves eliminating the loop structure completely by replicating the loop body a number of times equal to the iteration count and replacing the iteration variable with constants. This makes sense only for loops with small iteration counts. Loop unrolling and the unroll factor are controlled using the +O[no]loop_unroll[=unroll factor]. This option is described “+O[no]loop_unroll[=unroll factor]”. Some loop transformations cause loops to be fully or partially replicated. Because unlimited loop replication can significantly increase compile times, loop replication is limited by default. You can increase this limit (and possibly increase your program's compile time and code size) by specifying the +Onosize and +Onolimit compiler options. Loop unrolling Consider the following Fortran example:
The loop nest is completely unrolled as shown below:
Partial unrolling is performed on loops with larger or unknown iteration counts. This form of unrolling retains the loop structure, but replicates the body a number of times equal to the unroll factor and adjusts references to the iteration variable accordingly. Loop unrolling This example begins with the following Fortran example:
It is unrolled to a depth of four as shown below:
Each iteration of the loop now computes four values of A instead of one value. The compiler also generates 'clean-up' code for the case where the range is not evenly divisible by the unroll factor. Array references often require one or more instructions to
compute the virtual memory address of the array element specified
by the subscript expression. The register reassociation optimization
implemented in Within loops, the virtual memory address expression is rearranged and separated into a loop-variant term and a loop-invariant term.
The register reassociation optimization dedicates a register to track the value of the virtual memory address expression for one or more array references in a loop and updates the register appropriately in each iteration of a loop. The register is initialized outside the loop to the loop-invariant portion of the virtual memory address expression. The register is incremented or decremented within the loop by the loop-variant portion of the virtual memory address expression. The net result is that array references in loops are converted into equivalent, but more efficient, pointer dereferences. Register reassociation can often enable another loop optimization. After performing the register reassociation optimization, the loop variable may be needed only to control the iteration count of the loop. If this is the case, the original loop variable is eliminated altogether by using the PA-RISC ADDIB and ADDB machine instructions to control the loop iteration count. You can enable or disable register reassociation using the +O[no]regreassoc command-line option at +O2 and above. The default is +Oregreassoc. See “+O[no]regreassoc” for more information. Register allocation This example begins with the following C/C++ code:
After register reassociation is applied, the innermost loop becomes:
As you can see, the compiler-generated temporary register variable, p, strides through the array a in the innermost loop. This register pointer variable is initialized outside the innermost loop and auto-incremented within the innermost loop as a side-effect of the pointer dereference. Software pipelining transforms code in order to optimize program loops. It achieves this by rearranging the order in which instructions are executed in a loop. Software pipelining generates code that overlaps operations from different loop iterations. It is particularly useful for loops that contain arithmetic operations on REAL*4 and REAL*8 data in Fortran or on float and double data in C or C++. The goal of this optimization is to avoid processor stalls due to memory or hardware pipeline latencies. The software pipelining transformation partially unrolls a loop and adds code before and after the loop to achieve a high degree of optimization within the loop. You can enable or disable software pipelining using the +O[no]pipeline command-line option at +O2 and above. The default is +Opipeline. Use +Onopipeline if a smaller program size and faster compile time are more important than faster execution speed. See “+O[no]pipeline” for more information. Software pipelining is attempted on a loop that meets the following criteria:
This optimization produces slightly larger program files and increases compile time. It is most beneficial in programs containing loops that are executed many times. Software pipelining The following C/C++ example shows a loop before and after the software pipelining optimization:
Four significant things happen in this example:
When this loop is compiled with software pipelining, the optimization is expressed as follows:
This transformation stores intermediate results of the division instructions in unique registers (noted as n and m). These registers are not referenced until several instructions after the division operations. This decreases the possibility that the long latency period of the division instructions will stall the instruction pipeline and cause processing delays. This optimization removes expressions that are linear functions of a loop counter and replaces each of them with a variable that contains the value of the function. Variables of the same linear function are computed only once. This optimization also replaces multiplication instructions with addition instructions wherever possible. Strength reduction of induction variables and constants This example begins with the following C/C++ code:
After this optimization, it looks like this:
Where possible, the store and copy optimization substitutes registers for memory locations, by replacing store instructions with copy instructions and deleting load instructions. The unused definition elimination optimization removes unused memory location and register definitions. These definitions are often a result of transformations made by other optimizations. Unused definition elimination This example begins with the following C/C++ code:
After unused definition elimination, it looks like this:
The assignment a = 1 is removed because a is not used after it is defined. Due to another +O2 optimization (constant propagation), the c = x * b statement becomes c = x * 2. The assignment b = 2 is then removed as well. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||