Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems > Chapter 4 Standard optimization features

Routine level optimizations (+O2)

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

At optimization level +O2, the compiler performs optimizations on a routine level. The compiler continues to perform the optimizations performed at +O1, with the following additions:

Advanced constant folding and propagation

Constant folding computes the value of a constant expression at compile time. Constant propagation is the automatic compile-time replacement of variable references with a constant value previously assigned to that variable.

Example 4-3 Advanced constant folding and propagation

The following C/C++ code example describes an advanced constant folding and propagation:

a = 10;
b = a + 5;
c = 4 * b;

Once a is assigned, its value is propagated to the statement where b is assigned so that the assignment reads:

b = 10 + 5;

The expression 10 + 5 can then be folded. Now that b has been assigned a constant, the value of b is propagated to the statement where c is assigned. After all the folding and propagation, the original code is replaced by:

a = 10;
b = 15;
c = 60;

Common subexpression elimination

Common subexpression elimination optimization identifies expressions that appear more than once and have the same result. It then computes the result and substitutes the result for each occurrence of the expression. Subexpression types include instructions that load values from memory, as well as arithmetic evaluation.

Example 4-4 Common subexpression elimination

In Fortran, for example, the code first looks like this:

A = X + Y + Z
B = X + Y + W

After this form of optimization, it becomes:

T1 = X + Y
A = T1 + Z
B = T1 + W

Global register allocation (GRA)

Scalar variables can often be stored in registers, eliminating the need for costly memory accesses. Global register allocation (GRA) attempts to store commonly referenced scalar variables in registers throughout the code in which they are most frequently accessed.

The compiler automatically determines which scalar variables are the best candidates for GRA and allocates registers accordingly.

GRA can sometimes cause problems when parallel threads attempt to update a shared variable that has been allocated a register. In this case, each parallel thread allocates a register for the shared variable; it is then unlikely that the copy in memory is updated correctly as each thread executes.

Parallel assignments to the same shared variables from multiple threads make sense only if the assignments are contained inside critical or ordered sections, or are executed conditionally based on the thread ID. GRA does not allocate registers for shared variables that are assigned within critical or ordered sections, as long as the sections are implemented using compiler directives or sync_routine-defined functions (refer to Chapter 13 “Parallel synchronization” for a discussion of sync_routine). However, for conditional assignments based on the thread ID, GRA may allocate registers that may cause wrong answers when stored.

In such cases, GRA is disabled only for shared variables that are visible to multiple threads by specifying +Onosharedgra. A description of this option is located in +O[no]sharedgra.

In procedures with large numbers of loops, GRA can contribute to long compile times. Therefore, GRA is only performed if the number of loops in the procedure is below a predetermined limit. You can remove this limit (and possibly increase compile time) by specifying +O[no]limit. A description of this option is located on +O[no]limit.

This optimization is also known as coloring register allocation because of the similarity to map-coloring algorithms in graph theory.

Register allocation in C and C++

In C and C++, you can help the optimizer understand when certain variables are heavily used within a function by declaring these variables with the register qualifier.

GRA may override your choices and promote a variable not declared register to a register over a variable that is declared register, based on estimated speed improvements.

Loop-invariant code motion

The loop-invariant code motion optimization recognizes instructions inside a loop whose results do not change and then moves the instructions outside the loop. This optimization ensures that the invariant code is only executed once.

Example 4-5 Loop-invariant code motion

This example begins with following C/C++ code:

x = z;
for(i=0; i<10; i++)
a[i] = 4 * x + i;

After loop-invariant code motion, it becomes:

x = z;
t1 = 4 * x;
for(i=0; i<10; i++)
a[i] = t1 + i;

Loop unrolling

Loop unrolling increases a loop’s step value and replicates the loop body. Each replication is appropriately offset from the induction variable so that all iterations are performed, given the new step.

Unrolling is total or partial. Total unrolling involves eliminating the loop structure completely by replicating the loop body a number of times equal to the iteration count and replacing the iteration variable with constants. This makes sense only for loops with small iteration counts.

Loop unrolling and the unroll factor are controlled using the +O[no]loop_unroll[=unroll factor]. This option is described on +O[no]loop_unroll[=unroll factor].

Some loop transformations cause loops to be fully or partially replicated. Because unlimited loop replication can significantly increase compile times, loop replication is limited by default. You can increase this limit (and possibly increase your program’s compile time and code size) by specifying the +Onosize and +Onolimit compiler options.

Example 4-6 Loop unrolling

Consider the following Fortran example:

SUBROUTINE FOO(A,B)
REAL A(10,10), B(10,10)
DO J=1, 4
DO I=1, 4
A(I,J) = B(I,J)
ENDDO
ENDDO
END

The loop nest is completely unrolled as shown below:

A(1,1) = B(1,1)
A(2,1) = B(2,1)
A(3,1) = B(3,1)
A(4,1) = B(4,1)

A(1,2) = B(1,2)
A(2,2) = B(2,2)
A(3,2) = B(3,2)
A(4,2) = B(4,2)

A(1,3) = B(1,3)
A(2,3) = B(2,3)
A(3,3) = B(3,3)
A(4,3) = B(4,3)

A(1,4) = B(1,4)
A(2,4) = B(2,4)
A(3,4) = B(3,4)
A(4,4) = B(4,4)

Partial unrolling is performed on loops with larger or unknown iteration counts. This form of unrolling retains the loop structure, but replicates the body a number of times equal to the unroll factor and adjusts references to the iteration variable accordingly.

Example 4-7 Loop unrolling

This example begins with the following Fortran example:

DO I = 1, 100
A(I) = B(I) + C(I)
ENDDO

It is unrolled to a depth of four as shown below:

DO I = 1, 100, 4
A(I) = B(I) + C(I)
A(I+1) = B(I+1) + C(I+1)
A(I+2) = B(I+2) + C(I+2)
A(I+3) = B(I+3) + C(I+3)
ENDDO

Each iteration of the loop now computes four values of A instead of one value. The compiler also generates ‘clean-up’ code for the case where the range is not evenly divisible by the unroll factor.

Register reassociation

Array references often require one or more instructions to compute the virtual memory address of the array element specified by the subscript expression. The register reassociation optimization implemented in
PA-RISC compilers tries to reduce the cost of computing the virtual memory address expression for array references found in loops.

Within loops, the virtual memory address expression is rearranged and separated into a loop-variant term and a loop-invariant term.

  • Loop-variant terms are those items whose values may change from one iteration of the loop to another.

  • Loop-invariant terms are those items whose values are constant throughout all iterations of the loop. The loop-variant term corresponds to the difference in the virtual memory address associated with a particular array reference from one iteration of the loop to the next.

The register reassociation optimization dedicates a register to track the value of the virtual memory address expression for one or more array references in a loop and updates the register appropriately in each iteration of a loop.

The register is initialized outside the loop to the loop-invariant portion of the virtual memory address expression. The register is incremented or decremented within the loop by the loop-variant portion of the virtual memory address expression. The net result is that array references in loops are converted into equivalent, but more efficient, pointer dereferences.

Register reassociation can often enable another loop optimization. After performing the register reassociation optimization, the loop variable may be needed only to control the iteration count of the loop. If this is the case, the original loop variable is eliminated altogether by using the PA-RISC ADDIB and ADDB machine instructions to control the loop iteration count.

You can enable or disable register reassociation using the +O[no]regreassoc command-line option at +O2 and above. The default is +Oregreassoc. See +O[no]regreassoc for more information.

Example 4-8 Register allocation

This example begins with the following C/C++ code:

int a[10][20][30];

void example (void)
{
int i, j, k;

for (k = 0; k < 10; k++)
for (j = 0; j < 10;j++)
for (i = 0; i < 10; i++)
a[i][j][k] = 1;
}

After register reassociation is applied, the innermost loop becomes:

int a[10][20][30];

void example (void)
{
int i, j, k;
register int (*p)[20][30];

for (k = 0; k < 10; k++)
for (j = 0; j < 10; j++)
for (p = (int (*)[20][30]) &a[0][j][k], i = 0; i < 10; i++)
*(p++[0][0]) = 1;
}

As you can see, the compiler-generated temporary register variable, p, strides through the array a in the innermost loop. This register pointer variable is initialized outside the innermost loop and auto-incremented within the innermost loop as a side-effect of the pointer dereference.

Software pipelining

Software pipelining transforms code in order to optimize program loops. It achieves this by rearranging the order in which instructions are executed in a loop. Software pipelining generates code that overlaps operations from different loop iterations. It is particularly useful for loops that contain arithmetic operations on REAL*4 and REAL*8 data in Fortran or on float and double data in C or C++.

The goal of this optimization is to avoid processor stalls due to memory or hardware pipeline latencies. The software pipelining transformation partially unrolls a loop and adds code before and after the loop to achieve a high degree of optimization within the loop.

You can enable or disable software pipelining using the +O[no]pipeline command-line option at +O2 and above. The default is +Opipeline. Use +Onopipeline if a smaller program size and faster compile time are more important than faster execution speed. See +O[no]pipeline for more information.

Prerequisites of pipelining

Software pipelining is attempted on a loop that meets the following criteria:

  • It is the innermost loop

  • There are no branches or function calls within the loop

  • The loop is of moderate size

This optimization produces slightly larger program files and increases compile time. It is most beneficial in programs containing loops that are executed many times.

Example 4-9 Software pipelining

The following C/C++ example shows a loop before and after the software pipelining optimization:

#define SIZ 10000
float x[SIZ], y[SIZ];
int i;
init();
for (i = 0;i<= SIZ;i++)
x[i] = x[i] / y[i] + 4.00;

Four significant things happen in this example:

  • A portion of the first iteration of the loop is performed before the loop.

  • A portion of the last iteration of the loop is performed after the loop.

  • The loop is unrolled twice.

  • Operations from different loop iterations are interleaved with each other.

When this loop is compiled with software pipelining, the optimization is expressed as follows:

R1 = 0;Initialize array index
R2 = 4.00;Load constant value
R3 = X[0];Load first X value
R4 = Y[0];Load first Y value
R5 = R3 / R4;Perform division on first element: n = X[0]/Y[0]
  do {Begin loop
       R6 = R1;Save current array index
       R1++;Increment array index
       R7 = X[R1];Load current X value
       R8 = Y[R1];Load current Y value
       R9 = R5 + R2;Perform addition on prior row: X[i] = n + 4.00
       R10 = R7 / R8;Perform division on current row: m = X[i+1]/Y[i+1]
       X[R6] = R9;Save result of operations on prior row
       R6 = R1;Save current array index
       R1++;Increment array index
       R3 = X[R1];Load next X value
       R4 = Y[R1];Load next Y value
       R11 = R10 + R2;Perform addition on current row: X[i+1] = m + 4.00
       R5 = R3 / R4;Perform division on next row: n = X[i+2]/Y[i+2]
       X[R6] = R11;Save result of operations on current row
} while (R1 <= 100);End loop
R9 = R5 + R2;Perform addition on last row: X[i+2] = n + 4.00
X[R6] = R9;Save result of operations on last row

This transformation stores intermediate results of the division instructions in unique registers (noted as n and m). These registers are not referenced until several instructions after the division operations. This decreases the possibility that the long latency period of the division instructions will stall the instruction pipeline and cause processing delays.

Strength reduction of induction variables
and constants

This optimization removes expressions that are linear functions of a loop counter and replaces each of them with a variable that contains the value of the function. Variables of the same linear function are computed only once. This optimization also replaces multiplication instructions with addition instructions wherever possible.

Example 4-10 Strength reduction of induction variables and constants

This example begins with the following C/C++ code:

for (i=0; i<25; i++) {
r[i] = i * k;
}

After this optimization, it looks like this:

t1 = 0;
for (i=0; i<25; i++) {
r[i] = t1;
t1 += k;
}

Store and copy optimization

Where possible, the store and copy optimization substitutes registers for memory locations, by replacing store instructions with copy instructions and deleting load instructions.

Unused definition elimination

The unused definition elimination optimization removes unused memory location and register definitions. These definitions are often a result of transformations made by other optimizations.

Example 4-11 Unused definition elimination

This example begins with the following C/C++ code:

f(int x){
int a,b,c;

a = 1;
b = 2;
c = x * b;
return c;
}

After unused definition elimination, it looks like this:

f(int x) {
int a,b,c;

c = x * 2;
return c;
}

The assignment a = 1 is removed because a is not used after it is defined. Due to another +O2 optimization (constant propagation), the c = x * b statement becomes c = x * 2. The assignment b = 2 is then removed as well.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.