Loop fusion involves creating one loop out of two or more neighboring
loops that have identical loop bounds and trip counts. This reduces
loop overhead, memory accesses, and increases
register usage. It can also lead to other optimizations. By potentially
reducing the number of parallelizable loops in a program and increasing
the amount of work in each of those loops, loop fusion can greatly
reduce parallelization overhead. Because fewer spawns and joins
are necessary.
Loop
fusion takes place at +O3 and above and is enabled by default. Specifying
+Onoloop_transform
disables loop fusion, as well as loop distribution, loop
interchange, loop blocking, loop unroll, and loop unroll
and jam.
Occasionally, loops that do not appear to be fusible become
fusible as a result of compiler transformations that precede fusion.
For instance, interchanging a loop may make it suitable for fusing
with another loop.
Loop fusion is especially beneficial when applied to Fortran 90
array assignments. The compiler translates these statements into
loops; when such loops do not contain code that inhibit fusion,
they are fused.
Loop fusion
This example begins with the following Fortran code:
DO I = 1, N A(I) = B(I) + C(I) ENDDO DO J = 1, N IF(A(J) .LT. 0) A(J) = B(J)*B(J) ENDDO |
The two loops shown above are fused into the following loop
using loop fusion:
DO I = 1, N A(I) = B(I) + C(I) IF(A(I) .LT. 0) A(I) = B(I)*B(I) ENDDO |
Loop fusion
This example begins with the following Fortran code:
REAL A(100,100), B(100,100), C(100,100) . . . C = 2.0 * B A = A + B |
The compiler first transforms these Fortran array assignments
into loops, generating code similar to that shown below.
DO TEMP1 = 1, 100 DO TEMP2 = 1, 100 C(TEMP2, TEMP1) = 2.0 * B(TEMP2, TEMP1) ENDDO ENDDO DO TEMP3 = 1, 100 DO TEMP4 = 1, 100 A(TEMP4,TEMP3)=A(TEMP4,TEMP3)+B(TEMP4,TEMP3) ENDDO ENDDO |
These two loops would then be fused as shown in the following
loop nest:
DO TEMP1 = 1, 100 DO TEMP2 = 1, 100 C(TEMP2,TEMP1) = 2.0 * B(TEMP2, TEMP1) A(TEMP2,TEMP1)=A(TEMP2,TEMP1)+B(TEMP2,TEMP1) ENDDO ENDDO |
Further optimizations could be applied to this new nest as
appropriate.
Loop peeling
When trip counts of adjacent loops differ by only a single
iteration (+1 or -1), the compiler may peel an
iteration from one of the two loops so that the loops may then be
fused. The peeled iteration is performed separately from the original
loop.
The following Fortran example shows how this is implemented:
DO I = 1, N-1 A(I) = I ENDDO DO J = 1, N A(J) = A(J) + 1 ENDDO |
As you can see, the Nth iteration
of the J loop is peeled, resulting
in a trip count of N - 1. The Nth
iteration is performed outside the J
loop. As a result, the code is changed to the following:
DO I = 1, N-1 A(I) = I ENDDO DO J = 1, N-1 A(J) = A(J) + 1 ENDDO A(N) = A(N) + 1 |
The I and J
loops now have the same trip count and are fused, as shown below:
DO I = 1, N-1 A(I) = I A(I) = A(I) + 1 ENDDO A(N) = A(N) + 1 |