Loop fusion involves creating one loop out of two or more
neighboring loops that have identical loop bounds and trip counts.
This reduces loop overhead, memory accesses, and increases register
usage. It can also lead to other optimizations. By potentially reducing
the number of parallelizable loops in a program and increasing the
amount of work in each of those loops, loop fusion can greatly reduce
parallelization overhead. Because fewer spawns and joins are necessary.
Loop fusion takes place at +O3 and above and is enabled by default. Specifying +Onoloop_transform disables loop fusion, as well as loop distribution,
loop interchange, loop blocking, loop unroll, and loop unroll
and jam.
Occasionally, loops that do not appear to be fusible become
fusible as a result of compiler transformations that precede fusion.
For instance, interchanging a loop may make it suitable for fusing
with another loop.
Loop fusion is especially beneficial when applied to Fortran
array assignments. The compiler translates these statements into
loops; when such loops do not contain code that inhibit fusion,
they are fused.
Example 5-13 Loop
fusion
This example begins with the following Fortran code:
DO I = 1, N A(I) = B(I) + C(I) ENDDO DO J = 1, N IF(A(J) .LT. 0) A(J) = B(J)*B(J) ENDDO |
The two loops shown above are fused into the following loop
using loop fusion:
DO I = 1, N A(I) = B(I) + C(I) IF(A(I) .LT. 0) A(I) = B(I)*B(I) ENDDO |
Example 5-14 Loop
fusion
This example begins with the following Fortran code:
REAL A(100,100), B(100,100), C(100,100) . . . C = 2.0 * B A = A + B |
The compiler first transforms these Fortran array assignments
into loops, generating code similar to that shown below.
DO TEMP1 = 1, 100 DO TEMP2 = 1, 100 C(TEMP2, TEMP1) = 2.0 * B(TEMP2, TEMP1) ENDDO ENDDO DO TEMP3 = 1, 100 DO TEMP4 = 1, 100 A(TEMP4,TEMP3)=A(TEMP4,TEMP3)+B(TEMP4,TEMP3) ENDDO ENDDO |
These two loops would then be fused as shown in the following
loop nest:
DO TEMP1 = 1, 100 DO TEMP2 = 1, 100 C(TEMP2,TEMP1) = 2.0 * B(TEMP2, TEMP1) A(TEMP2,TEMP1)=A(TEMP2,TEMP1)+B(TEMP2,TEMP1) ENDDO ENDDO |
Further optimizations could be applied to this new nest as
appropriate.
Example 5-15 Loop
peeling
When trip counts of adjacent loops differ by only a single
iteration (+1 or -1), the compiler may peel an iteration
from one of the two loops so that the loops may then be fused. The
peeled iteration is performed separately from the original loop.
The following Fortran example shows how this is implemented:
DO I = 1, N-1 A(I) = I ENDDO DO J = 1, N A(J) = A(J) + 1 ENDDO |
As you can see, the Nth iteration of the J loop is peeled, resulting in a trip count of N - 1. The Nth iteration is performed outside the J loop. As a result, the code is changed to the
following:
DO I = 1, N-1 A(I) = I ENDDO DO J = 1, N-1 A(J) = A(J) + 1 ENDDO A(N) = A(N) + 1 |
The I and J loops now have the same trip count and are fused,
as shown below:
DO I = 1, N-1 A(I) = I A(I) = A(I) + 1 ENDDO A(N) = A(N) + 1 |