| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems > Chapter 5 Loop
and cross-module optimization featuresLoop blocking |
|
Loop blocking is a combination of strip mining and interchange that maximizes data localization. It is provided primarily to deal with nested loops that manipulate arrays that are too large to fit into the cache. Under certain circumstances, loop blocking allows reuse of these arrays by transforming the loops that manipulate them so that they manipulate strips of the arrays that fit into the cache. Effectively, a blocked loop accesses array elements in sections that are optimally sized to fit in the cache. The loop-blocking optimization is only available at +O3 (and above) in the HP compilers; it is disabled by default. To enable loop blocking, use the +Oloop_block option. Specifying +Onoloop_block (the default) disables both automatic and directive-specified loop blocking. Specifying +Onoloop_transform also disables loop blocking, as well as loop distribution, loop interchange, loop fusion, loop unroll, and loop unroll and jam.Loop blocking can also be enabled for specific loops using the block_loop directive and pragma. The block_loop and no_block_loop directives and pragmas affect the immediately following loop. You can also instruct the compiler to use a specific block factor using block_loop. The no_block_loop directive and pragma disables loop blocking for a particular loop. The forms of these directives and pragmas is shown in Table 5-5 “Forms of block_loop, no_block_loop directives and pragmas”. Table 5-5 Forms of block_loop, no_block_loop directives and pragmas
where
In the absence of the block_factor argument, this directive is useful for indicating which loop in a nest to block. In this case, the compiler uses a heuristic to determine the block factor. Data reuse is important to understand when discussing blocking. There are two types of data reuse associated with loop blocking:
Spatial reuse uses data that was encached as a result of fetching another piece of data from memory; data is fetched by cache lines. 32 bytes of data is encached on every fetch on V2250 servers. Cache line sizes may be different on other HP SMPs. On the initial fetch of array data from memory within a stride-one loop, the requested item is located anywhere in the 32 bytes. The exception is if array is aligned on cache line boundaries. Refer to Chapter 4 “Standard optimization features”, for a description of data alignment. Starting with the cache-aligned memory fetch, the requested data is located at the beginning of the cache line, and the rest of the cache line contains subsequent array elements. For a REAL*4 array, this means the requested element and the seven following elements are encached on each fetch after the first. If any of these seven elements could then be used on any subsequent iterations of the loop, the loop would be exploiting spatial reuse. For loops with strides greater than one, spatial reuse can still occur. However, the cache lines contain fewer usable elements. Temporal reuse uses the same data item on more than one iteration of the loop. An array element whose subscript does not change as a function of the iterations of a surrounding loop exhibits temporal reuse in the context of the loop. Loops that stride through arrays are candidates for blocking when there is also an outermost loop carrying spatial or temporal reuse. Blocking the innermost loop allows data referenced by the outermore loop to remain in the cache across multiple iterations. Blocking exploits spatial reuse by ensuring that once fetched, cache lines are not overwritten until their spatial reuse is exhausted. Temporal reuse is similarly exploited. Example 5-9 Simple loop blocking In order to exploit reuse in more realistic examples that manipulate arrays that do not all fit in the cache, the compiler can apply a blocking transformation. The following Fortran example demonstrates this:
Here the array elements occupy nearly 16 Mbytes of memory. Thus, blocking becomes profitable. First the compiler strip mines the I loop:
IBLOCK is the block factor (also referred to as the strip mine length) the compiler chooses based on the size of the arrays and size of the cache. Note that this example assumes the chosen IBLOCK divides 1000 evenly. Next, the compiler moves the outer strip loop (IOUT) outward as far as possible.
This new nest accesses IBLOCK rows of A and IBLOCK columns of B for every iteration of J. At every iteration of IOUT, the nest accesses 1000 IBLOCK-length columns of A (or an IBLOCK ¥ 1000 chunk of A) and 1000 IBLOCK-width rows of B are accessed. This is illustrated in Figure 5-3 “Blocked array access”. Fetches of A encache the needed element and the three elements that are used in the three subsequent iterations, giving spatial reuse on A. Because the I loop traverses columns of B, fetches of B encache extra elements that are not spatially reused until J increments. IBLOCK is chosen by the compiler to efficiently exploit spatial reuse of both A and B. Figure 5-4 “Spatial reuse of A and B” illustrates how cache lines of each array are fetched. A and B both start on cache line boundaries because they are in COMMON. The shaded area represents the initial cache line fetched.
Typically, IBLOCK elements of C remain in the cache for several iterations of J before being overwritten, giving temporal reuse on C for those iterations. By the time any of the arrays are overwritten, all spatial reuse has been exhausted. The load of D is removed from the I loop so that it remains in a register for all iterations of I. Example 5-10 Matrix multiply blocking The more complicated matrix multiply algorithm, which follows, is a prime candidate for blocking:
This loop is blocked as shown below:
As a result, the following occurs:
An analogous C and C++ example follows with a different resulting interchange:
The HP C and aC++ compilers interchange and block the loop in this example to provide optimal access efficiency for the row-major C and C++ arrays. The blocked loop is shown below:
As you can see, the interchange was done differently because of C and C++’s different array storage strategies. This code yields:
Blocking is inhibited when loop interchange is inhibited. If a candidate loop nest contains loops that cannot be interchanged, blocking is not performed. Example 5-11 Loop blocking The following example shows the affect of the block_loop directive on the code shown earlier in Example 5-10 “Matrix multiply blocking”:
The original example involving this code showed that the compiler
blocks the I and K loops. In this example, the BLOCK_LOOP directive instructs the compiler to use a block
factor of 112 for the K loop. This is an efficient blocking factor for
this example because 112 × 8 bytes = 896 bytes, |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||