| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 2 Architecture overviewSystem architectures |
|
PA-RISC processors communicate with each other, with memory, and with peripherals through various bus configuration. The difference between the K-Class and V-Class servers are presented by the manner in which they access memory. The K-Class maintains a bus-based configuration, shown in Figure 2-1 “K-Class bus configuration”. On a V-Class, processors communicate with each other, memory, and peripherals through a nonblocking crossbar. The V-Class implementation is achieved through the Hyperplane Interconnect, shown in Figure 2-2 “V2250 Hyperplane Interconnect view”. The HP V2250 server has one to 16 PA-8200 processors and 256 Mbytes to 16 Gbytes of physical memory. Two CPUs and a PCI bus share a single CPU agent. The CPUs communicate with the rest of the machine through the CPU agent. The Memory Access Controllers (MACs) provide the interface between the memory banks and the rest of the machine. CPUs communicate directly with their own instruction and data caches, which are accessed by the processor in one clock (assuming a full pipeline). V2250 servers use 2-Mbyte off-chip instruction caches and data caches. HP systems use cache to enhance performance. Cache sizes, as well as cache line sizes, vary with the processor used. Data is moved between the cache and memory using cache lines. A cache line describes the size of a chunk of contiguous data that must be copied into or out of a cache in one operation. When a processor experiences a cache miss—requests data that is not already encached—the cache line containing the address of the requested data is moved to the cache. This cache line also contains a number of other data objects that were not specifically requested. One reason cache lines are employed is to allow for data reuse. Data in a cache line is subject to reuse if, while the line is encached, any of the data elements contained in the line besides the originally requested element are referenced by the program, or if the originally requested element is referenced more than once. Because data can only be moved to and from memory as part of a cache line, both load and store operations cause their operands to be encached. Cache-coherency hardware, as found on a V2250, invalidates cache lines in other processors when they are stored to by a particular processor. This indicates to other processors that they must load the cache line from memory the next time they reference its data. Aligning data addresses on cache line boundaries allows for
efficient data reuse in loops (refer to “Data reuse”). The linker automatically aligns data
objects larger than 32 bytes in size on Only the first item in a list of data objects appearing in
any of these statements is aligned on a cache line boundary. To
make the most efficient use of available memory, the total size,
in bytes, of any array appearing in one of these statements should
be an integral multiple Sizing your arrays this way prevents data following the first array from becoming misaligned. Scalar variables should be listed after arrays and ordered from longest data type to shortest. For example, REAL*8 scalars should precede REAL*4 scalars. You can align data on 64-byte boundaries by doing the following. These apply only to parallel executables:
Cache thrashing occurs when two or more data items that are frequently needed by the program both map to the same cache address. Each time one of the items is encached, it overwrites another needed item, causing cache misses and impairing data reuse. This section explains how thrashing happens on the V-Class. A type of thrashing known as false cache line sharing is discussed in the section “False cache line sharing”. Cache thrashing The following Fortran example provides an example of cache thrashing:
In this example, the arrays ORIG
and DISP overwrite each other in
When the addition in the body of the loop executes, the current elements of both ORIG and DISP must be fetched from memory into the cache. Because these elements have identical cache addresses, whichever is fetched last overwrites the first. Processor cache data is fetched 32 bytes at a time. To efficiently execute a loop such as this, the unused elements in the fetched cache line (three extra REAL*8 elements are fetched in this case) must remain encached until they are used in subsequent iterations of the loop. Because ORIG and DISP thrash each other, this reuse is never possible. Every cache line of ORIG that is fetched is overwritten by the cache line of DISP that is subsequently fetched, and vice versa. The cache line is overwritten on every iteration. Typically, in a loop like this, it would not be overwritten until all of its elements were used. Memory accesses take substantially longer than cache accesses, which severely degrades performance. Even if the overwriting involved the NEW array, which is stored rather than loaded on each iteration, thrashing would occur, because stores overwrite entire cache lines the same way loads do. The problem is easily fixed by increasing the distance between the arrays. You can accomplish this by either increasing the array sizes or inserting a padding array. Cache padding The following Fortran example illustrates cache padding:
In this example, the array P(4) moves DISP 32 bytes further from ORIG in memory. No two elements of the same index share a cache address. This postpones cache overwriting for the given loop until the entire current cache line is completely exploited. The alternate approach involves increasing the size of ORIG or NEW by 4 elements (32 bytes), as shown in the following example:
Here, NEW has been increased by 4 elements, providing the padding necessary to prevent ORIG from sharing cache addresses with DISP. Figure 2-4 “Array layouts—non-thrashing” shows how both solutions prevent thrashing. It is important to note that this is a highly simplified, worst-case example. Loop blocking optimization (described in “Loop blocking”) eliminates thrashing from certain nested loops, but not from all loops. Declaring arrays with dimensions that are not powers of two can help, but it does not completely eliminate the problem. Using COMMON blocks in Fortran can also help because it allows you to accurately measure distances between data items, making thrashing problems easier to spot before they happen. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||