Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 2 Architecture overview

System architectures

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

PA-RISC processors communicate with each other, with memory, and with peripherals through various bus configuration. The difference between the K-Class and V-Class servers are presented by the manner in which they access memory. The K-Class maintains a bus-based configuration, shown in Figure 2-1 “K-Class bus configuration”.

Figure 2-1 K-Class bus configuration

K-Class bus configuration

On a V-Class, processors communicate with each other, memory, and peripherals through a nonblocking crossbar. The V-Class implementation is achieved through the Hyperplane Interconnect, shown in Figure 2-2 “V2250 Hyperplane Interconnect view”.

The HP V2250 server has one to 16 PA-8200 processors and 256 Mbytes to 16 Gbytes of physical memory. Two CPUs and a PCI bus share a single CPU agent. The CPUs communicate with the rest of the machine through the CPU agent. The Memory Access Controllers (MACs) provide the interface between the memory banks and the rest of the machine.

CPUs communicate directly with their own instruction and data caches, which are accessed by the processor in one clock (assuming a full pipeline). V2250 servers use 2-Mbyte off-chip instruction caches and data caches.

Figure 2-2 V2250 Hyperplane Interconnect view

V2250 Hyperplane Interconnect view

Data caches

HP systems use cache to enhance performance. Cache sizes, as well as cache line sizes, vary with the processor used. Data is moved between the cache and memory using cache lines. A cache line describes the size of a chunk of contiguous data that must be copied into or out of a cache in one operation.

When a processor experiences a cache miss—requests data that is not already encached—the cache line containing the address of the requested data is moved to the cache. This cache line also contains a number of other data objects that were not specifically requested.

One reason cache lines are employed is to allow for data reuse. Data in a cache line is subject to reuse if, while the line is encached, any of the data elements contained in the line besides the originally requested element are referenced by the program, or if the originally requested element is referenced more than once.

Because data can only be moved to and from memory as part of a cache line, both load and store operations cause their operands to be encached. Cache-coherency hardware, as found on a V2250, invalidates cache lines in other processors when they are stored to by a particular processor. This indicates to other processors that they must load the cache line from memory the next time they reference its data.

Data alignment

Aligning data addresses on cache line boundaries allows for efficient data reuse in loops (refer to “Data reuse”). The linker automatically aligns data objects larger than 32 bytes in size on
a 32-byte boundary. It also aligns data greater than a page size on a 64-byte boundary.

Only the first item in a list of data objects appearing in any of these statements is aligned on a cache line boundary. To make the most efficient use of available memory, the total size, in bytes, of any array appearing in one of these statements should be an integral multiple
of 32.

Sizing your arrays this way prevents data following the first array from becoming misaligned. Scalar variables should be listed after arrays and ordered from longest data type to shortest. For example, REAL*8 scalars should precede REAL*4 scalars.

You can align data on 64-byte boundaries by doing the following. These apply only to parallel executables:

  • Using Fortran ALLOCATE statements

  • Using the C functions malloc or memory_class_malloc

NOTE: Aliases can inhibit data alignment. Be careful when equivalencing arrays in Fortran.

Cache thrashing

Cache thrashing occurs when two or more data items that are frequently needed by the program both map to the same cache address. Each time one of the items is encached, it overwrites another needed item, causing cache misses and impairing data reuse. This section explains how thrashing happens on the V-Class.

A type of thrashing known as false cache line sharing is discussed in the section “False cache line sharing”.

Cache thrashing

The following Fortran example provides an example of cache thrashing:

REAL*8 ORIG(131072), NEW(131072), DISP(131072)
COMMON /BLK1/ ORIG, NEW, DISP
.
.
.
DO I = 1, N
NEW(I) = ORIG(I) + DISP(I)
ENDDO

In this example, the arrays ORIG and DISP overwrite each other in
a 2-Mbyte cache. Because the arrays are in a COMMON block, they are allocated in contiguous memory in the order shown. Each array element occupies 8 bytes, so each array occupies one Mbyte (8 × 131072= 1048576 bytes). Therefore, arrays ORIG and DISP are exactly 2-Mbytes apart in memory, and all their elements have identical cache addresses. The layout of the arrays in memory and in the data cache is shown in
Figure 2-3 “Array layouts—cache-thrashing”.

Figure 2-3 Array layouts—cache-thrashing

Array layouts—cache-thrashing

When the addition in the body of the loop executes, the current elements of both ORIG and DISP must be fetched from memory into the cache. Because these elements have identical cache addresses, whichever is fetched last overwrites the first. Processor cache data is fetched 32 bytes at a time.

To efficiently execute a loop such as this, the unused elements in the fetched cache line (three extra REAL*8 elements are fetched in this case) must remain encached until they are used in subsequent iterations of the loop. Because ORIG and DISP thrash each other, this reuse is never possible. Every cache line of ORIG that is fetched is overwritten by the cache line of DISP that is subsequently fetched, and vice versa. The cache line is overwritten on every iteration. Typically, in a loop like this, it would not be overwritten until all of its elements were used.

Memory accesses take substantially longer than cache accesses, which severely degrades performance. Even if the overwriting involved the NEW array, which is stored rather than loaded on each iteration, thrashing would occur, because stores overwrite entire cache lines the same way loads do.

The problem is easily fixed by increasing the distance between the arrays. You can accomplish this by either increasing the array sizes or inserting a padding array.

Cache padding

The following Fortran example illustrates cache padding:

REAL*8 ORIG(131072), NEW(131072), P(4),DISP(131072)
COMMON /BLK1/ ORIG, NEW, P, DISP
.
.
.

In this example, the array P(4) moves DISP 32 bytes further from ORIG in memory. No two elements of the same index share a cache address. This postpones cache overwriting for the given loop until the entire current cache line is completely exploited.

The alternate approach involves increasing the size of ORIG or NEW by 4 elements (32 bytes), as shown in the following example:

REAL*8 ORIG(131072), NEW(131080), DISP(131072)
COMMON /BLK1/ ORIG, NEW, DISP
.
.
.

Here, NEW has been increased by 4 elements, providing the padding necessary to prevent ORIG from sharing cache addresses with DISP. Figure 2-4 “Array layouts—non-thrashing” shows how both solutions prevent thrashing.

Figure 2-4 Array layouts—non-thrashing

Array layouts—non-thrashing

It is important to note that this is a highly simplified, worst-case example.

Loop blocking optimization (described in “Loop blocking”) eliminates thrashing from certain nested loops, but not from all loops. Declaring arrays with dimensions that are not powers of two can help, but it does not completely eliminate the problem.

Using COMMON blocks in Fortran can also help because it allows you to accurately measure distances between data items, making thrashing problems easier to spot before they happen.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.