| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 2 Architecture overviewMemory Systems |
|
HP's K-Class and V-Class servers maintain a single level of memory latency. Memory functions and interleaving work similarly on both servers, as described in the following sections. Multiple, independently accessible memory banks are available on both the K-Class and V-Class servers. In 16-processor V2250 servers, for example, each node consists of up to 32 memory banks. This memory is typically partitioned (by the system administrator) into system-global, and buffer cache. It is also interleaved as described in “Interleaving”". The K-Class architecture supports up to four memory banks. System-global memory is accessible by all processors in a given system. The buffer cache is a file system cache and is used to encache items that have been read from disk and items that are to be written to disk. Memory interleaving is used to improve performance. For an explanation, see the section “Interleaving”. Each process running on a V-Class or K-Class server under
The memory stack size is configurable. Refer to the section “Setting thread default stack size” for more information. Both servers share data among all threads unless a variable is declared to be thread private. Memory class definitions describing data disposition across hypernodes have been retained for the V-Class. This is primarily for potential use when porting to multinode machines.
Memory classes are discussed more fully in Chapter 11 “Memory classes”. Processes cannot access each other's virtual address spaces. This virtual memory maps to the physical memory of the system on which the process is running. Physical pages are interleaved across the memory banks on a cache-line basis. There are up to 32 banks in the V2250 servers; there are up to four on a K-Class. Contiguous cache lines are assigned in round-robin fashion, first to the even banks, then to the odd, as shown in Figure 2-5 “V2250 interleaving” for V2250 servers. Interleaving speeds memory accesses by allowing several processors to access contiguous data simultaneously. It also eliminates busy bank and board waits for unit stride accesses. This is beneficial when a loop that manipulates arrays is split among many processors. In the best case, threads access data in patterns with no bank contention. Even in the worst case, in which each thread initially needs the same data from the same bank, after the initial contention delay, the accesses are spread out among the banks. Interleaving The following Fortran example illustrates a nested loop that accesses memory with very little contention. This example is greatly simplified for illustrative purposes, but the concepts apply to arrays of any size.
Assume that arrays A and B are stored contiguously in memory, with A starting in bank 0, processor cache line 0 for V2250 servers, as shown in Figure 2-6 “V2250 interleaving of arrays A and B”. You may assume that the HP Fortran 90 compiler parallelizes the J loop to run on as many processors as are available in the system (up to N). Assuming N=12 and there are four processors available when the program is run, the J loop could be divided into four new loops, each with 3 iterations. Each new loop would run to completion on a separate processor. These four processors are identified as CPU0 through CPU3.
In order to execute the body of the I loop, A and B must be fetched from memory and encached. Each of the four processors running the J loop attempt to simultaneously fetch its portion of the arrays. This means CPU0 will attempt to read arrays A and B starting at elements (1,1), CPU1 will attempt to start at elements (1,4) and so on. Because of the number of memory banks in the V2250 architecture, interleaving removes the contention from the beginning of the loop from the example, as shown in Figure 2-6 “V2250 interleaving of arrays A and B”.
The data from the V2250 example above is spread out on different memory banks as described below:
Because of interleaving, no contention exists between the processors when trying to read their respective portions of the arrays. Contention may surface occasionally as the processors make their way through the data, but the resulting delays are minimal compared to what could be expected without interleaving. Variable-sized pages are used to reduce Translation Lookaside Buffer (TLB) misses, improving performance. A TLB is a hardware entity used to hold a virtual to physical address translation. With variable-sized pages, each TLB entry used can map a larger portion of an application's virtual address space. Thus, applications with large data sets are mapped using fewer TLB entries, resulting in fewer TLB misses. Using a different page size does not help if an application is not experiencing performance degradation due to TLB misses. Additionally, if an application uses too large a page size, fewer pages are available to other applications on the system. This potentially results in increased paging activity and performance degradation. Valid page sizes on the PA-8200 processors are 4K, 16K, 64K,
256K, The following chatr utility command options allow you to specify information regarding page sizes.
The following configurable kernel parameters allow you to specify information regarding page sizes.
For more information on the chatr utility, see the chatr(1) man page. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||