Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 2 Architecture overview

Memory Systems

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

HP's K-Class and V-Class servers maintain a single level of memory latency. Memory functions and interleaving work similarly on both servers, as described in the following sections.

Physical memory

Multiple, independently accessible memory banks are available on both the K-Class and V-Class servers. In 16-processor V2250 servers, for example, each node consists of up to 32 memory banks. This memory is typically partitioned (by the system administrator) into system-global, and buffer cache. It is also interleaved as described in “Interleaving”". The K-Class architecture supports up to four memory banks.

System-global memory is accessible by all processors in a given system. The buffer cache is a file system cache and is used to encache items that have been read from disk and items that are to be written to disk.

Memory interleaving is used to improve performance. For an explanation, see the section “Interleaving”.

Virtual memory

Each process running on a V-Class or K-Class server under
HP-UX accesses its own 16-Tbyte virtual address space. Almost all of this space is available to hold program text, data, and the stack. The space used by the operating system is negligible.

The memory stack size is configurable. Refer to the section “Setting thread default stack size” for more information.

Both servers share data among all threads unless a variable is declared to be thread private. Memory class definitions describing data disposition across hypernodes have been retained for the V-Class. This is primarily for potential use when porting to multinode machines.

thread_private

This memory is private to each thread of a process. A thread_private data object has a unique virtual address for each thread. These addresses map to unique physical addresses in hypernode-local physical memory.

node_private

This memory is shared among the threads of a process running on a single node. Since the V-Class and
K-Class servers are single-node machines, node_private actually serves as one common shared memory class.

Memory classes are discussed more fully in Chapter 11 “Memory classes”.

Processes cannot access each other's virtual address spaces. This virtual memory maps to the physical memory of the system on which the process is running.

Interleaving

Physical pages are interleaved across the memory banks on a cache-line basis. There are up to 32 banks in the V2250 servers; there are up to four on a K-Class. Contiguous cache lines are assigned in round-robin fashion, first to the even banks, then to the odd, as shown in Figure 2-5 “V2250 interleaving” for V2250 servers.

Interleaving speeds memory accesses by allowing several processors to access contiguous data simultaneously. It also eliminates busy bank and board waits for unit stride accesses. This is beneficial when a loop that manipulates arrays is split among many processors. In the best case, threads access data in patterns with no bank contention. Even in the worst case, in which each thread initially needs the same data from the same bank, after the initial contention delay, the accesses are spread out among the banks.

Figure 2-5 V2250 interleaving

V2250 interleaving

Interleaving

The following Fortran example illustrates a nested loop that accesses memory with very little contention. This example is greatly simplified for illustrative purposes, but the concepts apply to arrays of any size.

REAL*8 A(12,12), B(12,12)
...
DO J = 1, N
DO I = 1, N
A(I,J) = B(I,J)
ENDDO
ENDDO

Assume that arrays A and B are stored contiguously in memory, with A starting in bank 0, processor cache line 0 for V2250 servers, as shown in Figure 2-6 “V2250 interleaving of arrays A and B.

You may assume that the HP Fortran 90 compiler parallelizes the J loop to run on as many processors as are available in the system (up to N). Assuming N=12 and there are four processors available when the program is run, the J loop could be divided into four new loops, each with 3 iterations. Each new loop would run to completion on a separate processor. These four processors are identified as CPU0 through CPU3.

NOTE: This example is designed to simplify illustration. In reality, the dynamic selection optimization (discussed in "Dynamic selection" on page 102) would, given the iteration count and available number of processors described, cause this loop to run serially. The overhead of going parallel would outweigh the benefits.

In order to execute the body of the I loop, A and B must be fetched from memory and encached. Each of the four processors running the J loop attempt to simultaneously fetch its portion of the arrays.

This means CPU0 will attempt to read arrays A and B starting at elements (1,1), CPU1 will attempt to start at elements (1,4) and so on.

Because of the number of memory banks in the V2250 architecture, interleaving removes the contention from the beginning of the loop from the example, as shown in Figure 2-6 “V2250 interleaving of arrays A and B.

  • CPU0 needs A(1:12,1:3) and B(1:12,1:3)

  • CPU1 needs A(1:12,4:6) and B(1:12,4:6)

  • CPU2 needs A(1:12,7:9) and B(1:12,7:9)

  • CPU3 needs A(1:12,10:12) and B(1:12,10:12)

The data from the V2250 example above is spread out on different memory banks as described below:

  • A(1,1), the first element of the chunk needed by CPU0, is on cache line 0 in bank 0 on board 0

  • A(1,4), the first element needed by CPU1, is on cache line 9 in bank 1 on board 1

  • A(1,7), the first element needed by CPU2, is on cache line 18 in bank 2 on board 2

  • A(1,10) the first element needed by CPU3, is on cache line 27 in bank 3 on board 3

Because of interleaving, no contention exists between the processors when trying to read their respective portions of the arrays. Contention may surface occasionally as the processors make their way through the data, but the resulting delays are minimal compared to what could be expected without interleaving.

Figure 2-6 V2250 interleaving of arrays A and B

V2250 interleaving of arrays A and B

Variable-sized pages on HP-UX

Variable-sized pages are used to reduce Translation Lookaside Buffer (TLB) misses, improving performance. A TLB is a hardware entity used to hold a virtual to physical address translation. With variable-sized pages, each TLB entry used can map a larger portion of an application's virtual address space. Thus, applications with large data sets are mapped using fewer TLB entries, resulting in fewer TLB misses.

Using a different page size does not help if an application is not experiencing performance degradation due to TLB misses. Additionally, if an application uses too large a page size, fewer pages are available to other applications on the system. This potentially results in increased paging activity and performance degradation.

Valid page sizes on the PA-8200 processors are 4K, 16K, 64K, 256K,
1 Mbyte, 4 Mbytes, 16 Mbytes, 64 Mbytes, and 256 Mbytes. The default configurable page size is 4K. Methods for specifying a page size are described below. Note that the user-specified page size only requests a specific size. The operating system takes various factors into account when selecting the page size.

Specifying a page size

The following chatr utility command options allow you to specify information regarding page sizes.

  • +pi affects the page size for the application's text segment

  • +pd affects the page size for the application's data segment

The following configurable kernel parameters allow you to specify information regarding page sizes.

  • vps_pagesize represents the default or minimum page size (in kilobytes) if the user has not used chatr to specify a value. The default is 4Kbytes.

  • vps_ceiling represents the maximum page size (in kilobytes) if the user has not used chatr to specify a value. The default is 16Kbytes.

  • vps_chatr_ceiling places a restriction on the largest value (in kilobytes) a user can specify using chatr. The default is 64 Mbytes.

For more information on the chatr utility, see the chatr(1) man page.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.