Chapter 6. Run-time Tuning

This chapter discusses ways in which the user can tune the run-time environment to improve the performance of a MPI message passing application on SGI computers. None of these ways involve application code changes.

Reducing Run-time Variability

One of the most common problems with optimizing message passing codes on large shared memory computers is achieving reproducible timings from run to run. To reduce run-time variability, you can take the following precautions:

  • Do not oversubscribe the system. In other words, do not request more CPUs than are available and do not request more memory than is available. Oversubscribing causes the system to wait unnecessarily for resources to become available and leads to variations in the results and less than optimal performance.

  • Avoid interference from other system activity. Both the Linux and IRIX kernels use more memory on node 0 than on other nodes. Node 0 is called the kernel node in the following discussion. If your application uses almost all of the available memory per processor, the memory for processes assigned to the kernel node can unintentionally spill over to nonlocal memory. By keeping user applications off the kernel node, you can avoid this effect. Additionally, by restricting system daemons to run on the kernel node, you can also deliver an additional percentage of each application CPU to the user. One solution IRIX provides to solve this problem is the boot_cpuset(4) command. The boot_cpuset ( man boot_cpuset) capability allows creation of a cpuset that contains the init process and all of its descendants, effectively preventing system functions from interfering with batch jobs running on the rest of the machine.

  • Avoid interference with other applications. You can use cpusets or cpumemsets to address this problem also. You can use cpusets (for IRIX) or cpumemsets (for Linux) to effectively partition a large, distributed memory host in a fashion that minimizes interactions between jobs running concurrently on the system.

  • On a quiet, dedicated system, you can use dplace or the MPI_DSM_CPULIST shell variable to improve run-time performance repeatability. These approaches are not as suitable for shared, nondedicated systems.

  • Use a batch scheduler; for example, LSF from Platform Computing or PBSpro from Veridan. These batch schedulers use cpusets to avoid oversubscribing the system and possible interference between applications.

Tuning MPI Buffer Resources

By default, the SGI MPI implementation buffers messages whose lengths exceed 64 bytes. Longer messages are buffered in a shared memory region to allow for exchange of data between MPI processes. In the SGI MPI implementation, these buffers are divided into two basic pools. For messages exchanged between MPI processes within the same host, buffers from the ”per process” pool (called the “per proc” pool) are used. Each MPI process is allocated a fixed portion of this pool when the application is launched. Each of these portions is logically partitioned into 16-KB buffers. For MPI jobs running across multiple hosts, a second pool of shared memory is available. Messages exchanged between MPI processes on different hosts use this pool of shared memory, called the “per host” pool. The structure of this pool is somewhat more complex than the “per proc” pool.

For an MPI job running on a single host, messages that exceed 64 bytes are handled as follows. For messages with a length of 16 KB or less, the sender MPI process buffers the entire message. It then delivers a message header (also called a control message) to a mailbox, which is polled by the MPI receiver when an MPI call is made. Upon finding a matching receive request for the sender's control message, the receiver copies the data out of the shared memory buffer into the application buffer indicated in the receive request. The receiver then sends a message header back to the sender process, indicating that the shared memory buffer is available for reuse. Messages whose length exceeds 16 KB are broken down into 16-KB chunks, allowing the sender and receiver to overlap the copying of data to and from shared memory in a pipeline fashion.

Because there is a finite number of these shared memory buffers, this can be a constraint on the overall application performance for certain communication patterns. You can use the MPI_BUFS_PER_PROC shell variable to adjust the number of buffers available for the “per proc” pool. Similarly, you can use the MPI_BUFS_PER_HOST shell variable to adjust the “per host” pool. You can use the MPI statistics counters to determine if retries for these shared memory buffers are occurring. For information on the use of these counters, see “MPI Internal Statistics” in Chapter 5. In general, you can avoid excessive numbers of retries for buffers by increasing the number of buffers for the “per proc” pool or “per host” pool. However, you should keep in mind that increasing the number of buffers does consume more memory. Also, increasing the number of “per proc” buffers does potentially increase the probability for cache pollution (that is, the excessive filling of the cache with message buffers). Cache pollution can result in degraded performance during the compute phase of a message passing application.

There are additional buffering considerations to take into account when running an MPI job across multiple hosts. For further discussion of multihost runs, see “Tuning for Running Applications Across Multiple Hosts”.

For further discussion on programming implications concerning message buffering, see “Buffering” in Chapter 3.

Avoiding Message Buffering - Enabling Single Copy

For message transfers between MPI processes within the same host or transfers between partitions, it is possible under certain conditions to avoid the need to buffer messages. Because many MPI applications are written assuming infinite buffering, the use of this unbuffered approach is not enabled by default. This section describes how to activate this mechanism.

Using Global Memory for Single Copy Optimization

On IRIX systems, when global memory is used for single copy optimization, the sender's message data must reside in globally accessible memory. Globally accessible memory includes common block or static memory and memory allocated with the Fortran 90 allocate statement or MPI_Alloc_mem (with the SMA_GLOBAL_ALLOC environment variable set). In addition, applications linked against the SHMEM library can also access the LIBSMA symmetric heap via the shpalloc or shmalloc functions. Consequently, use of this feature might require changes to the application. Additional restrictions are described in “Avoiding Message Buffering -- Single Copy Methods” in Chapter 3.

The threshold for message lengths beyond which MPI attempts to use this single copy method is specified by the MPI_BUFFER_MAX shell variable. Its value should be set to the message length in bytes beyond which the single copy method should be tried. In general, a value of 2048 or higher is beneficial for most applications running on a single host.

Using the XPMEM Driver for Single Copy Optimization

MPI can take advantage of the XPMEM driver, a special cross partition device driver, available on both IRIX and Linux systems, that allows the operating system to copy data between two processes within the same host or across partitions.

On systems running IRIX, this feature requires IRIX 6.5.13 or greater and is available only on Origin 300 and 3000 series servers. This option is not available on servers running Trusted IRIX. The MPI library uses the XPMEM driver to enhance single copy optimization (within a host) to eliminate some of the restrictions with only a slight (less that 5 percent) performance cost over the more restrictive single copy optimization using globally accessible memory. You can enable this optimization if you set the MPI_XPMEM_ON and MPI_BUFFER_MAX environment variables. Note that if the sender data resides in globally accessible memory, the data is copied using a bcopy process. Otherwise, the XPMEM driver is used to transfer the data. Using the XPMEM form of single copy is less restrictive in that the sender's data is not required to be globally accessible. It is available for ABI N32 as well as ABI 64. This optimization also can be used to transfer data between two different executable files on the same host or two different executable files across IRIX partitions.


Note: Use of the XPMEM driver disables the ability to checkpoint/restart an MPI job.


On IRIX systems, under certain conditions, the XPMEM driver can take advantage of the block transfer engine (BTE) to provide increased bandwidth. In addition to having MPI_BUFFER_MAX and MPI_XPMEM_ON set, the send and receive buffers must be cache-aligned and the amount of data to transfer must be greater than or equal to MPI_XPMEM_THRESHOLD . The default value for MPI_XPMEM_THRESHOLD is 8192.

On systems running Linux, use of the XPMEM driver is required to support single-copy message transfers between two processes within the same host or across partitions. On Linux systems, during job startup, MPI uses the XPMEM driver (via the xpmem kernel module) to map memory from one MPI process onto another. The mapped areas include the static region, private heap, and stack region of each process. Memory mapping allows each process to directly access memory from the address space of another process. This technique allows MPI to support single copy transfers for contiguous data types from any of these mapped regions. For these transfers, whether between processes residing on the same host or across partitions, the data is copied using a bcopy process. A bcopy process is also used to transfer data between two different executable files on the same host or two different executable files across partitions. For data residing outside of a mapped region (a /dev/zero region, for example), MPI uses the XPMEM driver to copy the data.

Memory mapping is enabled by default on Linux. To disable it, set the MPI_MEMMAP_OFF environment variable. Memory mapping must be enabled to allow single-copy transfers, MPI-2 one-sided communication, and certain collective optimizations.

Memory Placement and Policies

The MPI library takes advantage of NUMA placement functions that are available on IRIX and Linux systems. Usually, the default placement is adequate. Under certain circumstances, however, you might want to modify this default behavior. The easiest way to do this is by setting one or more MPI placement shell variables. Several of the most commonly used of these variables are discribed in the following sections. For a complete listing of memory placement related shell variables, see the MPI man page.

MPI_DSM_CPULIST

The MPI_DSM_CPULIST shell variable allows you to manually select processors to use for an MPI application. At times, specifying a list of processors on which to run a job can be the best means to insure highly reproducible timings, particularly when running on a dedicated system.

This setting is treated as a comma and/or hyphen delineated ordered list, specifying a mapping of MPI processes to CPUs. If running across multiple hosts, the per host components of the CPU list are delineated by colons.


Note: This feature will not be compatible with job migration features available in future IRIX releases. In addition, this feature should not be used with MPI applications that use either of the MPI-2 spawn related functions.


Examples of settings are as follows::

Value 

CPU Assignment

8,16,32 

Place three MPI processes on CPUs 8, 16, and 32.

32,16,8 

Place the MPI process rank zero on CPU 32, one on 16, and two on CPU 8.

8-15,32-39 

Place the MPI processes 0 through 7 on CPUs 8 to 15. Place the MPI processes 8 through 15 on CPUs 32 to 39.

39-32,8-15 

Place the MPI processes 0 through 7 on CPUs 39 to 32. Place the MPI processes 8 through 15 on CPUs 8 to 15.

8-15:16-23 

Place the MPI processes 0 through 7 on the first host on CPUs 8 through 15. Place MPI processes 8 through 15 on CPUs 16 to 23 on the second host.

Note that the process rank is the MPI_COMM_WORLD rank. The interpretation of the CPU values specified in the MPI_DSM_CPULIST depends on whether the MPI job is being run within a cpuset. If the job is run outside of a cpuset, the CPUs specify cpunum values given in the hardware graph (hwgraph(4)). When running within a cpuset, the default behavior is to interpret the CPU values as relative processor numbers within the cpuset. To specify cpunum values instead, you can use the MPI_DSM_CPULIST_TYPE (MPI(1)) shell variable.

The number of processors specified should equal the number of MPI processes that will be used to run the application. The number of colon delineated parts of the list must equal the number of hosts used for the MPI job. If an error occurs in processing the CPU list, the default placement policy is used. To insure linking of the MPI processes to the designated processors, you should also set the MPI_DSM_MUSTRUN shell variable.

MPI_DSM_MUSTRUN

Use the MPI_DSM_MUSTRUN shell variable to ensure that each MPI process will get a physical CPU and memory on the node to which it was assigned. It has been observed that using this shell variable has led to improved performance, especially on IRIX systems running version 6.5.7 and earlier.

On Linux systems, if this environment variable is used without specifying an MPI_DSM_CPULIST variable, it will cause MPI to assign MPI ranks starting at logical CPU 0 and incrementing until all ranks have been placed. On Linux systems, therefore, it is recommended that this variable be used only if running within a cpumemset or on a dedicated system.

MPI_DSM_PPM

The MPI_DSM_PPM shell variable allows you to specify the number of MPI processes to be placed on a node. Memory bandwidth intensive applications can benefit from placing fewer MPI processes on each node of a distributed memory host. On Origin 200 and Origin 2000 series servers, the default is to place two MPI processes on each node. On Origin 300 and Origin 3000 series servers, the default is four MPI processes per node. You can use the MPI_DSM_PPM shell variable to change these values. On Origin 300 and Origin 3000 series servers, setting MPI_DSM_PPM to 2 places one MPI process on each memory bus.


Note: Currently, this shell variable is not available on Linux systems.


MPI_DSM_VERBOSE

Setting the MPI_DSM_VERBOSE shell variable directs MPI to display a synopsis of the NUMA placement options being used at run time.

PAGESIZE_DATA and PAGESIZE_STACK

You can use the PAGESIZE_DATA and PAGESIZE_STACK variables to request nondefault page sizes (in kilobytes). Setting these variables can be helpful for applications that experience frequent TLB misses. You can ascertain this condition by using the ssrun or perfex profiling tools. However, these variables should be used with caution. Generally, system administrators do not configure the system to have many large pages per node. If very large page sizes are requested, you might lose good memory locality if the operating system is able to satisfy the large page request only with remote memory.


Note: Because these variables are associated with NUMA placement, disabling NUMA placement via the MPI_DSM_OFF shell variable disables the use of these page size shell variables.



Note: Currently, these shell variables are not available on Linux systems.


Using dplace for Memory Placement

The dplace tool offers another means of specifying the placement of MPI processes within a distributed memory host. This tool is available on both Linux and IRIX systems. Starting with IRIX 6.5.13, dplace and MPI interoperate to allow MPI to better manage placement of certain shared memory data structures when dplace is used to place the MPI job. If for some reason this interoperability feature is undesirable, you can set the MPI_DPLACE_INTEROP_OFF shell variable. For instructions on how to use dplace with MPI, see the dplace man page.

Tuning MPI/OpenMP Hybrid Codes

Hybrid MPI/OpenMP applications might require special memory placement features to operate efficiently on ccNUMA Origin servers. This section describes a preliminary method for achieving this memory placement. The basic idea is to space out the MPI processes to accommodate the OpenMP threads associated with each MPI process. In addition, assuming a particular ordering of library init code (see the DSO man page), this method employs procedures to insure that the OpenMP threads remain close to the parent MPI process. This type of placement has been found to improve the performance of some hybrid applications significantly.

To take partial advantage of this placement option, the following requirements must be met:

  • When running the application, you must set the MPI_OPENMP_INTEROP shell variable.

  • To compile the application, you must use a MIPSpro compiler and the -mp compiler option. This hybrid model placement option is not available with other compilers.

  • The application must run on an Origin 300 or Origin 3000 series server.

To take full advantage of this placement option, you must be able to link the application such that the libmpi.so init code is run before the libmp.so init code. For instructions on how to link the hybrid application, see “Compiling and Linking IRIX MPI Programs” in Chapter 2. This linkage issue will be removed in the MIPspro 7.4 compilers.

You can use an additional memory placement feature for hybrid MPI/OpenMP applications by using the MPI_DSM_PLACEMENT shell variable. Specification of a “threadroundrobin” policy results in the parent MPI process stack, data, and heap memory segments being spread across the nodes on which the child OpenMP threads are running.

MPI reserves nodes for this hybrid placement model based on the number of MPI processes and the number of OpenMP threads per process, rounded up to the nearest multiple of 4. For example, if 6 OpenMP threads per MPI process are going to be used for a 4 MPI process job, MPI will request a placement for 32 (4 X 8) CPUs on the host machine. You should take this into account when requesting resources in a batch environment or when using cpusets. In this implementation, it is assumed that all MPI processes start with the same number of OpenMP threads, as specified by the OMP_NUM_THREADS or equivalent shell variable at job startup.

This placement is not recommended if you set _DSM_PPM to a non-default value (for more information, see pe_environ). Also, it is suggested that the mustrun shell variables (MPI_DSM_MUSTRUN and _DSM_MUSTRUN) not be set when using this placement model.


Note: Currently, this feature is not available on Linux systems.


Tuning for Running Applications Across Multiple Hosts

When you are running an MPI application across a cluster of hosts, there are additional run-time environment settings and configurations that you can consider when trying to improve application performance.

IRIX hosts can be clustered using a variety of high performance interconnects. You can use the XPMEM interconnect to cluster Origin 300 and Origin 3000 series servers as partitioned systems. Other high performance interconnects include GSN and Myrinet. The SGI MPI implementation also supports the older HIPPI 800 interconnect technology. If none of these interconnects is available, MPI relies on TCP/IP to handle MPI traffic between hosts.

Systems running Linux can use the XPMEM interconnect to cluster hosts as partitioned systems, or rely on TCP/IP as the multihost interconnect.

When launched as a distributed application, MPI probes for these interconnects at job startup. For details of launching a distributed application, see “Launching a Distributed Application” in Chapter 2. When a high performance interconnect is detected, MPI attempts to use this interconnect if it is available on every host being used by the MPI job. If the interconnect is not available for use on every host, the library attempts to use the next slower interconnect until this connectivity requirement is met. Table 6-1 specifies the order in which MPI probes for available interconnects.

Table 6-1. Inquiry Order for Available Interconnects

Interconnect

Default Order of Selection

Environment Variable to Require Use

Environment Variable for Specifying Device Selection

XPMEM

1

MPI_USE_XPMEM

NA

GSN

2

MPI_USE_GSN

MPI_GSN_DEVS

Myrinet(GM)

3

MPI_USE_GM

MPI_GM_DEVS

HIPPI 800

4

MPI_USE_HIPPI

MPI_BYPASS_DEVS

TCP/IP

5

MPI_USE_TCP

NA


The third column of Table 6-1 also indicates the environment variable you can set to pick a particular interconnect other than the default. For example, suppose you want to run an MPI job on a cluster supporting both GSN and HIPPI interconnects. By default, the MPI job would try to run over the GSN interconnect. If for some reason you wanted to use the HIPPI 800 interconnect, you would set the MPI_USE_HIPPI shell variable before launching the job. This would cause the MPI library to attempt to run the job using the HIPPI interconnect. If the HIPPI interconnect cannot be used, the job will fail .

The XPMEM interconnect is an exception in that it does not require that all hosts in the MPI job need to be reachable via the XPMEM device. Message traffic between hosts not reachable via XPMEM will go over the next fastest interconnect. Also, when you specify a particular interconnect to use, you can set the MPI_USE_XPMEM variable in addition to one of the other four choices.

In general, to insure the best performance of the application, you should allow MPI to pick the fastest available interconnect.

When running in cluster mode, be careful about setting the MPI_BUFFER_MAX value too low. Setting it below 16384 bytes could lead to a significant increase in the number of small control messages sent over the interconnect, possibly leading to performance degradation.

In addition to the choice of interconnect, you should know that multihost jobs use different buffers from those used by jobs run on a single host. In the SGI implementation of MPI, all of the previously mentioned interconnects rely on the “per host” buffers to deliver long messages. The default setting for the number of buffers per host might be too low for many applications. You can determine whether this setting is too low by using the MPI statistics described earlier in this section. In particular, you should examine the metric for retries allocating MPI per host buffers. High retry counts usually indicate that the MPI_BUFS_PER_HOST shell variable should be increased. Table 6-2 provides an example of application performance as a function of the number of “per host” message buffers. Here, the Fourier Transform (FT) class C benchmark was run on a cluster of four Origin 300 servers (32 CPUs each) using Myrinet. Note that the performance improves by almost a factor of three by increasing the MPI_BUFS_PER_HOST from the default of 32 buffers to 128 buffers per host.

Table 6-2. NPB FT Class C Running on 128 CPUs

MPI_BUFS_PER_HOST Setting

Execution Time (secs)

32 (default)

280

64

144

128

108

256

104


When considering these MPI statistics, GSN and HIPPI 800 users should also examine the counter for retries allocating MPI per host message headers. In cases in which this metric indicates high numbers of retries, it might be necessary to increase the MPI_MSGS_PER_HOST shell variable . Myrinet (GM) does not use this resource.

When using GSN, Myrinet, or HIPPI 800 high performance networks, MPI attempts to use all adapters (cards) available on each host in the job. You can modify this behavior by specifying specific adapter(s) to use. The fourth column of Table 6-1 indicates the shell variable to use for a given network. For details on syntax, see the MPI man page. Users of HIPPI 800 networks have additional control over the way in which MPI uses multiple adapters via the MPI_BYPASS_DEV_SELECTION variable. For details on the use of this environment variable, see the MPI man page.

When using the TCP/IP interconnect, unless specified otherwise, MPI uses the default IP adapter for each host. To use a nondefault adapter, enter the adapter-specific host name on the mpirun command line.