Chapter 3. Programming with SGI MPI

Portability is one of the main advantages MPI has over vendor specific message passing software. Nonetheless, the MPI Standard offers sufficient flexibility for general variations in vendor implementations. In addition, there are often vendor specific programming recommendations for optimal use of the MPI library. This chapter addresses topics that are of interest to those developing or porting MPI applications to SGI systems.

Job Termination and Error Handling

This section describes the behavior of the SGI MPI implementation upon normal job termination. Error handling and characteristics of abnormal job termination are also described.

MPI_Abort

In the SGI MPI implementation, a call to MPI_Abort causes the termination of the entire MPI job, regardless of the communicator argument used. The error code value is returned as the exit status of the mpirun command.

Error Handling

Section 7.2 of the MPI Standard describes MPI error handling. Although almost all MPI functions return an error status, an error handler is invoked before returning from the function. If the function has an associated communicator, the error handler associated with that communicator is invoked. Otherwise, the error handler associated with MPI_COMM_WORLD is invoked.

The SGI MPI implementation provides the following predefined error handlers:

  • MPI_ERRORS_ARE_FATAL. The handler, when called, causes the program to abort on all executing processes. This has the same effect as if MPI_Abort were called by the process that invoked the handler.

  • MPI_ERRORS_RETURN. The handler has no effect.

By default, the MPI_ERRORS_ARE_FATAL error handler is associated with MPI_COMM_WORLD and any communicators derived from it. Hence, to handle the error statuses returned from MPI calls, it is necessary to associate either the MPI_ERRORS_RETURN handler or another user defined handler with MPI_COMM_WORLD near the beginning of the application.

MPI_Finalize and Connect Processes

In the SGI implementation of MPI, all pending communications involving an MPI process must be complete before the process calls MPI_Finalize. If there are any pending send or recv requests that are unmatched or not completed, the application will hang in MPI_Finalize. For more details, see section 7.5 of the MPI Standard.

If the application uses the MPI-2 spawn functionality described in chapter 5 of the MPI-2 Standard, there are additional considerations. In the SGI implementation, all MPI processes are connected. Section 5.5.4 of the MPI-2 Standard defines what is meant by connected processes. When the MPI-2 spawn functionality is used, MPI_Finalize is collective over all connected processes. Thus all MPI processes, both launched on the command line, or subsequently spawned, synchronize in MPI_Finalize.

Signals

In the SGI implementation, MPI processes are UNIX processes. As such, the general rules regarding handling of signals applies as it would to ordinary UNIX processes.

In addition, the SIGURG and SIGUSR1 signals can be propagated from the mpirun process to the other processes in the MPI job, whether they belong to the same process group on a single host, or are running across multiple hosts in a cluster. To make use of this feature, the MPI program must have a signal handler that catches SIGURG or SIGUSR1. When the SIGURG or SIGUSR1 signals are sent to the mpirun process ID, the mpirun process catches the signal and propagates it to all MPI processes.

There are additional concerns when using signals with multithreaded MPI applications. These are discussed in “Thread Safety”.

Buffering

Most MPI implementations use buffering for overall performance reasons and some programs depend on it. However, you should not assume that there is any message buffering between processes because the MPI Standard does not mandate a buffering strategy. Table 3-1 illustrates a simple sequence of MPI operations that cannot work unless messages are buffered. If sent messages were not buffered, each process would hang in the initial call, waiting for an MPI_Recv call to take the message. Because most MPI implementations do buffer messages to some degree, a program like this does not usually hang. The MPI_Send calls return after putting the messages into buffer space, and the MPI_Recv calls get the messages. Nevertheless, program logic like this is not valid by the MPI Standard. Programs that require this sequence of MPI calls should employ one of the buffer MPI send calls, MPI_Bsend or MPI_Ibsend.

Table 3-1. Outline of Improper Dependence on Buffering

Process 1

Process 2

MPI_Send(2,....)

MPI_Send(1,....)

MPI_Recv(2,....)

MPI_Recv(1,....)


By default, the SGI implementation of MPI uses buffering under most circumstances. Short messages of 64 or fewer bytes are always buffered. Longer messages are also buffered, although under certain circumstances, buffering can be avoided. For performance reasons, it is sometimes desirable to avoid buffering. For further information on unbuffered message delivery, see “Programming Optimizations”.

Multithreaded Programming

SGI MPI supports hybrid programming models, in which MPI is used to handle one level of parallelism in an application, while POSIX threads or OpenMP processes are used to handle another level. When mixing OpenMP with MPI, for performance reasons it is better to consider invoking MPI functions only outside parallel regions, or only from within master regions. When used in this manner, it is not necessary to initialize MPI for thread safety. You can use MPI_Init to initialize MPI. However, to safely invoke MPI functions from any OpenMP process or when using Posix threads, MPI must be initialized with MPI_Init_thread.


Note: Currently, multithreaded programming models are supported on IRIX systems only.


There are further considerations when using MPI with threads. These are described in the following sections.

Thread Safety

The SGI implementation of MPI on IRIX systems assumes the use of POSIX threads or processes (see the pthread_create or the OpenMP (sprocs) commands, respectively). Other threading packages might or might not work with this MPI implementation.

Each thread associated with a process can issue MPI calls. However, the rank ID in send or receive calls identifies the process, not the thread. A thread behaves on behalf of the MPI process. Therefore, any thread associated with a process can receive a message sent to that process.

It is the user's responsibility to prevent races when threads within the same application post conflicting communication calls. By using distinct communicators for each thread, the user can ensure that two threads in the same process do not issue conflicting communication calls.

All MPI calls on IRIX 6.5 or later systems are thread-safe. This means that two concurrently running threads can make MPI calls and the outcome will be as if the calls executed in some order, even if their execution is interleaved.

If you block an MPI call, only the calling thread is blocked, allowing another thread to execute, if available. The calling thread is blocked until the event on which it waits occurs. Once the blocked communication is enabled and can proceed, the call completes and the thread is marked runnable within a finite time. A blocked thread does not prevent progress of other runnable threads on the same process, and does not prevent them from executing MPI calls.

Initialization

To initialize MPI for a program that will run in a multithreaded environment, the user must call the MPI-2 function, MPI_Init_thread(). In addition to initializing MPI in the same way as MPI_Init does, MPI_Init_thread() also initializes the thread environment.

You can create threads before MPI is initialized, but before MPI_Init_thread() is called, the only MPI call these threads can execute is MPI_Initialize.

Only one thread can call MPI_Init_thread(). This thread becomes the main thread. Since only one thread calls MPI_Init_thread(), threads must be able to inherit initialization. With the SGI implementation of thread-safe MPI, for proper MPI initialization of the thread environment, a thread library must be loaded before the call to MPI_Init_thread(). This means that dlopen cannot be used to open a thread library after the call to MPI_Init_thread().

Query Functions

The MPI-2 query function, MPI_Query_thread(), is available to query the current level of thread support. The MPI-2 function, MPI_Is_thread_main(), can be used to determine whether a thread is the main thread. The main thread is the thread that called MPI_Init_thread().

Requests

More than one thread cannot work on the same request. A program in which two threads block, waiting on the same request is erroneous. Similarly, the same request cannot appear in the array of requests of two concurrent MPI_Wait{any|some|all} calls. In MPI, a request can be completed only once. Any combination of wait or test that violates this rule is erroneous.

Probes

A receive call that uses source and tag values returned by a preceding call to MPI_Probe or MPI_Iprobe will receive the message matched by the probe call only if there was no other matching receive call after the probe and before that receive. In a multithreaded environment, it is the user's responsibility to use suitable mutual exclusion logic to enforce this condition. You can enforce this condition by making sure that each communicator is used by only one thread on each process.

Collectives

Matching collective calls on a communicator, window, or file handle is performed according to the order in which the calls are issued in each process. If concurrent threads issue such calls on the communicator, window, or file handle, it is the user's responsibility to use interthread synchronization to ensure that the calls are correctly ordered.

Exception Handlers

An exception handler does not necessarily execute in the context of the thread that made the exception-raising MPI call. The exception handler can be executed by a thread that is distinct from the thread that will return the error code.

Cancellation

If a thread that executes an MPI call is canceled by another thread, or if a thread catches a signal while executing an MPI call, the outcome is undefined. When not executing MPI calls, a thread associated with an MPI process can terminate and can catch signals or be canceled by another thread.

Internal Statistics

The SGI internal statistics diagnostics are not thread-safe. MPI statistics are discussed in Chapter 5, “Profiling MPI Applications”.

Finalization

The call to MPI_Finalize occurs on the same thread that initialized MPI (also known as the main thread). It is the user's responsibility to ensure that the call occurs only after all of the processes' threads have completed their MPI calls and have no pending communications or I/O operations.

Interoperability with SHMEM

You can mix SHMEM and MPI message passing in the same program. The application must be linked with both the SHMEM and MPI libraries. Start with an MPI program that calls MPI_Init and MPI_Finalize. When you add SHMEM calls, the PE numbers are equal to the MPI rank numbers in MPI_COMM_WORLD. Do not call start_pes() in a mixed MPI and SHMEM program. When running the application across a cluster, not all MPI processes might be accessible when using SHMEM functions. You can use the shmem_pe_accessible function to determine whether a SHMEM call can be used to access data residing in another MPI process. Since SHMEM functions only with respect to MPI_COMM_WORLD, these functions cannot be used to exchange data between MPI processes that are connected via MPI intercommunicators returned from MPI-2 spawn related functions.

SHMEM functions should not be considered thread safe.

For more information on SHMEM, see the intro_shmem man page.

Miscellaneous Features of SGI MPI

This section describes other characteristics of the SGI MPI implementation that might be of interest to application developers.

stdin/stdout/stderr

In this implementation, stdin is enabled for only those MPI processes with rank 0 in the first MPI_COMM_WORLD (which does not need to be located on the same host as mpirun). stdout and stderr results are enabled for all MPI processes in the job, whether launched via mpirun, or via one of the MPI-2 spawn functions.

MPI_Get_processor_name

In this release of SGI MPI, the MPI_Get_processor_name function returns the Internet host name of the computer on which the MPI process invoking this subroutine is running.

Programming Optimizations

This section describes ways in which the MPI application developer can best make use of optimized features of SGI's MPI implementation. Following recommendations in this section might require modifications to your MPI application.

Using MPI Point-to-Point Communication Routines

MPI provides for a number of different routines for point-to-point communication. The most efficient ones in terms of latency and bandwidth are the blocking and nonblocking send/receive functions (MPI_Send, MPI_Isend, MPI_Recv, and MPI_Irecv). Unless required for application semantics, the synchronous send calls (MPI_Ssend and MPI_Issend) should be avoided. The buffered send calls (MPI_Bsend and MPI_Ibsend) should also usually be avoided as these double the amount of memory copying on the sender side. The ready send routines (MPI_Rsend and MPI_Irsend) are treated as standard MPI_Send and MPI_Isend in this implementation. Persistent requests do not offer any performance advantage over standard requests in this implementation.

Using MPI Collective Communication Routines

The MPI collective calls are frequently layered on top of the point-to-point primitive calls. For small process counts, this can be reasonably effective. However, for higher process counts (32 processes or more) or for clusters, this approach can become less efficient. For this reason, a number of the MPI library collective operations have been optimized to use more complex algorithms. Some collectives have been optimized for use with clusters. In these cases, steps are taken to reduce the number of messages using the relatively slower interconnect between hosts. Two of the collective operations have been optimized for use with shared memory. The barrier operation has also been optimized to use hardware fetch operations (fetchops) on platforms on which these are available. The MPI_Alltoall routines also use special techniques to avoid message buffering when using shared memory. For more details, see “Avoiding Message Buffering -- Single Copy Methods”. Table 3-2 lists the MPI collective routines optimized in this implementation.

Table 3-2. Optimized MPI Collectives

Routine

Optimized for Clusters

Optimized for Shared Memory

MPI_Alltoall

Yes

Yes

MPI_Barrier

Yes

Yes

MPI_Allreduce

Yes

No

MPI_Bcast

Yes

No


Using MPI_Pack/MPI_Unpack

While MPI_Pack and MPI_Unpack are useful for porting PVM codes to MPI, they essentially double the amount of data to be copied by both the sender and receiver. It is generally best to avoid the use of these functions by either restructuring your data or using derived data types. Note however, that use of derived data types may lead to decreased performance under certain cases.

Avoiding Derived Data Types

In general, you should avoid derived data types when possible. In the SGI implementation, use of derived data types does not generally lead to performance gains. Use of derived data types might disable certain types of optimizations (for example, unbuffered or single copy data transfer).

Avoiding Wild Cards

The use of wild cards (MPI_ANY_SOURCE, MPI_ANY_TAG) involves searching multiple queues for messages. While this is not significant for small process counts, for large process counts the cost increases quickly.

Avoiding Message Buffering -- Single Copy Methods

One of the most significant optimizations for bandwidth sensitive applications in the MPI library is single copy optimization, avoiding the use of shared memory buffers. Table 3-3 indicates the relative improvement in bandwidth for a simple ping/pong test using various message sizes. However, as discussed in “Buffering”, some incorrectly coded applications might hang because of buffering assumptions. For this reason, this optimization is not enabled by default, but can be turned on by the user at run time. The following steps can be taken by the application developer to increase the opportunities for use of this unbuffered pathway:

  • The MPI application should be built as a 64-bit executable file, or linked explicitly with the SHMEM library.

  • The MPI data type on the send side must be a contiguous type.

  • The sender and receiver MPI processes must reside on the same host.

  • The sender data must be globally accessible. Globally accessible memory includes common block or static memory. Depending on the run-time environment, memory allocated via the Fortran 90 allocate statement might also be globally accessible. You can also access globally accessible memory by using the MPI_Alloc_mem function. In addition, the SHMEM symmetric heap accessed by using the shpalloc or shmalloc functions is also globally accessible.

Certain run-time environment variables must be set to enable the unbuffered, single copy method. In addition, on certain platforms, hardware is available to facilitate the single copy method without requiring a message buffer to be globally accessible. For more details on how to set the run-time environment, see “Avoiding Message Buffering - Enabling Single Copy” in Chapter 6.

Table 3-3. MPI_Send/MPI_Recv Bandwidth Speedup for Unbuffered vs. Buffered Methods

Message Length

O2000

O3000

8KB

1.2

1.1

1MB

1.2

1.2

10MB

1.8

1.6


Managing Memory Placement

Many multiprocessor SGI systems have a ccNUMA memory architecture. For single process and small multiprocess applications, this architecture behaves similarly to flat memory architectures. For more highly parallel applications, memory placement becomes important. MPI takes placement into consideration when laying out shared memory data structures, and the individual MPI processes' address spaces. In general, it is not recommended that the application programmer try to manage memory placement explicitly. There are a number of means to control the placement of the application at run time, however. For more information, see Chapter 6, “Run-time Tuning”.

Additional Programming Model Considerations

A number of additional programming options might be worth consideration when developing MPI applications for SGI systems. For example, SHMEM can provide a means to improve the performance of latency sensitive sections of an application. Usually, this requires replacing MPI send/recv calls with shmem_put/shmem_get and shmem_barrier calls. SHMEM can deliver significantly lower latencies for short messages than traditional MPI calls. As an alternative to shmem_get/shmem_put calls, you might consider the MPI-2 MPI_Put/ MPI_Get functions. These provide almost the same performance as the SHMEM calls, while providing a greater degree of portability.

Alternately, you might consider exploiting the shared memory architecture of SGI systems by handling one or more levels of parallelism with OpenMP, with the coarser grained levels of parallelism being handled by MPI. If you plan to call MPI in a manner requiring thread safety, see “Thread Safety”. Also, there are special ccNUMA placement considerations to be aware of when running hybrid MPI/OpenMP applications. For further information, see Chapter 6, “Run-time Tuning”.