Chapter 14. Message-Passing Parallelism

In a message-passing model, your parallel application consists of multiple, independent processes, each with its own address space. Each process shares data and coordinates with the others by passing messages, using a formal interface. The formal interface makes the program independent of the medium over which the message go. The processes of the program can be in a single computer, with messages moving via memory-to-memory copy; but it is possible to distribute the program in different machines, with messages passing over a network.

The Message-Passing Toolkit package supports three libraries on which you can build a message-passing application. The Cray Shared-Memory (SHMEM) library supports message passing in a single system. Message-Passing Interface (MPI) and Parallel Virtual Machine (PVM) models support distribution. High-level overviews of these are given under “Message-Passing Models”.

Choosing a Message-Passing Model

There are five considerations in choosing among the three message-passing models: compatibility, portability, scope, latency, and bandwidth.

Compatibility

If you are starting with an existing program that uses one of the three models, or if you want to reuse code from such a program, or if you personally are highly familiar with one of the three, you will likely choose that model in order to minimize development time.

Portability

The SHMEM library is portable among all Silicon Graphics/Cray systems, including both IRIX and UNICOS/MK. However, it is not supported on systems of other types. Both MPI and PVM are industry standard libraries that are widely available in public-domain implementations.

Scope

The SHMEM library can be used only within a single multiprocessor such as Cray T3E or an Origin2000. You can use MPI or PVM to distribute a program across all nodes in an Silicon Graphics Array, or across a heterogeneous network.

Latency

Latency, the start-up delay inherent in sending any one message of any size, is the shortest in SHMEM. MPI within a single system is a close second (both use memory-to-memory copy).

MPI latency across an Array using the Silicon Graphics-proprietary HIPPI Bypass is an order of magnitude greater. MPI or PVM latency using ordinary HIPPI or TCP/IP is greater still.

Bandwidth

The rate at which the bits of a message are sent is the highest in SHMEM and MPI within a single system. MPI bandwidth over a HIPPI link is next, followed by PVM.

If you require the highest performance within a single Cray or Silicon Graphics system, use SHMEM. For the highest performance in an Array system linked with HIPPI, use MPI. Use PVM only when compatibility or portability is an overriding consideration.

Choosing Between MPI and PVM

When your program must be able to use the resources of multiple systems, you choose between MPI and PVM. In many ways, MPI and PVM are similar:

  • Each is designed, specified, and implemented by third parties that have no direct interest in selling hardware.

  • Support for each is available over the Internet at low or no cost.

  • Each defines portable, high-level functions that are used by a group of processes to make contact and exchange data without having to be aware of the communication medium.

  • Each supports C and Fortran 77.

  • Each provides for automatic conversion between different representations of the same kind of data so that processes can be distributed over a heterogeneous computer network.

Another difference between MPI and PVM is in the support for the “topology” (the interconnect pattern: grid, torus, or tree) of the communicating processes. In MPI, the group size and topology are fixed when the group is created. This permits low-overhead group operations. The lack of run-time flexibility is not usually a problem because the topology is normally inherent in the algorithmic design. In PVM, group composition is dynamic, which requires the use of a “group server” process and causes more overhead in common group-related operations.

Other differences are found in the design details of the two interfaces. MPI, for example, supports asynchronous and multiple message traffic, so that a process can wait for any of a list of message-receive calls to complete and can initiate concurrent sending and receiving. MPI provides for a “context” qualifier as part of the “envelope” of each message. This permits you to build encapsulated libraries that exchange data independently of the data exchanged by the client modules. MPI also provides several elegant data-exchange functions for use by a program that is emulating an SPMD parallel architecture.

PVM is possibly more suitable for distributing a program across a heterogeneous network that includes both uniprocessors and multiprocessors, and includes computers from multiple vendors. When the application runs in the environment of a Silicon Graphics Array system, MPI is the recommended interface.

Differences Between PVM and MPI

This section discusses the main differences between PVM and MPI from the programmer's perspective, focusing mainly on PVM functions that are not available in MPI.

Although to a large extent the library calls of MPI and PVM provide similar functionality, some PVM calls do not have a counterpart in MPI, and vice versa. Additionally, the semantics of some of the equivalent calls are inherently different for the two libraries (owing, for example, to the concept of dynamic groups in PVM). Hence, the process of converting a PVM program into an MPI program can be straightforward or complicated, depending on the particular PVM calls in the program and how they are used. For many PVM programs, conversion is straightforward.

In addition to a message-passing library, PVM also provides the concept of a parallel virtual machine session. A user starts this session before invoking any PVM programs; in other words, PVM provides the means to establish a parallel environment from which a user invokes a parallel program.

Additionally, PVM includes a console, which is useful for monitoring and controlling the states of the machines in the virtual machine and the state of execution of a PVM job. Most PVM console commands have corresponding library calls.

The MPI standard does not provide mechanisms for specifying the initial allocation of processes to an MPI computation and their binding to physical processors. Mechanisms to do so at load time or run time are left to individual vendor implementations. However, this difference between the two paradigms is not, by itself, significant for most programs, and should not affect the port from PVM to MPI.

The chief differences between the current versions of PVM and MPI libraries are as follows:

  • PVM supports dynamic spawning of tasks, whereas MPI does not.

  • PVM supports dynamic process groups; that is, groups whose membership can change dynamically at any time during a computation. MPI does not support dynamic process groups.

    MPI does not provide a mechanism to build a group from scratch, but only from other groups that have been defined previously. Closely related to groups in MPI are communicators, which specify the communication context for a communication operation and an ordered process group that shares this communication context. The chief difference between PVM groups and MPI communicators is that any PVM task can join/leave a group independently, whereas in MPI all communicator operations are collective.

  • A PVM task can add or delete a host from the virtual machine, thereby dynamically changing the number of machines a program runs on. This is not available in MPI.

  • A PVM program (or any of its tasks) can request various kinds of information from the PVM library about the collection of hosts on which it is running, the tasks that make up the program, and a task's parent. The MPI library does not provide such calls.

  • Some of the collective communication calls in PVM (for instance, pvm_reduce()) are nonblocking. The MPI collective communication routines are not required to return as soon as their participation in the collective communication is complete.

  • PVM provides two methods of signaling other PVM tasks: sending a UNIX signal to another task, and notifying a task about an event (from a set of predefined events) by sending it a message with a user-specified tag that the application can check. A PVM call is also provided through which a task can kill another PVM task. These functions are not available in MPI.

  • A task can leave/unenroll from a PVM session as many times as it wants, whereas an MPI task must initialize/finalize exactly once.

  • A PVM task need not explicitly enroll: the first PVM call enrolls the calling task into a PVM session. An MPI task must call MPI_Init() before calling any other MPI routine and it must call this routine only once.

  • A PVM task can be registered by another task as responsible for adding new PVM hosts, or as a PVM resource manager, or as responsible for starting new PVM tasks. These features are not available in MPI.

  • A PVM task can multicast data to a set of tasks. As opposed to a broadcast, this multicast does not require the participating tasks to be members of a group. MPI does not have a routine to do multicasts.

  • PVM tasks can be started in debug mode (that is, under the control of a debugger of the user's choice). This capability is not specified in the MPI standard, although it can be provided on top of MPI in some cases.

  • In PVM, a user can use the pvm_catchout() routine to specify collection of task outputs in various ways. The MPI standard does not specify any means to do this.

  • PVM includes a receive routine with a timeout capability, which allows the user to block on a receive for a user-specified amount of time. MPI does not have a corresponding call.

  • PVM includes a routine that allows users to define their own receive contexts to be used by subsequent PVM receive routines. Communicators in MPI provide this type of functionality to a limited extent.

On the other hand, MPI provides several features that are not available in PVM, including a variety of communication modes, communicators, derived data types, additional group management facilities, and virtual process topologies, as well as a larger set of collective communication calls.