Chapter 7. Frequently Asked Questions

This chapter provides answers to frequently asked questions about MPI.

What are some things I can try to figure out why mpirun is failing?

Here are some things to investigate:

  • Look at the last few lines in /var/adm/SYSLOG for any suspicious errors or warnings. For example, if your application tries to pull in a library that it cannot find, a message should appear here.

  • Be sure that you did not misspell the name of your application.

  • To find rld/dynamic link errors, try to run your program without mpirun. You will get the "mpirun must be used to launch all MPI applications" message, along with any rld link errors that might not be displayed when the program is started with mpirun.

  • Be sure that you are setting your remote directory properly. By default, mpirun attempts to place your processes on all machines into the directory that has the same name as $PWD. This should be the common case, but sometimes different functionality is required. For more information, see the section on $MPI_DIR and/or the -dir option in the mpirun man page.

  • If you are using a relative pathname for your application, be sure that it appears in $PATH. In particular, mpirun will not look in '.' for your application unless '.' appears in $PATH.

  • Run /usr/etc/ascheck to verify that your array is configured correctly.

  • Be sure that you can execute rsh (or arshell) to all of the hosts that you are trying to use without entering a password. This means that either /etc/hosts.equiv or ~/.rhosts must be modified to include the names of every host in the MPI job. Note that using the -np syntax (i.e. no hostnames) is equivalent to typing localhost, so a localhost entry will also be needed in one of the above two files.

  • If you are using an mpt module to load MPI, try loading it directly from within your .cshrcfile instead of from the shell. If you are also loading a MIPSpro module, be sure to load it after the mpt module.

  • Use the -verbose option to verify that you are running the version of MPI that you think you are running.

  • Be very careful when setting MPI environment variables from within your .cshrc or .login files, because these will override any settings that you might later set from within your shell (due to the fact that MPI creates the equivalent of a fresh login session for every job). The safe way to set things up is to test for the existence of $MPI_ENVIRONMENT in your scripts and set the other MPI environment variables only if it is undefined.

  • If you are running under a Kerberos environment, you may be in for a wild ride because currently, mpirun is unable to pass tokens. For example, in some cases, if you use telnet to connect to a host and then try to run mpirun on that host, it fails. But if you instead use rsh to connect to the host, mpirun succeeds. (This might be because telnet is kerberized but rsh is not.) At any rate, if you are running under such conditions, you will definitely want to talk to the local administrators about the proper way to launch MPI jobs.

How do I combine MPI with insert favorite tool here?

In general, the rule to follow is to run mpirun on your tool and then the tool on your application. Do not try to run the tool on mpirun. Also, because of the way that mpirun sets up stdio, seeing the output from your tool might require a bit of effort. The most ideal case is when the tool directly supports an option to redirect its output to a file. In general, this is the recommended way to mix tools with mpirun. Of course, not all tools (for example, dplace) support such an option. However, it is usually possible to make it work by wrapping a shell script around the tool and having the script do the redirection, as in the following example:

> cat myscript
     #!/bin/sh
     setenv MPI_DSM_OFF
     dplace -verbose a.out 2> outfile
     > mpirun -np 4 myscript
     hello world from process 0
     hello world from process 1
     hello world from process 2
     hello world from process 3
     > cat outfile
     there are now 1 threads
     Setting up policies and initial thread.
     Migration is off.
     Data placement policy is PlacementDefault.
     Creating data PM.
     Data pagesize is 16k.
     Setting data PM.
     Creating stack PM.
     Stack pagesize is 16k.
     Stack placement policy is PlacementDefault.
     Setting stack PM.
     there are now 2 threads
     there are now 3 threads
     there are now 4 threads
     there are now 5 threads 

MPI with dplace

setenv MPI_DSM_OFF
     mpirun -np 4 dplace -place file a.out

Starting with IRIX 6.5.13, MPI interoperates with dplace so that MPI cc-NUMA functionality is not actually turned off. This might change the performance characteristics of MPI with previous releases of IRIX and dplace. To disable this interaction, the user needs to set the MPI_DPLACE_INTEROP_OFF shell variable.

MPI with perfex

mpirun -np 4 perfex -mp -o file a.out 

The -o option to perfex became available only in IRIX 6.5, so on systems released earlier than IRIX 6.5, you must use a shell script as described previously. However, a shell script allows you to view only the summary for the entire job; individual statistics for each process are possible only via the -o option.

MPI with rld

setenv _RLDN32_PATH /usr/lib32/rld.debug
     setenv _RLD_ARGS "-log outfile -trace"
     mpirun -np 4 a.out 

You can create more than one output file, depending on whether you are running out of your home directory and whether you use a relative pathname for the file. The first file will be created in the same directory from which you are running your application, and will contain information that applies to your job. The second file will be created in your home directory and will contain (uninteresting) information about the login shell that mpirun created to run your job. If both directories are the same, the entries from both are merged into a single file.

MPI with Totalview

totalview mpirun -a -np 4 a.out 

In this special case, you must run the tool on mpirun and not the other way around. Note also that Totalview uses the -a option and therefore, it must always appear as the first option of the mpirun command.

Note that Totalview is not expected to operate with MPI processes started via the MPI_Comm_spawn or MPI_Comm_spawn_multiple functions.

MPI with SHMEM

It is easy to mix SHMEM and MPI message passing in the same program. Start with an MPI program that calls MPI_Init and MPI_Finalize. When you add SHMEM calls, the PE numbers are equal to the MPI rank numbers in MPI_COMM_WORLD. Do not call start_pes() in a mixed MPI and SHMEM program. For more information, see the shmem(3) man page.

I am unable to malloc() more than 700-1000 MB when I link with libmpi.

On IRIX systems released before IRIX 6.5, there are no so_locations entries for the MPI libraries. The way to fix this is to requickstart all versions of libmpi as follows:

cd /usr/lib32/mips3
     rqs32 -force_requickstart -load_address 0x2000000 ./libmpi.so
     cd /usr/lib32/mips4
     rqs32 -force_requickstart -load_address 0x2000000 ./libmpi.so
     cd /usr/lib64/mips3
     rqs64 -force_requickstart -load_address 0x2000000 ./libmpi.so
     cd /usr/lib64/mips4
     rqs64 -force_requickstart -load_address 0x2000000 ./libmpi.so 

Note that this code requires root access.

My code runs correctly until it reaches MPI_Finalize() and then it hangs.

This is almost always caused by send or recv requests that are either unmatched or not completed. An unmatched request is any blocking send for which a corresponding recv is never posted. An incomplete request is any nonblocking send or recv request that was never freed by a call to MPI_Test(), MPI_Wait(), or MPI_Request_free().

Common examples are applications that call MPI_Isend() and then use internal means to determine when it is safe to reuse the send buffer. These applications never call MPI_Wait(). You can fix such codes easily by inserting a call to MPI_Request_free() immediately after all such isend operations, or by adding a call to MPI_Wait() at a later place in the code, prior to the point at which the send buffer must be reused.

I keep getting error messages about MPI_REQUEST_MAX being too small, no matter how large I set it.

There are two types of cases in which the MPI library reports an error concerning MPI_REQUEST_MAX. The error reported by the MPI library distinguishes these.

If the error message states

MPI has run out of unexpected request entries; the current allocation level is: XXXXXX 

the program is sending so many unexpected large messages (greater than 64 bytes) to a process that internal limits in the MPI library have been exceeded. The options here are to increase the number of allowable requests via the MPI_REQUEST_MAX shell variable, or to modify the application.

If the error message states

 *** MPI has run out of request entries
*** The current allocation level is:
***     MPI_REQUEST_MAX = XXXXX

you might have an application problem. You almost certainly are calling MPI_Isend() or MPI_Irecv() and not completing or freeing your request objects. You need to use MPI_Request_free(), as described in the previous section.

I am not seeing stdout and/or stderr output from my MPI application.

Beginning with our MPT 1.2/MPI 3.1 release, all stdout and stderr is line-buffered, which means that mpirun does not print any partial lines of output. This sometimes causes problems for codes that prompt the user for input parameters but do not end their prompts with a newline character. The only solution for this is to append a newline character to each prompt.

Beginning with MPT 1.5.2, you can set the MPI_UNBUFFERED_STDIO environment variable to disable line-buffering. For more information, see the MPI(1) and mpirun(1) man pages.

How can I get the MPT software to install on my machine?

Message-Passing Toolkit software releases can be obtained at the SGI Software Download page at

http://www.sgi.com/products/evaluation/ 

Where can I find more information about SHMEM?

See the intro_shmem(3)man page.

The ps(1) command says my memory use (SIZE) is higher than expected.

At MPI job start-up, MPI calls libsma to cross-map all user static memory on all MPI processes to provide optimization opportunities. The result is large virtual memory usage. The ps(1) command's SIZE statistic is telling you the amount of virtual address space being used, not the amount of memory being consumed. Even if all of the pages that you could reference were faulted in, most of the virtual address regions point to multiply-mapped (shared) data regions, and even in that case, actual per-process memory usage would be far lower than that indicated by SIZE.

What does MPI: could not run executable mean?

This message means that something happened while mpirun was trying to launch your application, which caused it to fail before all of the MPI processes were able to handshake with it.

With Array Services 3.2 or later and MPT 1.3 or later, many scenarios that generated this error message are now improved to be more descriptive.

Prior to Array Services 3.2, no diagnostic information was directly available. This was due to the highly decoupled interface between mpirun and arrayd.

mpirun directs arrayd to launch a master process on each host and listens on a socket for those masters to connect back to it. Since the masters are children of arrayd, arrayd traps SIGCHLD and passes that signal back to mpirun whenever one of the masters terminates. If mpirun receives a signal before it has established connections with every host in the job, it knows that something has gone wrong.

I have other MPI questions. Where can I read more about MPI?

The MPI(1) and mpirun(1) man pages are good places to start. Also see the MPI standards at http://www.mpi-forum.org/docs/docs.html.