Appendix B. Debugging and Profiling Multiprocessed Programs

This appendix describes some aspects of debugging multiprocessed Fortran source code. The recommended debugger for use with the MIPSpro 7 Fortran 90 compiler is dbx(1). The dbx(1) debugger includes the following features to support the Fortran language: allocatable arrays, pointer-based variables, nonstandard stride arrays, modules, and derived types. For more information on this debugger, see the dbx(1) man page.

Setting Up Your Environment

When debugging a program with dbx(1), enter the following command:

% (dbx) ignore TERM

This command allows a multiprocessed program to terminate gracefully after execution is complete.

Profiling a Parallel Fortran Program

It is easiest to debug a program for execution on multiple processors in a single-processor environment. After your program executes successfully on a single processor, you can compile it for multiprocessing by using the -mp option on the f90(1) command line.

After converting a program from use on one processor to one that can be multiprocessed, you should examine execution profiles to judge the effectiveness of the transformation. Good profiles of the program are crucial to help you focus on the loops that use the most time. You can use SpeedShop to obtain these profiles. For more information on SpeedShop, see the SpeedShop User's Guide or the ssrun(1) man page.

If your job uses multiple threads, you can use SpeedShop to create multiple profile data files, one profile file for each thread. Use the prof(1) standard profile analyzer to examine this output. You can also use timex(1); this command indicates if the parallelized versions performed better overall than the serial version.

The profile of a Fortran parallel job is different from a standard profile. To produce a parallel program, the compiler pulls the parallel DO loops out into separate subroutines, one routine for each loop. Each of these loops is shown as a separate procedure in the profile. You can compare the amount of time spent in each loop by the various threads to determine how well the workload is balanced.

You can use par(1) to trace the activity of a single process, a related group of processes, or the system as a whole. The par(1) utility is a process activity reporter. For more information on par(1), see the par(1) man page.

In addition to the loops, the profile returned by the prof(1) command shows the special routines that actually do the multiprocessing. The __mp_parallel_do routine is the synchronizer and controller. Slave threads wait for work in the routine __mp_slave_wait_for_work; the less time they wait, the more time they work. This gives a rough estimate of the extent of parallelism in a program. For more information on these routines, see the mp(3f) man page.

Debugging Parallel Fortran

After you have isolated program bugs to one or two loops, you can begin to debug. To determine if a loop can be multiprocessed, change the order of the iterations on the parallel DO loop on a single-processor version. If the loop can be multiprocessed, the iterations can execute in any order and produce the same answer. If the loop cannot be multiprocessed, changing the order usually causes the single-processor version to fail. You can use single-process debugging techniques to determine the problem.

Example. Erroneous !$OMP PARALLEL DO. In this example, two references to A have the indexes in reverse order. If the indexes were in the same order (if both were A(I,J) or both were A(J,I)), the loop could be multiprocessed. As written, there is a data dependency, so the !$OMP PARALLEL DO is an error.

!$OMP PARALLEL DO PRIVATE(I,J)
        DO I = 1, N
           DO J = 1, N
              A(I,J) = A(J,I) + X*B(I)
           END DO
        END DO

Because a (correct) multiprocessed loop can execute its iterations in any order, you could rewrite this as:

!$OMP PARALLEL DO PRIVATE(I,J)
        DO I = N, 1, -1
           DO J = 1, N
              A(I,J) = A(J,I) + X*B(I)
           END DO
        END DO

This loop no longer gives the same answer as the original even when compiled without the -mp option. This reduces the problem to a normal debugging problem.

Other Debugging Tips for Multiprocessed Loops

If a multiprocessed loop produces the wrong answer, use the following checklist to determine the cause:

Item to investigate 

Reasons

PRIVATE variables 

Check the PRIVATE variables when the code runs correctly as a single process but fails when multiprocessed. Check any scalar variables that appear in the left-hand side of an assignment statement in the loop to be sure they are all declared as PRIVATE. Be sure to include the DO variable of any loop nested inside the parallel loop.

LASTPRIVATE 

A problem occurs when you need the final value of a variable but the variable is declared PRIVATE rather than LASTPRIVATE. If the use of the final value happens several hundred lines farther down, or if the variable is in a common block and the final value is used in a completely separate routine, a variable can look as if it is PRIVATE when in fact it should be LASTPRIVATE. To combat this problem, simply declare all the PRIVATE variables LASTPRIVATE when debugging a loop.

EQUIVALENCE 

Check for EQUIVALENCE problems. Two variables of different names may in fact refer to the same storage location if they are associated through an EQUIVALENCE.

EQUIVALENCE statements affect storage of local variables and can cause data dependencies when parallelizing code. EQUIVALENCE statements with local variables cause the storage location to be initialized to zero and saved between calls to the subroutine.

Uninitialized variables 

Some programs assume uninitialized variables are set to 0. This works with the -static option on the f90(1) command, but without it, uninitialized values assume the value that remains on the stack. When compiling with the -mp option on the f90(1) command, the program executes differently and the stack contents are different. You should suspect this type of problem when a program is compiled with -mp and is run on a single processor and produces a different result when it is compiled without -mp.

To discover this type of problem, compile suspected routines with the -static option. If an uninitialized variable is the problem, you should initialize the variable rather than compile the program with the -static option.

Ranges on arrays 

Perform array bounds checking analysis by compiling with the -C option on the f90(1) command. If arrays are indexed out of bounds, a memory location may be referenced in unexpected ways. This is particularly true of adjacent arrays in a common block.

Errors in choosing which arrays are SHARED can be detected only when running on multiple processors. When stepping through the code in the debugger, the program executes correctly.

The most likely candidates for this error are arrays with complicated subscripts. If the array subscripts are simply the variables of a DO loop, the analysis is probably correct. If the subscripts are more involved, examine those subscripts first.

If you suspect this type of error, print out all the values of all the subscripts on each iteration through the loop. Then use the uniq(1) command to look for duplicates. If duplicates are found, there is a data dependency.