Chapter 9. Multiple Process Debugging

WorkShop supports performance analysis and debugging of multiprocess applications, including processes spawned either with fork or sproc. You can perform process control operations on a single process or on all members of a process group. You can attach WorkShop automatically to child processes. You can also specify spawned processes to inherit traps. The Trap Manager provides special trap commands to facilitate debugging multiple processes simultaneously.

This chapter discusses the details of multiprocess debugging in WorkShop and includes the following topics:

Debugging With Multiprocess View

Multiprocess View operates on a process group. By default, a process group includes the parent process and all descendants spawned by sproc. Through a preferences option, processes spawned with fork during the session can be added to the process group automatically when they are created. Note that a child that performs an exec with setuid (the user ID) enabled will not become part of the process group. Any process to which you have read/write access can also be added to the process group, if desired. All sproc'd processes must be in the same process group, since they share information.

Each process in the session can have a standard Main View session associated with it. All processes in a process group share a single Multiprocess View. Selecting "Multiprocess View..." from the Admin menu in Main View for any process in the group brings up the Multiprocess View window. If the Multiprocess window exists, it will be raised to the front; otherwise, a new window will be created.

Currently, Multiprocess View handles these multiple process situations:

  • True multiprocess program, which refers to a tightly integrated system of sproc'd processes, generated by POWER/Fortran or POWER/C.

  • Auto-fork application, which is a process that spawns a child process and then runs in the background.

  • Locally distributed application, which is an application that involves two different executables running in different processes on the same host, coordinated by a rendezvous mechanism. To use the Performance Analyzer, you must have a Main View for each process and enable data collection accordingly.

  • Fork application, which is a process that spawns child processes and can interact with them. The WorkShop Performance Analyzer supports applications that fork but not those that exec.

Multiprocess View does not support remotely distributed applications.

Displaying the Multiprocess View

The first step in debugging multiple processes simultaneously is to invoke the Debugger with the parent process. Then select "Multiprocess View" from the Admin menu to bring up Multiprocess View. Main View is attached to the parent process. Figure 9-1 shows a typical Multiprocess View with Config and Process menus displayed.

Figure 9-1. Multiprocess View With Config and Process Menus Displayed

Figure 9-1 Multiprocess View With Config and Process Menus Displayed

To open a Main View (or other debugging views) for another process, double-click the desired process in Multiprocess View. A separate Main View window displays the selected process, and you can select any debugging views desired. If a set of views exists for that process, the views are raised to the foreground. To reuse views already displayed, select "Switch Process..." from the Admin menu in Main View. (If a process is currently highlighted in Multiprocess View, its ID is entered automatically in the Process ID: field in the Switch Process dialog box.)

Viewing Process Status

When Multiprocess View comes up, it lists the status of all processes in the process group. This information includes:

PID: 

shows the process identifier (ID).

PPID: 

lists the parent process IDs. Notice in Figure 9-1 that the first process PID#7748 is the parent process of the second.

State: 

represents the state of the process: stopped, running, or created, which appears just prior to running. Terminated processes are not displayed.

Name: 

identifies the process by filename.

Function/PC: 

indicates the current function and program counter (PC) for any stopped processes.

Multiprocess Control Buttons

Multiprocess View uses the same control buttons as MainView with two exceptions. The buttons are applied to all processes as a group. There is no separate Run button. Using a control button in Multiprocess View has the same effect as clicking the button in each process's Main View window. The buttons are:

Continue 

resumes program execution after a halt and continues until a stop trap or other event stops execution.

Stop 

stops execution of all processes. When program execution stops, the current source line of each process is highlighted in its Main View, if one is active, and annotated with an arrow indicating the PC.

Step Into 

steps to the next source line and into function calls. To step a specific number of lines, hold down the right mouse button over the Step Into button. A popup menu displays that lets you select one of the fixed values or a specified number of steps.

Step Over 

steps to the next source line and over function calls. To step a specific number of lines, hold down the right button over the Step Over button. A popup menu displays that lets you select one of the fixed values or a specified number of steps.

Return 

executes the remaining instructions in the current function. Program execution stops upon return from that procedure.

Sample 

collects performance data for each process (if performance data collection is enabled).

Kill 

terminates all processes in the group.

Multiprocess Traps

As discussed in Chapter 4, "Setting Traps," the trap qualifiers [all] and [pgrp] are used in multiprocess analysis. The [all] entry stops or samples all processes when a trap fires. The [pgrp] entry sets the trap in all processes within the process group containing the trap location. The qualifiers can be entered by default by the "Group Trap Default" and "Stop All Default" selections in the Traps menu in Trap Manager.

Note that the Sample button always samples all processes.

Adding and Removing Processes

Figure 9-2. Process Menu in Multiprocess View

Figure 9-2 Process Menu in Multiprocess View

The Process menu lets you manually add or remove a process from the process group (see Figure 9-2).

To remove a process, click the process and select "Remove" from the Process menu. Note that a process in a sproc share group cannot be removed from the process group.

To add a process, select "Add...". The dialog box shown in Figure 9-3 displays. Enter the new process ID and click OK.

Figure 9-3. Add Process Dialog Box

Figure 9-3 Add Process Dialog Box

Multiprocess Preferences

The "Preferences..." option in the Config menu brings up the Preferences dialog box. It lets you control when processes are added to the group, and it specifies their behavior (see Figure 9-4).

Figure 9-4. Multiprocess View Preferences Dialog Box

Figure 9-4 Multiprocess View Preferences Dialog Box

The Multiprocess View preference options are:

Attach to forked processes 


attaches new processes spawned by the fork command to the group automatically. (Note that processes spawned by sproc are always attached.)

Copy traps to forked processes 


copies traps you have set in the parent process to new forked processes automatically. If you create parent traps with Trap Manager and specify pgrp, then the children inherit these traps automatically, regardless of the state of this flag.

Copy traps to sproc'd processes 


copies traps you have set in the parent process to new sproc'd processes automatically. As in the previous option, if you create parent traps with the Trap Manager and specify pgrp, the children inherit these traps automatically, whether this flag is set or not.

Resume parent after fork 


restarts the parent process automatically when a child is forked.

Resume child after attach on fork 


restarts the new forked process automatically when it is attached. If this option is left off, a new process will stop as soon as it is attached.

Resume parent after sproc 


restarts the parent process automatically when a child is sproced.

Resume child after attach on sproc 


restarts the new sproced process automatically when it is attached. If this option is left off, a new process will stop as soon as it is attached.

Controlling Execution and Setting Traps in a Multiprocess Program

This section uses a C program that generates numbers in the Fibonacci sequence to demonstrate some of the tasks you'll be performing most often when using cvd to debug mp code. The tasks demonstrated are:

  • stopping a child process on a sproc

  • using the Multiprocess View buttons to control all processes

  • setting traps in the parent process only

  • setting group traps

The program fibo uses sproc to split off a child process, which in turn uses sproc to split off a grandchild process. All three processes churn out Fibonacci numbers until stopped. If you installed the demo programs, you can find the source for fibo.c in the directory
/usr/demos/WorkShop/mp.

A listing of fibo.c follows:

#include <stdio.h>
#include <sys/types.h>
#include <sys/prctl.h>

int NumberToCompute = 100;
int fibonacci();
void run(),run1();

int fibonacci(int n)
{
int f, f_minus_1, f_plus_1;
int i;

    f = 1;
    f_minus_1 = 0;
    i = 0;

    for (; ;) {
        if (i++ == n) return f;
        f_plus_1 = f + f_minus_1;
         f_minus_1 = f;
         f = f_plus_1;
    }
}

void run()
{
int fibon;
    for (; ;) {
        NumberToCompute = (NumberToCompute + 1) % 10;
        fibon = fibonacci(NumberToCompute);
        printf("%d'th fibonacci number is %d\n", 
             NumberToCompute, fibon);
    }
}

void run1()
{
int grandChild;

    errno = 0;
    grandChild = sproc(run,PR_SADDR);

    if (grandChild == -1) {
        perror("SPROC GRANDCHILD");
    }
    else
        printf("grandchild is %d\n", grandChild);
    run();
}

void main ()
{
int second;

    second = sproc(run1,PR_SADDR);
    if (second == -1)
        perror("SPROC CHILD");
    else
        printf("child is %d\n", second);

    run();
    exit(0);
}

To get started, compile the program and run the Debugger.

  1. Compile fibo.c.

    cc -g fibo.c -o fibo

  2. Invoke the Debugger on fibo.

    cvd fibo &

  3. Bring up the multiprocess view by selecting "Multiprocess View..." from the Admin menu.

In the next section, you'll set options to control how the process executes.

Using the Multiprocess View to Control Execution

To examine each process as it appears, you need to stop child processes as they are created with sproc. You can control the Debugger's behavior on sproc by setting Multiprocess preferences.

  1. Select "Preferences..." from the Config menu in Multiprocess View.

  2. Deactivate Resume child after attach on sproc.

    At the same time, you can turn off trap inheritance, so you can experiment with trap setting later.

  3. Click OK to accept the change.

    Now you're ready to run the process.

  4. In the Main View, click Run.

    If you watch Multiprocess View, you see the main process appear, and spawn a child process. The child process stops as soon as it appears, since you turned off the Resume child after attach on sproc option. You can now use Multiprocess View to open a new main view for the child process.

  5. Double-click the child process in the Multiprocess View window.

    You see a dialog box like the one in Figure 9-5, and the Debugger creates a new window.

    Figure 9-5. Launching a Debug Session Dialog Box

    Figure 9-5 Launching a Debug Session Dialog Box

    You can use the buttons in Multiprocess View to control all the processes simultaneously, or use the buttons in each of the Main Views to control each process separately.


    Note: You'll probably get a warning that the sproc.s is missing. This is a reference to assembly code and can be ignored.


  6. To send the child process on its way, click Continue in the Multiprocess View window.

    The first child now spawns a grandchild process. The grandchild stops in sproc, as shown in Figure 9-6:

    Figure 9-6. Using the Multiprocess View to Examine Process State

    Figure 9-6 Using the Multiprocess View to Examine Process State

Using the Trap Manager to Control Trap Inheritance

This section shows you how to use the Trap Manager to set traps that affect one or all of the fibo process group. For complete information on using the Trap Manager, refer to Chapter 4, "Setting Traps."

  1. In the Main View for the parent process, select "Trap Manager" from the Views menu.

    Right now, traps set using the Traps menu in any of the Main View windows will affect only the process controlled by that Main View. For example, see what happens if you set a stop trap in the first executable line of run(), which is line 32:

    32 NumbertoCompute = (NumbertoCompute + 1) % 10;

  2. Using the Traps menu of the parent process, set a stop trap at line 32 of fibo.c.

    Only the parent process halts. The child processes continue running, as a glance at Multiprocess View will confirm.

    You can use the Trap Manager to edit the trap so that it affects the whole process group.

  3. Insert the word pgrp after the word Stop.

    The trap should read Stop pgrp at.., as shown in Figure 9-7.

    Figure 9-7. Modifying a Trap to Affect a Process Group

    Figure 9-7 Modifying a Trap to Affect a Process Group

  4. Click Modify to accept your change to the trap.

    The trap affects the two child processes as well. Watch the Multiprocess View to see the whole process group stop at the trap on line 32.

    You can set an option to make all traps affect the process group by default for those traps set using the Trap Manager.

  5. Select "Group Trap Default" from the Traps menu (see Figure 9-8).

    Figure 9-8. Setting the Group Trap Default

    Figure 9-8 Setting the Group Trap Default

  6. In the Main View of the parent process, place the cursor in any executable line in the function fibonacci and select "At Source Line" from the Traps menu of the Trap Manager.

    The trap you've just set includes the modifier pgrp. It automatically affects both child processes.

    You have now learned the basics of controlling the execution of multiple processes and setting traps.

  7. Select "Exit" from the Admin menu in each Main View to end this tutorial.

    Note that the Multiprocess View window must be closed explicitly. It does not close when the Main View windows do.

Debugging a Multiprocess Fortran Program

The first part of this section presents a few standard techniques to assist you in debugging a parallel program. The second part shows you how to use the WorkShop Debugger to debug the sample program from Chapter 6 of the Fortran 77 Programmer's Guide.

General Fortran Debugging Hints

Debugging a multiprocessed program is more difficult than debugging a single-processor program; therefore, debug as much as possible on the single-processor version.

Try to isolate the problem as much as possible. If you can, reduce the problem to a single C$DOACROSS loop.

Once you've isolated the problem to a specific DO loop, try changing the order of its iterations in a single-processor version. If the loop can be multiprocessed, then the iterations can execute in any order and produce the same answer. If the loop cannot be multiprocessed, changing the order frequently causes the single-processor version to fail. If it fails, you can use standard single-process debugging techniques to find the problem.

If this technique fails, you need to debug the multiprocessed version. Compile your code with the flags –g and –mp_keep. The –mp_keep flag saves the file containing the multiprocessed DO loop Fortran code. The compiler saves the code in a file named

$TMPDIR/P<user_subroutine_name><machine_name><pid>

where user_subroutine_name is the name of the subroutine containing the DOACROSS, machine_name is your machine name, and pid is the process ID number of the compilation.

If you have not set the environment variable TMPDIR, /tmp is used.

Multiprocess Debugging Session

This section walks you through the process of using the Debugger to debug a small segment of incorrectly multiprocessed code. The example used in this section is also treated in Chapter 6 of the Fortran 77 Programmer's Guide with dbx. You can use cvd to perform the same tasks with less effort.

If you installed the demo programs, you can find the source for the code you will be debugging, total.f, in the directory /usr/demos/WorkShop/mp. A listing follows:

program driver
    implicit none
    integer iold(100,10), inew(100,10),i,j
    double precision aggregate(100, 10),result
    common /work/ aggregate
    call total(100, 10, iold, inew)
    do 20 j=1,10
      do 10 i=1,100
        result=result+aggregate(i,j)
10    continue
20  continue
    write(6,*)' result=',result
    stop
    end

    subroutine total(n, m, iold, inew)
    implicit none
    integer n, m
    integer iold(n,m), inew(n,m)
    double precision aggregate(100, 100)
    common /work/ aggregate
    integer i, j, num, ii, jj
    double precision tmp

    C$DOACROSS LOCAL(i,ii,j,jj,num)
    do j = 2, m-1
      do i = 2, n-1
        num = 1
        if (iold(i,j) .eq. 0) then
          inew(i,j) = 1
        else
        num = iold(i-1,j) +iold(i,j-1) + iold(i-1,j-1) +
&         iold(i+1,j) + iold(i,j+1) + iold(i+1,j+1)
          if (num .ge. 2) then
            inew(i,j) = iold(i,j) + 1
          else
            inew(i,j) = max(iold(i,j)-1, 0)
          end if
        end if
        ii = i/10 + 1
        jj = j/10 + 1
        aggregate(ii,jj) = aggregate(ii,jj) + inew(i,j)
      end do
    end do
    end

In the program, the local variables are properly declared. The inew always appears with j as its second index, so it can be a share variable when multiprocessing the j loop. The iold, m, and n are only read (not written), so they are safe. The problem is with aggregate. The person analyzing this code reasoned that because j is always different in each iteration, j/10 will also be different. Unfortunately, since j/10 uses integer division, it often gives the same results for different values of j.

While this is a fairly simple error, it is not easy to see. When run on a single processor, the program always gets the right answer. Sometimes it gets the right answer when multiprocessing. The error occurs only when different processes attempt to load from and/or store into the same location in the aggregate array at exactly the same time.

Here are the steps in this exercise:

  1. First try reversing the order of the iterations. Replace

    do j = 2, m-1

    with

    do j = m-1, 2, -1

    This still gives the right answer when running with one process but the wrong answer when running with multiple processes. The local variables look right, there are no equivalence statements, and inew uses only simple indexing. The likely item to check is aggregate. Your next step is to look at aggregate with the Debugger.

  2. Compile the program with the –g –mp_keep options:

    % f77 -g -mp -mp_keep total.f -o total

    If your debugging session is not running on a multiprocessor machine, you can force the creation of two threads for example purposes by setting an environment variable.

  3. If you use the C shell, type

    % setenv MP_SET_NUMTHREADS 2

  4. Start the Debugger:

    % cvd total&

    The Debugger Main View window displays.

  5. Choose "Go To Line..." from the Source menu and select line 43.

    This takes you to line 43:

    aggregate(ii,jj) = aggregate(ii,jj) + inew(i,j) 

    The subroutine touches aggregate in only one place, line 43. You want to set a stop trap at this line, so you can see what each thread is doing with aggregate, ii, and jj. You also want this trap to affect all threads of the process group. One way to do this is to turn on trap inheritance using the Multiprocess View Preferences dialog box. Another way is to use the Trap Manager to specify group traps, as follows.

  6. From the Views menu, select Trap Manager.

  7. In the Trap Manager window, pull down the Traps menu. Select the "Group Trap Default" option from the menu.

    This sets the group default.

  8. Place the cursor in line 43 in the Main View window.

    This selects the line.

  9. From the Traps menu in Traps Manager, select "At Source Line."

    This sets the stop trap, which should read something like this trap:

    Stop pgrp in file /usr/demos/WorkShop/mp/total.f line 43

  10. Bring up the Multiprocess View to keep tabs on the status of the two processes.

    Now you're ready to run the program.

  11. Click Run in the Main View window.

    As you watch the Multiprocess View, you'll see the two processes appear, run, and stop in the function _total_25_aaaa. The Main View window is now relative to the master process.

  12. Double-click the slave process listed in the Multiprocess View window, as in Figure 9-9.

    This invokes a Main View debugging session on the slave process.

    Figure 9-9. Launching a New Debugging Session From Multiprocess View

    Figure 9-9 Launching a New Debugging Session From Multiprocess View

    Now you can invoke the Variable Browser on each process. Look at ii and jj in Figure 9-10.

    Figure 9-10. Comparing Variable Values From Two Processes

    Figure 9-10 Comparing Variable Values From Two Processes

    They have the same values in each process; therefore, both processes may attempt to write to the same member of the array aggregate at the same time. So aggregate should not be declared as a share variable. You've found the bug in your parallel Fortran program.