Chapter 6. Setting Up and Running Experiments: ssrun

This chapter provides information on how to set up and run performance analysis experiments using the ssrun command; it has the following sections:

Building Your Executable

The ssrun command is designed to be used with normally built executables and default environment settings. However, there are some cases where you need to change the way you build your executable or set certain environment variables.

This section explains when to change the way you build your executable program. For information on setting environment variables, see “Using Run-Time Environment Variables”.

  • If you have used the ssrt_caliper_point(3) function provided in the SpeedShop libraries, you have to explicitly link in the SpeedShop libraries file, libss.so. For more information on setting caliper points, see “Using Calipers”.

  • If you are planning to build your executable using the -o32 option to the cc command, and you want to run the usertime experiment, you must add -lexc to the link line. For more information on cc -o32, see the cc(1) man page.

  • If you have built a stripped executable, you need to rebuild a non-stripped version to use with SpeedShop. For example, if you are using ld to link your C program, do not use the -s option. Using the -s option strips debugging information from the program object and makes the program unusable for performance analysis.

  • If you have used compiler optimization level 3 (-O3) and you are performing experiments that report function-level information, inlining can result in extremely misleading profiles. The time spent in the inlined procedure will show up in the profile as time spent in the procedure into which it was inlined. It is generally better to use compiler optimization level 2 (-O2) or less when gathering an execution profile.

Special Information for MP Fortran Programs

If you are compiling MP Fortran programs, you may encounter anomalies in the displayed data:

  • For all f90(1), f77(1), and fort77(1) MP compilations, parallel loops within the program are represented as subroutines with names relating to the source routine in which they are embedded. The naming conventions for these subroutines are different for 32-bit and 64-bit compilations.

    For example, in the linpack example program, most of the time is spent in the routine DAXPY, which can be parallelized. The name differences are as follows:

    • In an n32 or 64-bit MP version, the routine has the name DAXPY, but most of that work is done in the MP routine named DAXPY.PREGION1.

    • In an o32-bit version, the DAXPY routine is named daxpy_, and the MP routine is _daxpy_519_aaab_.

  • If you perform a bbcounts experiment, the source annotations for 32-bit and 64-bit compilations with the -g option differ and are not correct in most cases.

    • In 64-bit source annotations, the exclusive time is correctly shown for each line, but the inclusive time for the first line of the loop (do statement) includes the time spent in the loop body. This same time appears on the lines comprising the loop's body, in effect representing a double-counting.

    • In 32-bit source annotations, the exclusive time is incorrectly shown for the line comprising the loop's body. The line-level data for the loop-body routine (_daxpy_519_aaab_) does not refer to proper lines. If the program was compiled with the -mp_keep flag, the line-level data should refer to the temporary files that are saved from the compilation. But the temporary files do not contain that information, so no source or disassembly data can be shown. The disassembly data for the main routine does not show the times for the loop body.

    • If the 32-bit program was compiled without the -mp_keep flag, the line-level data for the loop-body routine is incorrect.

      Most lines refer to line 0 of the file and the rest to other lines at seemingly random places in the file. Consequently, false annotations will appear on some lines. Disassembly correctly shows the instructions and their data, but the line numbers are wrong. This reflects essentially the same double-counting problem as seen in 64-bit compilations, but the extra counts go to other places in the file, rather than to the first line of the loop.

Setting Up Output Directories and Files

When you run an experiment, performance data files are written to the current working directory by default. They are named using the following convention:

executable_name.exp_type.id   

The id consists of one or two letters (designating the process type) and the process ID number. The following list describes the letter codes:

  • m: master process created by ssrun.

  • p: process created by a call to sproc().

  • f: process created by a call to fork().

  • s: process created by a call to system().

  • e: process created by a call to exec()

  • fe: process created by a call to fork() and exec()

  • Rn: rank number of the MPI process that generated the experiment file.

  • Tn: OpenMP thread that generated the experiment file.

The following are examples of data file names:

stat.bbcounts.m10966
engines.pcsamp.m14493

In a single-process application, ssrun generates a single performance data file. In a multiprocess application, ssrun generates a performance data file for each process.

You can change the default file name or directory for performance data files using environment variables.

Using Run-Time Environment Variables

Several environment variables have been defined for use specifically with SpeedShop to provide additional information to SpeedShop commands or SpeedShop library routines at run time. This section provides information about available environment variables, grouped by functionality:

User Environment Variables

The following list describes a number of environment variables that are normally used to control the operation of SpeedShop.

  • _SPEEDSHOP_CALIPER_POINT_SIG sig_num: causes the specified signal number to be used for recording a caliper point in the experiment.

  • _SPEEDSHOP_HWC_COUNTER_NUMBER num: specifies the counter to be used for prof_hwc experiments. Counters are numbered between 0 and 31, and are described in the MIPS R10000 User's Guide.

  • _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER num: specifies the counter that will be profiled for prof_hwctime experiments.

  • _SPEEDSHOP_HWC_COUNTER_OVERFLOW num: specifies the overflow value for the counter to be used in prof_hwc experiments. The value for num must be 0 < num <= 2147483647. Some choices may produce data that is not statistically random but reflects a correlation between the overflow interval and a cyclic behavior in the application. Users may want to do two or more runs with different overflow values.

  • _SPEEDSHOP_INSTR_ARGS: defines additional instrumentation arguments.

  • _SPEEDSHOP_OUTPUT_DIRECTORY dir: causes the output data files to be placed in the specified directory rather than the current working directory.

  • _SPEEDSHOP_OUTPUT_FILENAME filename: causes the output file to be saved under the specified name. If _SPEEDSHOP_OUTPUT_FILENAME is set to myfile, the experiment file is named myfile.suffix (for example, myfile.m12345). If _SPEEDSHOP_OUTPUT_DIRECTORY is also specified, the directory is prepended to the file name you specify.

  • _SPEEDSHOP_OUTPUT_NOCOMPRESS: Disables the compression of performance data.

  • _SPEEDSHOP_POLLPOINT_CALIPER_POINT timer_type, timer_interval: used to add caliper points at regular time intervals into your experiment file (during program execution). Caliper points set with this variable are recorded in the performance data file generated by ssrun.

  • _SPEEDSHOP_REUSE_FILE_DESCRIPTORS: opens and closes the file descriptors for the output files every time performance data is to be written.

  • _SPEEDSHOP_RLD: defines the full path name to rld, and enables rld profiling (for pcsamp and _hwc experiments only). If the path name does not lead to rld, SpeedShop determines the correct path name automatically. For example, if you set _SPEEDSHOP_RLD to 1, SpeedShop will locate rld.

  • _SPEEDSHOP_SBRK_BUFFER_ADDR address: defines the preferred starting address to be used for the internal malloc arena. This option has to be used with extreme care since it might result in memory region overlap.

  • _SPEEDSHOP_SBRK_BUFFER_LENGTH: defines the segment grow size for the internal malloc arena used. This arena is completely separate from the user's arena, and it usually grows in default segments of the size 0x100000.

  • _SPEEDSHOP_VERBOSE or _SPEEDSHOP_VERBOSE non_empty_string: causes a log of each program's operation to be written to stderr. If this variable is set to an empty string, only major events are logged; if it is set to a non-empty string, more detailed events are logged.

  • _SPEEDSHOP_SILENT: suppresses all SpeedShop output other than fatal error messages. If both _SPEEDSHOP_VERBOSE and _SPEEDSHOP_SILENT are set, _SPEEDSHOP_VERBOSE is ignored.

  • _SSMALLOC_NO_BUFFERING: if this environment variable is set, the experiment file for each process will contain only its own heap trace data. Otherwise, the experiment file for each process will contain data from all processes.

To set an environment variable that requires no arguments (for example, _SPEEDSHOP_SILENT), use the following:

% setenv _SPEEDSHOP_SILENT

To set an environment variable that requires a number between 0 and 31 (for example, _SPEEDSHOP_HWC_COUNTER_NUMBER), use the following:

% setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 15

Process Tracking Environment Variables

A number of environment variables may be used for controlling the treatment of processes spawned from the original target, as shown in the following list:

  • _SPEEDSHOP_TRACE_FORK [True|False]: if True, specifies that processes spawned by calls to fork() will be monitored if they do not call exec(). If they do call exec() and _SPEEDSHOP_TRACE_FORK_TO_EXEC is not set to True, the data covering the time between the fork() and exec() will be discarded. It is True by default.

  • _SPEEDSHOP_TRACE_FORK_TO_EXEC [True|False]: if True, specifies that a process spawned by calls to fork() will be monitored, even if they also call exec(). It is False by default.

  • _SPEEDSHOP_TRACE_EXEC [True|False]: if True, specifies that a process spawned by calls to any of the various flavors of exec() will be monitored. It is True by default.

  • _SPEEDSHOP_TRACE_SPROC [True|False]: if True, specifies that a process spawned by calls to sproc() will be monitored. It is True by default.

  • _SPEEDSHOP_TRACE_SYSTEM [True|False]: if True, specifies that system() calls will be monitored. It is False by default.

  • _SPEEDSHOP_TRACE_MPI_RANKS [True|False]: if True, specifies that performance data should only be collected for the MPI ranks. It is False by default.

Expert-Mode Environment Variables

A number of variables may be used for debugging and finer control of the operation of SpeedShop, as shown in the following list:

  • _SPEEDSHOP_SAMPLING_MODE or _SPEEDSHOP_SAMPLING_MODE num: used for PC sampling and hardware counter profiling. If set to 1, generates data for the base executable only. If not set or set to a value other than 1, data is generated for the executable and all the DSOs it uses.

  • _SPEEDSHOP_INIT_DEFERRED_SIG sig_num: if specified, initialization of the experiment is not performed when the target process starts. Initialization is delayed until the specified signal is sent to the process. A handler for the given signal is installed when the process starts. It is the user's responsibility to ensure that it is not overridden by the target code.

  • _SPEEDSHOP_SHUTDOWN_SIG sig_num: if specified, termination of the experiment is not performed when the target process exits. Termination happens when the specified signal is sent to the process. A handler for the given signal is installed when the process starts, and it is the user's responsibility to ensure that it is not overridden by the target code.

  • _SPEEDSHOP_EXPERIMENT_TYPE exp_type: passes the experiment type to the run-time DSO. The ssrun command's -exp_type option, which usually specifies the experiment type, overrides this variable. Values for exp_type can be found in Table 4-1.

  • _SPEEDSHOP_EXTRA_MARCHING_ORDERS mo_syntax: This environment variable may be used to add marching orders to a predefined experiment. See the "Using Marching Orders" section in this Chapter for more information.

  • _SPEEDSHOP_MARCHING_ORDERS mo_syntax: passes the marching orders of the experiment to the run-time DSO. The ssrun command's -mo, marching orders, option overrides this environment variable. If this variable is specified, it overrides _SPEEDSHOP_EXPERIMENT_TYPE, as well as the ssrun command's -exp_type option. The mo_syntax is discussed in “Using Marching Orders”.

  • _SPEEDSHOP_SBRK_BUFFER_LENGTH size: defines the maximum size of the internal malloc (memory allocation) area used. This area is completely separate from the user's area and has a default size of 0x100000.

  • _SPEEDSHOP_FILE_BUFFER_LENGTH size: defines the size of the buffer used for writing the experiment files. The default length is 8 KB. The buffer is used only for writing small records to the file; large records are written directly to avoid the buffering overhead.

  • _SPEEDSHOP_DEBUG_NO_SIG_TRAPS: disables the normal setting of signal handlers for all fatal and exit signals.

  • _SPEEDSHOP_DEBUG_NO_STACK_UNWIND: suppresses the stack unwind, as in usertime experiments and at caliper samples, for all experiments. The option is used as a workaround for various unwind bugs in libexc.

Using Marching Orders

Using marching orders is another method of specifying what experiment type you want to run. One of the benefits of using marching orders is that it lets you customize experiments. Any specification of explicit marching orders overrides the environment variable _SPEEDSHOP_EXPERIMENT_TYPE or the -exp_type option on the ssrun command, since these experiment type specifications are translated into possible orders by the command.

Each experiment type corresponds to a marching orders specification. You can use marching orders in either of the following ways:

  • The _SPEEDSHOP_MARCHING_ORDERS environment variable. The following example selects the usertime experiment:

    % setenv _SPEEDSHOP_MARCHING_ORDERS ut:cu

  • The -mo option on the ssrun command line. The following example selects the pcsamp experiment:

    % ssrun -mo pc,2,10000,0:cu a.out

  • Adding marching orders to a predefined experiment by using the _SPEEDSHOP_EXTRA_MARCHING_ORDERS environment variable. The following example generates a useful resource usage graph when viewed with the cvperf(1) command:

    % setenv _SPEEDSHOP_EXTRA_MARCHING_ORDERS hb
    % ssrun -pcsamp a.out

If the marching orders on the command line differ from those specified with the environment variable, the command-line version takes precedence.

The number and meaning of the arguments for each marching order depend on the specific marching order. The following specifies PC sampling, using 16-bit bins, sampling every 10 microseconds, and sampling both the executable and all of its DSOs:

pc,2,10000,0

The following specifies call stack sampling every 10 microseconds, based on process virtual time plus system time spent on behalf of the process:

ut,10000,2

Defining the Base Experiment

The experiment specifier, with which a marching order begins, takes one of the following values:

  • ut: a time experiment that returns real time, virtual time, or user time. The default arguments are 30000,2. The argument should be specified in multiples of 10,000. The first argument is the interval between call stack samples in microseconds. The second argument is the timer type used to measure the intervals; the supported values are 0, 1, and 2, with the same meanings as for the second argument of hb (described later). The argument value -1 is not valid for ut.

  • pc: a 16-bit or 32-bit PC sampling (pcsamp) experiment. The default arguments are 2,10000,0. The first argument is the size of the sample count bins in bytes. The supported values are 2 (16 bits) and 4 (32 bits). The second argument is the sampling rate in microseconds. Supported values are 10,000 (10-millisecond sample interval) and 1000 (1-millisecond sample interval). The third argument is the sampling mode:

    • 0: selects the user executable and all its dynamic shared objects

    • 1: selects only the user executable (without any dynamic shared objects)

  • it: a 32-bit bbcounts experiment. Only 4-byte (32-bit) counters are supported. No additional arguments are needed.

  • mf: a memory allocation and deallocation experiment that traces calls to malloc, realloc, free, memalign, and valloc routines. There are no arguments to this marching order. The arguments to these routines and bad calls are recorded. Bad calls include malloc calls of 0 bytes, freeing invalid memory blocks, reallocating invalid memory pointers, and calling memalign with invalid arguments. (For descriptions of these routines, see the malloc(3) man page.)

  • fpe: a floating-point exceptions (fpe) experiment. There are no arguments. The call stack is sampled whenever a floating-point exception occurs.

  • io: an I/O trace experiment. There are no arguments. The start time and end time for each of the following I/O system calls are recorded: creat(2), open(2), read(2), pread(2), write(2), pwrite(2), close(2), pipe(2), dup(2), lseek(2), readv(2), and writev(2).

  • mpit: MPI experiment. There are no arguments. The beginning time, ending time, return value, and arguments are recorded. For a list of the routines traced, see “Generating MPI Tracing Experiments”.


    Note: The output from this experiment can only be displayed by using the cvperf(1) user interface; it cannot be displayed through prof.


  • hwct: a hardware counter call stack profiling experiment (_hwctime). The default arguments are xx,xxx,0,SIGPROF. The first argument is the hardware counter number of the counter to be profiled. The second argument is the overflow interval for the counter (a prime number should be specified). The third argument is the hardware counter number of the counter whose overflow will trigger the sampling.

  • hwc: a hardware counter PC profiling experiment (-hwc). The default arguments are xx, xxx. The first argument is the hardware counter number. The second argument is the overflow interval for the counter.

  • hb: heart beat data collection. System-wide, per-process, and MPI resource usage data is collected at regular time intervals. If the program creates multiple processes, data is collected for each process. If the process is using the MPI library, MPI library statistics are also recorded.

    The default arguments are 1000000,2. The first argument is the interval in microseconds between samples. The second argument is the time type to use, as follows:

    • 0: real (wall-clock) time.

    • 1: virtual time. The timer runs while the user program is executing.

    • 2: user time. The timer runs while the user program is executing or the system is processing system calls made by the program.

  • cu: caliper point usage data collection. It usually appears at the end of a marching order, and there are no arguments. Usage data is recorded at caliper points. As with the hb marching order, system-wide, per-process, and MPI resource usage data is or can be collected at these points. But, the hb marching order collects data based on time, and the cu marching order is based on caliper points that you can set anywhere in your source code. For more information on setting caliper points, see “Using Calipers”.

  • mpi: traces calls to MPI functions and collects data (such as the time taken by the call, which thread made the call, etc.)

  • nm: used to profile an application's memory access patterns on ccNUMA architectures. The profiler periodically interrupts the running application, and during each interrupt, the application's memory accesses are examined.

Running Experiments

This section describes how to use ssrun to perform experiments.

ssrun Syntax

The ssrun command takes the following form:

ssrun ssrun_options exp_type executable_name executable_args 

The arguments are as follows:

  • ssrun_options: zero or more of the options described in the following list. These options control the data collection and the treatment of descendent processes or programs, and they specify how the data is to be externalized.

  • exp_type | exp exp_type: the experiment type. Experiments are described in detail in Chapter 4, “Experiment Types”.

  • executable_name: the name of the program on which you want to run an experiment.

  • executable_args: arguments to your program, if any.

The ssrun command generates a performance data file that is named as described in “Setting Up Output Directories and Files”.

The following list describes the options to the ssrun command:

  • -hang: specifies that the process should be left waiting just before executing its first instruction. This allows you to attach the process to a debugger.

  • -mo marching_orders: allows you to specify marching orders. If this option is used, the environment variable _SPEEDSHOP_MARCHING_ORDERS is not examined. If both -exp_type and -mo are specified, the -mo option will override the value given by -exp_type.

  • -name argv0-value: specifies that the executable, or its appropriately instrumented version, should be run with argv[0] set to argv0-value. Normally, both instrumented and uninstrumented executables are run with argv[0] set to the original executable_name name. argv0-value is also used in the executable_name portion of the name of the performance data file.

  • -port hostname portno: specifies that the process is to be left waiting, and notifications of status are to be sent to the socket on the host named by hostname and the port specified by portno. When the process is ready, a message of the form "running pid host" will be sent to inform the requester of the PID of the executing process and the host, which may be remote. A debugger can then attach to it and take control of its execution.

  • -quiet: suppresses all output other than error messages. If -quiet is specified, the _SPEEDSHOP_SILENT environment variable is also set for the duration of the ssrun command.

  • -ranks mip-ranks: specifies that performance data should only be collected for the MPI ranks in the comma-separated list of mpi-ranks.

  • -v: prints a log of the operation of ssrun to stderr. The same behavior occurs if the environment variable _SPEEDSHOP_VERBOSE is set to a null string.

  • -V: prints a detailed log of the operation of ssrun to stderr. The same behavior occurs if the environment variable _SPEEDSHOP_VERBOSE is set to a nonzero-length string. This option can be used to see how to set the various environment variables, and how to invoke instrumentation when necessary.

  • -workshop: specifies special instrumentation so that the experiment files can be read by WorkShop's cvperf analyzer.

  • -x display-id window-id: specifies that the process is to be left waiting and that the window of the WorkShop debugger requesting the creation (as specified by the display-id and window-id arguments on the command line) be informed of the PID of the target process. A debugger can then attach to it and take control of its execution.

ssrun Examples

This section provides examples of using ssrun with options and experiment types. For additional examples, see Chapter 2, “Tutorial for C Users”, or Chapter 3, “Tutorial for Fortran Users”.

Example Using the pcsampx Experiment

The pcsampx experiment collects data to estimate the actual CPU time for each source code line, machine instruction, and function in your program. The optional x suffix causes a 32-bit bin size to be used, allowing a larger number of counts to be recorded. For a more detailed description of the pcsamp experiment, see “PC Sampling Experiment (pcsamp)” in Chapter 4.

The following example performs a pcsampx experiment on the generic executable:

% ssrun -pcsampx generic

To see the performance data that has been generated, run prof on the performance data file, generic.pcsampx.m12185, as shown in the following example:

% prof generic.pcsampx.m12185

The report is printed to stdout. (This layout of this report has been altered slightly to accommodate presentation needs.) For more information on prof and the reports generated by prof, see Chapter 7, “Analyzing Experiment Results”.

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 15:08:14 1998
   prof generic.pcsampx.m12185
                 generic (n32): Target program
                       pcsampx: Experiment name
               pc,4,10000,0:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file generic.pcsampx.m12185:
        Caliper point 0 at target begin, PID 12185
                        /usr/demos/SpeedShop/linpack.demos/c/generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (pcsampx)--
                          2729: Total samples
                        27.290: Accumulated time (secs.)
                          10.0: Time per sample (msecs.)
                             4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
 [index]      secs    %    cum.%   samples  function (dso: file, line)

     [1]    25.470  93.3%  93.3%      2547  anneal (generic: generic.c,
1573)
     [2]     1.100   4.0%  97.4%       110  slaveusrtime (dlslave.so: dlslave.c, 22)
     [3]     0.310   1.1%  98.5%        31  __read (libc.so.1: read.s, 20)
     [4]     0.240   0.9%  99.4%        24  cvttrap (generic: generic.c, 317)
     [5]     0.150   0.5%  99.9%        15  _xstat (libc.so.1: xstat.s,
12)
     [6]     0.010   0.0% 100.0%         1  __write (libc.so.1: write.s, 20)
     [7]     0.010   0.0% 100.0%         1  _morecore (libc.so.1: malloc.c, 632)

            27.290 100.0% 100.0%      2729  TOTAL

Example Displaying Data in WorkShop

To use the WorkShop graphic user interface to display the information gathered by ssrun, include the -workshop option on the ssrun command line, as shown in the following example:

% ssrun -workshop -pcsampx generic

The result is a file viewable through the cvperf WorkShop command:

% cvperf generic.pcsampx.m44800

Example Using the -v Option

To get information about how a SpeedShop experiment is set up and performed, you can supply the -v option to ssrun.

The following example performs another pcsampx experiment on the generic executable:

% ssrun -v -pcsampx generic

The ssrun command writes the following output to stderr. It displays information as the command line is parsed and shows the environment variables that ssrun sets.

fraser 75% ssrun -v -pcsampx generic

ssrun: target PID 12345
ssrun: setenv _SPEEDSHOP_MARCHING_ORDERS pc,4,10000,0:cu
ssrun: setenv _SPEEDSHOP_EXPERIMENT_TYPE pcsampx
ssrun: setenv _SPEEDSHOP_TARGET_FILE generic
ssrun: setenv _RLD_LIST libss.so:libssrt.so:DEFAULT
...

The _RLD32_LIST environment variable is used with programs compiled with the -n32 compiler option. The _RLD64_LIST environment variable is used with programs compiled with the -64 compiler option. If neither is set, the value of _RLD_LIST is the default. See the rld(1) man page for more information.

Using ssrun with a Debugger

To use the ssrun command in conjunction with a debugger such as dbx or the WorkShop debugger, you need to call ssrun with the -hang option and the name of your program.

Follow these steps to run the floating-point exceptions trace experiment on generic, and then run generic in a debugger.

  1. Call ssrun as follows:

    % ssrun -hang -fpe generic

    The ssrun command parses the command line, sets up the environment for the experiment, calls the target process using exec, and halts the target process on exiting from the call to exec.

  2. Note the process ID returned by ssrun.

  3. In another window, start your debugging session as follows:

    % cvd -pid process_id_number

  4. Attach the process to the debugger.

  5. Run the process from the debugger.

You can also invoke ssrun from within a debugger. In this case, ssrun leaves the target halted on exiting the call to exec and informs the debugger of that fact.

You can also use a debugger to set calipers for the purpose of recording performance data for a part of your program. See “Using Calipers”, for more information on setting calipers.

Running Experiments on MPI Programs

The Message Passing Interface (MPI) is a library specification for message passing, proposed as a standard by a committee of vendors, implementors, and users. It allows processes to communicate by passing data messages to other processes, even those running on distant computers.

SpeedShop offers two types of experiments for MPI programs; see “Generating MPI Tracing Experiments” which follows for more information.

  • MPI tracing experiments: traces the use of MPI send, receive, and synchronization routines and a few other routines.

  • Other SpeedShop experiments: generates other SpeedShop experiments, such as usertime and pcsamp.

See the MPI Programmer's Manual for details about MPI use.

Generating MPI Tracing Experiments

Two different MPI experiments are available to help you trace calls to MPI routines. The main difference in the two is how the results can be viewed:

  • MPI_trace experiments tell you how many times, and at what locations within the application, various routines from the MPI library are called. This is run by using the -mpi_trace option to ssrun and it produces a file that is viewable in the Performance Analyzer (cvperf(1)) window.

  • MPI experiments trace calls to MPI functions and collect data such as the time used by the call, which thread and MPI rank made the call, and so on. This type of experiment is generated using the -mpi option to ssrun. The generated data can then be analyzed using prof(1).

The ranks option to ssrun specifies that performance data should only be collected for the MPI ranks in the comma-separated list used with ranks. See the ssrun(1) man page for a list of the functions traced by each option and for more information about the ranks option.

The following example demonstrates the mpi_trace option. You can use either of the following versions of the ssrun command on an executable named a.out:

% mpirun -np 4 ssrun -mpi_trace a.out 
% mpirun -np 4 ssrun -mo mpit:cu a.out

If you are running the application on four processors, you will see five output files: one for each processor and one for the master process. The identifier portions of the file names will start either with m for the master process or f (forked) for a process running on one of the processors. If the first version of the ssrun command, illustrated above, is used with an executable named myprog, file names similar to the following will be assigned to the output:

myprog.mpi.m12345
myprog.mpi.f12346
myprog.mpi.R0.f12346
myprog.mpi.R1.f12347
myprog.mpi.R2.f12348
myprog.mpi.R3.f12349

The Rx identifier does not correspond to a processor number but it does correspond to the MPI rank of the process for which the file was generated.

Depending on which option is used, output from the ssrun command can be viewed in the WorkShop Performance Analyzer window or by using the prof(1) command. You can bring up the Performance Analyzer with the cvperf(1) command. You can view the information in either graphical or numerical format. Graphs that do not contain data are not displayed. For an example of a portion of a numerical display, see Figure 6-1.


Note: The MPI tracing experiment does not track down communicators, and it does not trace all collective operations.


Figure 6-1. MPI Numerical Format

MPI Numerical Format

For a description of the use of the prof command, see “Running Experiments”, for examples of the use of ssrun and prof.

Generating Other Experiments for Programs Using MPI

If your program uses MPI, you must set up SpeedShop experiments that will be displayed in prof a little differently. There are two ways to accomplish this. The first method takes two steps:

  1. Set up a shell script that contains the call to ssrun and the experiment you want to run.

    For example, if you have an executable called testit and you want to run the pcsampx experiment with a script named exp_script, the process might look like the following:

    #!/bin/sh
    ssrun -pcsampx testit

  2. Call mpirun with the script name using one of the following commands:

    % mpirun -np 6 exp_script
    % mpirun host1 2, host2 2 exp_script

The second method is to use one of the following:

% mpirun -np 6 ssrun -pcsampx testit
% mpirun host1 2, host2 2 ssrun -pcsampx testit

The master experiment file created on each MPI host might not contain performance data from the application (depending on the MPI version) but from a master program that spawns the members of an application group. You can choose to exclude that file from performance analysis.

When using ssrun -bbcounts or ssrun -purify, you should take care that the code for each separate host executes out of a different physical directory, not out of the same directory mounted by the network file system (NFS). During process creation, instrumentation is performed, and since different hosts may have different versions of the same named library (libc.so.1, for example), conflicts may occur. You may also need to use the -d option with mpirun to specify the directory on each host.

Running Experiments on Programs Using Pthreads

Pthreads is the multithreading model defined by the POSIX operating system standard (IEEE1003.1c-1995). This standard contains a set of interfaces and semantics for creating and managing threads within the POSIX operating system definition. The basic SGI threads implementation consists of a library and a header file.

Applications using pthreads are specifically identified by SpeedShop. Performance data collection is done on a per-program basis, rather than on a per-pthread basis. Under IRIX 6.2, 6.3, and 6.4, SpeedShop creates as many experiment files as the number of sproc(2) system calls used by the pthreads library to create and manage the pthreads. In addition, cm_usage data is not supported, and SIGTERM is reserved to be used to terminate the application normally. You should analyze all the experiment files together via prof to get a valid profile for the code. Under IRIX 6.5, SpeedShop creates only one experiment file. For usertime and fpe experiments, however, you can specify the -pthreads option with prof to get the specified pthread's performance reports.

Running Experiments on Programs That Use OpenMP Directives

The OpenMP Fortran API and the OpenMP C/C++ API specify a collection of compiler directives, library functions, and environment variables that can be used to specify shared memory parallelism in Fortran, C, or C++ programs. The -mp compiler option causes OpenMP directives to be used in creating an executable that may be run using one or more processors.

Performance data collection is done on a per-processor basis. If an executable named test1 is run under the ssrun command using n processors for a usertime experiment, then files similar to the following are created for the performance data:

test1.usertime.m109327
test1.usertime.T0.p109331
test1.usertime.T1.p109345
test1.usertime.T2.p109353

The Tx identifier is the number of the OpenMP thread that generated the file. The number of processors may be specified internally in the program using a call to an OpenMP subroutine variable or function omp_set_num_threads, or externally via the environment variable OMP_NUM_THREADS. The experiment output may be examined via prof using the file for each process, or ssaggregate may be used to create an aggregated file from all of the experiment files. Then the results for the entire experiment could be analyzed at once.

Using Calipers

In some cases, you may want to generate performance data reports for only a part of your program. You can do this by selecting caliper points to identify the area of your program or the time interval during execution for which you want to see performance data. When you run prof, you can specify a region for which to generate a report by supplying the -calipers option and the appropriate caliper numbers. For more information on prof -calipers, see “Using the -calipers Option” in Chapter 7.

Table 6-1, shows the different ways you can set caliper points.

Table 6-1. Setting Caliper Points

Use This Approach...

For These Benefits...

Explicitly link with the SpeedShop run-time and call ssrt_caliper_point to set a caliper sample.

Lets you set a caliper point at a specific location in the source program.

Set pollpoint caliper points at specified time intervals during program execution using the _SPEEDSHOP_POLLPOINT_CALIPER_POINT environment variable.

Lets you set caliper points at time intervals rather than at places in the code.

Define a signal to be used to set a caliper sample by specifying a signal as a value to the environment variable _SPEEDSHOP_CALIPER_POINT_SIG and then sending the target the given signal.

Useful if you want to be able to set a caliper point as your program is running.

Set a caliper sample trap in dbx or the WorkShop debugger. Setting a trap involves setting a breakpoint and evaluating the expression libss_caliper_point(1) when the process stops.

Useful if you are working with a debugger in conjunction with SpeedShop.

An implicit caliper point is always present at the start of execution of the process. A final caliper point is set when the process calls _exit. The implicit caliper point at the beginning of the program is numbered 0, the first caliper point recorded is numbered 1, and any additional caliper points are numbered sequentially.

In addition, caliper points are automatically set under the following circumstances to ensure that at least one valid set of data is recorded:

  • When a fatal signal is received, such as SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGEMT, SIGFPE, SIGBUS, SIGSEGV, SIGSYS, SIGXCPU, or SIGXFSZ. Note that this list does not and cannot include SIGKILL.

  • When the program calls an exec function, such as execve() or execvp().

  • When an exit signal is received, such as SIGHUP, SIGINT, SIGPIPE, SIGALRM, SIGTERM, SIGUSR1, SIGUSR2, SIGPOLL, SIGIO, SIGRTMIN, or SIGRTMAX.

Setting Calipers with the ssrt_caliper_point Function

To set caliper points using the ssrt_caliper_point(3) function, follow these steps:

  1. Insert calls to ssrt_caliper_point in your source code. Call the function with the argument 1 (meaning, True) and a string to help identify the caliper point in the experiment file later on.

    Example for C:

    ...
    ssrt_caliper_point(1,"bgn_calc");
    ...

    Example for Fortran:

    . . .
    INTEGER SSRT_CALIPER_POINT
    . . .
    i = SSRT_CALIPER_POINT (1, 'bgn_calc')
    . . .

    You can insert one or more calls at any point in your code.

  2. Link the SpeedShop library libss.so into your application. Place the -lss option at the end of your compile or link command so that the library is the last to be referenced.

  3. Run your program with ssrun and the desired experiment type. For example, if you want to run the bbcounts experiment on generic:

    % ssrun -bbcounts generic

    The caliper points you have set in the source file are recorded in the performance data file that is generated by ssrun.

Setting Time-Oriented Calipers

To add caliper points at a regular time interval into your experiment file, set the _SPEEDSHOP_POLLPOINT_CALIPER_POINT environment variable before you generate an experiment. It takes the following form:

_SPEEDSHOP_POLLPOINT_CALIPER_POINTtimer_type,timer_interval

The arguments are as follows:

  • timer_type: can have one of the following values:

    • 0: Real time. This is the total time a program spent while executing. It includes both time spent when a program is swapped out waiting for a CPU and the time the operating system is in control, performing some task for the program such as I/O or executing a system call.

    • 1: process virtual time. This is the time spent when the program is actually running. This does not include either the time spent when a program is swapped out waiting for a CPU or the time the operating system is in control, performing some task for the program such as I/O or executing another system call.

    • 2: CPU time. This is process virtual time plus the time the system is running on behalf of the process. The system time could include performing I/O or executing other system calls.

  • timer_interval: the integer interval, in seconds, at which a new caliper will be set.

The caliper points you have set with the _SPEEDSHOP_POLLPOINT_CALIPER_POINT environment variable are recorded in the performance data file that is generated by ssrun. For the usertime experiment, timer_type must be 2.

Setting Calipers with Signals

To set calipers with signals, follow these steps:

  1. Set the _SPEEDSHOP_CALIPER_POINT_SIG variable to the signal number you want to use.

    Choose a signal that does not terminate the program. The signal should also not be caught by the target program; doing so would interfere with its triggering a caliper point.

    The following signals are good choices because they do not have system-defined semantics already associated with them:

    SIGUSR1 16      /* user defined signal 1 */
    SIGUSR2 17      /* user defined signal 2 */

  2. Execute your program with ssrun.

  3. In another window, enter a command such as ps or top to determine the process ID of ssrun. This is also the process ID of the program you are working on.

  4. In this window, send the signal you used in step 1 to the process using the kill command:

    % kill-sig_numpid

    Caliper point data is recorded at the point in the program where the signal sent by the kill command interrupts the executing ssrun process.

Setting Calipers with a Debugger

From either dbx or the WorkShop debugger, you can set a caliper point anywhere it is possible to set a breakpoint: at a function entry or exit, a line number, an execution address, a watchpoint, or a pollpoint (timer-based). You can also attach conditions and or cycle counts.

Use the following procedure:

  1. Set a breakpoint in your program where you want a caliper point.

  2. When the process stops, evaluate the expression ssrt_caliper_point(3). The evaluation of the expression always returns zero, but a side effect of the evaluation is the recording of the appropriate data.

  3. Resume execution of the process.

Effects of ssrun

When you call ssrun, the system performs the following operations for all experiments:

  • Sets various environment variables like _SPEEDSHOP_MARCHING_ORDERS and _SPEEDSHOP_EXPERIMENT_TYPE.

    For more information on these environment variables, see “Using Run-Time Environment Variables”.

  • Inserts the SpeedShop libraries libss.so and libssrt.so as part of your executable using the environment variable _RLD_LIST.

  • Invokes the file executable_name by calling exec().

  • The SpeedShop run-time library writes the appropriate experiment data to the output file.