Chapter 1. Introduction to Performance Analysis

This chapter provides a brief introduction to performance analysis techniques for SGI systems and describes how to use them with SpeedShop to solve performance problems. It includes the following sections:

Sources of Performance Problems

To tune a program's performance, you must first determine where machine resources are being used. At any point in a process, there is one limiting resource controlling the speed of execution. Processes can be slowed down by:

  • CPU speed and availability: a CPU-bound process spends its time executing in the CPU and is limited by CPU speed and availability. To improve the performance of CPU-bound processes, you may need to streamline your code. This can entail modifying algorithms, reordering code to avoid interlocks, removing nonessential steps, blocking to keep data in cache and registers, or using alternative algorithms.

  • I/O processing: an I/O-bound process has to wait for input/output (I/O) to complete. I/O may be limited by disk access speeds or memory caching. To improve the performance of I/O-bound processes, you can try one of the following techniques:

    • Improve overlap of I/O with computation

    • Optimize data usage to minimize disk access

    • Use data compression

  • Memory size and availability: a program that continuously needs to swap out pages of memory is called memory-bound. Page thrashing is often due to accessing virtual memory on a haphazard rather than strategic basis; cache misses result. Insufficient memory bandwidth could also be the problem.

    To fix a memory-bound process, you can try to improve the memory reference patterns or, if possible, decrease the memory used by the program.

  • Bugs: you may find that a bug is causing the performance problem. For example, you may find that you are reading in the same file twice in different parts of the program, that floating-point exceptions are slowing down your program, that old code has not been completely removed, or that you are leaking memory (making malloc calls without the corresponding calls to free).

  • Performance phases: because programs exhibit different behavior during different phases of operation, you need to identify the limiting resource during each phase. A program can be I/O-bound while it reads in data, CPU-bound while it performs computation, and I/O-bound again in its final stage while it writes out data. Once you've identified the limiting resource in a phase, you can perform an in-depth analysis to find the problem. And after you have solved that problem, you can check for other problems within the phase. Performance analysis is an iterative process.

  • Cache thrashing: If an application does not access CPU caches efficiently, the application will run slower whle the CPU and operating system reload cache entries.

Fixing Performance Problems

The SpeedShop tools described in this manual can help you to identify specific performance problems described later; these techniques are only a part of performance tuning. You can also tune graphics, I/O, the kernel, system parameters, memory, and real-time system calls. For a complete guide to all performance tools and the documentation about those tools, see Guide to SGI Compilers and Compiling Tools.

Although it may be possible to obtain short-term speed increases by relying on unsupported or undocumented quirks of the compiler, it is a bad idea to do so. Any such “features” may break in future compiler releases. The best way to produce efficient code that will remain efficient is to follow good programming practices. In particular, choose good algorithms and leave the details to the compiler.

SpeedShop Tools

The SpeedShop tools allow you to run experiments and generate reports that track down the sources of performance problems. SpeedShop consists of a set of commands that can be run in a shell, an application programming interface (API) to provide some control over data collection, and a number of libraries to support the commands.

This section provides an overview of the tools by first discussing the main commands, then providing more detail on additional commands, experiment types, libraries, the SpeedShop API, and supported programs and languages.

Commands

SpeedShop provides the following commands to help you analyze your programs:

  • ssusage: Collects information about your program's use of machine resources. Output from ssusage can be used to determine where most resources are being spent.

  • ssrun: Allows you to run experiments on a program to collect performance data. It establishes the environment to capture performance data for an executable, creates a process from the executable (or from an instrumented version of the executable) and runs it. Input to ssrun consists of an experiment type, control flags, the name of the target, and the arguments to be used in executing the target.

  • prof: Analyzes the performance data you have recorded using ssrun and provides formatted reports. prof detects the type of experiment you have run and generates a report specific to the experiment type. You can also use the cvperf command to display the data through the WorkShop graphic user interface.

  • sscompare: Analyzes the performance data in one or more experiment files generated by SpeedShop and produces comparison reports.

SpeedShop provides the following additional commands:

  • squeeze: Allocates a region of virtual memory and locks the virtual memory down into real memory, making it unavailable to other processes.

  • thrash: Allows you to allocate a block of memory, then access the allocated memory to explore system paging behavior.

Experiment Types

The following are the most popular experiments using the ssrun command. (For the complete list of experiments, see the ssrun(1) man page.)

  • pcsamp experiments provide information on a program's CPU usage using statistical program counter sampling.

    Data is measured by periodically sampling the program counter of the target executable when it is executing in the CPU. The program counter shows the address of the currently executing instruction in the program. The data that is obtained from the samples is translated to a time that can be displayed at the function, source line, and machine instruction levels. The actual CPU time is calculated by multiplying the number of times a specific address is found in the PC by the amount of time between samples. (For a definition of CPU time, wall-clock time, and process virtual time, see the glossary.)

  • hwc experiments display information from a variety of hardware counters using statistical sampling.

    Hardware counter experiments are available on R10000, R12000, R14000, and R16000 systems that have built-in hardware counters. Data is measured by counting each time the specified hardware counter exceeds its maximum value, or overflows. You can specify the hardware counter and the overflow interval you want to use. For more information on the hardware counter experiments, see “Hardware Counter Experiments (*_hwc, *_hwctime)” in Chapter 4.

  • usertime experiments display a program's CPU time by statistical call-stack profiling.

    Data is measured by periodically sampling the call stack. The program's call stack data is used to attribute exclusive user time to the function at the bottom of each call stack (that is, the function being executed at the time of the sample), and to attribute inclusive user time to all the functions above the one currently being executed. Exclusive time is the execution time of a given function but not any functions that function calls, while inclusive time is the execution time both of a given function and of any functions called by that function.

  • The totaltime experiment returns wall-clock time in a manner identical to that of the usertime experiment. It uses statistical callstack profiling, based on wall-clock time, with a time sample interval of 30 milliseconds.

  • bbcounts experiments display an estimated time based on linear basic blocks counting.

    Data is measured by counting the number of executions of each basic block and calculating an estimated time for each function. This involves instrumenting the program to divide the code into basic blocks, which are consecutive sequences of instructions with a single entry point, a single exit point, and no branches into or out of the sequence. Instrumentation also records a count of all dynamic (function-pointer) calls.

    Because an exact count of every instruction in your program is recorded, you can also use the bbcounts experiment to determine the efficiency of your algorithm and identify any code that is not executed.

  • fpe experiments trace floating-point exceptions.

    A floating-point exception trace collects each floating-point exception, including the exception type and the call stack, at the time of the exception. prof(1) generates a report showing inclusive and exclusive floating-point exception counts.

SpeedShop Libraries

Versions of the SpeedShop libraries libss.so and libssrt.so are available to support applications built using shared libraries (called dynamic shared objects, or DSOs) only and the old 32-bit, new 32-bit, or 64-bit application binary interfaces (ABIs).

The following list describes the different SpeedShop libraries.

  • libss.so: A shared library (DSO) that supports libssrt.so. The libss.so data normally appears in experiment results generated with prof.

  • libssrt.so: A shared library (DSO) that is linked in to the program you specify when you run an experiment. All the performance data collection with the SpeedShop system is done within the target processes by exercising various pieces of functionality using libssrt . Data from libssrt.so does not normally appear in performance data reports generated with prof.

  • libfpe_ss.so: Supplements the standard libfpe.so for the purposes of collecting floating-point exception data. See the fpe_ss(3) man page for more information.

  • libmalloc_ss.so: Inserts versions of malloc routines from libc.so.1 that allow tracing all calls to malloc, free, realloc, memalign, and valloc. See the malloc_ss(3) man page for more information.

  • libpixrt.so: A shared library (DSO) used by programs that have been instrumented for basic block counting.

API

The SpeedShop application programming interface (API) allows you to use the ssrt_caliper_point function to set caliper points in your source code. See “Using Calipers” in Chapter 6, for information on using caliper points. For information on other API functions, see the ssapi(3) man page.

Supported Programming Models and Languages

The SpeedShop tools support programs with the following characteristics:

  • Shared libraries (DSOs).

  • Unstripped executables.

  • Executables that call fork(2), sproc(2), system(3F), or exec(2).

  • Executables using supported techniques for opening, closing, and delay-loading DSOs.

  • C, C++, Fortran (Fortran 77 and Fortran 90), or Ada (1.4.2 and older versions) source code.

  • Power Fortran and Power C source code. prof understands the syntax and semantics of the multiprocessing run time and displays the data accordingly.

  • pthreads, supported with data on a per-program basis.

  • Message Passing Interface (MPI) or other message-passing paradigms. Currently supported by providing data on the behavior of each process. The behavior of the MPI library itself is monitored just like any other user-level code. See the MPI Programmer's Manual for details about the MPI library.

  • The OpenMP collection of compiler directives, library routines, and environment variables that can be used to specify shared memory parallelism.

Using SpeedShop Tools for Performance Analysis

Performance tuning typically consists of:

  1. Examining machine resource usage

  2. Breaking down the process into phases

  3. Identifying the resource bottleneck within each phase

  4. Correcting the cause of the bottleneck

Generally, you run the first experiment to break your program down into phases and run subsequent experiments to examine each phase individually. After you have solved a problem in a phase, you should re-examine machine resource usage to see if there is further opportunity for performance improvement.

The general steps for a performance analysis cycle are as follows:

  1. Build the application.

  2. Run experiments on the application to collect performance data.

  3. Examine the performance data.

  4. Generate an improved version of the program.

  5. Compare performance of improved version of the program against the previous version. To do this, use the sscompare command to compare the new version to the previous version to verify that improvements are being made.

  6. Repeat steps 1 through 5 as needed.

To accomplish this using SpeedShop tools, do the following:

  • Use the ssusage command to capture information on your program's use of machine resources.

  • Use the ssrun command to capture different types of performance data over either your entire program or parts of the program. ssrun can be used in conjunction with dbx(1) or cvd(1), the WorkShop debugger.

  • Use the prof command to analyze the data and generate reports.

Using ssusage to Evaluate Machine Resource Use

To determine overall resource usage by your program, run the program with ssusage. The results of this command allow you to identify high-user CPU time, high-system CPU time, high I/O time, and a high degree of paging. The ssusage(1) command has the following format:

ssusage executable_name executable_args

From the ssusage output, you can decide which experiments to run to collect data for further study. For more information on ssusage, see Chapter 5, “Collecting Data on Machine Resource Usage”, or see the ssusage(1) man page.

Gathering and Analyzing Performance Data

This section describes the steps involved in a performance analysis cycle when using the line-based interface to the SpeedShop tools: the ssrun and prof commands.

To perform a performance analysis, follow these general steps:

  1. Build the executable.

    You can usually build the executable as you would normally. See “Building Your Executable” in Chapter 6, for information on how to build the executable.

  2. Specify caliper points if you want to analyze data for only a portion of your program.

  3. To collect performance data, issue the ssrun command with the following parameters:

    % ssrun ssrun_options -exp_type executable_name executable_args

    The following options are available with the ssrun command:

    • ssrun_options: zero or more valid options. For a complete list of options, see the ssrun(1) man page.

    • exp_type: experiment name.

    • executable_name: executable name.

    • executable_args: arguments to the executable.

    Use the information in the following list to determine which experiments to run. Each performance problem is followed by one or more experiment types:

    • High-user CPU time: usertime , pcsamp (four variants), _hwc/ _hwctime (hardware counter experiments), or bbcounts.

    • High-system CPU time: if floating-point exceptions are suspected, run an fpe trace.

    • High I/O time: bbcounts, then examine counts of I/O routines.

    • High paging rates: bbcounts , then prof -cordfb and cord to rearrange procedures.

    For each process of the executable, the experiment data is stored in a file with a name in the following form:

    executable_name.exp_type.id

    The experiment ID consists of one or two letters designating the process type, followed by the process ID number. An example of a name is:

    generic.pixbb.m10966

    See the following table for letter codes and descriptions.

    Table 1-1. Letter Codes in Process Experiment ID Numbers

    Letter Codes

    Description

    m

    Master process created by ssrun

    p

    Process created by a call to sproc()

    f

    Process created by a call to fork()

    s

    Process created by a call to system()

    e

    Process created by a call to exec()

    fe

    Process created by a call to fork() and exec()

    For more information on the ssrun command, see Chapter 6, “Setting Up and Running Experiments: ssrun”, or see the ssrun(1) man page.

  4. To generate a report from the experiment, issue prof with the following parameters:

    % prof options data_file

    • options: one or more valid options. For a complete list of options, see the prof(1) man page or “prof Options” in Chapter 7.

    • data_file: the name of the file in which the experiment data was recorded.

  5. The sscompare command can be used to analyze the performance data in experiment files that were generated by SpeedShop tools such as ssrun, and produce a comparison report. When comparing application performance, make sure to make a copy of the original binary code and a copy of the original experiment file. Then you can compare the original experiment results with the newer (hopefully improved) results.

    The following are some useful comparisons:

    • application performance before and after optimization

    • multiple ranks in an MPI application

    • multiple threads in an OpenMP applications

    • different experiments for the same application

    The comparison report produced by sscompare contains a legend and a table of performance data. Each input file and the type of performance data it contains is listed in the legend with a numeric column key. The table contains multiple columns of data; the type of data is dependent on the options used to generate the report.

    sscompare can be used with the following SpeedShop experiment types:

    • usertime

    • pcsamp

    • bbcounts

    See the sscompare(1) man page or “Comparing Experiment Results” in Chapter 7, for more details.

Collecting Data for Part of a Program

If you have a performance problem in only one part of your program, consider collecting performance data for just that part. You can do this by setting caliper points around the problem area when running an experiment, then using the prof -calipers option to generate a report for the problem area or using the calipers time line in the cvperf(1) window of WorkShop to view the area through a graphic user interface.

You can record caliper points using one of the following methods:

  • Direct calls to the SpeedShop API.

  • The caliper signal environment.

  • A debugger such as the ProDev WorkShop debugger.

  • Periodic caliper points with pollpoint caliper points.

For more information on using calipers, see “Using Calipers” in Chapter 6.