Chapter 1. Introduction to Performance Analysis

This chapter provides a brief introduction to performance analysis techniques for Silicon Graphic® systems and describes how to use them to solve performance problems. It includes the following sections:

Sources of Performance Problems

To tune a program's performance, you need to determine its consumption of machine resources. At any point (or phase) in a process, there is one limiting resource controlling the speed of execution. Processes can be slowed down by any of the following:

  • CPU speed and availability

  • I/O processing

  • memory size and availability

Performance problems may span the entire run of a process, or they may occur in just a small portion of the program. For example, a function that performs a lot of I/O processing might be called regularly as the program runs, or a particularly CPU-intensive calculation might occur in just one portion of the program. When there are performance problems in a small portion of the program, collect data for just that part of the program.

Because programs exhibit different behavior during different phases of operation, you need to identify the limiting resource for each phase. A program can be I/O-bound while it reads in data, CPU-bound while it performs computation, and I/O-bound again in its final stage while it writes out data. Once you've identified the limiting resource in a phase, you can perform an in-depth analysis to find the problem. After you have solved that problem, you can check for other problems within the same or other phases—performance analysis is an iterative process.

CPU-Bound Processes

A CPU-bound process spends its time in the CPU and is limited by CPU speed and availability. To improve performance on CPU-bound processes, streamline your code using one or more of the following techniques:

  • modifying algorithms

  • reordering code to avoid interlocks

  • removing nonessential steps

  • blocking to keep data in cache and registers

  • using alternative algorithms

I/O-Bound Processes

An I/O-bound process has to wait for I/O to complete and may be limited by disk access speeds or memory caching. To improve the performance of I/O-bound processes, try one of the following techniques:

  • improving overlap of I/O with computation

  • optimizing data usage to minimize disk access

  • using data compression

Memory-Bound Processes

A memory-bound program continuously swaps out pages of memory. Page thrashing is often due to accessing virtual memory on a haphazard rather than strategic basis. One telltale indication of page-thrashing with paging to a local disk is noise during disk accesses. To fix a memory-bound process, try to improve the memory reference patterns or, if possible, decrease the memory used by the program.

Bugs

Certain bugs can cause performance problems. Examples include:

  • The program is unnecessarily reading from the same file twice in different parts.

  • Floating point exceptions are slowing down the program.

  • Old code has not been completely removed.

  • The program is leaking memory (making malloc() calls without the corresponding calls to free()).

Fixing Performance Problems

The SpeedShop performance tools described in this manual can help you to identify specific performance problems described later in this chapter. However, the techniques described in this manual are only a part of performance tuning. Other areas that you can tune, but that are outside the scope of this document, include graphics, I/O, the kernel, system parameters, memory, and real-time system calls.

Although it may be possible to obtain short-term speed increases by relying on unsupported or undocumented quirks of the compiler, it's a bad idea to do so. Any such "features" may break in future compiler releases. The best way to produce efficient code that can be trusted to remain efficient is to follow good programming practices. In particular, choose good algorithms and leave the details to the compiler.

SpeedShop Tools

The SpeedShop tools allow you to run experiments and generate reports to track down the sources of performance problems. SpeedShop consists of a set of commands that can be run in a shell, an API, and a number of libraries to support the commands.

This section provides an overview of the tools by first discussing the main commands, then providing more detail on additional commands, experiment types, libraries, and supported programs and languages.

Main Commands

SpeedShop provides the commands listed in Table 1-1.

Table 1-1. SpeedShop Main Commands

Command

Description

ssusage

Collects information about your program's use of machine resources. Output from ssusage can be used to determine where most resources are being spent.

ssrun

Allows you to run experiments on a program to collect performance data. It establishes the environment to capture performance data for an executable, creates a process from the executable (or from an instrumented version of the executable) and runs it. Input to ssrun consists of an experiment type, control flags, the name of the target, and the arguments to be used in executing the target.

prof

Analyzes the performance data you have recorded using ssrun and provides formatted reports. prof detects the type of experiment you have run, and generates a report specific to the experiment type.


Additional Commands

SpeedShop provides the additional commands shown in Table 1-2.

Table 1-2. SpeedShop Additional Commands

Command

Description

pixie

Instruments an executable to enable basic block counting experiments to be performed. If you use ssrun, you will not normally need to call this program directly.

fbdump

Prints out the formatted contents of compiler feedback files generated by prof.

squeeze

Allocates a region of virtual memory and locks the virtual memory down into real memory, making it unavailable to other processes.

thrash

Allows you to allocate a block of memory, then access the allocated memory to explore paging behavior.

ssdump

Prints out formatted performance data that was collected while running ssrun. This program is included for SpeedShop debugging purposes. You don't normally need to use it.


Experiment Types

You can conduct the following types of experiments using the ssrun command:

  • Statistical PC sampling with pcsamp experiments.

    Data is measured by periodically sampling the Program Counter (PC) of the target executable when it is executing in the CPU. The PC shows the address of the currently executing instruction in the program. The data that is obtained from the samples is translated to a time that can be displayed at the function, source line, and machine instruction levels. The actual CPU time is calculated by multiplying the number of times a specific address is found in the PC by the amount of time between samples.

  • Statistical hardware counter sampling with _hwc experiments.

    Hardware counter experiments are available on R10000 systems that have built-in hardware counters. Data is measured by collecting information each time the specified hardware counter overflows. You can specify the hardware counter and the overflow interval you want to use.

  • Statistical call stack profiling with usertime.

    Data is measured by periodically sampling the call stack. The program's call stack data is used to attribute exclusive user time to the function at the bottom of each call stack (that is, the function being executed at the time of the sample), and to attribute inclusive user time to all the functions above the one currently being executed.

  • Basic block counting with ideal.

    Data is measured by counting basic blocks and calculating an ideal CPU time for each function. This involves instrumenting the program to divide the code into basic blocks, which are sets of instructions with a single entry point, a single exit point, and no branches into or out of the set. Instrumentation also permits a count of all dynamic (function-pointer) calls to be recorded.

  • Floating point exception trace with fpe.

    A floating point exception trace collects each floating point exception with the exception type and the call stack at the time of the exception. prof generates a report showing inclusive and exclusive floating point exception counts.

SpeedShop Libraries

Versions of the SpeedShop libraries libss.so and libssrt.so are available to support applications built using shared libraries (DSOs) only and the old 32-bit, new 32-bit or 64-bit application binary interfaces (ABIs).

Table 1-3 provides information about the different SpeedShop libraries.

Table 1-3. SpeedShop Libraries

Library

Description

libss.so

A shared library (DSO) that supports libssrt.so. libss.so data normally appears in experiment results generated with prof.

libssrt.so

A shared library (DSO) that is linked in to the program you specify when you run an experiment. All the performance data collection with the SpeedShop system is done within the target process(es), by exercising various pieces of functionality using libssrt. Data from libssrt.so does not normally appear in performance data reports generated with prof.

libfpe_ss.so

Supplements the standard libfpe.so for the purposes of collecting floating point exception data. See the fpe_ss reference page for more information.

libmalloc_ss.so

Inserts versions of malloc routines from libc.so.1 that allow tracing all calls to malloc, free, realloc, memalign, and valloc. See the malloc_ss reference page for more information.

libpixrt.so

A shared library (DSO) used by pixified programs.


API

The SpeedShop API is primarily available to allow you to use ssrt_caliper_point to set caliper points in your source code. See "Using Calipers" in Chapter 6 for information on using caliper points. For information on other API functions, see the ssapi reference page.

Supported Programming Models and Languages

The SpeedShop tools support programs with the following characteristics:

  • Shared libraries (DSOs.)

  • Non-stripped executables.

  • Executables containing fork, sproc, system, or exec commands.

  • Executables using supported techniques for opening, closing, and/or delay-loading DSOs.

  • C, C++, Fortran (Fortran-77, Fortran-90, and High-Performance Fortran), or Ada® 95 source code.

  • Power Fortran and Power C source code.

    prof understands the syntax and semantics of the multi-processing runtime and displays the data accordingly.

  • pthreads, supported with data on a per-program basis.

  • Message Passing Interface (MPI) or other message-passing paradigms. Currently supported by providing data on the behavior of each process. The behavior of the MPI library itself is monitored just like any other user-level code.

Using SpeedShop Tools for Performance Analysis

Performance tuning typically consists of

  • examining machine resource usage

  • breaking down the process into phases

  • identifying the resource bottleneck within each phase

  • correcting the cause of the bottleneck

Generally, you run the first experiment to break your program down into phases and run subsequent experiments to examine each phase individually. After you have solved a problem in a phase, you should re-examine machine resource usage to see if there is further opportunity for performance improvement.

The general steps for a performance analysis cycle are:

  1. Build the application.

  2. Run experiments on the application to collect performance data.

  3. Examine the performance data.

  4. Generate an improved version of the program.

  5. Repeat as needed.

To accomplish this using SpeedShop tools:

  • Use ssusage to capture information on your program's use of machine resources.

  • Use ssrun to capture different types of performance data over either your entire program or parts of the program. ssrun can be used in conjunction with dbx or WorkShop debuggers.

  • Use prof to analyze the data and generate reports.

Using ssusage to Evaluate Machine Resource Use

To determine overall resource usage by your program, run the program with ssusage. The results of this command allow you to identify high user CPU time, high system CPU time, high I/O time, and a high degree of paging.

ssusage prog_name prog_args 

From the ssusage output, you can decide which experiments to run to collect data for further study. For more information on ssusage, see Chapter 5, "Collecting Data on Machine Resource Usage," or see the ssusage reference page.

Using ssrun and prof to Gather and Analyze Performance Data

This section describes the steps involved in a performance analysis cycle when using the main interface to the SpeedShop tools: the ssrun command.

You can also call the commands individually. For example, if you are planning to perform basic block counting experiments that involve instrumenting the executable, you can either do this by calling ssrun with the appropriate experiment type, or you can set up your environment to call pixie directly to instrument your executable. Information on setting up your environment and running pixie directly can be found in Chapter 8, "Using SpeedShop in Expert Mode: pixie."

To perform a performance analysis, follow these general steps:

  1. Build the executable.

    You can usually build the executable as you would normally. See "Building Your Executable" in Chapter 6 for information on how to build the executable.

  2. Specify caliper points if you want to collect data for only a portion of your program.

    See "Collecting Data for Part of a Program" for more information.

  3. To collect performance data, call ssrun with the parameters below. Use the information in Table 1-4 to determine which experiments to run:

    ssrun flags exp_type prog_name prog_args 
     

    flags 

    One or more valid flags. For a complete list of flags, see the ssrun reference page.

    exp_type 

    Experiment name.

    prog_name 

    Executable name.

    prog_args 

    Arguments to the executable

    Table 1-4. Choosing an Experiment Type

    Performance Problem

    Experiment(s) to Run

    High user CPU time

    usertime

    pcsamp (four variants)

    *_hwc experiments

    ideal

    High system CPU time

    If floating point exceptions are suspected: fpe

    High I/O time

    ideal, then examine counts of I/O routines

    High paging (majf)

    ideal, then prof -feedback and cord to rearrange procedures.

    If inefficient heap usage is suspected, use WorkShop's Performance Analyzer to gather information.


    For each process of the executable, the experiment data is stored in a file with a name of the format prog_name.exp_type.id.

    The experiment ID, id, consists of one or two letters (designating the process type) and the process ID number. See Table 1-5 for letter codes and descriptions.

    Table 1-5. Letter Codes in Process Experiment ID Numbers

    Letter Codes

    Description

    m

    Master process created by ssrun

    p

    Process created by a call to sproc()

    f

    Process created by a call to fork()

    s

    Process created by a call to system()

    e

    Process created by a call to exec()

    fe

    Process created by a call to fork() and exec()

    For more information on the ssrun command, see Chapter 6, "Setting Up and Running Experiments: ssrun," or view the ssrun reference page.

  4. To generate a report of the experiment, call prof with the following parameters:

    prof flags data_file

    flags 

    One or more valid flags. For a complete list of flags, see the prof reference page.

    data_file 

    The name of the file in which the experiment data was recorded.

    For more information on using prof, see Chapter 7, "Analyzing Experiment Results: prof," or see the prof reference page.

Collecting Data for Part of a Program

If you have a performance problem in only one part of your program, consider collecting performance data for just that part. You can do this by setting caliper points around the problem area when running an experiment, then using the prof -calipers option to generate a report for the problem area.

You can set caliper points using one of the following:

  • the SpeedShop API

  • the caliper signal environment

  • a debugger such as the ProDev WorkShop debugger

For more information on using calipers, see "Using Calipers" in Chapter 6.