Chapter 4. Experiment Types

This chapter provides detailed information on each experiment type available within SpeedShop. It contains the following sections:

For information on how to run the experiments described in this chapter, see Chapter 6, "Setting Up and Running Experiments: ssrun."

Selecting an Experiment

Table 4-1 shows the possible experiments you can perform using the SpeedShop tools and the reasons why you might want to choose a specific experiment. The Clues column shows when you might use an experiment. The Data Collected column indicates performance data collected by the experiment. For detailed information on the experiments listed, see the sections listed in Table 4-1.

Table 4-1. Summary of Experiments

Experiment

Clues

Data Collected

"usertime Experiment"

Slow program, nothing else known.

Not CPU-bound.

Inclusive and exclusive user time for each function by sampling the callstack at 30-millisecond intervals.

"pcsamp Experiment"

High user CPU time.

Actual CPU time at the source line, machine instruction and function levels by sampling the program counter at 10-or 1-millisecond intervals.

"ideal Experiment"

CPU-bound.

Ideal CPU time at the function, source line and machine instruction levels using instrumentation for basic block counting.

"ideal Experiment"

High user CPU time.

On R10000 class machines, exclusive counts at the source line, machine instruction, and function levels for overflows of the following counters: clock cycle, graduated instructions, primary instruction-cache misses, secondary instruction-cache misses, primary data-cache misses, secondary data-cache misses, TLB misses, graduated floating-point instructions.

"fpe Trace"

High system time.

Presence of floating point operations.

All floating point exceptions with the exception type and the callstack at the time of the exception.


usertime Experiment

The usertime experiment uses statistical call stack profiling, based on wall clock time, to measure inclusive and exclusive user time spent in each function while your program runs. This experiment uses an interval of 30 milliseconds.

Data is measured by periodically sampling the callstack. The program's callstack data is used to

  • attribute exclusive user time to the function at the bottom of each callstack (that is, the function being executed at the time of the sample)

  • attribute inclusive user time to all the functions above the one currently being executed

The time spent in a procedure is determined by multiplying the number of times an instruction for that procedure appears in the stack by the average time interval between call stacks. Call stacks are gathered whether the program was running or blocked; hence, the time computed represents the total time, both within and outside the CPU. If the target process was blocked for a long time as a result of an instruction, that instruction will show up as having a high time.

User time runs should incur a program execution slowdown of no more than 15%. Data from a usertime experiment is statistical in nature and shows some variance from run to run.


Note: For this experiment, o32 executables must explicitly link with -lexc.


pcsamp Experiment

The pcsamp experiment uses statistical PC sampling to estimate actual CPU time for each source code line, machine line, and function in your program. The prof listing of this experiment shows exclusive PC-sampling time. This experiment is a lightweight, high-speed operation done with kernel support. The actual CPU time is calculated by multiplying the number of times an instruction appears in the PC by the interval specified for the experiment (either 1 or 10 milliseconds.)

To collect the data, the kernel regularly stops the process if it is in the CPU, increments a counter for the current value of the PC, and resumes the process. The default sample interval is 10 milliseconds. If you specify the optional f prefix to the experiment, a sample interval of 1 millisecond is used.

By default, the experiment uses 16-bit bins, based on user and system time. If the optional x suffix is used, a 32-bit bin size will be used. Using a 32-bit bin provides more accurate information, but requires additional disk space.

  • 16-bit bins allow a maximum of 65,000 counts.

  • 32-bit bins allow approximately 4,000,000 counts.

PC-sampling time runs should incur a slowdown of execution of the program of no more than 5%. The measurements are statistical in nature, and exhibit variance inversely proportional to the running time.

ideal Experiment

The ideal experiment instruments the executables and any DSOs to permit basic block counting and counting of all dynamic (function-pointer) calls.

How SpeedShop Prepares Files

To permit block counting, SpeedShop

  • divides the code into basic blocks (which are sets of instructions with a single entry point), a single exit point, and no branches into or out of the set

  • inserts counter code at the beginning of each basic block to increment a counter each time that basic block is executed

The target executable and all the DSOs it uses are instrumented, including libc.so.1, libexc.so, libm.so, libss.so, libssrt.so. Instrumented files with an extension .pix*, where * depends on the ABI, are written to the current working directory.

After the transformations are complete, the program's symbol table and translation table are updated so that debuggers can map between transformed addresses and the original program's addresses, and reference the measured performance data to the untransformed code.

After instrumentation, ssrun executes the instrumented program. Data is generated as long as the process exits normally or receives a fatal signal that the program does not handle.

How SpeedShop Calculates CPU Time

prof uses a machine model to convert the block execution counts into an idealized exclusive time at the function, source line, or machine instruction levels. By default, the machine model corresponds to the machine on which the target was run; the user can specify a different machine model for the analysis.

SpeedShop calculates ideal CPU time by using the TDT models. Potential floating-point interlocks are taken into account inside the same basic blocks, but ignored across basic block boundaries. Memory latency time (cache misses and memory bus contention) is ignored. The computed ideal time is therefore always less than the real time that any run would take. See Table 4-2 for a comparison of running a pcsamp experiment, which generates estimated actual CPU time, and running an ideal experiment.

Note that the execution time of an instrumented program is three to six times longer than an uninstrumented one. This timing change may alter the behavior of a program that deals with a graphical user interface (GUI), or depends on events such as SIGALRM that are based on an external clock. Also, during analysis, the instrumented executable might appear to be CPU-bound, whereas the original executable was I/O-bound.

Basic block counts are translated to ideal CPU time displayed at the function, source line and machine line levels.

Inclusive Basic Block Counting

The basic block counting explained in the previous section allows you to measure ideal time spent in each procedure, but doesn't propagate the time up to the caller of that procedure. For example, basic block counting may tell you that procedure sin(x) took the most time, but significant performance improvement can only be obtained by optimizing the callers of sin(x). Inclusive basic block counting solves this problem.

Inclusive basic block counting calculates cycles just like regular basic block counting, and then propagates it proportionately to all its callers. The cycles of procedures obtained using regular basic block counting (called exclusive cycles), are divided up among its callers in proportion to the number of times they called this procedure. For example, if sin(x) takes 1000 cycles, and its callers, procedures foo() and bar(), call sin(x) 25 and 75 times respectively, 250 cycles are attributed to foo() and 750 to bar(). By propagating cycles this way, __start() ends up with all the cycles counted in the program. for example), the assumption can be very misleading. If foo() calls matmult() 99 times for 2X2 matrices, while bar() calls it once for 100X100 matrices, the inclusive time report will attribute 99% of matmult()'s time to foo(), but actually almost all the time derives from bar()'s one call.

To generate a report that shows inclusive time, specify the -gprof flag to prof.

Using pcsamp and ideal Together

The ideal experiment can be used together with the pcsamp experiment to compare actual and ideal times spent in the CPU. A major discrepancy between pcsamp CPU time and ideal CPU time indicates

  • cache misses and floating point interlocks in a single process application

  • secondary cache invalidations in an application with multiple processes that is run on a multiprocessor

A comparison between basic block counts (ideal experiment) and PC profile counts (pcsamp experiment) is shown in Table 4-2.

Table 4-2. Basic Block Counts and PC Profile Counts Compared

Basic Block Counts

PC Profile Counts

Used to compute ideal CPU time

Used to estimate actual CPU time

Data collection by instrumenting

Data collection done with the kernel

Slows program down by factor of three

Has minimal impact on program speed

Generates an exact count

Generates statistical counts


Hardware Counter Experiments

The experiments described in this section are available for systems that have hardware counters (R10000 class machines). Hardware counters allow you to count various types of events, such as cache misses and counts of issued and graduated instructions.

A hardware counter works as follows: for each event the appropriate hardware counter is incremented on each processor clock cycle when events occur for which there are hardware counters. For example, when a floating point instruction is graduated in a cycle, the graduated floating-point instruction counter is incremented by 1.

Two Tools for Hardware Counter Experiments

There are two tools that allow you to access hardware counter data:

  • perfex is command-line interface that provides program-level event information, For more information on perfex, and on hardware counters, see the perfex or r10k_counters reference pages.

  • SpeedShop allows you to perform the hardware counter experiments described in the next section ("SpeedShop Hardware Counter Experiments").

SpeedShop Hardware Counter Experiments

In SpeedShop hardware counter experiments, overflows of a particular hardware counter are recorded. Each hardware counter is configured to count from zero to a number designated as the overflow value. When the counter reaches the overflow value, the system resets it to zero and increments the number of overflows at the present program instruction address. Each experiment provides two possible overflow values; the values are prime numbers, so any profiles that seem the same for both overflow values should be statistically valid.

The hardware counter experiments show where the overflows are being triggered in the program, at the function, source-line, and individual instruction level. When you run prof on the data collected during the experiment, the overflow counts are multiplied by the overflow value to compute the total number of events. These numbers are statistical. The generated reports show exclusive hardware counts, that is, information about where the program counter was, not the callstack to get there.

Hardware counter overflow profiling experiments should incur a slowdown of execution of the program of no more than 5%. Count data is kept as 32-bit integers only.

The available hardware experiments are [f]gi_hwc, [f]cy_hwc, [f]ic_hwc, [f]isc_hwc, [f]dc_hwc, [f]dsc_hwc, [f]tlb_hwc, [f]gfp_hwc, and prof_hwc.

[f]gi_hwc

The [f]gi_hwc experiment counts overflows of the graduated instruction counter. The graduated instruction counter is incremented by the number of instructions that were graduated on the previous cycle. The experiment uses statistical PC sampling based on overflows of the counter at an overflow interval of 32771. If the optional f prefix is used, the overflow interval is 6553.

[f]cy_hwc

The [f]cy_hwc experiment counts overflows of the cycle counter. The cycle counter is incremented on each clock cycle. The experiment uses statistical PC sampling based on overflows of the counter, at an overflow interval of 16411. If the optional f prefix is used, the overflow interval is 3779.

[f]ic_hwc

The [f]ic_hwc experiment counts overflows of the primary instruction-cache miss counter. The primary instruction-cache miss counter is incremented one cycle after an instruction fetch request is entered into the miss handling Table. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 2053. If the optional f prefix is used, the overflow interval is 419.

[f]isc_hwc

The [f]isc_hwc experiment counts overflows of the secondary instruction-cache miss counter. The secondary instruction-cache miss counter is incremented after the last 16-byte block of a 64-byte primary instruction cache line is written into the instruction cache. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 131. If the optional f prefix is used, the overflow interval is 29.

[f]dc_hwc

The [f]dc_hwc experiment counts overflows of the primary data-cache miss counter. The primary data-cache miss counter is incremented on the cycle after a primary cache data refill is begun. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 2053. If the optional f prefix is used, the overflow interval is 419.

[f]dsc_hwc

The [f]dsc_hwc experiment counts overflows of the secondary data-cache miss counter. The secondary data-cache miss counter is incremented on the cycle after the second 16-byte block of a primary data cache line is written into the data cache. The experiment uses statistical PC sampling, based on the overflow of the counter at an overflow interval of 131. If the optional f prefix is used, the overflow interval is 29.

[f]tlb_hwc

The [f]tlb_hwc experiment counts overflows of the TLB (translation lookaside buffer) counter. The TLB counter is incremented on the cycle after the TLB miss handler is invoked. The experiment uses statistical PC sampling based on the overflow of the counter at an overflow interval of 257. If the optional f prefix is used, the overflow interval is 53.

[f]gfp_hwc

The [f]gfp_hwc experiment counts overflows of the graduated floating-point instruction counter. The graduated floating-point instruction counter is incremented by the number of floating point instructions which graduated on the previous cycle. The experiment uses statistical PC sampling based on overflows of the counter, at an overflow interval of 32771. If the optional f prefix is used, the overflow interval is 6553.

prof_hwc

The prof_hwc experiment allows you to set a hardware counter to use in the experiment, and to set a counter overflow interval using the following environment variables:

_SPEEDSHOP_HWC_COUNTER_NUMBER  


The value of this variable may be any number between 0 and 31. Hardware counters are described in the MIPS R10000 Microprocessor User's Manual, Chapter 14, and on the r10k_counters reference page. The hardware counter numbers are provided in Table 4-3.

_SPEEDSHOP_HWC_COUNTER_OVERFLOW 


The value of this variable may be any number greater than 0. Some numbers may produce data that is not statistically random, but rather reflects a correlation between the overflow interval and a cyclic behavior in the application. You may want to do two or more runs with different overflow values.

The default counter is the primary instruction-cache miss counter; the default overflow interval is 2053.

The experiment uses statistical PC sampling based on the overflow of the specified counter, at the specified interval. Note that these environment variables cannot be used for other hardware counter experiments. They are examined only when the prof_hwc experiment is specified.

Hardware Counter Numbers

The possible numeric values for the _SPEEDSHOP_HWC_COUNTER_NUMBER variable are shown in Table 4-3.

Table 4-3. Hardware Counter Numbers

 

 

0

Cycles

1

Issued instructions

2

Issued loads

3

Issued stores

4

Issued store conditionals

5

Failed store conditionals

6

Decoded branches

7

Quadwords written back from secondary cache

8

Correctable secondary cache data array ECC errors

9

Primary instruction-cache misses

10

Secondary instruction-cache misses

11

Instruction misprediction from secondary cache way prediction table

12

External interventions

13

External invalidations

14

Virtual coherency conditions (or functional unit completions, depending on hardware version)

15

Graduated instructions

16

Cycles

17

Graduated instructions

18

Graduated loads

19

Graduated stores

20

Graduated store conditionals

21

Graduated floating point instructions

22

Quadwords written back from primary data cache

23

TLB misses

24

Mispredicted branches

25

Primary data-cache misses

26

Secondary data-cache misses

27

Data misprediction from secondary cache way prediction table

28

External intervention hits in secondary cache

29

External invalidation hits in secondary cache

30

Store/prefetch exclusive to clean block in secondary cache

31

Store/prefetch exclusive to shared block in secondary cache


fpe Trace

A floating point exception trace collects each floating point exception with the exception type and the callstack at the time of the exception. Floating-point exception tracing experiments should incur a slowdown of execution of the program of no more than 15%. These measurements are exact, not statistical.

prof generates a report that shows inclusive and exclusive floating-point exception counts.