Chapter 4. Experiment Types

This chapter provides detailed information on each experiment type available within SpeedShop. It contains the following sections:

For information on how to run the experiments described in this chapter, see Chapter 6, “Setting Up and Running Experiments: ssrun”.

Selecting an Experiment

Table 4-1 shows the possible experiments you can perform using the SpeedShop tools and the reasons why you might want to choose a specific experiment. The Clues column shows when you might use an experiment. The Data Collected column indicates performance data collected by the experiment. For detailed information on the experiments, see the relevant section in the remainder of this chapter.

Table 4-1. Summary of Experiments

Experiment

Clues

Data Collected

fpe

High system time. Presence of floating-point operations.

 

All floating-point exceptions, with the exception type and the call stack at the time of the exception. See “Floating-Point Exception Trace Experiment (fpe)”.

heap

Memory utilization.

Heap trace data from each processor in a multiprocessor system. See “Heap Trace Experiment (heap)”.

_hwc

High user CPU time.

On R10000, R12000, R14000, and R16000 class machines, counts at the source line, machine instruction, and function levels of various hardware events, including: clock cycles, graduated instructions, primary instruction cache misses, secondary instruction cache misses, primary data cache misses, secondary data cache misses, translation lookaside buffer (TLB) misses, and graduated floating-point instructions. PC sampling is used. See “Hardware Counter Experiments (*_hwc, *_hwctime)”.

_hwctime

High user CPU time.

Similar to _hwc experiment, except that callstack sampling is used. See “Hardware Counter Experiments (*_hwc, *_hwctime)”.

bbcounts

CPU-bound.

CPU time at the function, source line, and machine instruction levels using basic block counting. See “Basic Block Counting Experiment (bbcounts)”.

io

I/O-bound.

Traces the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. See “I/O Trace Experiment (io)”.

mpi

mpi performance is poor.

Traces and times calls to various MPI routines. See “MPI Call Tracing Experiment (mpi/mpi_trace)”

mpi_trace

mpi performance is poor.

Traces calls to various MPI routines and generates a file viewable in the cvperf performance analyzer window. This is a deprecated experiment and will be removed in a future release. See “MPI Call Tracing Experiment (mpi/mpi_trace)”.

numa

Slow shared-memory program.

On ccNUMA architecture machines, randomly samples memory accesses and reports: instruction performing the access, memory being accessed, ccNUMA node where memory access originates, ccNUMA node containing the memory, ccNUMA routing distance between these two nodes. See “NUMA Profiling Experiment (numa)”.

pcsamp

High user CPU time.

Actual CPU time at the source line, machine instruction, and function levels by sampling the program counter at 10 or 1-millisecond intervals. See “PC Sampling Experiment (pcsamp)”.

totaltime

Slow program, nothing else known. Not CPU-bound.

Inclusive and exclusive real time for each function by sampling the callstack at 30-millisecond intervals. See “Call Stack Profiling Experiment (usertime/totaltime)”.

usertime

Slow program, nothing else known. Not CPU-bound.

Inclusive and exclusive CPU time for each function by sampling the callstack at 30-millisecond intervals. See “Call Stack Profiling Experiment (usertime/totaltime)”.


Floating-Point Exception Trace Experiment (fpe)

A floating-point exception trace collects each floating-point exception with the exception type and the call stack at the time of the exception. Floating-point exception tracing experiments should incur a slowdown in execution of the program of no more than 15%. These measurements are exact, not statistical.

The prof command generates a report that shows inclusive and exclusive floating-point exception counts.

Heap Trace Experiment (heap)

If you are running a heap trace experiment (heap) on a multiprocessor application, you will get an experiment file for each process and an additional experiment file for the master process. Each process experiment file can either contain a sample of the data from the whole application or its own data only, as follows:

  • By default, the experiment file for each process will contain data from all processes.

  • If you set the _SSMALLOC_NO_BUFFERING environment variable before executing ssrun, the experiment file for each process will contain only its own heap trace data.

Hardware Counter Experiments (*_hwc, *_hwctime)

In the SpeedShop hardware counter experiments, overflows of a particular hardware counter are recorded. Each hardware counter is configured to count from zero to a number designated as the overflow value. When the counter reaches the overflow value, the system resets it to zero and increments the number of overflows at the present program instruction address. Each experiment provides two possible overflow values; the values are prime numbers, so any profiles that seem the same for both overflow values should be statistically valid.

The experiments described in this section are available for systems that have hardware counters (R10000, R12000, R14000, and R16000 class machines). Hardware counters allow you to count various types of events, such as cache misses and counts of issued and graduated instructions.

A hardware counter works as follows: for each event, the appropriate hardware counter is incremented on the processor clock cycle. For example, when a floating-point instruction is graduated in a cycle, the graduated floating-point instruction counter is incremented by 1.

These experiments are detailed by nature. They return information gathered at the hardware level. You probably want to run a higher level experiment first. Once you have narrowed the scope, you can use hardware counter experiments to pinpoint the area to be tuned.

Two Tools for Hardware Counter Experiments

There are two tools that allow you to access hardware counter data:

  • perfex(1) is a command-line interface that provides program-level event information. For more information on perfex, see the perfex(1) man page. For more information on hardware counters, see the r10k_counters(1) man page.

  • SpeedShop allows you to perform the hardware counter experiments described in the next sections (“_hwc Hardware Counter Experiments” and “_hwctime Hardware Counter Experiments”).

_hwc Hardware Counter Experiments

The _hwc hardware counter experiments show where the overflows are being triggered in the program: at the function, source-line, or individual instruction level. When you run prof on the data collected during the experiment, the overflow counts are multiplied by the overflow value to compute the total number of events. These numbers are statistical, meaning they are not precise. The generated reports show exclusive hardware counts: that is, information about where the program counter was. They do not show the callstack to get there

Hardware counter overflow profiling experiments should incur a slowdown of execution of the program of no more than 5%. Count data is kept as 32-bit integers only.

By preceding the experiment name with an f (for example, cy_hwc becomes fcy_hwc), you reduce the value of an overflow to approximately one-fifth of the original value. By preceding the experiment name with an s, you increase the value of an overflow over the original value.

The following list describes the hardware counter experiments:

  • [f|s]gi_hwc experiment: this experiment counts overflows of the graduated instruction counter. The graduated instruction counter is incremented by the number of instructions that were graduated on the previous cycle. The experiment uses statistical PC sampling based on an overflow interval of 32,771. If the optional f prefix is used, the overflow interval is 6,553. If the s prefix is used, the overflow interval is 3,999,971.

  • [f|s]cy_hwc experiment: this experiment counts overflows of the cycle counter. The cycle counter is incremented on each clock cycle. The experiment uses statistical PC sampling based on an overflow interval of 16,411. If the optional f prefix is used, the overflow interval is 3,779. If the optional s prefix is used, the overflow interval is 1,999,993.

  • [f|s]ic_hwc experiment: this experiment counts overflows of the primary instruction cache miss counter. The counter is incremented one cycle after an instruction fetch request is entered into the miss handling table. The experiment uses statistical PC sampling based on an overflow interval of 2,053. If the optional f prefix is used, the overflow interval is 419. If the optional s prefix is used, the overflow interval is 524,309.

  • [f|s]isc_hwc experiment: this experiment counts overflows of the secondary instruction cache miss counter. The secondary instruction cache miss counter is incremented after the last 16-byte block of a 64-byte primary instruction cache line is written into the instruction cache. The experiment uses statistical PC sampling based on an overflow interval of 131. If the optional f prefix is used, the overflow interval is 29. If the optional s prefix is used, the overflow interval is 65,537.

  • [f|s]dc_hwc experiment: this experiment counts overflows of the primary data cache miss counter. The primary data cache miss counter is incremented on the cycle after a primary cache data refill is begun. The experiment uses statistical PC sampling based on an overflow interval of 2,053. If the optional f prefix is used, the overflow interval is 419. If the optional fs prefix is used, the overflow interval is 524,309.

  • [f|s]dsc_hwc experiment: this experiment counts overflows of the secondary data cache miss counter. The secondary data cache miss counter is incremented on the cycle after the second 16-byte block of a primary data cache line is written into the data cache. The experiment uses statistical PC sampling, based on an overflow interval of 131. If the optional f prefix is used, the overflow interval is 29. If the optional s prefix is used, the overflow interval is 65,537.

  • [f|s]tlb_hwc experiment: this experiment counts overflows of the translation lookaside buffer (TLB) counter. The TLB counter is incremented on the cycle after the TLB miss handler is invoked. The experiment uses statistical PC sampling based on an overflow interval of 257. If the optional f prefix is used, the overflow interval is 53. If the optional s prefix is used, the overflow interval is 19,997.

  • [f|s]gfp_hwc experiment: this experiment counts overflows of the graduated floating-point instruction counter. The graduated floating-point instruction counter is incremented by the number of floating-point instructions that graduated on the previous cycle. If the optional f prefix is used, the overflow interval is 6,553. If the optional s prefix is used, the overflow interval is 3,999,971.

  • [f|s]fsc_hwc experiment: this experiment uses statistical PC sampling based on overflows of the failed store conditionals counter. If the optional f prefix is used, the overflow interval is 401. If the optional s prefix is used, the overflow interval is 19,997.

  • prof_hwc experiment: for any hardware counter not otherwise covered in “_hwc Hardware Counter Experiments”, or to choose different overflow intervals for those hardware counters, the prof_hwc experiment allows you to set a hardware counter to use in the experiment and to set a counter overflow interval using the following environment variables:

    • _SPEEDSHOP_HWC_COUNTER_NUMBER: the value of this variable can be between 0 and 31. Hardware counters are described in the MIPS R10000 User's Guide and on the r10k_counters(1) man page. The hardware counter numbers are provided in the tables in “Hardware Counter Numbers”.

    • _SPEEDSHOP_HWC_COUNTER_OVERFLOW: The value of this variable can be any number greater than 0. Some numbers may produce data that is not statistically random, but rather reflects a correlation between the overflow interval and a cyclic behavior in the application. You may want to do two or more runs with different overflow values.

    The default counter is the primary instruction-cache miss counter; the default overflow interval is 2,053.

    The experiment uses statistical PC sampling based on the overflow of the specified counter, at the specified interval. Note that these environment variables cannot be used for other hardware counter experiments. They are examined only when the prof_hwc or prof_hwctime experiment is specified.

_hwctime Hardware Counter Experiments

The following sections describe _hwctime hardware counter experiments, which run on R10000, R12000, R14000 and R16000 machines only. The _hwctime hardware counter experiments also show where the overflows are being triggered in the program. These experiments are similar to the _hwc experiments, but record the callstack information rather than showing where the program counter was when the overflow occurred.

See the perfex(1) and r10k_counters(5) man pages for other methods of returning hardware-level information.

The following list describes these experiments:

  • gi_hwctime experiment: hwct,17,1000003,0,SIGPROF:cu. Profiles the cycle counter using the statistical call stack sampling, based on overflows of the graduated-instruction counter, at an overflow interval of 1000003.

  • cy_hwctime experiment: hwct,0,10000019,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling based on overflows of the cycle counter, at an overflow interval of 10000019.

  • ic_hwctime experiment: hwct,9,8009,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the primary instruction-cache-miss counter, at an overflow interval of 8009.

  • isc_hwctime experiment: hwct,10,2003,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the secondary instruction-cache-miss counter, at an overflow interval of 2003.

  • dc_hwctime experiment: hwct,25,8009,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the primary data-cache-miss counter, at an overflow interval of 8009.

  • dsc_hwctime experiment: hwct,26,2003,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the secondary data-cache-miss counter, at an overflow interval of 2003.

  • tlb_hwctime experiment: hwct,23,2521,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the TLB miss counter, at an overflow interval of 2521.

  • gfp_hwctime experiment: hwct,21,10007,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the graduated floating-point instruction counter, at an overflow interval of 10007.

  • fsc_hwctime experiment: hwct,5,5003,0,SIGPROF:cu. Profiles the cycle counter using statistical call-stack sampling, based on overflows of the failed store conditionals counter, at an overflow interval of 5003.

  • prof_hwctime experiment: For any hardware counter not otherwise covered in “_hwc Hardware Counter Experiments”, or to choose different sampling counter/overflow interval for any hardware counter time experiment, the prof_hwctime experiment is available. Here, profiling is done for the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER using statistical call-stack sampling, based on overflows of the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_NUMBER, at an interval given by the environment variable _SPEEDSHOP_HWC_COUNTER_OVERFLOW.


    Note: These environment variables cannot be used to override the counter numbers or interval for other defined experiments. They are examined only when the prof_hwctime or prof_hwc experiment is specified.


    The default overflow and profiling counter is the cycle counter; the default overflow interval is 10000019.

Hardware Counter Numbers

The possible numeric values for the _SPEEDSHOP_HWC_COUNTER_NUMBER and _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER variables are shown in the following tables. Table 4-2, gives the hardware counter numbers for systems with R10000 processors, and Table 4-3, gives them for systems with R12000/R14000/R16000 processors. For the R10000 processors, if two counter numbers need to specified, one counter number must be chosen from a group including numbers 0-15 and the other counter number must be chosen from a group including numbers 16-31 due to hardware restrictions. See the r10k_counters(5) man page for further details.

Table 4-2. R10000 Hardware Counter Numbers

Number

Indication

0

Cycles

1

Issued instructions

2

Issued loads

3

Issued stores

4

Issued store conditionals

5

Failed store conditionals

6

Decoded branches (rev 2.x processors) or resolved branches (rev 3.x processors)

7

Quadwords written back from secondary cache

8

Correctable secondary cache data array ECC errors

9

Primary instruction-cache misses

10

Secondary instruction-cache misses

11

Instruction misprediction from secondary cache way prediction table

12

External interventions

13

External invalidations

14

Virtual coherency conditions (or functional unit completions, depending on hardware version)

15

Graduated instructions

16

Cycles

17

Graduated instructions

18

Graduated loads

19

Graduated stores

20

Graduated store conditionals

21

Graduated floating-point instructions

22

Quadwords written back from primary data cache

23

TLB misses

24

Mispredicted branches

25

Primary data cache misses

26

Secondary data cache misses

27

Data misprediction from secondary cache way prediction table

28

External intervention hits in secondary cache

29

External invalidation hits in secondary cache

30

Store/prefetch exclusive to clean block in secondary cache

31

Store/prefetch exclusive to shared block in secondary cache


Table 4-3. R12000, R14000, R16000 Hardware Counter Numbers

Number

Indication

0

Cycles

1

Decoded instructions

2

Decoded loads

3

Decoded stores

4

Miss Handling Table occupancy

5

Failed store conditionals

6

Resolved conditional branches

7

Quadwords written back from secondary cache

8

Correctable secondary cache data array ECC errors

9

Primary instruction-cache misses

10

Secondary instruction-cache misses

11

Instruction misprediction from secondary cache way prediction table

12

External interventions

13

External invalidations

14

Not implemented

15

Graduated instructions

16

Executed prefetch instructions

17

Prefetch primary data cache misses

18

Graduated loads

19

Graduated stores

20

Graduated store conditionals

21

Graduated floating-point instructions

22

Quadwords written back from primary data cache

23

TLB misses

24

Mispredicted branches

25

Primary data cache misses

26

Secondary data cache misses

27

Data misprediction from secondary cache way prediction table

28

State of intervention hits in secondary cache

29

State of invalidation hits in secondary cache

30

Miss Handling Table (MHT) entries accessing memory

31

Store/prefetch exclusive to shared block in secondary cache


Basic Block Counting Experiment (bbcounts)

The bbcounts experiment displays an estimated time based on linear basic block counting.

Data is measured by counting the number of executions of each basic block and calculating an estimated time for each function. This involves instrumenting the program to divide the code into basic blocks, which are consecutive sequences of instructions with a single entry point, a single exit point, and no branches into or out of the sequence. Instrumentation also records a count of all dynamic (function-pointer) calls.

Because an exact count of every instruction in your program is recorded, you can also use the bbcounts experiment to determine the efficiency of your algorithm and identify any code that is not executed.

How SpeedShop Prepares Files

To permit block counting, SpeedShop does the following:

  • Divides the code into basic blocks, which are sets of instructions with a single entry point, a single exit point, and no branches into or out of the set.

  • Inserts counter code at the beginning of each basic block to increment a counter each time that basic block is executed.

The target executable, rld, and all the DSOs are instrumented. Instrumented files with an extension .pix*, where * depends on the ABI, are written to the current working directory or to the directory specified by the _SPEEDSHOP_OUTPUT_DIRECTORY environment variable, if set.

After instrumentation, ssrun executes the instrumented program. Data is generated as long as the process exits normally or receives a fatal signal that the program does not handle.

How SpeedShop Calculates CPU Time for bbcounts Experiments

The prof command uses a machine model to convert the block execution counts into an estimated, exclusive CPU time at the function, source line, or machine instruction levels. By default, the machine model corresponds to the machine on which the target was run; the user can specify a different machine model (CPU processor model and clock speed) for the analysis.

Note that the execution time of an instrumented program is three to six times longer than an uninstrumented one. This timing change may alter the behavior of a program that deals with a graphical user interface (GUI) or depends on events such as SIGALRM that are based on an external clock. Also, during analysis the instrumented executable might appear to be CPU-bound, whereas the original executable was I/O-bound.

Basic block counts are translated to an estimated CPU time displayed at the function, source line, and assembly instruction levels.

Inclusive Basic Block Counting

The basic block counting explained in the previous section allows you to measure ideal time spent in each procedure, but it does not propagate the time up to the caller of that procedure. For example, basic block counting may tell you that procedure sin(x) took the most time, but significant performance improvement can only be obtained by optimizing the callers of sin(x). Inclusive basic block counting solves this problem.

Inclusive basic block counting calculates cycles just like regular basic block counting and then propagates it in proportion to its callers. The cycles of procedures obtained using regular basic block counting (called exclusive cycles) are divided up among its callers in proportion to the number of times they called this procedure. For example, if sin(x) takes 1000 cycles, and its callers, procedures foo() and bar(), call sin(x) 25 and 75 times respectively, 250 cycles are attributed to foo() and 750 to bar(). By propagating cycles this way, __start() usually ends up with all the cycles counted in the program. (It is possible to write code that makes determining the complete call graph impossible, in which case you may end up with parts of the call graph disconnected.)

The assumption can be very misleading. If foo calls matmult 99 times for 2-by-2 matrices, while bar calls it once for 100-by-100 matrices, the inclusive time report will attribute 99% of matmult()'s time to foo(), but actually almost all the time could derive from the one call from bar().

To generate a report that shows inclusive time, specify the -gprof option to the prof command.

Using pcsamp and bbcounts Together

The bbcounts experiment can be used together with the pcsamp experiment to compare actual and ideal times spent in the CPU. A major discrepancy between pcsamp CPU time and bbcounts CPU time indicates one or more of the following situations:

  • Cache misses and floating-point interlocks in a single process application

  • Secondary cache invalidations in an application with multiple processes that is run on a multiprocessor

A comparison between basic block counts (bbcounts experiment) and PC profile counts (pcsamp experiment) is shown in Table 4-4.

Table 4-4. Basic Block Counts and PC Profile Counts Compared

Basic Block Counts

PC Profile Counts

Used to compute bbcounts CPU time.

Used to estimate actual CPU time.

Data collection by instrumentation.

Data collection by the kernel.

Slows program down by factor of three or more.

Has minimal impact on program speed.

Generates an exact count of every instruction.

Generates statistical, inexact counts.


I/O Trace Experiment (io)

The I/O trace experiment shows you the level of I/O activity in your program by tracing various I/O system calls, for example read(2) and write(2).

The prof output of an I/O trace experiment yields the following information:

  • The number of I/O system calls executed.

  • The number of calls with an incomplete traceback.

  • The [index] column assigns a reference number to each function.

  • The number of I/O-related system calls from each function in the program.

  • The percentage of I/O-related system calls from each function in the program.

  • The percentage of I/O-related system calls encountered so far in the list of functions.

  • The number of I/O-related system calls made by a given function and by all the functions ultimately called by that given function. For example, the main function will probably include all of the program's I/O calls with complete tracebacks.

  • The percentage of I/O-related system calls made by a given function and by all the functions ultimately called by that given function.

  • The DSO, file name, and line number for each function.

The following ssrun command creates an I/O trace experiment file from the executable file generic:

% ssrun -io generic

MPI Call Tracing Experiment (mpi/mpi_trace)

The mpi trace experiment traces calls to various MPI routines and generates a file that is viewable in prof. For a list of the routines that are traced, see the ssrun man page.

The mpi_trace experiment traces calls to various MPI routines and generates a file that is viewable in the cvperf Performance Analyzer window. For a list of the routines that are traced, see the ssrun man page (this experiment will be removed in a future release).

For more details about MPI experiments, see “Running Experiments on MPI Programs” in Chapter 6.

NUMA Profiling Experiment (numa)

The NUMA profiler operates on a statistical basis by periodically interrupting the running application. During each interrupt the application's memory accesses are examined. Interrupts are triggered periodically by waiting for a particular number of CPU hardware performance counter events to occur. For example, the default setting is to interrupt the running application after approximately 100 secondary data cache misses have occurred.

During each interrupt, the profiler begins at the interrupted program counter location and finds the nearest memory access for which it can accurately calculate a target address. This address is then used to determine the ccNUMA node that contains the memory being accessed. The profiler also determines which ccNUMA node is executing the interrupted application thread.

Each time a sample is taken, the following information is stored in the experiments data file:

  • ID of the thread performing the memory access

  • Program counter for the instruction performing the memory access

  • Memory address being accessed

  • ccNUMA node executing the memory access

  • ccNUMA node containing the memory being accessed

  • ccNUMA routing distance (in "hops") between these two ccNUMA nodes

This sampling process is repeated continuously until the application terminates.

The generated data can then be used to locate those lines of application code that generate the largest number of remote memory accesses ('remote' refers to the situation in which a node is performing an access to memory that does not lie on that node). By using facilities already present in IRIX (namely CPU sets and memory locality domains) the application engineer can attempt to minimize these remote accesses.

Applications running on a ccNUMA system do not see the same memory latency for every memory access - even after accounting for all cache effects. Accessing memory located on the same node as that on which you run your code is faster than accessing memory on other nodes. By reducing the number of remote memory accesses the application's performance is enhanced.

PC Sampling Experiment (pcsamp)

The pcsamp experiment estimates the actual CPU time for each source code line, machine code line, and function in your program. The prof listing of this experiment shows exclusive PC sampling time. This experiment is a lightweight, high-speed operation that makes use of the operating system.

CPU time is calculated by multiplying the number of times an instruction or function appears in the PC by the interval specified for the experiment (either 1 or 10 milliseconds).

To collect the data, the operating system regularly stops the process, increments a counter corresponding to the current value of the PC, and resumes the process. The default sample interval is 10 milliseconds. If you specify the optional f prefix to the experiment, a sample interval of 1 millisecond is used. (See “A pcsamp Experiment” in Chapter 2, for an example.)

By default, the experiment uses 16-bit counters. If the optional x suffix is used, a 32-bit counter size will be used. Using a 32-bit bin provides more accurate information, but requires additional memory and disk space. See “Example Using the pcsampx Experiment” in Chapter 6, for an example.

  • 16-bit bins allow a maximum of 65,536 counts.

  • 32-bit bins allow over 4 billion counts.

PC sampling runs should slow the execution time of the program down no more than 5 percent. The measurements are statistical in nature, meaning they exhibit variance inversely proportional to the running time.

Call Stack Profiling Experiment (usertime/totaltime)

The usertime and the totaltime experiments are useful experiments to start your performance analysis. The usertime experiment returns CPU time for each function while your program runs and the totaltime experiment returns real time for each function.

These expeirments use statistical call stack profiling to measure inclusive and exclusive user time. They take a sample every 30 milliseconds. Data is measured by periodically sampling the callstack. The program's callstack data is used to do the following:

  • Attribute exclusive user time to the function at the bottom of each callstack (that is, the function being executed at the time of the sample).

  • Attribute inclusive user time to all the functions above the one currently being executed (those involved in the chain of calls that led to the function at the bottom of the callstack executing).

The time spent in a procedure is determined by multiplying the number of times an instruction for that procedure appears in the stack by the sampling time interval between call stack samples. Call stacks are gathered when the program is running; hence, the time computed represents user time, not time spent when the program is waiting for a CPU. User time shows both the time the program itself is executing and the time the operating system is performing services for the program, such as I/O.

The usertime experiment should incur a program execution slowdown of no more than 15%. Data from a usertime experiment is statistical in nature and shows some variance from run to run.


Note: For this experiment, o32 executables must explicitly link with -lexc.