Chapter 7. Analyzing Experiment Results

This chapter provides information on how to view and analyze experiment results by using the prof(1) report generator. This chapter has the following sections:

Using prof to Generate Performance Reports

Performance data is examined using prof, a text-based report generator that prints to stdout.

Use either of the following syntaxes to generate a report from performance data gathered during experiments recorded by ssrun(1):

prof [options][speedshop_data_file]... 

or

prof [options] executable_name [speedshop_data_file]...

prof Arguments

The arguments for prof when used with data files from ssrun are as follows:

  • options: zero or more of the options described in Table 7-1.

  • executable_name: the name of the executable file (including its path) created by the compiler. This argument is needed if prof is unable to locate the executable relative to the location of the data files being analyzed because the data or the executable were moved after the files were created.

  • speedshop_data_file: one or more names of performance data files generated by ssrun. The file names may differ only in the ID portion of their names. The exp_type portion of the names must be identical.

prof Options

The following table lists prof options that are current for this release. For more information and for a list of any newly added options since this printing, see the prof(1) man page.

Table 7-1. Options for prof

Name

Result

-archinfo

Reports the number of times each register was used as a destination, base (integer registers only), or source; how many times each instruction opcode was used; and some detailed statistics concerning branches jumps, and how many delay slots were filled with no-op instructions. Works only with bbcounts experiments.

-basicblocks

Prints a list of all the basic blocks executed, ordered by the number of cycles spent in each basic block. Works only with bbcounts experiments.

-b[utterfly]

Causes prof to print a report showing the callers and callees of each function, with inclusive time attributed to each. For bbcounts experiments, the attribution is based on a heuristic. For the various callstack sampling and tracing experiments, the attribution is precise. The usertime, totaltime, and some _hwctime experiments are statistical in nature and so are not exact. This option is ignored for experiments in which the data does not support inclusive calculations. It delivers the same display as -gprof.

-calipers [n1,]n2

Restricts analysis to a segment of program execution. This option works only for SpeedShop experiments.

 

Causes prof to compute the data between caliper points n1 and n2, rather than for the entire experiment.

If n1 >= n2, an error is reported.

 

If n2 is greater than the maximum number of caliper points recorded, it is set to the maximum.

 

If n1 is omitted, zero (the beginning of the program) is assumed.

-calls

Sorts the function list by the number of procedure calls rather than by time. This option can only be used when generating reports for bbcounts experiments.

-clock n

Sets the CPU clock speed to (n), expressed in megahertz. This option is useful when generating reports for bbcounts experiments. The default is the clock speed of the machine on which the performance data was collected.

-[no]cordfb

Enables or disables (-nocordfb) cord feedback file generation for the executable only. Cord feedback is used to arrange procedures in the binary in an optimal ordering. This improves both paging and instruction cache performance. Users can use cord(1) or ld(1) to actually do the procedure ordering.

-cordfball

Enables cord feedback for the executable and all DSOs.

-cycle n

Sets the cycle time to n nanoseconds. This parameter may be used as another way of setting the clock speed. See also the description for -clock n.

-debug:dbg_flags

Sets dbg_flags. dbg_flag should be specified as a hexadecimal value made by adding up combinations of the hexadecimal values listed below (Example: -debug:0x00000102):

GPROF_FLAG           0x00000001
COUNTS_FLAG          0x00000002
SAMPLE_FLAG          0x00000004
MISS_FLAG            0x00000008
FEEDBACK_FLAG        0x00000010
CORD_FLAG            0x00000020
USERPC_FLAG          0x00000040
MDEBUG_FLAG          0x00000080
BEAD_FLAG            0x00000100
LIBSSRT_FLAG         0x00000200

-dis[assemble]

Disassembles and annotates the analyzed object code with cycle times if you have run an bbcounts experiment, collected data using pixie, or have run apcsamp or _hwc/_hwctime experiment.

-dislimit n

Disassembles only those basic blocks with a frequency >= n%. This option applies to the same experiments as the -disassemble option.

-dso dsoname

Generates a report only for the named DSO. Only the base name, not the full path name, of the DSO needs to be specified; the .so suffix is required. Multiple instances of the -dso flag can be given.

-dsolist

List all the DSOs in the program and their start and end text addresses.

-e[xclude] procedure_name

If you use one or more -e[xclude] options, the profiler omits the specified procedure from the listing. If any option uses an upper-case E for -E[xclude], prof also omits that procedure from the base upon which it calculates percentages.

-feedback

Produces files with information that can be used to arrange procedures in the binary in an optimal ordering using cord. The cord feedback files are named program.fb or libso.fb. Procedures are normally ordered by their measured invocation counts; if -gprof is also specified, procedures are ordered using call graph counts, rather than invocation counts.

-fpe_counts

For FPE experiments, this option restricts the output to only show data from FPE events of type n. The default is to display the combined results from all FPE events.

-fpe_type n

Used with FPE experiments. Restricts the output to only show data from FPE events of type n. The default is to display the combined results from all FPE events.

-gprof

(See -b[utterfly].)

-h[eavy]

Lists the most heavily used lines of source code in descending order of use, sorting lines by their execution time. This option can be used when generating reports for bbcounts, pcsamp, or _hwc experiments.

-inclusive

Sorts function list by inclusive data rather than by exclusive data. This option can only be used when generating reports for those experiments that have inclusive data; it is ignored for others.

-l[ines]

Lists the most heavily used lines of source code in descending order of use, but lists lines grouped by procedure, sorted by cycles executed per procedure. This option can be used when generating reports for bbcounts, pcsamp, or _hwc experiments.

-nh

Suppresses various header blocks from the output.

-o[nly] procedure_name

If you use one or more -o[nly] options, the profile listing includes only the named procedures, rather than the entire program. If any option uses an uppercase -O[nly], prof uses only the named procedures, rather than the entire program, as the base upon which it calculates percentages.

-overhead

Generates overhead data for a parallel program. Overhead data includes how much time was spent when the program had no parallel work to do, how much time was lost when work was not spread evenly among the processors, and so on.

-pthreads pthread_id

Analyzes data only for the specified pthread identifier (for usertime, totaltime, _hwctime, io, and fpe experiments on applications that use pthreads on IRIX 6.5 or later systems). pthread_id may be a list of pthread identifiers separated by commas.

-q[uit] n

Condenses output listings by truncating -h[eavy], -l[ines], and -gprof listings. You can specify n in three ways:

 

n, an integer, truncates everything after n functions are listed;

 

n%, an integer followed by a percent sign, truncates the listing after the first entry that represents less than n percent of the total;

 

ncum%, an integer followed by cum%, truncates the listing after enough entries have been printed to account for n percent of the cumulative total. If -b[utterfly] is also specified, it behaves the same as -q n%.

 

For example, -q 15 truncates each part of the report after 15 lines of text, -q 15% truncates each part after the first entry that represents less than 15 percent of the whole, and -q 15cum% truncates each part after the entry that brought the cumulative percentage above 15%.

-rel[ative]

Shows percentage attribution in a butterfly report relative to the central function. The default is to show percentages as absolute percentages over the whole run.

-repository directory

Uses the SpeedShop DSO information from directory instead of processing the DSO in memory. This reduces the time it takes to retrieve source file information, procedure name, and address from the DSO. If you have multiple SpeedShop DSO repositories, you can use this option multiple times to define all the needed repositories. It can also be used to save the experiments because saving the DSO information will allow prof to not use the original DSO (which may have been modified). The directory is used in Read-Only mode.

-r16000|r14000|-r12000|-r10000

Overrides the default processor scheduling model that prof uses to generate a report. If this option is not specified, prof uses the scheduling model for the processor on which the experiment is being run. These options are only meaningful for an bbcounts time experiment or pixie count data.

-showss

Enables the display of functions from the SpeedShop run-time DSO. Usually those functions are suppressed from the reports and computations. In addition, some statistics for the prof command's own memory usage will be printed.

-S (-source)

Disassembles and annotates the analyzed object code with cycle times, or PC samples, and interleaves and lists the source code, if you have run a bbcounts, pcsamp, or _hwc experiment.

-update_repository directory

Processes the DSO and stores the information in directory. This option can only be specified once, and is required to save the DSO information. SpeedShop DSO information files have a _ssInfoabi extension, depending on the abi (32, n32, or 64). The directory is used in Read-Write mode.

-u[sage]

Prints a report on system statistics and timers.

-ws

Generates, for the executable only, a working-set file for the current caliper setting. This option is only meaningful for a bbcounts time experiment or pixie count data. The file suffix is .ws.

-wsall

Generates, for the executable and all the non-ignored DSOs, a working-set file for the current caliper setting. This option is only meaningful for a bbcounts time experiment. The file suffix is .ws.

-xdso dso_name

Excludes the named DSO from any reports. Only the base name, not the full path name, of the DSO need be specified; the .so suffix is required. Multiple instances of the -xdso flag can be specified.


prof Output

The prof command generates a performance report that is printed to stdout. Warning and fatal errors are printed to stderr.


Note: Fortran alternate entry point times are attributed to the main function or subroutine, since there is no general way for prof to separate the times for the alternate entries.


Using prof with ssrun

When you call prof with one or more SpeedShop performance data files, it collects the data from all the output files and produces a listing. The prof command is able to detect which experiment was run and generate an appropriate report. The command can identify all of the experiment types used with the ssrun command.

In cases where prof accepts more than one data file as input, it sums up the results. The multiple input data files must be generated from the same executable, using the same experiment type.

The prof command may report times for procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF stands for Dummy Function and indicates cycles spent in parts of text which are not in any function: init, fini, and MIPS.stubs sections, for example.

The most frequently used reports that prof generates are described in the following sections:

usertime Experiment Reports

For usertime experiments, prof generates CPU times for individual routines and shows how those times compare with the rest of the program. The column heading are as follows:

  • The index column provides an index number for reference.

  • The excl.secs column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in __start(), but 0.03 of a second was spent in fread.

  • The excl.% column shows the percentage of a program's total time that was spent in the function.

  • The cum.% column shows the percentage of the complete program time that has been spent in the functions that have been listed so far.

  • The incl.secs column shows how much time, in seconds, was spent in the function and descendents of the function.

  • The incl.% column shows the cumulative percentage of inclusive time spent in each function and its descendents.

  • The samples column provides the number of samples of the function and all of its descendants.

  • The function (dso:file,line) column lists the function name, its DSO name, its file name, and its line number.

The following example is an abbreviated version of the full report. For a complete report, see “Generating a Report” in Chapter 2.

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 11:07:15 1998
   prof generic.usertime.m10981
                 generic (n32): Target program
                      usertime: Experiment name
                         ut:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file generic.usertime.m10981:
        Caliper point 0 at target begin, PID 10981
                        /usr/demos/SpeedShop/linpack.demos/c/generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical callstack sampling data (usertime)--
                           809: Total Samples
                             0: Samples with incomplete traceback
                        24.270: Accumulated Time (secs.)
                          30.0: Sample interval (msecs.)
-------------------------------------------------------------------------
Function list, in descending order by exclusive time
-------------------------------------------------------------------------
 [index]  excl.secs excl.%   cum.%  incl.secs incl.%    samples  function (dso: file, line)

     [4]     22.770  93.8%   93.8%     22.770  93.8%        759  anneal (generic: generic.c, 1573)

pcsamp Experiment Reports

For [f]pcsamp[x] experiments, prof generates a function list annotated with the number of samples taken for the function and the estimated time spent in the function. The column headings are as follows:

  • The [index] column assigns a reference number to each function.

  • The secs column shows the amount of CPU time that was spent in the function.

  • The % column shows the percentage of the total program time that was spent in the function.

  • The cum.% column shows the percentage of the complete program time that has been spent in the functions that have been listed so far.

  • The samples column shows how many samples were taken when the process was executing in the function.

  • The function (dso:file, line) column lists the function, its DSO name, its file name, and its line number.

The following is output from an fpcsamp experiment:

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 11:01:36 1998
   prof generic.fpcsamp.m11140
                 generic (n32): Target program
                       fpcsamp: Experiment name
                pc,2,1000,0:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file generic.fpcsamp.m11140:
        Caliper point 0 at target begin, PID 11140
                        /usr/demos/SpeedShop/linpack.demos/c/generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (fpcsamp)--
                         23828: Total samples
                        23.828: Accumulated time (secs.)
                           1.0: Time per sample (msecs.)
                             2: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
 [index]      secs    %    cum.%   samples  function (dso: file, line)

     [1]    22.279  93.5%  93.5%     22279  anneal (generic: generic.c,1573)

Hardware Counter Experiment Reports

For the various hwc experiments, prof generates a function list annotated with the number of overflows of hardware counters generated by the function. The column headings are as follows:

  • The [index] column assigns a reference number to each function.

  • The counts column shows the extrapolated event count based on the number of samples and the overflow value for the particular counter.

  • The % column shows the percentage of the program's overflows that occurred in the function.

  • The cum.% column shows the percentage of the program's overflows that occurred in the functions that have been listed so far.

  • The samples column shows the number of times the program counter was sampled during execution of the function.

  • The function (dso:file, line) column lists the name, the DSO, the file name, and line number of the function.

The following is output from a dsc_hwc hardware counter experiment:

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 11:11:44 1998
   prof generic.dsc_hwc.m294398
                 generic (n32): Target program
                       dsc_hwc: Experiment name
                 hwc,26,131:cu: Marching orders
               R10000 / R10010: CPU / FPU
                            16: Number of CPUs
                           195: Clock frequency (MHz.)
  Experiment notes--
          From file generic.dsc_hwc.m294398:
        Caliper point 0 at target begin, PID 294398
                        /usr/demos/SpeedShop/linpack.demos/c/generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of R10K perf. counter overflow PC sampling data (dsc_hwc)--
                             6: Total samples
       Sec cache D misses (26): Counter name (number)
                           131: Counter overflow value
                           786: Total counts
-------------------------------------------------------------------------
Function list, in descending order by counts
-------------------------------------------------------------------------
 [index]        counts     %   cum.%   samples  function (dso: file, line)

     [1]           131  16.7%  16.7%         1  init2da (generic: generic.c, 1430)
     [2]           131  16.7%  33.3%         1  genLog (generic: generic.c, 1686)
     [3]           131  16.7%  50.0%         1  _write (libc.so.1: writeSCI.c, 27)
                   393  50.0% 100.0%         3  **OTHER** (includes excluded DSOs, rld, etc.)

                   786 100.0% 100.0%         6  TOTAL

bbcounts Experiment Reports

For bbcounts experiments, prof generates a function list annotated with the number of cycles and instructions attributed to the function and the estimated time spent in the function.

The prof command does not take into account interactions between basic blocks. Within a single basic block, prof computes cycles for one execution and multiplies it with the number of times that basic block is executed.

If any of the object files linked into the application have been stripped of line number information (with ld -x, for example), prof warns about the affected procedures. The instruction counts for such procedures are shown as a procedure total, not on a per-basic-block basis. Where a line number would normally appear in a report on a function without line numbers, question marks appear instead. The column headings are as follows:

  • The [index] column assigns a reference number to each function.

  • The excl.secs column shows the minimum number of seconds that might be spent in the function under bbcounts conditions.

  • The excl.% column represents how much of the program's total time was spent in the function.

  • The cum.% column shows the cumulative percentage of time spent in the functions that have been listed so far.

  • The cycles column reports the number of machine cycles used by the function.

  • The instructions column shows the number of instructions executed by a function.

  • The calls column reports the number of calls to the function.

  • The function (dso:file, line) column lists the function, its DSO name, its file name, and the line number.

The following is output from a bbcounts experiment.

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Aug 14 13:51:00 2000
   prof -butterfly generic.bbcounts.m46372
                 generic (n32): Target program
                         bbcounts: Experiment name
                         it:cu: Marching orders
               R12000 / R12010: CPU / FPU
                           127: Number of CPUs
                           400: Clock frequency (MHz.)
  Experiment notes--
          From file generic.bbcounts.m46372:
        Caliper point 0 at target begin, PID 46372
                        generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of bbcounts time data (bbcounts)--
                    2048835049: Total number of instructions executed
                    2552056463: Total computed cycles
                         6.380: Total computed execution time (secs.)
                         1.246: Average cycles / instruction
-------------------------------------------------------------------------
Function list, in descending order by exclusive bbcounts time
-------------------------------------------------------------------------
 [index]   excl.secs   excl.%     cum.%        cycles  instructions   incl.secs   incl.%
    calls  function  (dso: file, line)

     [5]       6.088    95.4%     95.4%    2435240026    1956780024       6.088    95.4%
        1  anneal (generic: generic.c, 1559)

If the -butterfly flag is added to prof, a list of callers and callees of each function is provided:

-------------------------------------------------------------------------
Butterfly function list, in descending order by inclusive bbcounts time
-------------------------------------------------------------------------
         attrib.% attrib.time(#calls)                         incl.time  caller (callsite) [index]
 [index]   incl.%   incl.time    self%   self-time       procedure [index]
                              attrib.% attrib.time(#calls)    incl.time  callee (callsite) [index]
-------------------------------------------------------------------------
     [1]    99.9%       6.376     0.0%       0.000         __start [1]
                                 99.9%       6.376(0000001)        6.376  main [2]
                                  0.0%       0.000(0000001)        0.000  __readenv_sigfpe [314]
                                  0.0%       0.000(0000001)        0.000  __istart [315]
-------------------------------------------------------------------------
            99.9%       6.376(0000001)                             6.376  __start [1]
     [2]    99.9%       6.376     0.0%       0.000         main [2]
                                 99.9%       6.376(0000001)        6.376  Scriptstring [3]
-------------------------------------------------------------------------
            99.9%       6.376(0000001)                             6.376  main [2]
     [3]    99.9%       6.376     0.0%       0.000         Scriptstring [3]
                                 95.4%       6.088(0000001)        6.088  usrtime [4]
                                  3.7%       0.238(0000001)        0.238  libdso [6]
                                  0.8%       0.050(0000001)        0.050  cvttrap [9]
                                  0.0%       0.000(0000001)        0.000  iofile [31]
                                  0.0%       0.000(0000002)        0.000  genLog [36]
                                  0.0%       0.000(0000001)        0.000  dirstat [56]
                                  0.0%       0.000(0000001)        0.000  linklist [63]
                                  0.0%       0.000(0000001)        0.000  fpetraps [65]
                                  0.0%       0.000(0000002)        0.000  fprintf [54]
                                  0.0%       0.000(0000002)        0.000  sprintf [49]
                                  0.0%       0.000(0000061)        0.000  strcmp [47]
-------------------------------------------------------------------------
            95.4%       6.088(0000001)                             6.376  Scriptstring [3]
     [4]    95.4%       6.088     0.0%       0.000         usrtime [4]
                                 95.4%       6.088(0000001)        6.088  anneal [5]
                                  0.0%       0.000(0000001)        0.000  genLog [36]
                                  0.0%       0.000(0000001)        0.000  fprintf [54]
-------------------------------------------------------------------------

fpe Trace Reports

The fpe trace report shows information for each function. The function name is shown in the right column of the report. The remaining columns are described below.

  • The [index] column assigns a reference number to each function.

  • The excl.FPEs column shows how many floating point exceptions were found in the function.

  • The excl.% column shows the percentage of the total number of floating-point exceptions that were found in the function.

  • The cum.% column shows the percentage of floating-point exceptions in the program that have been encountered so far in the list.

  • The incl.FPEs column shows how many floating-point exceptions were attributed to the function and descendents of the function.

  • The incl.% column shows the cumulative percentage of floating-point exceptions attributed to the function and its descendents.

  • The function (dso:file, line) column lists the function name, its DSO name, its file name, and its line number.

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Feb  2 13:26:33 1998
   prof generic.fpe.m12213
                 generic (n32): Target program
                           fpe: Experiment name
                        fpe:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file generic.fpe.m12213:
        Caliper point 0 at target begin, PID 12213
                        /usr/demos/SpeedShop/linpack.demos/c/generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of FPE callstack tracing data (fpe)--
                             4: Total FPEs
                             0: Samples with incomplete traceback
-------------------------------------------------------------------------
Function list, in descending order by exclusive FPEs
-------------------------------------------------------------------------
 [index]   excl.FPEs excl.%   cum.%   incl.FPEs incl.%  function  (dso: file, line)

     [1]           4 100.0%  100.0%           4 100.0%  fpetraps (generic: generic.c, 405)

Using prof Options

This section shows the output from calling prof with some of the options available for prof.

Using the -dis Option

For pcsamp and bbcounts experiments, the -dis option to prof can be used to obtain machine instruction information. prof provides the standard report and then appends the machine instruction information to the end of the report. The following example shows partial output from prof for a pcsamp experiment.

-------------------------------------------------------------------------
SpeedShop profile listing generated Tue Feb  3 10:48:59 1998
   prof -dis generic.pcsamp.m14493
                 generic (n32): Target program
                        pcsamp: Experiment name
               pc,2,10000,0:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
	  From file generic.pcsamp.m14493:
	Caliper point 0 at target begin, PID 14493
			/usr/demos/SpeedShop/c/generic
	Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (pcsamp)--
                          2707: Total samples
                        27.070: Accumulated time (secs.)
                          10.0: Time per sample (msecs.)
                             2: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
 [index]      secs    %    cum.%   samples  function (dso: file, line)
     [1]    25.240  93.2%  93.2%      2524  anneal (generic: generic.c, 1573)
     [2]     1.090   4.0%  97.3%       109  slaveusrtime (dlslave.so: dlslave.c, 22)
     [3]     0.390   1.4%  98.7%        39  __read (libc.so.1: read.s, 20)
     [4]     0.230   0.8%  99.6%        23  cvttrap (generic: generic.c, 317)
     [5]     0.090   0.3%  99.9%         9  _xstat (libc.so.1: xstat.s, 12)
     [6]     0.010   0.0%  99.9%         1  __write (libc.so.1: write.s, 20)
     [7]     0.010   0.0% 100.0%         1  _ngetdents (libc.so.1: ngetdents.s, 16)
     [8]     0.010   0.0% 100.0%         1  _doprnt (libc.so.1: doprnt.c, 285)

            27.070 100.0% 100.0%      2707  TOTAL

-------------------------------------------------------------------------
Disassembly listing, annotated with PC sampling overflows
-------------------------------------------------------------------------
.
.
.
/usr/demos/SpeedShop/linpack.demos/c/generic.c
anneal: <0x10006830-0x10006b3c>    2524 total samples(93.24%)
  [1573] 0x10006830	0x27bdffd0	addiu	sp,sp,-48 	# 1
  [1573] 0x10006834	0xffbc0020	sd	gp,32(sp) 	# 2
  [1573] 0x10006838	0xffbf0018	sd	ra,24(sp) 	# 3
  [1573] 0x1000683c	0x3c030002	lui	v1,0x2 	# 4
  [1573] 0x10006840	0x246397e8	addiu	v1,v1,-26648 	# 5
  [1573] 0x10006844	0x0323e021	addu	gp,t9,v1 	# 6
  [1575] 0x10006848	0xd7808370	ldc1	$f0,-31888(gp) 	# 7
		<2 cycle stall for following instruction>
  [1575] 0x1000684c	0xf7a00000	sdc1	$f0,0(sp) 	# 10
  [1577] 0x10006850	0x24010001	li	at,1 	# 11
  [1577] 0x10006854	0x8f82816c	lw	v0,-32404(gp) 	# 12
		<2 cycle stall for following instruction>
  [1577] 0x10006858	0xac410000	sw	at,0(v0) 	# 15
  [1578] 0x1000685c	0x8f998148	lw	t9,-32440(gp) 	# 16
  [1578] 0x10006860	0x0c00171b	jal	0x10005c6c 	# 17
  [1578] 0x10006864	0000000000	nop 	# 18
		<2 cycle stall for following instruction>
  [1586] 0x10006868	0xafa00008	sw	zero,8(sp) 	# 21
  [1586] 0x1000686c	0x8fa40008	lw	a0,8(sp) 	# 22
		<2 cycle stall for following instruction>
  [1586] 0x10006870	0x28842710	slti	a0,a0,10000 	# 25
  [1586] 0x10006874	0x108000ac	beq	a0,zero,0x10006b28 	# 26
  [1586] 0x10006878	0000000000	nop 	# 27
		<2 cycle stall for following instruction>
  [1588] 0x1000687c	0x24070001	li	a3,1 	# 30
  [1588] 0x10006880	0xafa7000c	sw	a3,12(sp) 	# 31
  [1588] 0x10006884	0x8f868164	lw	a2,-32412(gp) 	# 32
		<2 cycle stall for following instruction>
  [1588] 0x10006888	0x8cc60000	lw	a2,0(a2) 	# 35
		<2 cycle stall for following instruction>
  [1588] 0x1000688c	0x24c6ffff	addiu	a2,a2,-1 	# 38
  [1588] 0x10006890	0x8fa5000c	lw	a1,12(sp) 	# 39
		<2 cycle stall for following instruction>
  [1588] 0x10006894	0x00a6282a	slt	a1,a1,a2 	# 42
  [1588] 0x10006898	0x10a0009c	beq	a1,zero,0x10006b0c 	# 43
  [1588] 0x1000689c	0000000000	nop 	# 44
		<2 cycle stall for following instruction>
  [1589] 0x100068a0	0x240a0001	li	t2,1 	# 47
  ^------    1 samples(0.04%)------^
  [1589] 0x100068a4	0xafaa0010	sw	t2,16(sp) 	# 48
  ^------    1 samples(0.04%)------^
  [1589] 0x100068a8	0x8f898164	lw	t1,-32412(gp) 	# 49
		<2 cycle stall for following instruction>
  [1589] 0x100068ac	0x8d290000	lw	t1,0(t1) 	# 52
		<2 cycle stall for following instruction>
  [1589] 0x100068b0	0x2529ffff	addiu	t1,t1,-1 	# 55
  [1589] 0x100068b4	0x8fa80010	lw	t0,16(sp) 	# 56
		<2 cycle stall for following instruction>
  [1589] 0x100068b8	0x0109402a	slt	t0,t0,t1 	# 59
  [1589] 0x100068bc	0x11000089	beq	t0,zero,0x10006ae4 	# 60
  [1589] 0x100068c0	0000000000	nop 	# 61
		<2 cycle stall for following instruction>
  [1590] 0x100068c4	0x8faf000c	lw	t7,12(sp) 	# 64
  ^------   27 samples(1.00%)------^
		<2 cycle stall for following instruction>
  [1590] 0x100068c8	0x25ef0001	addiu	t7,t7,1 	# 67
  ^------    7 samples(0.26%)------^
  [1590] 0x100068cc	0x000f7080	sll	t6,t7,2 	# 68
  ^------   30 samples(1.11%)------^
  [1590] 0x100068d0	0x01cf7021	addu	t6,t6,t7 	# 69
  ^------    8 samples(0.30%)------^
  [1590] 0x100068d4	0x000e70c0	sll	t6,t6,3 	# 70
  ^------    5 samples(0.18%)------^
  [1590] 0x100068d8	0x8faf0010	lw	t7,16(sp) 	# 71
  ^------    8 samples(0.30%)------^
		<2 cycle stall for following instruction>
  [1590] 0x100068dc	0x01cf7021	addu	t6,t6,t7 	# 74
  ^------    9 samples(0.33%)------^
  [1590] 0x100068e0	0x000e70c0	sll	t6,t6,3 	# 75
  ^------   27 samples(1.00%)------^
  [1590] 0x100068e4	0x8f8f817c	lw	t7,-32388(gp) 	# 76
  ^------   14 samples(0.52%)------^
		<2 cycle stall for following instruction>
  [1590] 0x100068e8	0x01cf7021	addu	t6,t6,t7 	# 79
  ^------    9 samples(0.33%)------^
  [1590] 0x100068ec	0x25ce0008	addiu	t6,t6,8 	# 80
  ^------   28 samples(1.03%)------^
  [1590] 0x100068f0	0xd5c10000	ldc1	$f1,0(t6) 	# 81
  ^------    7 samples(0.26%)------^
  [1590] 0x100068f4	0x8fad000c	lw	t5,12(sp) 	# 82
  ^------   10 samples(0.37%)------^
		<2 cycle stall for following instruction>
  [1590] 0x100068f8	0x25ad0001	addiu	t5,t5,1 	# 85
  ^------   21 samples(0.78%)------^
  [1590] 0x100068fc	0x000d6080	sll	t4,t5,2 	# 86
  ^------   19 samples(0.70%)------^
  [1590] 0x10006900	0x018d6021	addu	t4,t4,t5 	# 87
  ^------    9 samples(0.33%)------^
  [1590] 0x10006904	0x000c60c0	sll	t4,t4,3 	# 88
  ^------   14 samples(0.52%)------^
  [1590] 0x10006908	0x8fad0010	lw	t5,16(sp) 	# 89
  ^------    8 samples(0.30%)------^
		<2 cycle stall for following instruction>
  [1590] 0x1000690c	0x018d6021	addu	t4,t4,t5 	# 92
  ^------    8 samples(0.30%)------^
  [1590] 0x10006910	0x000c60c0	sll	t4,t4,3 	# 93
  ^------   30 samples(1.11%)------^
  [1590] 0x10006914	0x8f8d817c	lw	t5,-32388(gp) 	# 94
  ^------   10 samples(0.37%)------^
		<2 cycle stall for following instruction>
  [1590] 0x10006918	0x018d6021	addu	t4,t4,t5 	# 97
  ^------    8 samples(0.30%)------^
  [1590] 0x1000691c	0xd5820000	ldc1	$f2,0(t4) 	# 98
  ^------   28 samples(1.03%)------^
  [1590] 0x10006920	0x8fab000c	lw	t3,12(sp) 	# 99
  ^------    9 samples(0.33%)------^
		<2 cycle stall for following instruction>
  [1590] 0x10006924	0x256b0001	addiu	t3,t3,1 	# 102
  ^------   11 samples(0.41%)------^
  [1590] 0x10006928	0x000b5080	sll	t2,t3,2 	# 103
  ^------   25 samples(0.92%)------^
  [1590] 0x1000692c	0x014b5021	addu	t2,t2,t3 	# 104
  ^------   11 samples(0.41%)------^
  [1590] 0x10006930	0x000a50c0	sll	t2,t2,3 	# 105
  ^------    8 samples(0.30%)------^
  [1590] 0x10006934	0x8fab0010	lw	t3,16(sp) 	# 106
  ^------   11 samples(0.41%)------^
		<2 cycle stall for following instruction>
  [1590] 0x10006938	0x014b5021	addu	t2,t2,t3 	# 109
  ^------    7 samples(0.26%)------^
  [1590] 0x1000693c	0x000a50c0	sll	t2,t2,3 	# 110
  ^------   26 samples(0.96%)------^
  [1590] 0x10006940	0x8f8b817c	lw	t3,-32388(gp) 	# 111
  ^------   13 samples(0.48%)------^
		<2 cycle stall for following instruction>
  [1590] 0x10006944	0x014b5021	addu	t2,t2,t3 	# 114
  ^------    9 samples(0.33%)------^
  [1590] 0x10006948	0x254afff8	addiu	t2,t2,-8 	# 115
  ^------   26 samples(0.96%)------^
  [1590] 0x1000694c	0xd5430000	ldc1	$f3,0(t2) 	# 116
  ^------   11 samples(0.41%)------^
  [1590] 0x10006950	0x8fa9000c	lw	t1,12(sp) 	# 117
  ^------   10 samples(0.37%)------^
		<2 cycle stall for following instruction>
  [1590] 0x10006954	0x00094080	sll	t0,t1,2 	# 120
  ^------   11 samples(0.41%)------^
.
.
.

The listing shows statistics about the procedure anneal() in the file generic.c and lists the beginning and ending addresses of anneal(): <0x10006830-0x10006b3c>. The five columns display the following information:

Column

Displays

1

Line number of the instruction: [1573].

2

Beginning address of the instruction: 0x10006830.

3

Instruction in hexadecimal: 0x27bdffd0.

4

Assembler form (mnemonic) of the instruction: addiu sp,sp,-48.

5

Cycle in which the instruction executed: # 1.

Other information includes:

  • The number of times the immediately preceding branch was executed and taken (bbcounts only).

  • The total number of cycles in a basic block and the percentage of the total cycles for that basic block; the number of times the branch terminating that basic block was executed; and the number of cycles for one execution of that basic block (bbcounts only).

  • The total number of samples at an instruction (pcsamp only).

  • Any cycle stalls, that is, cycles that were wasted.

Using the -S Option

For bbcounts experiments, the -S option to prof can be used to obtain source line information. prof provides the standard report and then appends the source line information to the end of the report.

This example shows output from calling prof for a bbcounts experiment. Note that some lines are wrapped here to accommodate page width:

-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Jul 17 14:45:28 2000
   prof -S generic.bbcounts.m190404
                 generic (n32): Target program
                         bbcounts: Experiment name
                         it:cu: Marching orders
               R12000 / R12010: CPU / FPU
                           128: Number of CPUs
                           400: Clock frequency (MHz.)
  Experiment notes--
          From file generic.bbcounts.m190404:
        Caliper point 0 at target begin, PID 190404
                        generic
        Caliper point 1 bgn_calc
        Caliper point 2 at exit(0)
-------------------------------------------------------------------------
Summary of bbcounts time data (bbcounts)--
                    2048886059: Total number of instructions executed
                    2552098900: Total computed cycles
                         6.380: Total computed execution time (secs.)
                         1.246: Average cycles / instruction
-------------------------------------------------------------------------
Function list, in descending order by exclusive bbcounts time
-------------------------------------------------------------------------
 [index]   excl.secs   excl.%     cum.%        cycles  instructions    calls  
function  (dso: file, line)

     [1]       6.088    95.4%     95.4%    2435240026    1956780024        1  
anneal (generic: generic.c, 1560)
     [2]       0.238     3.7%     99.1%      95000839      75000732        1  
slaveusrtime (dlslave.so: dlslave.c, 22)
     [3]       0.050     0.8%     99.9%      20000056      15000054        1  
cvttrap (generic: generic.c, 317)
     [4]       0.001     0.0%     99.9%        503138        559313     5212  
resolve_relocations (rld: rld.c, 2636)
     [5]       0.001     0.0%    100.0%        274847        282220     1255  
general_find_symbol (rld: rld.c, 2038)
     [6]       0.000     0.0%    100.0%        116756        120371        3  
fix_all_defineds (rld: rld.c, 3419)
     [7]       0.000     0.0%    100.0%        115819        145585     1270  
elfhash (rld: obj.c, 1184)
     [8]       0.000     0.0%    100.0%        102496        146324     6406  
obj_dynsym_got (rld: objfcn.c, 46)
     [9]       0.000     0.0%    100.0%         89123        116619      948  
fread (libc.so.1: fread.c, 27)
    [10]       0.000     0.0%    100.0%         74339         58123        1  
init2da (generic: generic.c, 1417)
     .
     .
     .
-----------------------
disassembly listing    
-----------------------

*DF*_generic.MIPS.stubs_1
*DF*_generic.MIPS.stubs_1@0x10001fd4-0x100023d8: <0x10001fd4-0x100023d8>
     .
     .
     .
/usr/people/n4733/demos/SpeedShop/generic/generic.c
main: <0x10002500-0x10002640>
    31 total cycles(0.00%) invoked 1 times, average 31 cycles/invocation
File '/usr/people/n4733/demos/SpeedShop/generic/generic.c':
Skipping source listing to line 91
92: void        sproctestgrandchild(void *);    /* sproc grandchild code */
93:
94: static      struct timeval  starttime;      /* starting time - first timestamp */
95: static      struct timeval  ttime;          /* last-recorded timestamp */
96: static      struct timeval  deltatime;
97:
98: int pagesize;
99:
100: main(unsigned argc, char **argv)
101: {
  [101] 0x10002500      0x27bdffd0      addiu   sp,sp,-48       # 1
  [101] 0x10002504      0xffbc0010      sd      gp,16(sp)       # 2
  [101] 0x10002508      0xffbf0008      sd      ra,8(sp)        # 3
  [101] 0x1000250c      0x3c010002      lui     at,0x2  # 3
  [101] 0x10002510      0x2421eb28      addiu   at,at,-5336     # 4
  [101] 0x10002514      0x0321e021      addu    gp,t9,at        # 5
  [101] 0x10002518      0xafa40024      sw      a0,36(sp)       # 5
  [101] 0x1000251c      0xafa5002c      sw      a1,44(sp)       # 6
102:    int     i;
103:
104:    /* initialize the timestamp */
105:    (void) gettimeofday(&starttime, NULL);
  [105] 0x10002520      0x27848198      addiu   a0,gp,-32360    # 6
  [105] 0x10002524      0x00002825      move    a1,zero         # 6
  [105] 0x10002528      0x8f998064      lw      t9,-32668(gp)   # 7
                <2 cycle stall for following instruction>
  [105] 0x1000252c      0x0320f809      jalr    ra,t9   # 8
  [105] 0x10002530      0000000000      nop     # 9
  ^---     11 total cycles(0.00%) executed     1 times, average 11 cycles.---^
106:
107:    /* set up to reap any children */
108:    (void) sigset(SIGCHLD, (SIG_PF)reapSig);
  [108] 0x10002534      0x24040012      li      a0,18   # 1
  [108] 0x10002538      0x8f858130      lw      a1,-32464(gp)   # 2
  [108] 0x1000253c      0x8f998068      lw      t9,-32664(gp)   # 3
                <2 cycle stall for following instruction>
  [108] 0x10002540      0x0320f809      jalr    ra,t9   # 4
  [108] 0x10002544      0000000000      nop     # 5
  ^---      6 total cycles(0.00%) executed     1 times, average  6 cycles.---^
109:
110:    if(argc == 1) {
  [110] 0x10002548      0x8fa20024      lw      v0,36(sp)       # 1
  [110] 0x1000254c      0x24030001      li      v1,1    # 2
                <2 cycle stall for following instruction>
  [110] 0x10002550      0x1443000c      bne     v0,v1,0x10002584        # 3
  [110] 0x10002554      0000000000      nop     # 4
         Preceding branch executed 1 times, taken 0 times.
  ^---      5 total cycles(0.00%) executed     1 times, average  5 cycles.---^
111:            Scriptstring(DEFAULT_SCRIPT);
  [111] 0x10002558      0x8f848034      lw      a0,-32716(gp)   # 1
                <1 cycle stall for following instruction>
  [111] 0x1000255c      0x24848008      addiu   a0,a0,-32760    # 2
  [111] 0x10002560      0x8f998138      lw      t9,-32456(gp)   # 2
  [111] 0x10002564      0x0c000a14      jal     0x10002850      # 2
  [111] 0x10002568      0000000000      nop     # 3
  ^---      4 total cycles(0.00%) executed     1 times, average  4 cycles.---^
112:            exit(0);
  [112] 0x1000256c      0x00002025      move    a0,zero         # 1
  [112] 0x10002570      0x8f99805c      lw      t9,-32676(gp)   # 2
                <2 cycle stall for following instruction>
  [112] 0x10002574      0x0320f809      jalr    ra,t9   # 3
  [112] 0x10002578      0000000000      nop     # 4
  ^---      5 total cycles(0.00%) executed     1 times, average  5 cycles.---^
  [112] 0x1000257c      0x10000020      b       0x10002600      # 1
  [112] 0x10002580      0000000000      nop     # 2
  ^---      0 total cycles(0.00%) executed     0 times, average  2 cycles.---^
113:    } else {
114:
115:            i = argc;
  [115] 0x10002584      0x8fa60024      lw      a2,36(sp)       # 1
  [115] 0x10002588      0xafa60000      sw      a2,0(sp)        # 2
     .
     .
     .

Using the -calipers Option

When you run prof on the output of an experiment in which you have recorded caliper points, you can use the -calipers option to specify the area of the program for which you want to generate a performance report. For example, if you record just one caliper point in the middle of your program, prof can provide a report from the beginning of the program up to the first caliper point using the following command:

% prof -calipers 0,1 executable_name.exp_type.id

The prof command can also provide a report from the caliper point to the end of the program using the following command:

% prof -calipers 1,2 executable_name.exp_type.id

If you record two caliper points (0, 1, 2, 3), prof can generate a report from the second to the third caliper point:

% prof -calipers 1,2 executable_name.exp_type.id

Using the -butterfly Option

For bbcounts, usertime, and fpe experiments, the -butterfly option to prof can be used to obtain inclusive metric information. prof provides the standard report and then appends the inclusive function counts information to the end of the report. The following example is partial output from prof, showing just the inclusive function counts report.

With inclusive cycle counting, prof prints a list of functions at the end, which are called but not defined. It also includes functions from libss; they are instrumented, but their data is normally excluded.

The prof report does not list the cycles of a procedure in the inclusive listing for the following reasons:

  • init, fini, and MIPS.stubs sections are not part of any procedure.

  • Calls to procedures that do not use a “jump and link” are not recognized as procedure calls. (This is not true for bbcounts experiments.)

  • When global procedures with the same name are executed in different DSOs, only one of them is listed.

These exceptions are listed at the end of the report.

This example shows output from calling prof for a usertime experiment:

-------------------------------------------------------------------------
SpeedShop profile listing generated Thu Feb 12 13:52:09 1998
   prof -butterfly generic.usertime.m10981
                 generic (n32): Target program
                      usertime: Experiment name
                         ut:cu: Marching orders
                 R4400 / R4000: CPU / FPU
                             1: Number of CPUs
                           175: Clock frequency (MHz.)
  Experiment notes--
          From file generic.usertime.m10981:
        Caliper point 0 at target begin, PID 10981
                        /usr/demos/SpeedShop/linpack.demos/c/generic
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical callstack sampling data (usertime)--
                           809: Total Samples
                             0: Samples with incomplete traceback
                        24.270: Accumulated Time (secs.)
                          30.0: Sample interval (msecs.)
-------------------------------------------------------------------------
Function list, in descending order by exclusive time
-------------------------------------------------------------------------
 [index]  excl.secs excl.%   cum.%  incl.secs incl.%    samples  function  
(dso: file, line)

     [4]     22.770  93.8%   93.8%     22.770  93.8%        759  anneal
(generic: generic.c, 1573)
     [6]      1.020   4.2%   98.0%      1.020   4.2%         34  slaveusrtime
(dlslave.so: dlslave.c, 22)
     [9]      0.210   0.9%   98.9%      0.210   0.9%          7  cvttrap 
(generic: generic.c, 317)
    [12]      0.120   0.5%   99.4%      0.120   0.5%          4  _pm_create_special 
(libc.so.1: pm.c, 191)
    [14]      0.090   0.4%   99.8%      0.090   0.4%          3  _migr_policy_args_init 
(libc.so.1: pm.c, 398)
    [10]      0.030   0.1%   99.9%      0.180   0.7%          6  iofile
(generic: generic.c, 464)
    [11]      0.030   0.1%  100.0%      0.150   0.6%          5  _doscan_f 
(libc.so.1: inline_doscan.c, 615)
     [1]      0.000   0.0%  100.0%     24.270 100.0%        809  __start 
(generic: crt1text.s, 101)
     [2]      0.000   0.0%  100.0%     24.270 100.0%        809  main 
(generic: generic.c, 101)
     [3]      0.000   0.0%  100.0%     24.270 100.0%        809  Scriptstring 
(generic: generic.c, 184)
     [5]      0.000   0.0%  100.0%     22.770  93.8%        759  usrtime 
(generic: generic.c, 1377)
    [15]      0.000   0.0%  100.0%      0.090   0.4%          3  dirstat 
(generic: generic.c, 348)
    [16]      0.000   0.0%  100.0%      0.090   0.4%          3  _pread
(libc.so.1: preadSCI.c, 33)
    [13]      0.000   0.0%  100.0%      0.120   0.5%          4  _fullocale 
(libc.so.1: _locale.c, 77)
     [7]      0.000   0.0%  100.0%      1.020   4.2%         34  libdso
(generic: generic.c, 619)
     [8]      0.000   0.0%  100.0%      1.020   4.2%         34  dlslave_routine 
(dlslave.so: dlslave.c, 7)

-------------------------------------------------------------------------
Butterfly function list, in descending order by inclusive time
-------------------------------------------------------------------------
         attrib.% attrib.time                         incl.time  caller
(callsite) [index]
 [index]   incl.%   incl.time    self%   self-time       function [index]
                              attrib.% attrib.time    incl.time  callee
(callsite) [index]
-------------------------------------------------------------------------
     [1]   100.0%      24.270     0.0%       0.000         __start [1]
                                100.0%      24.270       24.270  main 
(@0x10001fac; generic: crt1text.s, 177) [2]
-------------------------------------------------------------------------
           100.0%      24.270                            24.270  __start 
(@0x10001fac; generic: crt1text.s, 177) [1]
     [2]   100.0%      24.270     0.0%       0.000         main [2]
                                100.0%      24.270       24.270  Scriptstring 
(@0x10002040; generic: generic.c, 111) [3]
-------------------------------------------------------------------------
           100.0%      24.270                            24.270  main 
(@0x10002040; generic: generic.c, 111) [2]
     [3]   100.0%      24.270     0.0%       0.000         Scriptstring
[3]
                                 93.8%      22.770       22.770  usrtime 
(@0x10002460; generic: generic.c, 214) [5]
                                  4.2%       1.020        1.020  libdso
(@0x10002460; generic: generic.c, 214) [7]
                                  0.9%       0.210        0.210  cvttrap 
(@0x10002460; generic: generic.c, 214) [9]
                                  0.7%       0.180        0.180  iofile
(@0x10002460; generic: generic.c, 214) [10]
                                  0.4%       0.090        0.090  dirstat 
(@0x10002460; generic: generic.c, 214) [15]
-------------------------------------------------------------------------
            93.8%      22.770                            22.770  usrtime 
(@0x10005c30; generic: generic.c, 1393) [5]
     [4]    93.8%      22.770    93.8%      22.770         anneal [4]
-------------------------------------------------------------------------
            93.8%      22.770                            24.270  Scriptstring 
(@0x10002460; generic: generic.c, 214) [3]
     [5]    93.8%      22.770     0.0%       0.000         usrtime [5]
                                 93.8%      22.770       22.770  anneal
(@0x10005c30; generic: generic.c, 1393) [4]
-------------------------------------------------------------------------
             4.2%       1.020                             1.020  dlslave_routine 
(@0x5ffe0690; dlslave.so: dlslave.c, 9) [8]
     [6]     4.2%       1.020     4.2%       1.020         slaveusrtime
[6]
-------------------------------------------------------------------------
             4.2%       1.020                            24.270  Scriptstring 
(@0x10002460; generic: generic.c, 214) [3]
     [7]     4.2%       1.020     0.0%       0.000         libdso [7]
                                  4.2%       1.020        1.020  dlslave_routine 
(@0x100032a0; generic: generic.c, 650) [8]
-------------------------------------------------------------------------
             4.2%       1.020                             1.020  libdso
(@0x100032a0; generic: generic.c, 650) [7]
     [8]     4.2%       1.020     0.0%       0.000         dlslave_routine [8]
                                  4.2%       1.020        1.020  slaveusrtime 
(@0x5ffe0690; dlslave.so: dlslave.c, 9) [6]
-------------------------------------------------------------------------
             0.9%       0.210                            24.270  Scriptstring 
(@0x10002460; generic: generic.c, 214) [3]
     [9]     0.9%       0.210     0.9%       0.210         cvttrap [9]
-------------------------------------------------------------------------
             0.7%       0.180                            24.270  Scriptstring 
(@0x10002460; generic: generic.c, 214) [3]
    [10]     0.7%       0.180     0.1%       0.030         iofile [10]
                                  0.6%       0.150        0.150  _doscan_f 
(@0x10002d48; generic: generic.c, 483) [11]
-------------------------------------------------------------------------
             0.6%       0.150                             0.180  iofile
(@0x10002d48; generic: generic.c, 483) [10]
    [11]     0.6%       0.150     0.1%       0.030         _doscan_f [11]
                                  0.5%       0.120        0.120  _fullocale 
(@0x0fadebe4; libc.so.1: inline_doscan.c, 808) [13]
-------------------------------------------------------------------------
             0.5%       0.120                             0.120  _fullocale 
(@0x0fb0b1b8; libc.so.1: _locale.c, 84) [13]
    [12]     0.5%       0.120     0.5%       0.120         _pm_create_special [12]
-------------------------------------------------------------------------
             0.5%       0.120                             0.150  _doscan_f 
(@0x0fadebe4; libc.so.1: inline_doscan.c, 808) [11]
    [13]     0.5%       0.120     0.0%       0.000         _fullocale [13]
                                  0.5%       0.120        0.120  _pm_create_special 
(@0x0fb0b1b8; libc.so.1: _locale.c, 84) [12]
-------------------------------------------------------------------------
             0.4%       0.090                             0.090  _pread
(@0x0fb05928; libc.so.1: preadSCI.c, 33) [16]
    [14]     0.4%       0.090     0.4%       0.090         _migr_policy_args_init [14]
-------------------------------------------------------------------------
             0.4%       0.090                            24.270  Scriptstring 
(@0x10002460; generic: generic.c, 214) [3]
    [15]     0.4%       0.090     0.0%       0.000         dirstat [15]
                                  0.4%       0.090        0.090  _pread
(@0x10002a5c; generic: generic.c, 381) [16]
-------------------------------------------------------------------------
             0.4%       0.090                             0.090  dirstat 
(@0x10002a5c; generic: generic.c, 381) [15]
    [16]     0.4%       0.090     0.0%       0.000         _pread [16]
                                  0.4%       0.090        0.090  _migr_policy_args_init 
(@0x0fb05928; libc.so.1: preadSCI.c, 33) [14]
-------------------------------------------------------------------------

Generating Reports for Different Machine Types

If you need to generate a report for a machine model that is different from the one on which the experiment was performed, you can use several of the prof options to specify a machine model.

For example, if you record a bbcounts experiment on an R4000 processor with a clock frequency of 100 megahertz, but you want to generate a report for an R10000 processor with a clock frequency of 196 megahertz, the prof command would be the following:

% prof -r10000 -clock 196 generic.bbcounts.m4561

Generating Reports for Multiprocessed Executables

You can gather data from executables that use the sproc(2) and sprocsp(2) system calls, such as those executables generated by POWER Fortran and POWER C. Prepare and run the job using the same method as for uniprocessed executables. For multiprocessed executables, each thread of execution writes its own separate data file. View these data files with prof.

The only difference between multiprocessed and regular executables is how the data files are named. The data files are named prog_name.experiment_type.id.

The experiment ID, id, consists of one or two letters (designating the process type) and the process ID number. See Table 1-1 for the letter codes and their meanings. This naming convention avoids the potential conflict of multiple threads attempting to write simultaneously to the same file.

Determining Program Overhead

You can determine the overhead of a parallel program by including the -overhead argument on the prof command line. The overhead information will be included at the end of the usual report.

You can get the overhead report for any experiment, but it may be the most useful for the following experiments:

  • pcsamp

  • usertime

  • bbcounts

  • The hardware counter experiments (*_hwc and *_hwctime)

Use the following steps to generate an overhead report on a system with multiple processors.

  1. Run the bbcounts experiment on the executable:

    % ssrun -bbcounts linpackd

    The ssrun command generates the following files, each from a different processor, for an 8-processor program:

    linpackd.bbcounts.m422744
    linpackd.bbcounts.p421778
    linpackd.bbcounts.p422191
    linpackd.bbcounts.p422252
    linpackd.bbcounts.p422313
    linpackd.bbcounts.p422620
    linpackd.bbcounts.p422704
    linpackd.bbcounts.p422731

  2. Combine the experiment files into one experiment file using the ssaggregate(1) command.

    % ssaggregate -e linpackd.bbcounts* -o bbcounts.8ps

  3. Execute the prof command with the -overhead option to create the data file.

    % prof -overhead bbcounts.8ps >result

The result file contains the usual bbcounts experiment output with the overhead information at the end. Note that some lines are wrapped here to accommodate page width:

-------------------------------------------------------------------------
SpeedShop profile listing generated Fri Jun 25 09:21:27 1999
   prof -overhead bbcounts.8ps
                linpackd (n32): Target program
                         bbcounts: Experiment name
                         it:cu: Marching orders
               R10000 / R10010: CPU / FPU
                            16: Number of CPUs
                           195: Clock frequency (MHz.)
  Experiment notes--
          From file bbcounts.8ps:
        Caliper point 0 at target begin, PID 422744
                        linpackd
        Caliper point 0 at target begin, PID 422620
                        linpackd
        Caliper point 0 at target begin, PID 422731
                        linpackd
        Caliper point 0 at target begin, PID 422704
                        linpackd
        Caliper point 0 at target begin, PID 422252
                        linpackd
        Caliper point 0 at target begin, PID 421778
                        linpackd
        Caliper point 0 at target begin, PID 422191
                        linpackd
        Caliper point 0 at target begin, PID 422313
                        linpackd
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of bbcounts time data (bbcounts)--
                   29877509668: Total number of instructions executed
                   20592366537: Total computed cycles
                       105.602: Total computed execution time (secs.)
                         0.689: Average cycles / instruction
-------------------------------------------------------------------------
Function list, in descending order by exclusive bbcounts time
-------------------------------------------------------------------------
 [index]   excl.secs   excl.%     cum.%        cycles  instructions    calls  
function  (dso: file, line)

     [1]      72.955    69.1%     69.1%   14226219437   24895879414      140  
__mp_slave_wait_for_work (libmp.so: mp_parallel_do.s, 593)
     [2]      30.344    28.7%     97.8%    5917081268    4669997342   772633  
daxpy (linpackd: linpack.f, 495)

.
.
.
-------------------------------------------------------------------------
OpenMP Report
-------------------------------------------------------------------------
                Parallelization Overhead: 00.000%
                          Load Imbalance: 00.076%
                Insufficient Parallelism: 69.085%
                            Barrier Loss: 00.002%
                    Synchronization Loss: 00.000%
           Other Model-specific Overhead: 00.000%

The parallel model used in the program was OpenMP, although other parallel models (such as MPI and pthreads) are supported. The categories for which information is returned vary depending on the model. The OpenMP categories have the following meanings:

  • Parallelization Overhead: the percentage of the program's time spent doing work necessary to a parallel program, such as distributing loop iterations and data among the processors. The percentage is negligible for this program.

  • Load Imbalance: the percentage of a program's time lost because work was not spread evenly across the processors. This number would be 0 if each processor had exactly the same amount of work.

  • Insufficient Parallelism: the percentage of a program's time in which the processors were not working in parallel. The number returned for this program tells us that about two-thirds of the program time was not parallelized.

  • Barrier Loss: the percentage of the program's time used by the barrier process. This is not the time processors spent waiting at barriers.

  • Synchronization Loss: the percentage of the program's time used by the other synchronization processes.

  • Other Model-specific Overhead: the percentage of a program's time spent in other OpenMP (in this case) processes.

The same aggregated experiment file created above can be used by the cvperf(1) command to display overhead information in its own window. For an example, see the ProDev WorkShop: Performance Analyzer User's Guide.

Generating Compiler Feedback Files

If you run a bbcounts experiment, run prof with the -feedback option to generate a feedback file that can be used to arrange procedures more efficiently on the next compilation. You can rearrange procedures using the -fb option on compiler command lines.

To reorder code regions for the cord(1) command, use the -cordfb or -cordfball option to prof.

For more information, see your compiler man page or the cord(1) man page.

Comparing Experiment Results

After running experiments, you can compare experiment results by using the sscompare command. This command can be used to analyze performance data generated by ssrun and produce a comparison report. This can be particularly useful when comparing the effects of different optimization techniques, for example, or when comparing different experiments for the same application.

The sscompare command has the following syntax:

sscompare [-by type] [-individual] [-left caliper] [-output format]
[-path pathname] [-percentages] [-precision digits] [rough]
[-right caliper] [statistics] [-ut_exclusive] [-ut_inclusive]
[-width characters]

The comparison report contains a legend and a table of performance data. Each input file and the type of performance data it contains is listed in the legend with a numeric column key. The table contains multiple columns of data. There is one column for each experiment file (if the individual option is used) or one for each statistic (if the statistics option was used).

The following example demonstrates this command's use. In this example, a SpeedShop PC sampling experiment is run on the OpenMP implementation of the NAS Conjugate Gradient Parallel Benchmark using four threads. The resulting experiment files are then compared side-by-side using sscompare:

% setenv OMP_NUM_THREADS 4
% ssrun -fpcsampx cg.A

 NAS Parallel Benchmarks 2.3 OpenMP C version - CG Benchmark
 Size:      14000
 Iterations:    15

           ...
           ...
           ...

% sscompare -by function -individual -percentages cg.A.fpcsampx.*

   1: Exclusive `PC sampling' time for cg.A.fpcsampx.T0.m2675229
   2: Exclusive `PC sampling' time for cg.A.fpcsampx.T1.p2676575
   3: Exclusive `PC sampling' time for cg.A.fpcsampx.T2.p2678981
   4: Exclusive `PC sampling' time for cg.A.fpcsampx.T3.p2681324

      1         2         3         4

  63.973%   74.819%   67.227%   63.820%    conj_grad (cg.A: cg.c, 356)
   0.000%   22.881%   23.754%   24.255%    __mp_slave_wait_for_work (libmp.so: mp_parallel_do.s, 593)
  10.758%    1.608%    7.913%   10.943%    __mp_barrier_nthreads (libmp.so: mp_barrier.c, 271)
  16.161%    0.000%    0.000%    0.000%    sparse (cg.A: cg.c, 709)

                  ...
                  ...
                  ...
                  ...

sscompare supports the following SpeedShop experiment types:

  • usertime

  • totaltime

  • pcsamp

  • bbcounts