Chapter 8. Using SpeedShop in Expert Mode: pixie

This chapter provides information on how to run pixie and prof without invoking ssrun. By calling pixie directly, you can generate the following performance data:

For more information on basic block counting and inclusive basic block counting, see Chapter 7, "Analyzing Experiment Results: prof."

This chapter contains the following sections:

Using pixie

Use pixie to measure the frequency of code execution. pixie reads an executable program, partitions it into basic blocks, and writes (instruments) an equivalent program containing additional code that counts the execution of each basic block.

Note that the execution time of an instrumented program is two to five times longer than that of an uninstrumented one. This timing change may alter the behavior of a program that deals with a graphical user interface (GUI), or depends on events such as SIGALRM that are based on an external clock.

pixie Syntax

The syntax for pixie is

pixie prog_name [options]

prog_name 

Name of the input program.

options 

Zero or more of the keywords listed in Table 8-1.

pixie Options

Table 8-1 lists pixie options. For a complete list of options, view the pixie reference page.

Table 8-1. Options for pixie

Name

Result

-addlibslib1.s0:...libN.so

Adds lib1.s0:...libN.so to the library list of the executable. No libraries are added by default.

-copy

Produces a copy of the target with function list (map) and arc list (graph) sections but does not instrument the target.

-counts_file file

Specifies the name to be used for the output .Counts file. By default, .Counts is appended to the original program name.

-dso

Treats executable as an o32 DSO. Performs a search of standard o32 library directories. A .pix32 extension is used.

-dso32

Treats executable as an n32 DSO. Performs a search of standard n32 library directories. A .pixn32 extension is used.

-dso64

Treats executable as an n64 DSO. Performs a search of standard n64 library directories. A .pix64 extension is used.

-directory dir_name

Writes output files to dir_name. Files are written to the current directory by default.

-fcncounts

Produces an instrumented executable that counts function calls and arc calls, but not basic-block or branch counts.

-idtrace_file number

Specifies a UNIX® file descriptor number for the trace output file. Default is 19.

[no]autopixieo32

Permits or prevents a recursive instrumenting of all dynamic shared libraries used by the input file during run time. pixie keeps the timestamp and checksum from the original executable. Thus, before instrumenting a shared library, pixie checks any pixified files that it finds matching the lib it is to instrument. If the fields match, they are not instrumented. pixie cannot detect shared libraries opened with dlopen() (and hence does not instrument them). All used DSOs need to be instrumented for the pixifiedexecutable to work. The default behavior with shared libraries is –noautopixie. The default behavior with an executable is –autopixie.

-[no]idtrace

[Disables] or enables tracing of both instruction and data memory. Default is -noidtrace.

-[no]ittrace

[Disables] or enables tracing of instruction memory references. Default is -noittrace.

-[no]longbranch

During instrumentation, some transformations can push a branch offset beyond its legal range and pixie generates warnings about branch offsets being out of range. This option causes pixie to transform these instructions into jumps. The default is -nolongbranch.

[no]verbose

Prints or suppresses messages summarizing the binary-to-binary translation process. The default is –noverbose.

-pixie_filename

Specify the name of the pixiefied executable.

-suffix.suffix

Appends .suffix to the pixified executable and DSOs. The default suffix is .pixie.


pixie Output

The pixie command generates a set of files with a .pixie extension. These files are essentially copies of your original executable and any DSOs you specified in the call to pixie with code inserted to enable the collection of performance data when the .pixie version of your program is run.

If you use the -verbose flag with pixie, it reports the size of the old and new code. The new code size is the size of the code pixie will actually execute. It does not count read-only data (including a copy of the original text and another data block the same size as the original text) put into the text section. Calling size on the .pixie file reports a much larger text size than pixie -verbose, because size also counts everything in the text segment.

When you run the .pixie version of your program, one or more .Counts files are generated. The name of an output .Counts file is that of the original program with any leading directory names removed and .Counts appended. If the program executes calls to sproc(), sprocsp() or fork(), multiple .Counts files are generated—one for each process in the share group. In this case, each file will have the process ID appended to its name.

Obtaining Basic Block Counts

Use this procedure to obtain basic block counts. Also refer to Figure 8-1, which illustrates how basic block counting works.

  1. Compile and link your program. The following example uses the input file myprog.c:

    % cc -o myprog myprog.c

    The cc compiler compiles myprog.c into an executable called myprog.

  2. Run pixie to generate the equivalent program containing basic-block-counting code.

    % pixie myprog

    pixie takes myprog and writes an equivalent program, myprog.pixie, containing additional code that counts the execution of each basic block. pixie also writes an equivalent program for each shared object used by the program (in the form: libname.so.pix*), containing additional code that counts the execution of each basic block. For example, if myprog uses libc.so.1, pixie generates libc.so.1.pix*. (The value of * depends on the ABI).

  3. Set the path for your .pixie files. pixie uses the rld search path for libraries (see rld(1) for the default paths). If the .pixie files are in your local directory, set the path as

    % setenv LD_LIBRARY_PATH .

  4. Execute the file(s) generated by pixie (myprog.pixie) in the same way you executed the original program:

    % myprog.pixie

    This program generates a list of basic block counts in files named myprog.Counts. If the program executes fork() or sproc(), a process ID is appended to the end of the filename (for example, myprog.Counts.345) for each process.


    Note: Your program may not run as you expect when you invoke it with a .pixie extension. Some programs, uncompress and vi for example, treat their arguments differently when the name of the program changes. You may need to rename the .pixie version of your program back to its original name.

    To generate a valid .Counts file, your program must terminate normally or with a call to exit(). If it terminates with a signal such as SIGINT, the program must use a signal handler and leave the program through exit().

  5. Run the profile formatting program prof specifying the name of the original program and the .Counts file for the program:

    % prof myprog myprog.Counts

    prof extracts information from myprog.Counts and prints it in an easily readable format. If multiple .Counts files exist, you can use the wildcard character (*) to specify all of the files.

    % prof myprog myprog.Counts*

You can run the program several times, altering the input data, to create multiple profile data files.

The time computation assumes a "best case" execution; actual execution takes longer. This is because the time includes predicted stalls within a basic block, but not actual stalls that may occur entering a basic block. It also assumes that all instructions and data are in cache, that is, it excludes the delays due to cache misses and memory fetches and stores.

The complete output of the –pixie option is often extremely large. Use the –quit option with prof to restrict the size of the report. Refer to Chapter 7, "Analyzing Experiment Results: prof," for details about prof options.

Figure 8-1. How Basic Block Counting Works

Figure 8-1 How Basic Block Counting Works

Examples of Basic Block Counting

The examples in this section illustrate how to use prof to obtain basic block counting information from a C program, generic.

Example Using prof invocations

The partial listing below illustrates the report generated for basic block counts in generic. prof first provides a standard report of basic block counts, then provides a report reflecting any options provided to prof.

% prof -i generic generic.Counts

Prof run at: Fri May 17 12:39:22 1996
 Command line: prof -i generic generic.Counts 
 
         2662778530: Total number of cycles
          17.75186s: Total execution time
         1875323864: Total number of instructions executed
              1.420: Ratio of cycles / instruction
                150: Clock rate in MHz
              R4000: Target processor modelled
 
---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------
 
cycles(%)          cum %   secs   instrns    calls     procedure(dso:file)
 
2524610038(94.81)  94.81   16.83  1797940023   1 anneal(generic:/usr/demos/
          SpeedShop/generic/generic.c)
135001332( 5.07)   99.88    0.90  75000822     1 slaveusrtime(./dlslave.so:/usr/demos/
          SpeedShop/generic/dlslave.c)
1593518( 0.06)     99.94    0.01  1378788      4385 memcpy(/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
735797( 0.03)      99.97    0.00  506627       4123 fread(/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/stdio/fread.c)
187200( 0.01)      99.98    0.00  124800       1600 next(/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
136116( 0.01)      99.98    0.00  82498        1 iofile(generic:
          /usr/demos/SpeedShop/generic/generic.c)
91200( 0.00)      99.98     0.00  62400  1600 _drand48(/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
...

  • The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 2524610038 cycles, or 94.81% of cycles were spent in the anneal() procedure.

  • The cum% column shows the cumulative percentage of calls. For example, 99.88% of all calls were spent between the top two functions in the listing: anneal() and slaveusrtime().

  • The secs column shows the number of seconds spent in each procedure. For example, 16.83 seconds were spent in the anneal() procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention).

  • The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal() procedure.

  • The calls column reports the number of calls to each procedure. For example, there was just one call to the anneal() procedure.

  • The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure anneal() in the file generic.c in the generic executable.

The partial listing below illustrates the use of the –i[nvocations] option. For each procedure, prof reports the number of times it was invoked from each of its possible callers and lists the procedure(s) that called it.

 
---------------------------------------------------------
Procedures sorted in descending order of times invoked.  
Unexecuted procedures are not listed.
---------------------------------------------------------
 
Total number of procedure invocations: 15114
calls(%)      cum%   size(bytes) procedure (dso:file)
 
4385(29.01)   29.01  3416        memcpy (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)

4123(27.28)   56.29  1304        fread (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/stdio/fread.c)
1600(10.59)   66.88  312         next (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
1600(10.59)   77.46  180         _drand48 (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
628( 4.16)    81.62  368         __sinf (/usr/lib32/libm.so:
          /work/cmplrs/libm/fsin.c)
259( 1.71)    83.33  524         __filbuf (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/stdio/_filbuf.c)

The above listing shows the total procedure invocations (calls) during the run: 12113082.

  • The calls(%) column reports the number of calls (and the percentage of total calls) per procedure. For example, there were 4385 calls (or 29.01% of the total) spent in the procedure memcpy().

  • The cum% column shows the cumulative percentage of calls. For example, 56.29% of all calls were spent between memcpy() and fread().

  • The size(bytes) column shows the total byte size of a procedure. For example, the procedure memcpy() is 3416 bytes.

  • The procedure (dso:file) column lists the procedure, its DSO name and its filename. For example, the first line reports statistics for the procedure memcpy() in the file bcopy.s in libc.so.

Example Using prof heavy

The following partial listing shows the source code lines responsible for the largest portion of execution time produced with the –heavy option.

% prof -heavy generic generic.Counts

The partial listing below shows basic block counts sorted in descending order of cycles used. The fields in the report are described in section "ideal Experiment Reports" section in Chapter 7, "Analyzing Experiment Results: prof."

---------------------------------------------------------
Lines listed in descending order of cycle counts.        
---------------------------------------------------------
cycles(%)           cum %  times     line procedure (dso:file)
2309934120(86.75%)  86.75% 14440000  1465  anneal (generic:/usr/demos/
          SpeedShop/generic/generic.c)
207945880( 7.81%)   94.56% 14440000  1464  anneal (generic:/usr/demos/
          SpeedShop/generic/generic.c)
81000506( 3.04%)  97.60% 5000000       29  slaveusrtime (dlslave.so:/usr/demos/
          SpeedShop/generic/dlslave.c)
54000000( 2.03%)  99.63% 5000000       30  slaveusrtime (dlslave.so:/usr/demos/
          SpeedShop/generic/dlslave.c)
6600000( 0.25%)  99.88% 380000       1463  anneal (generic:/usr/demos/
          SpeedShop/generic/generic.c)
418380( 0.02%)   99.89%  32981        493  memcpy (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
418380( 0.02%)  99.91%   32981        494  memcpy (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
139482( 0.01%)  99.91%   32981        496  memcpy (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
139460( 0.01%)  99.92%   32981        495  memcpy (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
130009( 0.00%)  99.92%   10000       1461  anneal (generic:/usr/demos/
          SpeedShop/generic/generic.c)

Example Using prof quit

You can limit the output of prof to collect information on only the most time-consuming parts of the program by specifying the –quit option. You can instruct prof to quit after a particular number of lines of output, after listing the elements consuming more than a certain percentage of the total, or after the portion of each listing whose cumulative use is a certain amount.

Consider the following sample listing:

% prof -quit 4 generic generic.Counts

 Prof run at: Fri May 17 14:09:12 1996
 Command line: prof -quit 4 generic generic.Counts 
 
         2662778530: Total number of cycles
          17.75186s: Total execution time
         1875323864: Total number of instructions executed
              1.420: Ratio of cycles / instruction
                150: Clock rate in MHz
              R4000: Target processor modelled
---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------
cycles(%)          cum %     secs    instrns        calls     procedure(dso:file)
 
2524610038(94.81)  94.81     16.83   1797940023     1 anneal(generic:/usr/demos/
          SpeedShop/generic/generic.c)
135001332( 5.07)   99.88     0.90    75000822       1 slaveusrtime(./dlslave.so:
          /usr/demos/SpeedShop/generic/dlslave.c)   
1593518( 0.06)     99.94     0.01    1378788        4385 memcpy(/usr/lib32/
          libc.so.1:/work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
735797( 0.03)      99.97     0.00     506627        4123 fread(/usr/lib32/
          libc.so.1:/work/irix/lib/libc/libc_n32_M3/stdio/fread.c)

Obtaining Inclusive Basic Block Counts

Inclusive basic block counting counts basic blocks and generates a call graph. By propagating regular counts to callers of a routine, prof provides inclusive basic block counting. For more information on inclusive basic block counting, see the "ideal Experiment" section in Chapter 4, "Experiment Types."

To see inclusive data, run the profile formatting program prof specifying the name of the original program, the -gprof flag, and the .Counts file for the program.

% prof -gprof myprog myprog.Counts

prof extracts information from myprog.Counts and prints it in an easily readable format. If multiple .Counts files exist, you can use the wildcard character (*) to specify all of the files.

% prof -gprof myprog myprog.Counts*

Example of prof –gprof

This section contains part of a sample output obtained by using the –gprof option. The fields in the report are explained in detail in the report, but are not provided in this example. For more information on the -gprof option, see Chapter 7, "Analyzing Experiment Results: prof." (The format of the output has been adjusted slightly.)

% prof -gprof generic generic.Counts

...
Prof run at: Fri May 17 14:42:25 1996
 Command line: prof -gprof generic generic.Counts 
...
                          self           kids        called/total    parents
index           cycles(%) self(%)        kids(%)  called+self   name     index
                          self           kids        called/total    children
 
[1] 2662767961 (100.00%)  71( 0.00%) 2662767890(100.00%) 0  __start [1]
                          44         2662767837       1/1      main [2]
                          5                  0        1/1      __istart[111]
                          4                  0        1/1   __readenv_sigfpe[112]
--------------------------------------------------------------------------------

                          44         2662767837         1/1  __start [1]
[2] 2662767881(100.00%)   44( 0.00%) 2662767837(100.00%) 1      main [2]
                          2152       2662764245         1/1       Scriptstring[3]
                          67         926                1/1       exit [58]
                          96         309                1/1       _sigset [67]
                          32         10                 1/9     _gettimeofday[68]
-----------------------------------------------------------------------------
...