Chapter 3. Tutorial for Fortran Users

This chapter presents a tutorial for using the SpeedShop tools to gather and analyze performance data in a Fortran program, and covers these topics:


Note: Because of inherent differences between systems and also due to concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic form of the results should be the same.


Tutorial Overview

This tutorial is based on a standard benchmark program called linpackup. There are three versions of the program: the linpack directory contains files for the n32-bit ABI, the linpack64 directory contains files for the 64-bit ABI and the linpacko32 directory contains files for the 32-bit ABI. Each linpack directory contains versions of the program for a single processor (linpackup) and for multiple processors (linpackd). When you work with the tutorial, choose the version of the program that is most appropriate for your system. A good guideline is to choose whichever version corresponds to the way you expect to develop your programs.

This tutorial was written and tested using the single-processor version of the program (linpackup) in the linpack directory.

The linpack program is a standard benchmark designed to measure CPU performance in solving dense linear equations. The program focuses primarily on floating point performance.

Output From the linpackup Program

Output from the linpackup program looks like the following:

...
norm. resid      resid           machep         x(1)          x(n)
  5.35882395E+00  7.13873405E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00
 
times are reported for matrices of order   300
      dgefa      dgesl      total     mflops       unit      ratio
 times for array with leading dimension of 301
  1.180E+00  1.000E-02  1.190E+00  1.528E+01  1.309E-01  2.125E+01
  1.180E+00  1.000E-02  1.190E+00  1.528E+01  1.309E-01  2.125E+01
  1.180E+00  1.000E-02  1.190E+00  1.528E+01  1.309E-01  2.125E+01
  1.180E+00  1.000E-02  1.190E+00  1.528E+01  1.309E-01  2.125E+01
 
 times for array with leading dimension of 300
  1.180E+00  1.000E-02  1.190E+00  1.528E+01  1.309E-01  2.125E+01
  1.180E+00  2.000E-02  1.200E+00  1.515E+01  1.320E-01  2.143E+01
  1.180E+00  1.000E-02  1.190E+00  1.528E+01  1.309E-01  2.125E+01
  1.181E+00  1.200E-02  1.193E+00  1.524E+01  1.312E-01  2.130E+01

Experiments Performed in This Tutorial

The tutorial shows you how to perform the following experiments using ssrun, and how to interpret experiment-specific reports generated by prof:

  • usertime

  • pcsamp

  • dsc_hwc

  • ideal

Tutorial Setup

Copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial.

  1. Change to the /usr/demos/SpeedShop directory.

  2. Copy the appropriate linpack directory and its contents to a directory where you have write permission:

    cp -r linpack_version your_dir

  3. Change to the directory you just created:

    cd your_dir/linpack_version

  4. Compile the program by entering

    make all

    This provides an executable for the experiment.

Analyzing Performance Data

This section provides steps on how to run the following experiments on the linpackup program and generate and interpret the results:

You can follow the tutorial from start to finish, or you can follow steps for just the experiment(s) you want.

usertime Experiment

This section takes you through the steps to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of user time spent in each function in your program. For more information on usertime, see the "usertime Experiment" section in Chapter 4, "Experiment Types."

Performing a usertime Experiment

From the command line, enter

ssrun -v -usertime linpackup

This starts the experiment. The -v flag tells ssrun to print a log to stderr.

Output from linpackup and from ssrun is printed to stdout as shown in the example below. A data file is also generated. The name consists of the process name (linpackup), the experiment type, usertime, and the experiment ID. In this example, the filename is linpackup.usertime.m17566.

ssrun: target PID 17566
ssrun: setenv _SPEEDSHOP_MARCHING_ORDERS ut:cu
ssrun: setenv _SPEEDSHOP_EXPERIMENT_TYPE usertime
ssrun: setenv _SPEEDSHOP_TARGET_FILE linpackup
ssrun: setenv _RLD_LIST libss.so:libssrt.so:DEFAULT
 Please send the results of this run to:
 
 Jack J. Dongarra
 Mathematics and Computer Science Division
 Argonne National Laboratory
 Argonne, Illinois 60439
 
 Telephone: 312-972-7246
 
 ARPAnet: DONGARRA@ANL-MCS
 
     norm. resid      resid           machep         x(1)          x(n)
  5.35882395E+00  7.13873405E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00
 
    times are reported for matrices of order   300
      dgefa      dgesl      total     mflops       unit      ratio
 times for array with leading dimension of 301
  3.050E+00  3.000E-02  3.080E+00  5.903E+00  3.388E-01  5.500E+01
  3.030E+00  3.000E-02  3.060E+00  5.941E+00  3.366E-01  5.464E+01
  3.030E+00  3.000E-02  3.060E+00  5.941E+00  3.366E-01  5.464E+01
  3.030E+00  3.000E-02  3.060E+00  5.941E+00  3.366E-01  5.464E+01
 
 times for array with leading dimension of 300
  3.030E+00  3.000E-02  3.060E+00  5.941E+00  3.366E-01  5.464E+01
  3.040E+00  3.000E-02  3.070E+00  5.922E+00  3.377E-01  5.482E+01
  3.040E+00  3.000E-02  3.070E+00  5.922E+00  3.377E-01  5.482E+01
  3.034E+00  3.000E-02  3.064E+00  5.933E+00  3.371E-01  5.471E+01

Generating a Report

To generate a report on the data collected, enter the following at the command line:

prof your_output_file_name > usertime.results

prof interprets the type of experiment you have performed and prints results to stdout. The report below shows partial prof output:

-----------------------------------------------------------------------
Profile listing generated Mon Nov 18 11:39:36 1996
    with:  prof linpackup.usertime.m17566 
-----------------------------------------------------------------------
 
Total Time (secs)     : 115.11
Total Samples         : 3837
Stack backtrace failed: 2
Sample interval (ms)  : 30
CPU                   : R4600
FPU                   : R4600
Clock                 : 100.0MHz
Number of CPUs        : 1
 
-----------------------------------------------------------------------
 
index  %Samples  self descendents  total           name
[1]     99.9%    0.00      115.05    3835          __start
[2]     99.9%    0.00      115.05    3835          main
[3]     99.9%    0.00      115.05    3835          linp
[4]     94.9%    3.12      106.17    3643          dgefa
[5]     92.7%  106.71        0.00    3557          daxpy
[6]      3.8%    4.35        0.00     145          matgen
[7]      0.9%    0.00        1.08      36          dgesl
[8]      0.2%    0.27        0.00       9          dscal
[9]      0.2%    0.27        0.00       9          idamax
[10]     0.2%    0.00        0.18       6          s_wsfe64
[11]     0.2%    0.00        0.18       6          s_wsfe_com
[12]     0.2%    0.00        0.18       6          wsfe
[13]     0.1%    0.03        0.09       4          f_init
[14]     0.1%    0.00        0.06       2          f77canseek
[15]     0.1%    0.03        0.03       2          _isatty
[16]     0.1%    0.06        0.00       2          dmxpy
[17]     0.1%    0.03        0.03       2          s_stop
[18]     0.0%    0.03        0.00       1          _mips2_test_and_set
[19]     0.0%    0.00        0.03       1          _ftell64
[20]     0.0%    0.00        0.03       1          memset
[21]     0.0%    0.03        0.00       1          _blk_init
[22]     0.0%    0.03        0.00       1          __oserror
[23]     0.0%    0.03        0.00       1          c_sfe
[24]     0.0%    0.00        0.03       1          do_ud
[25]     0.0%    0.00        0.03       1          check_buflen
[26]     0.0%    0.00        0.03       1          _malloc
[27]     0.0%    0.00        0.03       1          __malloc
[28]     0.0%    0.03        0.00       1          _morecore
[29]     0.0%    0.00        0.03       1          pars_f
[30]     0.0%    0.00        0.03       1          f_s
[31]     0.0%    0.00        0.03       1          f_list
[32]     0.0%    0.00        0.03       1          i_tem
[33]     0.0%    0.00        0.03       1          ne_d
[34]     0.0%    0.03        0.00       1          op_gen
[35]     0.0%    0.00        0.03       1          do_fioxr8v
[36]     0.0%    0.00        0.03       1          do_fio64_mp
[37]     0.0%    0.03        0.00       1          w_ed
[38]     0.0%    0.03        0.00       1          f_exit

Analyzing the Report

The report shows information for each function. The function names are show in the right-hand column of the report. The remaining columns are described below:

  • The index column provides an index number for reference.

  • The %time column shows the cumulative percentage of inclusive time spent in each function and its descendents. For example, in the third row, 99.9% of the time was spent in linp and all functions listed below it.

  • The self column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in linp, but 3.12 seconds were spent in dgefa.

  • The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, in the third row, 115.05 seconds were spent in callees of linp.

  • The total column provides information on the number of cycles out of the total spent on the function.


    Note: Many functions shown here have only one or two "hits." The data for those functions is not statistically significant.


pcsamp Experiment

This section takes you through the steps to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine line, and function in your program. For more information on pcsamp, see the "pcsamp Experiment" section in Chapter 4, "Experiment Types."

Performing a pcsamp Experiment

From the command line, enter

ssrun -pcsamp linpackup

This starts the experiment.

Output from linpackup and from ssrun is printed to stdout as shown in the example below. A data file is also generated. The name consists of the process name (linpackup), the experiment type, pcsamp, and the experiment ID. In this example, the filename is linpackup.pcsamp.m17576.

  ...
     norm. resid      resid           machep         x(1)          x(n)
  5.35882395E+00  7.13873405E-13  2.22044605E-16  1.00000000E+00  1.00000000E+00
  ...

Generating a Report

To generate a report on the data collected, enter the following at the command line:

prof your_output_file_name > pcsamp.results

-------------------------------------------------------------------------------
Profile listing generated Sun May 19 18:38:50 1996
    with:       prof linpackup.pcsamp.m17576 
-------------------------------------------------------------------------------
samples   time    CPU    FPU   Clock   N-cpu  S-interval Countsize
   5421    54s  R8000  R8010  75.0MHz   1     10.0ms     2(bytes)
 
Each sample covers 4 bytes for every 10.0ms ( 0.02% of 54.2100s)
-------------------------------------------------------------------------------
  -p[rocedures] using pc-sampling.
  Sorted in descending order by the number of samples in each procedure.
  Unexecuted procedures are excluded.
-------------------------------------------------------------------------------
samples   time(%)      cum time(%)      procedure (dso:file)
 
   5064    51s( 93.4)   51s( 93.4)          daxpy (linpackup:linpackup.f)
    240   2.4s(  4.4)   53s( 97.8)         matgen (linpackup:linpackup.f)
     76  0.76s(  1.4)   54s( 99.2)          dgefa (linpackup:linpackup.f)
     19  0.19s(  0.4)   54s( 99.6)          dscal (linpackup:linpackup.f)
     17  0.17s(  0.3)   54s( 99.9)         idamax (linpackup:linpackup.f)
      4  0.04s(  0.1)   54s(100.0)          dmxpy (linpackup:linpackup.f)
      1  0.01s(  0.0)   54s(100.0)         _ioctl (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M4/sys/ioctl.s)
 
   5421    54s(100.0)   54s(100.0)          TOTAL
 

Analyzing the Report

The report has the following columns:

  • The samples column shows how many samples were taken when the process was executing in the function.

  • The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function.

  • The cum time(%) column shows how much time has been spent in the function up to and including the procedure in the list.

  • The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure daxpy in the file linpackup.f in the linpackup executable.

Hardware Counter Experiment


Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000 class of machines).

This section takes you through the steps to perform a hardware counter experiment. There are a number of hardware counter experiments, but this tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see the"ideal Experiment" section in Chapter 4, "Experiment Types."

Performing a hardware counter Experiment

From the command line, enter

ssrun -dsc_hwc linpackup

This starts the experiment. Output from linpackup and from ssrun will be printed to stdout. A data file is also generated. The name consists of the process name (linpackup), the experiment type, dsc_hwc, and the experiment ID. In this example, the filename is linpackup.dsc_hwc.m6180.

Generating a Report

To generate a report on the data collected, enter the following at the command line:

prof your_output_file_name > dsc_hwc.results

Output similar to the following is generated:

-------------------------------------------------------------------------------
Profile listing generated Sun May 19 18:20:14 1996
    with:       prof linpackup.dsc_hwc.m6180 
-------------------------------------------------------------------------------
        Counter               : Sec cache D misses
        Counter overflow value: 131
        Total numer of ovfls  : 2737
        CPU                   : R10000
        FPU                   : R10010
        Clock                 : 196.0MHz
        Number of CPUs        : 1
-------------------------------------------------------------------------------
  -p[rocedures] using counter overflow.
  Sorted in descending order by the number of overflows in each procedure.
  Unexecuted procedures are excluded.
-------------------------------------------------------------------------------
  overflows(%)  cum overflows(%)      procedure (dso:file)
 
   2133( 77.9)       2133( 77.9)          DAXPY (linpackup:/usr/demos/
          SpeedShop/linpack64/linpackup.f)
    307( 11.2)       2440( 89.1)         MATGEN (linpackup:/usr/demos/
          SpeedShop/linpack64/linpackup.f)
    275( 10.0)       2715( 99.2)          DGEFA (linpackup:/usr/demos/
          SpeedShop/linpack64/linpackup.f)
     11(  0.4)       2726( 99.6)         IDAMAX (linpackup:/usr/demos/
          SpeedShop/linpack64/linpackup.f)
      3(  0.1)       2729( 99.7)          DMXPY (linpackup:/usr/demos/
          SpeedShop/linpack64/linpackup.f)
      3(  0.1)       2732( 99.8)          DGESL (linpackup:/usr/demos/
          SpeedShop/linpack64/linpackup.f)
      1(  0.0)       2733( 99.9)         memset (/usr/lib64/libc.so.1:
          /work/irix/lib/libc/libc_64_M4/strings/bzero.s)
      1(  0.0)       2734( 99.9)         fflush (/usr/lib64/libc.so.1:
          /work/irix/lib/libc/libc_64_M4/stdio/flush.c)
      1(  0.0)       2735( 99.9)    _mixed_dtoa (/usr/lib64/libc.so.1:
          /work/irix/lib/libc/libc_64_M4/math/mixed_dtoa.c)
      1(  0.0)       2736(100.0)           wsfe (/usr/lib64/libftn.so:
          /work/cmplrs/libI77/wsfe.c)
      1(  0.0)       2737(100.0)         f_exit (/usr/lib64/libftn.so:
          /work/cmplrs/libI77/close.c)
 
   2737(100.0)                      TOTAL

Analyzing the Report

The report has the following columns:

  • The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program.

  • The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the MATGEN function shows 307 overflows, but the cumulative number of overflows is 2440.

  • The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure.

ideal Experiment

This section takes you through the steps to perform an ideal experiment. For more information on collecting ideal-time data, and basic block counting, see the "ideal Experiment" section in Chapter 4, "Experiment Types."

Performing an ideal Experiment

From the command line, enter

ssrun -ideal linpackup

This starts the experiment. First the executable and libraries are instrumented using pixie. This entails making copies of the libraries and executables, giving them an extension that depends on the ABI, and inserting information into the copies. The extension is .pixie for the executable, .pix32 for all 32 libraries, .pixn32 for all n32 libraries, and .pix64 for all 64 libraries.

Output from linpackup and from ssrun is printed to stdout as shown in the example below. A data file is also generated. The name consists of the process name (linpackup), the experiment type, ideal, and the experiment ID. In this example, the filename is linpackup.ideal.n17580.

Beginning libraries
        /usr/lib32/libssrt.so
        /usr/lib32/libss.so
        /usr/lib32/libftn.so
        /usr/lib32/libm.so
        /usr/lib32/libc.so.1
Ending libraries, beginning "linpackup"
...

Generating a Report

To generate a report on the data collected, enter the following at the command line:

prof your_output_file_name > ideal.results

This command redirects output to a file called ideal.results. The file should contain results that look something like the following:

Prof run at: Sun May 19 18:46:10 1996
 Command line: prof linpackup.ideal.m17580 
 
         5722510379: Total number of cycles
          76.30014s: Total execution time
         4906763725: Total number of instructions executed
              1.166: Ratio of cycles / instruction
                 75: Clock rate in MHz
              R8000: Target processor modelled
---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------
        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)
 
5404032607(94.43)  94.43    72.05 4639092022 772633 daxpy(linpackup.pixie:
          linpackup.f)
 207582228( 3.63)  98.06     2.77  157405518    18 matgen(linpackup.pixie:
          linpackup.f)
  67844858( 1.19)  99.25     0.90   72325769    17 dgefa(linpackup.pixie:
          linpackup.f)
  19920277( 0.35)  99.60     0.27   17658342  5083 dscal(linpackup.pixie:
          linpackup.f)
  18115251( 0.32)  99.91     0.24   15675343  5083 idamax(linpackup.pixie:
          linpackup.f)
   4053920( 0.07)  99.98     0.05    3605124     1 dmxpy(linpackup.pixie:
          linpackup.f)
    786709( 0.01) 100.00     0.01     695776    17 dgesl(linpackup.pixie:
          linpackup.f)
     41357( 0.00) 100.00     0.00      83826  1116 __flsbuf(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M4/stdio/_flsbuf.c)
     30294( 0.00) 100.00     0.00      29094     1 linp(linpackup.pixie:
          linpackup.f)
     17330( 0.00) 100.00     0.00      39823   867 x_putc(./libftn.so.pixn32:
          /lv7/mtibuild/nodebug/workarea/mongoose/libI77/wsfe.c)
     12294( 0.00) 100.00     0.00      25795    28 x_wEND(./libftn.so.pixn32:
          /lv7/mtibuild/nodebug/workarea/mongoose/libI77/wsfe.c)
     10620( 0.00) 100.00     0.00      14340    53 wrt_E(./libftn.so.pixn32:
          /lv7/mtibuild/nodebug/workarea/mongoose/libI77/wrtfmt.c)
      9617( 0.00) 100.00     0.00      14889    71 do_fio64_mp
          (./libftn.so.pixn32:/lv7/mtibuild/nodebug/workarea/mongoose/libI77/
          fmt.c)
      4940( 0.00) 100.00     0.00       7917   380 

Analyzing the Report

The report has the following columns:

  • The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 5404032607 cycles, or 94.43% of cycles were spent in the daxpy procedure.

  • The cum% column shows the cumulative percentage of cycles. For example, 98.06% of all cycles were spent between the top two functions in the listing: daxpy and matgen.

  • The secs column shows the number of seconds spent in the procedure. For example, 72.05 seconds were spent in the daxpy procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention.)

  • The instrns column shows the number of instructions executed for a procedure. For example, there were 157405518 instructions devoted to the matgen procedure.

  • The calls column reports the number of calls to the procedure. For example, there were 18 calls to the matgen procedure.

  • The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure daxpy in the file linpackup.f in the linpackup executable.

This concludes the tutorial.