Chapter 6. Performance Tuning for the R8000 and R10000

This chapter outlines techniques for tuning the performance of your R8000 and R10000 applications. It contains five sections:

Architectural Overview

Table 6-1 illustrates the main architectural features of the R8000 and R10000 microprocessors. Both can execute the MIPS IV instruction set and both can issue up to four instructions per cycle, of which two can be integer instructions and two can be floating point instructions. The R8000 can execute up to two madd instructions per cycle, while the R10000 can only execute one per cycle.The R8000 can issue two memory instructions per cycle while the R10000 can issue one per cycle. On the other hand, the R10000 operates at a much faster clock frequency and is capable of out-of-order execution. This makes it a better candidate for executing programs that were not explicitly tuned for it.

Table 6-1. Architectural Features of the R8000 and R10000

Feature

R8000

R10000

ISA

MIPS IV

MIPS IV

Frequency

90 Mhz

200 Mhz

Peak MFLOPS

360

400

Total Number of Instructions per cycle

4

4

Number of Integer Instructions per cycle

2

2

Number of Floating Point Instructions per cycle

2

2

Number of Multiply-Add Instructions per cycle.

2

1

Number of Load/Store Instructions per cycle

2

1

Out-of-order Instruction execution

No

Yes


Software Pipelining

The purpose of this section is to give you an overview of software pipelining and to answer these questions:

  • Why software pipelining delivers better performance

  • How software pipelined code looks

  • How to diagnose what happened when things go wrong

Why Software Pipelining Delivers Better Performance

To introduce the topic of software pipelining, let's consider the simple DAXPY loop (double precision a times x plus y) shown below.

DO i = 1, n
    v(i) = v(i) + X * w(i)
ENDDO

On the MIPS IV architecture, this can be coded as two load instructions followed by a madd instruction and a store. See Figure 6-1.

Figure 6-1. A Simple DAXPY Implementation

Figure 6-1 A Simple DAXPY Implementation

This simplest schedule achieves the desired result after five cycles. Since the R8000 architecture can allow up to two memory operations and two floating point operations in the same cycle, this simple example uses only one tenth of the R8000's peak megaflops. There is also a delay of three cycles before the results of the madd can be stored.

0: ldc1 ldc1 madd
1:
2:
3:
4: sdc1

A loop unrolling by four schedule improves the performance to one quarter of the R8000's peak megaflops.

0: ldc1    ldc1    madd
1: ldc1    ldc1    madd
2: ldc1    ldc1    madd
3: ldc1    ldc1    madd
4: sdc1
5: sdc1
6: sdc1
7: sdc1

But this schedule does not take advantage of the R8000's ability to do two stores in one cycle.The best schedule that could be achieved would look like the following:

0: ldc1    ldc1
1: ldc1    ldc1    madd    madd
2: sdc1    sdc1

It uses 1/3 of the R8000's peak megaflops. But there still is a problem with the maddsdc1 interlock delay. Software pipelining addresses this problem.

Software pipelining allows you to mix operations from different loop iterations in each iteration of the hardware loop. Thus the store instructions (which would interlock) in the above example could be used to store the results of different iterations. This can look like the following:

L1:
0: t1 = ldc1    t2 = ldc1   t3 = madd t1 X t2
1: t4 = ldc1    t5 = ldc1   t6 = madd t4 X t5
2: sdc1 t7      sdc1 t8     beq DONE

3: t1 = ldc1    t2 = ldc1   t7 = madd t1 X t2
4: t4 = ldc1    t5 = ldc1   t8 = madd t4 X t5
5: sdc1 t3      sdc1 t6     bne L1

DONE:

The stores in this loop are storing the madd results from previous iterations. But, in general, you could mix any operations from any number of different iterations. Also, note that every loop replication completes two loop iterations in 3 cycles.

In order to properly prepare for entry into such a loop, a windup section of code is added. The windup section sets up registers for the first stores in the main loop. In order to exit the loop properly, a winddown section is added. The winddown section performs the final stores. Any preparation of registers needed for the winddown is done in the compensation section. The winddown section also prevents speculative operations.

windup:
0: t1 = ldc1    t2 = ldc1   t7 = madd t1 X t2
1: t4 = ldc1    t5 = ldc1   t8 = madd t4 X t5
L1:
0: t1 = ldc1    t2 = ldc1   t3 = madd t1 X t2
1: t4 = ldc1    t5 = ldc1   t6 = madd t4 X t5
2: sdc1 t7      sdc1 t8     beq compensation1

3: t1 = ldc1    t2 = ldc1   t7 = madd t1 X t2
4: t4 = ldc1    t5 = ldc1   t8 = madd t4 X t5
5: sdc1 t3      sdc1 t6     bne L1

winddown:
0: sdc1 t7      sdc1 t8     br ALLDONE

compensation1:
0: t7 = t3      t8 = t6
1: br winddown

ALLDONE:

Our example loop always does loads from at least 4 iterations, so we don't want to start it if we don't want speculative operations and if the trip count is less than 4. The following generalizes our example into a map of a software pipelined loop:

/* Precondition for unroll by 2 */
   do i = 1, n mod 2
      original loop
   enddo

   if ( n-i < 4 ) goto simple_loop
windup:
      ...
/* fp - fence post */
      fp = fp - peel_amount 
      ...

swp replication 0:
      ...
      if ( i == fp ) goto compensation 0   
      ...
swp replication n - 1:
      ...
      if ( i != fp ) goto swp replication 0

compensation n-1:

winddown:
   ...
   goto done

compensation 0:
/* Move registers to set up winddown */
   rx = ry
   ...
   goto winddown
   ...


compensation n - 2: 
   ...
/* Move registers to set up winddown */
   rx = ry
   goto winddown

simple_loop:

   do i = i,n
      original_loop
   enddo

done:

In practice, software pipelining can make a huge difference. Sample benchmark performances see more than 100% improvement when compiled with software pipelining. This makes it clear that in order to get the best performance on the R8000 (-mips4), you should compile your application to use software pipelining (-O3 -r8000).

Software Pipelining on the R10000

Scheduling for the R10000 (-r10000) is somewhat different. First of all, since the R10000 can execute instructions out-of-order, static scheduling techniques are not as critical to its performance. On the other hand, the R10000 supports the prefetch (pref) instruction. This instruction is used to load data into the caches before it is needed, reducing memory delays for programs that access memory predictably. In the schedule below, the compiler generates prefetch instructions.

Since the R10000 can execute only one memory instruction and only one madd instruction per cycle, there are two open slots available for integer instructions. The R10000 has a delay of three cycles before the results of the madd can be stored. It also has delays of three cycles and one cycle before the madd can use the result of a load for multiplication and addition, respectively.

The following schedule shows four replications from a daxpy inner loop scheduled for the R10000. Notice how the operands of the madd instruction are loaded well ahead of the actual execution of the instruction itself. Notice also that the final replication contains two prefetch instructions. Use of the prefetch instruction is enabled by default with the -r10000 flag. The other replications each have a hole where the first prefetch instruction is placed. Had prefetch been turned off through the -LNO:prefetch=0 option, each replication could have been scheduled in three cycles.

L1:
0: t0 = ldc1
1: 
2: t7 = ldc1
3: sdc1 t2
4: t2 = madd t4 X t5    beq compensation0
0: t4 = ldc1
1: 
2: t3 = ldc1
3: sdc1 t2
4: t2 = madd t0 X t1    beq compensation1
0: t0 = ldc1
1: 
2: t5 = ldc1
3: sdc1 t2
4: t2 = madd t4 X t7    beq compensation2
0: t4 = ldc1
1: pref
2: t1 = ldc1
3: sdc1 t2
4: t2 = madd t0 X t3    bne L1   pref

Looking at the Code Produced by Software Pipelining

The proper way to look at the assembly code generated by software pipelining is to use the -S compiler switch. This is vastly superior to using the disassembler (dis) because the -S switch adds annotations to the assembly code which name out the sections described above.

The annotations also provide useful statistics about the software pipelining process as well as reasons why certain code did not pipeline. To get a summary of these annotations do the following:

%f77 -64 -S -O3 -mips4 foo.f

This creates an annotated .s file

%grep '#<swp' foo.s

#<swpf is printed for loops that failed to software pipeline. #<swps is printed for statistics and other info about the loops that did software pipeline.

Example 1: Output from Using the -S Compiler Switch

%cat test.f
        program test
        real*8 a x(100000),y(100000)
        do i = 1, 2000
           call daxpy(3.7, x, y, 100000)
        enddo
        stop
        end

        subroutine daxpy(a, x, y, nn)
        real*8 a x(*),y(*)
        do i = 1, nn, 1
           y(i) = y(i) + a * x(i)
        enddo
        return
        end
%f77 -64 -r8000 -mips4 -O3 -S test.f
%grep swps test.s
#<swps>
#<swps> Pipelined loop line 12 steady state
#<swps>
#<swps> 25 estimated iterations before pipelining
#<swps>  4 unrollings before pipelining
#<swps>  6 cycles per 4 iterations
#<swps>  8 flops       ( 33% of peak) (madds count as 2)
#<swps>  4 flops       ( 33% of peak) (madds count as 1)
#<swps>  4 madds       ( 33% of peak)
#<swps> 12 mem refs    (100% of peak)
#<swps>  3 integer ops ( 25% of peak)
#<swps> 19 instructions( 79% of peak)
#<swps>  2 short trip threshold
#<swps>  7 integer registers used
#<swps> 14 float registers used
#<swps>

This example was compiled with scheduling for the R8000. It shows that the inner loop starting at line 12 was software pipelined. The loop was unrolled four times before pipelining. It used 6 cycles for every four loop iterations and calculated the statistics as follows:

  • If each madd counts as two floating point operations, the R8000 can do four floating point operations per cycle (two madds), so its peak for this loop is 24. Eight floating point references are 8/24 or 33% of peak. The figure for madds is likewise calculated.

  • If each madd counts as one floating point operation, the R8000 can do two floating point operations per cycle, so its peak for this loop is 12. Four floating point operations are 4/12 or 33% of peak.

  • The R8000 can do two memory operations per cycle, so its peak for this loop is 12. Twelve memory references are 12/12 or 100% of peak.

  • The R8000 can do two integer operations per cycle, so its peak for this loop is 12. Three integer references are 3/12 or 25% of peak.

  • The R8000 can do four instructions per cycle, so its peak for this loop is 24. Nineteen instructions are 19/24 or 79% of peak. The statistics also point out that loops of less than 2 iterations would not go through the software pipeline replication area, but would be executed in the simple_loop section shown above and that a total of seven integer and eight floating point registers were used in generating the code.

If the example would have been compiled with scheduling for the R10000, the following results would have been obtained.

%f77 -64 -r10000 -mips4 -O3 -S test.f
%grep swps test.s
#<swps>
#<swps> Pipelined loop line 12 steady state
#<swps>
#<swps>    25 estimated iterations before pipelining
#<swps>     4 unrollings before pipelining
#<swps>    14 cycles per 4 iterations
#<swps>     8 flops       ( 28% of peak) (madds count as 2)
#<swps>     4 flops       ( 14% of peak) (madds count as 1)
#<swps>     4 madds       ( 28% of peak)
#<swps>    12 mem refs    ( 85% of peak)
#<swps>     3 integer ops ( 10% of peak)
#<swps>    19 instructions( 33% of peak)
#<swps>     2 short trip threshold
#<swps>     7 integer registers used
#<swps>    15 float registers used
#<swps>

The statistics are tailored to the R10000 architectural characteristics. They show that the inner loop starting at line 12 was unrolled four times before being pipelined. It used 14 cycles for every four loop iterations and the percentages were calculated as follows:

  • The R10000 can do two floating point operations per cycle (one multiply and one add), so its floating point operations peak for this loop is 28. If each madd instruction counts as a multiply and an add, the number of operations in this loop are 8/28 or 28% of peak.

  • If each madd counts as one floating point instruction, the R10000 can do two floating point operations per cycle (one multiply and one add), so its peak for this loop is 28. Four floating point operations (four madds) are 4/28 or 14% of peak.

  • The R10000 can do one madd operation per cycle, so its peak for this loop is 14. Four madd operations are 4/14 or 28% of peak.

  • The R10000 can do one memory operation per cycle, so its peak for this loop is 14. Three memory references are 12/14 or 85% of peak.


Note: prefetch operations are sometimes not needed in every replication. The statistics will miss them if they are not in replication 0 and they will understate the number of memory references per cycle while overstating the number of cycles per iteration.


  • The R10000 can do two integer operations per cycle, so its peak for this loop is 28. Three integer operations are 3/28 or 10% of peak.

  • The R10000 can do four instructions per cycle, so its peak for this loop is 56. Nineteen instructions are 19/56 or 33% of peak. The statistics also point out that loops of less than 2 iterations would not go through the software pipeline replication area, but would be executed in the simple_loop section shown above and that a total of seven integer and fifteen floating point registers were used in generating the code.

How to Diagnose What Went Wrong

When you don't get much improvement in your application's performance after compiling with software pipelining, you should ask the following questions and consider some possible answers:

  1. Did it software pipeline at all?

    Software pipelining works only on inner loops. What is more, inner loops with subroutine calls or complicated conditional branches do not software pipeline.

  2. How well did it pipeline?

    Look at statistics in the .s file.

  3. What can go wrong in code that was software pipelined?

    Your generated code may not have the operations you expected.

  4. Think about how you would hand code it.

    What operations did it need?

    Look at the loop in the .s file.

    Is it very different? Why?

    Sometimes this is human error. (Improper code, or typo.)

    Sometimes this is a compiler error.

Perhaps the compiler didn't schedule tightly. This can happen because there are unhandled recurrence divides (Divides in general, are a problem) and because there are register allocation problems (running out of registers).

Matrix Multiply – A Tuning Case Study

Matrix multiplication illustrates some of the issues in compiling for the R8000 and R10000. Consider the simple implementation.

do j = 1,n
   do i = 1,m
      do k = 1 , p
         c(i,j) = c(i,j) - a(i,k)*b(k,j)

As mentioned before, the R8000 is capable of issuing two madds and two memory references per cycle. This simple version of matrix multiply requires 2 loads in the inner loop and one madd. Thus at best, it can run at half of peak speed. Note, though, that the same locations are loaded on multiple iterations. By unrolling the loops, we can eliminate some of these redundant loads. Consider for example, unrolling the outer loop by 2.

do j = 1,n,2
   do i = 1,m
      do k = 1 , p
         c(i,j) = c(i,j) - a(i,k)*b(k,j)
         c(i,j+1) = c(i,j+1) - a(i,k)*b(k,j+1)

We now have 3 loads for two madds. Further unrolling can bring the ratio down further. On the other hand, heavily unrolled loops require many registers. Unrolling too much can lead to register spills. The loop nest optimizer (LNO) tries to balance these trade-offs.

Below is a good compromise for the matrix multiply example on the R8000. It is generated by LNO using the -O3 and -r8000 flags. The listing file is generated using -FLIST:=ON option.

%f77 -64 -O3 -r8000 -FLIST:=ON mmul.f
%cat mmul.w2f.f
C ***********************************************************
C Fortran file translated from WHIRL Mon Mar 22 09:31:25 1999
C ***********************************************************

        PROGRAM MAIN
        IMPLICIT NONE
C
C       **** Variables and functions ****
C
        REAL*8 a(100_8, 100_8)
        REAL*8 b(100_8, 100_8)
        REAL*8 c(100_8, 100_8)
        INTEGER*4 j
        INTEGER*4 i
        INTEGER*4 k
C
C       **** Temporary variables ****
C
        REAL*8 mi0
        REAL*8 mi1
        REAL*8 mi2
        REAL*8 mi3
        REAL*8 mi4
        REAL*8 mi5
        REAL*8 mi6
        REAL*8 mi7
        REAL*8 mi8
        REAL*8 mi9
        REAL*8 mi10
        REAL*8 mi11
        REAL*8 c_1
        REAL*8 c_
        REAL*8 c_0
        REAL*8 c_2
        REAL*8 c_3
        REAL*8 c_4
        REAL*8 c_5
        REAL*8 c_6
        REAL*8 c_7
        REAL*8 c_8
        REAL*8 c_9
        REAL*8 c_10
        INTEGER*4 i0
        REAL*8 mi12
        REAL*8 mi13
        REAL*8 mi14
        REAL*8 mi15
        INTEGER*4 k0
        REAL*8 c_11
        REAL*8 c_12
        REAL*8 c_13
        REAL*8 c_14
C
C       **** statements ****
C
        DO j = 1, 8, 3
          DO i = 1, 20, 4
            mi0 = c(i, j)
            mi1 = c((i + 3), (j + 2))
            mi2 = c((i + 3), (j + 1))
            mi3 = c(i, (j + 1))
            mi4 = c((i + 3), j)
            mi5 = c((i + 2), (j + 2))
            mi6 = c((i + 2), (j + 1))
            mi7 = c(i, (j + 2))
            mi8 = c((i + 2), j)
            mi9 = c((i + 1), (j + 2))
            mi10 = c((i + 1), (j + 1))
            mi11 = c((i + 1), j)
            DO k = 1, 20, 1
              c_1 = mi0
              mi0 = (c_1 -(a(i, k) * b(k, j)))
              c_ = mi3
              mi3 = (c_ -(a(i, k) * b(k, (j + 1))))
              c_0 = mi7
              mi7 = (c_0 -(a(i, k) * b(k, (j + 2))))
              c_2 = mi11
              mi11 = (c_2 -(a((i + 1), k) * b(k, j)))
              c_3 = mi10
              mi10 = (c_3 -(a((i + 1), k) * b(k, (j + 1))))
              c_4 = mi9
              mi9 = (c_4 -(a((i + 1), k) * b(k, (j + 2))))
              c_5 = mi8
              mi8 = (c_5 -(a((i + 2), k) * b(k, j)))
              c_6 = mi6
              mi6 = (c_6 -(a((i + 2), k) * b(k, (j + 1))))
              c_7 = mi5
              mi5 = (c_7 -(a((i + 2), k) * b(k, (j + 2))))
              c_8 = mi4
              mi4 = (c_8 -(a((i + 3), k) * b(k, j)))
              c_9 = mi2
              mi2 = (c_9 -(a((i + 3), k) * b(k, (j + 1))))
              c_10 = mi1
              mi1 = (c_10 -(a((i + 3), k) * b(k, (j + 2))))
            END DO
            c((i + 1), j) = mi11
            c((i + 1), (j + 1)) = mi10
            c((i + 1), (j + 2)) = mi9
            c((i + 2), j) = mi8
            c(i, (j + 2)) = mi7
            c((i + 2), (j + 1)) = mi6
            c((i + 2), (j + 2)) = mi5
            c((i + 3), j) = mi4
            c(i, (j + 1)) = mi3
            c((i + 3), (j + 1)) = mi2
            c((i + 3), (j + 2)) = mi1
            c(i, j) = mi0
          END DO
        END DO
        DO i0 = 1, 20, 4
          mi12 = c(i0, 10_8)
          mi13 = c((i0 + 1), 10_8)
          mi14 = c((i0 + 3), 10_8)
          mi15 = c((i0 + 2), 10_8)
          DO k0 = 1, 20, 1
            c_11 = mi12
            mi12 = (c_11 -(a(i0, k0) * b(k0, 10_8)))
            c_12 = mi13
            mi13 = (c_12 -(a((i0 + 1), k0) * b(k0, 10_8)))
            c_13 = mi15
            mi15 = (c_13 -(a((i0 + 2), k0) * b(k0, 10_8)))
            c_14 = mi14
            mi14 = (c_14 -(a((i0 + 3), k0) * b(k0, 10_8)))
          END DO
          c((i0 + 2), 10_8) = mi15
          c((i0 + 3), 10_8) = mi14
          c((i0 + 1), 10_8) = mi13
          c(i0, 10_8) = mi12
        END DO
        WRITE(6, '(F18.10)') c(9_8, 8_8)
        STOP
        END ! MAIN

The outermost loop is unrolled by three as suggested above and the second loop is unrolled by a factor of four. Note that we have not unrolled the inner loop. The code generation phase of the compiler back end can effectively unroll inner loops automatically to eliminate redundant loads.

In optimizing for the R10000, LNO also unrolls the outermost loop by three, but unrolls the second loop by 5. LNO can also automatically tile the code to improve cache behavior. You can use the -LNO: option group flags to describe your cache characteristics to LNO. For more information, please consult the MIPSpro Compiling, Debugging and Performance Tuning Guide.

%f77 -64 -O3 -r10000 -FLIST:=ON mmul.f
%cat mmul.w2f.f

C ***********************************************************
C Fortran file translated from WHIRL Mon Mar 22 09:41:39 1999
C ***********************************************************

        PROGRAM MAIN
        IMPLICIT NONE
C
C       **** Variables and functions ****
C
        REAL*8 a(100_8, 100_8)
        REAL*8 b(100_8, 100_8)
        REAL*8 c(100_8, 100_8)
        INTEGER*4 j
        INTEGER*4 i
        INTEGER*4 k
C
C       **** Temporary variables ****
C
        REAL*8 mi0
        REAL*8 mi1
        REAL*8 mi2
        REAL*8 mi3
        REAL*8 mi4
        REAL*8 mi5
        REAL*8 mi6
        REAL*8 mi7
        REAL*8 mi8
        REAL*8 mi9
        REAL*8 mi10
        REAL*8 mi11
        REAL*8 mi12
        REAL*8 mi13
        REAL*8 mi14
        INTEGER*4 i0
        REAL*8 mi15
        REAL*8 mi16
        REAL*8 mi17
        REAL*8 mi18
        REAL*8 mi19
        INTEGER*4 k0
        REAL*8 mi20
        REAL*8 mi21
        REAL*8 mi22
        REAL*8 mi23
        REAL*8 mi24
C
C       **** statements ****
C
        DO j = 1, 8, 3
          DO i = 1, 20, 5
            mi0 = c(i, j)
            mi1 = c((i + 4), (j + 2))
            mi2 = c((i + 4), (j + 1))
            mi3 = c(i, (j + 1))
            mi4 = c((i + 4), j)
            mi5 = c((i + 3), (j + 2))
            mi6 = c((i + 3), (j + 1))
            mi7 = c(i, (j + 2))
            mi8 = c((i + 3), j)
            mi9 = c((i + 2), (j + 2))
            mi10 = c((i + 2), (j + 1))
            mi11 = c((i + 1), j)
            mi12 = c((i + 2), j)
            mi13 = c((i + 1), (j + 2))
            mi14 = c((i + 1), (j + 1))
            DO k = 1, 20, 1
              mi0 = (mi0 -(a(i, k) * b(k, j)))
              mi3 = (mi3 -(a(i, k) * b(k, (j + 1))))
              mi7 = (mi7 -(a(i, k) * b(k, (j + 2))))
              mi11 = (mi11 -(a((i + 1), k) * b(k, j)))
              mi14 = (mi14 -(a((i + 1), k) * b(k, (j + 1))))
              mi13 = (mi13 -(a((i + 1), k) * b(k, (j + 2))))
              mi12 = (mi12 -(a((i + 2), k) * b(k, j)))
              mi10 = (mi10 -(a((i + 2), k) * b(k, (j + 1))))
              mi9 = (mi9 -(a((i + 2), k) * b(k, (j + 2))))
              mi8 = (mi8 -(a((i + 3), k) * b(k, j)))
              mi6 = (mi6 -(a((i + 3), k) * b(k, (j + 1))))
              mi5 = (mi5 -(a((i + 3), k) * b(k, (j + 2))))
              mi4 = (mi4 -(a((i + 4), k) * b(k, j)))
              mi2 = (mi2 -(a((i + 4), k) * b(k, (j + 1))))
              mi1 = (mi1 -(a((i + 4), k) * b(k, (j + 2))))
            END DO
            c((i + 1), (j + 1)) = mi14
            c((i + 1), (j + 2)) = mi13
            c((i + 2), j) = mi12
            c((i + 1), j) = mi11
            c((i + 2), (j + 1)) = mi10
            c((i + 2), (j + 2)) = mi9
            c((i + 3), j) = mi8
            c(i, (j + 2)) = mi7
            c((i + 3), (j + 1)) = mi6
            c((i + 3), (j + 2)) = mi5
            c((i + 4), j) = mi4
            c(i, (j + 1)) = mi3
            c((i + 4), (j + 1)) = mi2
            c((i + 4), (j + 2)) = mi1
            c(i, j) = mi0
          END DO
        END DO
        DO i0 = 1, 20, 5
          IF(IAND(i0, 31) .EQ. 16) THEN
            mi15 = c(i0, 10_8)
            mi16 = c((i0 + 1), 10_8)
            mi17 = c((i0 + 4), 10_8)
            mi18 = c((i0 + 2), 10_8)
            mi19 = c((i0 + 3), 10_8)
            DO k0 = 1, 20, 1
              mi15 = (mi15 -(a(i0, k0) * b(k0, 10_8)))
              mi16 = (mi16 -(a((i0 + 1), k0) * b(k0, 10_8)))
              mi18 = (mi18 -(a((i0 + 2), k0) * b(k0, 10_8)))
              mi19 = (mi19 -(a((i0 + 3), k0) * b(k0, 10_8)))
              mi17 = (mi17 -(a((i0 + 4), k0) * b(k0, 10_8)))
            END DO
            c((i0 + 3), 10_8) = mi19
            c((i0 + 2), 10_8) = mi18
            c((i0 + 4), 10_8) = mi17
            c((i0 + 1), 10_8) = mi16
            c(i0, 10_8) = mi15
          ELSE
            mi20 = c(i0, 10_8)
            mi21 = c((i0 + 1), 10_8)
            mi22 = c((i0 + 4), 10_8)
            mi23 = c((i0 + 2), 10_8)
            mi24 = c((i0 + 3), 10_8)
            DO k0 = 1, 20, 1
              mi20 = (mi20 -(a(i0, k0) * b(k0, 10_8)))
              mi21 = (mi21 -(a((i0 + 1), k0) * b(k0, 10_8)))
              mi23 = (mi23 -(a((i0 + 2), k0) * b(k0, 10_8)))
              mi24 = (mi24 -(a((i0 + 3), k0) * b(k0, 10_8)))
              mi22 = (mi22 -(a((i0 + 4), k0) * b(k0, 10_8)))
            END DO
            c((i0 + 3), 10_8) = mi24
            c((i0 + 2), 10_8) = mi23
            c((i0 + 4), 10_8) = mi22
            c((i0 + 1), 10_8) = mi21
            c(i0, 10_8) = mi20
          ENDIF
        END DO
        WRITE(6, '(F18.10)') c(9_8, 8_8)
        STOP
        END ! MAIN

Use of the IVDEP Directive

The IVDEP (Ignore Vector Dependencies) directive was started in Cray Fortran. It is a Fortran or C pragma that tells the compiler to be less strict when it is deciding whether it can get parallelism between loop iterations. By default, the compilers do the safe thing: they try to prove to that there is no possible conflict between two memory references. If they can prove this, then it is safe for one of the references to pass the other.

In particular, you need to be able to perform the load from iteration i+1 before the store from iteration i if you want to be able to overlap the calculation from two consecutive iterations.

Now suppose you have a loop like:

do i = 1, n
   a(l(i)) = a(l(i)) + ...
enddo

The compiler has no way to know that

&a(l(i)) != &a(l(i+1))

without knowing something about the vector l. For example, if every element of l is 5, then

&a(l(i)) == &a(l(i+1))

for all values of i.

But you sometimes know something the compiler doesn't. Perhaps in the example above, l is a permutation vector and all its elements are unique. You'd like a way to tell the compiler to be less conservative. The IVDEP directive is a way to accomplish this.

Placed above a loop, the statement:

cdir$ ivdep

tells the compiler to assume there are no dependencies in the code.

The MIPSpro v7.x compilers provide support for three different interpretations of the IVDEP directive, because there is no clearly established standard.Under the default interpretation, given two memory references, where at least one is loop variant, the compiler will ignore any loop-carried dependences between the two references. Some examples:

do i = 1,n
  b(k) = b(k) + a(i)
enddo

Use of IVDEP will not break the dependence since b(k) is not loop variant.

do i=1,n
  a(i) = a(i-1) + 3.
enddo

Use of IVDEP does break the dependence but the compiler warns the user that it's breaking an obvious dependence.

do i=1,n
  a(b(i)) = a(b(i)) + 3.
enddo

Use of IVDEP does break the dependence.

do i = 1,n
  a(i) = b(i)
  c(i) = a(i) + 3.
enddo

Use of IVDEP does not break the dependence on a(i) since it is within an iteration.

The second interpretation of IVDEP is used if you use the -OPT:cray_ivdep=TRUE command line option. Under this interpretation, the compiler uses Cray semantics. It breaks all lexically backwards dependences. Some examples:

do i=1,n
  a(i) = a(i-1) + 3.
enddo

Use of IVDEP does break the dependence but the compiler warns the user that it's breaking an obvious dependence.

do i=1,n
  a(i) = a(i+1) + 3.
enddo

Use of IVDEP does not break the dependence since the dependence is from the load to the store, and the load comes lexically before the store.

The third interpretation of IVDEP is used if you use the -OPT:liberal_ivdep=TRUE command line option. Under this interpretation, the compiler will break all dependences.


Note: IVDEP IS DANGEROUS! If your code really isn't free of vector dependences, you may be telling the compiler to perform an illegal transformation which will cause your program to get wrong answers. But, IVDEP is also powerful and you may very well find yourself in a position where you need to use it. You just have to be very careful when you do this.


Vector Intrinsic Functions

The MIPSpro 64-bit compilers support both single and double precision versions of the following vector instrinsic functions: asin(), acos(), atan(), cos(), exp(), log(), sin(), tan(), sqrt(). In C they are declared as follows:

/* single precision vector routines */
extern  __vacosf( float *x, float *y, int count, int stridex, int stridey );
extern  __vasinf( float *x, float *y, int count, int stridex, int stridey );
extern  __vatanf( float *x, float *y, int count, int stridex, int stridey );
extern  __vcosf(  float *x, float *y, int count, int stridex, int stridey );
extern  __vexpf(  float *x, float *y, int count, int stridex, int stridey );
extern  __vlogf(  float *x, float *y, int count, int stridex, int stridey );
extern  __vsinf(  float *x, float *y, int count, int stridex, int stridey );
extern  __vtanf(  float *x, float *y, int count, int stridex, int stridey );
/* double precision vector routines */
extern  __vacos( double *x, double *y, int count, int stridex, int stridey );
extern  __vasin( double *x, double *y, int count, int stridex, int stridey );
extern  __vatan( double *x, double *y, int count, int stridex, int stridey );
extern  __vcos(  double *x, double *y, int count, int stridex, int stridey );
extern  __vexp(  double *x, double *y, int count, int stridex, int stridey );
extern  __vlog(  double *x, double *y, int count, int stridex, int stridey );
extern  __vsin(  double *x, double *y, int count, int stridex, int stridey );
extern  __vtan(  double *x, double *y, int count, int stridex, int stridey )

The variables x and y are assumed to be pointers to non-overlapping arrays. Each routine is functionally equivalent to the following pseudo-code fragment:

    do i = 1, count-1
        y[i*stridey] = func(x[i*stridex])
    enddo

where func() is the scalar version of the vector intrinsic.

Performance and Accuracy

The vector intrinsics are optimized and software pipelined to take advantage of the R8000's performance features. Throughput is several times greater than that of repeatedly calling the corresponding scalar function though the result may not necessarily agree to the last bit. For further information about accuracy and restrictions of the vector intrinsics please refer to your IRIX Compiler_dev Release Notes.

Manual vs. Automatic Invocation

All of the vector intrinsics can be called explicitly from your program. In Fortran the following example invokes the vector intrinsic vexpf().

    real*4 x(N), y(N)
    integer*8 N, i
    i = 1
    call vexpf$( %val(%loc(x(1))),
    &            %val(%loc(y(1))),
    &            %val(N),%val(i),%val(i))
    end

The compiler at optimization level -O3 also recognizes the use of scalar versions of these intrinsics on array elements inside loops and turns them into calls to the vector versions. If you need to turn this feature off to improve the accuracy of your results, add -LNO:vint=OFF to your compilation command line or switch to a lower optimization level.