Appendix A. Bentley's Rules Updated

Appendix A. Bentley's Rules Updated
Prev		Next

Jon Bentley, a computer scientist known for writing many publications, published Writing Efficient Programs. In this classic book (now out of print), Bentley provided a unified, pragmatic treatment of program efficiency, independent of language and host platform.

For ease of presentation, Bentley codified his methods as a set of terse rules—a set of rules that are still useful reminders. This appendix is the list of Bentley's rules, paraphrased from Appendix C of Writing Efficient Programs and revised to reflect the needs and opportunities of the SN0 architecture, SGI compilers, and the MIPS instruction set. In some cases the rules are only stated as a reminder that the compiler now performs these techniques automatically.

Space-for-Time Rules

Data Structure Augmentation

The time required for common operations on data can often be reduced by augmenting the structure with additional information, or by changing the information within the structure so it can be accessed more easily.

Examples:

Add a reference counter to facilitate garbage collection.
Incorporate a “hint” containing a likely, but possibly incorrect, answer.

When the added data can help you avoid a disk access or a memory reference to a different data structure, it is worthwhile. However, when the added hint can at best avoid some inline processing, keep in mind the 100:1 ratio between instruction time and memory-fetch time. It is far more important to minimize secondary cache misses than to minimize instruction counts. If the simple data structure fits in a cache line and the augmented one does not, the time to access the augmented data can be greater than the instructions it saves.

Special applications of this rule include:

Reorganize data structures to make sure that the most-used fields fit in a single, 128-byte, cache line, and less-used fields fall into separate cache lines. (This of course presumes that each structure is 128-byte aligned in memory.)

In one example cited in a paper given at Supercomputing '96 , most of the cache misses of one program were caused by “a search loop that accesses a four-byte integer field in each element of an array of structures...By storing the integer items in a separate, shadow array, the secondary cache miss rate of the entire application is reduced.”
Reorganize structures so that variables that are updated by different threads are located in different cache lines. The alternative—when variables updated by different threads fall in the same cache line—creates false sharing, in which unrelated memory updates force many CPUs to refresh their cache lines.

Store Precomputed Results

The cost of recomputing an expensive function can be reduced by computing the function only once and storing the results. Subsequent requests for the function are handled by table lookup.

Precomputation is almost certainly a win when the replaced computation involves disk access, or an IRIX system call, or even a substantial libc function. Precomputation is also a win when it reduces the amount of data the program must reference.

However, keep in mind that if retrieval of the precomputed value leads to an extra cache miss, the replaced computation must have cost at least 100 instructions to break even.

Caching

Data that is accessed most often should be the cheapest to access. However, caching can backfire when locality is not in fact present. Examples:

The move-to-front heuristic for sequential list management is the classic example.
In implementing a dictionary, keep the most common words in memory.
Database systems cache the most-recently used tuples as a matter of course.

The MIPS R10000, SN0 node, and IRIX kernel already cache at a number of levels:

Primary (on-chip) caches for instructions and data.
Secondary (on-board) cache for recently-used memory lines of 128 bytes.
Virtual memory system for most-recently used pages of 16KB and up.
Filesystems (XFS and NFS independently) for recently-used disk sectors.

(For more detail, see “Understanding the Levels of the Memory Hierarchy” in Chapter 6.) However, all the standard caches are shared by multiple processes. From the standpoint of your program, the other processes are “diluting” the cache with their data, displacing your data. Also, system cache management knows nothing about your algorithms. Thus, application-dependent caching can yield huge payoffs. For a practical example, see “Grouping Data Used at the Same Time” in Chapter 6.

Lazy Evaluation

Never evaluate an item until it is needed. Examples:

In building a table of Fibonacci numbers (or any other function), populate the table with only the numbers that are actually requested, as they are requested.
Brian Kernighan reduced the run time of [the original troff] by twenty percent by calculating the width of the current line when needed, rather than after each character.

Time-for-Space Rules

Packing

Dense storage representations can decrease storage costs by increasing the time to store and retrieve data. Typical examples include packing multiple binary integers into 32- or 64-bit words.

The memory available in a typical SN0 server is large enough that you might think such bit-squeezing techniques ought to be relics of the past. However, is file access time a significant factor in your program's speed? (Use par to find out.) If the file format can be squeezed 50%, sequential access time is cut by that much, and direct access time is reduced due to shorter seek distances. Also, if an important array does not fit into cache, and by packing a data structure you can get all of it into cache, the extra instruction cycles needed to unpack each item are repaid many times over.

Interpreters

The space required to represent a program can often be decreased by the use of interpreters in which common sequences of operations are represented compactly. A typical example is the use of a finite-state machine to encode a complex protocol or lexical format into a small table.

Finite-state machines, decision tables, and token interpreters can all yield elegant, maintainable algorithm designs. However, keep in mind that the coding techniques often used to implement an interpreter, in particular computed gotos and vector tables, tend to defeat the compiler's code manipulations. Don't expect the compiler to optimize or parallelize the interpreter's central loop.

Loop Rules

Many of these loop-transformation rules have been incorporated into the current compilers. This makes it easier to apply them in controlled ways using compiler options.

Code Motion Out of Loops

An invariant expression (one whose value does not depend on the loop variable) should be calculated once, outside the loop, rather than iteratively. But keep in mind:

The compiler is good at recognizing invariant expressions that are evident in the code. Place expressions where it is most natural to read or write them, and let the compiler move them for you. Example:
for (i=0;i<Z;++i) { if (x[i]<(fact/div))...; }
At nonzero optimization levels, the compiler will recognize fact/div as invariant, and move it to a generated temporary in the loop prologue. You should not modify the code to move the expression out of the loop; leave the code in this naive but readable form, and let the compiler do the work.
The compiler cannot recognize when a call to a user function is invariant.
for (i=0;i<Z;++i) { if (x[i]<func(fact,div))...; }
When func(fact,div) does not depend on i, move the call out of the loop:
for (i=0,t=func(fact,div);i<Z;++i) { if (x[i]<t)...; }
An exception is when you request function inlining. After the compiler inlines the function body, it may then recognize that some or all of the inlined code is invariant, and move those parts out of the loop.

Combining Tests

An efficient inner loop should contain as few tests as possible, and preferably only one. Try to combine exit conditions. Examples:

Add a sentinel value to the last item of an unsorted vector, so as to avoid a separate test of the index variable.
As a special case, use a sentinel byte value to signal the end of a string of bytes, avoiding the test for last-byte-index.

These optimizations can be especially helpful with the R10000 CPU, which can pipeline a single conditional branch, executing speculatively along the most likely arm. However, two conditional branches in succession can interfere with the pipeline and cause the CPU to skip cycles until the branch condition values are available. Thus if you condition an inner loop on a single test, you are likely to get faster execution.

Loop Unrolling

A large cost of some short loops is in modifying loop indexes. That cost can often be reduced by unrolling the loop.

This optimization can bring large benefits, especially in a superscalar CPU like the R10000 CPU. However, loop unrolling done manually is an error-prone operation, and the resulting code can be difficult to maintain. Fortunately, current compilers have extensive support for loop unrolling. You can control the amount of unrolling either with a compiler option or loop by loop with directives. The compiler takes care of the details and the bookkeeping involved in handling end-of-loop, while the source code remains readable.

A benefit of compiler-controlled loop unrolling is that the compiler applies optimizations like common subexpression elimination, constant propagation, code motion out of loops, and function call inlining both before and after it unrolls the loop. The optimizations work together for a synergistic effect. The software-pipelining optimization (rearranging the order of the generated machine instructions in order to maximize use of the R10000 execution units) also becomes more effective when it has a longer loop body to work on.

There are still cases in which the compiler does not recognize the opportunity to unroll loops. One is the case where the code could take advantage of the ability of the SN0 node board to maintain two or more cache miss requests simultaneously. Consider the following loop:

      m= 1
      DO 24  k= 2,n
         IF( X(k).LT.X(m))  m= k
   24 CONTINUE

This is a normal, stride-1 scan of an array. The compiler inserts prefetch instructions to overlap the fetch of one cache line with the processing of another. However, a substantial speed increase can be had by making the hardware keep two cache fetches going concurrently at all times. This is done by unrolling the loop one time, and accessing the halves of the array in two, concurrent scans, as follows:

      do k=2,n/2
         if(x(k).lt.x(m)) m=k
         if(x(k+n/2).lt.x(m2)) m2=k
      enddo
      if(x(m2).lt.x(m)) m=m2

Transfer-Driven Loop Unrolling

When a large cost of an inner loop is devoted to trivial assignments (other than maintenance of loop indices), the assignments can be reduced by duplicating the loop body and renaming the variables in the second copy.

As with simple loop unrolling, this important optimization is well-handled by the compiler. Leave your code simple and readable and let the compiler do this.

Unconditional Branch Removal

A fast loop should have no unconditional branches. An unconditional branch at the end of a loop can be handled by “rotating” the loop to put a conditional branch at the bottom.

This optimization might be needed in assembly-language code, but no modern compiler should generate code for a loop with the test at the top and an unconditional branch at the bottom.

Loop Fusion

When two nearby loops operate on the same set of elements, combine their operational parts and use only one set of loop-control operations. Example: A pair of similar loops such as the following:

for (i=0;i<n;i++)
   for (j=0;j<n;j++)
      b[i][j] = a[i][j];
for (i=0;i<n;i++)
   for (j=0;j<n;j++)
      c[i][j] = b[i][j];

Such a pair can be fused into a single loop such as this:

for (i=0;i<n;i++)
   for (j=0;j<n;j++)
      b[i][j] = a[i][j];
      c[i][j] = b[i][j];

This technique can produce important benefits in the SN0 architecture because, first, the contents of array b are only passed through the cache once, and second, the instructions to load each element of b (from the second loop) are eliminated. However, it is not necessary for you to perform this optimization manually. The compilers perform loop fusion automatically at certain optimization levels. The compilers can also perform the following optimizations:

Loop interchange—exchanging the induction variables of inner and outer nested loops so as to pass over memory consecutively.
Cache blocking—breaking a single loop nest into a sequence of smaller nests that operate on blocks of data that each fit in secondary cache.
Loop fission—when two inner loops are controlled by a single outer loop, duplicating the outer loop and creating two simple loop nests that can then be interchanged, blocked, or unrolled.

Each of these changes, if done manually, complicates the source code with many temporary variables and insanely complicated loop control. In general, write loop code in the simplest, most portable and maintainable fashion, and then let the compiler tangle it behind the scenes.

Logic Rules

Exploit Algebraic Identities

In a conditional expression, replace a costly expression with an algebraically equivalent expression that is cheaper to evaluate. Examples:

Not (sqr(X)>0) but (X!=0). If this also allows you to avoid calculating sqr(X) unless it is needed, it is an instance of Lazy Evaluation.
Not (not(A) .and. not(B)) but not(A .and. B) (DeMorgan's Law).

When such replacements require insight, they are worth doing. Do not spend time replacing simple manifest expressions with constants; the compiler is good at that, and good at recognizing common subexpressions.

Short-Circuit Monotone Functions

When a monotone function of several variables is to be tested for a threshold value, break off the test as soon as the result is known. Short-circuit evaluation of a Boolean expression is a well-known case (handled automatically by the compiler). Bentley gave a more interesting example. Example A-1 shows a function to find the point in a list that is nearest to a given point.

Example A-1. Naive Function to Find Nearest Point

typedef struct point_s { double X, Y; } point_t;
#include <math.h>
int nearest_XY(int n, point_t given, point_t points[])
{
   int j, close_n=-1;
   double thisDist, closeDist=MAXFLOAT;
   for (j=0; j<n; ++j)
   {
      thisDist = sqr(points[j].X-given.X) + sqr(points[j].Y-given.Y);
      if (thisDist < closeDist)
      {
         closeDist = thisDist;
         close_n = j;
      }
   }
   return close_n;
}

Example A-1 as it stands short-circuits the distance computation (for purposes of comparing two distances, it is sufficient to compare the sum of squares and skip doing the square root). However, in many cases the X-distance alone is enough to eliminate a point, which avoids the need for the second multiply and an addition. Example A-2 applies this insight.

Example A-2. Nearest-Point Function with Short-Circuit Test

int nearest_XZ(int n, point_t given, point_t points[])
{
   int j, close_n=-1;
   double thisDist, closeDist=MAXFLOAT;
   for (j=0; j<n; ++j)
   {
      thisDist = sqr(points[j].X-given.X);
      if (thisDist < closeDist) /* possible */
      {
         thisDist += sqr(points[j].Y-given.Y);
         if (thisDist < closeDist)
         {
            closeDist = thisDist;
            close_n = j;
         }
      }
   }
   return close_n;
}

Reorder Tests

Logical tests should be arranged such that cheap tests precede expensive ones, and so that ones most likely to succeed precede ones more likely to fail.

In the absence of other information, the SGI compilers assume that the then-leg of an if is more likely than the else, so it is good practice to put the most likely code in the then-leg. You can pass exact instruction counts in a compiler feedback file (see “Passing a Feedback File” in Chapter 5). The compiler takes branch execution counts from the feedback file, and creates an instruction schedule that reflects the exact probabilities of each branch.

This is a convenient feature that can have good results on a poorly written program. However, the compiler does not take into account the cost of a test that involves a function call, or a test that can touch an unused cache line or page. Your insight is still needed to order tests so as to avoid expensive operations when possible.

Precompute Logical Functions

A logical function over a small, finite domain can be replaced by a table lookup. Examples include using a table to classify characters into lexical classes, and implementing an interpreter using a branch table indexed by instruction code.

When the table is small and frequently used, this technique is almost always helpful. However, when the table is sparse, or much larger than the executable code that it replaces, the effect of the table on the cache could swamp the benefits of its use.

Replace Control Variables with Control Statements

Assignment to a Boolean variable and its later test can be replaced by an if statement on the Boolean expression. Generalizing, assignment to any control variable over a small, finite domain can be replaced by a case statement on the value that would have been assigned.

Application of this rule saves three things: declaration of a variable to hold a control value; assignment into the variable; and fetching from the variable when the value is tested. The rule says that instead, you should modify the code so the control expression is used to direct execution as soon as its value is known. In some cases you have to duplicate code in order to apply the rule.

The compilers are good at reusing the stack space of local variables after their last use, and also good at saving calculated values in machine registers between their assignment and a nearby use. When the sole use of a local variable is to hold a value between its expression and a following test, the compiler (at -O2 and higher) is quite capable of eliminating the variable and using only a register temp.

Hence the only time you should distort your program's logic to apply this rule is when the assignment is separated from its use by many statements, or by a non-inlined procedure call. (A procedure call usually makes the compiler spill register values to memory.)

Procedure Design Rules

Collapse Procedure Hierarchies

The run time of a set of procedures that (nonrecursively) call themselves can often be reduced by rewriting the procedures inline and binding what were passed arguments.

This technique is extremely useful, but difficult to apply to an existing program and yet leave the program readable and maintainable. Fortunately, current compilers contain extensive support for interprocedural analysis (IPA) and automatic inlining. Given that, it is best to write the program in a simple, well-structured style, not fretting about frequent calls to small procedures, and to let the compiler inline the small procedures.

Although inlining eliminates the instruction overhead of a function call, it inflates the size of the object code. At a certain level of inlining, the effect of the inflated, low-density code on cache and virtual memory wastes as much time as it saves.

Keep in mind the maintenance implications of inlining procedures in code that you distribute to other users. A library procedure that has been inlined even once can never be field-replaced. You can send out a revised version of the library, but programs that inlined the function will never call the library version.

Exploit Common Cases

A procedure should be organized to handle the most common cases with the greatest efficiency, and all remaining cases correctly (if more slowly). When designing the application, attempt to arrange the input conditions so that the efficient cases are in fact the most common.

The Caching rule (“Caching”) can be seen as an extension of this rule, in which “the commonest case” is “whatever case was most recently requested.” Bentley cites a Julian-date conversion function in which it was observed that 90% of all calls presented the same date as the previous call. Saving the last result, and returning it without recomputing when appropriate, saved time.

One way to apply the rule is to code a function so it tests for the common cases and returns their results quickly, falling through to the more involved code of the hard cases. However, a function that contains the code to handle all cases is likely to be a bulky one to inline. In order to exploit common cases and also support function inlining, write two functions. One, a public function, is short, and efficiently handles only the most common cases. Whenever it is called with a case it does not handle, it calls a longer, private function to deal with all remaining cases. The shorter function can be inlined (at multiple places, if necessary) without greatly inflating the code. The longer should not be inlined.

Use Coroutines

A multiphase algorithm in which the phases are linked by temporary files (or arrays) can be reduced to a single-pass algorithm using coroutines.

Coroutines are defined and described in detail in Knuth (Volume I) and most other modern books on algorithmics. Under IRIX, you can write coroutines in a natural way using any one of three models:

The UNIX model of forked processes that communicate using pipes.
The POSIX threads model using POSIX message queues to communicate.
The MPI (or PVM) model of cooperating processes exchanging messages.

Transform Recursive Procedures

A simple, readable solution expressed as a recursion can be transformed into a more efficient iterative solution. You can:

Code an explicit recursion using a stack variable.
When the final action of a procedure is to call itself (“tail recursion”), replace the call with a goto to the top of the procedure. (Then rewrite the iteration as a loop.)
Cut off the recursion early for small arguments (for example using a table lookup), instead of recurring down to the case of size 0.

The compilers recognize simple cases of tail recursion and compile a goto to the top of the function, avoiding stack use. More complicated recursion defeats most of the compiler's ability to modify the source: recursive procedures are not inlined; a loop containing a recursion is not unrolled; and a loop with a recursion is not executed in parallel. Hence there is good reason to remove recursion, after it has served its purpose as a design tool.

Use Parallelism

Structure the program to exploit the parallelism available in the underlying hardware.

This rule is certainly a no-brainer for the SN0 systems; parallel execution is what they do best. However, you can apply parallelism using a variety of programming models and at a wide range of scales from small to large. Two examples among many:

Using the POSIX compliant asynchronous I/O library, you can schedule one or an entire list of file accesses to be performed in a parallel thread, with all the file-synchronous waits being taken by a different process.
The native integer in the MIPS IV ISA has 64 bits (under either -n32 or -64 compiler option). You can pack many smaller units into long long (in Fortran, INT*8) variables and operate on them in groups.

Parallel algorithm design is a very active area in computer science research.

Expression Rules

Initialize Data Before Runtime

As many variables as possible (including arrays and tables) should be initialized before program execution.

This rule encompasses the use of compile-time initialization of variables—in C, by use of initializing expressions in declarations, in Fortran by use of the PARAMETER clause—and also the preparation of large arrays and tables.

Compile-time initialization with manifest constants allows the compiler greater scope. The compiler precalculates subexpressions that have only constant arguments. When IPA is used, the compiler recognizes when a procedure argument is always a given constant, and propagates the constant into the procedure body. When functions are inlined, constants are propagated into the inlined code, and this produces still more constant expressions for the compiler to eliminate.

Some programs spend significant time initializing large arrays, tables, and other data structures. There are two common cases: clearing to zero, and processing initial data from one or more files.

Initializing to Zero

A for-loop that does nothing but store zero into an array is a waste of CPU cycles. Worse, in IRIX on SN0 systems, all memory is by default allocated to the node from which it is first touched, which means that the zero array will be allocated in the node where the for-loop runs. So at the least, make sure that each parallel thread of the program zeros the array elements that it uses, and no others, so that memory is allocated where it is used.

In a C program, it is much more efficient to use calloc (see the calloc(3) man page) to allocate memory filled with zero. The page frames are defined, but the actual pages are not created and not filled at this time. Instead, zero-filled pages are created as the program faults them in. This ensures that no time is spent clearing pages that are never used, and that pages are allocated to the node that actually uses them.

A similar alternative is to use mmap (see the mmap(2) man page) to map an extent of the pseudo-device /dev/zero. The mapped extent is added to the program's address space, but again, zero-filled pages are created and allocated as they are faulted.

In a Fortran program, it is somewhat more efficient to call the C library function bzero() than to code a DO loop storing zero throughout an array. However, this does touch the cleared pages.

Initializing from Files

The time to read and process file data is essentially wasted because it does not contribute to the solution. Bentley recommended writing a separate program that reads the input data and processes it into some intermediate form that can be read and stored more quickly. For example, if the initializing data is in ASCII form, a side program can preprocess it to binary. When the side program uses the same data structures and array dimensions as the main program, the main program can read the processed file directly into the array space.

IRIX permits you to extend this idea. You can use mmap (see the mmap(2) man page) to map any file into memory. You can take all the code that reads files and initializes memory arrays and structures, and isolate it in a separate program. This program maps a large block of memory to a new disk file. It reads the input file data, processes it in any way required, and builds a complete set of data structures and arrays of binary data in the mapped memory. Then the program unmaps the initialized memory (or simply terminates). This writes an image of the prepared data into the file.

When the main program starts up, it maps the initialized file into memory, and its initialization is instantly complete. Pages of memory are faulted in from the disk file as the program refers to them, complete with the binary content as it was prepared by the initializing program. This approach is especially handy for programs that need large in-memory databases, like the scenery files used in a visual simulator.

Exploit Algebraic Identities

Replace the evaluation of a costly expression by one that is algebraically equivalent but less costly. For example: not ln(A)+ln(B) but ln(A*B) (consider that A*B might overflow).

Bentley mentioned that the frequent need to test whether an integer J is contained in the inclusive range (I,K), naively coded ((I<=J)&&(J<=K)), can be implemented as ((J-I)<(K-I)). The value (K-I) can be precomputed. (When I and K are constants or are propagated from constant expressions, the compiler precomputes it automatically.)

you compile with the IEEE_arithmetic=3 and roundoff=3 options, the compilers automatically perform a number of mathematically valid transformations to make arithmetic faster.

Algebraic identities are at the heart of strength reduction in loops. The compiler can recognize most opportunities for strength reduction in an expression that includes the loop induction variable as an operand. However, strength reduction is a general principle, and the compiler cannot recognize every possibility. Examine every expression in a loop to see if the same value could not be calculated more cheaply by an incremental change in the same value from the previous iteration of the loop.

Eliminate Common Subexpressions

When the same expression is evaluated twice with no assignments to its variable operands, save the first evaluation and reuse it.

This rule, application of which is almost second nature to experienced programmers, is in fact applied well by the compilers. The compilers look for common subexpressions both before and after constant propagation and inlining. Often they find subexpressions that are not superficially apparent in the source code. The expression value is held for its second use in a machine register if possible, or the compiler may create a temporary stack variable to hold it.

Combine Paired Computation

When similar expressions are evaluated together, consider making a new function that combines the evaluation of both.

Bentley cites Knuth as reporting that both sine and cosine of an angle can be computed as a pair for about 1.5 times the cost of computing one of them alone. The maximum and minimum values in a vector can be found in one pass for about 1.5 times the cost of finding either. (The current compilers can perform this particular optimization automatically. See the opt(5) man page, the cis= parameter.)

When it is not practical for the function to return both values, it can at least cache the paired answer (see “Caching”) to be returned on request. In the case of sine and cosine, each of the functions can test to see if its argument is the same as the previous angle passed to either function. When it is not, call the function that calculates and caches both sine and cosine. Return the requested result.

In this regard, note that the compilers do recognize the dual nature of integer division and remainder, so that the sequence {div=a/b;rem=a%b;} compiles to a single integer divide.

Exploit Word Parallelism

Use the full word width of the underlying computer architecture to evaluate binary values in parallel.

This is a special case exploiting parallelism. The native integer in all current MIPS CPUs is 64 bits. The standard Fortran Boolean functions (IOR, IAND, and so on) handle INTEGER*8 (that is, 64-bit) values.

Prev	Table of Contents	Next
Chapter 8. Tuning for Parallel Processing		Appendix B. R10000 Counter Event Types