Chapter 4. OpenMP Fortran API Multiprocessing Directives

This chapter describes the multiprocessing directives that the MIPSpro 7 Fortran 90 compiler supports. These directives are based on the OpenMP Fortran application program interface (API) standard. Programs that use these directives are portable and can be compiled by other compilers that support the OpenMP standard.

To enable recognition of the OpenMP directives, specify -mp on the f90(1) command line. The -mp option must be specified in order for the compiler to honor any -MP:... options that may also be specified on the command line. The -MP:open_mp=ON option is on by default and must be in effect during compilation.

The following example command line can compile program ompprg.f, which contains OpenMP Fortran API directives:

f90 -mp ompprg.f

In addition to directives, the OpenMP Fortran API describes several library routines and environment variables. Information on these other utilities can be found in the following locations:

Programming Utility 

Information Location

Command line information 

For information on the -mp option, see “-mp” in Chapter 2. For information on the -MP: option, see “-MP:open_mp=setting” in Chapter 2.

Library routines 

omp_lock(3), omp_nested(3), and omp_threads(3) man pages

Environment variables 

pe_environ(5) man page


Note: If individual loops in your program contain both OpenMP directives and extensions (prefixed with !$OMP or !$SGI) and any of the outmoded multiprocessing directives described in Appendix D, “Multiprocessing Directives (Outmoded)”, and Chapter 5, “Parallel Processing on Origin Series Systems”, (prefixed with !$ or !$PAR), you must specify the set of directives that the compiler should use. To direct the compiler to ignore the OpenMP directives, compile with -MP:open_mp=OFF. To direct the compiler to ignore the outmoded multiprocessing directives, compile with -MP:old_mp=OFF. To direct the compiler to ignore the outmoded Origin series distributed shared memory directives, specify -MP:dsm=OFF. For more information on the -mp option, see “-mp” in Chapter 2. For more information on the -MP: option, see “-MP:open_mp=setting” in Chapter 2.

The sections in this chapter are as follows:


Note: The Silicon Graphics multiprocessing directives, including the Origin series distributed shared memory directives, are outmoded. Their preferred alternatives are the OpenMP Fortran API directives described in this chapter.


Using Directives

All multiprocessing directives are case-insensitive and are of the following form:

prefix directive[clause[[,] clause]...]
prefix

Each directive begins with a prefix, and the prefixes you can use depend on your source form, as follows:

  • If you are using fixed source form, the following prefixes can be used: !$OMP, C$OMP, or *$OMP.

    Prefixes must start in column one and appear as a single word with no intervening white space. Fortran fixed form line length, case sensitivity, white space, continuation, and column rules apply to the directive line.

  • If you are using free source form, the following prefix can be used: !$OMP.

    A prefix can appear in any column as long as it is preceded only by white space. It must appear as a single word with no intervening white space. Fortran free form line length, case sensitivity, white space, and continuation rules apply to the directive line.

directive

The name of the directive.

clause

One or more directive clauses. Clauses can appear in any order after the directive name and can be repeated as needed, subject to the restrictions listed in the description of each clause.

Directives cannot be embedded within continued statements, and statements cannot be embedded within directives. Comments cannot appear on the same line as a directive.

In fixed source form, initial directive lines must have a space or zero in column six, and continuation directive lines must have a character other than a space or a zero in column six.

In free source form, initial directive lines must have a space after the prefix. Continued directive lines must have an ampersand as the last nonblank character on the line. Continuation directive lines can have an ampersand after the directive prefix with optional white space before and after the ampersand.

Example 1 (fixed source form). The following formats for specifying directives are equivalent (the first line represents the position of the first 9 columns):

C23456789
!$OMP PARALLEL DO SHARED(A,B,C)

C$OMP PARALLEL DO
C$OMP+SHARED(A,B,C)

C$OMP PARALLELDOSHARED(A,B,C)

Example 2 (free source form). The following formats for specifying directives are equivalent (the first line represents the position of the first 9 columns):

!23456789
       !$OMP PARALLEL DO &
                 !$OMP SHARED(A,B,C)

!$OMP PARALLEL &
       !$OMP&DO SHARED(A,B,C)

      !$OMP PARALLEL DO SHARED(A,B,C)


Note: In order to simplify the presentation, the remainder of this chapter uses the !$OMP prefix in all syntax descriptions and examples.


Conditional Compilation

Fortran statements can be compiled conditionally as long as they are preceded by one of the following conditional compilation prefixes: !$, C$, or *$. The prefix must be followed by a Fortran statement on the same line. During compilation, the prefix is replaced by two spaces, and the rest of the line is treated as a normal Fortran statement.

Your program must be compiled with the -mp option in order for the compiler to honor statements preceded by conditional compilation prefixes; without the mp command line option, statements preceded by conditional compilation prefixes are treated as comments. For more information on the -mp option, see “-mp” in Chapter 2.

The !$ prefix is accepted when compiling either fixed source form files or free source form files. The C$ and *$ prefixes are accepted only when compiling fixed source form. The source form you are using also dictates the following:

  • In fixed source form, the prefixes must start in column one and appear as a single word with no intervening white space. Fortran fixed form line length, case sensitivity, white space, continuation, and column rules apply to the line. Initial lines must have a space or zero in column six, and continuation lines must have a character other than a space or zero in column six.

    Example. The following forms for specifying conditional compilation are equivalent:

    C23456789
    !$ 10 IAM = OMP_GET_THREAD_NUM() +
    !$   &          INDEX
    
    #ifdef _OPENMP
       10 IAM = OMP_GET_THREAD_NUM() +
         &          INDEX
    #endif

  • In free source form, the !$ prefix can appear in any column as long as it is preceded only by white space. It must appear as a single word with no intervening white space. Fortran free source form line length, case sensitivity, white space, and continuation rules apply to the line. Initial lines must have a space after the prefix. Continued lines must have an ampersand as the last nonblank character on the line. Continuation lines can have an ampersand after the prefix, with optional white space before and after the ampersand.

In addition to the conditional compilation prefixes, a preprocessor macro, _OPENMP, can be used for conditional compilation. For more information on source preprocessing and conditional compilation, see Chapter 7, “Source Preprocessing”.

Example. The following example illustrates the use of the conditional compilation prefix. Assuming Fortran fixed source form, the following statement is invalid when using OpenMP constructs:

C234567890
!$  X(I) = X(I) + XLOCAL

With OpenMP compilation, the conditional compilation prefix !$ is treated as two spaces. As a result, the statement infringes on the statement label field. To be valid, the statement should begin after column six, like any other fixed source form statement:

C234567890
!$    X(I) = X(I) + XLOCAL

In other words, conditionally compiled statements need to meet all applicable language rules when the prefix is replaced with two spaces.

Parallel Region Constructs (PARALLEL and END PARALLEL Directives)

The PARALLEL and END PARALLEL directives define a parallel region. A parallel region is a block of code that is to be executed by multiple threads in parallel. This is the fundamental OpenMP parallel construct that starts parallel execution. These directives have the following format:

!$OMP PARALLEL [clause[[,] clause]...]
block
!$OMP END PARALLEL
clause

clause can be one or more of the following:

  • PRIVATE(var[, var] ...)

  • SHARED(var[, var] ...)

  • DEFAULT(PRIVATE | SHARED | NONE)

  • FIRSTPRIVATE(var[, var] ...)

  • REDUCTION ({operator|intrinsic}:var[, var] ...)

  • IF(scalar_logical_expression)

  • COPYIN(var[, var] ...)

The IF clause is described in this section. For information on the PRIVATE, SHARED, DEFAULT, FIRSTPRIVATE, REDUCTION, and COPYIN clauses, see “Data Scope Attribute Clauses”.

block

block denotes a structured block of Fortran statements. You cannot branch into or out of the block. The code contained within the dynamic extent of the parallel region is executed on each thread.

The END PARALLEL directive denotes the end of the parallel region. There is an implied barrier at this point. Only the master thread of the team continues execution past the end of a parallel region.

When a thread encounters a parallel region, it creates a team of threads, and it becomes the master of the team. The master thread is a member of the team and it has a thread number of 0 within the team. The number of threads in the team is controlled by environment variables and/or library calls.

The number of physical processors actually hosting the threads at any given time depends on the number of CPUs available and the system load. Once created, the number of threads in the team remains constant for the duration of that parallel region, but it can be changed either explicitly by the user or automatically by the run-time system from one parallel region to another. The OMP_SET_DYNAMIC(3) library routine and the OMP_DYNAMIC environment variable can be used to enable and disable the automatic adjustment of the number of threads. For more information on environment variables that affect OpenMP directives, see the pe_environ(5) man page.


Note: The OpenMP Fortran API does not specify the number of physical processors that can host the threads at any given time.

If a thread in a team executing a parallel region encounters another parallel region, it creates a new team, and it becomes the master of that new team. By default, nested parallel regions are serialized; that is, they are executed by a team composed of one thread. This default behavior can be changed by using either the OMP_SET_NESTED(3) library routine or the OMP_NESTED environment variable. For more information on environment variables that affect OpenMP directives, see the pe_environ(5) man page.

If an IF clause is present, the enclosed code region is executed in parallel only if the scalar_logical_expression evaluates to .TRUE.. Otherwise, the parallel region is serialized. The expression must be a scalar Fortran logical expression.

The following restrictions apply to parallel regions:

  • The PARALLEL/END PARALLEL directive pair must appear in the same routine in the executable section of the code.

  • The code contained by these two directives must be a structured block. You cannot branch into or out of a parallel region.

  • Only a single IF clause can appear on the directive.

Example. The PARALLEL directive can be used for exploiting coarse-grained parallelism. In the following example, each thread in the parallel region decides what part of the global array X to work on based on the thread number:

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(X,NPOINTS)
      IAM = OMP_GET_THREAD_NUM()
      NP =  OMP_GET_NUM_THREADS()
      IPOINTS = NPOINTS/NP
      CALL SUBDOMAIN(X,IAM,IPOINTS)
!$OMP END PARALLEL


Note: ALLOCATABLE or POINTER arrays can be privatized by using a PRIVATE clause on a PARALLEL directive or by using a worksharing construct (a DO, SINGLE, or SECTIONS directive).

However, ALLOCATABLE and POINTER arrays are not allowed within FIRSTPRIVATE or LASTPRIVATE clauses. Assumed-size and assumed-shape arrays are not allowed within PRIVATE, FIRSTPRIVATE, or LASTPRIVATE clauses.


Work-sharing Constructs

A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it. A work-sharing construct must be enclosed within a parallel region in order for the directive to execute in parallel. The work-sharing directives do not launch new threads, and there is no implied barrier on entry to a work-sharing construct.

The following restrictions apply to the work-sharing directives:

  • Work-sharing constructs and BARRIER directives must be encountered by all threads in a team or by none at all.

  • Work-sharing constructs and BARRIER directives must be encountered in the same order by all threads in a team.

The following sections describe the work-sharing directives:

Specify Parallel Execution: DO and END DO Directives

The DO directive specifies that the iterations of the immediately following DO loop must be divided among the threads in the parallel region. If there is no enclosing parallel region, the DO loop is executed serially.

The loop that follows a DO directive cannot be a DO WHILE or a DO loop without loop control.

The format of this directive is as follows:

!$OMP DO [clause[[,] clause]...]
do_loop
[!$OMP END DO [NOWAIT]]
clause

clause can be one of the following:

  • PRIVATE(var[, var] ...)

  • FIRSTPRIVATE(var[, var] ...)

  • LASTPRIVATE(var[, var] ...)

  • REDUCTION({operator|intrinsic}:var[, var] ...)

  • SCHEDULE(type[,chunk])

  • ORDERED

The SCHEDULE and ORDERED clauses are described in this section. The PRIVATE, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses are described in “Data Scope Attribute Clauses”.

do_loop

A DO loop.

If ordered sections are contained in the dynamic extent of the DO directive, the ORDERED clause must be present. The code enclosed within an ordered section is executed in the order in which it would be executed in a sequential execution of the loop. For more information on ordered sections, see the ORDERED directive in “Request Sequential Ordering: ORDERED and END ORDERED Directives”.

The SCHEDULE clause specifies how iterations of the DO loop are divided among the threads of the team. Within the SCHEDULE(type[,chunk]) clause syntax, type can be one of the following:

type

Effect

STATIC

When SCHEDULE(STATIC,chunk) is specified, iterations are divided into pieces of a size specified by chunk. The pieces are statically assigned to threads in the team in a round-robin fashion in the order of the thread number. chunk must be a scalar integer expression.

When no chunk is specified, the iterations are divided among threads in contiguous pieces, and one piece is assigned to each thread. Default.

DYNAMIC

When SCHEDULE(DYNAMIC,chunk) is specified, the iterations are broken into pieces of a size specified by chunk. As each thread finishes its iterations, it dynamically obtains the next set of iterations.

When no chunk is specified, it defaults to 1.

GUIDED

When SCHEDULE(GUIDED,chunk) is specified, each of the iterations are handed out in pieces of exponentially decreasing size. chunk specifies the minimum number of iterations to dispatch each time, except when there are less than chunk number of iterations, at which point the rest are dispatched.

When no chunk is specified, it defaults to 1.

RUNTIME

When SCHEDULE(RUNTIME) is specified, the decision regarding scheduling is deferred until run time and you cannot specify a chunk.

The schedule type and chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the resulting schedule is STATIC.

For more information on the OMP_SCHEDULE environment variable, see the pe_environ(5) man page.


Note: The OpenMP Fortran API does not define a default scheduling mechanism. You should not rely on a particular implementation of a schedule type for correct execution because it is possible to have variations in the implementations of the same schedule type across different compilers.

If an END DO directive is not specified, it is assumed at the end of the DO loop. If NOWAIT is specified on the END DO directive, threads do not synchronize at the end of the parallel loop. Threads that finish early proceed straight to the instructions following the loop without waiting for the other members of the team to finish the DO directive.

Example. If there are multiple independent loops within a parallel region, you can use the NOWAIT clause to avoid the implied BARRIER at the end of the DO directive, as follows:

!$OMP PARALLEL
!$OMP DO
      DO I=2,N
        B(I) = (A(I) + A(I-1)) / 2.0
      ENDDO
!$OMP END DO NOWAIT
!$OMP DO
      DO I=1,M
        Y(I) = SQRT(Z(I))
      ENDDO
!$OMP END DO NOWAIT
!$OMP END PARALLEL

Parallel DO loop control variables are block-level entities within the DO loop. If the loop control variable also appears in the LASTPRIVATE variable list of the parallel DO, it is copied out to a variable of the same name in the enclosing PARALLEL region. The variable in the enclosing PARALLEL region must be SHARED if it is specified on the LASTPRIVATE variable list of a DO directive.

The following restrictions apply to the DO directives:

  • You cannot branch out of a DO loop associated with a DO directive.

  • The values of the loop control parameters of the DO loop associated with a DO directive must be the same for all the threads in the team.

  • The DO loop iteration variable must be of type integer.

  • If used, the END DO directive must appear immediately after the end of the loop.

  • Only a single SCHEDULE clause can appear on a DO directive.

  • Only a single ORDERED clause can appear on a DO directive.

Mark Code for Specific Threads: SECTION, SECTIONS and END SECTIONS Directives

The SECTIONS directive specifies that the enclosed sections of code are to be divided among threads in the team. It is a noniterative work-sharing construct. Each section is executed once by a thread in the team.

The format of this directive is as follows:

!$OMP SECTIONS [clause[[,] clause]...]
[!$OMP SECTION]
block
[!$OMP SECTION
block]
. . .
!$OMP END SECTIONS [NOWAIT]
clause

The clause can be one of the following:

  • PRIVATE(var[, var] ...)

  • FIRSTPRIVATE(var[, var] ...)

  • LASTPRIVATE(var[, var] ...)

  • REDUCTION({ operator|intrinsic}:var[, var] ...)

The PRIVATE, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses are described in “Data Scope Attribute Clauses”.

block

Denotes a structured block of Fortran statements. You cannot branch into or out of the block.

Each section must be preceded by a SECTION directive, though the SECTION directive is optional for the first section. The SECTION directives must appear within the lexical extent of the SECTIONS/END SECTIONS directive pair. The last section ends at the END SECTIONS directive. Threads that complete execution of their sections wait at a barrier at the END SECTIONS directive unless a NOWAIT is specified.

The following restrictions apply to the SECTIONS directive:

  • The code enclosed in a SECTIONS/END SECTIONS directive pair must be a structured block. In addition, each constituent section must also be a structured block. You cannot branch into or out of the constituent section blocks.

  • You cannot have a SECTION directive outside the lexical extent of the SECTIONS/END SECTIONS directive pair.

Request Single-thread Execution: SINGLE and END SINGLE Directives

The SINGLE directive specifies that the enclosed code is to be executed by only one thread in the team. Threads in the team that are not executing the SINGLE directive wait at the END SINGLE directive unless NOWAIT is specified.

The format of this directive is as follows:

!$OMP SINGLE [clause[[,] clause]...]
 block
!$OMP END SINGLE [NOWAIT]
clause

The clause can be one of the following:

  • PRIVATE(var[, var] ...)

  • FIRSTPRIVATE(var[, var] ...)

The PRIVATE and FIRSTPRIVATE clauses are described in “Data Scope Attribute Clauses”.

block

Denotes a structured block of Fortran statements. You cannot branch into or out of the block.

Example. In the following code fragment, the first thread that encounters the SINGLE directive executes subroutines OUTPUT and INPUT. You must not make any assumptions as to which thread will execute the SINGLE section. All other threads will skip the SINGLE section and stop at the barrier at the END SINGLE construct. If other threads can proceed without waiting for the thread executing the SINGLE section, a NOWAIT clause can be specified on the END SINGLE directive.

!$OMP PARALLEL DEFAULT(SHARED)
      CALL WORK(X)
!$OMP BARRIER
!$OMP SINGLE
      CALL OUTPUT(X)
      CALL INPUT(Y)
!$OMP END SINGLE
      CALL WORK(Y)
!$OMP END PARALLEL

Combined Parallel Work-sharing Constructs

The combined parallel work-sharing constructs are shortcuts for specifying a parallel region that contains only one work-sharing construct. The semantics of these directives are identical to that of explicitly specifying a PARALLEL directive followed by a single work-sharing construct.

The following sections describe the combined parallel work-sharing directives:

Declare a Parallel Region: PARALLEL DO and END PARALLEL DO Directives

The PARALLEL DO directive provides a shortcut form for specifying a parallel region that contains a single DO directive.

The format of this directive is as follows:

!$OMP PARALLEL DO [clause[[,]clause]...]
do_loop
[!$OMP END PARALLEL DO]
clause

clause can be one or more of the clauses accepted by the PARALLEL directive or the DO directive. These clauses are as follows:

  • PRIVATE(var[, var] ...)

  • FIRSTPRIVATE(var[, var] ...)

  • LASTPRIVATE(var[, var] ...)

  • REDUCTION({operator|intrinsic}:var[, var] ...)

  • SCHEDULE(type[,chunk])

  • ORDERED

  • SHARED(var[, var] ...)

  • DEFAULT(PRIVATE | SHARED | NONE)

  • IF(scalar_logical_expression)

  • COPYIN(var[, var] ...)

The SCHEDULE and ORDERED clauses are described in “Specify Parallel Execution: DO and END DO Directives”. The IF clause is described in “Parallel Region Constructs (PARALLEL and END PARALLEL Directives)”. The SHARED, DEFAULT, COPYIN, PRIVATE, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses are described in “Data Scope Attribute Clauses”.

For information on the PARALLEL directive, see “Parallel Region Constructs (PARALLEL and END PARALLEL Directives)”. For information on the DO directive, see “Specify Parallel Execution: DO and END DO Directives”.

do_loop

A DO loop.

If the END PARALLEL DO directive is not specified, the PARALLEL DO is assumed to end with the DO loop that immediately follows the PARALLEL DO directive. If used, the END PARALLEL DO directive must appear immediately after the end of the DO loop.

The semantics are identical to explicitly specifying a PARALLEL directive immediately followed by a DO directive.

Example. The following example shows how to parallelize a simple loop:

!$OMP PARALLEL DO
      DO I=1,N
        B(I) = (A(I) + A(I-1)) / 2.0
      ENDDO
!$OMP END PARALLEL DO

In the preceding code, the loop iteration variable is private by default, so it is not necessary to declare it explicitly. The END PARALLEL DO directive is optional.

Declare Sections within a Parallel Region: PARALLEL SECTIONS and END PARALLEL SECTIONS Directives

The PARALLEL SECTIONS directive provides a shortcut form for specifying a parallel region that contains a single SECTIONS directive. The semantics are identical to explicitly specifying a PARALLEL directive immediately followed by a SECTIONS directive.

The format of this directive is as follows:

!$OMP PARALLEL SECTIONS [clause[[,] clause]...]
[!$OMP SECTION ]
block
[!$OMP SECTION
block]
. . .
!$OMP END PARALLEL SECTIONS
clause

clause can be one or more of the clauses accepted by the PARALLEL directive or the SECTIONS directive. These clauses are as follows:

  • PRIVATE(var[, var] ...)

  • FIRSTPRIVATE(var[, var] ...)

  • LASTPRIVATE(var[, var] ...)

  • REDUCTION({ operator|intrinsic}:var[, var] ...)

  • SHARED(var[, var] ...)

  • DEFAULT(PRIVATE | SHARED | NONE)

  • IF(scalar_logical_expression)

  • COPYIN(var[, var] ...)

The IF clause is described in “Parallel Region Constructs (PARALLEL and END PARALLEL Directives)”. The SHARED, DEFAULT, FIRSTPRIVATE, REDUCTION, COPYIN, PRIVATE, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses are described in “Data Scope Attribute Clauses”.

For more information on the PARALLEL directive, see “Parallel Region Constructs (PARALLEL and END PARALLEL Directives)”. For more information on the SECTIONS directive, see “Mark Code for Specific Threads: SECTION, SECTIONS and END SECTIONS Directives”.

block

Denotes a structured block of Fortran statements. You cannot branch into or out of the block.

The last section ends at the END PARALLEL SECTIONS directive.

Example. In the following code fragment, subroutines XAXIS, YAXIS, and ZAXIS can be executed concurrently. The first SECTION directive is optional. All the SECTION directives need to appear in the lexical extent of the PARALLEL SECTIONS/END PARALLEL SECTIONS construct.

!$OMP PARALLEL SECTIONS
!$OMP SECTION
      CALL XAXIS
!$OMP SECTION
      CALL YAXIS
!$OMP SECTION
      CALL ZAXIS
!$OMP END PARALLEL SECTIONS

Synchronization Constructs

The following sections describe the synchronization constructs:

Request Execution by the Master Thread: MASTER and END MASTER Directives

The code enclosed within MASTER and END MASTER directives is executed by the master thread.

These directives have the following format:

!$OMP MASTER
block
!$OMP END MASTER
block

Denotes a structured block of Fortran statements. You cannot branch into or out of the block.

The other threads in the team skip the enclosed section of code and continue execution. There is no implied barrier either on entry to or exit from the master section.

Request Execution by a Single Thread: CRITICAL and END CRITICAL Directives

The CRITICAL and END CRITICAL directives restrict access to the enclosed code to one thread at a time.

These directives have the following format:

!$OMP CRITICAL [(name)]
block
!$OMP END CRITICAL [(name)]
name

Identifies the critical section.

If a name is specified on a CRITICAL directive, the same name must also be specified on the END CRITICAL directive. If no name appears on the CRITICAL directive, no name can appear on the END CRITICAL directive.

block

Denotes a structured block of Fortran statements. You cannot branch into or out of the block.

A thread waits at the beginning of a critical section until no other thread in the team is executing a critical section with the same name. All unnamed CRITICAL directives map to the same name. Critical section names are global entities of the program. If a name conflicts with any other entity, the behavior of the program is undefined.

Example. The following code fragment includes several CRITICAL directives. The example illustrates a queuing model in which a task is dequeued and worked on. To guard against multiple threads dequeuing the same task, the dequeuing operation must be in a critical section. Because there are two independent queues in this example, each queue is protected by CRITICAL directives with different names, XAXIS and YAXIS, respectively.

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(X,Y)
!$OMP CRITICAL(XAXIS)
      CALL DEQUEUE(IX_NEXT, X)
!$OMP END CRITICAL(XAXIS)
      CALL WORK(IX_NEXT, X)
!$OMP CRITICAL(YAXIS)
      CALL DEQUEUE(IY_NEXT,Y)
!$OMP END CRITICAL(YAXIS)
      CALL WORK(IY_NEXT, Y)
!$OMP END PARALLEL

Synchronize All Threads in a Team: BARRIER Directive

The BARRIER directive synchronizes all the threads in a team. When it encounters a barrier, a thread waits until all other threads in that team have reached the same point.

This directive has the following format:

!$OMP BARRIER

Protect a Location from Multiple Updates: ATOMIC Directive

The ATOMIC directive ensures that a specific memory location is updated atomically, rather than exposing it to the possibility of multiple, simultaneous writing threads.

This directive has the following format:

!$OMP ATOMIC

This directive applies only to the immediately following statement, which must have one of the following forms:

x = xoperatorexpr
x = exproperatorx
x = intrinsic (x, expr)
x = intrinsic (expr, x)

In the preceding statements:

  • x is a scalar variable of intrinsic type. All references to storage location x must have the same type and type parameters.

  • expr is a scalar expression that does not reference x.

  • intrinsic is one of MAX, MIN, IAND, IOR, or IEOR.

  • operator is one of +, *, -, /, .AND., .OR., .EQV., or .NEQV. .

Only the load and store of x are atomic; the evaluation of expr is not atomic. To avoid race conditions, all updates of the location in parallel must be protected with the ATOMIC directive, except those that are known to be free of race conditions.

Example 1. The following code fragment uses the ATOMIC directive:

!$OMP ATOMIC
      X(INDEX(I)) = Y(INDEX(I)) + B

Example 2. The following code fragment avoids race conditions by protecting all simultaneous updates of the location, by multiple threads, with the ATOMIC directive:

!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(X,Y,INDEX,N)
      DO I=1,N
        CALL WORK(XLOCAL, YLOCAL)
!$OMP ATOMIC
        X(INDEX(I)) = X(INDEX(I)) + XLOCAL
        Y(I) = Y(I) + YLOCAL
      ENDDO

Note that the ATOMIC directive applies only to the Fortran statement that immediately follows it. As a result, Y is not updated atomically in the preceding code.

Read and Write Variables to Memory: FLUSH Directive

The FLUSH directive identifies synchronization points at which thread-visible variables are written back to memory. This directive must appear at the precise point in the code at which the synchronization is required.

Thread-visible variables include the following data items:

  • Globally visible variables (common blocks and modules)

  • Local variables that do not have the SAVE attribute but have had their address taken and saved or have had their address passed to another subprogram

  • Local variables that do not have the SAVE attribute that are declared shared in a parallel region within the subprogram

  • Dummy arguments

  • All pointer dereferences

This directive has the following format:

!$OMP FLUSH [(var[, var] ...)]
var

Variables to be flushed.

An implicit FLUSH directive is assumed for the following directives:

  • BARRIER

  • CRITICAL and END CRITICAL

  • END DO

  • END PARALLEL

  • END SECTIONS

  • END SINGLE

  • ORDERED and END ORDERED

The directive is not implied if a NOWAIT clause is present.

Example. The following example uses the FLUSH directive for point-to-point synchronization between pairs of threads:

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(ISYNC)
      IAM = OMP_GET_THREAD_NUM()
      ISYNC(IAM) = 0
!$OMP BARRIER
      CALL WORK()
!
!I AM DONE WITH MY WORK, SYNCHRONIZE WITH MY NEIGHBOR
!
      ISYNC(IAM) = 1
!$OMP FLUSH(ISYNC)
!
!WAIT TILL NEIGHBOR IS DONE
!
      DO WHILE (ISYNC(NEIGH) .EQ. 0)
!$OMP FLUSH(ISYNC)
      END DO
!$OMP END PARALLEL

Request Sequential Ordering: ORDERED and END ORDERED Directives

The code enclosed within ORDERED and END ORDERED directives is executed in the order in which it would be executed in a sequential execution of an enclosing parallel loop.

These directives have the following format:

!$OMP ORDERED
block
!$OMP END ORDERED
block

Denotes a structured block of Fortran statements. You cannot branch into or out of the block.

An ORDERED directive can appear only in the dynamic extent of a DO or PARALLEL DO directive. This DO directive must have the ORDERED clause specified. For more information on the DO directive, see “Specify Parallel Execution: DO and END DO Directives”. For information on directive binding, see “Directive Binding”.

Only one thread is allowed in an ordered section at a time. Threads are allowed to enter in the order of the loop iterations. No thread can enter an ordered section until it is guaranteed that all previous iterations have completed or will never execute an ordered section. This sequentializes and orders code within ordered sections while allowing code outside the section to run in parallel. ORDERED sections that bind to different DO directives are independent of each other.

The following restrictions apply to the ORDERED directive:

  • An ORDERED directive cannot bind to a DO directive that does not have the ORDERED clause specified.

  • An iteration of a loop with a DO directive must not execute the same ORDERED directive more than once, and it must not execute more than one ORDERED directive.

Example. Ordered sections are useful for sequentially ordering the output from work that is done in parallel. Assuming that a reentrant I/O library exists, the following program prints out the indexes in sequential order:

!$OMP DO ORDERED SCHEDULE(DYNAMIC)
      DO I=LB,UB,ST
        CALL WORK(I)
      END DO

      SUBROUTINE WORK(K)
!$OMP ORDERED
      WRITE(*,*) K
!$OMP END ORDERED
      END

Data Environment Constructs

The following subsections present constructs for controlling the data environment during the execution of parallel constructs. “Declare Common Blocks Private to a Thread: THREADPRIVATE Directive”, describes the THREADPRIVATE directive, which makes common blocks local to a thread. “Data Scope Attribute Clauses”, describes directive clauses that affect the data environment.

Declare Common Blocks Private to a Thread: THREADPRIVATE Directive

The THREADPRIVATE directive makes named common blocks private to a thread but global within the thread. In other words, each thread executing a THREADPRIVATE directive receives its own private copy of the named common blocks, which are then available to it in any routine within the scope of an application.

This directive must appear in the declaration section of the routine after the declaration of the listed common blocks. Each thread gets its own copy of the common block, so data written to the common block by one thread is not directly visible to other threads. During serial portions and MASTER sections of the program, accesses are to the master thread's copy of the common block.

On entry to the first parallel region, data in the THREADPRIVATE common blocks should be assumed to be undefined unless a COPYIN clause is specified on the PARALLEL directive. When a common block that is initialized using DATA statements appears in a THREADPRIVATE directive, each thread's copy is initialized once prior to its first use. For subsequent parallel regions, the data in the THREADPRIVATE common blocks are guaranteed to persist only if the dynamic threads mechanism has been disabled and if the number of threads are the same for all the parallel regions.

For more information on dynamic threads, see the OMP_SET_DYNAMIC(3) library routine and the OMP_DYNAMIC environment variable on the pe_environ(5) man page.

The format of this directive is as follows:

!$OMP THREADPRIVATE(/cb/[,/cb/]...)
cb

The name of the common block to be made private to a thread. Only named common blocks can be made thread private.

The following restrictions apply to the THREADPRIVATE directive:

  • The THREADPRIVATE directive must appear after every declaration of a thread private common block.

  • You cannot use a THREADPRIVATE common block or its constituent variables in any clause other than a COPYIN clause. As a result, they are not permitted in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, SHARED, or REDUCTION clause. They are not affected by the DEFAULT clause.

You can use the mp_shmem library routines for communicating between threads. For information on these routines, see the mp(3f) man page.

Data Scope Attribute Clauses

Several directives accept clauses that allow a user to control the scope attributes of variables for the duration of the construct. Not all of the clauses in this section are allowed on all directives, but the clauses that are valid on a particular directive are included with the description of the directive. Usually, if no data scope clauses are specified for a directive, the default scope for variables affected by the directive is SHARED. Exceptions to this are described in “Data Environment Rules”.

The following sections describe the data scope attribute clauses:

PRIVATE Clause

The PRIVATE clause declares variables to be private to each thread in a team.

This clause has the following format:

PRIVATE(var[, var] ...)
var

A named variable or named common block that is accessible in the scoping unit. Subobjects cannot be specified. If a named common block is specified, its name must appear between slashes.

The behavior of a variable declared in a PRIVATE clause is as follows:

  • A new object of the same type is declared once for each thread in the team. The new object is no longer storage associated with the storage location of the original object.

  • All references to the original object in the lexical extent of the directive construct are replaced with references to the private object.

  • Variables defined as PRIVATE are undefined for each thread on entering the construct and the corresponding shared variable is undefined on exit from a parallel construct.

  • Contents, allocation state, and association status of variables defined as PRIVATE are undefined when they are referenced outside the lexical extent (but inside the dynamic extent) of the construct, unless they are passed as actual arguments to called routines.

Example. The following example shows how to scope variables with the PRIVATE clause:

      INTEGER I,J
      I = 1
      J = 2
!$OMP PARALLEL PRIVATE(I) FIRSTPRIVATE(J)
      I = 3
      J = J+ 2
!$OMP END PARALLEL
      PRINT *, I, J

In the preceding code, the values of I and J are undefined on exit from the parallel region.

SHARED Clause

The SHARED clause makes variables shared among all the threads in a team. All threads within a team access the same storage area for SHARED data.

This clause has the following format:

SHARED(var[, var] ...)
var

A named variable or named common block that is accessible in the scoping unit. Subobjects cannot be specified. If a named common block is specified, its name must appear between slashes.

DEFAULT Clause

The DEFAULT clause allows the user to specify a PRIVATE, SHARED, or NONE default scope attribute for all variables in the lexical extent of any parallel region. Variables in THREADPRIVATE common blocks are not affected by this clause.

This clause has the following format:

DEFAULT(PRIVATE | SHARED| NONE)

The PRIVATE, SHARED, and NONE specifications have the following effects:

  • Specifying DEFAULT(PRIVATE) makes all named objects in the lexical extent of the parallel region, including common block variables but excluding THREADPRIVATE variables, private to a thread as if each variable were listed explicitly in a PRIVATE clause.

  • Specifying DEFAULT(SHARED) makes all named objects in the lexical extent of the parallel region shared among the threads in a team, as if each variable were listed explicitly in a SHARED clause. In the absence of an explicit DEFAULT clause, the default behavior is the same as if DEFAULT(SHARED) were specified.

  • Specifying DEFAULT(NONE) declares that there is no implicit default as to whether variables are PRIVATE or SHARED. In this case, the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION attribute of each variable used in the lexical extent of the parallel region must be specified.

Only one DEFAULT clause can be specified on a PARALLEL directive.

Variables can be exempted from a defined default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. As a result, the following example is valid:

!$OMP PARALLEL DO DEFAULT(PRIVATE), FIRSTPRIVATE(I),SHARED(X),
!$OMP& SHARED(R) LASTPRIVATE(I)

FIRSTPRIVATE Clause

The FIRSTPRIVATE clause provides a superset of the functionality provided by the PRIVATE clause.

This clause has the following format:

FIRSTPRIVATE(var[, var] ...)
var

A named variable or named common block that is accessible in the scoping unit. Subobjects cannot be specified. If a named common block is specified, its name must appear between slashes.

Variables specified are subject to PRIVATE clause semantics described in “PRIVATE Clause”. In addition, private copies of the variables are initialized from the original object existing before the construct.

LASTPRIVATE Clause

The LASTPRIVATE clause provides a superset of the functionality provided by the PRIVATE clause.

When the LASTPRIVATE clause appears on a DO directive, the thread that executes the sequentially last iteration updates the version of the object it had before the construct. When the LASTPRIVATE clause appears in a SECTIONS directive, the thread that executes the lexically last SECTION updates the version of the object it had before the construct. Subobjects that are not assigned a value by the last iteration of the DO or the lexically last SECTION of the SECTIONS directive are undefined after the construct.

This clause has the following format:

LASTPRIVATE(var[, var] ...)
var

A named variable or named common block that is accessible in the scoping unit. Subobjects cannot be specified. If a named common block is specified, its name must appear between slashes.

Each var is subject to the PRIVATE clause semantics described in “PRIVATE Clause”.

Example. Correct execution sometimes depends on the value that the last iteration of a loop assigns to a variable. Such programs must list all such variables as arguments to a LASTPRIVATE clause so that the values of the variables are the same as when the loop is executed sequentially.

!$OMP PARALLEL
!$OMP DO LASTPRIVATE(I)
      DO I=1,N
        A(I) = B(I) + C(I)
      ENDDO
!$OMP END PARALLEL
      CALL REVERSE(I)

In the preceding code fragment, the value of I at the end of the parallel region will equal N+1, as in the sequential case.

REDUCTION Clause

This clause performs a reduction on the variables specified, with the operator or the intrinsic specified.

This clause has the following format:

REDUCTION({operator|intrinsic}:var[, var] ...)
operator

Specify one of the following: +, *, -, .AND., .OR., .EQV., or .NEQV.

intrinsic

Specify one of the following: MAX, MIN, IAND, IOR, or IEOR.

var

A named variable or named common block that is accessible in the scoping unit. Subobjects cannot be specified. Each var must be a named scalar variable of intrinsic type.

Variables that appear in a REDUCTION clause must be SHARED in the enclosing context. A private copy of each var is created for each thread as if the PRIVATE clause had been used. The private copy is initialized according to the operator. For more information, see Table 4-1.

If a named common block is specified, its name must appear between slashes.

At the end of the REDUCTION, the shared variable is updated to reflect the result of combining the original value of the (shared) reduction variable with the final value of each of the private copies using the operator specified. The reduction operators are all associative (except for subtraction), and the compiler can freely reassociate the computation of the final value (the partial results of a subtraction reduction are added to form the final value).

The value of the shared variable becomes undefined when the first thread reaches the containing clause, and it remains so until the reduction computation is complete. Normally, the computation is complete at the end of the REDUCTION construct; however, if the REDUCTION clause is used on a construct to which NOWAIT is also applied, the shared variable remains undefined until a barrier synchronization has been performed to ensure that all the threads have completed the REDUCTION clause.

The REDUCTION clause is intended to be used on a region or work-sharing construct in which the reduction variable is used only in reduction statements with one of the following forms:

x = xoperatorexpr
x = exproperatorx  (except for subtraction)
x = intrinsic (x,expr)
x = intrinsic (expr, x)

Some reductions can be expressed in other forms. For instance, a MAX reduction might be expressed as follows:

IF (x .LT. expr) x = expr

Alternatively, the reduction might be hidden inside a subroutine call. You must ensure that the operator specified in the REDUCTION clause matches the reduction operation.

The following table lists the operators and intrinsics that are valid and their canonical initialization values. The actual initialization value will be consistent with the data type of the reduction variable.

Table 4-1. Initialization values

Operator/Intrinsic

Initialization

+

0

*

1

-

0

.AND.

.TRUE.

.OR.

.FALSE.

.EQV.

.TRUE.

.NEQV.

.FALSE.

MAX

Smallest representable number

MIN

Largest representable number

IAND

All bits on

IOR

0

IEOR

0

Any number of reduction clauses can be specified on the directive, but a variable can appear only once in a REDUCTION clause for that directive.

Example 1. The following directive line shows use of the REDUCTION clause:

!$OMP DO REDUCTION(+: A, Y) REDUCTION(.OR.: AM)

Example 2. The following code fragment shows how to use the REDUCTION clause:

!$OMP PARALLEL DO DEFAULT(PRIVATE) REDUCTION(+: A,B)
      DO I=1,N
        CALL WORK(ALOCAL,BLOCAL)
        A = A + ALOCAL
        B = B + BLOCAL
      ENDDO
!$OMP END PARALLEL DO

COPYIN Clause

The COPYIN clause applies only to common blocks that are declared THREADPRIVATE. A COPYIN clause on a parallel region specifies that the data in the master thread of the team be copied to the thread private copies of the common block at the beginning of the parallel region.

This clause has the following format:

COPYIN(var[, var] ...)
var

A named variable or named common block that is accessible in the scoping unit. Subobjects cannot be specified. If a named common block is specified, its name must appear between slashes.

It is not necessary to specify a whole common block to be copied in.

Example. In the following example, the common blocks BLK1 and FIELDS are specified as thread private, but only one of the variables in common block FIELDS is specified to be copied in:

      COMMON /BLK1/ SCRATCH
      COMMON /FIELDS/ XFIELD, YFIELD, ZFIELD
!$OMP THREADPRIVATE(/BLK1/, /FIELDS/)
!$OMP PARALLEL DEFAULT(PRIVATE) COPYIN(/BLK1/,ZFIELD)

Data Environment Rules

The following rules and restrictions apply with respect to data scope:

  1. Sequential DO loop control variables in the lexical extent of a PARALLEL region that would otherwise be SHARED based on default rules are automatically made private on the PARALLEL directive. Sequential DO loop control variables with no enclosing PARALLEL region are not classified automatically. You must guarantee that these indexes are private if the containing procedures are called from a PARALLEL region.

    All implied DO loop control variables are automatically made private at the enclosing implied DO construct.

  2. Variables that are made private in a parallel region cannot be made private again on an enclosed work-sharing directive. As a result, variables that appear in the PRIVATE, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses on a work-sharing directive have shared scope in the enclosing parallel region.

  3. A variable that appears in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION clause must be definable.

  4. Assumed-size and assumed-shape arrays cannot be specified as PRIVATE, FIRSTPRIVATE, or LASTPRIVATE. Array dummy arguments that are explicitly shaped (including variably dimensioned) can be declared in any scoping clause.

  5. Fortran pointers and allocatable arrays can be declared as PRIVATE or SHARED but not as FIRSTPRIVATE or LASTPRIVATE.

    Within a parallel region, the initial status of a private pointer is undefined. Private pointers that become allocated during the execution of a parallel region should be explicitly deallocated by the program prior to the end of the parallel region to avoid memory leaks.

    The association status of a SHARED pointer becomes undefined upon entry to and on exit from the parallel construct if it is associated with a target or a subobject of a target that is PRIVATE, FIRSTPRIVATE, LASTPRIVATE, or REDUCTION inside the parallel construct. An allocatable array declared PRIVATE has an allocation status of not currently allocated on entry to and on exit from the construct.

  6. PRIVATE or SHARED attributes can be declared for a Cray pointer but not for the pointee. The scope attribute for the pointee is determined at the point of pointer definition. You cannot declare a scope attribute for a pointee. Cray pointers cannot be specified in FIRSTPRIVATE or LASTPRIVATE clauses.

  7. Scope clauses apply only to variables in the static extent of the directive on which the clause appears, with the exception of variables passed as actual arguments. Local variables in called routines that do not have the SAVE attribute are PRIVATE. Common blocks and modules in called routines in the dynamic extent of a parallel region always have an implicit SHARED attribute, unless they are THREADPRIVATE common blocks.

  8. When a named common block is declared as PRIVATE, FIRSTPRIVATE, or LASTPRIVATE, none of its constituent elements may be declared in another scope attribute. When individual members of a common block are privatized, the storage of the specified variables is no longer associated with the storage of the common block itself.

  9. Variables that are not allowed in the PRIVATE and SHARED clauses are not affected by DEFAULT(PRIVATE) or DEFAULT(SHARED) clauses, respectively.

  10. Clauses can be repeated as needed, but each variable can appear explicitly in only one clause per directive, with the following exceptions:

    • A variable can be specified as both FIRSTPRIVATE and LASTPRIVATE.

    • Variables affected by the DEFAULT clause can be listed explicitly in a clause to override the default specification.

Directive Binding

Some directives are bound to other directives. A binding specifies the way in which one directive is related to another. For instance, a directive is bound to a second directive if it can appear in the dynamic extent of that second directive. The following rules apply with respect to the dynamic binding of directives:

  • The DO, SECTIONS, SINGLE, MASTER, and BARRIER directives bind to the dynamically enclosing PARALLEL directive, if one exists.

  • The ORDERED directive binds to the dynamically enclosing DO directive.

  • The ATOMIC directive enforces exclusive access with respect to ATOMIC directives in all threads, not just the current team.

  • The CRITICAL directive enforces exclusive access with respect to CRITICAL directives in all threads, not just the current team.

  • A directive can never bind to any directive outside the closest enclosing PARALLEL.

Example 1. The directive binding rules call for a BARRIER directive to bind to the closest enclosing PARALLEL directive.

In the following example, the call from MAIN to SUB2 is valid because the BARRIER (in SUB3) binds to the PARALLEL region in SUB2. The call from MAIN to SUB1 is valid because the BARRIER binds to the PARALLEL region in subroutine SUB2.

      PROGRAM MAIN
      CALL SUB1(2)
      CALL SUB2(2)
      END

      SUBROUTINE SUB1(N)
!$OMP PARALLEL PRIVATE(I) SHARED(N)
!$OMP DO
      DO I = 1, N
      CALL SUB2(I)
      END DO
!$OMP END PARALLEL
      END

      SUBROUTINE SUB2(K)
!$OMP PARALLEL SHARED(K)
      CALL SUB3(K)
!$OMP END PARALLEL
      END

      SUBROUTINE SUB3(N)
      CALL WORK(N)
!$OMP BARRIER
      CALL WORK(N)
      END

Example 2. The following program shows inner and outer DO directives that bind to different PARALLEL regions:

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO I = 1, N
!$OMP PARALLEL SHARED(I,N)
!$OMP DO
        DO J = 1, N
          CALL WORK(I,J)
        END DO
!$OMP END PARALLEL
      END DO
!$OMP END PARALLEL

A following variation of the preceding example also shows correct binding:

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO I = 1, N
        CALL SOME_WORK(I,N)
      END DO
!$OMP END PARALLEL

      SUBROUTINE SOME_WORK(I,N)
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO J = 1, N
        CALL WORK(I,J)
      END DO
!$OMP END PARALLEL
      RETURN
      END

Directive Nesting

The following rules apply to the dynamic nesting of directives:

  • A PARALLEL directive dynamically inside another PARALLEL directive logically establishes a new team, which is composed of only the current thread unless nested parallelism is enabled.

  • DO, SECTIONS, and SINGLE directives that bind to the same PARALLEL directive cannot be nested one inside the other.

  • DO, SECTIONS, and SINGLE directives are not permitted in the dynamic extent of CRITICAL and MASTER directives.

  • BARRIER directives are not permitted in the dynamic extent of DO, SECTIONS, SINGLE, MASTER, and CRITICAL directives.

  • MASTER directives are not permitted in the dynamic extent of DO, SECTIONS, and SINGLE directives.

  • ORDERED sections are not allowed in the dynamic extent of CRITICAL sections.

  • Any directive set that is legal when executed dynamically inside a PARALLEL region is also legal when executed outside a parallel region. When executed dynamically outside a user-specified parallel region, the directive is executed with respect to a team composed of only the master thread.

Example 1. The following example is incorrect because the inner and outer DO directives are nested and bind to the same PARALLEL directive:

      PROGRAM WRONG1
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO I = 1, N
!$OMP DO
        DO J = 1, N
          CALL WORK(I,J)
        END DO
      END DO
!$OMP END PARALLEL
      END

The following dynamically nested version of the preceding code is also incorrect:

      PROGRAM WRONG2
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO I = 1, N
        CALL SOME_WORK(I,N)
      END DO
!$OMP END PARALLEL

      SUBROUTINE SOME_WORK(I,N)
!$OMP DO
      DO J = 1, N
        CALL WORK(I,J)
      END DO
      RETURN
      END

Example 2. The following example is incorrect because the DO and SINGLE directives are nested, and they bind to the same PARALLEL region:

      PROGRAM WRONG3
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO I = 1, N
!$OMP SINGLE
      CALL WORK(I)
!$OMP END SINGLE
      END DO
!$OMP END PARALLEL
      END

Example 3. The following example is incorrect because a BARRIER directive inside a SINGLE or a DO directive can result in deadlock:

      PROGRAM WRONG3
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
      DO I = 1, N
        CALL WORK(I)
!$OMP BARRIER
        CALL MORE_WORK(I)
      END DO
!$OMP END PARALLEL
      END

Example 4. The following example is incorrect because the BARRIER results in deadlock due to the fact that only one thread at a time can enter the critical section:

      PROGRAM WRONG4
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP CRITICAL
      CALL WORK(N,1)
!$OMP BARRIER
      CALL MORE_WORK(N,2)
!$OMP END CRITICAL
!$OMP END PARALLEL
      END

Example 5. The following example is incorrect because the BARRIER results in deadlock due to the fact that only one thread executes the SINGLE section:

      PROGRAM WRONG5
!$OMP PARALLEL DEFAULT(SHARED)
      CALL SETUP(N)
!$OMP SINGLE
      CALL WORK(N,1)
!$OMP BARRIER
      CALL MORE_WORK(N,2)
!$OMP END SINGLE
      CALL FINISH(N)
!$OMP END PARALLEL
      END

Analyzing Data Dependencies for Multiprocessing

The essential condition required to parallelize a loop correctly is that each iteration of the loop must be independent of all other iterations. If a loop meets this condition, then the order in which the iterations of the loop execute is not important. They can be executed backward or at the same time, and the answer is still the same. This property is captured by the notion of data independence.

For a loop to be data independent, no iterations of the loop can write a value into a memory location that is read or written by any other iteration of that loop. It is all right if the same iteration reads and/or writes a memory location repeatedly as long as no others do; it is all right if many iterations read the same location as long as none of them write to it.

In a Fortran program, memory locations are represented by variable names. So, to determine if a particular loop can be run in parallel, examine the way variables are used in the loop. Because data dependence occurs only when memory locations are modified, pay particular attention to variables that appear on the left-hand side of assignment statements. If a variable is neither modified nor passed to a function or subroutine, there is no data dependence associated with it.

The Fortran compiler supports four kinds of variable usage within a parallel loop: SHARED, PRIVATE, LASTPRIVATE, and REDUCTION. If a variable is declared as SHARED, all iterations of the loop use the same copy. If a variable is declared as PRIVATE, each iteration is given its own uninitialized copy. A variable is declared SHARED if it is only read (not written) within the loop or if it is an array where each iteration of the loop uses a different element of the array. A variable can be PRIVATE if its value does not depend on any other iteration and if its value is used only within a single iteration. The PRIVATE variable is essentially temporary; a new copy can be created in each loop iteration without changing the final answer. As a special case, if only the last value of a variable computed on the last iteration is used outside the loop (but would otherwise qualify as a PRIVATE variable), the loop can be multiprocessed by declaring the variable to be LASTPRIVATE.

It is often difficult to analyze loops for data dependence information. Each use of each variable must be examined to determine if it fulfills the criteria for PRIVATE, LASTPRIVATE, SHARED, or REDUCTION. If all of the uses conform, the loop can be parallelized. If not, the loop cannot be parallelized as written, but can possibly be rewritten into an equivalent parallel form.

An alternative to manually analyzing variable usage is to use the MIPSpro Auto-Parallelizing Option (APO). This optional software package analyzes loops for data dependence. If the APO software determines that a loop is data-independent, it automatically inserts the required compiler directives. If it cannot determine if the loop is independent, it produces a listing file detailing where the problems lie. For more information on APO, see Chapter 9, “The Auto-Parallelizing Option (APO)”.

Dependency Analysis Examples

This section contains examples that show dependency analysis.

Example 1. Simple independence. In this example, each iteration writes to a different location in A, and none of the variables appearing on the right-hand side are ever written to; they are only read from. This loop can be correctly run in parallel. All the variables are SHARED except for I, which is either PRIVATE or LASTPRIVATE, depending on whether the last value of I is used later in the code.

        DO I = 1,N
        A(I) = X + B(I)*C(I)
      END DO

Example 2. Data dependence. The following code fragment contains A(I) on the left-hand side and A(I-1) on the right. This means that one iteration of the loop writes to a location in A and the next iteration reads from that same location. Because different iterations of the loop read and write the same memory location, this loop cannot be run in parallel.

         DO I = 2,N
         A(I) = B(I) - A(I-1)
      END DO

Example 3. Stride not 1. This example is similar to the previous example. The difference is that the stride of the DO loop is now 2 rather than 1. A(I) now references every other element of A, and A(I-1) references exactly those elements of A that are not referenced by A(I). None of the data locations on the right-hand side is ever the same as any of the data locations written to on the left-hand side. The data are disjoint, so there is no dependence. The loop can be run in parallel. Arrays A and B can be declared SHARED, while variable I should be declared PRIVATE or LASTPRIVATE.

      DO I = 2,N,2
         A(I) = B(I) - A(I-1)
      END DO

Example 4. Local variable. In the following loop, each iteration of the loop reads and writes the variable X. However, no loop iteration ever needs the value of X from any other iteration. X is used as a temporary variable; its value does not survive from one iteration to the next.

This loop can be parallelized by declaring X to be a PRIVATE variable within the loop. Note that B(I) is both read and written by the loop. This is not a problem because each iteration has a different value for I, so each iteration uses a different B(I). The same B(I) is allowed to be read and written as long as it is done by the same iteration of the loop. The loop can be run in parallel. Arrays A and B can be declared SHARED, while variable I should be declared PRIVATE or LASTPRIVATE.

DO I = 1, N
     X = A(I)*A(I) + B(I)
     B(I) = X + B(I)*X
END DO

Example 5. Function call. The value of X in any iteration of the following loop is independent of the value of X in any other iteration, so X can be made a PRIVATE variable. The loop can be run in parallel. Arrays A, B, C, and D can be declared SHARED, while variable I should be declared PRIVATE or LASTPRIVATE.

     DO I = 1, N
        X = SQRT(A(I))
        B(I) = X*C(I) + X*D(I)
      END DO

This loop invokes an intrinsic function, SQRT. It is possible to use functions and/or subroutines (intrinsic or user defined) within a parallel loop. However, verify that the parallel invocations of the routine do not interfere with one another. In particular, SQRT returns a value that depends only on its input argument, does not modify global data, and does not use static storage (it has no side effects).

The Fortran intrinsic functions have no side effects. The intrinsic functions can be used safely within a parallel loop. The intrinsic subroutines, however, can have side effects. Most Fortran library functions cannot be included in a parallel loop. In particular, rand is not safe for multiprocessing. For user-written routines, it is your responsibility to ensure that the routines can be correctly multiprocessed.


Caution: Do not use the -static option on the f90(1) command line when compiling routines called within a parallel loop.

Example 6. Rewritable data dependence. Here, the value of INDX survives the loop iteration and is carried into the next iteration. This loop cannot be parallelized as it is written. Making INDX a PRIVATE variable does not work; you need the value of INDX computed in the previous iteration. It is possible to rewrite this loop to make it parallel. See “Rewriting Data Dependencies”, for an example.

INDX = 0
DO I = 1, N
     INDX = INDX + I
     A(I) = B(I) + C(INDX)
END DO

Example 7. Exit branch. The following loop contains an exit branch; that is, under certain conditions the flow of control suddenly exits the loop. The compiler cannot parallelize loops containing exit branches.

       DO I = 1, N
          IF (A(I) .LT. EPSILON) EXIT
          A(I) = A(I) * B(I)
       END DO

Example 8. Complicated independence. Initially, it appears that the following loop cannot be run in parallel because it uses both W(I) and W(I-K). However, because the value of I varies between K+1 and 2*K, then I-K goes from 1 to K. This means that the W(I-K) term varies from W(1) to W(K), while the W(I) term varies from W(K+1) to W(2*K). Therefore, W(I-K) in any iteration of the loop is never the same memory location as W(I) in any other iterations. Because there is no data overlap, there are no data dependencies. This loop can be run in parallel. Elements W, B, and K can be declared SHARED, but variable I should be declared PRIVATE or LASTPRIVATE.

DO I = K+1, 2*K
     W(I) = W(I) + B(I,K) * W(I-K)
END DO

The preceding code illustrates a general rule: the more complex the expression used to index an array, the harder it is to analyze. If the arrays in a loop are indexed only by the loop index variable, the analysis is usually straightforward.

Example 9. Inconsequential data dependence. The data dependence in the following loop is present because it is possible that at some point that I will be the same as INDEX, so there will be a data location that is being read and written by different iterations of the loop. In this special case, you can simply ignore it. You know that when I and INDEX are equal, the value written into A(I) is exactly the same as the value that is already there. The fact that some iterations of the loop read the value before it is written and some after it is written is not important because they all get the same value. Therefore, this loop can be parallelized. Array A can be declared SHARED, but variable I should be declared PRIVATE or LASTPRIVATE.

INDEX = SELECT(N)
DO I = 1, N
     A(I) = A(INDEX)
END DO

Example 10. Local array. In the following code fragment, each iteration of the loop uses the same locations in array D. However, closer inspection reveals that array D is being used as a temporary. This can be multiprocessed by declaring D to be PRIVATE. The Fortran compiler allows arrays (even multidimensional arrays) to be PRIVATE variables, with the following restrictions: the size of the array must be either a constant or an expression; the dimension bounds must be specified; the PRIVATE array cannot have been declared using a variable or the asterisk (*) syntax; and assumed-shape, deferred-shape, and pointer arrays are not permitted.

DO I = 1, N
     D(1) = A(I,1) - A(J,1)
     D(2) = A(I,2) - A(J,2)
     D(3) = A(I,3) - A(J,3)
     TOTAL_DISTANCE(I,J) = SQRT(D(1)**2 + D(2)**2 + D(3)**2)
END DO

The preceding loop can be parallelized. Arrays TOTAL_DISTANCE and A can be declared SHARED, and array D and variable I can be declared PRIVATE or LASTPRIVATE.

Rewriting Data Dependencies

Many loops that have data dependencies can be rewritten so that some or all of the loop can be run in parallel. You must first locate the statement(s) in the loop that cannot be made parallel and try to find another way to express it that does not depend on any other iteration of the loop. If this fails, try to pull the statements out of the loop and into a separate loop, allowing the remainder of the original loop to be run in parallel.

After you identify data dependencies, you can use various techniques to rewrite the code to break the dependence. Sometimes the dependencies in a loop cannot be broken, and you must either accept the serial execution rate or try to find a new parallel method of solving the problem. The following examples show how to deal with commonly occurring situations. These are by no means exhaustive but cover many situations that happen in practice.

Example 1. Loop-carried value. The following code segment is the same as the rewritable data dependence example in the previous section. INDX has its value carried from iteration to iteration. However, you can compute the appropriate value for INDX without making reference to any previous value.

INDX = 0
DO I = 1, N
   INDX = INDX + I
   A(I) = B(I) + C(INDX)
END DO

For example, consider the following code:

!$OMP PARALLEL DO PRIVATE (I, INDX)
    DO I = 1, N
       INDX = (I*(I+1))/2
       A(I) = B(I) + C(INDX)
    END DO 

In this loop, the value of INDX is computed without using any values computed on any other iteration. INDX can correctly be made a PRIVATE variable, and the loop can now be multiprocessed.

Example 2. Indirect indexing. Consider the following code:

     DO I = 1, N
        IX = INDEXX(I)
        IY = INDEXY(I)
        XFORCE(I) = XFORCE(I) + NEWXFORCE(IX)
        YFORCE(I) = YFORCE(I) + NEWYFORCE(IY)
        IXX = IXOFFSET(IX)
        IYY = IYOFFSET(IY)
        TOTAL(IXX, IYY) = TOTAL(IXX, IYY) + EPSILON
      END DO

It is the final statement that causes problems. The indexes IXX and IYY are computed in a complex way and depend on the values from the IXOFFSET and IYOFFSET arrays. It is not known if TOTAL(IXX,IYY) in one iteration of the loop will always be different from TOTAL(IXX,IYY) in every other iteration of the loop.

You can pull the statement out into its own separate loop by expanding IXX and IYY into arrays to hold intermediate values, as follows:

!$OMP PARALLEL DO PRIVATE(IX, IY, I)
      DO I  = 1, N
         IX = INDEXX(I)
         IY = INDEXY(I)
         XFORCE(I) = XFORCE(I) + NEWXFORCE(IX)
         YFORCE(I) = YFORCE(I) + NEWYFORCE(IY)
         IXX(I) = IXOFFSET(IX)
         IYY(I) = IYOFFSET(IY)
      END DO
      DO I = 1, N
         TOTAL(IXX(I),IYY(I)) = TOTAL(IXX(I), IYY(I)) + EPSILON
      END DO

Here, IXX and IYY have been turned into arrays to hold all the values computed by the first loop. The first loop (containing most of the work) can now be run in parallel. Only the second loop must still be run serially. This is true if IXOFFSET or IYOFFSET are permutation vectors.

If you were certain that the value for IXX was always different in every iteration of the loop, then the original loop could be run in parallel. It could also be run in parallel if IYY was always different. If IXX (or IYY) is always different in every iteration, then TOTAL(IXX,IYY) is never the same location in any iteration of the loop, and so there is no data conflict.

This sort of knowledge is program-specific and should always be used with great care. It may be true for a particular data set, but to run the original code in parallel as it stands, you need to be sure it will always be true for all possible input data sets.

Example 3. Recurrence. The following example shows a recurrence, which exists when a value computed in one iteration is immediately used by another iteration. There is no good way of running this loop in parallel. If this type of construct appears in a critical loop, try pulling the statement(s) out of the loop as in the previous example. Sometimes another loop encloses the recurrence; in that case, try to parallelize the outer loop.

DO I = 1,N
     X(I) = X(I-1) + Y(I)
END DO

Example 4. Sum reduction. The following example shows an operation known as a reduction. Reductions occur when an array of values is combined and reduced into a single value.

SUM  = 0.0
DO I = 1,N
     SUM = SUM + A(I)
END DO

This example is a sum reduction because the combining operation is addition. Here, the value of SUM is carried from one loop iteration to the next, so this loop cannot be multiprocessed. However, because this loop simply sums the elements of A(I), you can rewrite the loop to accumulate multiple, independent subtotals and do much of the work in parallel, as follows:

     NUM_THREADS = OMP_GET_NUM_THREADS()
!
!  IPIECE_SIZE = N/NUM_THREADS ROUNDED UP
!
      IPIECE_SIZE = (N + (NUM_THREADS-1)) / NUM_THREADS
      DO K = 1, NUM_THREADS
        PARTIAL_SUM(K) = 0.0
!
!  THE FIRST THREAD DOES 1 THROUGH IPIECE_SIZE, THE
!  SECOND DOES IPIECE_SIZE + 1 THROUGH 2*IPIECE_SIZE,
!  ETC. IF N IS NOT EVENLY DIVISIBLE BY NUM_THREADS,
!  THE LAST PIECE NEEDS TO TAKE THIS INTO ACCOUNT,
!  HENCE THE "MIN" EXPRESSION.
!
      DO I = K*IPIECE_SIZE - IPIECE_SIZE + 1, MIN(K*IPIECE_SIZE,N)
           PARTIAL_SUM(K) = PARTIAL_SUM(K) + A(I)
        END DO
      END DO
!
!  NOW ADD UP THE PARTIAL SUMS
      SUM = 0.0
      DO I = 1, NUM_THREADS
         SUM = SUM + PARTIAL_SUM(I)
      END DO

The outer loop K can be run in parallel. In this method, the array pieces for the partial sums are contiguous, resulting in good cache utilization and performance.

Because this is an important and common transformation, automatic support is provided by the REDUCTION clause:

     SUM = 0.0
!$OMP PARALLEL DO PRIVATE (I), REDUCTION (+:SUM)
     DO 10 I = 1, N
        SUM = SUM + A(I)
10 CONTINUE

The previous code has essentially the same meaning as the much longer and more confusing code above. Adding an extra dimension to an array to permit parallel computation and then combining the partial results is an important technique for trying to break data dependencies. This technique is often useful.

Reduction transformations such as this do not produce the same results as the original code. Because computer arithmetic has limited precision, when you sum the values together in a different order, as was done here, the round-off errors accumulate slightly differently. It is probable that the final answer will be slightly different from the original loop. Both answers are equally correct. The difference is usually irrelevant, but sometimes it can be significant. If the difference is significant, neither answer is really trustworthy.

This example is a sum reduction because the operator is plus (+). The Fortran compiler supports the reduction operations described in Table 4-1.

For example,

!$OMP PARALLEL DO PRIVATE (I), REDUCTION(+:BG_SUM),
!$OMP+REDUCTION(*:BG_PROD), REDUCTION(MIN:BG_MIN), REDUCTION(MAX:BG_MAX)
          DO I = 1,N
             BG_SUM  = BG_SUM + A(I)
             BG_PROD = BG_PROD * A(I)
             BG_MIN  = MIN(BG_MIN, A(I))
             BG_MAX  = MAX(BG_MAX, A(I))
          END DO

The following is another example of a reduction transformation:

        DO I = 1, N
           TOTAL = 0.0
           DO J = 1, M
              TOTAL = TOTAL + A(J)
           END DO
           B(I) = C(I) * TOTAL
        END DO

Initially, it might look as if the inner loop should be parallelized with a REDUCTION clause. However, consider the outer I loop. Although TOTAL cannot be made a PRIVATE variable in the inner loop, it fulfills the criteria for a PRIVATE variable in the outer loop: the value of TOTAL in each iteration of the outer loop does not depend on the value of TOTAL in any other iteration of the outer loop. Thus, you do not have to rewrite the loop; you can parallelize this reduction on the outer I loop, making TOTAL and J local variables.

Work Quantum

A certain amount of overhead is associated with multiprocessing a loop. If the work occurring in the loop is small, the loop can actually run slower by multiprocessing than by single processing. To avoid this, make the amount of work inside the multiprocessed region as large as possible, as is shown in the following examples.

Example 1. Loop interchange. Consider the following code:

DO K = 1, N
     DO I = 1, N
        DO J = 1, N
           A(I,J) = A(I,J) + B(I,K) * C(K,J)
        END DO
     END DO
END DO

For the preceding code fragment, you can parallelize the J loop or the I loop. You cannot parallelize the K loop because different iterations of the K loop read and write the same values of A(I,J). Try to parallelize the outermost DO loop if possible, because it encloses the most work. In this example, that is the I loop. For this example, use the technique called loop interchange. Although the parallelizable loops are not the outermost ones, you can reorder the loops to make one of them outermost.

Thus, loop interchange would produce the following code fragment:

!$OMP PARALLEL DO PRIVATE(I, J, K)
        DO J = 1, N
           DO K = 1, N
              DO I = 1, N
                 A(I,J) = A(I,J) + B(I,K) * C(K,J)
              END DO
           END DO
        END DO

Now the parallelizable loop encloses more work and shows better performance. In practice, relatively few loops can be reordered in this way. However, it does occasionally happen that several loops in a nest of loops are candidates for parallelization. In such a case, it is usually best to parallelize the outermost one.

Example 2. Conditional parallelism. The loop is worth parallelizing if N is sufficiently large. To overcome the parallel loop overhead, N needs to be around 1000, depending on the specific hardware and the context of the program. The optimized version would uses an IF clause on the PARALLEL DO directive:

!$OMP PARALLEL DO IF (N .GE. 1000), PRIVATE(I)
        DO I = 1, N
           A(I) = A(I) + X*B(I)
        END DO

Cache Effects and Optimization

It is best to try to write loops that take the cache into account, with or without parallelism. The technique for attaining the best cache performance is quite simple: make the loop step through the array in the same way that the array is laid out in memory. For Fortran, this means stepping through the array without any gaps and with the leftmost subscript varying the fastest. This does not depend on multiprocessing, nor is it required in order for multiprocessing to work correctly. However, multiprocessing can affect how the cache is used.

Performing a Matrix Multiply

Consider the following code segment:

DO I = 1, N
     DO K = 1, N
        DO J = 1, N
           A(I,J) = A(I,J) + B(I,K) * C(K,J)
        END DO
     END DO
END DO

To get the best cache performance, the I loop should be innermost. At the same time, to get the best multiprocessing performance, the outermost loop should be parallelized.

For this example, you can interchange the I and J loops, and get the best of both optimizations:

!$OMP PARALLEL DO PRIVATE(I, J, K)
        DO J = 1, N
           DO K = 1, N
              DO I = 1, N
                 A(I,J) = A(I,J) + B(I,K) * C(K,J)
              END DO
           END DO
        END DO

Optimization Costs

Sometimes you must choose between the possible optimizations and their costs. Look at the following code segment:

DO J = 1, N
     DO I = 1, M
        A(I) = A(I) + B(J)*C(I,J)
     END DO
END DO

This loop can be parallelized on I but not on J. You could interchange the loops to put I on the outside, thus getting a bigger work quantum.

!$OMP PARALLEL DO PRIVATE(I,J)
     DO I = 1, M
        DO J = 1, N
           A(I) = A(I) + B(J)*C(I,J)
        END DO
     END DO

However, putting J on the inside means that you will step through the C array in the wrong direction; the leftmost subscript should be the one that varies the fastest. It is possible to parallelize the I loop where it stands:

      DO J = 1, N
!$OMP PARALLEL DO PRIVATE(I)
        DO I = 1, M
           A(I) = A(I) + B(J)*C(I,J)
        END DO
     END DO

However, M needs to be large for the work quantum to show any improvement. In this example, A(I) is used to do a sum reduction, and it is possible to use reduction techniques to rewrite this in a parallel form. However, that involves converting array A from a one-dimensional array to a two-dimensional array to hold the partial sums; this is analogous to the way the scalar summation variable was converted into an array of partial sums.

If A is large, however, the conversion can take too much memory. It can also take extra time to initialize the expanded array and increase the memory bandwidth requirements.

!$OMP PARALLEL SHARED (NUM)
!$OMP SINGLE
      NUM = OMP_GET_NUM_THREADS()
!$OMP END SINGLE NOWAIT
!$OMP END PARALLEL
      IPIECE = (N + (NUM-1)) / NUM
!$OMP PARALLEL DO PRIVATE(K,J,I)
      DO K = 1, NUM
        DO J = K*IPIECE - IPIECE + 1, MIN(N, K*IPIECE)
           DO I = 1, M
              PARTIAL_A(I,K) = PARTIAL_A(I,K) + B(J)*C(I,J)
           END DO
        END DO
      END DO
!$OMP PARALLEL DO PRIVATE (I,K)
      DO I = 1, M
        DO K = 1, NUM
           A(I) = A(I) + PARTIAL_A(I,K)
        END DO
      END DO

You must analyze the various possible optimizations to find the combination that is right for the particular job.

Load Balancing

When the Fortran compiler divides a loop into pieces, by default it uses the simple method of separating the iterations into contiguous blocks of equal size for each process. It can happen that some iterations take significantly longer to complete than other iterations. At the end of a parallel region, the program waits for all processes to complete their tasks. If the work is not divided evenly, time is wasted waiting for the slowest process to finish.

Consider the following code:

DO I = 1, N
     DO J = 1, I
        A(J, I) = A(J, I) + B(J)*C(I)
     END DO
END DO

The previous code segment can be parallelized on the I loop. Because the inner loop goes from 1 to I, the first block of iterations of the outer loop will end long before the last block of iterations of the outer loop.

In this example, this is easy to see and predictable, so you can change the program:

!$OMP PARALLEL SHARED (NUM)
!$OMP SINGLE
      NUM_THREADS = OMP_GET_NUM_THREADS()
!$OMP END SINGLE NOWAIT
!$OMP END PARALLEL
!$OMP PARALLEL DO PRIVATE(I, J, K)
      DO K = 1, NUM_THREADS
        DO I = K, N, NUM_THREADS
           DO J = 1, I
              A(J, I) = A(J, I) + B(J)*C(I)
           END DO
        END DO
      END DO

In this rewritten version, instead of breaking up the I loop into contiguous blocks, break it into interleaved blocks. Thus, each execution thread receives some small values of I and some large values of I, giving a better balance of work between the threads. Interleaving usually, but not always, cures a load balancing problem.

You can use the SCHEDULE clause to automatically perform this desirable transformation, as in this example:

!$OMP PARALLEL DO PRIVATE(I,J), SCHEDULE(STATIC,1)
         DO I = 1, N
            DO J = 1, I
               A (J,I) = A(J,I) + B(J)*C(J)
            END DO
         END DO

The previous code has the same meaning as the rewritten form above.

Interleaving can cause poor cache performance because the array is no longer stepped through at stride 1. You can improve performance somewhat by using a chunk size larger than 1. Usually 4 or 8 is a good value for int_expr. Each small chunk will have stride 1 to improve cache performance, while the chunks are interleaved to improve load balancing.

The way that iterations are assigned to processes is known as scheduling. Interleaving is one possible schedule. Both interleaving and the simple scheduling methods are examples of fixed schedules; the iterations are assigned to processes by a single decision made when the loop is entered. For more complex loops, it may be desirable to use DYNAMIC or GUIDED schedules.

Comparing the output from SpeedShop allows you to see how well the load is being balanced so you can compare the different methods of dividing the load. For more information on SpeedShop, see the ssrun(1) man page.

Even when the load is perfectly balanced, iterations may still take varying amounts of time to finish because of random factors. One process may take a page fault, another may be interrupted to let a different program run, and so on. Because of these unpredictable events, the time spent waiting for all processes to complete can be several hundred cycles, even with near perfect balance.