Appendix D. Multiprocessing Directives (Outmoded)

Appendix D. Multiprocessing Directives (Outmoded)
Prev

The MIPSpro 7 Fortran 90 multiprocessing directives let you optimize your code by helping you to split your program into concurrently executing pieces. This appendix describes techniques for analyzing your code and preparing it for execution on multiple CPUs.

Note: The directives in this appendix are outmoded. They are supported for older codes that require this functionality. “Migrating to OpenMP” lists the equivalent new directives that you can use in place of the outmoded directives.

Silicon Graphics and Cray Research encourage you to write new codes using the OpenMP directives described in Chapter 4, “OpenMP Fortran API Multiprocessing Directives”.

This appendix describes two sets of directives to use for multiprocessing. The first set consists of the loop-level multiprocessing directives. The second set consists of directives based on the work of the Parallel Computing Forum (PCF). The PCF directives allow you to specify multiprocessing based on the model of a parallel region. The following sections describe the multiprocessing directives and how to use them.

The -mp option must be specified on the f90(1) command line in order for the compiler to honor the directives in this chapter. For more information on multiprocessing, see the mp(3f) and sync(3f) man pages.

Migrating to OpenMP

The functionality of the directives in this chapter has been preserved in the OpenMP Fortran API directives. The following list indicates equivalent directives:

MIPSpro Directive		OpenMP Directive
`!$DOACROSS`		`!$OMP PARALLEL DO`
`!$CHUNK`		Optional second argument to the `SCHEDULE (type[,chunk])` clause
`!$MP_SCHEDTYPE`		First argument to the `SCHEDULE (type[,chunk])` clause
`!$ statement`		`!$` or `_OPENMP` preprocessor macro
`!$PAR BARRIER`		`!$OMP BARRIER`
`!$PAR CRITICALSECTION`		`!$OMP CRITICAL [(name)]`
`!$PAR END CRITICALSECTION`		`!$OMP END CRITICAL`
`!$PAR PARALLEL`		`!$OMP PARALLEL`
`!$PAR END PARALLEL`		`!$OMP END PARALLEL`
`!$PAR PARALLELDO`		`!$OMP PARALLEL DO`
`!$PAR PDO`		`!$OMP DO`
`!$PAR ENDPDO`		`!$OMP END DO`
`!$PAR PSECTION`		`!$OMP SECTIONS`
`!$PAR SECTION`		`!$OMP SECTION`
`!$PAR END PSECTION`		`!$OMP END SECTIONS`
`!$PAR SINGLEPROCESS`		`!$OMP SINGLE`
`!$PAR END SINGLEPROCESS`		`!$OMP END SINGLE`
`!$DISTRIBUTE`		`!$SGI DISTRIBUTE`
`!$DISTRIBUTE_RESHAPE`		`!$SGI DISTRIBUTE_RESHAPE`
`!$REDISTRIBUTE`		`!$SGI REDISTRIBUTE`
`!$DYNAMIC`		`!$SGI DYNAMIC`
`!$PAGE_PLACE`		`!$SGI PAGE_PLACE`

The following list indicates equivalent clauses:

MIPSpro Clause		OpenMP Clause
`AFFINITY`		`$SGI& AFFINITY`
`BLOCKED`		`STATIC`
`CHUNK`		`SCHEDULE(*[[, chunk])`
`IF`		`IF`
`LASTLOCAL`		`LASTPRIVATE`
`LOCAL`		`PRIVATE`
`MP_SCHEDTYPE`		`SCHEDULE(type,[*])`
`NEST`		`!$SGI& NEST`
`PRIVATE`		`PRIVATE`
`REDUCTION`		`REDUCTION`
`SHARED`		`SHARED`
`NOWAIT`		`NOWAIT`

The following list indicates equivalent modes for !$MP_SCHEDTYPE:

MIPSpro mode		OpenMP mode
`DYNAMIC`		`DYNAMIC`
`GSS`		`GUIDED`
`INTERLEAVE`		`SCHEDULE(STATIC, 1)`
`RUNTIME`		`RUNTIME`
`SIMPLE`		`STATIC`

Using Directives

Certain multiprocessing features are available to you either through the command line or through directives. For command line options and directives that accept either ON or OFF as arguments, the compiler turns the feature OFF when conflicting settings are present. If a feature accepts a numeric setting as an argument, the compiler compares the command line setting and the directive setting and uses the minimum setting.

The following sections contain general information that applies to both the loop-level and the PCF directives.

Directive Range

Directives placed in a file prior to program code are called global directives. The compiler interprets them as if they appeared at the top of each program unit in the file.

Directives appearing anywhere else in the file apply only until the end of the current program unit. The compiler resets the value of the directive to the global value at the start of the next program unit.

Directive Continuation

To continue the loop-level multiprocessing directives onto another line, use !$& as the first characters in the continued line(s). For example:

!$DOACROSS share(ALPHA, BETA, GAMMA, DELTA,
!$&  EPSILON, OMEGA), LASTLOCAL(I, J, K, L, M, N),
!$&  LOCAL(XXX1, XXX2, XXX3, XXX4, XXX5, XXX6, XXX7,
!$&  XXX8, XXX9)

To continue the PCF directives onto another line, begin the continued line with the characters !$PAR&.

Loop-level Multiprocessing Directives

It is possible for the compiler to execute different iterations of a DO loop on multiple processors. For example, suppose a DO loop consisting of 200 iterations will run on a machine with four processors using the simplest scheduling method. The first 50 iterations run on one processor, the next 50 on another, and so on.

A multiprocessing code adjusts itself at run time to the number of processors actually available to it on the machine. By default, the multiprocessing code does not use more than eight processors. If you want to use more processors, set the MP_SET_NUMTHREADS environment variable to a different value. If the 200-iteration loop was moved to a machine with only two processors, it would be divided into two blocks of 100 iterations each, without any need to recompile or reload. In fact, multiprocessing code can be run on single-processor machines; on such systems the iterations are divided into one block of 200 iterations. This allows code to be developed on a single-processor system and later run on a multiprocessor.

The processes that participate in the parallel execution of a task are arranged in a master/slave organization. The original process is the master. It creates zero or more slaves to assist. When a parallel DO loop is encountered, the master contacts the slaves for help. When the loop is complete, the slaves wait for the master, and the master resumes normal execution. The master process and each of the slave processes are called a thread of execution or simply a thread. By default, the number of threads is set to the number of processors on the machine or is set to 8, whichever is smaller. You can override the default and explicitly control the number of threads of execution used by a parallel job.

For multiprocessing to work correctly, the iterations of the loop must not depend on each other; each iteration must stand alone and produce the same answer regardless of when any other iteration of the loop is executed. Not all DO loops have this property, and loops without it cannot be correctly executed in parallel. However, many of the loops encountered in practice fit this model. Further, many loops that cannot be run in parallel in their original form can be rewritten to run wholly or partially in parallel. For information about determining data dependencies in loops, see “Analyzing Data Dependencies for Multiprocessing” in Chapter 4.

The loop-level multiprocessing directives are as follows:

DOACROSS
CHUNK
MP_SCHEDTYPE

The following sections describe the loop-level multiprocessing directives.

Note: Localized ALLOCATABLE or POINTER arrays are not supported on the DOACROSS directive. They cannot be specified in a LOCAL clause. Also, Cray Pointees are not supported in a LOCAL clause.

`DOACROSS` Directive

The basis for the loop-level multiprocessing directives is the DOACROSS directive. This directive indicates to the compiler that it should run iterations of the subsequent DO loop in parallel. This directive must appear directly before the loop that is to be operated on, and it remains in effect for that loop only.

The format of this directive is as follows:

!$DOACROSS [clause[, clause] ...]

clause

This directive accepts one or more of the following clauses:

AFFINITY
BLOCKED
CHUNK
IF
LASTLOCAL
LOCAL
MP_SCHEDTYPE
NEST
PRIVATE
REDUCTION
SHARED

The sections that follow describe the DOACROSS directive clauses.

Appendix B, “Debugging and Profiling Multiprocessed Programs”, contains information on debugging when DOACROSS directives are used.

Note: The Fortran compiler does not support direct nesting of DOACROSS loops.

For example, the following is illegal and generates a compilation error:

!$DOACROSS LOCAL(I)
        DO I = 1, N
!$DOACROSS LOCAL(J)
           DO J = 1, N
              A(I,J) = B(I,J)
           END DO
        END DO

However, to simplify separate compilation, a different form of nesting is allowed. A routine that uses !$DOACROSS can be called from within a multiprocessed region. This can be useful if a single routine is called from several different places: sometimes from within a multiprocessed region, sometimes not. Nesting does not increase the parallelism. When the first !$DOACROSS loop is encountered, that loop is run in parallel. While in the parallel loop, if a call is made to a routine that itself has a !$DOACROSS, the subsequent loop is executed serially.

`AFFINITY` Clause

Affinity scheduling allows you to map parallel loop iterations onto underlying threads. This clause is used most often on Origin series systems.

For more information on using this DOACROSS clause, see “AFFINITY Clause” in Chapter 5.

`BLOCKED` and `CHUNK` Clauses

These clauses affect work scheduling among the participating tasks in a loop. They break up the work into pieces specified by int_expr. These clauses are valid only when the MP_SCHEDTYPE=DYNAMIC or MP_SCHEDTYPE=INTERLEAVE clauses have also been specified.

The BLOCKED and CHUNK clauses have the following formats:

BLOCKED (int_expr)
CHUNK = int_expr

int_expr

Specify an integer expression that represents the size of the chunk (that is, the number of iterations per chunk).

If CHUNK or BLOCKED are specified, and MP_SCHEDTYPE is not, MP_SCHEDTYPE defaults to DYNAMIC. For more information on how these clauses interact with the MP_SCHEDTYPE clause, see “MP_SCHEDTYPE Clause”.

The CHUNK directive also affects the division of work. For more information on the CHUNK directive, see “CHUNK Directive”.

`IF` Clause

The IF clause determines whether the loop is actually executed in parallel. This clause has the following format:

IF (logical_expr)

logical_expr

Specify a logical expression. If logical_expr evaluates to TRUE, the loop is executed in parallel. If logical_expr evaluates to FALSE, the loop is executed serially.

`LASTLOCAL`, `LOCAL`, `PRIVATE` and `SHARED` Clauses

The LASTLOCAL, LOCAL, and SHARED clauses specify lists of variables used within parallel loops. A variable can appear in only one of these lists. The effect of these clauses is as follows:

The LASTLOCAL clause specifies variables that are local to each process. Unlike with the LOCAL clause, the compiler saves only the value of the logically last iteration of the loop when it exits. The name LASTLOCAL is preferred over LAST LOCAL.
The LOCAL clause specifies variables that are local to each process. If a variable is declared as LOCAL, each iteration of the loop is given its own uninitialized copy of the variable. You can declare a variable as LOCAL if its value does not depend on any other iteration of the loop and if its value is used only within a single iteration. In effect, the LOCAL variable is just temporary; a new copy can be created in each loop iteration without changing the final answer. The name LOCAL is preferred over PRIVATE.

Note: Localized ALLOCATABLE or POINTER arrays cannot be specified in a LOCAL clause. Also, Cray Pointees are not supported in a LOCAL clause.
The SHARED clause specifies variables that are shared across all processes. If a variable is declared as SHARED, all iterations of the loop use the same copy of the variable. You can declare a variable as SHARED if it is only read (not written) within the loop or if it is an array in which each iteration of the loop uses a different element of the array. The name SHARED is preferred over SHARE.

By default, the DO variable is LASTLOCAL and all other variables are SHARED.

These clauses have the following formats:

LASTLOCAL var[ , var ... ]
LOCAL var[ , var ... ]
SHARED var[ , var ... ]

var

Specify the name of a variable. If any var is an array, it is listed without any subscripts.

Common blocks, allocatable arrays, and Fortran 90 pointers cannot appear as var arguments in a LOCAL list.

LOCAL is a little faster than LASTLOCAL, so if you do not need the final value, it is good practice to put the DO index variable into the LOCAL list, although this is not required.

`MP_SCHEDTYPE` Clause

The MP_SCHEDTYPE clause affects the way the compiler schedules work among the participating tasks in a loop.

This clause has the following format:

MP_SCHEDTYPE = mode

mode

Specify one of the following for mode:

DYNAMIC. Specifying MP_SCHEDTYPE=DYNAMIC breaks the iterations into pieces the size of which is specified with the CHUNK clause. As each process finishes a piece, it enters a critical section to grab the next available piece. This gives good load balancing at the price of higher overhead. The CHUNK clause is valid with this mode.
GSS. Specifying MP_SCHEDTYPE=GSS results in a variation of the guided self-scheduling algorithm. The piece size is varied depending on the number of iterations remaining. By parceling out relatively large pieces to start with and relatively small pieces toward the end, the system can achieve good load balancing while reducing the number of entries into the critical section. Specifying GUIDED for mode performs the same function as specifying GSS, but GSS is preferred.
INTERLEAVE. Specifying MP_SCHEDTYPE=INTERLEAVE breaks the iterations into pieces of the size specified by the CHUNK clause, and execution of those pieces is interleaved among the processes. For example, if there are four processes and CHUNK=2, the first process executes iterations 1-2, 9-10, 17-18, ...; the second process executes iterations 3-4, 11-12, 19-20,...; and so on. Although this is more complex than the simple method, it is still a fixed schedule with only a single scheduling decision. The CHUNK clause is valid with this mode. Specifying INTERLEAVED for mode performs the same function as specifying INTERLEAVE, but INTERLEAVE is preferred.
RUNTIME. Specifying MP_SCHEDTYPE=RUNTIME directs the scheduling routine to examine environment variables to select a mode. For the list of valid environment variables, see the pe_environ(5) man page.
SIMPLE. Specifying MP_SCHEDTYPE=SIMPLE divides the iterations among processes by dividing them into contiguous pieces and assigning one piece to each process. Specifying STATIC for mode performs the same function as specifying SIMPLE, but SIMPLE is preferred. Default is SIMPLE.

The MP_SCHEDTYPE clause interacts with the CHUNK clause as follows:

If both the MP_SCHEDTYPE and CHUNK clauses are omitted, SIMPLE scheduling is assumed.
If MP_SCHEDTYPE=INTERLEAVE or MP_SCHEDTYPE=DYNAMIC and the CHUNK clause is omitted, CHUNK=1 is assumed.
If MP_SCHEDTYPE is set to one of the other values, CHUNK is ignored.
If the MP_SCHEDTYPE clause is omitted, but CHUNK is set, MP_SCHEDTYPE=DYNAMIC is assumed.

`NEST` Clause

The NEST clause allows you to exploit nested concurrency. This DOACROSS clause is used most often on Origin series systems. For more information on this clause, see “Specifying a Parallel Region: !$OMP PARALLEL DO” in Chapter 5.

`REDUCTION` Clause

The REDUCTION clause specifies variables involved in a reduction operation. In a reduction operation, the compiler keeps local copies of the variables and combines them when it exits the loop.

This clause has the following format:

REDUCTION var[ , var]...

var

Specify one or more variable names for var. Each var must be a scalar individual variable, not an array. A var can be an array element (for example REDUCTION(A(I,J))).

One element of an array can be used in a reduction operation while other elements of the array are used in other ways. To allow for this, if an element of an array appears in the REDUCTION list, the entire array can also appear in the SHARED list.

The four types of reductions supported are sum(+), product(*), min(), and max(). Note that min and max reductions must use the MIN(3i) and MAX(3i) intrinsic functions to be recognized correctly.

The compiler confirms that the reduction expression is legal by making some simple checks. The compiler does not, however, check all statements in the DO loop for illegal reductions. You must ensure that the reduction variable is used correctly in a reduction operation.

Example:

!$DOACROSS LOCAL(I), REDUCTION(A(1))
      DO I = 2,N
         A(1) = A(1) + A(I)
      END DO

`CHUNK` Directive

The CHUNK directive breaks work up into pieces. Like the MP_SCHEDTYPE directive, the CHUNK directive acts as an implicit clause, in this case a CHUNK clause, for all DOACROSS directives in the scope. The CHUNK directive is in effect from the place it occurs in the source until another corresponding directive is encountered or the end of the procedure is reached.

The format of this directive is as follows:

!$CHUNK=int_expr

int_expr

Specify an integer expression that represents the size of the chunk (that is, the number of iterations per chunk).

The CHUNK clause to the DOACROSS directive also divides work. For more information, see “BLOCKED and CHUNK Clauses”.

`MP_SCHEDTYPE` Directive

The MP_SCHEDTYPE directive affects the way the compiler schedules work among the participating tasks in a loop. Like the CHUNK directive, the MP_SCHEDTYPE directive acts as an implicit clause, in this case an MP_SCHEDTYPE clause, for all DOACROSS directives in the scope. The MP_SCHEDTYPE directive is in effect from the place it occurs in the source until another corresponding directive is encountered or the end of the procedure is reached.

The MP_SCHEDTYPE directive specifies the scheduling type to be used for subsequent !$DOACROSS directives that are specified without an explicit scheduling type.

The format of this directive is as follows:

!$MP_SCHEDTYPE mode

mode

This directive accepts a mode argument as described in “MP_SCHEDTYPE Clause”.

The MP_SCHEDTYPE clause to the DOACROSS directive also divides work. For more information, see “MP_SCHEDTYPE Clause”.

`!$` Directive

The !$ directive, which is really only a prefix, precedes code that should be recognized only when multiprocessing is enabled. Multiprocessing is enabled when either -pfa or -mp is specified on the f90(1) command line.

These directive lines are considered comment lines except when multiprocessing. A line beginning with !$ is treated as a conditionally compiled Fortran statement.

The format of this directive is as follows:

!$ statement

statement

For statement, specify a standard Fortran statement. This feature can be used to insert debugging statements or other arbitrary code.

If statement is a Fortran 90 statement, the statement can be continued to a subsequent line in fixed source form by placing an ampersand (&) in column 6 of the continued line.

If statement is a directive, continue it using the rules for directive continuation described in “Directive Continuation”.

The following code demonstrates the use of the !$ directive:

!$    PRINT 10
!$ 10 FORMAT('BEGIN MULTIPROCESSED LOOP')
!$DOACROSS LOCAL(I), SHARED(A,B)
        DO I = 1, 100
           CALL COMPUTE(A, B, I)
        END DO

`DOACROSS` Directive Examples

This section contains examples of DOACROSS directives.

Example 1. Simple DOACROSS directive. Consider the following code fragment:

     DO 10 I = 1, 100
        A(I) = B(I)
10   CONTINUE

By inserting a directive, it can be multiprocessed:

!$DOACROSS LOCAL(I), SHARED(A, B)
     DO 10 I = 1, 100
        A(I) = B(I)
10   CONTINUE

Here, the defaults are sufficient provided that A and B are mentioned in a nonparallel region or in another SHARED list. The following code will then work:

!$DOACROSS
     DO 10 I = 1, 100
        A(I) = B(I)
10   CONTINUE

Example 2. A DOACROSS directive with a LOCAL clause. Consider the following code fragment:

     DO 10 I = 1, N
        X = SQRT(A(I))
        B(I) = X*C(I) + X*D(I)
10   CONTINUE

The following code shows this fragment rewritten for multiprocessing using explicit clauses:

!$DOACROSS LOCAL(I, X), SHARED(A, B, C, D, N)
     DO 10 I = 1, N
        X = SQRT(A(I))
        B(I) = X*C(I) + X*D(I)
10   CONTINUE

The following code shows the fragment rewritten for multiprocessing using the default settings:

!$DOACROSS LOCAL(X)
     DO 10 I = 1, N
        X = SQRT(A(I))
        B(I) = X*C(I) + X*D(I)
10   CONTINUE

Example 3. A DOACROSS directive with a LASTLOCAL clause. Consider the following code fragment:

     DO 10 I = M, K, N
        X = D(I)**2
        Y = X + X
        DO 20 J = I, MAX
           A(I,J) = A(I,J) + B(I,J) * C(I,J) * X + Y
20   CONTINUE
10   CONTINUE
     PRINT*, I, X

In this example, the final values of I and X are needed after the loop completes. A correct directive is shown in the following:

!$DOACROSS LOCAL(Y,J), LASTLOCAL(I,X),
!$& SHARED(M,K,N,ITOP,A,B,C,D)
     DO 10 I = M, K, N
         X = D(I)**2
         Y = X + X
         DO 20 J = I, ITOP
             A(I,J) = A(I,J) + B(I,J) * C(I,J) *X + Y
20       CONTINUE
10   CONTINUE
     PRINT*, I, X

You can also use the defaults:

!$DOACROSS LOCAL(Y,J), LASTLOCAL(X)
     DO 10 I = M, K, N
         X = D(I)**2
         Y = X + X
         DO 20 J = I, MAX
             A(I,J) = A(I,J) + B(I,J) * C(I,J) *X + Y
20       CONTINUE
10   CONTINUE
     PRINT*, I, X

In the preceding code example, I is a loop index variable for the DOACROSS loop, so it is LASTLOCAL by default. Even though J is a loop index variable, it is not the loop index of the loop being multiprocessed and has no special status. If it is not declared, it is assigned the default value of SHARED, which produces an incorrect answer.

Local Common Blocks

The -Xlocal option to the ld(1) command allows named common blocks to be local to a process. Each process in the parallel job gets its own private copy of the common block. This can be helpful in converting certain types of Fortran programs into a parallel form.

The common block must be a named common block (blank common cannot be made local), and it must not be initialized by DATA statements.

To create a local common block, use the special loader option -Xlocal followed by a list of common block names. The external name of a common block known to the loader has a trailing underscore and is not surrounded by slashes. For example, the following command makes the common block /foo/ a local common block in the resulting a.out file. You can specify multiple -Xlocal options if necessary.

% f90 -mp a.o -Wl,-Xlocal,foo_

You can use the !$COPYIN directive to copy values from the master thread's version of the common block into the slave thread's version. This directive has the following format:

!$COPYIN item[, item] ...

item

Specify one or more members of a local common block. Each item can be a variable, an array, an individual element of an array, or the entire common block.

Note: The !$COPYIN directive cannot be executed from inside a parallel region.

The following example propagates the values for x and y, all the values in the common block foo, and the Ith element of array A:

!$COPYIN X,Y, /FOO/, A(I)

These items must be either common blocks or members of common blocks. The directive is translated into executable code, so in this example, I is evaluated at the time this statement is executed.

PCF Directives

In addition to the simple loop-level parallelism offered by the DOACROSS directive, the compiler supports a set of directives that allows you to specify a more general model of parallelism. This model is based on the work done by the Parallel Computing Forum (PCF), which itself formed the basis for the proposed ANSI-X3H5 standard.

The main concept in this model is the parallel region, which can be any arbitrary section of code (not just a DO loop). Within the parallel region, there are special work-sharing constructs that can be used to divide the work among separate processes or threads. All master and slave threads synchronize at the bottom of a work-sharing construct. None of the threads continue past the end of a construct until they all have completed execution within that construct.

The parallel region can also contain a critical section construct, where exactly one process executes at a time. Within a critical section, only one thread executes at a time, and threads do not synchronize at the bottom of a critical section.

The master thread executes the user program until it reaches a parallel region. It then spawns one or more slave threads that begin executing code at the beginning of a parallel region. Each thread executes all the code in the region until a work sharing construct is encountered. Each thread then executes some portion of the work sharing construct, and then resumes executing the parallel region code. At the end of the parallel region, all the threads synchronize, and the master thread continues execution of the user program.

For information on interthread communication with library routines, see Appendix A, “Libraries”.

The compiler recognizes the PCF directives when multiprocessing is enabled with either the -mp or the -pfa option to the f90(1) command. The PCF directives are as follows:

BARRIER
CRITICAL SECTION, END CRITICAL SECTION
PARALLEL, END PARALLEL
PARALLEL DO
PDO, END PDO
PSECTION[S], SECTION, and END PSECTION[S]
SINGLE PROCESS, END SINGLEPROCESS

The following sections describe the syntax of the PCF directives.

Note: Generated code from the PCF directives is sometimes slower than the generated code from the special case parallelism offered by the DOACROSS directive. PCF directive code is slower because of the extra synchronization required. When a DOACROSS loop executes, there is a synchronization point at entry and another at exit. When a parallel region executes, there is a synchronization point at entry to the region, another at each entry to a work-sharing construct, another at each exit from a work-sharing construct, and one at exit from the region. Thus, several separate DOACROSS loops typically execute faster than a single parallel region with several PDO directives. Limit your use of the parallel region construct to those few cases that actually need it.

`BARRIER` Directive

The BARRIER directive ensures that each process waits until all processes reach the barrier before proceeding.

This directive has the following format:

!$PAR BARRIER

`CRITICAL SECTION` and `END CRITICAL SECTION` Directives

The CRITICAL SECTION and END CRITICAL SECTION directives ensure that the enclosed block of code is executed by only one process (thread) at a time. Another process attempting to gain entry to the critical section must wait until the previous process has exited. Threads do not synchronize at the bottom of a critical section.

The critical section construct can appear anywhere in a program, including inside and outside a parallel region and within a DOACROSS loop.

These directives have the following format:

!$PAR CRITICAL SECTION [ (lock_variable) ]
!$PAR END CRITICAL SECTION

lock_variable

Specify an integer variable that is initialized to zero. The parentheses are required. If you do not specify lock_variable, the compiler automatically supplies a global lock. Multiple critical section constructs inside the same parallel region are considered to be independent of each other unless they use the same explicit lock_variable.

`PARALLEL` and `END PARALLEL` Directives

The PARALLEL and END PARALLEL directives enclose a parallel region that includes work-sharing constructs and critical sections. It signifies the boundary within which slave threads execute. A user program can contain any number of parallel regions.

These directives have the following format:

!$PAR PARALLEL [clause[,clause]...]
!$PAR END PARALLEL

clause

Specify one of the following clauses:

IF (logical_expression)
LOCAL var[, var] ...
SHARED var[, var] ...

The IF, LOCAL, and SHARED clauses have the same meaning as for the DOACROSS directive. Also as with the DOACROSS directive, the keyword LOCAL is preferred to PRIVATE and the keyword SHARED is preferred to SHARE. For more information on these clauses and their syntax, see “DOACROSS Directive”.

The preferred form of the directive has no commas between the clauses.

In the following code, all threads enter the parallel region and call routine FOO:

          SUBROUTINE EX1(INDEX)
          INTEGER I
!$PAR PARALLEL LOCAL(I)
          I = MP_MY_THREADNUM()
          CALL FOO(I)
!$PAR END PARALLEL
          END

`PARALLEL DO` Directive

The PARALLEL DO directive indicates that the iterations of the subsequent DO loop should be executed by different processes. This directive produces the same effect as the DOACROSS directive, and it is conceptually the same as a parallel region containing exactly one PDO construct and no other code. Each thread inside the enclosing parallel region executes separate iterations of the loop within the parallel DO construct. This directive must not appear within a parallel region.

This directive has the following format:

!$PAR PARALLELDO [clause[,clause] ...]

clause

For clause, enter one or more of the DOACROSS clauses described in “DOACROSS Directive”.

`PDO` and `END PDO` Directives

The PDO and END PDO directives surround a loop and indicate that the iterations of the enclosed loop should be executed by different processes. These directives must be enclosed within a parallel region delimited by PARALLEL and END PARALLEL directives.

Within a parallel region, each thread inside the region executes a separate iteration of a loop within a PDO construct.

These directives have the following format:

!$PAR PDO [clause[, clause]...]

[!$PAR END PDO [NOWAIT]]

clause

Specify one of the following clauses:

AFFINITY
CHUNK=int_expr
LASTLOCAL var
LOCAL var[, var] ...
MP_SCHEDTYPE=mode
(ORDERED). Specifying the (ORDERED) clause is equivalent to specifying MP_SCHEDTYPE=DYNAMIC and CHUNK=1. The parentheses are required.

Each clause has the same meaning as for the DOACROSS directive. Also as with the DOACROSS directive, the keyword LASTLOCAL is preferred to LAST LOCAL and the keyword LOCAL is preferred to PRIVATE.

The (ORDERED) clause is not a supported DOACROSS clause.

For more information on the AFFINITY clause and its syntax, see “AFFINITY Clause” in Chapter 5. For more information on the other clauses and their syntax, see “DOACROSS Directive”.

It is legal to declare a data item as LOCAL in a PDO directive even if it was declared as SHARED in the enclosing parallel region.

The END PDO directive is optional. If specified, this directive must appear immediately after the end of the DO loop. The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes wait until all have reached the directive before proceeding.

Note: Localized ALLOCATABLE or POINTER arrays are not supported on the PDO directive.

The code in the following example is equivalent to a DOACROSS loop. In fact, the compiler recognizes this as a special case and generates the same (more efficient) code as for a DOACROSS directive.

          SUBROUTINE EX2(A,N)
          REAL A(N)
!$PAR PARALLEL LOCAL(I) SHARED(A)
!$PAR PDO
          DO I = 1, N
            A(I) = A(I) + 1.0
          END DO
!$PAR END PARALLEL
          END

`PSECTION[S]`, `SECTION`, and `END PSECTION[S]` Directives

The PSECTION[S] and END PSECTION[S] directives delimit a parallel section construct and distribute code blocks to processes. These directives have an effect that is similar to the Fortran 90 SELECT construct. Each block of code is parceled out in turn to a separate thread.

The SECTION directive indicates a starting line for an individual section within a parallel section.

These directives must be enclosed within a parallel region delimited by PARALLEL and END PARALLEL directives.

These directives have the following format:

!$PAR PSECTION[S][LOCAL var[, var] ...]
[!$PAR SECTION]
!$PAR END PSECTION[S][NOWAIT]

var

Specify a variable name for var. The LOCAL keyword has the same meaning as it does on the DOACROSS directive. The LOCAL keyword is preferred to PRIVATE. For more information on LOCAL, see “DOACROSS Directive”.

It is legal to declare a data item as LOCAL in a parallel sections construct even if it was declared as SHARED in the enclosing parallel region.

The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes wait until all have reached the END PSECTION directive before proceeding.

Parallel sections can contain critical section constructs, but they cannot contain any of the following types of constructs:

A DO loop that is preceded by a PDO directive
A DO loop that is preceded by a PARALLEL DO or a DOACROSS directive
Code delimited by SINGLEPROCESS and END SINGLEPROCESS directives

Each code block is executed in parallel (depending on the number of processes available). The code blocks are assigned to threads one at a time, in the order specified. Each code block is executed by only one thread.

For example, consider the following code:

          SUBROUTINE EX3(A,N1,B,N2,C,N3)
          REAL A(N1), B(N2), C(N3)
!$PAR PARALLEL LOCAL(I) SHARED(A,B,C)
!$PAR PSECTIONS
!$PAR SECTION
          DO I = 1, N1
            A(I) = 0.0
          END DO
!$PAR SECTION
          DO I = 1, N2
            B(I) = 0.5
          END DO
!$PAR SECTION
          CALL NORMALIZE(C,N3)
          DO I = 1, N3
            C(I) = C(I) + 1.0
          END DO
!$PAR END PSECTION
!$PAR END PARALLEL
          END

The first thread to enter the parallel section construct executes the first block, the second thread executes the second block, and so on. This example has only three sections, so if more than three threads are in the parallel region, the fourth and higher threads wait at the !$PAR END PSECTION directive until all threads are finished. If the parallel region is being executed by only two threads, whichever thread finishes its block first continues and executes the remaining block.

This example uses DO loops, but a parallel section can be any arbitrary block of code. Parallel constructs have significant overhead. Make sure the amount of work performed is enough to outweigh the extra overhead.

The sections within a parallel section construct are assigned to threads one at a time, from the top down. There is no other implied ordering to the operations within the sections. In particular, a later section cannot depend on the results of an earlier section, unless some form of explicit synchronization is used. If there is such explicit synchronization, you must be sure that the lexical ordering of the blocks is a legal order of execution.

`SINGLEPROCESS` and `END SINGLEPROCESS` Directives

The SINGLEPROCESS and END SINGLEPROCESS directives enclose a block of code that should be executed by only one process. These directives must be enclosed within a parallel region delimited by PARALLEL and ENDPARALLEL directives.

These directives have the following format:

!$PAR SINGLEPROCESS  [LOCAL var[, var] ...]
!$PAR END SINGLEPROCESS [NOWAIT]

var

It is legal to declare a data item as LOCAL in a single process construct even if it was declared as SHARED in the enclosing parallel region.

This construct is semantically equivalent to a parallel section construct with only one section. The single process construct provides a more descriptive syntax.

The first thread to reach a single process section executes the code in that block. All other threads wait at the end of the block until the code has been executed.

Notice the use of the repetition of the IF test in the first parallel loop:

             IF (A(I,J) .GT. CUR_MAX) THEN
!$PAR CRITICAL SECTION
               IF (A(I,J) .GT. CUR_MAX) THEN

This practice is called test&test&set. It is a multiprocessing optimization. The following straightforward code segment is incorrect:

          DO I = 1, N
            IF (A(I,J) .GT. CUR_MAX) THEN
!$PAR CRITICAL SECTION
                  INDEX_X = I
                  INDEX_Y = J
                  CUR_MAX = A(I,J)
!$PAR END CRITICAL SECTION
            ENDIF
          ENDDO

Because many threads execute the loop in parallel, there is no guarantee that once inside the critical section, CUR_MAX still has the same value it did in the IF test outside the critical section (some other thread may have updated it). In particular, CUR_MAX may now have a value that is larger than A(I,J). Therefore, the critical section must be locked before testing the value of CUR_MAX. Changing the previous code into the following code works correctly, but suffers from a serious performance penalty: the critical section lock must be acquired and released (an expensive operation) for each element of the array:

            DO I = 1, N
!$PAR CRITICAL SECTION
                IF (A(I,J) .GT. CUR_MAX) THEN
                  INDEX_X = I
                  INDEX_Y = J
                  CUR_MAX = A(I,J)
                ENDIF
!$PAR END CRITICAL SECTION
         ENDDO

Because the values are rarely updated, this process involves a lot of wasted effort. It is almost certainly slower than just executing the loop serially.

Combining the two methods, as in the original example, produces code that is both fast and correct. If the IF test outside of the critical section fails, you can be certain that the values will not be updated and can proceed. You can expect that the outside IF test will account for the majority of cases. If the outer IF test passes, then the values might be updated, but you cannot always be certain. To ensure correctness, you must perform the test again after acquiring the critical section lock.

You can prefix one of the two identical IF tests with !$ to reduce overhead in the non-multiprocessed case.

Lastly, note the difference between the single process and critical section constructs. If several processes arrive at a critical section construct, they execute the code one at a time. However, they will all execute the code. If several processes arrive at a single process construct, only one process executes the code. The other processes bypass the code and wait at the end of the construct for the chosen process to finish.

Restrictions on the PCF Directives

The three work-sharing constructs, PDO, PSECTION, and SINGLEPROCESS, must be executed by all the threads executing in the parallel region or by none of the threads. The following is illegal:

      ...
!$PAR PARALLEL
          IF (MP_MY_THREADNUM() .GT. 5) THEN
!$PAR SINGLE PROCESS
              MANY_PROCESSES = .TRUE.
!$PAR END SINGLE PROCESS
          ENDIF
           ...

The preceding code cannot run successfully when more than six processors are used. One or more processes will be stuck at the !$PAR ENDSINGLEPROCESS directive waiting for all the threads to arrive. Because some of the threads never took the appropriate branch, they will never encounter the construct. However, the following kind of simple looping is supported:

       ...
!$PAR PARALLEL LOCAL(I,J) SHARED(A)
          DO I= 1,N
!$PAR PDO
            DO J = 2,N
       ...

The distinction here is that all of the threads encounter the work-sharing construct. They all complete it, and they all loop around and encounter it again.

This restriction does not apply to the critical section construct, which operates on one thread at a time without regard to any other threads.

Parallel regions cannot be nested inside of other parallel regions, nor can work-sharing constructs be nested. However, as an aid to writing library code, you can call an external routine that contains a parallel region even from within a parallel region. In this case, only the first region is actually run in parallel. Therefore, you can create a parallelized routine without accounting for whether it will be called from within an already parallelized routine.

Prev	Table of Contents
Appendix C. Autotasking Directives (Outmoded)