Chapter 5. Parallel Processing on Origin Series Systems

This chapter describes directives that may be useful to you when developing programs for parallel processing on Origin Series systems. The techniques described in this chapter use directives from the OpenMP Fortran API standard and directives that are Silicon Graphics extensions to the standard.


Note: The directives and clauses that are part of the OpenMP Fortran API have the !$OMP prefix. The extension directives have the !$SGI prefix.

The multiprocessing features described in this chapter require support from the MP run-time library. IRIX operating system versions 6.3 and later include this library. If you need to access these features on a machine running a different IRIX version, contact your sales representative.

For information on environment variables that can control run-time features, see the pe_environ(5) man page.

Performance Tuning on Origin Series Systems

Origin series systems provide cache-coherent, shared memory in the hardware. Memory is physically distributed across processors. Processors can read data only from the primary cache. If the required data is not present in the primary cache, a cache miss is said to have occured. Therefore, references to locations in the remote memory of another processor take substantially longer to complete than references to locations in local memory. Cache misses adversely affect program performance.

Figure 5-1 shows a simplified version of the Origin series memory hierarchy.

Figure 5-1. Origin series memory hierarchy

Origin series memory hierarchy

Improving Program Performance

To obtain good performance in parallel programs it is important to schedule computation and to distribute the data across the underlying processors and memory modules, ensuring that most cache misses are satisfied from local rather than from remote memory. The primary goal of programming support is to enable user control over data placement and user control over computation scheduling.

Cache behavior is the largest single factor affecting performance, and programs with infrequent cache misses usually have little need for explicit data placement. These programs write data to memory and reuse it as many times as possible before overwriting it. You can use perfex(1) to find information on your program's cache misses.

In programs with many cache misses, if the misses correspond to true data communication between processors, data placement is unlikely to help. In these cases, it may be necessary to redesign your program to reduce interprocessor communication. When redesigning your program to reduce interprocessor communication, keep the following in mind:

  • Make sure the data needed by a processor is at least local to the processor's memory.

  • Make sure that each processor is working independently and not relying on the changing data of other processors.

  • Minimize cache misses.

If the misses are to data that is referenced primarily by a single processor, then data placement may be able to convert remote references to local references, thereby reducing the latency of the miss. The possible methods for data placement are automatic page migration or explicit data distribution, either regular or reshaped, described in detail in “Regular Data Distribution”, and “Data Distribution with Reshaping”. The differences between these methods are shown in Figure 5-2. Some criteria for choosing between these methods are discussed in “Choosing a Tuning Method”.

Automatic page migration requires no user intervention and is based on the run-time cache miss behavior of the program. It can, therefore, adjust to dynamic changes in the reference patterns. However, page migration is very conservative, and the system may be slow to react to changes in the reference patterns. It is also limited to performing page-level data allocation.


Note: On most systems, page migration is disabled by default. When enabled, page migration can affect other codes running on the system. To determine whether page migration is enabled, contact your system administrator or examine the output from the sn -v command. For more information on this command, see the sn(1) man page.

Regular data distribution (performing only page-level placement of the array) is also limited to page-level allocation, but is useful when the page migration heuristics are slow and the desired distribution is known to the programmer.

Finally, reshaped data distribution changes the layout of the array. This overcomes the page-level allocation constraints, but it is useful only if a data structure has the same (static) distribution for the duration of the program. Given these differences, it may be necessary to use each of these methods for different data structures in the same program.

Figure 5-2. Cache behavior and solutions

Cache behavior and solutions

Choosing a Tuning Method

For a given data structure in the program, you can choose between the automatic page migration method or the data distribution method. Your choice will be based on the following criteria:

  • If the program repeatedly references the data structure and benefits from reuse in the cache, data placement is not needed.

  • If the program incurs a large number of cache misses on the data structure, then you should identify the desired distribution in the array dimensions (such as BLOCK or CYCLIC) based on the desired parallelism in the program.

    The following example suggests a A(BLOCK, *) distribution:

    !$OMP PARALLEL DO
          DO I = 2, N
            DO J = 2, N
              A(I,J) = 3*I + 4*J + A(I, J-1)
            END DO
          END DO

    However, the following example suggests a A(*, BLOCK) distribution:

          DO I = 2, N
    !$OMP PARALLEL DO
            DO J = 2, N
              A(I,J) = 3*I + 4*J + A(I-1, J)
            END DO
          END DO

After identifying the desired distribution, you can select either regular or reshaped distribution based on the size of an individual processor's portion of the distributed array. Regular distribution is useful only if each processor's portion is substantially larger than the page size in the underlying system (16 KB on the Origin series systems). Otherwise, regular distribution is probably not useful, and you should use the !$SGI DISTRIBUTE_RESHAPE directive, which changes the layout of the array to overcome page-level constraints.

For example, consider the following code:

      REAL(KIND=8) A(M, N)
!$SGI DISTRIBUTE A(BLOCK, *)

In the preceding example, the size of each processor's portion is approximately m/P elements (8 × (m/P) bytes), where P is the number of processors. If m is 1,000,000, each processor's portion is likely to exceed a page and regular distribution is sufficient. However, if m is 10,000, the !$SGI DISTRIBUTE_RESHAPE directive is required to obtain the desired distribution.

In contrast, consider the following distribution:

!$SGI DISTRIBUTE A(*, BLOCK)

In the preceding example, the size of each processor's portion is approximately (m×n)/P elements (8 × (m×n)/P bytes). Therefore, if n is 100, for example, regular distribution may be sufficient even if m is only 10,000.

Distributing the outer dimensions of an array increases the size of an individual processor's portion (favoring regular distribution), but distributing the inner dimensions is more likely to require reshaped distribution.

The IRIX operating system on Origin series systems follows a default first-touch page-allocation policy. This means that each page is allocated from the local memory of the processor that incurs a page-fault on that page. Therefore, in programs where the array is initialized and is consequently first referenced in parallel, even a regular distribution directive may not be necessary, because the underlying pages are allocated from the desired memory location automatically due to the first-touch policy.


Note: The OpenMP Fortran API does not describe the BLOCK or CYCLIC data distributions. These are Silicon Graphics extensions.


Directives for Performance Tuning

The MIPSpro 7 Fortran 90 compiler supports directives for performance tuning on Origin series systems. These directives are extensions to the OpenMP Fortran API. You must be licensed for the MIPSpro Automatic Parallelization Option in order for these directives to be recognized. In addition, the -mp or -pfa options must be in effect during compilation.

The directives supported are as follows:

  • !$SGI DISTRIBUTE

  • !$SGI DISTRIBUTE_RESHAPE

  • !$OMP PARALLEL DO

  • !$SGI DYNAMIC

  • !$SGI PAGE_PLACE

  • !$SGI REDISTRIBUTE


Note: The functionality of the preceding directives is the same as that provided in MIPSpro 7 Fortran 90 releases 7.2 and earlier. Only the prefix has changed. Beginning with MIPSpro 7 Fortran 90 release 7.2.1, the !$ prefix is outmoded.

The MIPSpro 7 Fortran 90 compiler supports several clauses to the preceding directives that are extensions to the OpenMP Fortran API. These clauses can be used with the preceding directives and with the standard directives described by OpenMP. To preserve portability, the clauses must be preceded by a !$SGI+ prefix and must appear on a separate line, as follows:

directive
!$SGI+clause
. . .
directive

Specify any OpenMP Fortran API directive or any Silicon Graphics parallel processing directive. The OpenMP directives are described in Chapter 4, “OpenMP Fortran API Multiprocessing Directives”, and the Silicon Graphics parallel processing directives are described in this chapter.

clause

Specify any of clauses described in this chapter. There cannot be any intervening spaces between the plus sign (+) and the name of the clause.

The following code uses the Silicon Graphics NEST clause with the OpenMP Fortran API DO directive:

!$OMP PARALLEL
!$OMP DO
!$SGI+NEST (I,J)

      DO I = 1,10
         DO J = 1,10
            BLAH, BLAH
         ENDDO
      ENDDO

!$OMP ENDDO
!$OMP END PARALLEL
      END

The !$OMP PARALLEL DO directive is described in “Declare a Parallel Region: PARALLEL DO and END PARALLEL DO Directives” in Chapter 4. The following sections describe the syntax of the Silicon Graphics directives and clauses that are extensions to the OpenMP Fortran API.

Determining the Data Distribution for an Array: !$SGI DISTRIBUTE, !$SGI DISTRIBUTE_RESHAPE, and !$SGI REDISTRIBUTE

The !$SGI DISTRIBUTE directive determines the data distribution for an array. The !$SGI REDISTRIBUTE directive dynamically redistributes an array. The !$SGI DISTRIBUTE_RESHAPE directive performs data distribution with reshaping.

The formats of these directives are as follows:

!$SGI DISTRIBUTE array (dist1,dist2)
[ONTO (target1, target2[, targetN] ...)]
!$SGI DISTRIBUTE_RESHAPE array (dist1,dist2)
[ONTO (target1, target2[, targetN] ...)]
!$SGI REDISTRIBUTE array (dist1,dist2)
[ONTO (target1, target2[, targetN] ...)]
array

Specify the name of an array.

dist

Specify the type of distribution for each dimension of the named array. The number of dist arguments specified must be equal to the number of array dimensions. dist can be one of the following:

  • BLOCK. Indicates that BLOCK distribution should be used.

  • CYCLIC [(expr)]. If expr is not specified, a chunk size of 1 is assumed.

    For performance reasons, use constants rather than expr when the value of expr is known to be a compile-time constant.

  • An asterisk (*). Indicates that the dimension is not distributed.

target

Specify the target processor topology. This argument to the ONTO clause specifies how to partition the processors across the distributed dimensions. There must be one target argument specified for each BLOCK and CYCLIC distribution specified.

The Silicon Graphics data distribution directives and the !$OMP PARALLEL DO directive have an optional ONTO clause. The ONTO clause allows you to specify the processor topology when two (or more) dimensions of processors are required.

The following example array is distributed in two dimensions, so you can use the ONTO clause to specify how to partition the processors across the distributed dimensions:

! ASSIGN PROCESSOR IN THE RATIO 1:2 TO THE TWO
! DIMENSIONS OF ARRAY A
      REAL(KIND=8) A(100, 200)
!$SGI DISTRIBUTE A (BLOCK, BLOCK) ONTO (1, 2)

You can supply a !$SGI DISTRIBUTE directive on a dummy argument, thereby specifying the distribution on the incoming actual argument. If different calls to the subroutine have arguments with different distributions, you can omit the !$SGI DISTRIBUTE directive on the dummy argument. Data affinity loops in that subroutine are automatically implemented through a run-time lookup of the distribution. This is allowed only for regular data distribution. For reshaped array parameters, the distribution must be fully specified on the formal parameter.

For more information on using the data distribution directives, see “Using the Data Distribution Directives”.


Note: The OpenMP Fortran API does not describe the BLOCK, *, or CYCLIC distribution; or the ONTO clause.


Specifying a Parallel Region: !$OMP PARALLEL DO

The !$OMP PARALLEL DO directive is part of the OpenMP Fortran API. It accepts the Silicon Graphics AFFINITY and NEST clauses as extensions, however.

The following sections describe the AFFINITY and NEST clauses. For information on the !$OMP PARALLEL DO directive, see “Declare a Parallel Region: PARALLEL DO and END PARALLEL DO Directives” in Chapter 4.

AFFINITY Clause

Affinity scheduling controls the mapping of iterations of a parallel loop for execution onto the underlying threads. The !$OMP PARALLEL DO directive with the AFFINITY clause must immediately precede the loop to which it applies, and it is in effect only for that loop.

An AFFINITY clause, if supplied, overrides an OpenMP SCHEDULE clause.

There are two type of affinity scheduling: data affinity and thread affinity.

An AFFINITY clause on an !$OMP PARALLEL DO directive has the following format:

!$OMP PARALLEL DO
!$SGI+AFFINITY(do_variable) = DATA(array_element)
!$OMP PARALLEL DO
!$SGI+AFFINITY(do_variable) = THREAD(expr)
do_variable

Specify one or more DO loop identifiers, separated by commas.

array_element

Enter an array element.

expr

Specify a thread number. do_variable is executed on the thread number specified, modulo the number of threads.

Because the threads may need to evaluate expr in each iteration of the loop, the variables used in the expr (other than the do_variable) must be declared SHARED and must not be modified during the execution of the loop. Violating these rules can lead to incorrect results. For information on declaring shared variables, see “SHARED Clause” in Chapter 4.

If the expr does not depend on the DO variable, all iterations execute on the same thread and do not benefit from parallel execution.

When -O3 is in effect, loops that reference reshaped arrays default to data affinity scheduling for the most frequently accessed reshaped array in the loop (chosen by the compiler). To override this behavior, you can explicitly specify the SCHEDULE clause on the !$SGI PARALLEL DO directive.

Data affinity for loops with nonunit stride can sometimes result in nonlinear affinity expressions. In such situations the compiler issues a warning, ignores the affinity clause, and defaults to STATIC scheduling.

Example 1. The following code shows an example of data affinity:

!$SGI DISTRIBUTE A(BLOCK)
!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(A(X*I+Y))
      DO I = 1, N
        A(X*I+Y) = 0
      END DO

The multiplier for A and the constant term B must both be literal constants, with A greater than zero.

This example distributes the iterations of the parallel loop to match the data distribution specified for array A, such that iteration I is executed on the processor that owns element A(X*I+Y) based on the distribution for A. The iterations are scheduled based on the specified distribution, and are not affected by the actual underlying data distribution, which may, for example, differ at page boundaries.

Example 2. In case of a multidimensional array, affinity is provided for the dimension that contains the loop index variable. The loop index variable cannot appear in more than one dimension in an AFFINITY clause. In the following example, the loop is scheduled based on the block distribution of the first dimension:

!$SGI DISTRIBUTE A (BLOCK, CYCLIC(1))
!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(A(I+3, J))
      DO I = 1, N
        DO J = 1, N
          A(I+3, J) = A(I+3,J-1)
        END DO
      END DO

Example 3. The following directive executes iteration I on the thread number given by the user-supplied expression (modulo the number of threads):

!$OMP PARALLEL DO
!$SGI+AFFINITY (I) = THREAD( expr)


Note: The OpenMP Fortran API does not describe the AFFINITY clause.


NEST Clause

The NEST clause on the !$OMP PARALLEL DO directive allows you to exploit nested concurrency in a limited manner. Although true nested parallelism is not supported, you can exploit parallelism across iterations of a perfectly nested loop nest.

The NEST clause to the !$OMP PARALLEL DO directive has the following format:

!$OMP PARALLEL DO
!$SGI+NEST ( do_variable , do_variable[, do_variable] ...)
[ONTO (target1, target2[, targetn] ...)]
index

Specify a do_variable name that identifies a subsequent loop. At least two do_variable names must be specified. The loops identified must be perfectly nested.

target

Specify the target processor topology. The ONTO clause allows you to specify the processor topology when two (or more) dimensions of processors are required. This argument specifies how to partition the processors across the distributed dimensions. target can be either an integer expression or an asterisk (*).

Example 1. In a nested !$OMP PARALLEL DO with two or more nested loops, you can use the ONTO clause to specify the partitioning of processors across the multiple parallel loops, as follows:

! USE 2 PROCESSORS IN THE OUTER LOOP,
! AND THE REMAINING IN THE INNER LOOP
!$OMP PARALLEL DO
!$SGI+NEST(I, J) ONTO(2, *)
      DO I = 1, N
        DO J = 1, M
          A(J,I) = ...
        END DO
      END DO

Example 2. The following directive specifies that the entire set of iterations across both loops can be executed concurrently:

!$OMP PARALLEL DO
!$SGI+NEST(I, J)
      DO I = 1, N
        DO J = 1, M
          A(I,J) = 0
          END DO
      END DO

It is restricted, however, in that loops I and J must be perfectly nested. No code is allowed between either the DO I ... and DO J ... statements or between the END DO statements.

You can combine a nested !$OMP PARALLEL DO directive with an AFFINITY clause or with a SCHEDULE clause specified as STATIC; STATIC scheduling is the default except when accessing reshaped arrays. DYNAMIC, RUNTIME, and GUIDED scheduling are not supported.

For more information on the AFFINITY clause, see “AFFINITY Clause”. For more information on the SCHEDULE clause see “Specify Parallel Execution: DO and END DO Directives” in Chapter 4.

The following code uses an AFFINITY clause:

!$OMP PARALLEL DO
!$SGI+NEST(I, J) AFFINITY(I,J) = DATA(A(I,J))
      DO I = 2, N-1
        DO J = 2, M-1
          A(I,J) = A(I,J) + I*J
        END DO
      END DO


Note: The OpenMP Fortran API does not describe the NEST clause.


Requesting Dynamic Distribution for an Array: !$SGI DYNAMIC

The !$SGI DYNAMIC directive informs the compiler that a particular array can be dynamically redistributed. This directive is required for arrays in procedures that contain !$OMP PARALLEL DO loops with data affinity for arrays in the loops.

By default, the compiler assumes that a distributed array is not dynamically redistributed, and it directly schedules a parallel loop for the specified data affinity. In contrast, a redistributed array can have multiple possible distributions, and data affinity for a redistributed array must be implemented in the run-time system based on the particular distribution.

However, the compiler does not know if an array is redistributed because the array may be redistributed in another procedure or in another file. Therefore, you must explicitly specify the !$SGI DYNAMIC declaration for redistributed arrays. The !$SGI DYNAMIC directive implements data affinity for that array at run time rather than at compile time. If you know an array has a specified distribution throughout the duration of a procedure, you do not have to supply the !$SGI DYNAMIC directive. The result is more efficient compile time affinity scheduling. This directive is required only in those procedures that contain a !$OMP PARALLEL DO loop with data affinity for that array. This tells the compiler that the array can be dynamically redistributed. Data affinity for such arrays is implemented through a run-time lookup.

The format of this directive is as follows:

!$SGI DYNAMIC (array)
array

Specify the name of an array.

The run-time lookup incurs some extra overhead compared to a direct compile-time implementation. Because the compiler assumes that a distributed array is not redistributed at run time, the distribution is known at compile time, and data affinity for the array can be implemented directly by the compiler. In contrast, because a redistributed array can have multiple possible distributions at run time, data affinity for a redistributed array is implemented in the run-time system based on the distribution at run time, incurring extra run-time overhead.

You can avoid this overhead when a procedure contains data affinity for a redistributed array and the distribution of the array for the entire duration of that procedure is known. In this situation, you can supply the !$SGI DISTRIBUTE directive with the particular distribution and omit the !$SGI DYNAMIC directive.

Because reshaped arrays cannot be dynamically redistributed, this is an issue only for regular data distribution.

Designating Memory: !$SGI PAGE_PLACE

The !$SGI PAGE_PLACE directive allows you to explicitly place data structures in the physical memory of a particular processor. This directive is useful when dealing with irregular data structures such as pointers and sparse-matrix arrays.

The format of this directive is as follows:

!$SGI PAGE_PLACE (object, size, threadnum)
object 

Specify the name of the object.

size 

Specify the size of object, in bytes.

threadnum 

Specify the processor number upon which object is to be placed.

This directive causes all the pages spanned by the virtual address range (address to address+size) to be allocated from the local memory of processor number threadnum. It is an executable statement; therefore, you can use it to place either statically or dynamically allocated data. This directive is only a performance hint; it does not allocate memory, and it has no effect on the virtual address space of the program.

An example of this directive is as follows:

      REAL(KIND=8) A(100)
!$SGI PAGE_PLACE (A, 800, 3)

Using the Data Distribution Directives

The data distribution directives, !$SGI DISTRIBUTE, !$SGI REDISTRIBUTE, and !$SGI DISTRIBUTE_RESHAPE, allow you to specify distributions for array data structures. For irregular data structures, the directives can explicitly place data directly on a specific processor.

The !$SGI DISTRIBUTE, !$SGI DYNAMIC, and !$SGI DISTRIBUTE_RESHAPE directives are declarations that must be specified in the declaration part of the program, along with the array declaration. The !$SGI REDISTRIBUTE directive is an executable statement and can appear in any executable portion of the program.

You can specify a data distribution directive for any local, global, or common block array. Each dimension of a multidimensional array can be independently distributed. The possible distribution types for an array dimension are BLOCK, CYCLIC[(expr)], and *, as follows:

  • As shown in Figure 5-3, a BLOCK distribution is one that partitions the elements of the dimension of size N into P blocks (one per processor), with each block of size B = ceiling (N/P).

    Figure 5-3. Block distribution

    Block distribution

  • A CYCLIC distribution can include an expr to indicate the chunk size. A chunk size that is either greater than 1 or is determined at run time is sometimes also called BLOCK-CYCLIC.

  • The * distribution indicates that the array is not distributed.

As shown in Figure 5-4, a CYCLIC[(expr)] distribution partitions the elements of the dimension into pieces of size expr each and distributes them sequentially across the processors:

Figure 5-4. Cyclic distribution

Cyclic distribution

A distributed array is distributed across all of the processors being used in that particular execution of the program, as determined by the OMP_NUM_THREADS environment variable. If a distributed array is distributed in more than one dimension, then by default the processors are apportioned as equally as possible across each distributed dimension. For example, if an array has two distributed dimensions, then an execution with 16 processors assigns 4 processors to each dimension (4 x 4=16), whereas an execution with 8 processors assigns 4 processors to the first dimension and 2 processors to the second dimension. You can override this default and explicitly control the number of processors in each dimension using the ONTO clause with a data distribution directive.

Regular Data Distribution

The DISTRIBUTE and REDISTRIBUTE data distribution directives achieve the desired distribution by influencing the mapping of virtual addresses to physical pages without affecting the layout of the data structure. Because the granularity of data allocation is a physical page (at least 16 KB), the achieved distribution is limited by the underlying page granularity. However, the advantages are that regular data distribution directives can be added to an existing program without any restrictions, and they can be used for affinity scheduling.

For example, the following directive dynamically redistributes array A:

!$SGI REDISTRIBUTE A (BLOCK, CYCLIC(K))

The !$SGI REDISTRIBUTE directive is an executable statement that changes the distribution permanently (or until another !$SGI REDISTRIBUTE statement). It also affects subsequent affinity scheduling.

The !$SGI DYNAMIC directive specifies that the named array is redistributed in the program, and is useful in controlling affinity scheduling for dynamically redistributed arrays.

For more information on the !$SGI REDISTRIBUTE and !$SGI DYNAMIC directives, see “Determining the Data Distribution for an Array: !$SGI DISTRIBUTE, !$SGI DISTRIBUTE_RESHAPE, and !$SGI REDISTRIBUTE”, and “Requesting Dynamic Distribution for an Array: !$SGI DYNAMIC”.

Data Distribution with Reshaping

Similar to regular data distribution, the RESHAPE directive specifies the desired distribution of an array. In addition, however, the !$SGI DISTRIBUTE_RESHAPE directive declares that the program makes no assumptions about the storage layout of that array. The compiler performs aggressive optimizations for reshaped arrays that may violate standard Fortran layout assumptions, but it guarantees the desired data distribution for that array.

As shown in the following example, the !$SGI DISTRIBUTE_RESHAPE directive accepts the same distributions as the regular data distribution directive:

!$SGI DISTRIBUTE_RESHAPE (BLOCK, CYCLIC(1))

Restrictions on Reshaped Arrays

Because the !$SGI DISTRIBUTE_RESHAPE directive specifies that the program does not depend on the storage layout of the reshaped array, restrictions on the arrays that can be reshaped include the following:

  • Deferred-shape arrays (pointers, assumed-shape arrays, dummy arguments, and allocatable arrays) cannot be reshaped.

  • The distribution of a reshaped array cannot be changed dynamically (that is, there is no REDISTRIBUTE_RESHAPE directive).

  • Initialized data cannot be reshaped.

  • Arrays that are explicitly allocated through the alloca(3c) or MALLOC(3f) routines and accessed through Cray pointers cannot be reshaped.

  • An array that is equivalenced to another array cannot be reshaped.

  • I/O for a reshaped array cannot be mixed with namelist I/O or a function call in the same I/O statement.

  • A common block that contains a reshaped array cannot be declared THREADPRIVATE. For more information on the THREADPRIVATE directive, see “Declare Common Blocks Private to a Thread: THREADPRIVATE Directive” in Chapter 4.


Caution: A common block containing a reshaped array cannot be loaded with the -Wl,-Xlocal option. This user error is not detected by the compiler or loader.

There are two possible outcomes if a reshaped array is passed as an actual parameter to a subroutine:

  • The array is passed in its entirety; that is, CALL FUNC(A) passes the entire array A, whereas CALL FUNC(A(I,J)) passes a portion of A. The compiler automatically clones a copy of the called subroutine and compiles it for the incoming distribution. The actual arguments and dummy arguments must match in the number of dimensions and the size of each dimension.

    You can restrict a subroutine to accept a particular reshaped distribution on a parameter by specifying a !$SGI DISTRIBUTE_RESHAPE directive on the dummy argument within the subroutine. All calls to this subroutine with a mismatched distribution will lead to compile time or load time.

  • A portion of the array can be passed as an actual argument, but the callee must access only a single processor's portion. If the callee exceeds a single processor's portion, the results are undefined. You can use the intrinsics described on the MP(3f) man page under the heading Query Intrinsics for Distributed Arrays to find details about the array distribution.

Error Detection for Reshaped Arrays

Most errors in accessing reshaped arrays are detected either at compile time or at load time. These errors include:

  • Inconsistencies in reshaped arrays across common blocks (including across files)

  • Using the EQUIVALENCE statement to declare a reshaped array as equivalent to another array

  • Inconsistencies in reshaped distributions on actual and dummy arguments

  • Other errors such as disallowed I/O statements involving reshaped arrays, reshaping initialized data, or reshaping dynamically allocated data

Errors such as matching the declared size of an array dimension typically can be caught only at run time. You can use the -MP:CHECK_RESHAPE=ON option on the f90(1) command to perform these tests at run time. These run-time checks are not generated by default because they incur overhead, but they are useful during program development.

The types of run-time checks performed can detect the following:

  • Inconsistencies in array bounds declarations on each actual and dummy argument

  • Inconsistencies in declared bounds of a dummy argument that corresponds to a portion of a reshaped actual argument

Implementation of Reshaped Arrays

The compiler transforms a reshaped array into a pointer to a processor array. The processor array has one element per processor, with the element pointing to the portion of the array local to the corresponding processor.

Figure 5-5, shows the effect of a !$SGI DISTRIBUTE_RESHAPE directive with a BLOCK distribution on a one-dimensional array, as follows:

      REAL A(N)
!$SGI DISTRIBUTE_RESHAPE A(BLOCK)

N is the size of the array dimension, P is the number of processors, and B is the block-size on each processor, CEILING(N/P).

Figure 5-5. Implementation of the !$SGI DISTRIBUTE_RESHAPE A(BLOCK) distribution directive

Implementation of the !$SGI DISTRIBUTE_RESHAPE A(BLOCK) distribution directive

With this implementation array reference A(I) is transformed into the two-dimensional reference A[I/B][I%B] (in C syntax with C dimension order), where B is the size of each block and given by CEILING(N/P). Thus A[I/B] points to a processor's local portion of the array, and A[I/B][I%B] refers to a specific element within the local processor's portion.

A CYCLIC distribution with a chunk size of 1 is implemented as shown in Figure 5-6.

Figure 5-6. Implementation of the !$SGI DISTRIBUTE_RESHAPE A(CYCLIC(1)) distribution directive

Implementation of the !$SGI DISTRIBUTE_RESHAPE A(CYCLIC(1)) distribution directive

An array reference, A(I), is transformed to A[I%P][I/P], where P is the number of threads in that distributed dimension.

Finally, a CYCLIC distribution with a chunk size that is either a constant greater than 1 or a run-time value (also called BLOCK-CYCLIC) is implemented as Figure 5-7, shows.

Figure 5-7. Implementation of the !$SGI DISTRIBUTE_RESHAPE A(CYCLIC(K)) directive (a BLOCK-CYCLIC Distribution)

Implementation of the !$SGI DISTRIBUTE_RESHAPE A(CYCLIC(K))  directive (a BLOCK-CYCLIC Distribution)

An array reference, A(I), is transformed to the three-dimensional reference A[(I/K)%P][I/(PK)][I%K], where P is the total number of threads in that dimension and K is the chunk size.

The compiler tries to optimize these divide/modulo operations out of inner loops through aggressive loop transformations such as blocking and peeling.

Regular versus Reshaped Data Distribution

Regular distributions have an advantage in that they do not impose any restrictions on the distributed arrays and can be freely applied in existing codes. Furthermore, they work well for distributions where page granularity is not a problem. For example, consider a BLOCK distribution of the columns of a two-dimensional Fortran array of size A(R,C) (column-major layout) and distribution (*, BLOCK). If the size of each processor's portion, CEILING=(C/P)(R)(element_size) is significantly greater than the page size (16 KB on Origin2000 systems), then regular data distribution should be effective in placing the data in the desired fashion.

However, regular data distribution is limited by page-granularity considerations. For instance, consider a (BLOCK,BLOCK) distribution of a two-dimensional array in which the size of a column is much smaller than a page. Each physical page is likely to contain data belonging to multiple processors, making the data distribution quite ineffective. However, data distribution may still be useful from the standpoint of affinity scheduling considerations.

Reshaped data distribution addresses the problems of regular distributions by changing the layout of the array in memory to guarantee the desired distribution. However, because the array no longer conforms to standard Fortran storage layout, there are restrictions on the usage of reshaped arrays.

Given both types of data distribution, you can choose between the two based on the characteristics of the particular array in an application.

Examples

The following sections provide several examples of data distribution and affinity scheduling.

Distributing Columns of a Matrix

Example 1. This example distributes the columns of a matrix in a round-robin fashion across threads. Such a distribution places data effectively only if the size of an individual column exceeds that of a page.

      REAL(KIND=8) A(N, N)
! DISTRIBUTE COLUMNS IN CYCLIC FASHION
!$SGI DISTRIBUTE A (*, CYCLIC(1))

! PERFORM GAUSSIAN ELIMINATION ACROSS COLUMNS
! THE AFFINITY CLAUSE DISTRIBUTES THE LOOP ITERATIONS BASED
! ON THE COLUMN DISTRIBUTION OF A
      DO I = 1, N
!$OMP PARALLEL DO
!$SGI+AFFINITY(J) = DATA(A(I,J))
        DO J = I+1, N
!         ... REDUCE COLUMN J BY COLUMN I ...
        END DO
      END DO

If the columns are smaller than a page, it may be beneficial to reshape the array. This is specified by using a !$SGI DISTRIBUTE_RESHAPE directive in place of the !$SGI DISTRIBUTE directive.

In addition to overcoming size constraints as shown in the preceding example, the !$SGI DISTRIBUTE_RESHAPE directive is useful when the desired distribution is contrary to the layout of the array.

Example 2. This example uses the !$SGI DISTRIBUTE_RESHAPE directive to distribute the rows of a two-dimensional matrix. It shows how to overcome the storage layout constraints to provide the desired distribution.

REAL(KIND=8) A(N, N)
! DISTRIBUTE ROWS IN BLOCK FASHION
!$SGI DISTRIBUTE_RESHAPE A (BLOCK, *)
      REAL(KIND=8) SUM(N)
!$SGI DISTRIBUTE SUM(BLOCK)

! PERFORM SUM-REDUCTION ON THE ELEMENTS OF EACH ROW
!$OMP PARALLEL DO PRIVATE (J)
!$SGI+AFFINITY(I) = DATA(A(I,J))
      DO I = 1,N
        DO J = 1,N
          SUM(I) = SUM(I) + A(I,J)
        ENDDO
      ENDDO

Using Data Distribution and Data Affinity Scheduling

The following example demonstrates regular data distribution and data affinity. This example, run on a 4-processor Origin2000 server, uses simple block scheduling. Processor 0 calculates the values of the first 25,000 elements of A, processor 1 calculates the second 25,000 values of A, and so on. Arrays B and C are initialized using one processor. Therefore, all of the memory pages are touched by the master processor (processor 0) and are placed in processor 0's local memory.

Using data distribution changes the placement of memory pages for arrays A, B, and C to match the data reference pattern.

Without data distribution:

      REAL(KIND=8) A(1000000), B(1000000)
      REAL(KIND=8) C(1000000)
      INTEGER I

!$OMP PARALLEL SHARED(A, B, C) PRIVATE(I)
!$OMP DO
      DO I = 1, 1000000
         A(I) = B(I) + C(I)
      END DO
!$OMP END PARALLEL

With data distribution:

      REAL(KIND=8) A(1000000), B(1000000)
      REAL(KIND=8) C(1000000)
      INTEGER I
!$SGI DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK)

!$OMP PARALLEL SHARED(A, B, C) PRIVATE(I)
!$OMP DO
!$SGI+AFFINITY(I) = DATA(A(I))
      DO I = 1, 100000
         A(I) = B(I) + C(I)
      END DO
!$OMP END PARALLEL

Argument Passing

The following code shows how a distributed array can be passed as an argument to a subroutine that has a matching declaration for the dummy argument:

      REAL(KIND=8) A(M, N)
!$SGI DISTRIBUTE_RESHAPE A (BLOCK, *)
      CALL FOO(A, M, N)
      END

      SUBROUTINE FOO(A, P, Q)
      REAL(KIND=8) A(P, Q)
!$SGI DISTRIBUTE_RESHAPE A (BLOCK, *)
!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(A(I, J))
        DO I = 1, P
        END DO
      END

Because the array is reshaped, the !$SGI DISTRIBUTE_RESHAPE directive in the caller and the callee must match exactly. Furthermore, all calls to subroutine FOO must pass in an array with the exact same distribution.

If the array was only distributed (but not reshaped) in the preceding example, then subroutine FOO could be called from different places with different incoming distributions. In that case, you could omit the distribution directive on the dummy argument, thereby ensuring that any data affinity within the loop is based on the distribution (at run time) of the incoming actual argument, as shown in this example:

      REAL(KIND=8) A(M, N), B(P, Q)
      REAL(KIND=8) A (BLOCK, *)
      REAL(KIND=8) B (CYCLIC(1), *)
      CALL FOO(A, M, N)
      CALL FOO(B, P, Q)
! ---------------------------------------------------------
      SUBROUTINE FOO(X, S, T)
      REAL(KIND=8) X(S, T)

!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(X(I+2, J))
      DO I =
        ...
      END DO

Redistributed Arrays

Example 1. The following example shows how an array is redistributed at run time:

      SUBROUTINE BAR(X, N)
      REAL(KIND=8) X(N, N)
      ...
!$SGI REDISTRIBUTE X (*, CYCLIC(expr))
      ...
      END
!---------------------------------------------------------
      SUBROUTINE FOO
      REAL(KIND=8) LOCALARRAY(1000, 1000)
!$SGI DISTRIBUTE LOCALARRAY (*, BLOCK)
! THE CALL TO SUBROUTINE BAR MAY REDISTRIBUTE LOCALARRAY
!$SGI DYNAMIC LOCALARRAY
      ...
      CALL BAR(LOCALARRAY, 100)
! THE DISTRIBUTION FOR THE FOLLOWING DOACROSS
! IS NOT KNOWN STATICALLY
!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(A(I, J))
      END

Example 2. The following example illustrates a situation in which the !$SGI DYNAMIC directive can be optimized away. The main routine contains local array A that is both distributed and dynamically redistributed. This array is passed as an argument to FOO before being redistributed and to FOO after being (possibly) redistributed. The incoming distribution for FOO is statically known; you can specify a !$SGI DISTRIBUTE directive on the dummy argument, thereby obtaining more efficient static scheduling for the !$OMP PARALLEL DO directive with data affinity. The subroutine BAR, however, can be called with multiple distributions, requiring run-time scheduling of the !$OMP PARALLEL DO loop.

      PROGRAM MAIN
!$SGI DISTRIBUTE A (BLOCK, *)
!$SGI DYNAMIC A
      CALL FOO(A)      
      IF (X .NE. 17) THEN
!$SGI REDISTRIBUTE A (CYCLIC(X), *)
      END IF
      CALL BAR(A)
      END

      SUBROUTINE FOO (A)
!Incoming distribution is known to the user
!$SGI DISTRIBUTE A(BLOCK, *)
!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(A(I, J))
      ...
      END

      SUBROUTINE BAR(A)
!Incoming distribution is not known statically
!$SGI DYNAMIC A
!$OMP PARALLEL DO
!$SGI+AFFINITY(I) = DATA(A(I, J))
      ...
      END

Irregular Distributions and Thread Affinity

This example consists of a large array that is conceptually partitioned into unequal portions, one for each processor. This array is indexed through index array IDX, which stores the starting index value and the size of each processor's portion.

      REAL(KIND=8) A(N)
! IDX ---> INDEX ARRAY CONTAINING START INDEX INTO A (IDX(P, 0))
! AND SIZE (IDX(P, 1)) FOR EACH PROCESSOR
      REAL(KIND=4) IDX (P, 2)
!$SGI PAGE_PLACE (A(IDX(0, 0)), IDX(0, 1)*8, 0)
!$SGI PAGE_PLACE (A(IDX(1, 0)), IDX(1, 1)*8, 1)
!$SGI PAGE_PLACE (A(IDX(2, 0)), IDX(2, 1)*8, 2)
      ...
!$OMP PARALLEL DO
!$SGI+ AFFINITY(I) = THREAD(I)
      DO I = 0, P-1
!     ... PROCESS ELEMENTS ON PROCESSOR I
!     ... A(IDX(I, 0)) TO A(IDX(I,0)+IDX(I,1))
      END DO