INDEX ARRAY FLATTENING THROUGH PROGRAM TRANSFORMATION

Raja Das, Joel Saltz, University of Maryland, College Park; Reinhard van Hanxleden, Ken Kennedy, Chuck Koelbel, Rice University

This group has developed compiler techniques to transform loops with array accesses involving complex indirection patterns into loops where array references have at most one level of indirection. Such code transformations are likely to improve performance on architectures where hardware support exists for vectorized gather/scatter methods. Further, in architectures where it is profitable to prefetch data between levels of memory hierarchy, such transformation techniques will be extremely profitable.

An increasing number of parallel applications make heavy use of indirection arrays for indexing data arrays, which make it difficult for a compiler to generate efficient parallel code. The problems with existing techniques addressing this problem is that they are only applicable when programmers restrict themselves to a very limited set of programming idioms (e.g., indexing of the form x(ia(i)), where i is the loop index). However, many codes using sparse data structures access their data through multiple levels of indirection. The group has developed methods for transforming programs using multiple levels of indirection, particularly by performing program slicing on the subscript expressions of the indirect array accesses. These slices individually peel off the levels of indirection and create opportunities for aggregated data movement between processor memories.

The group has implemented these transformations as a part of the Fortran 77D compiler, making the compiler more robust and able to handle more irregular application codes. A number of kernels derived from various irregular applications have been successfully parallelized. The applications are defined as irregular because they cannot be parallelized using existing compilation techniques without a severe degradation in performance.

Using the CHAOS runtime support system, timing results from two kernels were compared between versions that were parallelized using the transformation techniques and those where CHAOS library calls were manually inserted. (The group is confident that the hand kernels were reasonably optimized.) One of the kernels was extracted from an unstructured grid Euler solver, EUL3D. In this case, for both the hand and the compiler parallelized version, the data was partitioned using coordinate bisection. Iteration space was partitioned using the on- clause directive. The input to the kernel was an unstructured mesh (9428 points). The other kernel was extracted from a molecular dynamics code called CHARMM. The compiler-parallelized version was compared against an optimized hand-parallelized code. In the hand-parallelized code, both the iteration and data was block partitioned. For the compiler-parallelized version for data partitioning, both block and binary dissection partitioning and the on-clause directive were used to override the default iteration space partition. The input data for the program was small (648 atoms). Hence, after 16 processors, there is no reduction in computation time with an increase in the number of processors.

Performance results demonstrated the feasibility of the transformation techniques and showed that in the cases that have been studied, the performance obtained by using the transformations is only modestly less effective than the performance of a good hand-parallelized version. For both the kernels, the hand-parallelized codes performed better because it did less data communication and also the number of messages exchanged between processors was less compared to the compiler generated code. This effort is part of the Maryland, Rice, and Syracuse joint project to develop compilers and runtime support for irregular applications.

Table of Contents