


Linear AlgebraJack Dongarra (director), Herb Keller, Roldan Pozo, Danny Sorensen, and Eric Van de VeldeSeveral areas in linear algebra must be addressed to allow wide use of parallel computing. These areas include the development of LAPACK for distributedmemory machines, dense nonsymmetric eigenvalue problems, parallel algorithms for largescale eigenvalue problems, sparse linear least squares, multigrid algorithms, sparse linear systems, and linear algebra for signal processing. Jack Dongarra holds a joint appointment with the University of Tennessee (UT) and Oak Ridge National Laboratory (ORNL) under the UT/ORNL Science Alliance Program. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced computers, programming methodology, and tools for parallel computers. Current research also involves the development, testing, and documentation of highquality mathematical software. He was involved in the design and implementation of the software packages EISPACK, LINPACK, the BLAS, LAPACK, and PVM and is currently involved in the design of algorithms and techniques for highperformance computer architectures. He is also coeditorinchief of the International Journal of Supercomputer Applications and an editor for Parallel Computing, the Journal of Distributed and Parallel Computing, the Journal of Supercomputing, IMPACT of Computing in Scientific Applications, Communications of the ACM, Research Monographs on Parallel and Distributed Computing, and the Journal of Numerical Linear Algebra with Applications. Framework for Library Development on Advanced ArchitecturesOne of the important areas for software development is the design and implementation of a mathematical software library that is scalable on parallel computers. Researchers are developing a prototype scalable library for solving the major problems of numerical linear algebra. They are initially looking at dense matrix algorithms and will later focus on sparse structures.LAPACK for Distributedmemory MachinesThe module approach of LAPACK for sharedmemory MIMD architectures is being adapted for distributedmemory MIMD. It would be beneficial to have the same software superstructure for the solution of linear equations on both classes of parallel machines. The Linear Algebra group plans to use algorithms based upon blocking techniques that take advantage of surfacetovolume effects and allow effective reuse of data in local memory. This approach can lead to optimal algorithms on supercomputers with modest amounts of parallelism, and on distributedmemory machines where the levels of parallelism are much greater (e.g., the hypercube). Without sacrificing performance, portability will be achieved through a collection of routines that are portable across a wide range of architectures.Previous efforts have focused on the development of a set of linear algebra communication routines for the specification and manipulation of distributed data structures in algorithms. The result has been a proposed standard for the implementation of a scalable communication library, which is the first step in seeking a community consensus on such standards. Messagepassing StandardsResearchers are involved in the formation of a standard for performing messagepassing between pairs of processes in a MIMD distributedmemory computing system. A small set of typed messagepassing routines form the core of the standard and are augmented by support for features such as noncontiguous messages, communication contexts, and process groups.The project provides parallel hardware vendors with a clearly defined base set of routines that can be efficiently implemented. As a result, the vendors can build upon this collection of standard lowlevel routines to create higherlevel routines for the distributedmemory communication environments supplied with their parallel machines. Researchers are developing a simpletouse interface that is powerful enough to allow programmers to use the highperformance messagepassing operations available on advanced machines. In an effort to create a "true" standard for message passing, researchers incorporated into the messagepassing interface the most useful features of several systems. Rather than choosing one system to adopt as a standard, features were used from systems by IBM, Intel, nCUBE, Express, p4, and PARMACS. The message paradigm will be attractive because it is widely portable and can be used in communications for distributedmemory and sharedmemory multiprocessors, networks of workstations, and any combination of these elements. The paradigm will not be made obsolete by increases in network speeds or by architectures combining shared and distributedmemory components. Largescale Eigenvalue ProblemsIterative methods to solve very largescale algebraic eigenvalue problems and linear systems are essential to the numerical solution of many very largescale applications. Basic research on algorithms for the iterative solution eigenproblems and linear systems is still needed. Nonsymmetric problems clearly require the most attention. During the first three years after the establishment of the CRPC, an approach to the numerical solution of large scale eigenproblems was developed based upon a new variant of Arnoldi's method. This approach has provided a unified computational framework for the development of numerical software for the iterative solution of large scale eigenproblems.The Arnoldi process is a wellknown technique for approximating a few eigenvalues and corresponding eigenvectors of a general square matrix. Researchers have developed ARPACK, a software package based upon an implicit restarting scheme that may be viewed as a truncation of the standard implicitly shifted QRiteration for dense problems. Numerical difficulties and storage problems normally associated with Arnoldi and Lanczos processes are avoided. The algorithm is capable of computing a few (k) eigenvalues with userspecified features, such as largest real part or largest magnitude using n*2k + O(k^2) storage. No auxiliary storage is required. A set of Schur basis vectors for the kdimensional eigenspace is computed that is numerically orthogonal to working precision. ARPACK, which is based upon an implementation of this algorithm, has been designed to be efficient on a variety of highperformance computers. Parallelism within the scheme is obtained primarily through the matrixvector operations that comprise the majority of the work in the algorithm. The software is capable of solving largescale symmetric, nonsymmetric, and generalized eigenproblems from significant application areas. ARPACK has been used to solve problems from quantum chemistry, reactive scattering, stability of fluid flow, design of dielectric wave guides, and structural analysis, with some problems as large as 250K degrees of freedom. Continuing research activities involve adapting these techniques to the solution of large (sparse) nonsymmetric and symmetricindefinite linear systems. Researchers are also developing preconditioning techniques for eigenvalue calculations. The use of preconditioners has proven highly effective in accelerating iterative methods for sparse linear systems; however, it has not received the attention it deserves in the context of large scale nonsymmetric eigenvalue problems. Polynomial preconditioning has been one popular alternative for eigenproblems, but recent work has indicated that other kinds of preconditioners like relaxation, incomplete factorization, or multigrid might also be very useful in solving some nonsymmetric eigenvalue problems. ScaLAPACK++ScaLAPACK++ is an objectoriented extension designed to support distributed dense, banded, sparse matrix operations for symmetric, positivedefinite, and nonsymmetric cases. Researchers have initially focused on the most common factorizations for dense systems: LU, QR, and Cholesky. The intent is that for largescale problems, these ScaLAPACK++ routines should effectively exploit the computational hardware of mediumgrainsized multicomputers with up to a few thousand processors, such as the Intel Paragon and Thinking Machines Corporation's CM5.Among the important design goals of ScaLAPACK++ are scalability, portability, flexibility, and ease of use. Such goals present serious challenges, as the layout of an application's data within the hierarchical memory of a concurrent computer is critical in determining the performance and scalability of the parallel code. To enhance the programmability of the library, details of the parallel implementation will be hidden as much as possible, while providing the user with the capability to control the data distribution. The ScaLAPACK++ release will include a general twodimensional matrix decomposition that supports the most common block/scattered schemes currently used. ScaLAPACK++ can be extended to support arbitrary matrix decompositions by providing the specific parallel BLAS library to operate on such matrices. Decoupling the matrix operations from the details of the decomposition not only simplifies the encoding of an algorithm but also allows the possibility of postponing the decomposition until runtime. Often, the optimal matrix decomposition is dependent on how the matrix is utilized in other parts of the driving application code. Furthermore, it may be necessary to dynamically alter the matrix decomposition at runtime to accommodate special routines. The currently supported decomposition scheme defines global matrix objects that are distributed across a logical grid of processors. Matrices are mapped to processors using a block scattered class of decompositions that allows a wide variety of matrix mappings while enhancing scalability and maintaining good load balance for various dense factorization algorithms. At each node of the multicomputer there is a sequential LAPACK++ library that provides the objectoriented framework to describe block algorithms on conventional matrices in each individual processor.
Parallelism is exploited through the use of distributedmemory versions of
the Basic Linear Algebra Subprogram (BLAS) that perform the basic
computational units of the block algorithms. Thus, at a higher level, the
block algorithms used look the same for the parallel or sequential
versions, and only one version of each needs to be maintained. Sites & Affiliations  Leadership  Research & Applications  Major Accomplishments  FAQ  Search  Knowledge & Technology Transfer  Calendar of Events  Education & Outreach  Media Resources  Technical Reports & Publications  Parallel Computing Research Quarterly Newsletter  News Archives  Contact Information
