Object-Oriented Library Efforts: POOMA, Tulip, and PAWS

John Reynders, Los Alamos National Laboratory

In 1995, the National Science Foundation awarded a contract to a group of CRPC researchers and colleagues to study Problem Solving Environments (PSEs). PSEs are integrated collections of software tools that scientists can use to facilitate solving computationally intensive problems. (See Winter 1996 Parallel Computing Research, page 11.) Led by K. Mani Chandy of Caltech, the group includes collaborators from Los Alamos National Laboratory, Indiana University, Drexel University, and New Mexico State University. The project spans four related areas, one of which is the development of object-oriented libraries of parallel program templates to support PSEs. Following are descriptions of three efforts in this area that are currently underway at the Los Alamos National Laboratory (LANL) Advanced Computing Laboratory.

POOMA

The Parallel Object-Oriented Methods and Applications (POOMA) FrameWork effort is an application-driven software infrastructure of layered class libraries designed to increase simulation lifetime, portability, and agility across rapidly evolving high-performance computing architectures. The POOMA FrameWork achieves these goals through:

Code portability across serial, distributed, and parallel architectures with no change to source code
The development of reusable, cross-problem-domain components to enable rapid application development
Code efficiency for kernels and components relevant to scientific simulation
Driving the FrameWork design and development with applications from a diverse set of scientific problem domains
Shortening the time from problem inception to working parallel simulations

The FrameWork provides an integrated layered system of objects in which each object higher in the FrameWork is composed of or uses objects that are lower in the FrameWork. In the case of the POOMA FrameWork, the higher layers provide objects that capture the main abstractions of scientific problem domains (particles, fields, matrices) in a representation of mathematical expressions preserved in the application source code. The interface to these high-level objects are mostly data parallel and array syntax structures that provide the application developer with a familiar programming environment. Many new users to the FrameWork are able to build simple working POOMA codes in a single day with no knowledge of C++.

The objects that are lower in the FrameWork focus more on capturing the abstractions relevant to parallelism and efficient node-level simulation, such as communications, domain decomposition, threading, and load balancing. Although a high-level interface may be data-parallel, this does not preclude the low-level constituent objects from participating in a task-parallel manner. This provides the ease of a data-parallel interface with the efficiency of MIMD implementation.

This layered approach provides a natural breakdown of responsibility in application development. Computer science specialists can focus on the lower realms of the FrameWork optimizing kernels and message-passing techniques without having to know the details of the application being constructed. The physicists, on the other hand, can construct algorithms, components, and simulations with objects in the higher levels of the FrameWork without having to know the specific implementation details of the objects that compose the application. This design allows for efficient team code development with no single member of the team having to know all the details of the computer and physics problem domains. With the POOMA FrameWork, it is possible for a scientist to build an application that runs on a variety of distributed memory architectures without any knowledge or understanding of parallel simulation.

The advent of optimizing C++ compilers with comprehensive template implementations has provided several advantages at all levels of the FrameWork. At the user level, templates allow for a form of compile-time polymorphism that provides the application developer with a generality of interface superior to those provided by non-parameterized class libraries. At the implementation level, templates enable the use of the Standard Template Library (STL) and the creation of a family of STL-like parallel containers and iterators for scientific computation. This use of a container/iterator/algorithm abstraction has made it easier to generalize parallel operations within the FrameWork while reducing the lines of source code by a factor of six. Finally, at the kernel level, expression template methods have enabled kernels written with high-level objects from the POOMA framework to run nearly the same speed as hand-coded C and Fortran equivalents.

As the complexity of multi-physics simulations increases, the necessity of a software framework to encapsulate parallelism and enable portability becomes essential to achieving high-performance computing objectives. The interspersion of message-passing and physics kernels in typical high-performance computing simulations makes both the parallelism and physics impenetrable and the resultant code unmaintainable. The absence of this interleaving of parallelism by explicit encapsulation allows the physicist to concentrate on physics and the computer scientist to concentrate on computer science, the result being a faster turnaround in the problem-solving cycle.

The POOMA FrameWork is being used in multiple Advanced Strategic Computing Initiative (ASCI) and Energy Research Grand Challenge applications. (See "ASCI Center of Excellence Caltech to Construct Virtual Shock Physics Facility," this issue.)

Tulip

The development of Tulip, a parallel run-time system class library for C++ frameworks portability and component reuse, depends on well-designed abstractions and interfaces. The purpose of a portable, parallel, run-time system is to efficiently implement a standard interface that is powerful and easy to use, yet sufficiently abstract to permit the underlying implementation machinery the freedom to optimize for the particular architectures and changing status of the hardware. That interface must include support for allocating data structures from a shared region, copying shared data between unshared (distributed) regions of memory (contexts), basic synchronization primitives, remote method invocation, and basic process control. Tracing and profiling should also be built into the run-time system from the ground up.

Principles

There are several design principles used to guide the construction of the interface:

1. Communication and synchronization should be separate. Communication simply moves data structures (regions of memory) from one context to another. Internally, this action may involve translation between binary formats, message blocking, negotiation for fast links, and even cache invalidation protocols. Quite distinct from these issues are the constraints that should be explicitly built into the user's code that describes when data is ready for use by a computation. By separating communication and synchronization, the user receives access to more powerful abstractions that can be used to clearly describe the data dependencies of the program while permitting run-time system flexibility to reorder communication within the constraints of the synchronization model and improve performance.

2. For this layer of the run-time system, communication should be specified at as high a level as possible, and not translated to "move these bytes from there to here." Using high-level abstractions for describing data movement creates an easy-to-use interface for the framework programmer, and ensures that the lower layers have the widest latitude possible for implementation.

3. The familiar loop execution pattern--schedule communication, execute the communication schedule, compute on local data--is not universally the most efficient technique for some types of computer hardware. Therefore, that model should not be forced upon the run-time system, but rather the run-time system may choose to use that model under certain circumstances. For cc-numa distributed shared memory machines, the communicate/compute paradigm is actually one of the worst possible choices when seeking high performance. On a machine such as a 32-processor SGI/CRAY Origin 2000, where a single processor can use all the bandwidth that must be shared by about 4 processors, it is critical to minimize contention in the backplane. Communicate/execute models maximize contention, and therefore force stalls in the computation. The distributed-data layout manager (PADRE, for example) should provide the run-time system with the information describing what blocks of data live in which contexts. The run-time system may choose to do global communication and then begin computation, or it may use the information provided to interleave computation and communication where appropriate.

4. Sequences of parallel tasks that need to be run should be described to the run-time system without overspecification. The run-time system must be free to use a variety of work scheduling and sharing techniques to complete the tasks. Therefore, the interface for specifying collective operations and iteration sequences should remain as general as possible, allowing the implementation to balance system resources and optimize execution of the tasks.

5. It should provide as much functionality to the user as possible without trying to protect the user from gaining access to powerful constructs that could be used incorrectly.

The Tulip effort grew out of HPC++ work at CRPC Affiliated Site Indiana University and is now being deployed into multiple Department of Defense and Department of Energy research applications.

PAWS

The goal of the Parallel Applications Workspace (PAWS) project is to create a parallel, scientific PSE. Using PAWS, scientists will be able to specify the problem's initial conditions, then run parallel applications on selected platforms. As the application is executing, the scientist can monitor component interactions, stop computation, and perform minor computational steering as allowed by the parallel application. Support for coupling two or more parallel programs will also be provided. Parallel computations will have the ability to communicate in parallel, rather than aggregating the data stream into a single point of communication, followed by a scatter operation when the data reaches its destination. This is critical for the ASCI Blue SMP platforms, where several types of node-to-node communication are supported, including 100BT ethernet, HIPPI, and shared memory. Parallel coupling is especially important for connecting parallel programs with parallel visualization simulations.

Initially, interaction with PAWS will be via a scripting language interface, and eventually, a graphical user interface. Scientists will be able to develop rapid prototype applications using a selected framework. With the help of a graphical user interface, the scientist will be able to choose framework components that will link to create a parallel application that will then be executed and analyzed.

PAWS is now being deployed into ASCI simulations and the Numerical Tokamak Grand Challenge codes.

For more information about these Los Alamos projects, see
http://www.acl.lanl.org,
http://www.acl.lanl.gov/Pooma,
http://www.acl.lanl.gov/PAWS, and
http://www.acl.lanl.gov/SciTL/slides/Tulip