Active Data Repository Accelerates Access to Large Data Sets
FIGURE 1: CHESAPEAKE BAY SIMULATION
This simulation of the Chesapeake Bay region by the University of Texas Bays and Estuaries Simulator (UTBEST) is based on an unstructured Godunov-type finite-volume method. Colors denote the distribution across four processors of the domain elements, devised to minimize the costs of communications among processors.
The ADR project, led by Saltz, integrates data retrieval with processing, because many scientific applications have similar processing requirements, at least for formatting data returned from a database.
Integrated processing has three main advantages. First, because the ADR runs on high-performance parallel machines, the processing step can use the high-end hardware. Second, by overlapping processing and retrieval, ADR can mask query latencies and improve efficiency. Without parallelism, operations can take up to twice as long. Finally, ADR minimizes the amount of data transmitted from server to client because the processed data is often much smaller than the amount of data accessed.
"The constraints are that we get to choose the order of processing," said Alan Sussman, research scientist, Maryland's Computer Science Department and UMIACS, who is working on the ADR project with Saltz and doctoral candidate Chialin Chang. "The processing operations can't depend on the order data is retrieved"
The constraints, however, still permit many common data manipulations: transforming input items, mapping from one grid to another, aggregating items by averaging, and computing subsamples and other statistical winnowing methods. The ADR can be customized for various applications, while the system provides a uniform interface to the data and processing. An application developer need only provide a data description and the processing methods.
Typically, the processing for an ADR query happens in three steps. First, individual data objects are pre-processed with a transform function. Then the objects are mapped from the input attribute set to the output attribute set. Finally, an aggregation function computes an output item Tom the set of input items that map to it.
The ADR currently runs on a 16-node IBM SP at Maryland with the data sets stored on up to 64 disks. Plans call for moving the ADR server to NPACI's 128-node IBM SP at SDSC and extending the ADR to use tape archives, such as the High Performance Storage System (HPSS) at SDSC. Applications will require both the larger machine and larger storage system to manage full-scale data sets.
SATELLITES AND MICROSCOPES
The ADR has been or is being applied to projects in geography, medicine, and environmental science to demonstrate its generality for accessing large, multidimensional data sets. "We've done the work to make sure everything happens in parallel in a high-performance environment," Sussman said.
The first ADR application was implemented as part of a Grand Challenge project on land cover classification led by Larry Davis, professor in the Department of Computer Science and UMIACS, and geographers at Maryland. With a database of overlapping images from the Advanced Very High Resolution Radiometer (AVHRR) satellite, Joseph JÓJÓ, director of UMIACS, and colleagues used the ADR to analyze global land cover. "A long-term scenario is for a user to search, cross-correlate, and mine distributed Earth system data, collected from multiple sensors and archived automatically,' JÓJÓ said.
A typical query in this application might ask for the land cover of a particular country or region, specifying a time frame and resolution. From the query, the ADR develops a plan of action. First, the spatial query is mapped to segments of the relevant AVHRR images. Then the data is processed with methods for AVHRR correction and land cover analysis before being returned.
A second collaboration, with researchers from Johns Hopkins University, created a "virtual microscope" (Figure 2). High-resolution images were digitized from biological slides with a high-powered microscope; a single slide could require 50 gigabytes to store. A client application then uses the ADR to view each slide at low magnification (by subsampling the high-resolution image) and to zoom in on specific regions at higher magnifications. Such an application would allow physicians to compare specimens with earlier cases from their own institutions or a global collection of digitized cases.
FIGURE 2: A VIRTUAL MICROSCOPE
Researchers at Maryland and Johns Hopkins used the capabilities of the Active Data Repository to create a microscope-like interface for a database of digitized specimen slides.
Accurate simulations of the transport and reaction of pollutants in surface water must include not only chemical reaction information but also water flow information. But combining both flow and chemical calculations in the same simulation comes with a heavy computational price. The basic fluid flow in a bay or estuary does not change, even though pollution may enter in many ways, perhaps as non-point fertilizer runoff or as an oil spill. In a combined simulation, hydrodynamic data must be recomputed for every new pollution scenario.
By separating the two aspects, Wheeler's group can save time by computing and storing water depth, flow velocities, and turbulence information of the 3-D bay or estuary model only once. In fact, they are using two different programs, ADCIRC and UT-BEST, to perform the flow simulations. The pollution simulation, CE-QUAL-ICM, looks up the hydrodynamic flow in the ADR.
"Coupling the two simulators to form a complete system is not a straightforward process for several reasons," Wheeler said. Primarily, the coupling must account for differences in time scale and grid resolution. The chemical transport model can examine changes over days or over hundreds of years, using time steps that are out of sync with the hydrodynamics simulation or need the average of several flow steps. A separate code, UT-PROJ, must map the hydrodynamics grid to the chemical transport grid.
Under NPACI, the Texas group is parallelizing the simulation codes and working with the ADR group to optimize the storage of hydrodynamic simulator output. The Maryland group is working with Wheeler to define a data dictionary for such data, create the customized processing components, and incorporate them into the ADR.
"With the bay and estuary simulation, we are gaining experience in determining how much help we have to provide developers," Sussman said. "The goal is eventually to have the developers take advantage of the ADR in their own applications and write all their own components." -DH
Sites & Affiliations | Leadership | Research & Applications | Major Accomplishments | FAQ | Search | Knowledge & Technology Transfer | Calendar of Events | Education & Outreach | Media Resources | Technical Reports & Publications | Parallel Computing Research Quarterly Newsletter | News Archives | Contact Information
|Hipersoft | CRPC||
© 2003 Rice University