Architectural Convergence

Architectural Convergence
Ken Kennedy, Director, CRPC

I recently attended a meeting in Arlington, VA, to discuss research agendas for high performance computing. One of the most interesting talks at this meeting was given by John Hennessy of Stanford, one of the most widely respected computer architecture researchers in the world. In his remarks, John discussed the implications of what he sees as architectural convergence in the parallel computing industry. Most experts agree that, for reasons of cost and performance, future scalable parallel machines will be built with memory that is physically distributed with the processors. It is fairly simple to package a processor, or even a cluster of processors, with memory and interconnect these packages via a high-performance scalable network. Since the cost of accesses to data in local memory is much lower than the cost of access to remote memory, the distributed design permits the programmer or compiler to take advantage of locality of reference. In other words, if the computation can be rearranged so that each processor makes most of its accesses to data in local memory, then there will be significant performance benefits of this design. Most of the early scalable machines, such as the Intel iPSC/860 and Paragon, the Thinking Machines CM-5, and the IBM SP1 and SP2 have been designed this way.

On the other hand, most experts also agree that programmers prefer a shared-memory programming model, in which every global object can be referenced by any processor without the need for complex send/receive communications. Since most of the new parallel computing architectures will be constructed from commodity microprocessors and since those microprocessors are all evolving to 64-bit addressing, it seems likely that new machines will be able to address all of the memory of a parallel computer. Once that is the case, why not permit it? In other words, why not support loads from any location in a parallel machine on any processor? It would certainly make it easier to support the desirable shared-memory programming model. Hennessy argues that these trends will make the case for hardware-shared memory compelling and that all new machines will be built this way. In fact, he is currently working on a design for architectural and software support that will make it possible to have hardware shared memory on workstations communicating over a net.

I believe Hennessy is basically correct--most of the scalable parallel machines built over the next half decade will have hardware-shared memory. Already three new designs--the Kendall Square KSR series, Cray T3D, and the Convex Exemplar MPP--are following this path. So does this mean that all our troubles are over and that, if we wait for convergence, all our software problems on parallel machines will be solved?

Unfortunately the answer is "no," because the underlying performance problems of physically distributed memory will remain. To achieve high performance on the convergence machines, you will still need to deal with the high cost of access to remote memories. In fact, almost all of the effort to program these machines will be concerned with reducing the impact of latencies in the memory hierarchy. Even accesses to local memory are likely to exceed 50 processor cycles, while accesses to remote memories will take hundreds or even thousands of cycles.

To reduce the performance degradation due to memory latencies, most future machines will use one or more caches between each processor and the memory system. However, maintaining "coherence," or consistency between shared, writable cache blocks across the processors of such a machine is a difficult challenge. Many hardware solutions to the coherence problem have been proposed, but all of them impose cost or performance penalties. This is probably the reason why Cray omitted global coherence hardware in the T3D. As a result, I expect software strategies to play a significant role in overcoming the coherence and latency problems in distributed-memory architectures.

One strategy certain to increase in importance as we move toward convergence is "software prefetching." In this approach, the compiler or a runtime preprocessing phase analyzes the program and generates instructions that will prefetch data items from remote memories to cache or local memory before they are needed. Although it presents a few new problems, efficient use of memory and cache through prefetching generally builds upon the substantial system, compiler, and runtime analysis research that has been done for message-passing systems, because the problem of proper placement of prefetch instructions is related to the proper placement of send/receive pairs in a message- passing program. Furthermore, entirely avoiding the need for data movement will still be preferable to prefetching, so strategies to increase locality of reference will retain their importance.

The bottom line is that, although substantial architectural convergence is likely, the parallel computing community will still be faced with most of the software and algorithm development problems that exist on today's distributed-memory systems. Languages like High Performance Fortran, in which the programmer specifies data locality, will be very useful on convergence machines and the parallel algorithms currently being developed will still be needed to achieve high performance. Clearly, this is good news for software and algorithm researchers. However, it is not all bad news for users. Even though they will still have to wait for software and algorithms to make scalable machines usable, the approaches being followed today will be effective on the convergence machines as well, so users will at least avoid the pain of another complete paradigm shift.

Table of Contents

Architectural Convergence Ken Kennedy, Director, CRPC

Architectural Convergence
Ken Kennedy, Director, CRPC