Volume 7, Issue 1 -
Spring/Summer 1999
Volume 6, Issue 3
Fall 1998
Volume 6, Issue 2
Spring/Summer 1998
Volume 6, Issue 1
Winter 1998
Volume
5, Issue 4
Fall 1997
Volume
5, Issue 3
Summer 1997
Volume
5, Issue 2
Spring 1997
Volume
5, Issue 1
Winter 1997
Volume
4, Issue 4
Fall 1996
Volume
4, Issue 3
Summer 1996
Volume
4, Issue 2
Spring 1996
Volume
4, Issue 1
Winter 1996
Volume
3, Issue 4
Fall 1995
Volume
3, Issue 3
Summer 1995
Volume
3, Issue 2
Spring 1995
Volume
3, Issue 1
January 1995
Volume
2, Issue 4
October 1994
Volume
2, Issue 3
July 1994
Volume
2, Issue 2
April 1994
Volume
2, Issue 1
January 1994
Volume 1, Issue 4
October 1993
Volume
1, Issue 3
July 1993
Volume
1, Issue 2
April 1993
Volume
1, Issue 1
January 1993
|
Architectural Convergence
Ken Kennedy, Director, CRPC
I recently attended a meeting in Arlington, VA, to discuss research
agendas for high performance computing. One of the most interesting
talks at this meeting was given by John Hennessy of Stanford, one of the
most widely respected computer architecture researchers in the world. In
his remarks, John discussed the implications of what he sees as
architectural convergence in the parallel computing industry. Most
experts agree that, for reasons of cost and performance, future scalable
parallel machines will be built with memory that is physically
distributed with the processors. It is fairly simple to package a
processor, or even a cluster of processors, with memory and interconnect
these packages via a high-performance scalable network. Since the cost
of accesses to data in local memory is much lower than the cost of
access to remote memory, the distributed design permits the programmer
or compiler to take advantage of locality of reference. In other words,
if the computation can be rearranged so that each processor makes most
of its accesses to data in local memory, then there will be significant
performance benefits of this design. Most of the early scalable
machines, such as the Intel iPSC/860 and Paragon, the Thinking Machines
CM-5, and the IBM SP1 and SP2 have been designed this way.
On the other hand, most experts also agree that programmers prefer a
shared-memory programming model, in which every global object can be
referenced by any processor without the need for complex send/receive
communications. Since most of the new parallel computing architectures
will be constructed from commodity microprocessors and since those
microprocessors are all evolving to 64-bit addressing, it seems likely
that new machines will be able to address all of the memory of a
parallel computer. Once that is the case, why not permit it? In other
words, why not support loads from any location in a parallel machine on
any processor? It would certainly make it easier to support the
desirable shared-memory programming model. Hennessy argues that these
trends will make the case for hardware-shared memory compelling and that
all new machines will be built this way. In fact, he is currently
working on a design for architectural and software support that will
make it possible to have hardware shared memory on workstations
communicating over a net.
I believe Hennessy is basically correct--most of the scalable parallel
machines built over the next half decade will have hardware-shared
memory. Already three new designs--the Kendall Square KSR series, Cray
T3D, and the Convex Exemplar MPP--are following this path. So does this
mean that all our troubles are over and that, if we wait for
convergence, all our software problems on parallel machines will be
solved?
Unfortunately the answer is "no," because the underlying performance
problems of physically distributed memory will remain. To achieve high
performance on the convergence machines, you will still need to deal
with the high cost of access to remote memories. In fact, almost all of
the effort to program these machines will be concerned with reducing the
impact of latencies in the memory hierarchy. Even accesses to local
memory are likely to exceed 50 processor cycles, while accesses to
remote memories will take hundreds or even thousands of cycles.
To reduce the performance degradation due to memory latencies, most
future machines will use one or more caches between each processor and
the memory system. However, maintaining "coherence," or consistency
between shared, writable cache blocks across the processors of such a
machine is a difficult challenge. Many hardware solutions to the
coherence problem have been proposed, but all of them impose cost or
performance penalties. This is probably the reason why Cray omitted
global coherence hardware in the T3D. As a result, I expect software
strategies to play a significant role in overcoming the coherence and
latency problems in distributed-memory architectures.
One strategy certain to increase in importance as we move toward
convergence is "software prefetching." In this approach, the compiler or
a runtime preprocessing phase analyzes the program and generates
instructions that will prefetch data items from remote memories to cache
or local memory before they are needed. Although it presents a few new
problems, efficient use of memory and cache through prefetching
generally builds upon the substantial system, compiler, and runtime
analysis research that has been done for message-passing systems,
because the problem of proper placement of prefetch instructions is
related to the proper placement of send/receive pairs in a message-
passing program. Furthermore, entirely avoiding the need for data
movement will still be preferable to prefetching, so strategies to
increase locality of reference will retain their importance.
The bottom line is that, although substantial architectural convergence
is likely, the parallel computing community will still be faced with
most of the software and algorithm development problems that exist on
today's distributed-memory systems. Languages like High Performance
Fortran, in which the programmer specifies data locality, will be very
useful on convergence machines and the parallel algorithms currently
being developed will still be needed to achieve high performance.
Clearly, this is good news for software and algorithm researchers.
However, it is not all bad news for users. Even though they will still
have to wait for software and algorithms to make scalable machines
usable, the approaches being followed today will be effective on the
convergence machines as well, so users will at least avoid the pain of
another complete paradigm shift.
Table of Contents
|