LIFE AFTER HP/CONVEX: AN INTERVIEW WITH STEVE WALLACH, PART II

LIFE AFTER HP/CONVEX: AN INTERVIEW WITH STEVE WALLACH, PART II
10.10.97 by Steve Fisher, managing editor, HPCwire

HPCwire: What's your take on the battle between UNIX and NT?

WALLACH: I actually spoke about this once for the NCSA site review which was held at Rice. They called me down because most of the people knew me. The question was what is the role of NT going to be in supercomputing? I think that's something that will be debated again and again over the next several years until the market will determine it.

The reason that this is a hot topic, at least it was a hot topic when I did this in January or February is....UNIX is fine, so this is not a technical issue, this is not debated on which operating system is better. Unix grew because Unix captured the desktop, that is Sun etc. Eventually, after it owned the desktop, it migrated to the server. People don't give enough credit to, for example when it was just Cray Research and they committed to putting Unix on the Cray, that basically solidified that Unix was both the desktop and the server. All the start-ups at that time had no choice but to use Unix because you could license it for $40K or whatever the license fee was.

So now the question is, is this deja vu? That is, the desktop is being consumed by NT. More and more companies are moving to NT. You can read about all the electronic design stuff, the finite element codes are all moving to NT. Pentium chips are getting faster; and then when Merced comes out, those chips will be equal to, if not faster than any of the RISC chips. So now when you get low cost and you run NT...there all all these surveys that show that at the desktop NT will take over and I think even Sun has somewhat conceded that.

So now the question is, and these are things for debate over the next several years, if NT is on the desktop, what prevents them from consuming the server the same way Unix did. What are the things that could prevent it? It could be Microsoft is not interested in the scalability of NT to beyond four or eight CPU's. I have no idea what the plans are, so I'm just giving you some of the things that have to be considered. Perhaps it isn't as mature as Unix, that's certainly true. But now the question is, is if NT takes only half the server market that says, what happens to the people that only have Unix servers? And I think that's why you see SGI making that defensive move at least for the desktop.

Now the server market is different because in many cases that's more of an application. That is, you don't care what the operating system is. If I just want an Oracle box I may not really care whether it's NT or Unix underneath. I think that will keep the Unix market going and in fact growing. It won't grow as fast as NT, but it will still be a substantive market. It has to be sold into. And in fact, the machines that are being designed at HP with the Merced processor you put in one CD-ROM and you get Unix, you put in another, you get NT. In that case, the customer chooses. It's the same set of hardware in essence, that can go to two different operating systems.

In the high performance market now, I think at the low to medium end it will be NT, at the high it will be Unix. I don't think you are going to see NT on a 200 CPU system, it isn't designed for that. Now you may see people doing clustering with NT systems, but how that plays in and how people use it remains to be seen. I think you're going to start seeing a lot of that over the next couple of years, but the very high end will still be Unix for the forseeable future.

HPCwire: What are your views on the current state of the HPC industry and its future?

WALLACH: I think it's moving to the point where certainly all the new stuff is RISC-based. The ASCII program is...at least all the machines out in the field so far are RISC-based. That's there. I think the interesting thing is, we still need the people like Tera. I'm a firm supporter of what Burton (Smith) is doing. I think we still need the injection of new technology. As a technology it's certainly very interesting and we'll see what effect it has. I think we need efforts like that so we don't become too inbred in our thought processes. So I think that's one aspect of the high performance industry that's very important.

Another aspect of the high performance industry as it exists today is have we really seen the last of the vector machines in terms of new designs? There will be some new systems coming out of Japan from NEC and Fujitsu, there may be some incremental upgrades coming out of Cray, but will we really see a whole new generation of vector machines? If we had this discussion in three years, will we see any new vector machines, any new designs? I think that's questionable. And there are a lot of reasons. The major reason is cost. It costs hundreds of millions of dollars to develop them. If the market isn't there, you can't make money. So that brings up the issue of if there is a need, how will this be funded?

I've had some dealings with some of the Japanese companies, and their government is helping them out, but a lot of people there in Japan are saying they're looking at the ASCII program and other things which are all RISC-based and saying well maybe it's time to get off the vector bandwagon and get on the RISC bandwagon.

A lot of people would say it's already over. I'll say I'm not totally ready to throw in the towel, but as the RISC chips are getting faster and faster and cheaper and cheaper, it's difficult to see how the conclusion could be anything other than all RISC-based systems. I think the major debate today is not so much that it's RISC-based, it's how scalable it is, is it system wide virtual address spaces and cache coherent? We were the first production cache coherent and people said 'oh NPI, your message pass interface is sufficient' things like that. We said no it's not. And over time the market has proven our thought processes. Mainly because it's a simple programming model etc.

So now in addition to the Convex Exemplar you have certainly the SGI Origin; in their presentations IBM had said they are moving to a global model, so if you have a global model that means you can still support message passing. I think from that the trend is clear. To me the biggest unknown in the high performance industry is how many CPU's, what's the scalability issue, what's the unit of cache coherence?

The other issue everyone is asking is will you see Java in high performance computing? I think the way I deal with that is that basically numerically intense programming is still LAPACK, things like that with a Fortran wrapper. Then as you got to Unix you used C as a wrapper to have a better interface to the operating system. It's a lot easier to go do Unix level stuff with C than it is with Fortran. As we get to Web-centric supercomputing, you'll probably see Java wrappers around the C or Fortran numerically intense stuff. It's a lot easier to do Web-centric supercomputing with Java than with C++ or Fortran. There are a lot of efforts going on. I'm the chairman of the external advisory committee for CRPC down at Rice and the last time we had a meeting there was a lot of stuff on Java and how it is being used in these Web-centric things.

What's going to happen? I bet in another year if I take this Tops 500 thing, what you'll see is 20-30% of area installations will say industry database. That's the first thing. If I look at the Linpack report, right now...I know it's a strange way of making an observation, but the first page is all vector machines because it's sorted by hundred by hundred and the second page is pretty much vector machines except the September 16th report has the 440 Mhz Alpha and the IBM RS/6000 595. Well I would expect over the next two, three years you're going to find the RISC chips beginning to move into the first page of table one in this where chips will begin showing 400-500 megaflops on a 100 by 100. When that happens that will be another reason perhaps to move away from vector machines.

The real issue here is the programming model and data set size. RISC machines are great if the stuff fits in the cache. If it doesn't fit in the cache the performance drops by a factor of 10-20, and for vector machines that doesn't necessarily happen. That's why a lot of people still want vector machines. They're right, it's just a question of can we afford it or is it time to reprogram them to be more cache apparent. This is a major debate that goes on over and over again. But ultimately time will tell.

HPCwire: What is your function on the Presidential advisory committee for HPCC? How are things going? What direction is it taking?

WALLACH: Well, first of all I'm one of approximately 25 people and we had two meetings so far. The first meeting was really more introductory in nature and the second meeting we started to do some substantive work. There are two sub-committees. One is high performance computing and the other is next generation Internet. I'm on the high performance computing committee as you can imagine. We're addressing a lot of issues. This is serious stuff. We're not there just to smile and look good. We all take it very seriously and we're there to advise the government. It would be inappropriate to tell you what some of the projects are at this point, but people have said hey guys (using that in a gender neutral sense) there are some thorny issues here and we need some advice. So we're diving into them and at Supercomputing 97 there's going to be a Town Meeting for the community to come and express there views. We want to hear what people have to say with respect to government policy. We would like there input as to what they think the U.S. Government policy should be in high performance computing. I assume it's going to be structured so people will register with someone and they're given ten or fifteen minutes to do their spiel. We listen and ask questions and depending on what they say, if it's appropriate, we'll take it into consideration. So we're there, we have a charter, we have a mission, and we fully intend to do it. These are some very high power people and when they take their time out to do this, they want to accomplish something. It's not make work. We're all very busy and if it's make work we'd all be out of there in a second. So we hope it will have a major impact on the government's next generation view of HPC etc. There are various people in Washington who are looking to this committee for its advice.

Now ultimately the efforts of high performance computing and next generation internet have to be merged. because as we move to gigabit oriented networks, not locally, but across the nation more and more supercomputing...you're going to need a supercomputer to control the network. In fact that's one of the things I used to point out within HP is in the future the biggest input or the biggest transfer, that it may have more network traffic than they have disk traffic, and they have multiple gigahertz transmission lines coming into their system and that the aggregate bandwidth of that will dwarf the disk bandwidth. Then the question is, we have this Internet, which is very important to keep moving it along in terms of order of magnitude increases in performance, reliability, and quality of service. We have supercomputing and we know supercomputing will live in this environment. What does that mean? How do we integrate it? How will we do collaborations? More and more you hear that... what's the collaborative environment in the future with respect to supercomputing, and internet netorking? Those are some very intellectually challenging issues and I enjoy it. A lot of us know each other, so it's not like we don't know who we are. There's a lot of e-mail back and forth, things like that. We take it very seriously. It's a good mix of people, that is, industry, academia etc. It's very enjoyable.

HPCwire: Is there anything else you'd like our readers to know? Anything you'd like to add?

WALLACH: I think the main thing in terms of the readership is that the future of high performance computing is that things are obviously constantly changing and more and more I think people have to take a long term view and think more on to a strategic direction as opposed to a particular chip family or something, and the reason is everytime you change vendors or change major pieces of software, it's very traumatic. You have to pick a strategic direction, you have to pick someone who wants to reinvent themselves because technology is changing so rapidly companies allways have to reinvent themselves.

The other thing, and this is probably the most important thing is, and this is a bandwagon I've been on for the past fifteen years, you're going to have to rewrite your software. There's no way around it. Way back when, when we first started vectorization, we had great compilers, but you still had to program them a certain way. You say well, if people just program for performance they could double the performance of stuff in the United States and someone challenges you and says no you're wrong they could triple it, quadruple it. When we go through benchmarking efforts as part of the sales cycle what we're very basically doing is cleaning up the user code. They don't know how to write it correctly, I'm not saying everyone, but a lot. Here's the way the argument goes...you go into a big procurement and they benchmark three or four applications and in many cases the rules are you can't make any modifications or maybe only minimal modifications, and you say If I can get another two days, I can make it run 30% faster...and what they say is no. You're right, but you're not permitted to do that because even though we're only benchmarking three applications, we really have a thousand applications that we want to run. There's not enough manpower or time or money to modify all thousand applications. And I take that at face value.

Now a lot of performance is hindered by this lack of reprogramming. I'm not saying it's easy, in many cases it could take you years to rewrite things, but I remember at Los Alamos when they first got their, I think it was their CM2 and they took this application off of a Cray, programmed on the CM2 and it ran four or five times faster. Some very impressive ratio. And since I know a lot of the people, I asked them to tell me more about it. And they said we took this code, we pick out these artifacts of the Cray architecture and had to reprogram this to use the CM2 architecture, and then as long as we were doing that we used a different algorithmic approach because some better mathematics were developed in the last five years and on and on until we got this great number. So I said, 'did you ever take the same code and put it back on the Cray? And they go Yeah, and it ran twice as fast also.' So what they were doing is comparing the ratio of the new code versus the old code. And when I mention it to various people they all say...'we know that one.'

We only reprogram when we have to move to a new platform. And when we do the comparison we do it on the programming on the new platform relative to the application on the old platform. Cause that's the reason to buy it. You always have to buy new platforms, that's not the issue here, but it just shows you the sensitivity of the performance to the programming style. I always say when you benchmark a machine, you're benchmarking the analyst as much as you're benchmarking the machine. I don't think we do enough of it, and every time I've brought it up in the various government panels I'm on or before customers, they all say that you're right, we don't dispute what you said, but it's easier for us to spend another $30M for a piece of hardware that's performance compatible, than it is to spend $100M reprogramming.

You see, the problem is it's not just the Fortran or the C, or whatever language, it's also the algorithm..that is an algorithm for a vector machine you use may be in many cases very different from one you'd use for a RISC based machine. So the Holy Grail is to take the vector algorithm and move it onto RISC-based without any loss of performance or efficiency. No one has figured out how to do that yet...In a general way. I personally believe that this is becoming a bottleneck in the industry. One of the major bottlenecks is this issue of legacy code. That's why you hear people say that they need more vector cycles; and it's valid, don't get me wrong. The reason they're saying it is not because people aren't smart, quite the contrary, it's cheaper to buy a new piece of vector hardware than it is to reprogram. But you get into a trap now...you'll never move to new generations. That's really a tough call, but if I look ahead five, ten years, I don't think there's much choice.

In fact, that's another thing I hear all the time...'Well we know we're going to have to do this in five years, but we don't want to do it right now.' If you can make that statement..that I'm going to have to do it eventually, isn't it more expensive to do it in three years than it is to do it now? They're putting off the inevitable. Don't get me wrong, there are a lot of other issues here. There are many other issues that go into making that decision. But for me, if I see what's going to happen in five years, I'm going to start now. Not in four years. That's me. But everyone's different, there are innovators, there are early adopters and there are people who are only interested in mature technology. That's like my example of technical and commercial for systems. We're all right. Nobody's wrong. It's just a question of as you do this and when you move, where do you want to be? Where do you want to spend your money? To me that's a personal decision. By definition you're right. But it's just a question of what it will cost you to be right?

I would say in the last year, year and a half, and even now people call me up for my opinion on things. I'm doing some pro bono work for some people. The questions are the same, 'when do I get off one technology and get on another? With my scarce resources, both people and funds, what's the best way to use them to go forward?' You have to examine each case individually. I mean I can say definitively that this is going to be the hardware that is there in three years. And they say well do I have three applications or do I have 3,000. Everyone's different. It depends on your mission. Can you accomplish your mission by minimal recoding or do you really have to recode and use an MPP? That's very fine and very user specific. I do my best to say, when I was at HP, we'll give you a very general purpose machine, with the tools, and go at it. That's where the industry has moved to.

The majority of the machines that are sold today are really general purpose machines. The vectors are in a sense general purpose-technical, but not general purpose overall. As we come to this realization, I'm sure we'll see movement, but when I look at the data, and I know what's being done, I was involved in next-generation designs, to me it's a no brainer. But that doesn't mean it's a no brainer to someone who has hundreds of millions of dollars of investment in hardware and software and people who can't make that change so quickly. We can only tell them that this is what you have to face, and you can figure out how to run the operation.

----------

Steve Fisher is HPCwire managing editor. Email: steve@tgc.com

Hipersoft | CRPC