I'm here with Dr. Bill Dally, who is Chief Scientist and Senior Vice president of Research NVIDIA. Welcome, Bill. Well, thanks John. It's a pleasure to be here, I'm glad to have an opportunity to chat with you about important trends in computing. Super, tell us about your job in NVIDIA. What do you do? I really have two jobs. as Chief Scientist, I get to sort of look into a lot of neat technologies that we're developing and try to set technical directions in certain strategic areas work with key customers and partners on technology initiatives. And then as Senior Vice President of Research, I build and lead NVIDIA Research, which is a world class research lab doing research in a lot of interesting areas, from graphics, to parallel computing, to circuits and architecture programming systems irrelevant to the future of, of NVIDIA.
So, tell us what you did before you came to do video? So, for 26 years I was a professor first at MIT. For 11 years then, at Stanford for 15 years. And I taught courses on computer architecture and parallel computing, and led a research group that build a lot of interesting parallel computers and developed many interesting programming systems for them. including some early object-oriented parallel programming systems, concurrent small talk, concurrent aggregates. some stream programming systems. Dream C, Kernel C, Brook and Sequoia. and that, that sequence programming systems ultimately led to, to CUDA which was sort of done by the, the same person who had done Brook, a stream programming language we did at Stanford. during that period of time, I also was a consultant for industry working for numerous companies, particular for Cray. I worked for Cray since the late 1980s into the middle of the 2000s helping them develop interesting parallel computer systems.
So one of the cool, recent announcements we've seen is the, the announcement of the Titan Super Computer, which is now the fastest super computer in the world and, and this has NVIDIA processors in its core. Can you talk a little bit about that process and how NVIDIA got to be involved, and why that's such an exciting thing for GPU Comcuting? Yeah. Well, first of all, you know, Titan is, is an awesome machine. It's 600, 16, excuse me, 18,688 Kepler K20 GPUs and is the fastest computer in the world at running high performance winPAC. There's an interesting story there the, the story of Titan actually starts with a meeting that I had with Steve Scott who at the time was CTU of Cray at the Salishan Conference up on the Oregon Coast in together, we really should get NVIDIA GPUs into Cray super computer because, you know, we have the best, you know, sort fo compute per dollar, compute per watt, which are the two things that matter in, in high performance computing than anybody in the world and, and they were actually going through a, a problem because they had bet on a different vendor who cancelled a project on them, and it left a hole sort of, in what they wanted to bid for this solicitation that was out from Oakridge to build their, what they call, their leadership class computing facility, ultimately what turned in into Titan. And it was, it was a nice sort of juxtaposition in time that I was having this conversation with, with him right at the point in time where there was a, a hole to be filled. And it turns out, Kepler filled that hole wonderfully, there was a lot of challenges along the way that, I think, really had to do with getting the people in the National Labs to embrace sort of the, the model of parallelism that Cuda presents. I think that once they embraced it, they found it was actually easier to write their programs that way, and they actually ran better across the board once they were reorganized in, into you kn ow, that style of parallelism of watching CTAs and, and organizing things in that style. But they had a very large chunk of legacy code mostly in Fortran. It was coded sort of Fortran running on a single node with MPI used to communicate between the nodes and it was a, a nontrivial exercise to really sort of bring that software over and get it to run on well, on a GPU accelerated system. And I think beyond, you know, the number which is extra relatively easy number to get because it's one program, what really is a, is a success of Titan is the very large number of you know, basic energy science and, and defense codes that have been ported over very successfully, and get just tremendous performance on the on the K20s.
So, in videos, not only building high end, more expensive processors, like the couplers and like the GPUs you might buy for gaming, but it also really works very hard in the mobile arena. And so, how is GPU computing going to influence that? Well it's, it's, not in our current shipping products but the strategy will be available, you know, in the relatively near future is to have the same GPU from, from, top to the bottom of the line. And so, you know, eventually you will have a Kepler GPU in the Titan super computer and you'll have a Kepler GPU in your cell phone, and you can apply the same programming models across that line and do GPU computing in your cell phone to the extent that it makes sense to do so. >> So where do you see the most interesting applications going for things that you carry in your pocket? >> You know, I think that the really compelling applications for mobile devices are in computational photography and computer vision. I think that photography is at this cusp of being completely transformed from something that you do primarily with lenses and optics to something you do primarily with computing, and a GPU is just a perfectly matched, you know, tool to, to basically do the type of image processing and signal processing you need. >> To, to, you know, basically after you've collected a bunch of photons, to process them in a way to produce really compelling images. >> So that sounds fairly abstract, and I think this is a fascinating field. So can you give an example of something that maybe a GPU would help you do in this computational photography realm? You know, you, little low level things or like, if you want to do high dynamic range images, you'll actually acquire a set of images. Then there is a bunch of problems in composing them together. The camera may have moved slightly between the images, so they have to be registered. there may be objects moving in the images, so even if you registered the images you have to back that object motion out to get the equivalent of a still photograph that was taken at one point in time. you may have what's called a rolling shutter, where because of the way the CCDs work, you expose one line of the image at a time, and so that if there's any motion, that will produce sort of a wavy action in the photo. You again, have to remove that, and then you want to do some processing of those images to remove noise and to enhance the image, and then finally to decide both when you acquired, what exposure times you wanted to get an optimal high dynamic range image and then how to combine those images into one final image that, that basically you're given a limited gamut that you'll play on your display device. Gives the human who's viewing it the appearance of actually, the many orders of magnitude dynamic range, from the original collected image. So bright areas of the image look bright but not that bright. >> Not much, though. We want you to have that much, you know, dynamic range structures, you know, display that.
So let's turn to the technology side a little bit. >> So ten years ago when if I wanted a faster computer, then what I would do is wait six months and then the clock speed of the CPUs I could buy would be 30% higher then it was and I'd go down to the local store and buy a new faster processor. >> And that was largely determined by just a faster clock that I'd go from 1 to 2 to 3 gigahertz. >> That's not happening anymore. >> How come? >> Yeah, so it's interesting. >> So, a lot of people refer to that scaling in CPU performance as Moore's Law. >> Which is a little bit of, of a misnomer. >> Moore's Law really predicts a growth rate in the number of devices you can fabricate on a chip. >> And Moore's Law is actually alive and well. >> We can still, every generation of technology, fabricate more devices and assuming extra chip, that's still growing exponentially. >> What stopped around 2005 what's called the Nard scaling which was scaling the voltages that we operate our chips at as we scaled the dimensions that we make the transistors at. >> And this stopped because of weakened current in the devices. >> And without going too technically into it, what that meant is that now when we get a new generation of chip, let's say we produced a line that's by from say unit to .7 of whatever that unit was. >> You know recent jump is, is from 28 nanometers to 20 nanometers. >> When we do that, we now get twice as many devices in the same amount of area. >> In the old days we would scale the voltage by .7 and since the power goes as CV squared, C would go at .7 the voltage will go by .7 squared. >> We'll wind up with the, I should say energy goes to CV squared. >> We'll wind up with energy of switching the device being about a third of what it was before the square root of 8 and, and so we get this 3x improvement in, sort of, proof per watt of a basic unit and then you can take that and use it in various ways. >> One of the ways we used it was to crank clock rate up. >> So, so now that's not happening anymore. >> Now that we're holding voltage constant. We're getting a little bit of energy gain from the, you know .7 in capacitance. >> But even that, we're not really getting the whole .7, we're getting maybe half of that. >> So we get, you're perhaps in a generation of technology a, you know, 15% improvement in, in the basic underlying technology. >> So that's why you can't take the same old serial processor, and just have it go faster anymore. >> You're not going to get any of the frequency improvement. >> Frequencies are largely flattened out. >> You get more parallelism but it's not, it's no longer that, you know, factor of 3 each generation that you used to get. >> It's now you know perhaps you know you know you just relied on process, that's all you would get. >>, >> Even though each one of those transistors is a little smaller and sucks up a little less power, that's not nearly enough to change the fact that you're making these bigger and more parallel every generation. >>, >> That's right. The better way to think about it is suppose you got the whole .7 on, on capacitance in that scaling. >> That means you now sort of are, are 30% less energy. >> So you can now put 30% more units on. >> You would only grow 30% more parallelism in an, in a power constrained environment, on that same dime. >> You could take that 30% either in clock rate or in more units, although it's, it's harder to take in clock rate because that scaling is less, is less linear. >> Now, you don't get the whole getting to play with each, each generation, until you start innovating, and that's why it's actually kind of fun to be a computer architect these days, because that's where the value is. >> It used be largely in process, and so companies have had a proprietary process had, had kind of an advantage, and they still have a little bit of an advantage but that advantage has really shrunk since process matters less these days. >> And what you do with the process the architectural circuit and, and programming system matters a lot more.
So, when you meet with your architects and you talk about, oh, let's design the next generation processor. what are really the design constraints that you think are, are at the top of the list? Or certainly, power's going to be, you know, number one for sure. Yeah. So, energy efficiency is, is always the number one design constraint. I think, very closely behind it though, is program ability and range of applications. and again, sports re-divergence is a good example of that. We want to have you know, both a control structure and a memory system that will run diverge codes well cause that'll give us a broader range of applications. And then, we'll look at, at program ability, and it's not just being able to write the program and get it to function But, to get a large fraction of the absolute performance you can get out of the machine without undue effort. And a lot of this is removing performance surprises from the architecture. So that if you write the code in the very straightforward way, you'll actually get pretty good performance. And it won't be awful and then you'll have to sit there, trying to deeply understand some obscure fact in the memory system to figure out why you've fallen off a performance cliff. And, and that's something that's changed really an enormous amount in the last five years. The, the cliffs for GPUs have historically been steeper than for CPUs, but they're getting much flatter as a result of the work you guys are doing.
So one of the questions I asked the students in the classes that I teach is, what are you going to do with 100 times more compute? And sometimes that's a really hard question for them. There's a lot, a lot of head scratching. So, both in terms of what can we do with a super computer that's 100 times more powerful, and what can you do with something on your desk or in your pocket? Where do you see us going in this direction? >> Yeah, well I, I have an insatiable appetite for FLOPS. And so I would have no trouble using 100 or even 1,000 or even 10,000 times more, more compute. you know, a lot of what I do is designing computers. And a lot of that involves prototyping and simulation, new computer designs. And I'm always frustrated about how those simulations run. And so, if I could run, you know, RTL simulations of a new computer 100 times faster, it would enable me to be much more productive in trying out new ideas for computer design. Same for circuit simulations, I spend a lot of time waiting for circuit simulation to converge and if I could run it 100 times faster, I could not just run 1 simulation, but run whole parameters sweeps at once and do optimizations the same time I'm, I'm simulating. another thing is also you look in sort of the computers in your car. I mean our, our Tegra processors are actually designed into lots of different automobiles including the, the Tesla, the Model S, the Motor Trend you know, car of the year, but also you know, Audis and, and BMWs and all sorts of Fords have Tegras in them. And the applications people are starting to use for these mobile processors in cars involve having lots of computer vision to look at what people inside the car are doing, look at what people outside of the car are doing. And in many ways it makes your cars much safer by having the car aware of what's going on around it. In many ways compensate for the driver kind of not being completely alert or perhaps texting or doing something they shouldn't be doing. and in mobile de vices I think there are a lot of compelling applications in both computational photography and augmented reality. If your mobile device is constantly aware of what's around you, it can be informing you. Oh, you know, I think you're hungry, here's a, a place that has, you know Gyros that I know you like because I have your profile of, of your , likes and dislikes maybe you should stop for lunch, and, and or, you know, a block, a block away is this guy who you really don't like. Maybe you should turn right at this cornner and avoid. Avoid running into him. and in many ways, I think it sort of evolved to having your computing devices becoming your personal assistant. I was like Jeeves in, in the Iron Man movies. I would like to have, you know a, a device I can kind of talk to like that, and is aware of the environment around me and can, you know be, be basically a brain amplifier for me. It can sort of, you know, you know remember things that I forget and tell me about things in my environment, and, and basically assist me in going through my day, both on professional and personal basis. >> So one of the goals of the supercomputer industry is to get up to, the term they use is exascale, that they'd like to do ten to the eighteenth flops per second. >> And so, certainly, Nvidia is going to be interested in being in those computers. what are we going to use that for? >> Well, I think first of all, there's nothing magical about an exascale. It's like, you know, when we, when we first made petascale machines which is just a few years ago it wasn't, like, breaking the sound barrier, or anything, you know, really qualitatively changed, but enabled better science and, and there's always you, you look at sort of the fidelity of simulations we're able to do today, you know to say, simulate a more efficient engine for automobiles to improve gas mileage, and we're making lots of approximations to fit them on the supercomputers we have today. As, as we can get to higher fidelity by resolving g rids finer And modeling a bunch of effects like turbulence more directly rather than using macro models to model them, we'll get more accurate simulations. And that will enable a better understanding of, you know combustion in some of the you know, biotech \um, applications of you know, how proteins fold. you know, how you know various other you know, climate >> Climate modeling >> Sure. Climate evolves. And, and basically as we, as we get better computing ca, capacity and it's not, not, you're reaching magic excess scale and, and wonderful things happen, but every step along the way, we get better science, we are able to design better products. And, and computing's a big driver of, of, you know, both scientific understanding and economic progress across, across the board. And I think it's very important that we maintain that steady march forward and exascale is just one milestone along that march. >> And my understanding is that power is really an, an, an enormously crucial thing for them to be get, to get right to be, be able to enable the axis scale but we don't want machines that are going to cost $2 million a month just to plug in. >> Right. It's really an economic argument. I mean if you really wanted an axis scale machine today, you could build one. You just have to write a really big check and locate it right next to the nuclear power plant and the entire output of which it will consume. but I think if there was some, if there was some application that was so compelling they were willing to really, you know, write the multi-billion dollar check required to do that, you would, you would do it. I think what, what the real question of exascale is, is an economical exascale, and because on total, you know on total cost of ownership the power bill is a, is a tremendous fraction. So, it's not actually an economical exascale machine unless you can do it for reasonable power level, and the, the number that's been thrown out is 20 megawatts and, and >> So that's $20 million a year. >> Yeah, $20 million a year power bill, if you're paying, you know, roughly 10 a kilowatt hour. and in fact, the bill actually winds up usually being a little bit higher than that because the cost of provisioning energy amortized over, say, a 30 year lifetime of the facility, usually is about equal to the annual bill for the energy. There is also something called the PUE, which is basically the efficiency of providing the energy. And even for a very good installation today maybe, you know, on the order of 1.1 to 1.2 so you pay another, say 20% of to, run the air conditioners and fans and things like that in the facility, and basically energy you're consuming isn't being, being consumed by the computer, but it's a big challenge for us to get from, you know say Sandy Bridge today that's 1.5 nanojewels per instruction, to if you wanted to do an exit instruction per second, to do an exit flop you might have to do more than an exit instuctions per second, but even if you take that as a thing at, at 20 megawatts. That's 20 picajules for instruction. And that's not just the processor, that's everything. That's the memory system, that's the network, that's the, you know io storage system. It's the, the whole ball of wax that to do it. So you mainly get 10 picajules per instruction to actually use in the processor, so that's >> And so even in Nvidia it's not, not quite close enough to that. >> Yeah, well compare to Sandy Bridge, that's a factor of 150 data, that's why, and process isn't going to help you much. So that's why, you know, conventional CPUs are not going to get there. It's going to require, you know hybrid multi-core approach, with most of the work being done in, in a GPU like through put processor to get there. But even we have a ways to go. You know, we're close to over-magnitude and we might get a factor of three from process. So we need to be very clever to come up with the other factor of 3 or 4 that we need. >> You know Titan does have CPUs in it, yes? >> That's correct >> So is there a vision where that won't even be the case? >> No I think, there are always pieces of the code. Where you have a critical path. You have a, a piece of single thread code that you need to run very quickly. And so you always need a latency optimized processor around to do that but most of the work it, it's one of these things, it's kind of like a cache memory, where most of your axices are to this little memory that runs really fast but you still need the capacity of big memory sitting behind it, right. And so it's, it's the same thing on throughput versus latency. >> Most of your work is done in the through-put processors, but when you do have a latency critical thing, you run in on the latency optimized processors. And so, you wind up getting the critical path performance of the CPU with the bulk of the energy consumption of the GPU. >> And the bulk of the flops and Titan is certainly going to the GPU's. >> Right. The bulk of the flops will be in the GPU's.
So, to close, do you have any thoughts for students who are taking this class? I mean, it's very exciting to get to teach thousands of students about parallel computing. What should they be thinking about going forward? >> You know? I think my, my big advice for students is to get your feet wet. go out there and write a bunch of parallel programs. The future of the world is parallel. you know? Serial programs, it's going to be like COBOL programs. I mean, and people used to do those, but they don't do them anymore. and, you know, don't just write the program, but, but be a critic of your, of your own code. And, and go and profile it, understand, you know, why it performs the way it does, and understand, that you know, memory is not flat. If you want to get good performance out of a program, you have to understand the higher heirachy of the memory system. and, and find a compelling application. I always find that, that if you really want to make progress on, you know, improving your, your programming skills, the special parallel programming skills, it's motivating to say, okay. I've got this really cool computational photography application. I'm going to make cool pictures. And, that's actually one reason why graphics is so fun, is that you can, you can see your results. >> Super, thanks so much for coming Bill, I, I really appreciate your time. >> Oh, well thank you John, it's really been a pleasure.