I'm here with Stephen Jones, who's a senior software engineer at Nvidia. Stephen, tell us about your job. I am the CUDA lead at Nvidia, which means that I'm responsible for the programming model and for the engineering which goes behind that. So, when we look at the set of things we could do for CUDA, for example, the list is long, and so I sit and evaluate what we can do what's useful to do, what's the right direction. I'm not the only person who makes these decisions. But I'm, I'm the engineering end of that. So I, I sit there and figure out what the resources are going to, going to need to be, and what the, what, what different groups we have to pull together to make these things happen.
And so you've been in video for 4 years? Yeah just over 4 years? And you have had kind of an interest path to get there. So tell us a little bit about your back ground. So my background is actually not computing. I believe I've never had a computing class in my life. which is not necessarily a claim to fame but my background is actually fluid mechanics and plasma physics. So from engineering days it turned out that, you can't do physics without computing. When I graduated in 96 the whole world was already moving towards that, like, computing and science connected. But, you know, I was, I was young and it seemed a lot more interesting to go and write computer games, so I went and wrote computer games for awhile. when we worked for the military in high-performance computing, went to work for a bunch of start-ups. And somehow meandered my way around to my video.
So, one of the really exciting new things that you guys have recently put into CUDA is this idea of dynamic parallelism. And so we've had some material on that in the course. So before we talk about what it actually is, can you talk about what problem you're trying to solve?
The thing we really want to do, at least, the way that I think about it, the thing that I want to be able to do with the GPU is to make it easier to program. And to broaden the number of problems you can do on the GPU. So the GPU starts out being really really good at, very broad bulk parallelism. You can throw a lot of threads at a problem. It handles them very efficiently. But, for less regular problems. For more, sort of, diverse or fine grain parallelism. It's hard to express that or and, and CUDA today. So, the dynamic parallelism effort was to try and aim at a easier way to extract more parallelism from your problem. So we wanted to solve the kinds of problems, things like recursion, things like task parallelism. Both types of problems that are difficult to express in a single big grid of threads. So sub paradigm we modeled to enable new types of new types of programming problems to be solved.
So, tell us about what dynamic parallelism adds to the CUDA programming model. So quite simply, it lets you launch kernels directly from inside another kernel. And that sounds relatively simple. If you like the, the analogy on the, on the, on the CPU side is when I could create threads inside of process. Instead of having to go back to the operating system to initiate a new process, I can suddenly spawn a p thread and do something asynchronously within the single process that I have. Manage that all from within my own process. I can now do that on the GPU as well. I can create work and not just single threads, I create whole grids from inside of my GPU. So, if my problem is working on something, and I suddenly need to invert a matrix or perform a Fourier Transform or something, I can suddenly just call out into a, into a kernel which does that for me, returns the data, and I can continue. So, I can embed parallel work exactly where I need it with the data that I have available in, in, inside my program. sort of co, correlate to that, is this idea of being able to take data that dynamically you are working on. A, a value that you have a sort of mathematically, algorithmically generated. And use that value to make decisions on work you're going to do. If I, if my value is one do this, if it's two, do that. Or maybe I am partitioning a problem, I am building a tree, spatially partitioning something. If I have a, a certain number of things in one place and a larger number in another place, I can dynamically launch the correct number of threads to do that. And the ability to do this on the fly dynamically is, is really the power of this dynamic parallelism.
So, part of the motivation is going to be ease of programming. So how, how is it specifically easier for someone to write a program that has these irregular, complex control structures, data structures, compared to not having dynamic parallelism.
In the past, when, that is to say, before, before, before the dynamic parallelism stuff on, on Keppler, which is a new GPU in the past whenever you needed to make a new batch of work, you had to return to the CPU which was the master controller to go and launch that work for you. So, if ever in my program I reached a point where I needed that matrix inversion or that FFT to be done, I had to halt my program, return to the CPU, have the CPU then launch this work for me, that would complete, return back to the CPU and the CPU would then have to restart me. In fact, if I could have to split my program in two around this moment, where I needed the, the, this extra parallel work to be done. And suddenly, instead of having one smooth program, I have two fragments of program, I have state that I have to save across the two and then, I have my CPU having to get involved to manage and marshal this work. Suddenly, with the dynamic parallelism, I can just do this all compactly on the fly. If you like, the system does all that for you. It will save off your old program. It will run the new FFT for you. It will return the result to you, and it will continue where you left off. So, from the programmer's perspective, I'm no longer programming in two places at once. I'm no longer having the VPU and the CPU both tightly bound over my execution. And I no longer have to manage the, the, the portions of my program around where I need to launch this new work. I can just in-line it effectively and it makes for a much simpler and more straightforward program. that's fantastic. And what about the performance implications? There's always a performance overhead, bouncing backwards and forwards between the CPU and GPU. You've got the latencies of the PCI bus, which was the communication link. You've got the overheads of shutting down your first portion of your program. Starting up the next portion back, and resuming right where you left. So, those overheads get amortized. You save, potentially , data transfer across the buses. and in a way, something I feel is actually more important in this is that with the GPU, you're always trying to get as much work on that GPU as possible. You can much more easily overlap the new work that you're doing with other stuff that's still going on in the GPU. I don't have to shut down completely and fire up an FFT. I can, inline, do all of these things while something else useful is going on at the same time. And so, this ability to asynchronously do this work from different threads all at the same time. This is, this is, remember you've got thousands of threads on the GPU, they can all be doing this, I mean modular resources, they can all be doing this at the same time. And so you can get a much easier overlap between the different pieces of work. You're doing and it's therefore much easier to keep the VP busy and that gives you a lot of potential for more performers.
So what are the kind of problems you've looked at recently as you were designing this where dynamic parallelism really makes a difference either in terms of usability or performance? In terms of usability, obviously, it's simpler to program when you don't have to keep going backwards and forwards to the CPU. So, any kind of problem which dynamically discovers the work as it goes, iteratively works it's way through something. Imagine you're constructing a tree. You're part, partitioning something into a, into an Octree, which is a, a common, a common 3D spatial problem. You don't necessarily know how many objects are going into which part of your tree until you, until you reach that point in your tree. So typically, the approach would be to do this level by level by level which is not the most balanced way of doing things necessarily. So, as you discover these type, as you discover the work that you need to do, the ability to simply launch that work is much, much simpler. So I, we were very motivated by irregular parallelism, if you like, problems which did not have nice, well-balanced things. a similar type of, sort of category of problem to that is task parallelism. Where I might be wanting not just to do one thing that fills my whole GPU. Which is often difficult. Now, the GPUs these days are a teraflop of performance. It might, it might be much easier to have half a dozen or a dozen things running on my GPU at a time. And so each, if each of these can autonomously make forward progress, it's much easier to manage them if they're just managing themselves instead of having my CPU now juggle 12 different things. Instead of just one. And finally, there's the there's the, the type of, a new type of algorithm that, that you can approach. You can approach recursive types of algorithm. Things where it's, the category is generally the divide and conquer algorithm, where the, where you're, you take all of the work that you need to do and you conquer it by subdividing and subdividing and subdividing repeatedly. And a typical example would be Quicksort, for example a well known, a well-known problem where you take your, your, your data that you want to sort and recursively you progress through the data until you end up with, with a final sorted data set. So one of the demos you guys showed when you launched this was n body simulation interstellar. a bunch of stars moving around, being attracted to each other with gravity. And so, with dynamic parallelism, you were able to, write that in a way that you hadn't written it before. So, how did that come about? Like, what was the cool thing you could do with it? I mentioned octree spatial partitioning just now. And that is a key component of an n body simulation. So the way that you approach an n body simulation with a very large body of bodies is instead of doing an all to all comparison where you just calculate the gravity between all the bodies. So for n bodies that's an n squared problem you can cut down the complexity of the problem to an n log n or an order of an n problem. By partitioning things into space, doing a local interaction of gravity between your close neighbors. And doing an approximation to a center of mass at a more distant neighbor. To do this, you build an octree. Partitioning your bodies up into small octans, little cubes. Everyone inside your cube, you do, you do the n squared problem. And for all of the other cubes, you have, you have, then only have an order of an n expansion for. So what we did with dynamic parallelism and an n body, in this n body problem was we optimize the tree bill which is about half of the time that it takes, in all simulation. We optimize the tree build by using this recursive property by using this, this irregular parallelism ability. Where the tree might in the case of a galaxy of stars for example, might be very, very dense in some regions and very, very sparse in others. To much more efficiently build the tree. So instead of building the tree level by level by level so that you're wasting work building the tree for areas of where the bodies are sparse you focus the compute performance on the area where you need it. Could you have done this without dynamic parallelism? yes I guess you could. But the overhead of moving backwards and forwards between the CPU and GPU would probably negate the performance gain that we got.
Steven what kind of problems would be a good match for this dynamic parallelism capability? So as programmers are, are thinking about what techniques to use in designing the next generation algorithms, what are the things where dynamic parallelism is going to really make a difference for them? I guess when I think about the features that I want to add to CUDA I really want to make it easier to write the programs you want to write. And dynamic parallelism, I think, really gives you much more flexibility in how you write your code. So, as well as things like recursive algorithms, which are extremely difficult without it. Sometimes, it's just easy to write a program. Wholly on the GPU, for example. If I know that my program is a small integrative launching lots of parallel work, it might just simply be easier to wrote the whole thing on the GPU and have a GPU thread control, control the whole thing from scratch. So if I'm in a situation where my memory marshalling backwards and forth between CPU and GPU would be complex or difficult, if I put my whole program on the GPU, all my memory's in once place, and it's much easier to write. So the first thing, the first kind of problem I would look at, is one which would just be made easier, by keeping all of your code in one place. You may not necessarily go any faster. Remember serial execution on the gpu is slower than serial execution on the cpu. When you're developing code even though you might be able to get something working faster if you do up a CPU and GPU combination typically you know, as I say I came from a science background you don't have 6 months to tune your code you've got 4 weeks before you paper deadline. So the question is not, how good can I get it? It's how good can I get it in four weeks before my paper is due. And so if I can give something like dynamic parallelism which will make it easier to get as good as you can in four weeks. You might only get to 75% of performance in four weeks. But, if without it, you'd only get to 50% of performance, you can do that much more science in the time you have available. So, in spite of the new types of algorithm that it lets you do. And those are very interesting. Because it lets you approach things on the GPU you simply couldn't do before. I think the first step is to say, well, does this just make my life easier? because if it makes your life easier, you can spend your time focusing on doing the science or the, or, or, or the calculation you need to do, rather than wondering, how on Earth do I program this thing in the first place?
So, broadly, as you and your team are thinking about what to put in the next 3, sort of characteristics that you're thinking about to, to pick design features that are going to move CUDA forward?
I'm really focused right now on the heterogeneity problem. That is. I've got this system where I've got processors which are good at one thing, serial processing, a processor which is good at another thing parallel processing. They're two separate processors, and they live with their own separate execution spaces, and, and memory and hardware and all, all that kind of thing. And bridging that gap to make it easier for the programmer to reason about what he wants to do, and to express what he wants. Withouth having to fight the system. That's really where I think that the biggest advances need to be made. So I'm putting, I'm spending a lot of time thinking about memory models, for example. Making life easier for people who, you know, you might not always know what memory you need to move. Or you might find it inconvenient to figure out how do I move data while I'm computing on something else. All of those types of things. That, you have to think about in heterogeneous system. I have, anything I can do and I am really not sure what that is yet, I am still, I am still, working on it though, I really see the direction such that, whatever we can do to make that easier, will make the system easier to program. And I think inevitably in the future there is. there is a place where you've got specialist processors working on the task for which they are ideally suited. Because that gives you the best performance for the power that you've got, the best performance for the silicon that you've put in there. It probably will solve your problem faster. You know if you got a massively parallel processor, do your parallel work on it. But it now means you've got this space where the programmer's no longer thinking about one type of program, he has to think about two or three or four or. Or however many disparate things he's got. And you know, I've not yet seen a great solution to this. But, you know, we're working towards trying to find them.
And so let me conclude with another question, what advice do you have for students that are taking this course, and learning parallel programming maybe for the first time? I think parallel programming requires you to think about your program, program in a different way. And the thing that I see most when I go out and I, I spend a lot of my time talking to people who are trying to use not just CPUs but clustered computer systems and multi-core systems. And, and you know I spend a, a lot of my job is finding out the problems that people are facing. A lot of the problems boil down to not NPI is hard, or, or, or clusters is hard, or networking is hard. It actually boils down to parallel programming is hard. And it's hard because a lot of them are thinking in terms of, I, I know how to write a serial program. I assume parallel programming must be the same. But, often, you need to think about your problem in a different way. You need to think about how do I break my problem up into independent pieces? Because the way that you get the parallel problem solved is to have lots of independent pieces which only come together either in small subsets or infrequently. Because [INAUDIBLE] always kill you in the end. And so you got to find a way around that and, and if you start thinking of your problem in parallel terms instead of in serial terms, you can, you will end up with a much better solution to your problem. Sort of analogy as to how if you're trying to think of solve a problem in C++, maybe this does not actually apply to the [LAUGH] [INAUDIBLE] I'm not sure. If you're trying solve if you're trying to solve your problem in C++, don't start thinking about it in C. Because, although you can express C in C++. Your, your solution in C++ will be a very different design to how you resolved it in C. Same goes for parallel serial programming. So it's not, this is not just Cuda advice in general. This is parallel programming advice in general. Yeah, yeah. Sit down and think about your problem, in terms of, where can I extract the independent work. Rather than how do I do the work, now how do I multiply it by Steven, thanks so much for coming in today. We appreciate your time. It's a pleasure.