Hi, my name is Steven Jones and I'm one of Senior Engineers in the CUDA Group and Video. My job is work on the thing about the CUDA Programming Model. I have to figure out what we should and shouldn't do to add to it. How it should look once we actually do decide to add something and of course getting the actual engineering work done under the covers to support the new stuff that we add. And one of the things I've worked on most recently is what I'm here to tell you about right now. It's something we call Dynamic Parallelism. So, in essence, it's really simple. Everything you've learned so far about running programs on the GPU have followed sort of a code processor model where the GPU is fully under control of the CPU. If you think about it, the CPU can create threads for itself. It can not work on the GPU, like this. Or it can synchronize to wait for the GPU work to finish. It sounds pretty simple, but this extra little arrow allows all sorts of new types of problems to be solved on the GPU. And I'm going to tell you about some of those.
So you're used to the idea of bulk parallelism, this is where you break down a problem an array, for example, like this, into a series of threads each operating independently and simultaneously on the data. So imagine I've got a series of numbers and I want to scale everything by a factor of a quarter. So then I would take each of these numbers, each thread would operate on them independently and I would end up with my result. This is bulk parallelism, because everybody is operating at the same time without any dependencies.
Here's a list of some algorithms, and you need to identify which of these is a bulk parallel algorithm. So finding the largest value in a set is if I have a collection of numbers and I want to find which one is biggest. Summing elements of an array is if I have an array of values, adding them all together to get a final total sum. Adding two strings together as if I have a string, for example, hello everybody. Another string, how are you today? Combining them to make a single string of hello everybody, how are you today? So identify in these boxes which of these algorithms would be a bulk parallel algorithm.
Okay, so, finding the largest value in a set is something which uses what's called a reduction. It's a parallel algorithm, and it is in fact a bulk parallel algorithm, so that is yes. In this case, you're summing repeatedly between threads to find who has the greatest element. And reducing that down to a single element, which is the largest of them all. This is actually very similar to summing elements of an array. This is called a parallel reduction, and it's something you've probably seen before. That is also a bulk parallel algorithm that works on the same lines. Adding two strings together, however, is not a bulk parallel algorithm. Because you have to work your way to the end of the first string and then add character by character. The elements of the second string. This is a serial operation and so it would not be something what you could do with a large number of threads together. So it is not bulk parallel algorithm.
So with Dynamic Parallelism, programs can use other power patterns as well, and that's where it gets really interesting. There's Nested Parallelism, where I can call a parallel function from inside my parallel program. That's taking a Parallel program, creating other Parallel programs. So if I have kernel sequence of kernels, A, B, and C inside a CUDA stream and kernel B wants to do its own sequence of steps X, Y and Z, I can just launch that work in line from B exactly where I need it. Without Nested powers I might have to work X, Y, and Z into B's code and make a huge program. Or do some gymnastics with the CPU to launch the work. Like for example, I would split B in two. I would come out to X, Y, and Z and I'd come back to the second half of B. That's all very complicated to write and very complicated to manage. Its much, much simpler if I can simply launch my work directly out of B. So here's an example, suppose you want to adjust the volume of audio stream. We've got an incoming sound wave, this green wave right here. A lot of sound processing requires us to have a certain maximum volume, which means constraining the wave to fit within these two dashed lines right here. We could write a Parallel program to deal with this by creating a kernel to process this wave, which would then perform sequences of Parallel operations to cause the rescaling of this wave form. So I'd feed this wave into a kernal, and the kernel would break it down into a series of blocks like a normal CUDA kernel would do. The first step would be to find the maximum peak value in parallel, say 1.8 Volts. So having found the maximum peak voltage, my kernel would then launch a second kernel, which would rescale everything by the 1.8 Volts to end up with a normalized one volt peak to peak audio. We're combining several bulk power algorithms into a single program operating on the whole wave at once.
So now that I can create a complex composition of parallel[UNKNOWN] nested with parallelism. Taking in my audio stream. Putting it through a sequence of operations. Coming out with a re-normalized stream like this. It opens the door to something called task parallelism. Now, task parallelism is where you run multiple independent tasks. Moreless like separate programs each working on their own data. So you've probably noticed that the nested problem that we looked at before, the volume normalizer, could be launched from the CPU. But the GPU's really good at running lots of things at once. So we can take our volume normalization algorithm. And we can run it on dozens of audio streams all at the same time. We've got one program with multiple instances handling multiple audio streams. All at the same time on a single GPU without any complicated CPU to GPU communication. This is task parallelism and this is one way to get a lot of power out of the GPU all at once.
So, the final new parallel pattern is something called recursive parallelism and this is really interesting because it lets you do stuff you literally could not easily do before. A recursive algorithm would be, for example, a divide and conquer kind of algorithm. A classic example of this is QuickSort, which I think you covered in an earlier lecture. But it's much, much simpler and faster when you use recursion for this. Recursion is where you would subdivide a problem, and using the same operating kernel, you would apply that repeatedly to smaller and smaller subsets of data until you'd solve your problem. We'll go into more detail on recursive algorithm later on. But first, let's have a look at how you program dynamic parallelism.
Okay, so in this quiz, I'm asking you to identify, which type of the four types of parallelism we've looked at is represented by which of these algorithms that I've written down here? So, matrix multiplication, which is multiplying two matrices together. Finding faces in a photograph album where I might have hundreds of photographs and I'm trying to find faces in all of them. Binary search, finding a number in a list of numbers and calling a parallel library from inside of a kernel.
So matrix multiplication is one of the classic examples of bulk parallelism. Different threads can take different elements of the matrices, multiply them all together and combine them. Finding faces in photographs is task parallelism because I am performing a similar operation on a large number of different input data sets. Each photograph effectively amounts to a task, and I'm doing all of these tasks in parallel. Binary searches are recursive algorithm. I repeatedly look through increasingly small subsets of my data to find the value that I'm looking for. And parallel library calls from inside of a kernel amount to creating new parallel work inside my existing kernel parallel work, so that is nested parallelism.
So let's go over some of how actually you program this stuff. From the programming perspective it's pretty simple. You just use all of the CUDA you've learned so far, except inside a kernel. So the kernel launches, streams, events, synchronization, all that kind of stuff as normal from inside a kernel. So your CUDA from the GP program looks pretty much exactly like your CUDA from the CP program well. So what I've got here is a typical Hello World in CUDA. What you can see with this version of Hello World. I'm launching a kernel to print Hello from the GPU. And then I'm following it up with a print of World from the CPU. Note how I need the synchronization here after the launch to ensure that the Hello happens first. That it's flushed out to the screen. And then the World will print. And here's the equivalent code for Dynamic Parallelism. You can see it looks pretty much exactly the same. I have moved the code that was on the CPU directly to the GPU, and the same rules apply. I'm calling Hello to be printed from another kernel, I'm synchronizing so that the output completes, and then I'm printing World from the GPU kernel. I would end up launching this just trivially as a single kernel launch of Hello World from the CPU to make all of this work. So as you can see, the basic idea is that all the CUDA that you already know how to use from the CPU, you just use it directly from the GPU in exactly the same way you've learned.
The rule of the programming model are just as simple. They are based something called composability. Let me tell you a little bit about that. If you recall how CUDA works, kernels launched into a given stream are interdependent. That means that in this case, for example, B cannot run before A has finished, and C cannot run before B has finished. This how CUDA streams work, which I think you probably learned about in a previous lecture. So if b creates its own work x, y, z we call this child work to b. Web b is the parent of x, y, and z. Now composability says that whatever b does is entirely private to b. That means c no idea and quite honestly doesn't care. What B is doing. B is not considered to have finished until all of its other children, X, Y and Z in this case also finished. So A, B and C still execute in sequence as expected and B follows the rules of composability such that whatever B does has no effect on Cs abbility to execute. So, in effect B actually looks like a single kernel, even though internally it is doing all of these other steps. So, it does not just stop here, you know, you can have child kernels start there own children, P, Q and R in this case for example. So you have got grand children of B, you can go on to generations of great, great, great grand children if you like, until you run out of memory. The point being that it all composes and stacks back in to looking like a single part of b and that's composability.
So, so far so good. It looks like CUDA. It acts like you would expect but there are some things you need to watch out for. The first of these is that every thread executes the same program. This is standard CUDA, you know, when, when you write your kernel code, every thread executes all the lines of that kernel code. But that means that if you have a kernel launch inside your program, every thread will make that launch. And if you don't want multiple parallel launches, and you only want one, you're going to have to explicitly select just one thread. Otherwise, you get as many launches as you have threads.
Okay, so here's a quiz. Like I said, if you have every thread in a CUDA program executing the same program, you can end up with a lot of parallel launches unless you do something about it. So, what do we need to add to this program in order to make sure only the first thread launches this kernel? I've given you a hint at the bottom, see if you can figure it out.
So the first thread is thread Id 0, in CUDA. Inside the If statement, you want to check if my thread Id is 0. Which is comparing thread Idx.x to 0. In this case, only thread 0 will launch that kernel. And it will only get 1 kernel launch, no matter how many threads I've launched in my block.
So the second and most important thing to understand about the programming model is that each CUDA block is considered to be its own independent program. That means CUDA objects are private to a block. So streams and events that you create inside a block can be used by any thread in that block. But they can't be used by or exchanged with any other block, including your children. So that means, you can synchronize on common launch within your block but you cannot synchronize on work launched by another block. You can't pass. Streams or events to your children have them synchronize on it or you risk a deadlock, this is all part of composability that we looked at a minute ago. And finally, data that's private to a block, like shared memory for example, is private to that block, that means you cannot pass shared memory to your child kernels. So if I have a person data that might block as accumulated in shared memory, which is a pretty common thing to do in CUDA I would have to write this out to global memory in order for my child kernel to see it. Remember your child kernel is another grid, it's not sets of blocks. And the first most important rule is all the blocks are considered independent programs. So no passing around of private data.
Okay. So, now, time for a quiz. So, I've written a program here, where I'm launching a kernel, just with launch down here. And I'm passing it parameters x, y and zed. Select x, y or zed to select which of the parameters is not allowed to be passed to the child kernel.
Okay, so in this case, the answer is y is not allowed to be passed to the child kernel. Because y is in shared memory, and shared memory is private to a block, which means that the launched kernel possessing different blocks will not be permitted to have access to that shared memory. Note, in this case, when I created z, I allocated it from the heap. A heap allocation goes in global memory and is therefore permitted to be passed in. So, x and z are both in global memory. x is being declared at global scope, and z is being allocated, y is in shared memory, and would need to be copied before it's passed in.
Alright. So, let's pull all this together and look at an example of something that was really heard to do before and is now shockingly easy with dynamic parallelism. So, if you remember from 1 of your earlier lectures. Quick Sort is what's called a divide and conquer algorithm. It works my partitioning an array of data into two pieces. Partitioning based on what's less than or greater than a pivot value. You then call quick sort on each sub partition, which is why it's a recursive algorithm. These sub partitions then get re partitioned over and over again, recursively, and so you end up, at the end with single values. And you're done. Two important things to notice at first that the sub-partitions aren't usually the same size, and second that some branches go deeper than others. This means that the number of elements to be sorted is different each time and so the decision on whether I need to sub-sort can only be made after I've done the partitioning. That's what makes it hard to implement on a GPU because after each step, I need to communicate back to the host CPU all the information about what I want to launch next. That's a lot of communication at each level of the algorithm and it gets pretty complicated to manage. But even worse, I end up launching each stage of the sort as a wave of kernels. As these things partition down further and further, the number of kernels that I have to launch gets larger and larger. And the only way to manage them is to launch them wave after wave. What that means is, that I end up waiting as long as the longest operation in a wave, even if my shorter ones have already finished, before I can move on to the next stage.
Okay. So, here's where dynamic parallelism steps scene. It solves both of these problems really neatly. What I've got here, is an example of how quicksort would sort a series of numbers. At the top level, a single kernel partitions the numbers into two groups. Then launches two quicksort kernels, one on each group. And so on down. It would already have all the information it needs to decide whether to launch, and how many threads to use because, it's just done the partitioning. So it doesn't have to communicate this back to the CPU. That's the first problem taken care of. It solves a second problem, because as these kernels run, they each launch as soon as they finish partitioning. That means my plot progress is asynchronously. For example, this one on the left will launch its children, while the one on the right is still working. Because there's more work to do on the right hand side here. Each sort will be running independently of any others. I'm not waiting around for the slowest sort anymore. And my GPU is kept busy. So here's what the code looks like. I'm not going to get into detail about the partitioning function, because we're looking at dynamic launchers here and you already covered partitioning in a previous lecture. The important things to notice, is there is no host side management of data or launchers needed. Everything happens inside the code itself. Here are my launchers inside my kernel, of my next quicksort kernels. So the code's nice and simple. And the kernel launches, its two children into separate streams. Remember that CUDA streams run simultaneously, which means my two sub sorts will execute in parallel. Without these streams, everything would run sequentially because both of these launches would end up in the null stream. And that would defeat the purpose of the parallel sort, to have my program running sequentially. So let's have a quick look back at the sort diagram, to show you what I mean. If each launch runs sequentially, then I would do the left side first, then its children, left side first here, then the right-hand side. Then I would come back and do the right-hand side next, and then so on. And so what we would see, is that each step gets run sequentially instead of in parallel, which defeats the purpose of the parallel sort. So by launching the sort into different streams, we run both the left and right sorts in parallel. So I would run both of these together, and then each of their children and so on. And I'd get the parallel performance I'm looking for. I've combined dynamic parallism with CUDA streams, to get the parallel execution, that I just would not be able to do any other way.
Okay, final quiz. What I've got here is a list of potential reasons why the dynamic parallel version of QuickSort ends up more efficient than the CPU controlled one. We've got more efficient partitioning of the data inside the sort. We've got launching on the fly from the GPU rather than returning to the CPU to do the launch. Simpler code, because the code is a lot shorter and easier to write, and greater GPU utilization. Remember, that GPU utilization is making sure that as much of GPU is busy doing useful work at any given time. So, as long as you've got enough threads going on at one time to keep the GPU busy, you have high utilization. So, check all the boxes that apply.
Okay, so, for more efficient partitioning, that is actually not true. We have not been touching the partition function, the partition function does not have anything to do with the dynamic launches that I can do the recursive parallelism. Launching on-the-fly, however, yes. That does substantially contribute because I don't have to keep returning back to the CPU to do my launch forming. That means I'm communicating less data and it means that my launch cause immediately that I needed, instead of waiting around until that particular wave of launch is finish. Simple code while convenient and I can probably maintain it faster, is not the reason why it actually runs any faster. And finally, greater GPU utilization is probably the cause for the greatest of speedups. By launching on the fly, I'm making sure my GPU is always busy. So when one partial sort finishes, it creates two more immediately. Keeping my GPU fully stacked up and busy with work. It streams more work for my GPU at one time. And my sort ends up faster end to end. In fact when I've written this program in dynamic parallel form and then host launched form I see a pretty much exactly fact of two speed up between the two.
Alright, and that is it. Thank you everybody for taking this class. Following on from this, we have an interview between myself and John Owens, where I discuss something about dynamic parallelism and how we put it all together, and, and where we're going from there.