Okay. Welcome to unit 2. It's good to see you again. So in the last unit you learned about the fundamentals of the GPU programming model and the basics of writing a simple program using CUDA. In this unit we're going to build off of that. We'll learn about important parallel communication patterns like scatter, and gather, and stencil. And we'll dive a little deeper into the GPU hardware and learn about things like global memory and shared memory, and we'll put these together to learn how to write efficient GPU programs.
Let's recap what we've learned so far. Parallel computing is all about many threads solving a problem by working together. And the key is this working together any books on business practices or teamwork will tell you that working together is really all about communication. In CUDA, this communication takes place or memory. For example threads may need to read from the same input location. Threads may need to write to the same output location. Sometimes threads may need to exchange partial results.
So let's talk about the different kinds of communication, the different patterns of communication you'll see in parallel computing. And as you'll see, this is really all about how to map tasks and memory together. How to map tasks, which are threads in CUDA and the memory that they're communicating through. So the communication pattern you've already seen is called Map. Now with Map, you've got many data elements. Such as elements of an array, or entries in a matrix, or pixels in an image. And you're going to do the same function, or computational task, on each piece of data. This means each task is going to read from and write to a specific place in memory. There's a 1 to 1 correspondence between input and output. So, map is very efficient on GPUs. And it's easily expressed in an efficient way in CUDA by simply having 1 thread do each task. But this isn't a very flexible framework. There's many things you can't do with a simple math operation. Now suppose that you want each thread to compute and store the average across a range of data elements. Say maybe we want to average each set of 3 elements together. In this case each thread is going to read the values from 3 locations in memory and write them into a single place and so on. Or suppose you want to blur image by setting each pixel to the average of its neighboring pixels. So that this pixel would average together the values of all five of these pixels. And, then let's take this pixel next to it would average together the values of all these pixels and so on. We'll do exactly this kind of blurring operation in the homework assignment that's coming up at the end of this lecture. This operation is called a gather because each calculation gathers input data elements together from different places to compute an output result.
And the correct answer is that it's a scatter operation. Each thread is computing where to write its result.
So the image blurring example that we've been using actually illustrates another important kind of communication pattern called Stencil. Stencil codes update each element in an array using neighboring array elements in a fixed pattern called the stencil. This is the stencil pattern we saw before. It's technically known as a 2D von Neumann stencil. Let me reiterate how this worked and this time use color coding a little differently to show you what's going on. So here, I've color coded the threads to show you which one is going to be working on which output element. So, I'll choose the blue one to be writing into this value. Here's where the red one's going to write it's output value. Here's where the green thread will write it's output value. And if you look at what's going to happen, each of these threads is going to read from several locations in memory surrounding the corresponding input value, and those locations are defined by the stencil. So, the blue thread will do a gather from these threads, and then the red colored thread will do a gather from the overlapping neighborhood. And then, the green thread will do a gather from this neighborhood and so on. Eventually, there'll be some other thread whose responsible for say, writing to this value, and that thread is going to go and access these values. So, something you're going to notice right away is that there is a lot of data reuse going on. Many threads are accessing and computing from the same data. And exploiting that data reuse is something we're going to use later on when you're working on your homework assignment. We're going to try to exploit that reuse to speed up your homework assignment. Now, there are other common stencil patterns. For example, you might read from all of the neighboring elements, including the diagonal elements, and that would be called a 2D Moore pattern. And there are also 3D analogs of these, so for my next trick, I'm going to attempt to draw a three dimensions. Hopefully, you can see that from my drawing. So speaking of data reuse, here's a quick quiz. Can you figure out how many times a given input element in the array will be read when applying each of these stencils?
The answer, of course is simply the number of elements in the stencil. Right, so every element in that array is going to be read five times by the 2D von Neumann stencil, because there are five entries in the neighborhood. So, all elements will be read five times by the 2D von Neumann stencil, nine times by the 2D Moore stencil, and seven times by the 3D von Neumann stencil.
Another parallel communication pattern worth mentioning is called transpose. For example, you might have a 2D array, such as an image, laid out in row-major order. This means that the elements of the array, or the pixels of the image, are laid out one row at a time. And, I've color-code the rows here just to show you more clearly what I'm doing. But you might want to do some processing on the columns of the same edge. And so you'd want to lay out like this. This means you need to do an operation to reorder the elements. As you can see, I've drawn this as a scatter operation. So, each thread is reading from, an adjacent element in the array. But is writing to someplace scattered in memory, according to the stride of this row column, transpose. I could also have expressed this as a gather operation. Like so.
So you can see where a transpose might come up when you're doing array operations, matrix operations, image operations. But the concept is generally applicable to all kinds of data structures. Let me, let me give an example. So, here's some sort of structure you might have, right? It's a perfectly reasonably structure foo. It's got a float field and an integer field and say, that you have a thousand of these. Well, what does that look like in memory? You're going to have the floats and the integers disbursed throughout memory. And as we will talk about later, it can be more efficient to access, if you're going to do a lot of processing on the floats, it can be more efficient to access all of the floats contiguously. You're going to want some operation that lets you take your, what's called an array of structures representation, and turn it into a structure of arrays. And that operation is, again, a transpose. By the way, these two terms are so common that array of structures is often abbreviated AOS. And structure of arrays is often abbreviated SOA. You'll see these terms come up frequently in parallel computing. So, to summarize, the transpose operation is where tasks reorder data elements in memory.
Okay, let's have a quiz on communication patterns. I'm going to give you a bunch of code snippets and I'm going to ask you to label them according to the parallel communication pattern that they embody. For each code snippet, you should indicate whether it is a map operation, a gather operation, a scatter operation, a stencil operation, or a transpose operation. Here's the code, and this is really sort of pseudo code. I'm not explaining where these variables came from, or showing you that how many threads I'm watching, or anything like that. But this is kernel code, and as you can see, I have two variables, out and in. These are arrays, the floating point numbers. And just for berevity, I've created two variables, i and j, to represent threadIdx.x, and threadIdx.y. Just to have something to do, I'm going to multiply a bunch of numbers by pi, so I define pi here, and here are our code snippets. Out i equals pi times in. Out i plus j times 128 equals in j times, j plus i times 128. And then, you see these two I have guarded with an if statement but only the odd thread get executed. Out i minus 1 plus equals pi times in i, out i plus 1 plus equals pi times in i. Finally, out i equals in i plus in i minus 1 plus in i plus 1 times pi divided by 3. So, for each of these statements, each of these little code snippets indicate whether it's a map, a gather, a scatter, a stencil, or a transpose.
So, this first one is pretty easy, right. There is one-to-one correspondence between the output and the input so that's clearly an app operation. And this next one is also is one-to-one operation. One value gets written in the output array corresponding to[UNKNOWN] get right from the input array and you can see that while writing into an array which is represented in I major order here in the output, and in j major order in the input. So this is a transpose operation. Now this next code, as I said, I put a guard around. Only odd the numbered threads are going to execute this. So that rules out a map, it's not one to one. And that also rules out a transpose operation, which is also one to one. And you really couldn't call it a stencil operation either because a stencil operation should generate a result for every element in the output array. And this doesn't do that. Now, if you look the first one, the thread is taking the input at a given location and multiplying it by pi and placing that into a couple of different places in the output array. In fact, it's incrementing a couple different places in the output array. So, this would be a scatter operation, the thread is computing for itself where it needs to write its result. And this final line would be a gather. You can see that every thread is writing a single location in the output array and it's reading from multiple places in the input array, locations that it computes. So this would be a gather. And again, this looks very much like a stencil operation since it's. Reading from a, a local neighborhood, and doing some averaging and writing the result, but I wouldn't call it a stencil because it's not writing into every location because of this guard here. So that's why I refer to this as a gather rather than a stencil.
Conceptually map and transpose are one to one, each input maps to a single unique output. You can think of a gather operation as many to one. Many possible inputs can be chosen to compute an output. In this terminology, scatter is one to many, so each thread chooses from many possible output destinations. Stencil can be seen as a specialized gather that pulls output from a select few outputs in a given neighborhood of the output So you might turn that several-to-one. In the next lecture John will tell you about 2 more really fundamental pattern. Reduce could be turned all-to-one. For example if you're adding up all the numbers in an array. Finally scan and sort can be considered all-to-all because all of the input can affect the destination of the resulting output. You'll learn more about these in the next lectures.
Okay, so now we've talked about parallel communications patterns. Check that off. Now, that we've discussed the general forms of communication that we'll see among the code of threads in the parralel programs. Let's talk about how those threads can most efficiently access memory in concert. An important subtopic of this is going to be how to exploit the data reuse that we saw during all those gathers and stencils codes there was a lot of, there was a lot of threads often accessing the same data at the same time and how can we exploit that to reduce the total amount of time spent on memory? And the next big topic is how can threads communicate partial results by sharing memory? A real problem then becomes, once threads are sharing memory and reading and writing from the same memory how do you keep them from stomping on each others' memory accesses? How can you do this safely? So, in order to better understand all of these topics, we need to dive a bit deeper and learn more about the underlying GPU hardware.
In this lecture we're going to talk about GPU hardware and how that affects the programming model. So let's summarize what we've learned so far about the GPU programming model and start adding to it. In our last lecture, we learned about the programmers view of the GPU. And I'm going to summarize that and then start adding some more details to it. So, your job as the programmer is to divide up your program. Into smaller computations called kernels. So a kernel, in our case, is a C or C++ function. And, the key idea of a kernel, is that it's going to be performed by many threads. So a thread is a, a path of execution, a, a thread of execution through the program code. And I've drawn them as wiggly lines here because they're not necessarily all going to take the same path through that code. There might be branches like if statements or switch statements. There might be four, you know, loops four loops, two loops. So you don't know in advance what path you're going to take. This is not straight-line code. And in fact, different threads might take different paths. So there might be a thread which takes a completely different path. So the key thing about threads is that they come in thread blocks. A thread block is a group of threads that cooperate to solve a sub-problem. A GPU program launches many threads to run one kernel, like foo, and then they all run to completion and exit. And the program launches many threads to run the next kernel, like bar. And another thing to notice about the way I drew this is that, you notice, I have a different number of thread blocks with a different number of threads. That's actually something you can pick for each kernel. And we'll talk later how you would, you know, why you would choose different numbers of threads, different number of thread blocks, how you would organize these. So that's the programmer's view of the GPU.
Let's talk about thread blocks and why they're there. Hopefully, you have some questions about how the GPU works, since we've just sort of been telling you about threads and thread blocks without giving you much motivation. So, you might be wondering, why do we divide the problem into blocks? When do blocks get run? And, if you have a whole bunch of thread blocks, in what order do they run? And I've told you that thread blocks were about letting groups of threads cooperate. But I haven't told you how those threads cooperate, or with what limitations. To answer this, we're going to have to dive in to learn a little bit more about the GPU hardware. At a high level, a Cuda GPU is a bunch of these. We call them streaming multiprocessors or SMs, for short. Now, different GPUs have a different number of SMs. So, a small GPU might only have one SM. Whereas, a really big GPU might have 16 SMs, for example. And an SM, in turn, has many simple processors that can run a bunch of parallel threads. It also has some other things like some memory, that we'll talk more about in a moment. So, when you've got a Cuda program with a bunch of threads, organized into thread blocks, the important thing to understand is that the GPU is responsible for allocating blocks to SMs. Let me say that again because it's really important, the GPU is responsible for allocating the blocks to the SMs. So, as a programmer, all you have to worry about is giving the GPU a big pile of thread blocks and the GPU will take care of assigning them to run on the hardware SMs. All the SMs run in parallel and independently.
Check all the true statements. A thread block contains many threads. An SM might run more than one thread block. A thread block may run on more than one SM. All the threads in a thread block might cooperate to solve a subproblem. All the threads that run on a given SM may cooperate to solve a subproblem.
So, yes, a thread block contains many threads. And it's also true that an SM might run more than one block. On the other hand, a thread block may not run on more than one SM. By definition, a thread block is run on a single SM. Now, all the threads in the thread block may cooperate to solve a subproblem, that's why we have thread blocks. So, this last one is false and it's, it's a bit subtle so let me talk it through. They key, as I said, all the threads that run on a given SM may cooperate to solve a subproblem, but in fact, you might have multiple thread blocks on an SM and by definition, threads and different thread blocks should not attempt to cooperate with each other. And that's why this statement is not the same as the one before it.
So as a review, the programmer, or the GPU, is responsible for defining thread blocks in software. And the programmer, or the GPU, is responsible for allocating those thread blocks to hardware streaming multiprocessors, or SMs.
And of course the correct answer is that the programmer is the one writing software, and programmer's job is to define those thread blocks. And the GPU is running the hardware, and the GPU is completely responsible for allocating those thread blocks to run on hardware SMs.
One more quiz to try this on. If we have a single kernel that's launched on many thread blocks, including x and y, the programmer can specify that block x will run at the same time an block y. That block x will run after block y. That block x will run on sm z.
The answer of course is that all of these are false. There are no such guarantees.
So in short, CUDA makes few guarantees about when and where thread blocks will run. And this is actually a huge advantage for CUDA. This is one of the big reasons why GPU programs can go so fast. Why is that? So, among the advantages that the GPU gains from this programing model is that the hardware can run things really efficiently beacause it has so much flexibility. For example, if one thread block completes quickly, the SM can immediately schedule another thread block without waiting for any others to complete. But, the biggest advantage is scalability, okay? Because you've made no guarantees about where the thread blocks will run or how many thread blocks might be running at a time, that means that you can scale all the way down to a GPU that would be running with a single SM. Something that you might find in a tablet or cell phone, all the way up to the massive GPU's that are used in super computers. Importantly, the scalability also applies to future GPU's. So, you can be sure that GPU's will get more and more SM's as Moore's Law gives us more and more transistors on a chip. And, by writing the code in such a way that it can run on an arbitrary number of SM's, and doesn't depend on a certain number of SM's, a certain number of thread blocks being resident at a time, you can be sure that your CUDA code will scale forward to larger and larger GPUs. So, this scalability applies from cell phones to super computers, from current to future GPUs. And that's a really huge advantage. And there are also consequences to this programming model which are the things we've been talking about. You can make no assumptions about what blocks will run what SM. And you can't have any explicit communication between blocks. For example, if block x is waiting for block y to give it some result before it can proceed, but block y has already run to completion and exited, then you're going to be in a bad place. That, by the way, is an example of something called dead lock in parallel computing. This really means that threads and blocks must complete, okay? You can't simply have a thread that hangs around forever. Processing input or, or doing something, because that thread must complete in order for the block that it's in to complete so that other threads and blocks can get scheduled onto that SM.
Okay, here's a simple CUDA program that'll illustrate these ideas. So, we're going to launch 16 blocks and each one's going to have a single thread running. And it's going to run a trivial kernel that just prints hello world on the thread in block. And then, you know, in the main, we're going to do nothing more than launch that kernel. You need this call cudaDeviceSynchronize to make sure that these printf's flash and then we'll print. That's all. So, before we run this, here's a quick quiz. How many different outputs do you think different runs of this program can produce? Is it 1, 16 possible different outputs, 2 to the 16th possible different outputs, or 21 trillion possible outputs?
Well as we'll see the correct answer is actually 21 trillion, which is 16 factorial. Let's see how that works. So here I'm just going to run this program in my terminal. And as you can see the blocks complete in some basically random order, block 7 completes, then block 6, then 1, then 0, and so forth. If I run it again I get a different order and a different order and a different order and so forth. You'll find that the blocks are executed in a different order every time is where that number 16 factorial comes from.
So we spent so long talking about what CUDA doesn't guarantee, you may be wondering what CUDA does guarantee. There's two main things. Again, recall that all the threads in the block are guaranteed to run on the same SM at the same time. Second, all blocks in the kernel finish before any blocks from the next kernel are launched. This goes back to our programming model diagram where, remember, we had some threads running one kernel, say foo And then some threads running another kernel on the side bar. So what CUDA is promising as anything, any threads running foo will finish in their entirety before any threads running bar are launched.
Now, we're going to talk about the memory model. Let's go back to our diagram and draw the threads, and blocks, and kernels. So, every thread has access to something called local memory, and this is memory that's private to that thread, things like these local variables. So, the thread can read and write from local memory and the threads in the thread block also have access to something called shared memory. So, all the threads in this thread block can read and write to the per block shared memory. It's important to understand that shared memory is shared among the threads in a block. This is a small amount of memory that sits on the SM directly. And finally, there's global memory, every thread in the entire system at any time can read and write to global memory. So, threads in one kernel can read and write from it, threads in a later kernel can also read and write from it. So, every thread, to recap, has access to its own local memory, has access to shared memory that's also accessible to threads in, in a, specifically in its thread block and to global memory, which is accessible to all of the threads everywhere. Of course, everything we've been talking about is on the GPU, but there's also a CPU. The CPU thread launches the work on the GPU and, of course, the CPU has access to its own memory which, in Cuda, we call host memory. Usually, data is copied to and from this host memory into the GPU's fast global memory before launching a kernel to work on that data. We'll talk later about how Cuda threads can access host memory directly.
Let's have a quiz. Go ahead and check all the statements that are true. All threads from a block can access the same variable in that block's shared memory. Threads from two different blocks can access the same variable in global memory. Threads from different blocks have their own copy of local variables in local memory. Threads from the same block have their own copy of local variables in local memory.
So it turns out, these are all true. All threads from a block can access the same variable in that block's shared memory. That's what the shared memory is, it's a chunk of memory that shares, stores variables that are accessible from all threads in this given block. And threads from two different blocks can access the same variable in global memory. Well all threads, from any blocks, at any time can access a variable or piece of data that's sitting in global memory. So this is also true. Threads from different blocks have their own copy of variables in local memory. Yes, this is true. So, every thread has its own copy of whatever local variables, you know. Private, variables to that thread, are sitting in local memory. So this is true. And threads from the same block have their own copy of local variables, and local memory. Right. So, just because they're from the same block doesn't mean that they share local variables. They share shared memory. Per block shared memory, but they all share local memory. So all four of these are true.
So now we know that threads can access each others' results. They're shared in global memory. This means they can work together on a computation, but there's a problem. What if a thread tries to read a result before another thread has had a chance to compute or write it? This means we need synchronization. Threads need to synchronize with each other to avoid this situation. This need for synchronization is really one of the most fundamental problems in parallel computing. Now the simplest form of synchronization is called a barrier. A barrier is a point in the program where all the threads stop and wait. When all the threads have reached the barrier they can proceed on to the rest of the code.Let's illustrate this. Here's some threads and they're all proceeding along, three to code. I'll draw them in different colors, and I'm also drawing them different lengths. But you get the idea that they're, they're at different places in the code. They're at different points in their execution of the program. The idea is that when they reach the barrier, they're going to stop and wait for all the others to catch up. So in my drawing, the red one reaches the barrier first, stops. In the meantime, the blue one is preceding along and the green one is preceding along and eventually the blue one arrives at the barrier and stops and the green one is the last one to arrive at the barrier and stops. And now all three threads, say in my example I only have three threads, now that all, all of the threads have arrived at the barrier. Then they're all free to go again. And so they'll all proceed again, and we don't actually know which one's going to go first. It might be that the blue one is the first out of the gate, maybe green is next, maybe red is last. So let's look at some code to illustrate this.
Here's an example of the need for barriers. So we've got an array memory for the bunch of elements 1, 2, 3, 4, 5, 6, and we want to shift each of these left by element go here and so on. So here's a little code snippet, I should do this. We first initialize the elements of array to the thread index and you'll see that if hard code of this to 128 just to be lazy. So every thread is going to set its corresponding array elements to its own index. So this should initialize things to 1, 2, 3, 4, 5, 6 and so on. And then avoiding stepping off the end of the array with the step statement. Every thread will set its corresponding array element at its index equal to the value of the array element at the thread's index plus 1. So thread 1 will set its value to whatever is written in array, 2 sub 2, thread 2 will set array sub 2 equal to whatever is in array sub 3 and so forth. So here's a quick quiz. How many barriers does this code need?
So the answer is 3. Why's that? Well, here's the code. Let's take it step by step. First of all, every thread is, is setting a local variable equal to its own thread index. Then, in this line, we're declaring a shared variable which is going to be shared by all the threads in the thread block and I just hard coded in a hundred and twenty eight elements here And now we're actually going to do something. We're going to write. Every thread is going to write into a single element of the array, corresponding to a thread index. And the thing that it's going to write into that element of the array, happens to be its own thread index. Okay, so this going to load up that array with one, two, three, four and so on. And that's just to put something in the array that we can play with. This operation is a write, so we need a barrier after this write completes, in order to make sure that nobody tries to read from the array until after all of the threads have finished the write operation. So, let's insert a barrier here. And now, all of the threads that pass this if statement are going to, to this shift operation, right? They're going to set the array element at their index equal to the array element at their index plus one. But, if you think about it, this is really a read and a write operation. Here was our first write operation, that was number one. And now, we're doing a read operation. That's number two. Okay, we're going to read from the array at index plus 1. And then finally we're going to write to the array at index. And that's operation three. And the tricky thing about this is that all of these reads should complete before we allow any of these writes to finish. So really we need to restructure this code. We'll make this read operation explicit. Put it in a local temporary variable. We'll call syncthreads. We'll form the right operation explicitly. And to be absolutely safe, before anybody can use these written values, we need another syncthreads, to ensure that all these write operations have completed before anybody attempts to do anything with this array value afterwareds.
So let's recreate our programming model diagram. We've got threads, and we've got thread blocks. And really, what is happening with these barriers, is that the sync threads call is creating a barrier within a thread block. So all the threads within this thread block are going to get to this sync threads call, stop, and wait until they've all arrived. And then proceed. And these thread blocks are organized in kernels. Every kernel has a bunch of thread blocks. We haven't talked about this really, but there's an implicit barrier between kernels. So if I launch two kernels, one after another, then by default kernel a will finish before kernel b is allowed to start. So all of these threads will complete before any of these threads get launched. So when you add in our memory model that we've talked about. Where every thread has access to its own local memory, to its block shared memory, and to the global memory shared by all threads and all kernels in the system. Then what you've is CUDA. At its heart, CUDA is a hierarchy of computation That'd the threads, thread blocks, and kernels, with the corresponding hierarchy of memory spaces, local, shared, and global, and synchronization primitives, sync threads, barriers, and the implicit barrier in between two synchronous kernels.
So let's have a quiz. Are the following code snippets correct? Here's some function, some kernel. And it's going to declare an array of ints and shared memory, 1024 ints. And then for convenience we'll define i to be equal to thread index. And then we've got a bunch of functions. And I want you to check the ones that are going to run correctly without additional sync threads. And lets go ahead and put a sync thread before and after each of these code snippets because it's really the lines here themselves that I'm asking you about and we can assume there is a sync thread ahead of this and a sync thread after.
Okay, so this first example is pretty much the same thing as we already saw, right? Every thread is going to be reading from s of i minus 1 and then writing the result in the s of i, for it's own value of i. And there's nothing in this statement to guarantee that all of these reads complete before any of these rights complete. And since, they can be one threads s of i sub 1 is another thread s of i, then there's absolutely the potential for a collision here. So the correct way to write this, as, as before, would be something like this. Where we go ahead and insert a sink threads. And a temporary variable to separate the read and the write. So what about this second example? Here, what this if statement does, is, it basically insures that only the odd threads are going to try to read from s of i sub 1, and write to s sub i. So you think about this, thread 1 is going to be reading from location 0 and then writing to location 1. Thread 2 does nothing , thread 3 reads form location 2 and writes to location 3. Thread 4 does nothing, thread 5 reads from location 4 and writes to location 5 and so on. So, in this case there are not conflicts so this code is actually correct even without any additional internal sync threads. And our final example is again, not correct. It's similar to the first one. Every thread is going to read from locations I minus one, I and I plus one. It's going to do a little math, and then it's going to store the result in location i. So clearly to avoid, to make sure that all of these reads happen before any of these writes, we're going to need a sync threads. So to make this code correct, we'd have to do something like this. We'd have to use a temporary variable, do all of the reads, sync, do the write, sync, and then do the printf. The writes have to complete before this printf is done to make sure that the printf prints the correct value.
In the next homework, we're going to implement some pretty cool image blurring techniques. And you now know enough that you can go and implement that program on a massively parallel G PU and you'd get a correct answer. And it would be pretty fast. But, we can do better. Now we have all the ingredients to start talking about writing efficient parallel programs in CUDA. For now I'm only going to talk about high level strategies. We're going to have a whole unit later on about detailed optimization approaches to help you really squeeze the maximum performance out of the G PU. So think of this as a preview that covers some of the really important things high level things that you have to keep in mind when you are writing a G PU program. So the first thing to keep in mind is that G P U's have really incredible computational horse power. Okay, a high-end G P U can do over 3 trillion math operations per second. You'll sometimes see this written down as T FLOPS, okay? And FLOPS stands for a Floating Point Operation Per Second. So T FLOPS is Terra-Flops, and minor G PU's can do over 3 trillion of these every second at the high end. But all that power is wasted, if the arithmetic units that are doing the math, need to spend their time waiting while the system fetches the operands from memory. So the first strategy we need to keep in mind, is to maximize arithmetic intensity. Arithmetic intensity is basically the amount of math we do per amount of memory that we access. So we can increase arithmetic intensity by making the numerator bigger or by making the denominator smaller. So this corresponds to maximizing the work we do per thread or minimizing the memory we do per thread. And let's be more exact about what we mean here. Really what we're talking about is maximizing the number of useful compute operations per thread and really what we care about here is minimizing the time spent on memory per thread. And I phrase this carefully, because it is not the total number of memory operations that we care about and it's not the total amount of memory that comes and goes in the course of the thread executing it's program. It's how long it takes us to do that, so there are a lot of ways to spend less time on memory accesses. And that's what we're going to talk about now.
So, let's talk about ways to minimize time spent on memory accesses. So, the first strategy to think about is move frequently accessed data to fast memory. We've talked about the memory spaces available to the GPU. There's local memory, which represents the given threads private variables, local variables, parameters, things like that. There's shared memory shared by a thread block and there's global memory shared by all the threads. And more or less, it's true to say that local memory is faster than shared memory which in turn is faster than global memory. In fact, shared memory and local memory are usually much faster than global memory. I should mention that there are subtleties here. For those of you who know something about computer organization. The reason why I'm labeling local memory as so fast is that it tends to live either in registers or in l1 cache and those are both, those are both quite fast. This isn't a hard and fast rule there's some subtleties here. But in general data that is kept local to a thread is going to be about as fast as possible. And data that is shared in a thread's, thread block shared memory is going to be very fast and data that is way out in global memory, is going to be a lot slower although this is still much faster then CPU memory, also known as host memory. So, let's see an example of using local shared and global memory. So here's a kernel. I know that it's a kernel because it starts with either device or global, it's called use local memory GPU, it has one parameter called in, and it's got one local variable called f, because this is a local variable it's in local memory it's private to this thread, every thread will have its own copy of a variable named f. And, parameters are also local memory. So, every thread will have it's own copy of a parameter called in. And, you know real code would presumably do something with these variables. But, since this is just an example of how to use local memory, I don't need to do that. How would you call a kernel that shows using local memory? Well, you would call a kernel. You would have to launch it, meaning tell, tell the GPU how many thread blocks to run with how many thereads. And you pass in any parameters. So in this case, 2.0. Pretty simple.
So, once again we've got a kernel. And, we know it's a kernel because it's been tagged with global, so it's going to run on the GPU but can be called from the host. And once again we're going to pass in a local variable, a parameter called array. And the trick is that this parameter is a pointer, and so this is actually pointing to some global memory that we've allocated elsewhere. And I'll show you how to do that in a moment. Once you've got a pointer to global memory, you can manipulate, or you can manipulate the contents of that memory just as you would manipulate any other chunk of memory. So in this case, I'm going to take my array And I'm going to set one of its elements which happens to be equal to the index to this thread to some number which happens to be 2.0 times the index to this thread. Again not a very useful function but it illustrates what's happening. So the point really is that since all the parameters to a function are, our local variables are private to that thread. If you want to manipulate global memory you're going to have to pass an appointer to that memory. And, of course that means you're going to have to allocate a pointer so, let's look at how that works. Here's the code to show off how we use global memory. The first thing I'm going to do is to clear some host memory okay. And, once again I'm using a convention that's starting a variable with a prefix H underscore indicates that this is memory that lives on the host. So here's an array of 128 floats. And I'm also going to declare a pointer that I'm going to use to point to the global memory that I allocate on the device. And once again the d underscore, the d underscore can mention indicates that this variable is on the device. Now I want to allocate some global memory on the device. So I'm going to use the function cudaMalloc. What's happening here is that I'm passing it a pointer to this variable. Which is itself a pointer. Right? And, cudaMalloc is going to allocate some memory in this case room for 128 floats, and stuff a pointer to that memory into the pointer d array. If you're allocating memory you'll probably initialize to something. So we use cudaMemcpy for that operation. And in this case we pass in a pointer to the destination memory, which is this d array that we've allocated. And a pointer to the source memory, which is this h array variable. And then the number of bytes allocate, and then we indicate, whether we're copying from the hosted device, or vice versa. Oops, this is a bug. So now we've got a chunk of global memory, we've put something in it, and now we're ready to launch the kernel that's going to operate on that global memory. So here's the kernel that we saw earlier. Again, we're going to launch a single, thread block, consisting of 128 threads. I'm going to pa, pass in this pointer where I've allocated an, an initialized memory. So after this runs, presumably this code will do something to that memory that I pass in and now I'll need to copy it back onto the host. If I want to use the results of this kernel back on the host, then I need to copy the memory back, into host memory. And so, here's that operation. Once again, cudaMemcpy. This time, the destination is h array. The source is d array. This same number of bytes. And now, we're copying from device to host. Okay. So that's how you use global memory. Alright? The trick is that, since you can only pass in local variables to a kernel. You have to allocate and initialize global memory outside the kernel, and then pass in a pointer. Finally, let's look at how you would use shared memory.
This example's a little more complicated. For clarity remember that I'm just hardcoding, the idea that there's 128 threads and therefore we're going to operate on 128 elements of the array, all right? And I'm going to skip all this sort of out-of-bounds check and assertions that you would normally use to make sure that you're not trying to access a piece of memory that's not there. So once again, we have a function, use shared memory GPU. This function is a kernel, and we're going to pass in a local variable, which is a pointer to a bunch of floats. And, this local variable is a pointer to global memory that's been allocated in advance. I wanted to come up with some code that actually did something using shared memory and that's why this, this function is a little more complicated than the examples you've seen. So, we're going to declare a few local variables that are private to each thread, a couple of indices we'll declare this variable index, just to be shorthand for thread index dot x. That's just to save some typing. And we're going to declare a floating point variable called average, and another one called sum. And we'll initialize sum to 0.0. Here's the big example. Now we're going to declare a variable that lives in shared memory. In this case, it's an array. Again, I've hard coded that array to contain 128 elements. It's an array of floats. And I tag it as being a shared memory with this <u><u>shared<u><u>declspec. And remember the whole point of shared</u></u></u></u> memory is that this variable is visible to all the threads in the thread block. And it only has the lifetime of the thread block. So it doesn't exist outside of this thread block. Every thread block is going to have its own copy, it's own single copy of this array that all of the threads in that thread block can see. I call it sh<u>array. The first thing we're going to do is put some data in this</u> array. I, remember I passed in this array that's in global memory, called array, and I'm basically going to initialize the shared memory array to contain exactly what's in the global memory array. And notice how I do it. I'm copying data from this array in global memory to this array in shared memory. And I'm doing it with a single operation. How does this work? Every thread, if you look at this code, every thread is figuring out what its own index is, okay? Which thread it is, and its copying the element at array sub index into the element at shared array sub index. Okay, so since all of the threads in the thread block will be doing exactly this operation and since they will all have different values for index, when this single line has completed in all the threads then will have copied all of the information in this global memory array into our shared memory array. Okay, and the trick is that, that operation is going to, is going to take some time. Multiple threads are running. They're not running all at the same time, so it won't happen instantaneously. We have to make sure the all of these writes to shared memory have completed before we go on and do anything without array insured memory. And that's what the syncthreads operation is about. You've seen this before. Okay. So we call our barrier, to make sure that all the writes to shared memory have completed. And now after that barrier, we're guaranteed that this shared memory is fully populated. So okay. So every element has been initialized now. And just to be doing something, I said well let's just find the average of all the elements previous to this thread's index. So what we're going to do is we're going to, with this for loop, we're going to set i equals 0 and go up and up to index, which again is, which is the number of this thread. And we're going to accumulate all the variables in the shared memory array up to this, this, this index, this threads index. And after we're done, we'll compute the average. Which is simply equal to the sum divided by the number of elements that we, that we that we added up. And then, once again, now we need to do something with that average. And, and what I decided to do is to have every thread look at the average that it just computed of all of thread, of, excuse me, of all of the elements in the array to its left if you will. If the value in the array at this thread's index is greater than the average that it just computed of all of the elements in the array to the left of this thread's index. Then we're going to set array for this thread's index equal to that average. In other words, we're going to replace any threads whose, who are greater than the average of all the threads to the left, with that average. Notice that I'm operating on array, and not shared, shared array. So, I, I used this shared memory array to do my averaging. And that's a good idea. Because, remember that shared memory is, is fast. It's much faster to access than global memory. And so, here, every thread is acc, is, is accessing a bunch of elements in the array. So it make sense to, to move this array to first shared memories that moves faster. We'll talk about this later. But now I'm operating. I'm, I'm making this change back in the global memory array. And that means that, that when its, when its kind of complete this change is going to be seen by the host and would also potentially be seen by other thread blocks, other threads in other thread blocks if there were any. And then just to sort of make a point, here's a piece of code afterwards that has no effect at all, because I'm going to set an element of the shared memory array to 3.14, but then the kernel completes. Nothing ever gets done with that value that's sitting in shared memory. And since the shared memory has the lifetime of the thread block, this memory evaporates as soon as this thread block completes. So this code has no effect and in fact can probably be removed by the compiler. Calling a kernel that uses a shared memory is, is no different than calling a kernel that uses global memory right. Since all you can do is pass in a local variable that points to global memory if you've so allocated it. Then you know what that, what that kernel does with its shared memory is completely up to it and not visible to the host code at all. So, this code showing how to use shared memory is identical to the, to the code we saw up here.
So our question for the quiz is, which of these operations would we expect to be fastest, and which would we expect to be slowest?
So we would expect the assignment of local variables to be the very fastest. And shared memory is also very fast. So a and b are both shared memory variables, so you would expect that to be fast. And global variables are all the way out in global memory, so they're going to be the slowest. We would expect that this assignment of, of the contents of y to the contents of z. There's probably the slowest operation. And this one, which is moving a piece of data from global memory into shared memory, is probably the second slowest. And by the way, if you know anything about compilers. You realize that this is an oversimplification. Right? It's quite possible that many of these values will be promoted into registers. An optimizing compiler might rearrange accesses and so forth. But, the point is simply to get across the relative speeds of memories.
So, by just naively having 10,000 threads incrementing 10 array elements, we got the wrong answer. One solution might be to sprinkle barriers throughout the code that I just showed you. But another solution is something called atomic memory operations. Atomic memory operations, or atomics,for short, are special instructions that the GPU implements specifically for this problem. And, there's a bunch of atomics. You can get the list in the programing guide, but some of the examples that you might use are atomic add, atomic min, atomic xor, and so forth. A particularly interesting one is called atomic CAS, which stands for compare and swap. Let's look at how these work. Okay, so here's our code again. And this time, instead of calling it increment naive, let's call a new function increment atomic. This kernel is almost exactly the same. Every thread computes what thread it is, and then it mods that by the size of the array to figure out where it's going to write. But the difference is that rest before, we were just doing a standard c operation. G sub i equals g sub i plus 1. Which has 1 read, a modify, and a write. Here, we're going to use the atomic add operation. So this is a CUDA, This atomic ad function is a CUDA built in that's going to take a pointer. Which should be to somewhere in device memory, shared or global memory, and an amount to add, a value to add, okay? So, this has the same effect. Every thread is going to go and add one to the value at g sub i, but the difference is that this is using special hardware that the GPU has built in to perform atomic operations. So now, we're guaranteed that only one thread at a time can do this read modify write operation, okay? So now, we'll scroll down, change our code, call increment atomic, that's the only change we'll make, compile the code again and run it. So this looks more like it. Now we're writing a million total threads and a thousand blocks writing into 10 array elements. And this time, as we expect, every row, every array element contains the number 100,000 when all is done, okay? So, the atomic operation prevented these collisions where multiple threads were trying to read and write from the same memory location at the same time. And instead serialize them, making sure there are only one thread at a time has access drawing the course of that read modify write that are add operation and we get the right result. Notice that took longer, before it took about 3 milliseconds now we're taking about 3.7 milliseconds. So atomic operations come with a cost, and that's something I'm going to go into. So write this a few more times to make sure that we get same result that time it took about 4 milliseconds. Closer to four again, down around three, and so forth.
So, atomic memory operation have a number of limitations that you need to know about. First of all, only certain operations and data types are supported. Okay, so there's things like add and subtract, and min and x or, and so forth. But, you can't do every operation. There's no mod, for example. There's no expanintiate. You can't do every operation. And only certain data types are, are supported. In particular, mostly integer types. So, atomic ad and atomic exchange are the only types that actually support floating point operations. There's a workaround for this. It turns out that you can implement any atomic operation by using the atomic compare and swap operation. And I'm not going to go into the details, this gets into mutexes and critical sections. There's a, there's a short example in the programming guide that, that gives you an example of how to implement for example, 64 bit operations using atomic CAS. You need to be aware that there's still is no ordering constraints, right? So, so the different threads in a thread block, and the different thread blocks themselves will be running in some unspecified order. It will be different every time. And so, the operations that your performing on, on memory using a atomics are, are still going to happen in an unspecified order. This is important, because floating-point arithmetic is actually non-associative. In other words, the quantity of a plus b plus c, is not the same as a plus the quantity of b plus c. It's real easy to convince yourself if you look at numbers like say, a equal 1, b equal 10 to the 99th, and c equal 10 to the 99th. You can just plug this into your c decompiler or your calculator for that matter and you'll discover that you don't get the same number, depending on what order you do these operations in. And the final thing to be really aware of with atomics is that there's no magic happening under the hood. The G PU is still forcing each thread to take turns giving access to that memory. It's serializing the access to memory among the different threads. And this can make atomic operations very slow if you're not careful. Let's look into that.
So let's have a quiz and a programming exercise on this. Here, we've given you the code that you just saw. Now, what I want you do to is modify the code to time several different scenarios. Okay? So try running a million threads each incrementing one of a million elements, so every thread is uniquely incrementing a single element. A million threads atomically incrementing a million elements. A million threads incrementing a hundred elements or a million threads atomically incrementing a hundred elements, or finally, 10 million threads atomically incrementing a hundred elements. And for each of these choices, I'm going to want you to tell me two things. First of all, does it give the correct answer, so put a check mark by those that give the correct answer. And second of all, rank them from fastest to slowest. So the fastest will be 1, the slowest will be 5.
So, taking these 1 at a time. A million threads incrementing a million elements does give the correct answer because, in this case, there's a unique element for every thread, so there's no conflict. So even though we didn't make these atomic increments we're still safe. A million threads atomically incrementing a million elements, is, of course, also safe. So you'll also get the correct answer. A million threads incrementing a hundred elements is the same example we saw before. And as we saw, that will give the wrong answer, unless we use atomics. So the next one is not correct, the fourth one is correct. And finally, ten million threads atomically incrementing a hundred elements will still be the correct answer. So all but one of these give you the correct answer. Okay, so the more interesting question is how long do each of these options take? The fastest, perhaps counter-intuitively, is going to be option three. A million threads writing into a hundred elements. And the next fastest would be option one. A million threads writing into a million elements. On my laptop, these two operations take around 3 point 2 milliseconds, and 3 point 4 milliseconds respectively. Of course, it's not a very useful option since it doesn't give the correct answer. But it's still interesting to look at what's going on. So, the reason why this is slightly faster is that you have your million threads all trying to write to the same 100 elements. Well, those 100 elements occupy a very small fraction of memory, and they're all going to fit into a cash line or two on this system. Whereas a million elements is large enough that it's not going to fit in cache. You're actually going to have to touch more of the DRAM, where global memory is stored. For the same reason, a million threads writing into 100 elements atomically is going to be slightly faster than a million threads writing into a million elements. So, the next fastest is option 4. The next fastest after that is option 2 and in my system again these took 3.6 and 3.8 milliseconds. Which means the slowest of all options is the one where 10 million threads write into a hundred elements. This is actually 36 milliseconds. So it takes approximately 10 times as long to complete. Not surprisingly, there's about ten times as much going on. So you might play around with this code for a little bit. For example, see what happens to the time as you go to even more threads writing into even fewer elements. The big lesson here is that, is that atomic memory operations come with a cost.
Okay let's recap where we are. So we're talking strategies to do efficient CUDA programming. And the first thing that we've talked about is using high arithmetic intensity trying to get your ratio of math operations to memory time spent accessing memory as high as possible. And so far we've been talking about the denominator. The goal has been to minimize the time spent on memory. Part of that is simply moving data to faster memory, if you're going to access it a lot. Keeping in mind that, you know? The fastest memory of all is local, local variables followed by shared memory, followed by global memory. Another thing you can do is use coalesced global memory accesses. So when you are accessing global memory, try to do it quickly. And the trick there, is, adjacent threads are accessing a contiguous chunk of memory. Usually, they're accessing adjacent memory locations. Well what else do I need to think about when we're writing efficeint CUDA programs. In addition, to striving for high arithmetic intensity, we also want to avoid thread divergence, let me explain what that means. So in parallel threads, like threads in our trusty thread block do something different we say they diverge. So we had some code that looked like this in the kernel. You're doing some stuff and you reach an if statement, if condition is true then next get some code then else get some other code, then you proceed. So, if you think about a whole bunch of thread in a kernel executing this code all of these threads are going to get down here. So these threads are going. They're going to hit this condition. They are all going to execute this condition. And then some of those threads I do not take if branch, some of them are to take the else branch. So this thread proceeds hits the condition and then will see it takes the if branch. Maybe this thread proceeds when it hit the condition, maybe it takes the else branch. Perhaps this one takes the else branch as well. And maybe this thread takes the if branch. Okay, so these threads have diverged. Of course, in this particular case, afterwards they're going to all proceed together again. So this threads going to keep on going, finish the if code. This thread will keep on going, finish the if code. These two threads will each keep proceeding and do the else code. And it'll all converge again. And you might notice that I tried to draw them so that they kind of reassemble into the same the same order they were. In fact, their thread IDs have never changed. Okay? This is still thread zero, this is still thread three, this is still thread two sorry thread one and thread two. So the thread indexes don't change it's just the path that they're executing through the program is different. So that's what divergence means. This is thread divergence, threads that do different things.
You can also encounter thread divergence due to loops. Here's a somewhat contrived example. We have some pre-loop code in our kernel. All the threads are going to do this code. And then they're going to reach this for loop. And the way I've constructed it is, they're going to go through this loop a number of times, equal to their thread index. So thread 0 will execute this code once. Thread 1 will execute it twice. Thread 2 will execute it three times and so on. And then eventually they're all going to exit the loop and proceed and do some post loop stuff. So what does this look like? So here's a bunch of threads and they're all in the same thread block. I've just color coded them so you can see what they do more easily. And they're all going to be executing this pre-loop code, and then they're going to reach the loop, so thread 0 is going to proceed into this loop code. And they just keep going. Thread 1 is going to execute the loop code. And then execute again, and keep going. Thread 2. We'll execute the loop code again and again, and keep going. And thread 3, we'll execute the loop code 4 times. So if we think about these threads a little differently in terms of what they're doing over time. The first order is executing the pre-loop code. Then goes ahead and executes the loop code. And then it really just kind of sits around. Okay. Doesn't have anything to do for a while. Because in the mean time, thread 1 has executed the pre-loop code and then the loop code and then executes the loop code again. The 3rd thread executes the pre-loop code, the loop code, the loop code, then executes the loop code again. And the final thread executes pre-loop code, and then executes the loop code 4 times. And finally, all the threads can go ahead and proceed with post-loop code. This diagram, when you draw it like this, kind of gives you a sense of why loop divergence is a bad thing, why it slows you down. Because it turns out that the hardware loves to run these threads together and as long as they're doing the same thing, as long as they're executing the same code. Then it has the ability to do that, but in this case, the blue thread proceeds for a while, and then, because it's not going to do the loop again, it just ends up waiting around while the other threads do so. And then the red thread waits for a little while, the green thread waits a little bit. And only the purple thread was executing, at full efficiency the whole time. And so you can imagine that if the hardware gets some efficiency out of running all four of these threads at the same time, then that efficiency has been lost during this portion of the loop.
So, let's wrap up and, and summarize some of the things that we've learned. We've learned about parallel communication patterns going beyond the simple map operation that you saw in Unit 1 to encompass other important communication patterns like gather, scatter, stencil, and transpose. We've learned more about the GPU programming model and the underlying hardware such as how thread blocks run on streaming multi-processors, or SMs, and what assumptions you can make about ordering, and how threads and thread blocks can synchronize to safely share data and memory. We've learned about that GPU memory model, topics like local, and global, and shared memory and how atomic operations can simplify memory writes by concurrent threads. Finally, we got a quick preview of strategies for efficient GPU programming. The first principle is to minimize the time spent on memory accesses by doing things like coalescing global memory access. We saw that the extremely fast global memory on the GPU operates best when adjacent threads access contiguous chunks of global memory and this is called a coalesced memory access. We also learned to move frequently accessed data to faster memory. So, for example, promoting data to shared memory. And we learned that atomic memory operations have a cost, but they're great and they're useful and you shouldn't necessarily freak out about the cost. And often, the cost is negligible, but it's something to be aware of. Along the same lines, we learned about avoiding thread divergence that comes with branches and loops and, once again, thread divergence comes at a cost. You should be aware of that, but it isn't something that you should be freaked out about. We're going to revisit these topics and talk much more about optimizing GPU programs in Unit 5. So, that's it.
Okay, congratulations on finishing unit 2. Let's recap what we've learned. So you learned about how threads communicate with each other through memory and how they can access that memory efficiently when operating in concert. Along the way we've learned a few things about the GPU hardware, like it's memory model, and what assumptions we can and cannot make about when threads will run. Now you'll have a chance to put those concepts into practice. You'll implement a simple image blurring operation. Jen Hahn will tell you more.