cs344 ยป

Welcome to Unit 2

Okay. Welcome to unit 2. It's good to see you again. So in the last unit you learned about the fundamentals of the GPU programming model and the basics of writing a simple program using CUDA. In this unit we're going to build off of that. We'll learn about important parallel communication patterns like scatter, and gather, and stencil. And we'll dive a little deeper into the GPU hardware and learn about things like global memory and shared memory, and we'll put these together to learn how to write efficient GPU programs.

Communication Patterns

Let's recap what we've learned so far. Parallel computing is all about many threads solving a problem by working together. And the key is this working together any books on business practices or teamwork will tell you that working together is really all about communication. In CUDA, this communication takes place or memory. For example threads may need to read from the same input location. Threads may need to write to the same output location. Sometimes threads may need to exchange partial results.

Scatter Quiz - Solution

And the correct answer is that it's a scatter operation. Each thread is computing where to write its result.

Stencil-Solution

The answer, of course is simply the number of elements in the stencil. Right, so every element in that array is going to be read five times by the 2D von Neumann stencil, because there are five entries in the neighborhood. So, all elements will be read five times by the 2D von Neumann stencil, nine times by the 2D Moore stencil, and seven times by the 3D von Neumann stencil.

Transpose Part 1

Another parallel communication pattern worth mentioning is called transpose. For example, you might have a 2D array, such as an image, laid out in row-major order. This means that the elements of the array, or the pixels of the image, are laid out one row at a time. And, I've color-code the rows here just to show you more clearly what I'm doing. But you might want to do some processing on the columns of the same edge. And so you'd want to lay out like this. This means you need to do an operation to reorder the elements. As you can see, I've drawn this as a scatter operation. So, each thread is reading from, an adjacent element in the array. But is writing to someplace scattered in memory, according to the stride of this row column, transpose. I could also have expressed this as a gather operation. Like so.

Transpose Part 2

So you can see where a transpose might come up when you're doing array operations, matrix operations, image operations. But the concept is generally applicable to all kinds of data structures. Let me, let me give an example. So, here's some sort of structure you might have, right? It's a perfectly reasonably structure foo. It's got a float field and an integer field and say, that you have a thousand of these. Well, what does that look like in memory? You're going to have the floats and the integers disbursed throughout memory. And as we will talk about later, it can be more efficient to access, if you're going to do a lot of processing on the floats, it can be more efficient to access all of the floats contiguously. You're going to want some operation that lets you take your, what's called an array of structures representation, and turn it into a structure of arrays. And that operation is, again, a transpose. By the way, these two terms are so common that array of structures is often abbreviated AOS. And structure of arrays is often abbreviated SOA. You'll see these terms come up frequently in parallel computing. So, to summarize, the transpose operation is where tasks reorder data elements in memory.

What Kind of Communication Pattern - Quiz

Okay, let's have a quiz on communication patterns. I'm going to give you a bunch of code snippets and I'm going to ask you to label them according to the parallel communication pattern that they embody. For each code snippet, you should indicate whether it is a map operation, a gather operation, a scatter operation, a stencil operation, or a transpose operation. Here's the code, and this is really sort of pseudo code. I'm not explaining where these variables came from, or showing you that how many threads I'm watching, or anything like that. But this is kernel code, and as you can see, I have two variables, out and in. These are arrays, the floating point numbers. And just for berevity, I've created two variables, i and j, to represent threadIdx.x, and threadIdx.y. Just to have something to do, I'm going to multiply a bunch of numbers by pi, so I define pi here, and here are our code snippets. Out i equals pi times in. Out i plus j times 128 equals in j times, j plus i times 128. And then, you see these two I have guarded with an if statement but only the odd thread get executed. Out i minus 1 plus equals pi times in i, out i plus 1 plus equals pi times in i. Finally, out i equals in i plus in i minus 1 plus in i plus 1 times pi divided by 3. So, for each of these statements, each of these little code snippets indicate whether it's a map, a gather, a scatter, a stencil, or a transpose.

What Kind of Communication Pattern - Solution

So, this first one is pretty easy, right. There is one-to-one correspondence between the output and the input so that's clearly an app operation. And this next one is also is one-to-one operation. One value gets written in the output array corresponding to[UNKNOWN] get right from the input array and you can see that while writing into an array which is represented in I major order here in the output, and in j major order in the input. So this is a transpose operation. Now this next code, as I said, I put a guard around. Only odd the numbered threads are going to execute this. So that rules out a map, it's not one to one. And that also rules out a transpose operation, which is also one to one. And you really couldn't call it a stencil operation either because a stencil operation should generate a result for every element in the output array. And this doesn't do that. Now, if you look the first one, the thread is taking the input at a given location and multiplying it by pi and placing that into a couple of different places in the output array. In fact, it's incrementing a couple different places in the output array. So, this would be a scatter operation, the thread is computing for itself where it needs to write its result. And this final line would be a gather. You can see that every thread is writing a single location in the output array and it's reading from multiple places in the input array, locations that it computes. So this would be a gather. And again, this looks very much like a stencil operation since it's. Reading from a, a local neighborhood, and doing some averaging and writing the result, but I wouldn't call it a stencil because it's not writing into every location because of this guard here. So that's why I refer to this as a gather rather than a stencil.

Parallel Communication Patterns Recap

Conceptually map and transpose are one to one, each input maps to a single unique output. You can think of a gather operation as many to one. Many possible inputs can be chosen to compute an output. In this terminology, scatter is one to many, so each thread chooses from many possible output destinations. Stencil can be seen as a specialized gather that pulls output from a select few outputs in a given neighborhood of the output So you might turn that several-to-one. In the next lecture John will tell you about 2 more really fundamental pattern. Reduce could be turned all-to-one. For example if you're adding up all the numbers in an array. Finally scan and sort can be considered all-to-all because all of the input can affect the destination of the resulting output. You'll learn more about these in the next lectures.

Programmer View of the GPU

Check all the true statements. A thread block contains many threads. An SM might run more than one thread block. A thread block may run on more than one SM. All the threads in a thread block might cooperate to solve a subproblem. All the threads that run on a given SM may cooperate to solve a subproblem.

Another Quiz On Thread and Blocks - Quiz

So as a review, the programmer, or the GPU, is responsible for defining thread blocks in software. And the programmer, or the GPU, is responsible for allocating those thread blocks to hardware streaming multiprocessors, or SMs.

Another Quiz On Thread and Blocks - Solution

And of course the correct answer is that the programmer is the one writing software, and programmer's job is to define those thread blocks. And the GPU is running the hardware, and the GPU is completely responsible for allocating those thread blocks to run on hardware SMs.

What Can The Programmer Specify - Quiz

One more quiz to try this on. If we have a single kernel that's launched on many thread blocks, including x and y, the programmer can specify that block x will run at the same time an block y. That block x will run after block y. That block x will run on sm z.

What Can The Programmer Specify - Solution

The answer of course is that all of these are false. There are no such guarantees.

Okay, here's a simple CUDA program that'll illustrate these ideas. So, we're going to launch 16 blocks and each one's going to have a single thread running. And it's going to run a trivial kernel that just prints hello world on the thread in block. And then, you know, in the main, we're going to do nothing more than launch that kernel. You need this call cudaDeviceSynchronize to make sure that these printf's flash and then we'll print. That's all. So, before we run this, here's a quick quiz. How many different outputs do you think different runs of this program can produce? Is it 1, 16 possible different outputs, 2 to the 16th possible different outputs, or 21 trillion possible outputs?

A Thread Block Programming Example - Solution

Well as we'll see the correct answer is actually 21 trillion, which is 16 factorial. Let's see how that works. So here I'm just going to run this program in my terminal. And as you can see the blocks complete in some basically random order, block 7 completes, then block 6, then 1, then 0, and so forth. If I run it again I get a different order and a different order and a different order and so forth. You'll find that the blocks are executed in a different order every time is where that number 16 factorial comes from.

What Does CUDA Guarantee

So we spent so long talking about what CUDA doesn't guarantee, you may be wondering what CUDA does guarantee. There's two main things. Again, recall that all the threads in the block are guaranteed to run on the same SM at the same time. Second, all blocks in the kernel finish before any blocks from the next kernel are launched. This goes back to our programming model diagram where, remember, we had some threads running one kernel, say foo And then some threads running another kernel on the side bar. So what CUDA is promising as anything, any threads running foo will finish in their entirety before any threads running bar are launched.

Quiz About GPU Memory - Quiz

Let's have a quiz. Go ahead and check all the statements that are true. All threads from a block can access the same variable in that block's shared memory. Threads from two different blocks can access the same variable in global memory. Threads from different blocks have their own copy of local variables in local memory. Threads from the same block have their own copy of local variables in local memory.

Quiz About GPU Memory - Solution

So it turns out, these are all true. All threads from a block can access the same variable in that block's shared memory. That's what the shared memory is, it's a chunk of memory that shares, stores variables that are accessible from all threads in this given block. And threads from two different blocks can access the same variable in global memory. Well all threads, from any blocks, at any time can access a variable or piece of data that's sitting in global memory. So this is also true. Threads from different blocks have their own copy of variables in local memory. Yes, this is true. So, every thread has its own copy of whatever local variables, you know. Private, variables to that thread, are sitting in local memory. So this is true. And threads from the same block have their own copy of local variables, and local memory. Right. So, just because they're from the same block doesn't mean that they share local variables. They share shared memory. Per block shared memory, but they all share local memory. So all four of these are true.

The Need For Barriers - Quiz

Here's an example of the need for barriers. So we've got an array memory for the bunch of elements 1, 2, 3, 4, 5, 6, and we want to shift each of these left by element go here and so on. So here's a little code snippet, I should do this. We first initialize the elements of array to the thread index and you'll see that if hard code of this to 128 just to be lazy. So every thread is going to set its corresponding array elements to its own index. So this should initialize things to 1, 2, 3, 4, 5, 6 and so on. And then avoiding stepping off the end of the array with the step statement. Every thread will set its corresponding array element at its index equal to the value of the array element at the thread's index plus 1. So thread 1 will set its value to whatever is written in array, 2 sub 2, thread 2 will set array sub 2 equal to whatever is in array sub 3 and so forth. So here's a quick quiz. How many barriers does this code need?

A Quiz On Synchronization - Quiz

So let's have a quiz. Are the following code snippets correct? Here's some function, some kernel. And it's going to declare an array of ints and shared memory, 1024 ints. And then for convenience we'll define i to be equal to thread index. And then we've got a bunch of functions. And I want you to check the ones that are going to run correctly without additional sync threads. And lets go ahead and put a sync thread before and after each of these code snippets because it's really the lines here themselves that I'm asking you about and we can assume there is a sync thread ahead of this and a sync thread after.

Global Memory

So, once again we've got a kernel. And, we know it's a kernel because it's been tagged with global, so it's going to run on the GPU but can be called from the host. And once again we're going to pass in a local variable, a parameter called array. And the trick is that this parameter is a pointer, and so this is actually pointing to some global memory that we've allocated elsewhere. And I'll show you how to do that in a moment. Once you've got a pointer to global memory, you can manipulate, or you can manipulate the contents of that memory just as you would manipulate any other chunk of memory. So in this case, I'm going to take my array And I'm going to set one of its elements which happens to be equal to the index to this thread to some number which happens to be 2.0 times the index to this thread. Again not a very useful function but it illustrates what's happening. So the point really is that since all the parameters to a function are, our local variables are private to that thread. If you want to manipulate global memory you're going to have to pass an appointer to that memory. And, of course that means you're going to have to allocate a pointer so, let's look at how that works. Here's the code to show off how we use global memory. The first thing I'm going to do is to clear some host memory okay. And, once again I'm using a convention that's starting a variable with a prefix H underscore indicates that this is memory that lives on the host. So here's an array of 128 floats. And I'm also going to declare a pointer that I'm going to use to point to the global memory that I allocate on the device. And once again the d underscore, the d underscore can mention indicates that this variable is on the device. Now I want to allocate some global memory on the device. So I'm going to use the function cudaMalloc. What's happening here is that I'm passing it a pointer to this variable. Which is itself a pointer. Right? And, cudaMalloc is going to allocate some memory in this case room for 128 floats, and stuff a pointer to that memory into the pointer d array. If you're allocating memory you'll probably initialize to something. So we use cudaMemcpy for that operation. And in this case we pass in a pointer to the destination memory, which is this d array that we've allocated. And a pointer to the source memory, which is this h array variable. And then the number of bytes allocate, and then we indicate, whether we're copying from the hosted device, or vice versa. Oops, this is a bug. So now we've got a chunk of global memory, we've put something in it, and now we're ready to launch the kernel that's going to operate on that global memory. So here's the kernel that we saw earlier. Again, we're going to launch a single, thread block, consisting of 128 threads. I'm going to pa, pass in this pointer where I've allocated an, an initialized memory. So after this runs, presumably this code will do something to that memory that I pass in and now I'll need to copy it back onto the host. If I want to use the results of this kernel back on the host, then I need to copy the memory back, into host memory. And so, here's that operation. Once again, cudaMemcpy. This time, the destination is h array. The source is d array. This same number of bytes. And now, we're copying from device to host. Okay. So that's how you use global memory. Alright? The trick is that, since you can only pass in local variables to a kernel. You have to allocate and initialize global memory outside the kernel, and then pass in a pointer. Finally, let's look at how you would use shared memory.

Quiz On Memory Access - Quiz

So our question for the quiz is, which of these operations would we expect to be fastest, and which would we expect to be slowest?

Quiz On Memory Access - Solution

So we would expect the assignment of local variables to be the very fastest. And shared memory is also very fast. So a and b are both shared memory variables, so you would expect that to be fast. And global variables are all the way out in global memory, so they're going to be the slowest. We would expect that this assignment of, of the contents of y to the contents of z. There's probably the slowest operation. And this one, which is moving a piece of data from global memory into shared memory, is probably the second slowest. And by the way, if you know anything about compilers. You realize that this is an oversimplification. Right? It's quite possible that many of these values will be promoted into registers. An optimizing compiler might rearrange accesses and so forth. But, the point is simply to get across the relative speeds of memories.

A Quiz on Coalescing Memory Access

Okay, let's have a quick quiz. So, which of these statements has a coalesced access pattern? Here's a simple kernel foo. It takes a pointer to global memory g, and as a shortcut, I'm going to define a as 3.14 and i as thread index dot x. So now, each of these statements either reads or writes g or both. And I'd like you to tell me, in each case, whether the accesses to g follow a coalesced access pattern.

Let Us time Some Code - Quiz

So let's have a quiz and a programming exercise on this. Here, we've given you the code that you just saw. Now, what I want you do to is modify the code to time several different scenarios. Okay? So try running a million threads each incrementing one of a million elements, so every thread is uniquely incrementing a single element. A million threads atomically incrementing a million elements. A million threads incrementing a hundred elements or a million threads atomically incrementing a hundred elements, or finally, 10 million threads atomically incrementing a hundred elements. And for each of these choices, I'm going to want you to tell me two things. First of all, does it give the correct answer, so put a check mark by those that give the correct answer. And second of all, rank them from fastest to slowest. So the fastest will be 1, the slowest will be 5.