cs344 ยป


Optimizing GPU Programs

Today we're going to talk about optimizing GPU programs. Now the whole reason we want to use a parallel program like the GPU, is to solve problems faster. And in turn the reason we might want to solve problems faster could be simply because we want to a solve problems faster. Or more often it's because we want to solve bigger problems, or solve more problems. So the good news is that, it's often the case that the first initial port of a problem gets a speed up, assuming that you got a parallel problem to begin with. And, in my experience, this is, this is actually when a lot of PPU programmers get hooked. You know, over the weekend, they go home and out of curiosity, they try porting some piece of their existing CPU code to the GPU, and they get a nice speed up, a 5x or 8x speed up. And that's, that's what sort of gets them hooked and makes them realize they could, put some more effort into this and get, get a bigger speed up. So if you have a naturally parallel problem, it's often the case that, that first initial Cuda port will, will get you, good speed up, and that's cool. But, you know, by definition GPU programmers care about performance, that's why they're using the GPU. That means they're often want to spend additional effort to maximize the speed up beyond that first initial try. So in this unit we're going to talk about how to optimize GPU programs. So cast your mind back to, unit two when we talked for a little bit about how some basic principals of efficient GPU programming. So check which of these principles accurately correspond to things we talked about in unit two about efficient GPU programming. Do we want to decrease a, arithmetic intensity? Do we want to decrease the time spent on memory operations per thread? Do we want to coalesce global memory accesses? Do we want to do fewer memory operations per thread? Do we want to avoid thread divergence? Do we want to move all data to shared memory?

Optimizing GPU Programs

Well, arithmetic intensity, basically defined as the amount of math per memory. Or to be a little more precise, the amount of useful work, which is essentially math operations, that we do for the amount of time that we spend doing memory accesses. So, we want to maximize arithmetic intensity, not minimize it, spending less time on memory operations clearly helps. And the single most important way that you can do that, is to coalesce your global memory accesses. Now doing fewer memory operations per thread, well this may or may not help. Okay, for example, we might want to move data into the fast shared memory, do some operations on it there and then move it back. And that would be more total memory operations but the overall time spent on, accessing memory would go down, because we're doing a lot of frequent accesses within shared memory. So this one's not correct necessarily. By the same token, we don't necessarily want to move all data to shared memory, right? Because that could be an unnecessary memory operation. What we really care about is moving frequently accessed data to shared memory. So these three are not correct. Finally, the GPU does run most efficiently when adjacent threads take the same path through the code. That means it's good to avoid thread divergence. In this unit we'll give a lot more explanation, examples, and specifics for all three of these general guidelines.

Levels of Optimization Part1

Now, just as on a CPU, there are different levels of optimization for the GPU. There's picking good algorithms in the first place. There's some basic principles for writing efficient code. There's architecture specific detailed optimizations. And there's sort of bit twiddling micro-optimization at the instructions level. So, let's use a quiz to explore a few CPU examples. There's the use of vector registers such as SSE and AVX. There's the use of something like mergesort which runs an order n log n time versus insertion sort which runs and order n squared time. Next is writing cash-aware code, meaning write code that is likely to make efficient use of the cache. So, an example would be it's generally faster to traverse across the rows of a large 2D array than down the columns. Yeah, assuming that array is laid out in row order, you'll get better cache performance in that case. Or you can do explicit cache blocking, for example, for the L1 cache and a CPU core. Or finally, you could approximate the inverse square root of a floating point number by shifting it right one bit and subtracting it from the integer 0x5f3759df.

Levels of Optimization Part1

Alright, let's get through these. So picking good algorithms, obviously this is the most important level of optimization. And when sorting a large random list, and Order n log n, algorithm like mergesort, is just intrinsically going to be faster than an Order n square algorithm like insertion sort. And clearly this is a case of choosing the right algorithm. And you know much of performance CPU programming relies on making efficient use of the cache. This is sort of the number one thing to keep in mind, when you're trying to write efficient CPU code, is that there is this big cache hierarchy and you want to write code that's going to make good use of it. And so, I would consider this an example of optimization level 2, right, it's a basic principle of efficiency on a CPU. On the other hand, blocking for the L1 cache, which means carefully sizing your working set to fit exactly in the per core cash, on the CPU. This is a detailed optimization, it's going to depend on the exact CPU model you're using, because every CPU core has different size L1 cache. I would also put the use of vector registers like SSE or AVX intrinsics into this category, right, these are, not every CPU has SSE some have AVX. Some don't have either. So, the use of, vector registers, the choice of blocking for the L1 cache, cache blocking, I would consider these architectures specific detailed optimizations. So these kind of architectural-specific optimizations tend to be the domain of, of what I call ninja programmers. And we'll also touch on a few ninja-style GPU topics like shared memory bank conflicts. I'll try to highlight these, ninja topics with this little icon here. The idea is that these are really you know, not, needed or necessarily accessible to every programmer. They tend to be when you're sort of squabbling over the last few percent of, optimizations, the last few percent of speedup. I think this is one of the big differences between CPU and GPU programming. On a CPU, these sort of inter-level optimizations can make a, a really big difference. For example, if you ignore the existence of SSE registers, or AVX registers on really modern processors, then you're only getting a fourth or an eighth of the power of each, core on your CPU. So you're taking a huge hit in potential performance. On the GPU, for the most part as a rule of thumb, the, speedups to be gained by these sort of ninja optimizations are, are usually comparatively minor. Okay, so we're talking more like a 30% or an 80% speedup, by using some of the techniques that we'll talk about with this little ninja icon. Versus the speed up that you might hope to get from just picking the right algorithm in the first place or, you know, just obeying the basic principles of efficiency on a GPU such as coalescing your global memory accesses. These can also, often make a difference of three, three times, ten times, sometimes more. So these, you know, just doing a good job, the sort of, higher level, principles of optimization matters more on the GPU. And the ninja level optimizations definitely help, sometimes you can get a speedup here, but it's not a sort of vital step to extract maximum performance the way that it is on a CPU. Where if you don't If you don't start doing something to make sure you're using your vector registers, you're going to take a huge hit in terms of the efficiency that you get out of, out of your modern CPU. And finally this last one is, I really just threw that in there for fun. Clearly, shifting a number, casting it shifting a floating point number, casting it to a float and subtracting it from this magic constant, is firmly in the category of micro optimization at this sort of bit twitling instruction level. This is an infamous hack that's been used in many places, most famously in the video game Quake 3, and if you're curious go to Wikipedia and look up fast inverse square root. Of course these are firmly in the ninja category, and we won't be talking about them at all over the today or in rest of the course.

Levels of Optimization Part 2

Let me give a few examples on the GPU. So on the GPU picking good algorithms, really means, picking algorithms that are fundamentally parallel. My favorite example of this, is related to the example I gave for the CPU where I compared mergesort, which is Order n log n with insertion sort, which is Order n squared. Now in fact most CPU algorithm texts would probably tell you that heapsort, which is also Order n log in, is a better choice than either of these, it tends to run a little better on the CPU. But it turns out that heapsort is, all about updating this shared data structure called the heap, and that's a very hard thing to do in parallel. And so, in practice a mergesort is a much better choice on a parallel machine like the GPU. Terms of basic principles for efficiency, we've already discussed some of these. A big one is coalescing global memory accesses, and another is using shared memory, which is faster for frequently accessed data within a thread block. We'll just touch on a few GPU Ninja topics like optimizing bank conflicts and shared memory, and optimizing the use of registers. And I've certainly seen my share of, sort of GPU bit twitteling micro optimization. One example that comes to mind is, the use of, floating-point denorms in order to, abuse the floating-point arithmetic units for additional integer math performance. And this kind of stuff is just firmly in the Ninja category. You know, it's almost never worth the time or the code obfuscation that results, and, we won't be talking about them at all today.

APOD Part1

So, you can find an entire you know, entire books, entire courses on how to go about analyzing and optimizing code. The real point is simply that you should have a systematic optimization process. So, I'm going to introduce one systematic optimization process called APOD. Apod stands for Analyze, Parallelize, Optimize, and Deploy. And the real point to this process is that it's a cycle. There's nothing really new here. This is just good software engineering discipline. But the point is that we want to breakdown the process of optimization into several steps, and to remember not to short circuit really important one, which is to actually deploy the optimizations that you've made into the real world, get some feedback, see, see how they're performing on real world datasets. It's too easy to spend all your time circling around these two steps and optimizing a vacuum forgetting that, you know, actually you might be able to get a bigger speedup by doing a better job of thinking about what you're trying to accomplish or by going out and taking a partial optimization into the real world and seeing how it performs, you know, getting guidance from that. So, this is really nothing more than good software engineering discipline. Let's take these steps one at a time.

APOD Part2

So the analyze stage is really all about, profiling the whole application, looking not just at the kernals that you intend to parallelize, but looking at the whole thing, and trying to figure out, where can this application benefit from parallelization? Right? And how much can you expect to benefit? So once you've decided that there's a region of the code that you need to parallelize, of course there's different approaches to doing this. Certainly no code is better than the code you don't have to write at all. So if you can find that there's a parallel library for the GPU that already does exactly what you want, then great, you're done. You can simply call out to that library. Sometimes you have a lot of code and you want to be able to instrument it. You want to do a minimal amount of work to get a little bit of parallelization and there's an approach for this called Directives. The most well known one on a CPU is called Open MP. There's another cross platform standard called Open ACC which has emerged which ACC stands for accelerator. This is sort of an extension of Open MP to encompass the ideas of accelerators like the GPU. And so if you're looking at GPU programming Open ACC is a, a really lightweight way to experiment with it. But of course sometimes what you want to do is actually go in and write a parallel routine and naturally that's what we've been focusing on throughout this course using CUDA C ++. So of course we're going to, we're going to focus here. So assuming that you're going to be coding something up from scratch for this purpose, the next choice is how to pick an algorithm. And this is a huge deal. This is your real chance to make a huge improvement. Pick the right algorithm. So all, all the bit twiddling optimization in the world won't gain, gain you nearly as much as choosing the right sort of fundamentally parallel friendly algorithm. I'll give a couple of simple examples in this unit, and you'll see more examples in Unit 6. There's no recipe for picking the right algorithm. What you need to do is sort of, think deeply about what is the parallelism in your problem. But we'll try to give you a few examples of that, that you have a little practice. So once you've decided how to parallelize your algorithim, then you want to optimize the code, and there will be sort of, a, a cycle between these. As you'll see in the example, you try a parallelization, study how well it does, you know, suggest some changes that might suggest a way that you approach the parallelization differently. So we're really talking about profile-driven optimization. By which I mean measure it. Measure, measure, measure, measure how fast things are going and use that to base your decisions. Don't just sort of take a guess at what's going to work well and what doesn't. And finally, the deploy step. So, you know, look in this class we do these little home works, and they're fairly self-contained so you can get them done in a reasonable amount of time. So, the, the process of actually deploying a GPU accelerated code into real use is not going to come up a lot in this class. So, so consider this just sort of free software engineering advice. You know when you're working on a real code, it's a really, really bad idea to optimize in a vacuum. And what I mean by this is you can really, you can easily spend tons of time adding tons of unnecessary complexity to your code, speeding up a kernel way past the point where it's no longer a bottleneck. And it's just fundamentally a good idea to push any improvements through to the end code, you're making it real by doing that. And making it real as soon as possible insures that you're running and profiling and paralyzing and optimizing real workloads. And the other thing to remember is that, you know, even small improvements are, are useful. If your code is, if you make your code 50% faster or two times faster or four times faster, that's useful and you can push it out to, to the users of your code; you can start employing it and making it real. Even if you think that there's a factor of 20 times speedup in the wings. You know, really the, the advice here again is. Is be disciplined, you know analyze what you're doing. You know decide how you're going to para[LAUGH]elize it, decide how you're going to optimize it, by studying that code, you know and measuring it using, using profile driven optimization. And finally, you know be sure to deploy frequently okay, so, so, deploy early and often. Remember that this whole thing is intended to be a cycle.

Weak Vs Strong Scaling

It's also important to understand what it is you're trying to accomplish by parallelizing the code. Say you have a serial program that solves a certain problem size p in time t. You know, maybe your simulating how a certain protein molecule folds, and that simulation takes an hour to run. Weak scaling is what we refer to when we're talking about using parallel computing to run a larger problem, or perhaps to run more problems. So if you wanted to run a bigger protein, if you wanted to fold a bigger protein in your simulation and still take an hour, or you wanted to fold many small proteins in the same hour, then that would be an example of weak scaling. Strong scaling means using parallel computing to run a problem faster a single problem size. So, if you wanted to fold the same protein that you were folding in an hour, and you wanted to fold that in a minute, then that would be an example of strong scaling. More formally we could say that weak scaling describes how the solution size varies with the fixed problem size per processor as you add processors. We could say that strong scaling describes how the solution size varies as you add processors to tackle a fixed total problem size.


Now there are really two components to the analyze stage, understanding where your application spends its time, and understanding what you want to do with the additional parallelism. So understanding hotspots is pretty straight forward. Often the programmer will have some intuition for this. They'll have some idea where the time is being spent in their code. But intuition can be wrong and there's no substitute for data, so don't rely on intuition. So run your favorite Profiler, whether that's G Prof or V Tune, or the brilliantly named Very Sleepy Profiler. And look at how much time the various functions, or lines of code, take. You're going to get some output back that tells how much time each function takes, so some functions might, this function might take 10% of the time, this function might take 20% of the total time of the program. And in this case, the program is spending 50% of its time in this one function, so there's a clear hotspot here. Maybe your intuition would've told you that this is where the time is going. Often you do know that first hotspot, but sometimes you're surprised by that second and third hotspot, sometimes they're not what you think. And this gives you a starting point of where to begin parallelizing the code, and an indication as well of how much speedup you can expect to do so. Let's try that at home with a quiz. So in this example that we've shown here, the question is, what is the maximum speedup possible that you could hope for if you parallelized this top function, the function which takes the most time? Okay, that's this one. And then I'll say, well, what if you parallelize the top three functions? So what if you parallelize all of these?


The top function takes 50% of the run-time and that means even if you made it go to 0 time, the program would only go twice as fast. So, you get a 2x speed up. The top three functions between them take 50, 20, and 10% or 80% of the total time. So, if you were to make those all drop to 0, the program would be taking

Understanding Hotspots

So, this is an example of a really key principal in parallel programming and, and indeed, in, in all of the Computer Science. We call that principle, Amdahl's Law. The total speedup you get is limited by the portion of a time you spent doing something that can be parallelized. In the limit of enough parallelism, you know, doing a good enough job with parallelism, having a parallel enough machine, your maximum speedup goes to 1 over 1 over P, where P is defined as the percentage or portion of parallelizable time, time spent on this parallelizable activity. So, it's really important that you know these limits. So, for example, if your application spends 50% of its time on IO, reading and writing to the disk, that implies that you can't possibly do better than a 2x speedup. Unless that is, you can come up with the way to re-factor the computation. For example, maybe you can read and write larger batches of data at once, see the GPU has more work to do on a single batch and, and I want to say this, this situation comes up pretty often in practice, right? Gpus are so fast that often porting the hotspot to CUDA makes that hotspot so much faster that's no longer the bottleneck. And that's why I made the point earlier that if you have a bunch of functions, you might need to do more than port a single hotspot to the GPU, right? You might find that after you've ported that first hotspot to the GPU, it's gotten so much faster that now you need to focus somewhere else. And remember, once you've, once you've crossed this point, there's little point in continuing to optimize this. The total time on your application is not going to go down very much if you continue to work on this, this hot spot. You need to shift your attention to the new bottlenecks.


Okay, enough about analyze. Let's talk about parallelize. We're going to use a running example to illustrate the concepts that we'll go through over the rest of this unit. And that example is matrix transpose. It's a very simple algorithm that still has enough meat in it to illustrate a lot of the points we want to bring up. Matrix transpose just takes some matrix and exchanges the rows and columns. So, a particular element in this row, for example, would have coordinate a,b. So, its i coordinate is a and its j coordinate is b. And its new i coordinate, the transpose matrix will be b, and its new j coordinate will be a. Pretty simple. To make our code even simpler, we're going to actually going to restrict ourselves to square matrices. And you'll remember that, you know, this came up all the way back in Unit 2 as a really common parallel communication pattern. So, you know, transposing a matrix or doing some equivalent operation on array, or an image or structure of arrays, this comes up literally all the time. So, this is, even though it's a simple example as you'll see, there's a lot of opportunity to optimize it and it's a really important one.

Transpose Code Example Part1

Okay. Let's work through some code. Now in this series of coding exercises I'm going to show you, I'm going to be transposing a matrix and the coordinates of this matrix are going to be giving us i and j, i refers to the column of an elements, j refers to the row of an element. Now, this matrix is going to be laid out in row major order, meaning that in memory, all the elements of a given row will be adjacent and in the next row, and then in the next row, and so on. And remember that our goal is to take for every element ij, we want to switch it with element ji. So, that's what the code is going to do. Let me slide this over here as a reference and I'll bring it the Code window. So, here's how you would code transpose on a serial machine like a CPU, really simple code. I'm going to pass in two matrices in and out and I'm going to walk through a for loop that loops over the rows, j, and with N, it throw loops over the column and then, this is where you can tell that the matrix is laid out in row major order. I, j and the input matrix maps to j, i in the output matrix. Okay, really simple. And here's how you would call that code. So, I'm going to allocate a few matrices, an input matrix and an output matrix. And I call this little utility routine, call fill<u>matrix which is just going to put a bunch of consecutive numbers</u> inside the matrix. I call the transpose routine. And I'm going to put the result in the matrix I call gold, meaning that this is our golden reference. We know that this is, this code is correct. And we're going to compare the various GPU versions that we code to this, to this golden reference, and then I print it. Okay. So, let's compile and run this code.

Transpose Code Example Part2

We compile it and run it. I made a really small matrix to begin with so we can just see that it's acting correctly. So as you see our input matrix goes from 0 to 63. And the transposed version does the same thing down the columns instead of the rows. So we've written correct code, as we expected. And we're just going to store this routine for generating this transpose in the gold matrix and we'll use that to compare, so we don't have to keep printing out these matrices. For the rest of the, lecture I'll leave a smaller window so that we can see the results without actually printing out the matrices. Ok, so here's how you code transpose on a serial machine like a CPU. And here's how you call it. So let's do our first CUDA version. We'll copy this, paste it, make it a kernal, rename it. So now we've got a kernal that will operate on the GPU and call it. We need to do a few things, we need to allocate space for the input and output matrix on the device, so allocate those and we need to copy the input matrix on to the device CUDA kernel so what you see here is I've taken that same kernel that we wrote a serial kernel I've called it on a single thread, running at a single block, passed in d<u>in and d<u>out and then I copy my, my result and we're done.</u></u> See, that was quick. Gpu programming is easy. I'm actually making an important point here. Okay, this code is going to run correctly and it was really easy to write. And the problem, of course, is that it won't perform very well, right? It's using a single thread which is a really small fraction of the horsepower of a GPU, especially a high end GPU which can run tens of thousands of threads concurrently. So let's go ahead and time this. Okay, so I've added some timer code. Okay, we're going to start the timer, measure it and I also added I'm using a helper function here called compare matrices, which just compares the output to, to the golden matrix that we calculated to make sure that we did the transpose correctly. Lets compile again, run again. Great, so as we performed the transpose successfully and it took some very small fraction of the millisecond. But this was a really small problem. Lets make this problem bigger again, say 1024 by 1024 matrix. So now we've got roughly a million elements to touch, and we're going to do this in a single thread. Okay, now this is taking almost half a second, that's a long time.

Transpose Code Example Part3

So, let's keep track of what we're doing. Version 1 is serial in a single thread and it took about 466 milliseconds. So, 466 milliseconds is pretty slow, but sometimes that's okay. For code, that's only going to be executed once, for code that's not performance critical at all, or code that's going to run on a really small dataset, like that 8 by 8 matrix that we started with, it's just not worthwhile to optimize the heck out of this. So, even though this simple serial kernel may seem very naive, that's really sometimes the right thing to do. So, keep this in mind when you're optimizing, think about what you need to optimize, whether it's important. Now, let's assume that in fact, performance is critical on this section, that's why we're optimizing it. And let's go back to the code and see what we can do. Now, an easy step would be to launch one thread for each row of the input, okay? So now, here's the code that does that. In this code, we're going to launch one thread per row of the output matrix. So, the value of i is going to be fixed by the thread ID and every thread is going to execute just the outer loop of this code we saw before and the inner loop we're essentially handing off to be run across many different threads. So, these, these two codes are almost identical and the only difference is that we're launching threads instead of looping over values of i. Let's time this. Okay, so here's the code for calling our new function. We're going to launch the function, transpose parallel per row, as a kernel. We're going to launch a single thread block consisting of n threads. Remember, n is the size of our matrix, 1,024, currently. We're going to pass in the input matrix, pull out the output matrix, copy it, and then we're going to print out the timing and verify it. Let's compile and run this code. Okay. Transpose serial ran 484 milliseconds again, roughly what we saw before. Transpose parallel per row is running in 4.7 milliseconds. So, obviously, we're making a huge improvement by parallelizing this just across the threads of a single thread block. So, let's note that down,

Transport Code Example Part4

So, for our next version, let's parallelize this per element, right? So, we essentially replace the inner loop with a thread launch, let's replace the outer antenna loops with the thread launch. And I'm going to leave this as a programming exercise for you to do. We'll give you the same code that we've been using here with a little bit more instrumentation that should be pretty obvious. Go ahead and run this code verify the timings, they might be different, on whatever GPU and system you're running on. And then, add a new kernel which performs the transpose on a per element basis and see how it times out. Now, one thing to be aware of when you're doing this programming exercise, is that you can only launch up to a 1,024 threads in a thread block. So, I suggest you organize your code to use K by K thread blocks, multiple thread blocks, each of dimension K by K, and use a value like 16 for K for now. We're going to play around with this value a little bit later.

Transport Code Example Part4

Okay, so here's my solution to that. I started by adding a constant in, k equals thread blocks, each of which is responsible for a tile in the matrix. Let's draw that, we'll be processing this matrix in blocks called tiles, each tile corresponding to a single in the thread block. Here's the new kernel I wrote. Very similar to the previous one, the difference is that now I need to compute, I need to compute the in values. Not just the I values and that computations involves using the block index as well as the thread index. So the block index times k times the plus thread index plus x.x Gives my i value, some way i gate my j value from the y and so now the threads in my grid have a overall index x and y as far as index at tender block which is the thread index. So I am going to compute that x and y and that's going to correspond to the i and j of the output matrix and that's the calculation its going on here. And finally I do the same, the same final calculation that I've done all along. I go and grab an element from the input matrix, equal to the coordinates IJ, and I write that into the output matrix. The location of the output matrix corresponding to element JI, and here's how I'm going to call it. Now I need to define the number of blocks and the number of threads which is simply n over k, n over k. I'm being a little lazy here you can tell. I'm, I'm assuming that n is a multiple of k. There's k by k threads in a block. And now when I launch instead of launching a single thread I launch blocks threads. N over K by N over K and instead of launching 1024 elements or N elements I'm going to launch K by K elements per block. Let's go and compile and run this. Okay, now we're talking. So now we're down to about 0.67 milliseconds. Let's make a note. And thinking about what we've done here is we've, we've now extracted the maximum amount of parallelism from the application, right? We started with a single serial code, one thread doing all the work. We switched to a code that used one thread per row of them input matrix and now we're down to something that uses one thread per element. There's really no more parallelism that we can extract here. And you can see that increasing the parallelism has really helped us.

Maximum Parallelism Is Not Always Best For Performance

I want to make a point here that, it's a bit of a ninja topic. But, it turns out that exploding every last bit of parallelism isn't always the very best performing code. Sometimes it helps to do more work per thread. And this is, this leads to an advanced optimization technique called granularity coarsening that we'll talk about later. That said, the first problem is almost always to find enough parallelism. So keeping that in mind, are we done? Is this 0.7 milliseconds the fastest that we can transpose this matrix on this GPU? Let's reason that out. So two things can limit your performance on any code. Time spent fetching and storing data from and to memory. Or time spent performing compute operations on that data. Now, the transpose code has almost no computation at all. It's entirely about moving data around. So let's ignore compute for the moment and focus on memory. My question is, are we moving that data around efficiently? How can we tell? There's a handy utility called device Query that's included in the CUDA SDK. Let's run it. Device Query spits out an enormous amount of information, and most of which you really don't need to know right now. But buried in here are a few things that I want to point out. The GPU clock rate is how fast the actual processors in the GPU are going. Memory Clock rate shows you how fast the memory in the GPU is operating. And the Memory Bus Width describes how many bits of memory are actually being transferred for each of these clock cycles. So from this we can actually figure out the maximum speed of the memory, the maximum band width, the maximum amount of data that we can transfer in a second.

Theoretical Peak Bandwidth

So, here's the two important things from that whole device query. Memory clock, the theoretical peak bandwidth? And I'd like the answer in gigabytes per second because that's how we'll usually hear it described.

Theoretical Peak Bandwidth

So, 2508 megahertz is 2508 times 10 to the 6th clocks per second. Memory bus of maximum theoretical peak bandwidth of the memory system is going to be just over something like 40 to 60% of the memory bandwidth, we'd say, well, that's doing okay. It's not great, there's probably room for improvement. If you can get into the 60 to 75% range, that's doing pretty well. You might not be able to improve on that. And any time you get over 75%, we would consider that excellent, okay? You'll never achieve this theoretical peak bandwidth on any real substantial code, okay? This is literally just what you get from multiplying out the clock rate and the Memory bus. And a real code is going to have an additional overhead. So, if you can get over 75% of that, you're doing really well, and you probably don't need to optimize the memory further. So, how well are we doing in this code?

What Bandwidth Does Our Kernel Achieve

Let's make that another quiz. So, what bandwidth does our kernel achieve? And so remember, the last version I showed you ran with a value of N equals 1024 and I told you, it took 0.67 milliseconds. And once again, I like the answer in gigabytes per second.

What Bandwidth Does Our Kernel Achieve

Our transpose code is reading 1024 by 1024 elements. Each element has 4 bytes and we are both reading and writing them, so we are going to have to transfer those values across the bus twice, that's the total amount of memory. It's taking 0.67 milliseconds, which is about 1.25 times 10 to the 10 bytes per second, or scaled to gigabytes, about 12.5 gigabytes per second. So, we're not really coming that close to the theoretical peak bandwidth, we can probably do better.

Measuring Memory Usage of Our Transpose Code

So, if we go through this analysis for all of these kernels, we'll see that our parallel per element version of the code achieves 12.5 gigabytes per second. Our parallel per row version of the code gets about 1.8 gigabytes per second. And our serial version in the code gets an abysmal 0.018 gigabytes per second, this is roughly the speed of a carrier pigeon. And a better way to think about this perhaps, is not in absolute numbers. That is a percentage of what the particular GPU we're using can achieve. So, if we were to work out the percentages we're achieving, it's something like 31% of theoretical peak bandwidth with our highest performing kernel. 4.5% peak bandwidth with our per row kernel, and less than a 10th of percent with our serial kernel. So, back to the question. Why is this number so low? Well, we can take a pretty shrewd guess that whenever you see really low DRAM utilization, really low percentage bandwidth, your first guess is always coalescing. A way to think about coalescing is that the GPU is always accessing global memory, accessing the DRAM in pretty larges chunks, 32 or 128 bytes at a time. And this means that we are going to need the fewest total memory transactions when the threads in a warp access contiguous adjacent memory locations. So, this is an example of good coalescing. Every thread is either reading or writing an adjacent memory location. And clearly, if the threads in a warp are reading and writing completely random locations and memory, then you're going to get poor coalescing, right? So, if these accesses are spread out all over the memory, then the total number of chunks of memory that we have to read could be as large as the number of threads in the warp. So, random access pattern clearly leads to bad coalescing. So, a much more common access pattern is what's called Strided, and this is where threads access allocation memory that's a function of their thread ID times some, some stride. So for example, thread 0 might access location 0, thread 1 location 2, thread 2 location 4, thread 3 location 6, and so on. In that case, that would be a stride of 2, because there's two elements between thread accesses. And strided access is range from, okay, like in this case where with the stride of, of two elements, and really only doubling the number of memory transactions. So, I'm sort of having the quality of my, my coalescing all the way to really, really bad, right? So, you can imagine that at the stride between elements is large enough than every thread in the warp is accessing a completely different 32 or behavior. Guaranteed to be maximizing the number of memory transactions that you have to do. So, let's look at the code for our kernels. Here's where we're reading from the input matrix. And this actually works out pretty well. Every thread is reading a value in memory which is equal to some large offset, j times N plus i. And if you look at i, i is really the thread index plus some offset. So, adjacent threads, you know, threads with adjacent thread in the season x are reading adjacent values of the input matrix. That's exactly what we want. So, this is good coalescing. On the other hand, when we write the output matrix, adjacent threads, threads with adjacent values of i are riding to places separated in memory by n, right? And N was like 1024. So, adjacent threads are running memory locations that are 1024 elements away from each other. This is clearly bad, this is bad coalescing. Bad, bad, bad, bad. This is, in fact, the root of our problem.

Most GPU Programs Are Memory Limited

This is a really important optimization pattern. So, let me emphasize it. In practice, most well-tuned GPU codes are memory limited. I'll repeat that. Most, not all, but most GPU codes are memory limited. So, always start by measuring your achieved bandwidth to see if you're using memory efficiently. And if not, ask yourself, why not?

Tools to Measure Bandwidth Utitlization

It's important to be able to reason about this the way that I just described to you, right? So, we sort of walked our way through. We figured out what kind of bandwidth we were getting, and what percentage of theoretical peak that was. We saw that it was really quite low, and we said, why would we be getting low bandwidth to global memory? Well, the first thing you always look at there is coalescing. And then we inspected the code and convinced ourselves that yes, there's bad coalescing happening when we write to the output matrix. But, I also want to make the point that you don't have to, do this from scratch every time. Right? Doing all these calculations is a little bit like rubbing two sticks together to start a fire, it's good to know how, but there are tools to help you do this. The tool that we're going to be using is called nSight. This is an Nvidia product, there's also third-party products, maybe I'll give some links to those in supplementary material. And if you're using Linux or a Mac, like I'm using, then you'll be using the nSight Eclipse edition. If you were using Windows, you'd by using nSight Visual Studio edition. These are integrated debuggers and profilers, they're full, full blown development environments. The part that we're going to use, is called the Nvidia Visual Profiler, or NVPP. Let's fire that up now.

Using NVVP Part1

So, this is NVVP. It's a very powerful tool. It has a lot of functionality. We're only going to touch on it today. My real point is simply to point out to you that these tools exist and that even though in this class we're going to focus on understanding the concepts, and being able to reason about these things from scratch, I do want you to understand that, that you don't have to always do this from scratch, so that you should use the tools that are available. So, the first thing we're going to do is pick an executable to run on. And so, this is where we tell it what executable to run, we're going to run our transpose code. Before we do that, I'm actually going to go into the transpose code and comment out the section of it that is running the transpose serial kernel, that first kernel that we figured out. That one is so slow that I don't really want to spend any more time on that. And as you'll see, the profiler takes a little time to run the code several times and acquire statistics about it. So, I comment that out and say that I go back and I recompile my code. And now, I'm ready to go back to NVVP, tell it what executable we're going to run, and here we go. It runs the program, it measures all of the calls to the CUDA API, things like malloc and memcpy. And it measure the time taken for every kernel. Here's the transpose parallel per row kernel. Let me Zoom In on that. You can see that the parallel per row kernel started a 110 milliseconds into the executional programs, finished to 119 milliseconds. End of the program, ran a single block that a grid size of 111 so it launched to a single block, that block had 1024 threads. You can get a bunch of other statistics from this. If we scroll to the right, we can see the much shorter time taken by our transpose per element kernel. I'll Zoom In even further. And this one, as you see, ran in a grid, in a larger grid of 64 by 64 blocks, each of which had 16 by 16 threads. Now, I'll make the point that we measure a shorter time than that. And that the times that you see in the profile are not going to match, match up exactly to the timings that we measured a little bit ago, when we were outside the profiler, and that's to be expected.

Using NVVP Part2

But the interesting thing that we can do in something like NVBP is we can go in and we can anaylze the program. So this is going to run the program many times, it's collecting a bunch of statistics about the program, it's averaging them together. And now that we've done this analysis we have a lot more information about these kernels. So if I click on this you see, I have more statistics over here. And the one that I want to highlight for you is the global load efficiency, which tells us how efficient our global loads were. In other words, of all of the bytes that we fetched with each memory transaction, =how many of them are actually useful? 100%, looks pretty good. That's what we would expect from fully coalesced accesses. The global store efficiency, so now our, our stores to global memory, which in our case is writing the output matrix, achieved 12.5%, and that's pretty wretched. Right, and that's again what we would expect from having inspected the code. Our total D ram utilization is down at 7.6%, again remember this is not going to exactly match what we calculated outside the profiler but we can tell there's a problem. That was the parallel per row kernel but only had a single thread block. We can see that the parallel per-element kernel doesn't do much better. It has slightly higher global store efficiency. Very slightly higher DRAM utilization. So, we still got this problem. Our, our problem is clearly our ability to, to write to the output matrix is hampered. We're not achieving the bandwidth that we ought to. As I said before I'm not intending to go through all the many, many things that you can analyze in NVVP. I'll pull it up once or twice again in the course of this lecture to just illustrate that there are tools to help you figure out what's going on. You can see in fact that, that down here It's analyzing the program for you and giving you some suggestions. You know, saying look the multiprocessors in your program are mostly idle, you're not getting a lot of work done for the total amount of time this program runs. And your total compute to mem, mem copy efficiency is low, in other words, you're not doing a lot of computation, given the amount of time that you spend doing mim copies. Here's the mem copies in our timeline. You can see that we spent, you know, 2 point, let's see, 2.6 milliseconds copying information in and then, you know, 8 milliseconds processing it and then another 2.6 milliseconds copying it out. So it warns us, hey, the total compute to mem copy efficiency is low. So these are really useful tools, you should use them. You should know about them. But we're not going to rely on them too much in this class because our point is to teach you how to reason about things from, from first principles. So our problem is bad coalescing on the write to the output matrix. What can we do about that?


So, our problem is that we're achieving coalesced reads but scattered writes. And our goal is to achieve coalesced reads and coalesced writes. Clearly, swapping the order of the reads and writes wouldn't help, because then we'd simply have scattered reads and coalesced writes. So, how do we achieve both of these? The solution is going to be a, a, a really important general strategy, so I'll spend a little time on it. The idea is that we're going to take a tile of the input matrix at a time, and we're going to transpose it and copy it into its transposed location in the output matrix. And this is going to be the job of a single thread block, so the threads in this thread block are going to work cooperatively together to perform this copy and transpose of a tile of elements at a time. The threads in the thread block are going to copy the tile into shared memory, a shared memory belonging to that thread block. They're going to perform the transpose. So now, the transpose of these elements, the elements in this tile is happening in shared memory where you don't pay that tremendous cost that you see in global memory to, to do a, a scattered write or read. And finally, the threads in the thread block will work together to copy the elements out and the key is this. If our tile is large enough, say, it's a K by K tile, say, the K is 32, in that case, each warp will copy out a chunk of 32 elements at a time into shared memory. And because all 32 threads in that warp are reading and writing in adjacent locations in memory, you'll get good coalescing and then, when you copy the transpose matrix to its new location in global memory, you can once again do it in a coalesced fashion. So, let's try this as a programming exercise. We'll give you the start of the tile transpose kernel and the code that calls it and you should modify the kernel code to declare it an array in shared memory and copy appropriate elements in and out of the shared memory, set the final elements are written out in transpose fashion. And don't forget to add any syncthread barriers that you need to make sure that you get the correct result.


So here is my code for this. I'll start by setting k equal to 32 and here is the actual code. I begin by figuring out the locations of the tile corners, okay. This is going to tell me where I need to start writing in the output and start reading from the input so just a little book keeping and giving things variable names that mean something to me. But as you can see the, the place where I start reading the i value is a function of which block were in times the width of the tile because each tile is responsible for one block and that's the case, and the j value is the same but in y, in y instead of x, and the output simply inverts y and x. Okay, so now that I know where I need to read to write my tile. I'm going to want to know which element of the tile to read and write from. And just a shorthand to make the code a little more readable. I'm going to set x to thread index .x, and y to threat index y. So now, the code itself is really simple. I declare my floating point array in shared memory k by k array of tiles. And I read from global memory, and write the result into shared memory. So here's my read from global memory and its function of, where I start in the, where the tile starts plus which thread I'm responsible, or which element this thread is responsible for in the tile. To avoid an extra sync threads I'm going to go ahead and write this into shared memory in transposed fashion. So it's not tile x y, it's tile y x. Okay, that, that saves one of these sync ,threads barriers. Now I've got the transpose tile sitting in shared memory. It's already been transposed, and I want to write it out to global memory, and I want to write it out in coalesced fashion, so adjacent threads write adjacent locations and memory. In other word, adjacent threads are varying by x the way I've set this out. So, here's my write to global memory after my read to from a shared memory. You could have done this in two sync threads. You could have read this in the shared memory, perform the transpose and written it out to global memory and you have needed a sync threads after reading it in to shared memory and again after performing the transpose. So, if you did that I encourage you to go back and convert it to this single version and see how much faster it goes. Let's go ahead and run this on my laptop. Okay, there's two interesting things to note here. One is that the amount of time taken by the parallel per element code, the kernel that we had before, actually went up. It's almost twice as slow now as it was before and if you think about it this code didn't change at all, except that we changed the value of k. We changed the size of the thread block that's being used in this code. We're going to come back to that. That's going to give us a hint as to a further optimization. In the meantime, transpose<u>parallel<u>per<u>element<u>tiled, our new version is a little bit faster,</u></u></u></u> not a lot faster and that's kind of disturbing. We should have gone to perfectly coalesced loads and stores which should have made a difference. Let's go ahead and fire up NVVP again and see what happened.

Profiling the Tiling Code

Okay, so here we've run NVVP, on the function again. I'll zoom in on these, second functions. Here's our transpose per element tile. This is the, the kernel we just wrote. And as you can see, it's running, with 32 by 32 thread blocks, each of which has 32 by 32 threads and sure enough we're achieving a 100% global load efficiency and 100% global store efficiency. And yet, our DRAM utilization has actually gone down slightly. So whats going on? So why is our achieved bandwidth still so low? The answer is going to come down to this statistic here, Shared Memory Replay Overhead. But before we get into the details of Shared Memory Replay Overhead, what that means and what to do about it, I want to back up for a little bit and talk about a general principle of how do we make the GPU fast.

Littles Law

Let's step back for a bit to remind us why we're focusing on memory the way we are. Our overarching goal, of course, is just to make code fast, so, we say great, GPUs are fast, let's use those. But why are they fast? Gpus are fast, first, because they are massively parallel, with hundreds or thousands of processors on a single chip, working for you to solve your problem. But also because they have an extremely high bandwidth memory system to feed those massively parallel processors, okay? So, if the memory system can't deliver data to all of these processors and store results from all those processors, then we're not going to get the full speed out of our GPU. And that's why, on a memory limited kernel like transpose, our subgoal is really to utilize all the available memory bandwidth. Hence, our focus on global memory coalescing, DRAM utilization, and so on. Now, I really want to ask a question a little bit more rigorously, what do we mean by utilizing all the avail, available memory bandwidth? And this is going to bring us to a very important, very simple principle called Little's Law. Let's have the talented Kim Dilla illustrate this for us. Now, John Little is a MIT professor who studies Marketing. He formulated his Eponymous Law, when writing about queuing theory in business processes. And Little's Law is usually used to reason about things like optimizing the number of customers in a line at Starbucks, or maybe the size of queues in a factory. But Little's Law is really very general and can be applied to many things including memory systems in computers. In that context, Little's Law states that the number of bytes delivered equals the average latency of each memory, memory transaction times the bandwidth. Let's be a little more precise and emphasize that we care about the useful bytes delivered and the problem with uncoalesced global memory accesses is that not all of the bytes in every memory transaction are actually being used. That's why coalescing global memory accesses helps ensure that every byte delivered in a memory transaction will be used. So, given this definition, what can we do to improve our bandwidth? Let's check all that apply. We can increase the number of bytes delivered, we can increase the latency, meaning the time between memory transactions, we can decrease the number of bytes delivered, or we can decrease the latency or time between transactions.

Littles Law

That's right, we can increase the number of bytes delivered per transaction or we can decrease the latency, the time between transactions. To increase the number of bytes delivered, the Starbucks analogy might be to have many servers simultaneously serving up multiple cups of coffee. Whereas, decreasing the latency means having your servers work faster, taking less time to take an order, create a coffee, and deliver it back.

Littles Law for GPUs

So, let's look at Litte's Law for GPUs. To recap, Litte's Law states that the number of useful bytes delivered is equal to the average latency of memory transaction times the bandwidth. Now, what are some implications of this? First of all, there's a minimum latency to take a signal or piece of data all the way from an SM to somewhere on the DRAM, or to take information from the DRAM and pull it into an SM. Okay, you can find the details for your particular GPU online, but in general, any DRAM transaction is going to take hundreds of clock cycles. And by the way, this isn't a GPU thing. This is true of all modern processors. A clock cycle on a modern chip takes half a nanosecond for example on a 2 gigahertz chip. And even the speed of light, you know, light doesn't go very far in half a nanosecond. And electricity is even slower, especially on the tiny wires that you find in computer chips. So, to go from somewhere inside the GPU off the chip, over a wire somewhere on the board into the DRAM, get a result, go all the way back, hundreds and hundreds of clock cycles, many, many nanoseconds. So, this means that a thread that's trying to read or write global memory is going to have to wait 100s of clocks, and time that could otherwise be spending by doing actual computation. And this, in turn, is why we have so many threads in flight. We deal with this high latency hundreds of clocks between memory accesses by having many, many threads that are able to run at any one time, so after one thread requests a piece of data from global memory or initiates a store to global memory, another thread can step in and do some computation.

Useful Bytes Delivered

I find it helpful to think of the memory system as a pipe. Threads issuing requests stuff memory transactions into that pipe, for example these could be load instructions to read in a certain address in memory. And the result of that transaction, eventually falls out the bottom of the pipe, to head back to those threads. Now, the pipe is really deep, it takes 200 or 300 clock cycles for a transaction to move through this pipe, and the pipe is also really thick and wide. Okay, it's designed to be filled to with many transactions at the same time from lots of SM's running lots of threads. So, when you only have a few threads issuing transactions, the pipe is mostly empty. This could happen for example if you don't have all of your SM's, actively filled with threads that are issuing transactions, or if the latency between the transactions coming from a give, from each thread is to high. Right, so if we only have a few threads issuing transactions the pipe is mostly empty and not many bytes are being delivered. And Little's law tells us, that if not many bytes are being delivered, then our total bandwidth is going to suffer. So you can make this better by having more threads issuing memory transactions, or by having more memory in flight per thread. So here's one of those strategies. This is measured data from taking a GPU, and copying a bunch of data from one matrix to another, just like we're doing. And, the important thing to notice is that we're doing it in three different styles. We're doing a 4 byte word, so this is the equivalent of a single precision floating point, which is what our code is doing. An 8 byte word, for example if we were moving double precision floating point around, that's what we would be giving. Or a 16 byte word. And there exist in CUDA data types in the 4, 8, and 16 byte, variety. For example, float, which is what we've been using so far, float 2, which is 2 adjacent floating point numbers. Float 4, which is 4 adjacent floating point numbers. And so one option that we could do, is we could try to restructure our code around the idea of having a single thread pull in four floating point numbers at a time and do it's transpose on that. That would move us up from this line, in terms which is a percent of achieved bandwidth, up to this line, that's a healthy increase. On the other hand, the sort of, torquing the code around to do something that's less natural, like manipulate four single floating point values at a time. In this case, it's not a particularly natural thing to do, what we have coming in, an array of Floating point numbers or a matrix of floating point numbers. And to read them four at a time, you know in little bitty rows, and transpose those into little bitty columns, we can certainly do it, but I think it borders on a Ninja topic. While this would help, this is the kind of Ninja optimization that will, prove useful. I don't think that it's, vital. Let's talk about other things that we can do. So if we can't make all of our memory transactions wider very easily, then we can try to have more transactions. Let's have a quiz. Which of our transpose kernels so far probably suffer from too few transactions? Our first version had a single thread for the entire matrix. Our next version had a single thread for every row of the matrix. Our third version had a single thread for every element, and our fourth version was tiled. Check those which probably suffered from too few transactions.

Useful Bytes Delivered

Well, having only one thread issuing memory transactions certainly limits the number of bytes delivered and that's our total achieved bandwidth. Version 2 with a single thread for a row still only have a single thread block running on a single SM issuing transactions and again, this is not going to be enough to fill this big fat pipe. Version 3 had plenty of parallelism and plenty of transactions an flights. The problem there was the uncoalesced right ack operations, limited our useful bytes delivered. And our current version of the code, which uses shared memory to achieve better coalescing among the threads and get higher numbers of useful bytes delivered for every transaction, still uses on thread per element and therefore is achieving plenty of transactions to fill the pipeline. So, where does that leave us? We're still achieving lower bandwidth than we would have expected. And we know that we're delivering plenty of useful bytes per transaction. So, the problem really must be the average latency. There must be something that's keeping the time between transactions higher than it should be to fill this pipeline.

Reducing Average Wait Time Per Thread

Let's look at the code again. So, this is our tile per element kernel. And this syncthreads is the problem. Remember, we have a thread block of K by K threads, and at the moment, K is set to 32. So, we have a thread block of 1024 threads, and if you think about it, each thread is doing very little work. It's reading a single word into shared memory. And then, it's waiting around at the barrier until all the other threads get done with their job, and then it's writing a single word to memory. So, most of the threads in this block are spending most of their time waiting on the syncthreads call, waiting on this barrier for everybody else to get there. And the more threads in the block, the more time on average they spend waiting. So, for a simple kernel like this, most threads spend a large chunk of their lifetime simply waiting at this barrier until the last few threads complete their read operations. So, what can we do to reduce the average wait time per thread? Should we eliminate the syncthreads call? Should we reduce the number of threads per thread block? Or should we increase the number of threads per thread block, or perhaps increase the number of thread blocks per SM?

Reducing Average Wait Time Per Thread

Well, we need the SyncThreads call for correctness. This is what ensures that all the data in the tile has been placed in the shared memory before we attempt to copy it to its transposed location in global memory. So, we can't really eliminate the SyncThreads call. Now, the more threads in the block, the more time on average they spend waiting. So, reducing the number of threads that are actually waiting in the barrier by reducing the number of threads in the block will work. Whereas increasing the number of threads per block will make the problem worse. And this final one is a little more subtle and this again, verges on a ninja level optimization. Every thread block runs on a single SM but every SM can potentially hold more than one thread block. In this case, the threads in one thread block can still be making progress while those at the other thread block are gathering at a SyncThreads barrier. So this final strategy of increasing the number of blocks per SM can actually be a good one as well.

Occupancy Part1

So, what limits the number of thread blocks than an SM can run? Let me digress a bit to answer this question, which leads into the related topic of occupancy. You'll hear this term a lot if you pay attention to the CUDA forums, or if you watch presentations on optimizing CUDA code. Each SM has a limited number of resources. There's a maximum number of thread blocks allowed on an SM, turns out to be 8 on current GPUs. There's a maximum number of threads that a single SM can run across all of the thread blocks on it. This number ranges on modern GPUs from about 1500 threads on for example, the Fermi-based GPU's that you use on Amazon, or up to 2048 threads on the Kepler GPU in my laptop. Every thread running a given kernel takes a certain number of registers. And there's a total number of registers for all of the threads on the SM equal to 64 K on most GPUs and finally there is limited number of bytes of shared memory. This is going to be either 16K or 48K on modern GPUs. As at maximum number of 8 thread blocks usually one of these other things is going to get in the way first. The total number of thread that you want to run across all the thread blocks, the total number registers that each thread is going to take. And the total number of bytes of shared memory that each thread block wants to use. So for example, if I am running on a GPU with 48 kilobytes of shared memory and a single thread block in my kernel requires 16 kilobytes of shared memory, then I can run it most 3 blocks on that SM. Now I have that same GPU, as a maximum number of threads 1536 but my kernel takes 1024 threads then I can only run one block per SM. So, in this case even though my kernel to little enough shared memory that I'd be able to get 3 blocks per SM. The share number of threads in my kernel is preventing me from running more 1 block per SM.

Occupancy Part2

And if I can only run one block on that SM, then I'm getting a maximum of 1024 threads on that SM. So my occupancy is not as high as the SM could handle. You can get these vital statistics for the particular GPU you're programming to by looking it up in the CUDA reference manual or by making various CUDA calls on the GPU. And the CUDA SDK shifts with the sample called device Query, we saw it earlier, that conveniently collects all of this information into a single useful utility. So earlier we did a print out of the device query, the result of device query on my GPU. Lets look at that again. It prints out a lot of information. Earlier we were using this to estimate the total peak bandwidth by looking at the memory clock rate and the memory buds with. Now I don't want to try your attention to the total amount of shared memory per block, The total number of registers per block. The maximum number of threads per multiprocessor or SM, the maximum number of threads per block. And here's the register and shared memory usage for our tile transpose kernel going back in NVPP. We can see that we're launching a grid size of 32 x 32 a block size of 32 x 32. We're using seven registers per thread and we're requesting 4 kilobytes of shared memory per block. So given the register and shared memory usage for the kernel we just saw and given the GPU statistics from device query running on my laptop, that we just saw. How many thread blocks per SM can we run? And how many threads per SM can we run? And which resources are preventing us from running more? Are we limited by the maximum number of threads per SM, or the maximum number of registers per SM, or the maximum amount of shared memory per SM? Or by the maximum number of thread blocks per SM?

Occupancy Part2

Well, the kernel as we've written it, uses 32 by 32 threads, or 1,024 threads, and the limit on my machine is 2,048 threads. That's keeping us down to no more than two thread blocks per SM. The kernel requires seven registers for each of those threads, so that's a total of 7,168 registers. The limit is 65,536 registers per SM. So, this is limiting us to no more than 9 thread blocks per SM, this is not the limit. The kernel is requesting 4 kilobytes of shared memory to store the tile, the limit is 48k. So, this is restricting us to 12 blocks per SM, that's not our limiting factor. And finally, there's a limit of eight thread blocks per SM total. And that's clearly not the limiting factor. So, in the end, the thing that's keeping us from running more thread blocks per SM and therefore keeping us from running more threads per SM is this first limit, the maximum number of threads per SM. We're limited to running two thread blocks per SM, and that means that we're going to be running a total of 2,048 threads on each SM.

Occupancy on Fermi GPUs

Let's go through this exercise one more time. This time, using the Fermi GPUs that are, installed in the, the Amazon clusters that Udacity is running on. I will put the device query information and the supplementary material and you should look through that and answer the questions again.

Occupancy on Fermi GPUs

Well, this time you're going to get similar answers. The big difference is that the maximum number of threads per SM is now 1536. All these other numbers are the same. The maximum number of registers per SM is still 64k. The maximum amount of shared memory per SM is still 48k. The maximum number of thread blocks per SM is 8. So, we have the same limiting factor which is the maximum of threads per SM, but now the answer is a little different. Now, we can only run a single thread block per SM, and that means our total number of threads per SM will be 1,024.

Occupancy on Daves laptop And Udacity Servers

So let's look at those results. On my laptop, we had a maximum number of threads running of 2,048 out of a maximum possible of 2,048. This means that as long as we have enough thread blocks to fill the machine, enough thread blocks to run on all the SMs, then we should get 100% occupancy. On the other hand, the Fermi GPU's that Amazon uses and that Udacity runs on, were limited to a maximum number of threads running of 1,024 out of a maximum possible of 1,536. And this means that we're going to achieve at best 66% occupancy. Okay, so occupancy refers to the percentage of threads that are actually running, versus the, number of threads that actually could be running. There's a useful tool that's a spreadsheet in the CUDA tool kit installation called the CUDA Occupancy Calculator, that let's you just plug in your numbers and see how your kernels are going to perform in terms of occupancy, on all the various GPU's that have existed over time.

How To Affect Occupancy

So, the examples that we're looking at get pretty good occupancy. In general, how do affect the occupancy of a kernel? Well, it's usually fairly easy to control the amount of shared memory needed by a thread block. For example, in our transpose kernel that would be the tile size. And, of course, you can change the number of threads and thread blocks that you launched when you launched your kernel. You can also work with a compiler to control the number of registers a kernel uses. Now, this qualifies as a ninja level optimization and isn't usually worth the trouble. So, increasing occupancy is usually, but not always, a good thing and it only helps up to a point. It exposes more parallelism to the GPU and allows more memory transactions in flight, but it may force the GPU to run less efficiently when taken too far. This is always a tradeoff. For example, decreasing the tile size and the transpose code will let more thread blocks run. It decreases the amount of time spent at barriers. But if we get to too small a tile, we'll lose the whole point of tiling, which was to coalesce the global memory accesses. So, let's go back to that transpose code. So, here's our kernel, and you'll recall that we concluded that the syncthreads was really the problem. We need it there, but we've got a large thread block, 32 by 32 threads, sitting at this barrier, waiting for the rest of the threads to reach the barrier before they can proceed. So, we conclude that maybe we can make this faster by simply going up and changing to 16 by 16 blocks. A very small change that should decrease the amount of time that we spend waiting at a barrier and therefore decrease the latency between memory operations, which, by Little's Law, we would expect to let us to increase the total bandwidth. So now, we'll compile and run, and sure enough, this time, we took just over 0.53 milliseconds. So, let's update our running tally of results. The tiled version of our code, we managed to get half, about half a millisecond by adjusting the tile size. So, here's a programming exercise. Let's go back to the transpose code, and try a few different tile sizes, and see which ones work best. Try 8 by because, of course, this is more elements than you can put threads in a thread block. So you'll have to think about how to pack this into a single thread lock, and I should point out we are looking for the relative ranking of these on the udacity servers, so using the fermi based GPUs as suppose to for example your own machine that you might be developing on at home

Transpose Code Recap

Okay, so let's quickly recap. We had a serial version[UNKNOWN] code that did everything in a single, in a single thread trivial to the write that code, 0 parallelism and pretty terrible performance. Then we had a parallel per row version of the code that was also pretty simple to write and[UNKNOWN] 1 thread per row and a single thread block. again did not get a great performance although a huge improvement factor of about a hundred. Then in the third version, we extracted pretty much all the parallelism that the problem has to offer and assign one thread per element to the matrix. So we did this transposed operation now and soft of massive parallel. And now we got, you know, quite a bit better performance. Remember that all this is happening in the context of APOD, Analyze, Parallelize, Optimize and Deploy. And this is actually, might be fast enough that you're ready to deploy it. Right, so we started by looking at analyzing the code and deciding that this function of transposing matrix need to be sped up. We started exploring ways to parallelize it and this might be fast enough. This is already a place where you ought to look at this and say is this the bottleneck anymore? Is speeding this up going to make a big difference to my application. If not, then deploy. If so, then we can start thinking about optimizing it. And that was the next thing we did, we looked at the performance here we saw that we were getting pretty poor DRAM utilization. And we diagnose that the problem must be that our gloal stores were getting bad coalescing. In other words when we're writing the output matrix we're getting bad coalescing in those access to global memory. So to fix that we came up with the tiled version of our code. Where, each thread block is responsible for reading a tile of the input matrix, transposing it, and writing that transposed matrix to its location in the output matrix. And it can perform these reads and writes in a coalesce fashion. We did this in blocks of 32 by 32 tiles, and got excellent coalescing, this improved our speed a little bit. But we're still not getting great DRAM utilization. So, we looked into why. So we thought a little bit more about why we wouldn't have taken a, a great improvement in bandwidth. We concluded we probably had too many threads waiting at barriers. So we improved that by shrinking the tile size. Experimenting with different tile sizes led us to conclude that a 16 by 16 thread block writing into a 16 by 16 tile insured memory. Sruck a good balance between achieving coalesced writes, reads and writes to memory and not spending too much time waiting at barriers.

Transpose Code Memory Utilization

Okay, so before we proceed I want to fill in these numbers again and rerun this program to get values that correspond to what you'll see on the Udacity IDE. Until now, as I've mentioned, I've been using my laptop but I want to, we're going to let you play around with this transpose code and I want you to see on the screen numbers that are a little bit more comfortable to what you'll get on the much bigger GPUs that Udacity is using versus the one that's in my laptop. Okay, so here's the code we've looked at before. This time it's running in the Udacity IDE. And what I want to do now, is, do a test run. And here's the results we get. So you can see that now transpose serial took 82.6 milliseconds all the way down to transpose parallel per element tiled. which took about 0.13 milliseconds. So here I've written those results down again, and now we can go in and fill in the DRAM utilization. So you can see that we've got the same pattern. The serial version of the code, which runs in a single thread, takes milliseconds, and then sub-millisecond. And by the time we get here we're running quite well at 0.35 milliseconds. And sure enough by tiling that code we improve on this a little bit, but now we're still spending a lot of time waiting at barriers. And by going to a smaller tile size we can get that all the way down to 0.13 milliseconds. Now at this point we're geting about5 43.5% of the theoretical peak limit of our bandwidth, and that's actually doing pretty good. There's one last optimization effect that we're not going to talk about, at this lecture, but you're, welcome, in fact, encouraged, to go look it up in the, CUDA C Best Practices Guide, or on many presentations that you can find online, if you're motivated and curious. I decided that it's a fairly subtle effect and not too important. And that's shared memory bank conflicts. Essentially what this says is that shared memory is organized into banks, and depending on how you line up your thread accesses that the array of threads is making to the memory that's actually stored on the tile. You can end up spending a lot of time sort of replaying over and over. You saw that shared memory replays statistic much earlier when we were looking at the NVVP output, and that's what this is about. The fix is relatively simple, and as I said, we're not going to go into it here, but there's a version of this code that you can download and play with on the Wiki. The fix is actually simply to reserve a slightly larger chunk of shared memory than you're actually going to use, and this has the effect of padding the shared memory and striping it across the banks, which can actually improve our run time even slightly more.

Optimizing Compute Performance

So, remember that our goal is to maximize the amount of useful computation done per second. And we've been talking a lot about optimizing memory access because that's usually the bottleneck for GPU codes. But some codes, or some portions of codes, are limited by the actual computation. So, let's talk about optimizing compute performance. This topic boils down to two simple guidelines. Minimize time spent waiting at barriers and minimize thread divergence. Now, we've already seen an example of the first optimization. Too much time spent waiting at barriers prevented us, in our tile transpose code, from making memory access as fast enough to saturate global memory bandwidth. By shrinking the number of threads in the thread block, we're able to minimize the number of them that were actually waiting on the barrier, and get closer to fully utilizing the, the global memory.

Minimize Thread Divergence

In CUDA this discussion focuses on the concept of a warp. Now remember that a warp is a set of threads that operate in lock step, all executing the same instruction, at the same time, on whatever data they happen to be processing. And it's general technique is called SIMD, that stands for Single Instruction Multiple Data. This is a term you'll hear a lot in parallel computing, computer architects have been building SIMD processors for decades. It saves a lot of transistors and a lot of power if you can amortize the work used to, decode and fetch and perform a single instruction, against multiple pieces of data at once. All modern CPUs, use vector instructions and, the most typical one, on Intel CPUs, are called SSE or AVX instruction sets. And these are ways that the CPU can execute the same instruction on multiple pieces of data at once. And going back to a point I made at the very beginning of this lecture, on the CPU if you're performing an SSE instruction you're using, in a single clock cycle, you're affecting four pieces of data. On AVX you could be affecting eight pieces of data. These vector registers represent a huge amount of the computational horsepower and if you're not using them on the CPU, then you're missing most of the power of your processor. Another term you'll see a lot is SIMT. This is a term coined by Nvidia, to stand for Single Instruction Multiple Threads. And this is a subtle distinction that has to do with what happens when there is thread divergence. Let me explain what that means. It all comes down to what happens at a branch in the code. Here's some code with an, if L, statement and I'll color the instruction stream black for instructions that all the threads are going to execute. Red for instructions only threads that take this branch in the, if, statement will execute. Blue for instructions that only threads that take the, else, branch will execute and black again.

Thread Divergence Penalty

So, here's a bunch of threads executing instructions over time. Now, these threads all belong to the same warp. And in all NVIDIA GPU's today, a warp has there's no problems with this. All of these threads execute all of their instructions in lock step. But here's a warp were some threads take the if the branch and some take the else branch. And because the threads in a warp can only execute a single instruction at a time, the hardware automatically deactivates some of the threads, executes the red instructions, then flips which threads are activated and executes the blue instructions. And as you can see, if this is time, the code is going to take longer to execute overall. Some of our threads are wasting time sitting idle, while their warp mates execute. We say that the threads have diverged during this section of the code and would refer to this as what is the maximum branch divergence penalty that a CUDA thread block with 1024 threads could experience when executing a kernel? Is it a 2x slow down, a 4x slowdown, or so forth?

Thread Divergence Penalty

The answer is a 32x slow down, okay, so since warps only have 32 threads, you can't have worse than 32-way branch divergence. So, at worse, you'll have to deactivate each thread one at a time, execute some code and then activate another thread and deactivate you one, the one you just executed. And at most, you can do this 31 times. So, even if you have a 1,024 threads in a thread block, you won't experience worse than 32x slow down.

Switch Statements and Thread Divergence Part1

It's actually easy to construct such a kernel. For example, we could have a switch statement with 32 or more cases. If each thread in a warp switches to a different case, and each case took the same amount of time to execute. This segment of the kernel would run 32 times slower than if all the threads in a warp switch to the same case. Okay, so let's explore this with a quiz. So, I'm going to give you a bunch of switch statements and we're going to pretend that they're in a kernel and that all of the cases of those switch statements take an equal amount of time. Okay, so I'll do the first example to explain, the format that I'm looking for. Here, we've got a switch statement, this is in kernel code. It's switching on, the thread index, .x, mod 32, and that means that your going to get a number from 0 to 31. And, I've, I've just sort of used this short cut notation, to indicate that I've got cases 0 through 31. I want you to assume that all these cases take the same amount of time. And, then there's some more code and later on I'm actually going to launch this kernel, and I'm going to launch it in a configuration like this. It's going to have a single thread block with 1,024 threads. And so the question that I'm asking you is, what is the slowdown? Is it 1x, meaning no slowdown at all? Is it 32x, meaning, a factor of run 32 times slower through this section of the code, due to branch divergence. What's going on? And so I'll give you a series of these, using this sort of short hand notation, and I want you to think about what is the, the slowdown that you're going to get due to the thread divergence, the different threads in a warp. So the answer to this first one, this is sort of the example we've been discussing. Okay, every thread, is going to go a different path. Every thread in the warp, will have a different value for this thread index, those values will be 0 through 31. So every single thread will take a different path. And that means, that the warp is going to have to activate, each thread in turn run through that particular case, deactivate that thread, activate another thread run through that case and so forth. So this will lead to them maximum slowdown of 32 times. Here is our next example and what I am going to try to do for each of these examples is highlight in black the, the parts of the code that are different from, the, the first example that I gave. So in this case we are launching, similarly launching a kernel with 1,024 threads and now we have got cases 0 through 63 and we're, taking the thread index and we're moding it by 64, instead by 32. And your job is to figure out the slowdown. What about a two dimensional kernel? Here what I'm, using I'm using some shorthand to indicate that I'm launching a thread block which has 64 threads in the x direction and 16 threads in the y direction. Again a single thread block. And I'm going to switch on the y index of the thread. And next another example, where I'm going to switch on the y index of the thread, as before, except this time I'm launching a thread block with 16 by 16, threads.

Switch Statements and Thread Divergence Part1

So, the thing to understand about this first example, is that the maximum slow down is 32. There are only 32 threads in a warp so no matter, no matter how many cases there are, no matter how many ways you can branch, you're not going to branch more than 32 ways in a single warp because there's only 32 threads in that warp. So again, the maximum here is going to be 32, 32x slowdown, for just this segment of code, right, for this switch statement in particular. What about this case? We're going to switch on the thread index of y. So, the thing you need to understand is that CUDA assigns thread ideas to warps in such as a way that the x ID varies fastest, the y ID varies slower, the z ID varies slowest. And this is much more clear if I use a concrete example. So, when my quiz question, I have a 64 by 16 thread block. So, let's just draw that out. Here's thread with x ID 0, x ID 1, x ID 2 and so forth, and these are the y IDs. So, this first row has y ID 1, the second row has y ID 2, all the way down to y ID comprising x IDs 0 through 31. Then another warp, again, with 32 threads containing thread IDs 32 through 63. And yet another warp with y ID equal to 1 and x ID varying from 0 to 31, and so forth. So now, we know the answer to our question. All of the threads in a given warp, in this case, belong to the same y ID, they all have the same y ID. And that means that we'll get no slow down at all because all of the threads will be branching to the same place. What about this next case where we've got a smaller thread block that has 16 by 16 threads? So in this case, we only have 16 x IDs that it can assign to a warp before it has to change to the next y ID and start assigning the next set of x IDs. So, all of these threads will be placed into a single warp. The first 16 in the first half of the warp, the next 16 in the next half of the warp. Next, all of these threads will be placed in a warp, and so on. And so now, as you can see, there are two different y IDs for the threads in a single warp. So, we'll end up having to execute this code twice for this warp. Ones with 1 y ID, ones with a second y ID will end up with a total of a 2x slow down.

Switch Statements and Thread Divergence Part2

Lets do three more examples. In this example, we're switching on the thread index dot x mod 2 and we're launching a single thread block with 1024 threads,one- dimensional. In this example, we're switching on threadIdx.x divided by 32, Enter to divide. Otherwise, the same 1024 threads, one- dimensional. And finally, we'll switch on threadIdx.x divided by 8. Now, our case range is from 0 to 63 and again, we're going to launch 1024 threads single, one dimensional block. So, what is the slow down for these three situations?

Switch Statements and Thread Divergence Part2

Okay, well in this case, we're really just picking whether or not the thread is even or odd. Each warp will have some even threads and some odd threads, so it's a total slow down of 2. Here, we're doing an integer divide, okay? So, threads 0 through 31, when divided by 32, will return k0, and threads 32 through 63 will return this expression will evaluate to one. They'll evaluate case one and so forth. So, here we're getting no slowdown at all. All the threads in 32 thread warp will still end up taking the same case. And finally, threadIdx.x divided by evaluates to for the threads in a warp. And so, we'll end up evaluating cases 0, this time, we'll take a 4x slow down.

Loops and Thread Divergence

So, we've talked about if, then, else and switch statements that's one kind of branch loops are another kind of branch. So, here is a snippet of that CUDA kernel code with two loops, so for both of these loops, assume that this function bar is what's going to take the most the time that, that all these calls to bar are going to dominate the total times spent in foo. And the only difference between these loops is that, in this case we repeat a number of times which is threadIdx.x mod 32 and here, we repeat a number of times which is thread index divided by 32, integer divide. So, as a quiz, see if you can figure which of these is faster, loop 1 or loop 2, in other words, where do we spend total across the whole kernel more of our time? And remember, we're assuming the bar is a really expensive function that'll dominate the overhead of actually calling a loop, and roughly how much faster is it. So, give me just an integer answer.

Loops and Thread Divergence

So to answer this, we want to look at these for loops, and decide how many times each warp is going to have to execute the for loop. And that means, how many times, at least one thread in the warp, is going to have to execute it. So looking at this expression here, for a single warp, these values will vary from zero to 31. So there will be, at least one thread in that warp, for whom this modula expression evaluates to 31. And that means that the entire warp, is going to go through the motions of executing this bar function, 31 times. Now, some of those times, some of the threads will be deactivated. So the very first time, thread 0 will not execute the bar function, it'll be deactivated because i will not be less than 0, and the next time thread 0 and 1 will be deactivated and so forth. Ultimately the total amount of time that the warp has to spend in this loop, depends on the total number of time that any one thread has to spend on it. Each warp will executive this loop 31 times. This next loop though is different. In this case the integer divide means that threads 0 through 31 are going to evaluate to 0. This expression will evaluate to 0, and therefore they're not going to execute the bar function at all. And in threads 32 through so forth. So there'll be one warp which evaluates at 0 times, one warp which evaluates at 1 time, one that evaluates at 2 times and so forth, all the way up to a single warp, which evaluates at 31 times. So the average number of times, that all of the warps will execute this loop, is 15.5. So now we know what we need to know to answer the question. Clearly, the second loop will execute faster, and it'll be twice as fast. Because on average, the number of times that the expensive bar function gets evaluated is half the number of times that, that bar function gets evaluated during the first loop.

Thread Divergence in the Real World Part1

Let's look at a real example of branch divergence. The kind of thing that might come up in the real world. Suppose that I'm doing some sort of operation on pixels that are on the boundary. For example, if I'm doing a blur, I don't want look off of the edge of the image. So, here's a simple kernel code that, illustrates this idea. I'm going to call per pixel kernel. I'm going to check to see if the thread index is 0 or 1,024 on either x or y. If so I'm going to call some code that deals with the boundary conditions. Otherwise, I will run some code that does would have a reason doing the pixel. Okay. You can totally imagine that this is a common kind of operation. You've probably written code like this already in this class. So as a quiz, what is the maximum branch divergence of any warp in this code? Is it one way, two way, three way, 32 way, and so forth?

Thread Divergence in the Real World Part1

Well up, this is my image and these are the pixels on the boundary. I've assigned one thread per element, or one thread per pixel. A warp somewhere in the center of the image, is not going to have any divergence. And similarly a warp along the top or bottom rows, is also not going to have any divergence. Because in this case all of the threads responsible for the pixels along this row are going to call that special boundary handling code. So it's really only warps like these that are at the beginning or end of a row, of an interior row That have any divergence. In this case, one thread will execute the boundary code, that's the one responsible for the pixel in a boundary. The rest of the threads will execute the, the mainline code, the interior code. So in all these cases you have 2 way divergence, and that's the answer to our question.

Thread Divergence in the Real World Part2

So, how many total threads are in diverged warps? Warps with at least two-way divergence.

Thread Divergence in the Real World Part2

Well there's one row of pixels, which is not diverged down here. One row of pixels which are not diverged down here. So, there's 1,022 rows of pixels that do have diverged warps, one here and one here, one on each side. So, the answer is 1,022 times 2 warps, which is 2,044 warps or 1,920 threads. And we launched a million threads to deal with a million pixels in this image. So, that means, that just over 6% of the threads in this entire kernel are running at perhaps half efficiency for part of the kernel, which is probably not worth trying to optimize.

Thread Divergence in the Real World Part3

This drives home an important point, you need to be aware of branch divergence. It's one of the fundamental things about GPU architectures or any massively parallel architecture that uses a sim d approach. But don't freak out just because your code has an if statement or for loop. Reason it out, profile the code, figure out whether this is the real problem worth optimizing. There's no real recipe for reducing branch divergence. What to do depends on the algorithm and sometimes the right answer is an entirely new algorithm. There are a couple of general principles though. Try to avoid branchy code. If you have a lot of if or switch statements in your code, consider whether or not adjacent threads are likely to take different branches and if so, try to restructure. The second general principal is to beware large imbalance in thread workloads. Sometimes one thread can take much, much longer than the average thread to complete a task, and when this happens, the long running thread can hold up the rest of the warp or even the rest of thread block. We saw this in the example earlier where some threads performed a loop much more often than other threads in the same warp. So, look carefully at loops and recursive calls to decide if this kind of work imbalance might be hurting you. In the next unit, you'll see another example, breadth-first search in a graph, where each thread is assigned a work item of random size. And if one thread gets unlucky and picks up a work item that takes a thousand times longer than the rest of the threads, you're wasting a lot of computational horsepower as all the rest of those threads sit idle waiting for that one long running thread to finish. So, there's a lot of simple strategies that could mitigate this imbalance. You could restructure your code entirely. Or, you might, for example, coarsely sort the work items before size. And, John will dive deeper into parallel strategies for breadth-first search, and a lot of other interesting problems in the next unit.

Assorted Math Optimizations

It's also worth noting that different math instructions take different amounts of time. And this topic gets maybe half a ninja. You can go really deep understanding the latencies involved in different math optimizations but there's a few general principles that probably everybody should keep in mind. So, the first to keep in mind is use double precision only when you really mean it. point literals like 2.5 here are interpreted as fp64, unless you add the f suffix. Therefore, this statement on the left will take longer to execute than this one on the right. It's a subtle distinction, and clearly, sometimes you need to use double precision, but if you're concerned about performance and you're trying to squeeze the last few percent out of your kernels, only use it when you're absolutely intending to use it. A second math-oriented is to use intrinsics whenever possible for common operations. Cuda support special versions of many common math operations like sine and cosine and exponent, that are called intrinsics. These built-in functions achieve 2-3 bits less precision in their counter parts and math.h, but they are much faster. There are also compiler flags for fast square root, and fast divisions zero d norms and so forth. And you can see the programming guide for more detail.

Host-GPU Interaction

Finally, let's talk about optimizations at the whole system level, including the interactions that happen between the host and the GPU. So on most systems, the CPU and the GPU are on different chips, and they communicate through something called the PCI express bus, or PCIE for short. Today for example, most systems are PCI express GEN2, which can in practice get about 6 gigabytes per second maximum in either direction. Now PCI can transfer memory that has been page locked, or pinned, and it keeps a special chunk of pinned host memory set aside for this purpose. So when you have a chunk of memory that you want to copy over to the device memory on the GPU, CUDA first has to copy it into the staging area, before it can then transfer it over to the PCI express bus, onto the GPU. This extra copy here, increases the total time taken by the memory transfer. Instead you can use the CUDA Host Malloc function to allocate pinned host memory in the first place. Such that the memory that you've got sitting on a host is already ready to copy directly over without this additional staging step. Or you can use the CUDA host register command to take some data you've already allocated, and pin it. Again avoiding the extra copy to take it into GPU memory . So that's why you care. Memory transfers from the host to device can go faster, if you pin the memory first. And if you're transferring pinned to memory, you can use the asynchronous memory transfer, CUDA Memcpy Async, which let's the CPU keep working while the memory transfer completes. How does that work? Well with CUDA Memcpy, which you have been using until now, you make a call to CUDA Memcpy, you pass it some information, like the pointers and the size and bytes, and whether this is a transfer from device to host, or device, or host to device. And then there's a semicolon at the end of that statement. And with CUDA Memcpy the CPU blocks until the transfer completes. With CUDA Memcpy Async, when it hits the semicolon, it just keeps going. This kicks off an asynchronous memory transfer that continues to go on while control returns to the CPU and the program, the CUDA program continues executing. In other words, the CPU can keep getting worked on while this memory transfers. So to control that asynchronous interaction, we have to introduce the next topic in host GPU interaction, and that's streams.

Streams Part1

So, in Cuda, a stream is a sequence of operation that will execute in order on the GPU. And a particular operation we care about are memory transfers and a kernel launches. If I perform a Cuda Memcpy Async launch kernel A and perform second Cuda Memcpy Async and then launch kernel B ,then each operation completes before the next starts. So, the first Memcpy happens. Then, kernel a runs to completion. Then, the second Memcpy happens and then the second kernel runs. But if I do those same operations and this time I put them each in a different stream, by adding a parameter which is the stream, and now these operations can potentially run concurrently, therefore completing in much less time. So, this operation is in stream 1. This operation is in stream 2. This operation is in stream 3. This operation is in stream 4. Since we didn't specify a stream for any of these, we would say that they are in the default stream, which happens to be stream 0. A couple of notes. A Cuda stream object is of type cuda stream<u>t.</u> For example, here we declare a stream of, called s1. You create streams using the cuda StreamCreate call, passing in a pointer to the stream you want to create. And you destroy them by calling cuda StreamDestroy. And remember to get asynchronous memory transfers, you need to be using cuda Memcopy Async, which in turn must be called on pinned host memory.

Streams Part2

Lets have a quiz on this. If all memcpys and kernel launch operations in the below code snippets take exactly one second to complete, what is the minimum time taken before the results are available on the host? So, here's the code snippets. We'll start by declaring a couple of cudaStreams, s1 and s2, and by using the cudaStreamCreate call to create streams with those stream objects. And then, there's are a bunch of code snippets, and I want you to tell me how long each one is going to take, in seconds. I've taken a couple of shortcuts here. I haven't showed you d<u>arr, h<u>arr, d<u>arr1, h<u>arr1, 2. I haven't showed you these</u></u></u></u> things getting allocated. You can assume that d<u>arr has been allocated with</u> cudaMalloc, and it is a pointer to some device memory, and that h<u>arr has been</u> allocated in [unknown] host memory using cudaHostMalloc, and for a shorthand, I just abbreviated cudahosttodevice as H2D and cudadevicetohost as D2H. Notice that I'm launching a single block with a fairly small number of threads in each of these thread launches. So, if all of these operations take one second, what's the minimum time that each of them can take to complete? What's the minimum time that it'll be before the final results are ready for the host?

Streams Part2

Okay while the first code snippet doesn't assign streams at all. So none of these operations declare a stream, and therefore they're all on the default stream. And since they are in the same stream, each one must complete before the next one in that stream is allowed to run. So CUDA Memcpy will run for one second, kernel A will launch and run for one second,. And CUDA Memcpy will run from another second, will copy the memory down, do some operation on it, copy the memory back up. And after three seconds the results will be ready for the host. This next code snippet does declare a stream, it puts them all in s1. And so since all three operations are in the same stream, the previous operation must complete before the next one can start. So once again, this will take a second, this will take a second, this will take a second, and the final result will be ready in 3 seconds. This one's a bit more complicated. Here I'm doing two operations on different chunks of memory, so I'm copying h array 1 into d array 1, calling kernel A on d array 1, and copying d array 1 back to h array 1. And then I'm copying h array 2, to d array 2, doing some operation on d array 2 with, by launching kernel B, and copying d array 2 back to h array 2. So, the first three operations are in stream one. So CUDA Memcpy Async will be called, and it'll start a memory copy to happen on the GPU and then CPU execution will continue the next statement will happen. We don't, we don't stop and wait for this to Memcpy to finish before we execute the next statement, which is the launch of kernel A. Kernel launches are always asynchronous. So, in effect, this queues up the launch of kernel A, which will be run on the GPU after this Memcpy has finished. And it proceeds to the next statement, which is another CUDA Memcpy Async, to copy the results of kernel A, back onto the host. All three of these are in stream one, so they will all wait for each other on the GPU. But the host just issues a command, issues a command, issues a command, it doesn't stop, it proceeds to CUDA Memcpy Async, in s2. And so now it starts another asynchronous memory transfer. This one, because it's an s2 doesn't have to wait for any of these operations to complete, so it gets started right away, it'll complete in one second. It calls kernel B in one block of 192 threads, also in s2 and then calls CUDA Memcpy Async, again in s2, to copy the results back. So, the first memory copy will happen, these next twp operations in s1 will get queued up and the next memory copy will happen essentially immediately. So in the first second this Memcpy and this Memcpy will be running. In the next second kernel A will be running and kernel B will be running. And in the final second the return Memcpy will be running in s1 and the return Memcpy will be running in s2. And so the minimum amount of time this can take, if all the operations that can run concurrently do run concurrently, will again be three seconds. But in this three seconds, we'll have actually gotten twice as much work done, we've overlapped the execution here. And I phrased this very carefully when I talked about the minimum time this is going to take. It's worth noting that, there's a lot of reasons why it might not be able to overlap all of these operations. Some older GPU's can't do this many Asynchronist Memcpy's at the same time. And I was pretty careful to launch a single block, so this probably will not fill the GPU. On the other hand, if kernel A had run tens of thousands of blocks, it might well fill up the GPU with those blocks, before kernel B got a chance to start filling the GPU with any of its blocks. And so, kernel B might end up waiting on kernel A simply because resources are not available to run it . The final example is the same operations just reorganized. So in this case we issue both Memcpy's in stream 1 and 2, both, we launched both kernels in stream 1 and 2 and we issue both Memcpy's back in stream 1 and 2. So rearranging these things shouldn't make a difference, the minimum time that this could take is still three seconds.

Streams Part3

Here's two more examples. What's the minimum amount of time that these code segments take before their results are complete?

Streams Part3

Well, in this operation, we're copying some memory in to d<u>arr1. We're operating</u> in stream1 and in the kernel is launched in stream2, so it can proceed immediately and operate on d<u>arr2. Both of these operations will complete in one</u> second, everything's done in one second. And this one is a bit of a trick question. Once again, we have two different operations, they're in different streams. So, it's correct to say that the final result will be completed in one second. But hopefully, you noticed that we're both copying some information into d<u>arr1. And at the same time, because we didn't wait to launch the kernel, we're</u> operating on d<u>arr1. We're doing something with contents of d<u>arr1, even as it's</u></u> being overwritten by cudaMemcpyAsync. And so, while it's true that we'll complete in one second, we're going to get the wrong answer.

Streams Are Useful For

Now, you know what this is really useful for? Sometimes, you need to perform a computation on a really big hunk of data. Maybe so big that it won't even fit into the GPU memory. No problem, you say. I'll break up the data into chunks that will fit on GPU memory. I'll copy each chunk over, do my processing, and then copy each chunk back. Then, I'll do the same thing with the next chunk, and so on. And this'll work fine, and it'll get the job done. But the problem is, if this is a high-end GPU, it can easily have several gigabytes of memory. And so, these copy operations, where you're copying several gigabytes of data, enough to fill the GPU memory, do your processing, and then copying several gigabytes of data back, these copy operations can end up taking a significant amount of the time of your total computation. You see our problem, if this is the timeline, we spend some time copying, and we spend some time processing, and we spend some time copying back, and the process repeats. So, here's where asynchronous copies can come to the rescue. Instead, we can break up our data into chunks that are half the size and we can do an asynchronous copy, filling half the GPU. Then, we can launch some kernels on the GPU to start processing that data and immediately do another asynchronous copy, filling the rest of the GPU data, and immediately start processing that data. When the first chunk of data completes, we'll copy it back and start copying the next half size chunk of data, all the while this one is still completing. So, you get the idea. We're going to copy over half the data at a time and immediately start processing it while we copy the next half of the data. Now, the timeline looks like this. And the important point is that we're managing to overlap data transfers and computation.

Advantage of Streams

So, this is the advantage of streams. You can overlap data transferring computation and there's a subtler issue. The ability to launch multiple kernels asynchronously at the same time on the GPU can help you fill the GPU with smaller kernels. Sometimes, you have many problems with limited parallelism or sometimes, you're doing your computation such as a reduction, where there's sort of a narrow phase, the amount of work gets less and less. And in the narrow phases, you just don't have enough work to fill up the GPU. Well, maybe you have something else that you can start running on the GPU at the same time to help fill up all those SMs. Now, there are some details to be aware of. So, check out streams and events in the CUDA programming guide if you're interested in learning more.


Alright, it's time to wrap up, here's what I hope you've taken away from this unit. Remember APOD, Analyze, Parallelize, Optimize, and Deploy. The important points here are, do profile guided optimization at every step and deploy early and often, rather than optimizing forever in a vacuum. I can't emphasize this enough. Optimization takes effort and often complicates code. So optimize only where and when you need it, and go around this cycle multiple times. Now most codes are limited by memory bandwidth. So compare your performance to the theoretical peak bandwidth and if it's lacking see what you can do about it. Things that will help improve, in order from most to least important are, assure sufficient occupancy, make sure you have enough threads to keep the machine busy. This doesn't mean having as many threads as you can possibly fit on the machine, but you do need enough that the machine is basically busy. Coalesce global memory accesses, really strive to see if you can find a way to cast your algorithm so that you achieve perfect coalescing. And if you can't, consider whether you can do a transpose operation or something that will get poor coalescing once. But then put the data into a memory form where all your subsequent accesses will get coalescing. Remember Little's law, in order to achieve the maximum bandwidth, you may need to reduce the latency between your memory accesses. And so for example we saw, that in one case we spent too much time waiting at barriers. By reducing the number of threads in a block, we're able to reduce the average time spent waiting at a barrier and help saturate that global memory bandwidth. We've talked about minimizing the branch divergence experience by threads. Remember that this really applies to threads that diverge within a warp. If the warps themselves diverge, in other words, if all the threads within a warp take the same branch, go in the same, code path, then that comes for free. There's no additional penalty for threads in different warps diverging. It's only when threads within a warp diverge, that you have to execute both sides of the branch. As a rule, you should generally try to avoid branchy code. Okay, code with of lots of, if, statements, switch statements and so on. And you should generally be thinking about avoiding thread workload imbalance. In other words, if you have loops in your kernels that might execute a very different number of time between threads. Then, that one thread that's taking much longer than average thread can end holding the rest of the threads hostage. All that said, don't let a little bit of thread divergence freak you out. Remember we analysed a real-world example, of dealing with boundary conditions at the edge of an image and figured out that in fact the if statements to, guard the edge of the images, weren't really costing us very much, only a few warps ended up being divergent. If you're limited by the actual computational performance of your kernel rather than the time it takes to get the data to and from your kernel, then consider using fast math operations. This includes things like the intrinsics for sine and cosine and so forth. They go quite a bit faster than their math.h counterparts, at the cost of a few bits of precision. And remember that, when you use double precision it should be on purpose. So just typing the literal 3.14, well that's a 64-bit double precision number, and the compiler will treat it as such. Whereas typing 3.14f tells the compiler, hey this is single precision operation, you don't have to promote everything I multiply this by or add this to, to be a double precision number. Finally, if you're limited by a host device memory transfer time, consider using streams and asynchronous mem copies to overlap computation and memory transfers. And that's it. Now go forth and optimize your codes.

Problem Set 5

In Problem Set Number Five, you will be designing and implementing a fast parallel algorithm to compute histograms. You already implemented a kernel for computing histograms in Problem Set Number Three that used atomic operations. However, it turns out that we can get much better performance if we avoid atomic operations all together. And that is our goal in Problem Set Number Five. There are a lot of different was to efficiently compute histograms. We strongly encourage you to experiment and test out different ideas, but one basic strategy that works well is as follows. First, sort the input data into coarse bins. Two, use each thread block to compute a local histogram for the data that falls into a single coarse bin. Three, concatenate each of the coarse histograms together. This strategy is advantageous because each local histogram will have a small number of bins, so it will be able to fit into shared memory. This is true regardless of how many bins the final histogram has. Let's look at a simple example to see how this strategy will work. Suppose our input data is in the range of 0 to 999. And suppose we want to bin our input data into 100 different equally sized histogram bits. And here's our data. As you can see, it ranges from 0 to 999. And now, let's compute the bin ID for each input element as well as the coarse bin ID for each input element. So let's suppose we have ten equally sized coarse bins. And this is the result of calculating the bin ID as well as the coarse bin ID for each of the input element. And now, let's sort the final bin ID by the coarse bin ID, and let's see what we have. And this is the final result after we sort the bin ID by its coarse bin ID. Now, let's assign a thread block to each coarse bin. So thread block 0 is responsible for computing the histogram of coarse bin ID 0. Thread block 1 won't have anything to do, because in the sorted coarse bin ID, there is not a bin with an ID of 1. So it will compute an empty histogram. Thread block 2 will compute the histogram of coarse bin ID of 2, and so on. And to get our final histogram, we can simply concatenate all those histograms together end to end. So when we would combine the histogram that we calculated coarse bin ID of 1 with the histogram that we computed for coarse bin ID of 2 and so on and so forth. And that's it. Good luck.