Hi, everybody. Welcome to the final unit of the course. We're delighted that you stuck with us and hopefully learned a whole lot about parallel programming with CUDA. Now, this unit is going to be a little bit different. Instead of discussing one topic in depth, we're going to go wide and discuss many topics briefly. The goal is to give you some pointers to things that you can follow up according to your own interest and needs. Let's get started.
So today, we're going to cover the following, broad topics. We'll start by returning to the topic of optimizing GPU programs. Now in unit five, we gave some pretty specific, detailed advice. In unit six, explored some examples of how to think parallel. Here we're going to back off and talk more generally about strategies that parallel programmers use to optimize their programs. Now the best kind of programming as we mentioned briefly in unit five is the kind of programming that you don't do, because somebody else has already programmed it for you and packaged into a library you can use. So we'll talk about some important and useful CUDA libraries that are out there. Some of these libraries are less about packaging up code to solve a particular problem and more about providing what I call power tools to help code up your own solutions. So examples familiar to C++ programmers would be the standard template library, or STL, or the boost library. We'll discuss a few such power tools for kuda C++ programmers. Now to write this class we focused on a[UNKNOWN] C++ language but they're many other platforms for parallel programming on GPUs. We'll talk about CUDA platforms that support other languages from Fortran to Python to Matlab. And we'll also discuss cross platform accelerator solutions like Open CL, Open ACC and Open GL compute Now GPU computing is a young field and part of what makes it exciting is that the hardware and software are improving each year. Not just getting incrementally faster but also adding fundamentally new capabilities. So we're going to finish the unit and the course with a fantastic guest lecture inviting Dr. Stephen Jones from NVIDIA to come and teach us about the latest advance in CUDA called Dynamic Parllelism.
So, let's dive in and talk about Parallel Optimization Patterns. Now, in the supplentary material, you'll find pointers to a fantastic couple of papers by John Stratton and colleagues at the University of Illinois. One is titled, Optimization and Architecture Effects on GPU Computing and Workload Performance, and it was published in the Innovative Parallel Computing Conference in 2012. A reworked version of that same basic paper was published later that year in IEEE computer. The Computer paper is a little shorter and more readable. Now, what they did is survey a whole bunch of highly optimized parallel codes, pay attention to the patterns that emerged, and extracted seven fundamental techniques that come up over and over again. And as you'll see, several of these techniques encapsulate or formalize lessons that we've already learned. So, this will also be a good review of material that we've covered in Unit 2 and Unit 5, and along the way and in the homeworks. Let's walk through the taxonomy of seven techniques that that they've come up with.
They call their first technique, data layout transformation. So, by now, you're very familiar with the concept of global memory coalescing, where we achieve the best bandwidth to memory when adjacent threads access adjacent locations in memory. As a quick review, global memory coalescing is important because, because modern DRAM systems transfer large chunks of data per transaction. Because modern DRAM systems transfer data in many small chunks, one per thread request. Or because modern GPUs have no on-chip caches in which to prefetch data, meaning, to hold data likely to be used by a thread. But not yet actually requested by it.
And the answer, of course, is that, modern DRAM systems, such as the ones used by high performance GPUs transfer large chunks of data per transaction. And therefore, if you can coalesce your memory request by many threads, that are performing a transaction at the same time, then you'll get many more useful bytes out of that transaction. So, a second statement that DRAM systems transfer data in many small chunks is simply false. And the third statement, that GPUs have no on-chip caches to prefetch data, that's also false. modern GPUs do have on-chip caches. although, it is the case that prefetching data is not a particularly effective strategy for GPUs. The caches are quite small considering the number of threads that are trying to use them. And so, you can't really afford to speculatively load data that some thread might use. Okay, you have tens of thousands of threads that are actually trying to use those caches to hold data. So, prefetching is not an effective strategy and this is also not a correct answer.
So this burst utilitzation as Stratton and Company call it is usually the most critical factor to optimize for global memory performance and that's where the examples of a ray of structure, the structure of a ray comes in. But there are also subtler issues like partition camping that can occasionally rear their heads. So as before, I'll use the ninja icon to indicate topics that the average GPU programmer probably won't need to concern themselves with. Now, a more sophisticated example of the data layout transformation technique, is something called the array of structures of tiled arrays. Or ASTA. This technique can address issues like partition camping, and can potentially achieve even better bandwidth utilization on a wider range of architectures. So you'll find a reference in the supplementary materials.
So the idea of the data layout transformation is to reorganize the way data is laid out in memory to facilitate better memory performance. So a really common example of this data layout transformation is something we've seen before where we take data, which is laid out as an array of structures, like so. An array of structures, and transform it into a structure of arrays. Here's what those layouts would look like in memory. In the first case, the different fields in the structure are laid out one after another. Each structure is adjacent. In the next case they're all laid out adjacently. So as a quick quiz, which of these two layouts, the array of structures, or AOS, or the structure of arrays, SOA. Will perform better on the little code snippets I've given here? So you set i equal to the thread index. And then we access field a for a given thread. And then we access field b, c, and d. And then over here I've done the same code but I've transformed it to use this data layout. So the question is, to check off, which of these two data layouts you think will perform better?
The answer of course is the second one, structure of arrays, because you're going to have many threads executing this code snippet at the same time. In the first case each thread will be accessing element A from a different place in memory, followed element B and element C and so forth. Whereas in the second case all these threads running at the same time will access adjacent elements. And remember why this matters? Because the actual memory, the diagram that arepresents the global memory accesses data in large chuncks or bursts. So you'll get much better tuilization of one of these bursts, a fixed sized memory transaction when you have many threads that are simultaneously trying to access memory that will be contained within that transaction.
So the next technique in Stratton's optimization taxonomy is called the scatter to gather transformation. Now most interesting computations involve a step where many input elements contribute to the computation of an output element. So in this example the, the elements labeled blue in the input array, are contributing to the blue element in the output array. And the elements labeled green in the input array are contributing to the green in the output array and so forth. And examples form you homework included blurring the image or computing a histogram. And sometimes of course the locations of the input elements that will contribute to particular output element aren't know ahead of time like in a histogram example. And we've talked about the difference between scatter and gather. In scatter the threads are assigned to the input elements. And each one is deciding where it needs to write. In gather the threads were signed to the output elements and each one decides where he needs to go read from. So here's a quick quiz which should be pure review. Which code snippet represents a gather? And which one will run more efficiently? Here's code snippet1. And here's code snippet 2. In each of these cases you can assume that out and in are floating point numbers and that i, represents the thread index.
And of course code snippet two represents a gather since each thread is going to access three elements of the inner array and I do some math and store the results into a single place in the upper array, and that gather operation is going to run much faster, right? You probably noticed that in fact the first operation is accessing more memory since it has to read the three output locations in order to increment them and it won't even run correctly unless we insert some sink thread barrier in here or use atomic operation to guarantee that one thread doesn't read the value while another one is in process of updating it. Either of those would further reduce performance. And this is typical, a gather pattern results in many overlapping reads that the hardware can typically handle more efficiently. Then it can handle many potentially conflicting writes, which is what you tend to get from this scatter pattern. So now if the pattern of reads can't be determined ahead of time, you may need to combine this with Stratten's technique #five, binning. We'll come back to that.
Stratton calls the third technique tiling. Now often, multiple threads or more generally, multiple computational tasks will need to access overlapping parts of a data structure. Stratton uses the term tiling to refer to buffering data onto fast on-chip storage where it will be repeatedly accessed. Now, this is an old, old idea. Now, the CPU, it generally means sizing the data accessed by a program so that it will fit naturally into the CPU cores on chip cache. There's an implicit copy going on between the locations in DRAM and the fast on-chip storage, in this case, the cache. Another way to do it, is to explicitly allocate some of this fast memory to be used as a scratch pad. And on the GPU, in CUDA, this is what we call shared memory. And then, we can have a set of threads that do the explicit copy, pulling memory in from DRAM and explicitly laying it out in this scratchpad memory and the shared memory just the way we want it, where they will operate on it. It's worth emphasizing GPUs runs so many threads that the amount of cache per thread is limited. So, this style of cache blocking which is what you call this technique on a CPU isn't so great for the GPU. A better approach is to explicitly copy data that would be reused by the threads in the thread block into the fast shared memory that's available and set aside for that block. Let's have a quiz. Which of the following two code snippets would benefit from tiling? The first snippet averages values from five arrays the second average five values from one array.
These kernels look similar, both of them are reading five values for memory, performing the same arithemetic operations, and writing a single value to memory. And you'll get good coalescing in both cases. But the second kernel has a great deal of reuse. Okay, two adjacent threads will be reading four of the same 5 values. So tiling the n array, by copying it into shared memory, before performing the averaging operation, will capture that reuse and make the code run faster. In the first kernel, each value from the a, b, c, and d, and e arrays gets used only once. So tiling will just add an extra step without allowing any reuse. So the second coat snippet will benefit from tiling and the first won't. In fact the first will probably slow down.
So, as a programming exercise, I want you to take this code, and tile both foo and bar using thread blocks that have 128 threads. And tiling, a tile size of structured it so that it'll be safe for you to ignore the boundary effects at the beginning and end of the arrays. The input and output arrays? Okay, we're calling, we're only calling this on a portion of the code in the middle of the array. But of course, you can't ignore the boundary effects that happen at the boundaries between thread blocks. So I'm looking for the percentage of time of the non-tiled version of the code that your tiled version takes. So for example, if your code ran exactly as fast as the non-tiled version then it would be 100% of the time. And if your code ran twice as fast as the non-tiled version then it would take 50% of the time. Give me the percentage of time for both foo and bar after you've done the tiling.
The next optimization technique in Stratton's taxonomy is called Privatization. So where tiling exploited the fact that multiple threads are reading from the same memory locations, thread sharing input is good. Privatization helps avoid having multiple threads right to the same output locations. So Privatization takes some output data shared by multiple threads and duplicates it. So that different parallel threads can operate in their own private copy. The private copies are eventually merged together to recreate the result that would of been produced in the share data using, used by all the threads before Privatization. So Histogram makes a good example. As you've learned from our previous discussions and problem sets on Histogram. There's a bunch of ways that you can approach the problem. In particular, you can program the kernel to have each thread keep its own local Histogram as it looked at a bunch of pixels. And then use atomics or a global reduce operation to combine the buckets from the Histograms from all the threads into one set of shared results. You can also apply this idea hierarchically. For example, you can have each thread block compute a combined Histogram in shared memory. And then have other thread blocks do the same. And then accumulate the results into a single global Histogram. So in this example we would have per-thread privatized Histograms, and per-block privatized Histograms. Now in practice the GPU has so many threads that you don't want each one to have a huge private data structure. So this approach might work well for a small Histogram. Say five or ten buckets, but you would probably skip the per-thread privatized copies for a Histogram with hundreds of buckets. And just do the local version per block.
Okay, let's turn to Stratton's fifth technique, which he refers to as binning or spatial data structures. So, say, you've got an algorithm where certain input elements will contribute to certain output elements. In technique 2, we talked about transforming this from a scatter operation, where you assign threads to the input elements to perform the computation, and then they figure out where to write. To a gather operation, where you assign threads the output elements and they figure out where to go read from. Now sometimes, it can be hard to orchestrate this gather operation, right? It can be hard to figure out which inputs matter for a given output. One thing you can always do is you could have every thread assigned to an output element check every possible input element to see if this input element will contribute to this output element. But obviously, that's going to be ruinously expensive if you take it too far. So, what we want to do is optimize away most of the redundant work that you would be doing if all of these threads were checking all of these inputs to see if they contribute. And we're going to call this binning. So, binning is when you build a data structure that maps output locations to the relevant input data. And a point that I want to make is that, this data, this will work best, when you're mapping output locations to a relatively small subset of the input data. Stratton and company called a creation of this data structure binning because it often reflects a sorting of the input in to bins that contain often a region of space, representing a region of space containing those input elements that are relevant. So, let's look at an example.
Let's suppose that you're trying to figure out the best places to open a chain of stores in the United States. So let's say you have a list of cities in the United States. And here I, I went and traced a map for the United States. I can't draw this well. I traced a map out of the tablet and I, I dotted in I think that it was the 50 largest cities. In the U.S. And let's say that you're trying to open a chain of stores across the United States, and you want to know the accessible population of each city, meaning, how many people could easily get to a store in this city? So let's define accessible population as all the people that live in cities within 300 kilometers of this city. So these are people who could, you know, go for a long drive and get to your store. How are we going to find the accessible population of each city? You could launch a thread for each city and have it look at every other city. You get the idea. And then it could compute the distance to all those other cities, and if it's under 300 kilometers, then add the population of that other city to a running total. So in practice, most of these would be more than 300 kilometers away, and you just discard them. That would be really ineffecient, right? You can imaging that having every city check every other city, I've only drawn 50 cities here, if I'd drawn, you know, the 2,000, or 3,000, or 10,000 cities that actually exist in the United States, this would quickly become way too much work. So instead we could create a special data such as overlaying let say a each grid cell. So this cell will be empty, this cell will have three cities, this one will have one three four cities, I think that one has three, this one is one and so on. Right, so we could make a list of all the cities that fall into each grid cell. And now if you want to find all cities within 300 kilometers of a given city, say Denver here, you only need to look at the grid cell that that city falls in and the neighboring cells. Right, so I need to check these three here, there is two cities so close together, you can not even tell, that is Denver and Boulder I need to check these cities. So I still have several checks to make but this is still vastly fewer checks than I would have to make if I were checking all of the, all of the cities against all of the other cities. So by creating these grid cells these bines that contain the cities and therefore being able to restrict which sections, which bines I need to look in I've drastically reduced the amount of work that I need to do to, to define this accessible population that I'm trying to figure out. So this is a great example of binning. So let's have a quiz. Which of the following operations would probably benefit from the geographic binning stage described above? Suppose that for all US cities in the list we wanted to compute a histogram of city distances from a central warehouse in Denver, Colorado. What if you had a list of roads, each of which was represented as a list of line segments, and you wanted to compute the population of cities within 20 kilometers of that road, would it help to use the geographic binning that we described? Most US cities are in geographic administrative divisions called counties. To give you a sense of what a county is if you're not from the US I've pulled out the Wikipedia map of US counties and I've zoomed in so you can see some of the counties in our states. And as you can see there's thousands of counties. There's about 3,000 counties across the US. So given a list of counties and a list of cities, each of which contains an index to the containing county, the county that it's inside, what if you needed to compute the city dwelling population of all counties? Would this geographic bidding operation help?
The computer histogram of city distances from Denver. For every city you would compute it's distance from Denver. And you would build a histogram using the techniques that you've already studied. And so, a histogram is essentially a binning operation, right? In this case it's binning on distance the one dimensional quantity of distance. And so although you would use a binning step in the process of creating that histogram. It really wouldn't be using the geographic binning that we discussed, the sort of two dimensional geographic binning that we used to accelerate. Our computation of which cities were within it's not useful to use that geographic binning to compute a histogram of city distances. Even though this is still a geographic problem and a binning problem. On the other hand, if you have a list of road and each road is a list of line segments. You could easily imagine launching a thread per road or a thread per line segment. And using that geographic binning operation to compute which bins, which tiles, which of the grid cells that liine segment intersects. And computing the populatoin of all cities within that grid cell or nearby grid cell. So yes, a gepgraphic binning operation like we described, would accelerate this kind of operation. And, finally, here's a step where the geographic binning is simply unnecessary work. Right, so, if we have a list of counties, and we have a list of cities that have already got a pointer, to containing county that the city is in. Then to compute the city-dwelling population of all the counties, is simply a matter of having every city update it's county. So one of the points I wanted to make here is that in fact, every city is in, at most one county. And so you already sort of have a bin, a binning in some sense, the counties sort of provide a binning. And so even though this isn't a geographic or spatial data structure, this, this index of, of counties and, and pointers from cities to counties kind of gives you a binning. And so It would be redundant to apply an additional binning step in this case. So once again this one is not really going to benefit from the geographic binning operation that we described.
So the Stratton taxonomy's sixth optimization technique is called compaction. And, John has described compaction a couple of times. He described it in unit four and revisited it in unit six, where it comes up in sparse matrix vector multiply and in graph traversal. So, I'm not going to, to go into great length to describing what compaction is. Just as a quick review, compaction is useful when you have a computation to perform on a set of elements but the exact elements that require computation are not known in advance. If most of the elements require computation then its simplest just to assign threads to every element and just have every thread check to see whether they should process or not. The threads that are active will do whatever they are suppose to do, the threads that will not will simply return. In that case you will end up storing something valid some valid output in most cells and the cells in the output are the, the elements in the output where the thread wasn't active. Because it didn't need to be processed you will end up with leaving blank. Now if relatively few of the elements require computation, then you're wasting alot of storage on the output, and you're doing less efficient computation. 'Cuz alot of your threads are sitting idol, instead of doing some sort of useful work. Compacting the elements, means creating a dense array containing only active elements. So that then you can do your processing, on the stents array, producing a dense output array, directly. You're not wasting storage on all these invalide, null output elements. And you're not wasting computation on all these threads that are just sitting around not doing anything. While the other threads in their thread block are, the relatively few threads in this thread block that have act developments have have anything useful to do. So as a quiz, suppose that only every eighth element will be processed. So here I've drawn the elements that will be processed and all the ones I've left blank will not be processed. And the idea is that every eighth element, it requires processing. And assume that we're compute limited in other words that what ever operation is we're going to do on these Every eight elements is expensive and that's the limiting factor rather than the memory bandwidth to fetch elements. So assume that we are limited by computation and not memory bandwidth. So the question is about how much less time will the computation take if a thread is launched to run on every element of the compacted array rather than if the thread is launched to run. For every element of the original array. In other words, how much faster is it going to be to run on compacted array? Will it be I want to ask, what if only every thirty-second element will be processed, and what if only 128th element must be processed? So for the answer I'm looking for an integar like four or eight or sixteen or thirty-two or one. And you're tell, telling me whether its four times faster or eight times faster or one times faster and so on.
In unit five, remember that we learned the, the GPU hardware processes groups of threads called warps. And at every thread in a warp, performs the same instruction at the same time. And this means, that is some threads are not actively doing computation, they will, they're still going to have to wait while the active threads in warp do there thing. Now todays GPUs, in fact all GPU that Nvidia has ever made, have 32 threads per warp. In this first case, only four out of every 32 threads are actually doing something. And, the rest of the threads are just along for the ride. In a compacted case, if we can compact this down to a dense array. Where all four of these elements, are next to each other and then there's, the remaining elements from elsewhere in the array all kind of compacted in into one dense array. In that case, all 32 threads in the warp are going to be doing active work and so you'll finish the total work eight times faster. So that's our answer. We're going to go eight times faster, if every eighth element is active in the original array, and we compact it and operate and still in the compacted array. And by the same reasoning, if only one of these threads in a warp operating on the original array, has anything useful to do, then you'll go 32 times faster, which is a big deal. You can see why compaction can lead to big speed ups. But the other subtlety to remember about warps, is that if all the threads in the warp take the same branch, so in the code snippet I showed before. If all 32 threads in the warp check to see if they were active, decided they weren't and then exited. Then you pay almost no penalty for that. And so that, that warp is going to fire up, all threads are going to see they have nothing to do, it's going to disappear. And so, actually the warps that are primarily empty or that are entirely empty we'll exit immediately. And so, the case where only one in faster. because there's one warp that's going to have some work to do. And 31 of the threads in that warp, are going to be sitting around waiting for the single thread that has some useful work to do. But, all of the, all of the threads and the entirely empty warps, that have nothing to do, they exit quite quickly. So you'd pay a very slight penalty for having launched those warps and having them immediately exist. But, unbalance is still are going to be close to operate on the original array.
[INAUDIBLE] colleagues call their final technique regularization. And regularization is all about load balancing. Load balancing is a problem that's plagued parallel programming since it's inception. Now, to illustrate the idea of load imbalance, let's go back to our example for technique 5 where we were binning the cities of the United States. If you look at this map you'll notice that the cities aren't spread out evenly. Some grid cells are going to have a lot of cities, like up in this region, and some grid cells are going to have relatively few cities. So, any threads that are adding up the city population, to nearby cities in these cells. They've got a lot of work to do. Right? They've got lots of cities to look at in the neighboring cells so that they can look at those cities and compute distances. So any threads that are adding up to city population and cells up here, have a lot of work to do. They've got a lot of cities to compare to, while other threads that are in the same warp or the same block. You know might be responsible for, you know, a city in one of these grid cells and have a lot less work to do. And so, it will complete quickly and sit around waiting, for these long running threads. This is an example, a classic example of, of load imbalance. So, regularization is all about reorganizing the input data, to reduce load imbalance. In other words you're taking irregular parallelism and turning it into regular parallelism. For example, in the US cities example we've been looking at some bins might contain more cities than the average bin. So you can imagine provisioning each bin to have a fixed number of cities, say 5. And then have a special way of handling those relatively few overflow cities. For example, you might handle the cities or grid cells that overflow on the CPU. Or you might use a different kernel that implements a different algorithm. Such as sorting the cities by longitude or latitude. There, there's many ways you could do this computation that we were trying to do, with figuring out which cities were within 50 or 300 kilometers of each other. And now the storage in the kernel code get's a lot simpler. Rght? Every bin has 5 slots. Each thread loops over exactly 5 cities. And the fact that you'll occasionally waste a bit of compute while one thread sits idle for a slot. Because it only had 3 cities or 4 cities in it's bin. is more than made up for by the fact that you never have an entire warp, or an entire thread block, or an entire kernel. Waiting on a single thread that has
Let's have a quiz on regularization. So check the statements that are true. Regularization will likely have the most impact when, the load is relatively evenly distributed among tasks but with some significant outliers. So for example, most U.S. counties contain at most one city. But there are a few that have, you know, dozens of cities. Or is that the case that regularization will have most impact when the load varies wildly from task to task, and by wildly I really mean there's no real average case. An example of this would be popularity, some people have claimed that celebrity popularity like if you measure something like number of followers somebody has on twitter. Powers of power law wrap distribution and power law looks something like this. And if, if true, then this would mean that at any level of popularity a roughly fixed fraction of people will be twice as popular. Okay, so, this case you might say that oh, 80% of the people, 20% of the people are, have twice as many followers as the remaining 80% of the people below the line. And that, that becomes true no matter where you draw the line. So that would be a power law. And in a case like this, there's no real average case. A few very popular people have as many followers as the rest of the world combined. There's an interesting discussion of this that I found online. I'll put it into the instructor's notes. Finally, is it true to say that regularization will likely have the most impact when the application can predict load imbalance at run-time. In other words, can predict where and how load imbalance will occur. So an example here is that it might be a lot easier to add up the number of cities in a grid cell and notice that there's going to be load imbalance, than to actually perform some heavyweight computation on each city.
So this first example is, is a good case for writing regularization. The case is distributed so there is some regular sort of average case that you can shoot for that'll be relatively easy to predict. And then occasionally you'll have outliers and you do something about them such as dealing with them separately in another algorithm or on another processor as we discussed before. So I would consider this a good case for regularization. It's likely that we can come up with a regularization strategy here. That'll work. On the other hand when the load varies wildly, and it's not really easy to predict what is the average case. You have no idea when you start a process, how how long it's going to take, or how many items you'll have to process. Am, then, I would say this is not any easy case for regularization. Which is really, sort of, just restated in the final. but obviously for regularization the whole goal is to, is to extract regular parallelism from irregular parallelism so for that point you kind of like you be able to know what is regular, so if you can predict where and how load and balance is going to take place at run time then you can structure the application to deal with the... Load balancing on the spot. So unload is relatively evenly distributed. It's almost regular and dealing separately with occasional irregularities is likely to work. If load varies wildly then it's harder to know what to regularize to. And it's much easier to regularize the workload if you can predict what that workload is going to be.
Okay, and that's Stratton's Taxonomy. Data layout transformation, transforming scatter patterns togather patterns. Tiling, so that we can share input upon multiple threads. Privatization, so that we can avoid sharing output, as long as possible. Binning and spatial data structures. So that we can reduce the amount of input, that's necessary to look at for given output Compaction. So that we can avoid processing inactive inputs and process more efficiently on a sparse of inputs. And regularization, where we can extract regular parallelism from irregular parallelism, often in the face of strategies like binning and spatial data structures or the scatter togather transformation. Now you've encountered most of these ideas already in the class. But I do hope this overview helps you think consciously about the strategies you can employ to optimize parallel programs. I definitely recommend that you read the original papers if you're interested in learning more. I'll give examples from much different bench marks and describe case studies for some of these techniques. so I, I find them useful to help me be principled in my thinking about Parallel Optimization. I'll include again links to those papers in the instruction notes and on the wiki and so forth.
Let's shift gears now and talk about ways to avoid programming in the first place. There are many libraries available for CUDA, more come out every day. There are some libraries that are developed by NVIDIA, some by third parties, there's open source libraries, there's commercial products. And of course, there's no way I can cover all of the available libraries out there. But I do want to hit a few highlights here and talk about some of the more popular and useful libraries out there. The libraries I'm going to talk about are, for the most part, fairly mature. They're designed and optimized for performance by experts, and they get tuned up and re-released every time a new GPU architecture comes out. So if you can use these libraries, you should. So the first thing I want to highlight is called cuBLAS. And cuBLAS is an implementation of the BLAS, or Basic Linear Algebra Subroutines. This is a venerable library that's been around for a long time. It's used in Fortran and C and by, and scientists and engineers everywhere, you know, use this as one of their go-to workhorses for dealing with dense linear algebra. Next is cuFFT. FFT stands for Fast Fourier Transform, and so not surprisingly, this is the CUDA Fast Fourier Transform. This includes various batched transforms, as well as support for various you know, real to complex, complex to complex FFTs and so forth. It has an interface similar in, in ways to the FFTW, the popular Fastest Fourier Transform in the West routine. So it's a familiar interface to anybody who uses FFTs as part of their bread and butter toolbox. cuSPARSE is BLAS-like routines for doing linear algebra on sparse matrix formats. sparse matrices is as you know matrices that are mostly 0, and are therefore stored in some sort of compressed format. And cuSPARSE supports a variety of formats including, and includes higher level routines like incomplete lu factorization. cuRAND is a bunch of pseudo and quasi random number generation routines for making random numbers. And this includes device side functions as well as host interfaces for quickly filling arrays with nyumbers drawn from particular distributions, various high quality random number generations basically. NPP stands for NVIDIA Performance Primitives, and this is basically for the most part low level image processing primitives. So, highly optimized low-level primitives for image processing. Magma is developed by the same group that wrote the original LAPACK library. And LAPACK is another one of these tools in in, in the toolbox of many scientist and engineers who, who simply use these tools for linear algebra all the time. So, Magma provides GPU and multicore CPU implementations of many LAPACK routines. So, it's sort of a modern parallel rethinking of our implementation of LAPACK style linear algebra. Another linear algebra tool is CULA. Which is implementations of Eigensolvers and matrix factorizations, and matrix solvers. and for dense matrix, sorry, for dense matrices similar to the LAPACK API and there's also a sparse CULA packet as well. An Array Fire is a framework for data parallel manipulation of different types of array data, including a wide variety of built-in operations from numerical linear algebra to signal processing to financial. so this is somewhere in between domain specific library like these others and programming power tool, which is sort of the next category we'll talk about.
Assuming that you're already using a library. say, the equivalent of one of these libraries on the CPU. Then, porting to use one of the GPU libraries is a simple three step process. First, you need to substitute library calls with their equivalent CUDA library calls. So if you're using blas, and you call the function saxpy. Stands for single precision, ax plus y. Then you will simply shift that to be a call to cublas saxpy. Next you need to manage data locality, so if you are going to be running on the GPU, you want the data to already be on the GPU. So for example you might need to call cuda Malloc, cuda Memcpy to move the data on to the GPU where the library can use to it. cublas as cuda cublas Alloc, cublas SetVector, cublas get vector, for getting the result. and some of the libraries manage this entirely transparently for you, and others you do it explicitly. And, finally, step three, you want to rebuild and link with the cuda accelerated library. For example, calling the compiler with the dash L cublas line.
Here's an example of some simple BLAS code. First we initialize N to 2 to the is single precision and that stands for AX plus Y. So we're going to set y equal to ax plus y by calling SAXPY, and we're going to do this on a million elements, or more exactly M elements. So here we're going to say y is equal to a which is 2.0 times x plus y. This is the original BLAS code. If we want to run this on the GPU, then the first thing we're going to do is is change the comment at the top. So the first thing we're going to do after we change the comment at the top is add a CUBLAS Prefix and change these to be device variables. Next we need to initialize CUBLAS and shut it down again when we're done. And now we need to allocate those device vectors and free them when we're done. Finally, we need to set the vectors, so copying the data over to the GPU, and get the result back. Okay, so, by the addition of just a few lines, we've accelerated this using CUBLAS, and I'll point out that of course. If you were doing more than just this one simple SAXPY Operation, you'd probably leave the intermediate results. You wouldn't keep setting and getting your, your your vectors back and forth to the CPU. You would do some operation then you'd use the results to do the next operation and so forth. So a lot of this overhead really only appears once at the beginning and end of your calculation.
Now, some libraries are less about solving a particular domain of problems, and more about enabling the programmers to design their own solutions. And I call these programming power tools, and the first example I'm going to highlight is called thrust. Now, C++ programmers, will know about the Standard Template Library or STL. STL provides a set of incredibly useful abstractions. About, they're organized into containers and iterators. Thrust provides a data parallel analog to the STL for CUDA. And also borrows niceties from the popular Boost library. Thrust allows host code that runs on the CPU, to very easily create and manipulate data on the GPU. And a principle container abstract and thrust is, is device vector. Okay? And this is analogous to the vector container in STL, but as the name implies, it lives on the device, on the GPU. Like the STL vector container, thrust device vectors are generic containers, which means they're able to hold any data type and they can be re-sized dynamically. So once your data is in a device vector, you can do a bunch of things trivially. You can do things like sort the device vector, or perform a scan operation on it, you can do a reduction operation. Or a reduce by key operation, where you actually take a vector and reduce it according to a key that's in another vector. So, for example, if you have a device vector x, in this case it's got three elements and it's of type float. Then I can do a reduction on those three elements, by just saying, thrust reduce, pass in x.begin and x.end to indicate that I'm going to reduce over the entire range of the vector. And it will give me the result. Or you can do a reduction of a different source, so rather than simply doing a sum reduce, we can pass in an operator to use or a function. In this case we'll do thrust maximum, so in this case this reduction will give us the maximum number in this vector. You can also do more general transformations, you can transform one or more input vectors into an output vector. For instance, you could do a vector addition operation, taking two vectors and adding them into a third vector using the built in thrust plus operator. Or, you can apply a user defined transformation, using a functor, and this is a chunk of code that you give thrust and tell it run this on every element. Or you know, on the collection of input and output elements that I'm Interested in. And you can inter-operate with CUDA codes, you can mix it up, thrust code which is running on the host and doing things on the device. But, if you did not need to do something that is not build into thrust or it is difficult to do with thrust, and you have got some raw CUDA kernels that you want to run. Then you and simple get the pointer to a device vector to the data in a device vector or hand a pointer to thrust, to wrap up as a device factor. In the end using thrust is a good ideas because it helps you avoid writing a lot of error prone sort of cut and paste boilerplate code. And, you can exploit high performance implementations, of these data parallel primitives like sort, scan, reduce, and reduce by key. this has all been implemented by CUDA Ninjas for you, so that you can just reuse that code. So thrust can be very handy. We're going to give you thrust code, that sorts 10 to the 6th floats. And we're going to ask you to time it, then we're going to ask you to also time sorting 10 to the 5th and 10 to the 7th floats. And then enter here how long it took to sort 10 to the 5th floats, 10 to the 6th floats, 10 to the your answer in milliseconds, and we'll just check within a significant digit or so.
So Thrust gives a host side interface that can replace writing CUDA kernels in many cases. But for those times when you are writing CUDA kernels you'd still like to achieve software reuse. You'd like to identify the building blocks in your kernel and use somebody else's highly optimized implementation of those building blocks. And in switch them together with any custom code of your own. But software reuse in CUDA kernel code faces some kind of special challenges. So suppose for example that your calling a library implementation of say radix sort and this is somebody else's implementation. You're going to call it from inside your kernel and so the question is how does that implementation know how many threads it is running in. How much shared memory is it allowed to use? Okay. Use all of the shared memory or do you reserve some for other purposes? Should it use a work efficient or a step efficient algorithm for this piece of the problem? The answer probably depends on what you're sorting, how much you're sorting and exactly what's important to you right now. The CUB library, by Duane Merrill, provides a neat solution to these challenges of software reuse for CUDA kernel code at high performance. CUB stands for CUDA Unbound, and the reason it's called that is because the implementation in CUB leave these decisions like the number of threads and the amount of shared memory Unbound you the programmer bind the necessary parameters when you write your program using templates and a novel binding approach tha's embeddied in the c++ type system. The building blocks that CUB provies are really high performance implementations. So by stitching them together, you can create a high-performance CUDA kernel. And finally, because CUB doesn't dictate things like thread block size or shared memory usage, it allows you to experiment with these values as you optimize or auto-tune your code. And one thing you'll find, if you get into optimizing CUDA code, is that this kind of auto-tuning where you experiment with or even sweep, automatically sweep the parameters, of your code. Things like the block size and the shared memory size. But that's a really powerful tool for extracting the maximum amount of performance from your code, without investing a whole lot of personal effort.
Now let me focus on a specific aspect of writing high performance kernel code. At a high level, GPU programming looks like this. There's a bunch of data sitting in global memory, and you have an Algorithm that you want to run on that data. That algorithm will get executed by threads running on SMs, and for various reasons, you might want to stage the data through shared memory. And you've seen for yourself that loading or storing global memory into shared memory or into local variables of the threads can be complicated if you are striving for really high performance, think about problem set two where you loaded an image tile into shared memory and also had to load a halo of extra cells around the image tile in order to account for the width of the blur. And this can be tricky because the number of threads that want to perform a computation on the pixels in that tile is naturally a different number than the number of threads you would launch if your goal was simply to load the pixels in the tile as well as the pixels outside of the tile. Or think about the tile transpose example in unit five where we stage through shared memory in order to pay careful attention to coalescent global memory. Think about our discussion of Little's Law. And the trade offs we went over between latency, bandwidth, occupancy, the number of threads, and the transaction size per thread. Remember that, we saw a graph that looked sort of like this or the bandwidth that we achieved as we increased the number of threads was higher for you're able to access 4 floating point values in a single load versus 2 floating point values in a single load versus a single floating point value in, a load. Finally, there are ninja level optimizations that we haven't even talked about in this class, like using the Kepler LDG intrinsic. In short, cuda's explicit use of user managed, shared memory enables predicatable high performance. In contrast, the hardware manage caches that you find on CPUs, can result in unthinkable performance, especially when you have many multiple threads sharing the same cache, and they're interfering with each other. So that's an advantage of explicit shared memory. But that advantage comes at the cost of an additional burden on the programmer, who has to explicitly manage the movement of data in and out from global memory. And this is a big part of where CUB comes in. CUB puts an extraction around the algorithm and its memory access pattern and deals opaquely with the movement of data from global memory, possibly through shared memory, into the actual local variables of the threads. I want to mention another programming power tool called Cuda DMA that focuses specifically on the movement of data from global into shared memory.
So cuda DMA is a template library designed to make it easier to use shared memory, while achieving high performance. Now to use cuda DMA, programmers declare cuda DMA objects for each shared memory buffer that need to be loaded or stored. And the cool thing is that cuda DMA let's you explicitly describe the transfer pattern for that data. So for example, you might be transferring one long sequential trunk of memory. You might be transferring strided trunks of memory. Or you might be doing sort of indirect access to memory, such as you would find in a sparse matrix representation. As with cup, decoupling the transfer patterns, from the actual processing that we're going to do on each thread to achieve that transfer pattern,uh, has several benefits. And improves programmability because the code is now simpler. You packaged away all of that logic for doing the transfer separately from the actual compute in your kernel. It improves portability because you can have the Cuda DMA Ninjas or the CUB Ninjas. develop the very best implementations of these various transfer patters for your situation and packages that all up in a library for you. And because those Cuda Ninjas are good at what they do, you get high performance. You're going to achieve high DRAM mem, memory bandwidth, you're going to hide the global memory latency for kernels that don't have a lot of occupancy, and hopefully this will lead to better compiler generated code. And as I said these benefits really accrue to both CUB which tackles the whole circle from bringing data in from global memory and doing the computation on it, as well as Cuda DMA which is just tackling the top part of that cycle, where you're bringing memory into shared memory. And then you're going to do your own operations on it. So these benefits really accrue to both approaches.
So let's have a programming exercise with the CUB library. Here's a simple example of the cub blackscan primitive. Okay and this is doing a prefix sum in a single block. And you've seen scan a lot by now, you know a lot of intricacies of how it can be implemented. And so here's an example of how, easy this is to implement with high performance in the ide. And there's a couple different things about this, from some of the examples you've seen. Again, we're focusing on a single block, because the whole point of CUB is to give you intra-block primitives, code for writing your own thread block level code within cuda kernels. So we're just going to time a single block And we're going to measure something we haven't quite seen before. We're actually going to measure the scan throughput. And we're measuring that in clocks per element scanned. These are SM clocks. How many actual clock cycles the SM took averaged across the total number of elements that were scanned by this single thread block. You'll see the could have g/ built in clock. In this kernel and the way it's used is obvious, and what I want you to do is fill out this matrix. This is a performance matrix where we have the number of threads per block which is represented in the code with block threads, and the number of items processed per thread, which is represented in the code with items per thread. And I want you to go ahead and look at the matrix here and experiment with different values of block threads and items per thread, and try to figure out how these two items affect the scan throughput. It would be too much work to fill out this whole matrix, so let me just guide you to a few sweet spots. So, much of the interesting action is going to happen along this diagonal. So for example, 1,024 threads each representing a single item each, each in charge of scanning a single item per thread is the kind of thing you might code up naturally, it might be the first thing you try its the simplest to code up. And then try to analyze what happens when you do 512 by two threads, 256 by four threads. 128 by eight threads and then lets go ahead and complete the diagonal. Wont take long at all. All the way down to 32 by, 32 threads each responsible for 32 items and 16 threads responsible for 64 items. And so what I'm going to have you do is fill in this diagonal and from the choices along this diagonal. Check off the option that gives you the best performance. In other words, the highest scan throughput, which corresponds to the fewest clocks per elements scanned.
So when I run all of these, this one is slightly the fastest of all of these on the diagonal, 64 threads per block operating on 16 items each. And so the thing to notice is that, throughput is generally increasing as the granularity per thread increases, right? So, having threads do more and more serial work is generally a good thing. On the other hand if you've got a fixed tile size, in other words the tile here is the total number of items that we're scanning over for a fixed tile size, there's diminishing returns. These numbers get smaller and smaller, sorry, the improvements get smaller and smaller as you approach this this sweet spot because you're trading off increased granularity per thread for decreased parallelism. And then it starts to go up again, and in my measurements on Fermi, which is the same GPU that you'll be using on the Audacity IDE, I get 32 by 32 being slightly slower than 64 threads with 16 items each. And at some point, you get to the point where you can no longer fit a problem size in a single thread's registers. And then performance falls of a cliff again and for me, on Fermi, 64 items per thread running at only 16 threads really starts to get pretty bad performance again. I encourage you to play around, experiment a little bit more to see what the rest of this matrix looks like. And if you're interested, go check out the CUB homepage and see what else you can do. There's different varieties of block scan, for example. There's work efficient and, and step efficient varieties of block scan and so, experiment a little bit and see. How fast you can get this scanned throughput how many how few clocks you can spend for elements scanned. Well what I found, is that with the right balance, you can have a competitional over head of less than one clock cycle per item scanned. That is pretty remarkable, think about running this code on a CPU, right, you couldn't achieve less than one clock per item scanned.
It's important to remember that CUDA, isn't just a specific language or a set of extensions to the CC++ language. CUDA is the whole GPU computing architecture. So let's talk about some other platforms that support CUDA. First we'll talk about other languages that support CUDA. Then we'll talk about cross-platform solutions. So the lightweight version of targeting CUDA, is simply to wrap CUDA C++ into the other language. So that you can transfer data to the GPU and call a CUDA C++ kernel, from within the other language. And a great example of this is PyCUDA, which allows Python programs to call CUDA C++ kernels, and even to construct them programmatically from within Python. At a deeper level, an increasing number of languages are directly targeting the CUDA architecture. So fans of Python will also want to check out Copperhead. Copperhead is a data parallel subset of Python, uses a library of data parallel functions such as map, produce, filter, scan and sort. When a Copperhead function is called, the runtime generates thrust code and call it from the Python interpretor. Memory is managed by the Python garbage collector and lazily transferred to and from the GPU. So, a cool thing about Copperhead programs is that they interoperate with Python programs, using packages like numpy or matplotlib, this makes it easier to write entire applicaton and not just kernels. Cuda Fortran is a product form the Portland groop That does exactly what it sounds like. It integrates CUDA constructs such as thread blocks and shared memory directly into the FORTRAN language. More on the research front, Halide is a brand new language specifically designed for image processing. This makes it a DSL or Domain Specific Language. And it's a really cool example of this. Halide targets GPUs as well as multi-core CPUs. On both desktops and on mobile devices. And many of the parallel optimization patterns we've discussed, such as tiling, are particularly easy to express and optimize in Halide. And that makes it possible to get very high-performance image processing code. Finally, the widely used Matlab platform supports CUDA both for wrapping CUDA kernels as well as directly targeting CUDA with Matlab code in core functions. So let's talk about Matlab a little further.
So one of the GPU computing platforms I'm most excited about is Matlab. Many of you probably already know about Matlab, especially if you're a scientist or and engineer. Matlab is a high level programming language and development environment that's designed for scientific, numeric and mathematical computations. Scientists and engineers use Matlab all the time for algorithm development, data analysis, visualization, and mathematical modeling. Matlab supports GPU computing in a couple of ways by allowing Matlab code to be executed on a GPU, as well as by providing a way to interface with custom CUDA code. We're going to explore these features using an application that fits the theme of this class, an imaging processing application called white balancing. The white balance operation is used to adjust the tints of a, of a photograph. So you can see here, if this is the before image, after a white balance operation, might look like this. And you can see that we've removed the, sort of reddish tint from the image. So this is a simple example of the kind of operation that you might implement in Matlab. It's easy to implement, as you'll see, and as you'll see, it's also easy to use the GPU as well.
So let's look at a demo. This is what the example white balance routine looks like it Matlab. As you can see it's quite short. Matlab is a high level programming language. It's got a simple syntax. And a whole lotta built in math and engineering functions. So this whole routine can be written in just six lines of code. Now the way this is written out, currently this function runs entirely on a CPU. And so next what we're going to do is use the Matlab profiler to identify sections of this code that might be candidates run on GPU. To improve performance. So we run the profiler and it indicates that this last line is by far the most time consuming. This line of code is applying an operation to each color channel across every pixel on the image. Such an operation could be sped up when execuited on a GPU. This is clearly a good candidate for parallel computing. So before using the GPU, let's first measure how long it takes to execute the white balance function. On the CPU using the tick and tock commands. Notice that it takes approximately 0.124 seconds to execute the code entirely on the CPU. Now, to utilize the GPU for executing that last line of code, there's a couple of options. One thing that we could do is simply transfer the image data to the GPU memory using the GPU array command. And that's what we're doing here on line 14 of this slightly modified white balance routine. Whenever a GPU is passed as input to the most common core Matlab functions, as in this last line of code, then those computations will automatically be executed on the GPU. So, here by adding just that single line of code, notice that the time it takes to execute the white balance function, went down from point one two four seconds to point O two six seconds so that's a big speed up. Especially for a single line of code. So the second approach to using the GPU for computing within Matlab is to use the Matlab interface to invoke custom Cuda kernels. Here's a Cuda kernel that performs the same operation that was causing that original bottleneck in the white balance routine. And in this Matlab function, that Cuda kernel is being used to speed up the execution time instead of using GPU array. So the first 12 lines of code. Are the same as in the original white bound routine. Then however, Matlab loads the kernel and sets up the threading parameters. Creates data on the GPU. And finally invokes the kernel using feval. And now notice that the white balance routine executes in .022 seconds using the Cuda kernel. So the application executes faster but we should still verify that our Cuda does what its supposed to do. So we need ask the question do the results from the GPU match the results that we were getting on the CPU. And you can check that the output match, using this simple script, which computes the norm of the difference between outputs. So a norm equal to zero, signifies that the two outputs are the same. After running verified script, we can see that the two outputs are the same, and there we go. So in summary, Matlab is a nice high level language. It is used heavily by scientists and engineers all over the world for all kinds of applications, and its a great way to quickly design and test algorithms that perform computations on a GPU. Either by using the gpuArray command, or by using an interface directly to CUDA kernels. Furthermore existing CUDA kernels can be quickly and easily tested inside Matlab, makes a great harness for developing an exploring and testing CUDA kernels and that's Matlab.
Now, we focused on the Nvidia Cuda architecture for this course. But there are also some Cross Platform solutions that compile not only to Cuda. But to GPUs and multicourse CPUs from AMD, Intel, or ARM. Some of the tools and libraries we've already mentioned, like Halite and Copperhead, specifically target CPU back ends. Three other cross platform solutions, worth mentioning, are open CL. Open GL Compute and Open ACC. You'll find that these first two are really similar to CUDA in the overall shape. They share concepts such as thread blocks and shared memory, even though the names and syntax vary. For example, Open CL refers to work groups rather than thread blocks. But the basic ideas are isomorphic to CUDA, and that's one reason why we focused on CUDA, is that what you learn there is really applicable in many places. Open GL compute, as the name implies, is tightly integrated with the extremely popular Open GL graphics library. And this makes it a good choice for developers that are doing graphics or image processing work who need to support multiple GPU vendors. Now the third option Open ACC is a little different. This is a directives based approach. Directives are annotations that the programmer puts into his or her serial code, that help the compiler figure out how to automatically parallelize some of the loops. For example, with the appropriate directives, an open ACC compiler can transform a nested for loop into a thread launch on CUDA. Or a multi-threaded SIMD routine on a multicore CPU. So often adding just a few lines to an existing code base can get dramatic speed ups. And this is what makes open ACC a great choice for programmers that have a large legacy code, that they want to parallelize. Open ACC makes it easy and incremental to add parallelism to an existing code, and it'll be very familar to programmers that are used to using OpenMP. The ACC stands for accelerators and Open ACC is essentially updating of OpenMP to reflect the evolution of accelerators like the the GPU. Open ACC compilers exist for C, C++ and Fortran.
We're just about ready to wrap about the course. We've covered Stratton's taxonomy of parallel optimization patterns, and we've tied those back to the lessons and assignments that you've had throughout the course. We've discussed the wide array of libraries that let you easily exploit the computational horsepower of the GPU. We've reviewed a few programming power tools, to help you write high performance applications with less work. And finally, we've talked about other platforms including CUDA support for other languages, like Python and MatLab and Fortran. There's all those cross-platform solutions for CUDA-style massively parallel programming. So to close the lecture and the course, we're going to hear about one of the more exciting new features in the latest CUDA GPUs, dynamic parallelism. We've invited the real expert on this topic, Dr. Stephen Jones from NVIDIA to give us a guest lecture on dynamic parallelism. We've also recorded an interview with Stephen for those of you interested in diving a bit deeper afterwards. So let's turn it over to Stephen, then we'll wrap up the course.
Congratulations, you finished he final unit. We hope you are excited about GPUs as we are, and are ready to put your learning in to practice. it's been great having you in the class.
Then wants to implement a smoothing operaton on a 4096 element one dimensional array. This array is indexed by i and the value at each index is V of i. The smoothing operation on element i is equal to 0.25 times V of i minus 1 plus 0.5 times V of i plus 0.25 times V of i plus 1. And a quick note. At the boundary, assume that all values beyond the boundary like V of negative 1, or V of 4096, are equal to the value at that boundary. So V of 0, or V of 4095. The new values, Vnew of i at each step can be computed in parallel. Then Vnew becomes V, and we continue. Ben wants to implement this in CUDA. He wants to launch threads each. And here is Ben's code. Ben says, his code works but is running much slower than he would like. So we rewrite Ben's code to increase the speed. In particular, can you reduce the number of global memory transfers?
So, what is the ratio between the amount of traffic incurred by Ben's implementation, versus your optimized implementation? Please fill in the blank here. Your answer does not have to be exact, but it should be within 10 percent of the actual ratio.
Line C, in the following graph, represents how performance scales on a particular GPU on a CUDA application, as a function of the number of blocks launched. Assume this application is strongly compute bound, whereas computation is the performance bottleneck for this problem. If we run this application on a different GPU with twice as many SMs or in other words, cores, which curve best represents the performance you might see? Please fill in the box with either A, B, C, D or E.
Assume this application is strongly compute bound, whereas computation is the performance bottleneck for this problem. If we will run the same application on a different GPU with twice the memory bandwidth, which curve best represents the performance that you might see?
Assume this application is strongly memory bandwidth bound, whereas memory bandwidth is the performance bottleneck for this problem. If we run the same application on a different GPU with half as many as Ms, which curve best represents the performance that you might see?
Pursuing this application is totally memory bandwidth bound whereas, memory bandwidth is the performance bottleneck for this problem. If we run the same application on a different GPU with half as much memory bandwidth and twice as many SMs, which curve best represents the performance that you might see?
CUDA compute capability 3.5 machine. If I want to run the maximum number of threads possible that are all resident And SM at the same time, each thread must use no more than how many registers? Please fill in the number here.
If I want to run the maximum number of threads that are all resident on the SM at the same time, each thread must use no more than how many bytes of shared memory? Please fill in the blank here.
If we want to maximize the number of blocks resident on the same SM, what is the maximum number of threads we can have per block? Please fill in that number here.
Then, we'll actually implement a fast compact primitive that operates on a single blocks of data. Recall that a compact inputs a true or false flag for every thread and a value for every thread, and outputs only those values which the input flag is true. So, in this problem, the strategy that we will do this scan is as follows. For step 1, within each warp, when we'll sum up the number of true flags. For step 2, when we'll do an exclusive sum scan of the per warp results . For step 3, when we'll do a exclusive sum scan of each warp, with the starting value of each warp equal to the corresponding output in step 2. For step 4, when we will use the resulting address as scatter address into the[ output array for each thread for which the flag is true and we will scatter the value into the output array. As an example, consider the following, where when use a warp of size 4 to make the example easy to understand, like, here's our input of 4 warps of 4 flags each. Step 2, I'll put 2241. Step 2, I'll put 0248. Step 3, I'll put 0112, 2233, 4567, 8999. And step 4, scatters the underlined elements to the following flags. And that's that, and I will follow them up with programming questions.