cs344 ยป

Contents

- 1 Intro to Unit 4
- 2 Scan Recap
- 3 Scan Recap
- 4 What Is Compact
- 5 When To Use Compact
- 6 When To Use Compact
- 7 Core Algorithm to Compact
- 8 Core Algorithm to Compact
- 9 Steps to Compact
- 10 Steps to Compact
- 11 Allocate
- 12 Allocate
- 13 Possible Allocate Strategy
- 14 Segmented Scan
- 15 Segmented Scan
- 16 SpMv
- 17 SpMv
- 18 Sparse Matrices
- 19 Generate CSR
- 20 Generate CSR
- 21 Actually Doing the Matrix Multiplication
- 22 Sort
- 23 Sort
- 24 Merge Sort
- 25 Parallel Merge
- 26 Parallel Merge
- 27 Bad To Have One Big Merge
- 28 Bad To Have One Big Merge
- 29 Split Tasks Across SMs
- 30 Sorting Networks
- 31 Sorting Networks
- 32 Sorting Networks Part 2
- 33 Sorting Networks Part 2
- 34 Radix Sort
- 35 Radix Sort
- 36 Radix Sort Part 2
- 37 Radix Sort Part 2
- 38 Radix Sort Part 3
- 39 Quick Sort
- 40 Key Value Sort
- 41 Summary
- 42 Congratulations
- 43 Problem Set 4

Good to see you back for unit 4! I'm sure you asked yourself, after the last unit, just what do we do with this scan primitive? Well, today we're going to find out. We're going to look at applications of scan, and then we're going to learn how to sort on the GPU. So let's get started.

Alright, welcome to Unit four. Today we're going to talk about fundamental GPU algorithms, part two. We're going to start with scan, which we talked about last week, and look at specific applications of scan, and then we're going to turn to looking at how to sort on a GPU. So in the last unit we learned about fundamental GPU algorithms and the last one, and the most important one for this lecture was scan. Scan is an efficient way to paralyze a certain class of parallel problems that otherwise seem difficult to paralyze. We usually refer to these as recurrent relationships. One of the use cases for scan is for problems that keep some sort of running total, such as a running sum. We can also use other binary associative operators like Max, Min and Most Logical operations. So, as a quiz, let's recall two of the important properties of scan. So we're looking at a scan of n elements. And so, in the best G P U scan implementations, what is the amount of work to do the scan and what is the number of steps to do the scan? Your choices are proportional to log n, proportional to n, proportional to n log n. And proportional to n squared. So we'd like you to check the box that corresponds to what the work complexity of the algorithm is and what the step complexity of the algorithm is.

So what we learned last week was that scan can be computed efficiently on n elements with run time proportion to n. And we also learned that it can be completed with a number of steps proportional to log n. This is something we can implement very efficiently on the GPU. And because we can implement it efficiently, what we'll find today is that it's the core of a significant interesting number of interesting parallel parameters. Aand we're going to start with one called compact.

We're going to start with what is compact and so let's consider a common problem in computing. We have a set of things like cards, and we want to filter that set to get a subset that we care about. So this comes up all the time. We might decide that we only want to do computation on the diamonds from this set of cards. And take all the other cards and throw them away. We might select students from the roster of this class who have perfect scores on the programming assignments. My background is in computer graphics for instance. So an example there we might have a lot of objects to draw in a scene, but we only want to keep those that actually intersect the screen. We call this operation compact, because we're compacting the large input set of potentially many items, into something smaller. Another word for this is filter. Now, why do we want to do this? Well, if we only care about a subset of the input, and we want to run some computation on that input, it makes more sense to throw away the items that we don't care about, and only compute on the objects that we do care about. We can rightfully assume this is both cheaper to compute and takes less space. We have a set of objects S0, S1, S2, S3, S4 and so on. And we have what we call a predicate and that's this line here. Predicate is a function that inputs an S object and returns true or false for that object. So, for instance this predicate we're looking at here is, is my index even? So for S0, S2 and S4, the predicate returns true. For S1 and S3, the predicate returns false and so on. We want to calculate S filtered by this predicate. So we only keep those objects in S for which the predicate is true. So what form do we want the output of this filter, of this compact operation to be? We have a choice of how to do it. We can either have a sparse output, or a dense output. For the sparse output, each element tracks to its same position in the output and if an input has a false predicate we just put some sort of null element In there. Alternatively, we could have a dense output, where all the elements for which the predicate is true are then packed together into the output so they're all sitting right next to each other. There aren't any gaps in the output. In general, we want the output of a compact operation to be dense and let me tell you why. So let's go back to one of the examples I cited earlier. Selecting the 13 diamonds from a deck of 52 playing cards and running a procedure called computecard() on each of the diamonds. We can structure this code in one of two ways. We could either wrap our computation in a big if clause. So what we're doing here is on each card we check if it's a diamond, and if it is a diamond then we run computecard(). Otherwise, we do nothing. This is the sparse approach. Or, we could run a compact on the deck of 52 cards, to get back 13 diamonds, by running compact on the cards. Using the is diamond predicate. And then run a map on the compacted cards, calling compute card only on the diamonds. So note that the sparse computation on the left side, is going to launch 52 threads, one per card, and 39 of those threads will be idle. The dense computation on the other hand incurs the cost of the compact, but then in the map step it only launches 13 threads. If the compute card routine is at all expensive, then the sparse approach loses, because we have to launch four times as many threads. Three quarters of those threads are going to be idle, while the other quarter of the threads are actually doing useful work. But because both idle and non-idle are running in lock-step, we're still paying the run time costs of having four times as many threads. Therefore, generally, you prefer the dense approach especially when this computation is expensive.

So let's take a quick quiz on when we want to be using compact. So, in general, compact is going to be most useful in scenarios where we compact away, where we delete, a small or large number of elements. And the computation we need to run on each surviving element is cheap or expensive.

So the answers are, large, and expensive. Because we are saving the most computation by culling away a large number of elements, each of which would otherwise incur an expensive computation.

So now we know how to compact in parallel but only if we figure out how to compute the scatter addresses efficiently and in parallel. So let's take a closer look at the computation that we need to do here. We're going to have a set predicates here, true, false, false, true, true, false True, false, and what we need to do is compute the scatter addresses that will result from this string of predicates. Since we don't care about the scatter addresses for the false values, we'll just put a dash for don't care. Our algorithm is going to generate something there, it just doesn't matter what it generates. Again, our goal is to input this predicate string. And output this string of addresses. And now we'll do a simple substitution. Whenever we see a true here, we'll substitute the number one. Whenever we see a false here, we'll substitute the number zero. And rather than don't cares I'm actually going to fill in some values here. So we're going to have the strings 0, 1, 1, 1, 2, 3, 3, 4. So here's our input, a bunch of trues and falses. We're translating it into 1s and 0s, and this is our output, our set of addresses here. And now perhaps the shortest quiz of the course. If this is the input, and this is the output. What is the operation we need to do to get from this input, to this output? Hint. Your answer is one word, of 4 letters.

And the answer is, scan. Actually if you want to be a little bit more precise, it is exclusive sum scan and we already know how to do a scan efficiently. So if this predicate is our input, this black line here is the output of our scan and note that it corresponds precisely to the addresses we need, to scatter the input to the output.

So let's summarize how to compact. Conceptually, there are four steps though an implementation might combine these into fewer. First, we're going to run a predicate on each element of the input. Then we'll create a scan-in array equal in size to the input. For each element in the input, if the predicate is true, put a one into the scan-in array. If it's false, put a zero. Then we exclusive sum scan the scan in array. The output of this is scatter addresses for compacted array. Now, for the final step, for each element of the input, if the predicate is true, then scatter the input element into the output array. At the address in scan out. Let's try a little quiz to get some intuition about the run time of compact. We're going to compare two compact operations and both are going to compact the list of numbers from 1 to 1 million. Compact operation A, let's say is divisible by 17, is only going to keep a very few of the input items. Whereas compact operation B is not divisible by 31 is going to keep most of the input items. So for each of the predicate, scan and scatter faces of the compact operation, will a run faster, b run faster or will they take the same amount of time?

For the predicate operation, both will take the same amount of time. Both will run their predicate on each of 1 million items. Scan will also take the identical amount of time. Scan is going to run on an array of a million items. Where they're going to differ is scatter. A will run faster than B for scatter, because there's fewer memory operations. Since fewer elements need to be copied into the output array.

We can generalize compaction in the following way. Compact is an operation that allocates exactly one item in the output for each true input and zero items in the output for each false input. This is useful of course in compact as a really common operation. But more generally we can do a compact like operation where the number of output items can be computed dynamically for each input item. Let me offer a graphics example to show why this might be useful. The input to most real time computer graphics pipelines is a set of triangles. Some of these triangles might appear on the screen and some might not. And so we generally test to see if each triangle is visible or not before we pass it down the pipeline for later processing. This is a compact operation. We compact an input stream of triangles, some of which are visible, and some are not, into a smaller output stream where all triangles in the output stream are visible. And we know how to do this with compact. Now, here's the more complicated problem. What if a triangle intersects the boundary of the screen or window? For example, this triangle here, or This triangle here. In this case we apply an operation called clipping, where we cut the triangle with the boundary, and then triangulate the resulting shape. So for this triangle, for instance, we're going to convert it to this triangle right here. This triangle has left behind a trapezoid, and so we only want to deal with triangles. So instead we triangulate this trapezoid and send 2 triangles down the pipe. Now we have a situation where we input a set of triangles and each triangle can product 0, 1 or possibly many output triangles. And we want to output the resulting triangles in a dense array. Here's a good geometry quiz question for you. That has really nothing to do with the course material, but it's a pretty cool question nevertheless. What is the maximum number of triangles that can result from clipping a triangle with a rectangle?

So, here's the worst case. We have a green triangle clipped by a blue rectangle. How many sides is the resulting polygon? So, 1, 2, 3, 4, 5, 6, 7. And so how many triangles will this result in? 1, 2, 3, 4, and 5. So our answer here is 5.

So one way you could do this operation is to allocate the maximum space per element. In this case, we would create an output array with 5 output triangles per input triangle. And we're going to put these in an intermediate array, and then compact the result. However, this has 2 signifigant disadvantages. The first is that it's really wasteful in space. We've now allocated an intermediate array that has the maximum amount of space per element. And that's going to lead to a very large array overall. The second disadvantage is that, now, we have to scan this large intermediate array. And that's not going to be as efficient as we would like either. So how can we do this more efficiently. So let's go back to what we'd really like to do. What we'd like to do is have a list of allocation requests. 1 per input element. And that get back a location to write your requests. As an example, let's say we have 8 input triangles here. And those 8 triangles want to generate, in order. 1, 0, 1, 2, 1, 0, 3, and 0 triangles. So what's that going to look like? The first input element is going to want to write its 1 triangle here into the output array. The second triangle isn't going to have any output elements to write. The third input triangle is wanting to write into this element here. The next triangle has 2 outputs. That, we'll write them into the next 2 slots of the array. The next triangle wants to write its output here. The next triangle wants to write nothing. The next triangle wants to write its output into these 3 output slots. So what we want to generate here is the output addresses for these. He wants to write his output into address zero. He wants to write his output into address 1. He wants to write his output into address 2 and 3. And this guy wants to write his output into addresses 5, 6, and 7. And so on. And look. This is exclusive scan as well. It generalizes what we did with compact. In compact, we allocated one slot in the output array for each true element and zero for each false element. Instead, now, we've generalized this so that we simply indicate. In the input array, the exact number that we need. This is simpler, and it's more efficient. Instead of scanning an array that has the maximum possible allocation for all elements, in the case of clipping, 5 times the size of the input array, now we just have to scan something the size of the input array. So it's more efficient to be able to do it this way. So that's a couple of interesting applications of scan. If you're doing any sort of GPU programming with irregular workloads, you'll find that scan comes up all the time. Recent work has used scan in diverse application domains, such as graft reversal, data compression, and collision detection, among many others. And we'll see later in this lecture, that scan is the core of the fastest GPU sort as well.

Now, in some applications you might have the need for many small scans instead of one large scan. We've said before that when you launch a kernel on a GPU, you want to have lots of work to do in that kernel. So it doesn't make a lot of sense to launch a scan kernel separately. On each one of these small scans. Instead, we're going to pack these scans together as segments inside one big array. Then have a special scan operator that scans each of the segments independently. Typically, to indicate where segments begin in an array, We use a second array that has 1's for segment heads and 0's for non-segment heads. Let's do an example. So if you'll recall, exclusive sum scan on an array simply takes the running sum of all elements that come before the current element. So the sum of all elements before eight, for instance, is 28. Now what's different about a segmented scan is that we have a normal array, but what we do is we're marking with these large lines here boundaries between these segments. So when we call a segmented exclusive sum scan on this array, what it's going to do is take separate scans of each of these 3 segments. So the result is the scan of 1, 2. The scan of 3, 4, 5. And the scan of 6, 7, 8. And so, the way that we're going to represent these segments is with segment heads. So that we have a second array, the same length as the input array, that marks where segments begin. So a segment begins here with the number 1, a segment begins with the number 3, and a segment begins with the number 6. So to make sure we're on the same page, we're going to take the same array we did last time where we computed an exclusive segmented sum scan. And instead, you're going to fill in the result of an inclusive segmented Sum scan on this input array. If you'll recall, an inclusive sum scan is going to be the sum of all elements that came before in the segment as well as the current element. So why don't you fill in these eight boxes?

Okay, so this is pretty straightforward. The output at any particular element is the sum of that element and all elements that come before it in the current segment. But we're starting it over at every segment boundary. So 12, for instance, is the sum of 3, 4 and 5 but nothing from the previous segment. We're not going to go through how this is implemented, but it's a terrific exercise to work out on your own. Whether you do it with helicon steel scan or the Blaylock scan. It has the same complexity as the unsegmented scan, but runs a little slower because you have to keep track of the segments along the way. So it requires more memory traffic as well as more complex computation. We like to put a little more information about the implementation of segmented scan in the supplementary materials. But again, this is a really great problem for you to work out on your own.

You might be wondering why this might be useful. Never fear. We'll have 2 good examples in this unit. And cover the first 1 here, and the second a little later. So let's consider the problem of sparse matrix, dense vector multiplication. Which we abbreviate Spmv. Sparse matrix vector. Many interesting matrices have lots of zeroes. So we call those matrices sparse. And we want to find a data structure to represent them. It squeezes out the zeroes. >> As well as to find a more efficient way to multiply them by vectors. Afterall, if there's lots of zeroes, we'll do lots of multiplications that have no effect. So sparse matrices are incredibly common in computational problems in mini domains. For instance Pagerank is the world's largest matrix computation. Pagerank is based on a giant sparse matrix that represents every web page in the world. How big is this matrix? Well if there n web pages in the world, this matrix is n by n and each entry at row R in column C is non-zero only if web page R Links to web page C. So the page rate computation on this sparse matrix is how Google computes the importance of web pages. Let's briefly refresh how we multiply a matrix by a vector. Students who already know this are welcome to skip to the next segment. So we're going to multiply this three by three matrix here by this three by one matrix here. So how do we do this? For each output in the output vector, we multiply the row of the input matrix by the column of the input vector. So we do this in a pair-wise way. A times x plus b times y. Plus c times z and that will create the entry here a x plus b y plus c z. For each additional row in the output matrix we simply do another one of these dot products. So, for instance, to compute this value here we dot product this vector with again, our input vector, to get g times x plus h times y plus i times z. So let's do a matrix times a vector as a quiz. We're going to have this sample 3 by 3 matrix times this sample 3 by 3 vector. And I'd like you to write the vector answer in these 3 blanks here.

Okay, how do we go about doing this? Well first, we're going to dot this row times this column, so that's 1 times 1, plus 0 times 2, plus 2 times 3. Okay then we're going to dot this row, with this column, so that's 2 times 1 plus 1 times 2 Plus 0 times 3. And then this row with this vector, 0 times 1, plus 1 times 2, plus 3 times 3. So let's sum these up, 1 plus 6 gives us 7, 2 plus 2 gives us 4, 2 plus 9 gives us 11, so our answers are 7, 4, 11.

So the traditional way to represent a sparse matrix is what we call compressed sparse row. So here's a small matrix of 9 elements. 3 of them are 0s, and so we want some sort of representation that's going to squeeze out those 0s and only represent the values that are non-0. And it seems a little silly on a small matrix like this. But trust me. As you get to very large matrices with lots and lots of zeroes. This representation is going to save you a lot of space, and save you a lot of computation. So, in CSR format, we require 3 vectors that, together, are going to represent this sparse matrix. So the first one is what we call the value vector. And it is simply going to represent all the non zero data. So, here, we're simply going to list all the data that are not zero as one long array. The second array that we need is recording which column each of these data came from. So, for instance, a is in column 0. B is in column 2. C is in column 0, and so on. And finally, we have to indicate at which element each one of these three rows begin. So, the 3 rows begin with values A, and values C and value F. So what we're going to write in the row pointer is that value A is at index 0. Value C is at index 2. And value F is at index 5. And now we can reconstruct this sparse matrix with these three arrays.

So let's take the matrix that we saw earlier, and I'd like you to generate the CSR representation of this particular matrix, please fill in your answers in the value index and row pointer arrays.

So if you recall the value array, this is simply the non 0 elements of this input array, which we'll just fill in here. The index array here, is which column each of these array elements are in. So, the 1 is in column 0, the 2 is in column 2, the 2 here is in column 0 The 1 here is in column 1. This one's also in column 1. And this final 3 is in column 2. Finally, what are the indices of each one of these elements that actually begins a row. So, we should look for the index of this one, the index of this 2. And the index of this 1. Because they are what begin each one of these rows. That's this element, this element, and this element. This element is the zeroth element. This element is the second element, and this element here is the fourth element. So, this is the CSR representation of this sparse matrix here.

So now we're going to start actually doing the matrix multiplication and so we're going to show the first three steps of this. The first thing we're going to do, is take the value vector and the row pointer vector and together, we're going to create a segmented representation of the value vector. Note that each segment in this resulting vector, represents one row of the sparse matrix and the row pointer shows where each segment head begins. The next thing that we're going to do, is gather these vector values using these column indices to create a corresponding list, from which we can multiply each of these individual matrix entries. So, we see that value a, is located in column 0, which means it needs to be multiplied by the vector element in row 0. So what we're going to do, is use these column indices, to gather from this vector, and that's going to give us the following array. Okay, now we actually need to do the parewise multiplications. So we simply use a map operation, to multiply this vector not caring about the segmentation, by this vector here and it's going to give us yet another vector of the same size. And the final step, is that we need to add up the partial products, within each segment, to be able to get the final output vector, and that's where we can use segmented scan. But now we have to add up the partial products on each row, to get the output vector, specifically, we need to add up the partial products within each segment, and that's where out segmented scan comes in. We need to do an exclusive segmented sum scan, to sum up each segment of partial products. It's actually a little bit more convenient to do a backwards exclusive segmented sum scan, so all the sums, instead, end up at the head of each segment. Since then the row pointer array can be used to gather those per row sums into a dense vector. Now, here, we're actually using a segment in scan to perform what's really a segmented reduce. And if you want a good challenge, think about how you might want to build a segmented reduce, that's a little bit more efficient than a segmented scan. Just to go back to our original matrix, make sure we're actually going to get the same answer. So, here's our original matrix here, and our original vector. And we note that, we dot product this vector with this vector, to get this answer. This vector, dotted with this vector, to get this answer. And then this vector, dotted with this vector to get this answer. And you'll see if you look at each one of these rows, they're going to correspond to the sums, within each one of these segments. But the nice part about multiplying using this sparse matrix representation, is that we're never multiplying by any zeroes at all. So here, we've actually had to multiply by zeroes, and then we just throw away the answers, because they're going to be zero. On the other hand, when we're actually implementing this using a true sparse matrix representation, we're not going to have any zeroes present at all, so this is going to be substantially more efficient, for any real matrix that has a significant amount of zeroes.

Okay, so the second half of this unit concentrates on sort. Sorting is a fundamental operation. And understanding how we might approach this problem in the parallel world gives a lot of insight as to what works well and what doesn't work well on GPUs. There's a lot of neat algorithms in sorting. And I hope the rest of the unit gives you some new ideas about how to approach the design of efficient parallel algorithms. Now, sort is a challenging problem on GPUs for several reasons. The first is that most sort algorithms are serial algorithms, or at least, usually expressed in a serial fashion, particularly those you might have learned in an algorithm class. So all those nice algorithms that you learned in school are not necessarily applicable here and we'll see this in a bit. The second is that for performance reasons, we prefer to look at algorithms. Parallel algorithms, that coalesce memory axises, and have little branch divergents among threads, and particularly that are able to keep the hardware busy by keeping lots of threads busy at the same time. So the sort of algorithms that you might have learned in an algorithm course tend to be moving around little bits of memory at a time and they have very branchy code. And they're not often very parallel. So, we'd like to take a look at algorithms that have the other characteristics, that can keep the hardware busy. That do limit branch divergence, that do prefer coalesced memory accesses. And so, what we're going to do is look at some classic sorting algorithms and discuss how they might map onto a parallel world. We'll start with one of the simplest algorithms and one that maps nicely to a parallel implementation. Odd-even sort, also known as brick sort. If you're familiar with the serial algorithm called bubble sort. This is the parallel version of bubble sort. So, we're going to start by putting all of our elements to sort in a row. And then we're going to mark all the even elements as black and all the odd elements as red. In an odd-even sort, in the first phase, each red element looks down the line to the left, toward the left end and pairs up with the black elements it's facing. Now if that pair is out of order, they swap places and colors as well. Otherwise, they stay in the same places. Now, every red element turns around, faces to its right, and pairs up with the black element in the other direction. Again, they swap if they're out of order. So we continue until the sequence is sorted. So let's try an example. So just to show this with some real numbers, we'll try a sample with these five numbers. We start by pairing them up, starting at the left, and we swap each pair if they're out of order. Then we pair them with the opposite polarity and continue the process. We return to pairing them the way we did in the first step, and continue to pair them one way then the other way, then one way then the other way, until we finally conclude with a sorted sequence. So it's very important that we understand how the measure, the step and work complexity of these algorithms, because that's often the dominant factor in their run time. So, for this algorithm, what is the step complexity of this algorithm? Order of amount of work that we need to do with the same choices? Please check your choice.

We'll look at step complexity first. The worst case is that an element has to move all the way from one side of the array to the other side of the array. So the example here is the number five which starts at the far left and then has to travel all the way to the right. So how quickly does one item move? Its maximum speed is moving 1 position per step. Since the best they can do is swap with its neighbor and move only 1 position away. So it takes on the order of n steps to get from 1 side to the other and how much work does it do per step. Well, on every step if we've n items then we're going to do n over 2 comparisons. So in total we do order n steps and order n per step, totaling order of n squared. Overall, this is not a particularly efficient sort. We'd like to be able to do better than order of n squared steps. That being said, it's kind of a neat parallel algorithm because we can say that, within a step, each one of these comparisons can proceed completely in parallel. So at least, even though this isn't the most efficient algorithm, it's at least one that exploits a lot of parallelism in it's underlying structure.

More interesting from the GPU context, is a merge sort. And it's really interesting to discuss how to map this efficiently to a GPU. Because it's a great instance of a divide and conquer approach to the problem. So here's a tree that represents our merge sort. And what's particularly interesting about mapping this to a GPU. Is that the problem starts off with tons of little, tiny, parallel problems and then as the algorithm proceeds, we eventually end up with only one very large problem to solve, and that's the final merge. This is more challenging many of the algorithms we've discussed, where we have the luxury of being able to solve lots of little parallel problems independently throughout the whole computation. So what we do at each stage of this tree is the same. The only operation that we need is merging two sorted lists together into one sorted list. We begin with n items, which we treat as n sorted one element lists. Then we run n over 2 merges to create n over 2, sorted 2 element lists. Then we run n over 4 merges to create n over 4, sorted 4 element lists, and so on. Overall this is log n stages and in each stage we merge a total of n elements. So the overall number of operations that were complexity is order of nlog n. This algorithmic exposes a lot of parallelism within each step >> Each individual merge. This merge, this merge, this merge, this merge and so on, can proceed in parallel. Now the hard part about mapping this to the GPU is that the number and size of merge as we do on each step differs greatly between the first step and the last. So I'm going to talk about a gpu implementation that has 3 stages. The first stage, this blue stage here. Is merging a very large number of very short sequences. Because we have many, many tasks and each is very small. We might choose to assign one merge to one thread. Which can perform each individual merge using a serial algorithm on that thread. We might get better memory coalescing performance if we use shared memories as staging area to read many input elements or store many output elements at the same time. I'd say it's more common, though, that the start of a large merge sort. To just sort a block of elements, say, 1,024 within shared memory. And then start the merge sort with sorted chunks. Of size 1024. In other words, if we're actually doing this in practice, we're probably going to use a different algorithm to do this blue stage here with lots of little tiny tasks and then use merge sort to do the last two stages here. And we're going to see a good algorithm for an in block sort after we finish the discussion on merge sort. Now we move on to stage two. Now we have lots of small sorted blocks and we need to merge these small sorted blocks together. On the GPU for these intermediate merges we would usually assign one merge to one thread block. Now, the obvious way to merge two sorted sequences is a serial algorithm. So let's take a little bit of a closer look at the algorithm that we choose to use here. And we'll come back to this diagram a little bit later. The obvious way to merge two sorted sequences is a serial algorithm and here's our little serial processor here. The input to this algorithm is two sorted sequences and the output is one sorted sequence. And so the algorithm is that the serial processor looks at the head of each one of the sorted sequences, chooses whichever element is smaller, outputs it on to the tail of the output sequence and then advances the input sequence from which he took the previous element. However, this would be a poor match for the GPU because this is a serial algorithm and it wouldn't keep all of our hardware busy. So it's instructed to look another way, a better way, a more parallel way to merge two sorted sequences.

If you recall from earlier in this unit, the way that we compacted an array of input elements was to compute the address for each of the input elements to be written into the output array. And then we would scatter these input elements using these scatter addresses into the output array. We've computed, for instance, that element 21 would be scattered by address 5, into output location different algorithm. And our algorithm is going to rely on both of the input arrays being sorted. So let's take a little bit of a closer look. So, mentally, we're going to think about assigning 1 thread per input element each of the 2 sorted lists. So in this example, we're going to merge two sorted sequences of four elements each. So we would launch eight threads. Each of which would be responsible for one input element. So the goal of each thread is, just like the compact example, to calculate the position of its element in the final list. And then scatter that element there. So, to start with, let's figure out what we are trying to do. What we're going to ask you to do is fill in the scatter addresses for these two sorts. This is a very useful and common thing that you need to be able to do as you're building a parallel algorithm. Understanding the scatter pattern that will let you get to the final answer. Since you're going to generate a dense sorted list of eight elements the scatter addresses in these eight boxes are going to be the numbers from zero to seven.

So how do we do this? Well, we find the smallest element, and we know that it must move to the smallest output location. So, in that case, that's going to be a zero here. This is the smallest element out of all of these. It's going to be scattered to output location zero. The next smallest element is here, and then here, and then here, and then here, here, here and here. So the key here is, how do we get these numbers? That's the important part of the merge algorithm here. So we're going to put ourselves in the mental position of one of these elements. In the first list. So we're going to pick this particular element right here. And we're going to say. Hey, what position am I in my own list? Well, I'm a position 2. Because there are 2 elements in front of me in my list. So, he would be at position zero. He would be at position 1. And I'm at position 2. Now, here's the cool part. This guy has to ask, where would I'd be in the other list? Well, if I look at this list I would need to be right here. So if I was in that list I would also be a position number 2. Because there would be 2 elements in front of me. This guy would be a position 0, this guy would be a position 1 and I would be a position 2. So in the overall list I know I'm behind two elements here and I know I'm behind two elements here so I can add those things together and discover that in the final sorted list I would be at position number four. And that is the scatter address that I need to put him in the right position in the output list. So how do we know our position in our own list? Well, that's very simple. We launched these threads as a contiguous block with one element per thread. So >> I'm going to be exactly at my thread ID. He's thread ID 1, he's thread ID 1, and he's thread ID 2. So how do we know our position in the other list? That's a little bit more complex. What we need to do to make this work is to do a binary search in the other list. So, this element will look in the other list, move down as a binary search until it finds out where it's going to belong in the other list. Every thread does an independent binary search in parallel in the other list. So for a sorted list of length n, that will take log n steps per element and all of those elements will do the search in parallel. And that's a very fast operation if we're doing the out of shared memory. Now, for sorting more elements that can fit in shared memory, then what we generally do is read in only a chunk of each sorted list, compute the first part of the output, write it back into main memory, refill the input arrays with more elements and repeat until done.

So in the intermediate stages of the merge, we have all the SMs working on independent merges, one SM per merge. At some point though, when we get to the very top of the merge tree, we only have a very small number of merges left or eventually, just one. We have a single task, but it's a very big task. Now, why is this bad? Why is it problematic for efficiency that we only have one merging task remaining. So why is this bad? Why is it problematic for efficiency that we have only one merging task remaining? Well, could be a number of things. We could have lots of idle threads per SM. We could just have lots of SMs that are idle. Or we might suffer from very high branch divergence. So please check the one you think is the best answer.

The best answer is that we have lots of SMs that are idle. We know how to keep threads busy within one SM as we did in our, in our immediate merge but we don't know how to keep lots of SM's busy if we have only one task to do, and it's not at all efficient to have most of our SM's sitting idle.

So now we need to turn to an algorithm that allows us to have many sm's working on a single merge. Our strategy here is going to be to try to break up this one huge merge task that we need to do, and do a bunch of smaller merges. Each of which can then be processed independently, and in parallel by a different S-M. So the algorithm for doing this is pretty cool. The goal is that when we're breaking up these two long lists, this blue list and this green list, into shorter sub lists, and then merging the results, that we don't have any sub tasks that are too big to do on one S-M. It's okay for us to have some smaller and some bigger subtasks, but we don't want a subtask that's too big for one s m. So here's what we're going to do. We have a big, long list with tons of elements in it, and that's too big for us to be able to deal with, so what we're going to do instead is take a subset of each of these lists. In fact what we're going to do is take every nth element of each of these lists. Let's say every going to call these elements that we select, splitters. And they are 256 elements apart. And our list here is going to be a, b, c, d, and so on. And our list here is going to be E, F, G, H. And so on. So then what we're going to do is we're going to merge these splitters to get a single sorted list, say E, A, B, F, C, G, D, H. Now, just as with the merge algorithm we defined a few minutes ago, we calculate the position of each splitter in the other list. So it's very straightforward for us to use the merge algorithm we already know. To merge these two small lists of splitters. Note that the sorted splitter list already tells us in which sub-list our splitter will appear, so just as before we can do binary search for the exact position. Now, here's the cool bit. What I will submit to you, is that we now have a sorted list of splitters and the elements that fall between any two splitters in this list are an independent set that we can send to an S-M and independently sort. So, the elements between E and A Can go to one S M, the elements between a and b can go to another S M , and so on. So let's look at this in a little bit more detail with the elements between f and c. We can calculate the position of f in c's list, we can calculate the position of c In Fs list. And so the work that we need to do for the pair defined by F and C, is to merge this list with this list. And one of our goals was to make sure that none of these sub lists would be too big. And we can guarantee that because we know that there are no more than 256 elements between F and G. And we know there are no more than 256 elements between b and c, so we can guarantee that there are no more than 512 elements between f and c. So by choosing the space in between the splitters, we can guarantee a maximum size on any of the independent chunks. That we need to independently merge in the last stage of this algorithm. So the challenge we have addressed here. We take one big problem and divide it into a large number of small problems, each of which can be processed in parallel. So to sum up what we just did we look at 3 different phases of the merge sort, each of which we attack with a different strategy. First we used a one thread block to solve many merging problems in parallel. Here the number of problems was much, much greater than the number of SM's. Then we used one thread block to solve one larger merging problem. Here, the number of problems was on par with the number of sms. Finally, we cooperated between all thread blocks to solve a single problem. Now we have fewer problems than sms, so to keep all the hardware busy, we needed to find a way to divide the single problem into multiple, independent problems, so that we could keep all the sms busy. And that concludes the merge sort.

Now we're going to take a completely different approach to sorting. Generally most sorting algorithms are data dependent. Based on the values of the data, we choose to do different things. And if we sort two different sequences we'd probably take a different path through the code. Instead, now, we're going to consider a set of sorting algorithms that we call oblivious. No matter what input we choose, the sorting algorithm proceeds the exact same way. In fact, the only data dependence we'll have in the whole algorithm is a swap operation that inputs 2 elements and outputs them in the correct order. Let me take the brief digression why is a oblivious algorithm a good algorithm for a parallel processor like a GPU. So when I talk about an oblivious algorithm what I mean is that it's behavior is independent of some particular aspect of the problem in this case we're talking about a sorting algorithm that always does the exact same steps no matter what the. Oblivious. Good CPU sorting algorithms are generally move clever. They have more complex control flow and a lot of data independent decisions to run fast. Gpus aren't so great at complex control flow. Instead, they're created simple control flow and massive parallelism, and oblivious algorithms are generally a good match for massively parallel approaches to The problems. Okay. Let's return to sorting. Clearly, an example will help show what I mean. This structure that I'm showing here is called a sorting network. The input is placed on the lines of the left. So, this will input four elements, 1, 2, 3, 4. Each time two values are on either side of a shared vertical line, these values are swapped if they are in the Wrong order. So let's put some numbers on there and give it a shot. So we're going to start with the input sequence four, two, one, three. And so each time two elements are connected by one green line. We will swap them if they are out of order. So first, we'll swap two and four. But we won't swap one and three because they're in the right order. Now, we will look at 2 and 3 and we don't have to swap them but we do have to swap 1 and 4. And finally, we have two more swaps to do. 1 and so we will swap them and, voila, now we've moved from an unsorted sequence to a sorted sequence. Fortunately, this bitonic sorting network scales in straightforward. And easily programmable way. So we had a little sorting network that would sort 4 items. It's fairly straightforward to expand it. So that now it can sort A. So let me try to give you a little intuition about how that works. So a bitonic sequence is a sequence that only changes direction a maximum of once. So if we look at this sequence here, we're going up for a little while and then down for a little while though we only change direction right here. So this is bitonic. How about this sequence here? Well, we're going down and then up. 753 goes down then we change direction and go back up. So sort of the trace of a sequence look like that and that is also bitonic. But we might have a sequence that looks like this. 1, 2, 3, 4, where we go where we go up and down. And then up again. This however is not by time. Now why do we care? It turns out that it's particularly easy to sort a bitonic sequence, and let me tell you how. So let's say we have this bitonic sequence or alternatively two monitonic sequences that we can turn into one bitonic sequence. Here is what we are going to do. We're going to take the first half of this bitonic sequence and we're going to lay it on top of the second half of this bitonic sequence. Then what we're going to do is do a pairwise comparison of each element here. And we're going to take the larger element and we're going to put it in this set here, and we're going to take the smaller element and we're going to put it in this size here. So what we've done is we've partitioned This bitonic sequence into 2 bitonic sequences. 1 of which is bigger and one of which is smaller. And so then we can recurse and the bitonic sort on each one of these subsequences. And continue until we have a completely sorted sequence. The overall algorithm is generally to sort 2 sequences. We reverse 1. We append it to the other to create a bi-tonic sequence. We split that bi-tonic sequence into the larger and smaller halves. And then we sort each half. So if we look at this big picture here. These two boxes here, sort to input sequences of size 4. 4, this segment here splits two sorted sequences into a larger half and a smaller half and then these two boxes here will sort two bitonic sequences that result for the smaller half and the bigger half. So if we actually did the analysis here, we would find that for an input set of size n it requires something proportional to log n stages overall. If we actually looked here, we would find that's 1, 2, 3, 4 5, 6. And the first stage compares and swaps all elements once. The next stage does 2 swaps, and so on. And so, the total complexity here is that we have n log squared n swaps overall. Well, that's all nice theory. But here's the best part. What should immediately draw your eye, is that, within any set of swaps, every swap is completely in parallel. So if you look at this stage here for instance we have four swabs but each of them can proceed in parallel. We have four swabs in this particular stage. Each of those can proceed in parallel. Here's a different permutation of swabs. And again these could all be pursued in parallel. And this obviously makes a GPU programmer Very, very happy. Now that we know how this sorting network works, lets think about its performance when given different inputs. For each of the following possibilities with the same number of elements, what are the comparative runtimes in units of time? So a completely sorted sequence, how long does it take? An almost sorted sequence, just about sorted, maybe a couple elements off. A completely reversed sequence, or a completely random sequence. So here, this would indicate that A would be the fastest, then B, then D, Then C, so our choices are A, B, D, C. C is fastest, then D, B, A. A is fastest, then B, C, D. Or they'll all take the same amount of time.

The answer. And this is pretty particular to sorting networks but with a sorting network all will take the same amount of time. It's an important consequence of an oblivious sorting algorithm. Any input will require the same amount of time to sort.

So how do we implement this on the GPU? Well, its very simple. We're going to assign one thread per input element. In this case, eight input elements. For each comparison, each thread simply needs to know if its keeping the smaller or larger value. So we're actually doing each comparison twice. Once on either side of the comparison. And the only difference is that one side of the comparison will keep the smaller element and one side will keep the larger element. Now, we do need to synchronize after each comparison. So after we're done with this particular set of comparisons, we synchronize and then begin the next stage. And then synchronize and begin the next stage and so on. So just a couple notes on the computation itself. One, if the input set is small enough to fit into shared memory, then a sorting network is actually a very efficient way to sort that input set. So in fact, many sorting algorithms that start by sorting small blocks of data, like merge sort, use a sorting network to do their sorts within a block. And it helps that a bitonic sort is probably the simplest sort to actually implement. Two, bitonic sort is not the only kind of sorting network. The odd-even merge sort is a different kind of sorting network that generally runs a little faster but it's a little more complicated to explain or program but the basic ideas are the same. So as a quiz, what I'd like you to do is fill in each of the black boxes with the correct numbers to make sure that these four input elements are sorted by an odd-even merge sort.

So we're going to go about this the same way that we did on the other sort, every time we see a pair of lines that are connected by a green line, we will compare the two elements and put the smaller one at the top and the larger one at the bottom. So we'll compare 4 and 2 and switch their order. We'll compare 1 and 3, and they're already in the right order. Now we're comparing 2 and 1. So 2 will go here and 1 will go here. Now we'll compare 4 and 3 and we say they're in the wrong order as well and note that 2 and 3 have one more comparison to go. But note that we come out, again, with a sorted sequence.

So we've seen some really interesting algorithms so far, but the GPU performance leader is none of the ones that we've discussed today. Instead, typically on the GPU for the highest performing sorts, we use a different sorting algorithm called, Radix sort. Now, all of the previous sorting algorithm were comparisons sorts, meaning the only operation we did on an item, was compare it to another one. Radix sort relies on a number representation that uses positional notation. In this case, bits are more significant as we move further left in the word, and it's most easily explained using integers. So the algorithm for Radix sort is as follows; start with the least-significant bit of the integer, split the input into two sets, those that have a zero, with this particular bit location, and those that have a one, otherwise, maintain the order. Then proceed to the next least-significant bit, and repeat until we run out of bits. So as usual we're going to do an example that'll make more sense, and we're going to use unsigned integers. So what we're going to sort, is this column of numbers to the left, and here is the binary representation of those numbers. And so we're going to start here, with the least significant bit. So, the way that we're going to do this, is take all the elements that have a zero, as the least significant bit, and we're going to otherwise maintain their order, but we're going to put them up top. Then, we're going to take all the rest of the elements, those that have a one as the least significant bit, and we're going to again keep them in order and append them to the list that we've just created. So what this creates, is a list where all the least significant bits are zero and then a list where all the least significant bits is one. Now, we move to the next least significant bit, okay, so the bit in the middle, and we're going to do the same thing. We're going to take all the zeros and put them up top, and then we're going to take all the ones and put them below. And here the dotted lines are just showing the data movement that we're looking at, the green lines are the ones where the middle bits are zero and the blue line is where the middle bits are one. Now, we move on to the next most significant bit, in this case, the very most signifigant bit, and we do the same operation again. Zeroes in the most signifigant bit move up top, ones move to the bottom, otherwise, we maintain the order, and now we have a sorted sequence. Pretty cool, huh? Now, there's two big reasons this code runs great on GPUs, the first is its work complexity. The best comparison base sorts are o of n log n. This algorithm on the other hand, is o of kn, meaning the run time is linear in two different things. First, it's linear in the number of bits in the representation, so this particular integer has three bits in its representation and it took three stages for us to be able to sort the input. Second, it's linear in the number of items to sort. So, we have eight items in the representation here, and so the amount of work is proportional to eight. Generally K is constant, say, a 32 bit word, or a 64 bit word, for any reasonable applications. And so in general, the work complexity of this is mostly proportional to the number of items that we need to sort. And so, that's a superior work complexity to any of the sorts that we've talked about to date, and so that's one reason why this looks so good. The second, is that the underlying operations that we need to do this split of the input at each step, are ones that are actually very efficient. And in fact, they're efficient operations that you already know. Let's take a closer look at what we're doing. We're only going to look at the first stage of the Radix sort algorithm, where we're only considering the value of the least significant bit, and we're only going to look at the output for which the least significant bit is zero. Now, what are we actually doing here? We've already learned an algorithm that does this operation today. So, what is the name of the algorithm that takes this input, and creates that as the output?

That algorithm is compact.

What we're doing, is compacting the items that have a zero as the least significant bit. So if you recall, when we compact, what we need to do, is have a predicate that are going to tell us, which of these input elements to keep. So, if we call each of these elements an unsigned integer, and that unsigned integer is named i, what is the predicate, on each one of these integers that is going to allow us to return true, only for this set of integers. So write this in this box, in the C programming language please.

So the answer here is that what we're going to want to do is look at the least significant bit. That mans we'll take the bit representaton of i and end it with the value 1, which is only going to leave us with the value of the least significant bit. And then we'll going to test whether that bit is 0 or 1. And we only want to keep the ones that are 0. We only want this to return true if that least significant bit is 0.

So what we're really doing here, is simply running a scan over the input. So, the input to the scan is a 1, for each 0 bit, and a 0 for each 1 bit, and that will give us the scatter addresses for the zero half of the split. So we're going to scatter this element to an output zero, this element to 1, this element to 2, and this element to 3. Notice that the last element of the scan, with a little bit of extra math, because it ends with a zero element here, tells us how many zero bits there are total in the input, in this case there are four, 1, 2, scatter addresses for the other half of the split, for the one bits. There are a number of interesting ways to make this faster, and the most common one, is to reduce the total number of passes by taking multiple bits per pass. 4 Bits per pass and a resulting 16 ways split, instead of our 2 ways split here, appears to be fairly common. Overall, rating sort is a fairly brute force algorithm, but it's both simple and fast, recent GPUs can run rate exort on 32 bit keys, at a rate over a billion keys sorted per second.

The final sort algorithm we'll discuss is one of the most efficient ones for serial processors, this is Quick sort. It is a more complex algorithm on GPUs because of the control complexity of the algorithm, so let's recap what the Quick sort algorithm does. First, it chooses a pivot element, one particular element from within its input, then it compares all of the elements in its input to the pivot and it uses that comparison to divide the input into 3 sub-arrays. Those that are less than the pivot, those that are equal to the pivot and those that are greater than the pivot, and then it calles Quick sort on each of these sub-arrays and continues until the entire input is sorted. As an example, let's look at a particular array, and choose that the pivot is equal to 3. So what we're going to do here, is compare each one of these elements to the pivot, and we're going to decide if they're equal to, greater than or less than the pivot. Then we'll divide it into 3 arrays, those that are less than the pivot, those that are equal to the pivot, and those that are greater than the pivot, and we'll call Quick sort on each of these arrays, and do the same thing. So when we have 2 and 1, let's say we choose 2 as the pivot, then we'll divide this into smaller than the pivot and equal to the pivot. This doesn't need to be recursed because it only has a single element. And let's say we choose 4 as the pivot here, this is greater than the pivot, this is equal to the pivot, and now we have a completely sorted array. So this is a very challenging algorithm to implement on the GPU. The other algorithms that we've looked at are pretty simple to describe and they don't recurse. This one is more complex, and until very recently, GPUs didn't support recursion at all. Indeed, the GPUs we use in this class, don't support recursion currently. So how can we take this seemingly recursive only algorithm and map it to the GPU using the primitives that we've learned? So I'm bringing up this example for two reasons. The first is that you have already learned all the pieces that you need to implement Quick sort on the GPU. And the second, is to motivate the benefits of new GPU capabilities that do, natively support recursion. So we can implement Quick sort without recursion by using the idea of segments. Recall that segmented operations like scans, only operate within a single segment, operations on one segment don't affect other segments. That sounds a little bit like recursion, and in fact, it maps in a similar way. For Quick sort, when we begin to process the initial array, we're going to use distributes, maps, and compacts to eventually divide it into three segments. We can used segmented scans to do all the necessary operations, that we need to make this work, including distributing a pivot across a segment, for comparisons, and splitting a segment, which is similar to the way that we split a on a particular bit in Ridic sort. Quick Sort is interesting, because it shows you how useful segments can be, that they can let you mirror the same approach you use in recursion, without actually using recursion. But, and I gotta be perfectly honest here, it's really a challenge to implement and equally challenging to make it fast. So it's very exciting that the newest in video GPUs support a more flexible programming model, where kernels can actually call, can launch other kernels, which makes Quick sorts recursive calls much more straightforward. We're not using this new capability in the class. The Amazon instances where our code is running, don't have these new GPUs that support this capability at this time. But it's really exciting, and so when we get to unit 7, we'll see exactly what it looks like and how it applies to Quick sort.

Final note on sort. All the algorithms we discussed were what we call key sorts, where we only sort a set of keys. Many interesting applications however require that you sort not just a key, but a value associated with that key. For instance in unit two, Dave described sorting a set of NBA players by height. In this case, the key might be the height, and the value would be the player's name, the player's team, the player's history and so on. Dealing with values as well as keys is pretty straight forward. First, if you have a value that's large in terms of bytes, it's usually smarter to make the value instead, a pointer to the actual data, and then just sort the value and it's pointer. For instance, if you had a large NBA player data structure, don't sort the whole data structure as the value. Instead, just use a pointer to the data structure as the value instead. Second, our sort algorithms generally move around data keys. Whether that be a sorting network, or a radix sort, or a merge. Instead of just moving the key If you have to deal with a key value sort, move the key and the value together as one unit. Finally, since today's GPUs can often natively handle to input 64-bit values instead.

So if you can sort and scan, and more importantly understand the kinds of algorithms that we study in using sort and scan and why they're good and bad for the GPU, you are in fine shape. The ideas you've learned will carry over to many other algorithms. You'll be using these techniques in the assignment this week where you'll do automatic red-eye removal using template matching. This algorithm relies on sorting a vector by key which indirectly relies on scan. So go apply your new found knowledge to the assignment.

Congratulations on finishing unit 4. We've done a lot in this unit, so well done for getting through this. Now we're going to talk to Ian Buck from Nvidia about Cuda's past, present and future. And then in the assignment, what we're going to do is learn how to remove red eyes from photographs, so that your photographs can look just a little bit better.

In problem set number four you will implementing a parallel algorithm for removing the red eye effect that commonly occurs in pictures of human faces. Here is an example of the effect that we are talking about. You will be implementing a simple algorithm for red eye removal. That factors nicely into 3 different parallel operations. The first operation is a stencil computation over the image. The second is a sort. And the third is map. You have been exposed to map and stencil operations in the previous homework. So, in this homework, you will focus on sorting. We start our red eye removal algorithm by computing a score for each pixel that estimates the likelihood of that pixel belonging to a red eye. And here's what the score looks like. The score is known as normalized cross correlation. And it is expressed naturally as a stencil operation. We have computed this normalized cross-correlation square for you. But if you're interested we will provide extra details about Normalized cross-correlation in the instructor comments. Our next step is to sort all of the pixels in the image according to their score. Note that the pixels with the high scores are the pixels that are most likely belong to a red-eye, as you can see here. This sorting step is what you will be implementing in problem set number 4. In case, if you are interested, we will discuss several different types of optimization strategies, for sort in a shutter comments. Our final step is to reduce the redness. Of the high scoring pixels. This computation is expressed naturally as a map operation. Once again we've performed the step for you so you can concentrate your efforts on sorting. That is all, good luck and I hope you will enjoy problem set number 4. And like the previous problem sets I want to thank Eric and Mike for writing the code and the script to this problem set. So thank you.