cs344 ยป

Contents

- 1 Fundamental GPU Algorithms
- 2 Digging Holes Again
- 3 Steps and Work
- 4 A Quiz on Step And Work
- 5 A Quiz on Step And Work
- 6 Reduce Part 1
- 7 Reduce Part 2
- 8 Binary and Associative Operators
- 9 Binary and Associative Operators
- 10 Serial Implementation of Reduce
- 11 Serial Implementation of Reduce
- 12 Parallel Reduce
- 13 Parallel Reduce
- 14 Step Complexity of Parallel Reduce
- 15 Step Complexity of Parallel Reduce
- 16 Step Complexity of Parallel Reduce
- 17 Step Complexity of Parallel Reduce
- 18 Scan
- 19 Balancing Checkbook With Scan
- 20 Inputs to Scan
- 21 Inputs to Scan
- 22 What Scan Actually Does
- 23 What Scan Actually Does
- 24 Why Do We Care About Parallel Scan
- 25 Inclusive Vs. Exclusive Scan
- 26 Serial Implementation of Scan
- 27 Serial Implementation of Scan
- 28 Inclusive Scan Revisited
- 29 Inclusive Scan Revisited
- 30 Hillis Steele vs Blelloch Scan
- 31 Hillis Steele Scan
- 32 Hillis Steele Scan
- 33 Blelloch Scan
- 34 Blelloch Scan
- 35 Analyzing The Blelloch Downsweep
- 36 Which Scan Should You Use
- 37 Which Scan Should You Use
- 38 Histogram
- 39 Histogram
- 40 Serial Implementation of Histogram
- 41 Serial Implementation of Histogram
- 42 Parallel Implementation of Histogram
- 43 Implementing Histogram Using Atomics
- 44 Implementing Histogram Using Atomics
- 45 Implementing Histogram Using Atomics
- 46 Implementing Histogram Using Atomics
- 47 Calculating Global Histogram Using Reduction
- 48 Sort Then Reduce By Key
- 49 What Operation Should We Use
- 50 What Operation Should We Use
- 51 Final Thoughts on Histogram
- 52 Final Thoughts on Histogram
- 53 Description of Problem Set #3

Alright, today we're going to cover Lecture 3, and this lecture's going to be on fundamental GPU algorithms. We're going to cover the three different algorithms today, reduce, scan and histogram. So, GPU computing is based on exploiting concurrency in a workload, expressing that concurrence in a language like CUDA, allows parallel implementations on the GPU that can take advantage of hundreds of simultaneous computations. Now, the GPU is well suited to communication patterns that are completely independent, like the map operation. So, in the map operation, we're going to have a number of computational elements. And each one of them will compute it's output in parallel and completely separately without any communication whatsoever. It's also great at stencil, or more generally gathered operation like we saw in the last lecture. So, we might have an item we want to compute here and we're going to go out and fetch several items from memory and do our computation. But, the computation of the next item is again completely separate from the computation of the first one. However, not all communications fall into these categories. Some of them have more complex computation patterns, such as, all to one or all to many communications patterns. So, for instance, this element here depends on these three elements here and the computation of this element here will depend on these three elements. And notice that there's some overlap here. So, it's a little bit more complicated. So, we'll cover these more comp, complex computation patterns and three primitives that can implement them in today's lecture. These primitives are reduce, scan and histogram. And we'll also use them in this week's homework.

So, let's go back to our dig a hole example. If we're going to dig a hole, we're really interested in two things. One is how long it's going to take to dig that hole, and two, how much work, how many total hours it will take to dig. Now, if you're the only one digging this hole, these two things are the same. So, we have one worker, and perhaps it's going to take you 8 hours to finish, and so that means over all you're going to do 8 hours of work. But if you bring all your friends over to dig the hole with you, the amount of time to finish and the total amount of work, might be different. So now, we're going to have 4 workers, and, perhaps they'll work together for 8 hours total and thus be able to finish in 2 hours. Thus, they've finished digging this hole four times as fast. We'd actually call this ideal scaling. Since adding four times as many workers allowed us to finish four times as fast. Alternatively, we might have four diggers, each of which gets in each others way, so we're not going to be quite as efficient. So, for instance, we might have four workers, and together it's going to take them four hours to finish. Thus, meaning that, together they've done 16 hours of work. Now, you might reasonably ask yourself, why your instructor spends so much time talking about digging holes, and that's a fair point. But there's a couple of important points of terminology, that I want to make here. The first is the concept of a step. It's how long it takes, to complete a particular computation, operation, hole digging. The second, is the total amount of work that it took to perform that computation over all the active workers. And, so we're going to go into a little more depth in terms of the way that we define these.

As we talk about algorithms, we're going to talk about 2 costs. The first is step complexity and the second is work complexity. So, as an example here, we have 8 elements that we'd like to combine using this tree style structure. And so, we're going to try to characterize the number of steps that it's going to take us to do this computation, as well as the total amount of work. So first, we're going to look at the number of steps. We see that it's going to take us 3 steps to finish. This first step here, we'll do 4 operations, the second step can be done in parallel with 2 operations, and then the third step is a final operation to get a final result. But we can also count the total amount of work we have done here. We've done 1, 2, 3, 4, 5, 6, 7 operations. So, we'd say the step complexity is 3, and the work complexity is 7. So, we'll compare the step and work complexity of the parallel implementations that we develop against the step and work complexity for a serial implementation. And one more piece of terminology, we will say that a parallel algorithm is work-efficient, if it's work complexity, it's asymptotically the same. So, within a constant factor as the work complexity of the sequential algorithm. Now, if we can reduce the step complexity in our parallel implication compared to the serial implementation while still having a reasonable work complexity that isn't too expensive, we expect that this will lead to faster runtime overall. So, the real key here is to formulate the most efficient parallel algorithm. And that's the focus of this lecture and the next lecture.

Okay. So, quick quiz, just to test a little understanding about number of steps and total amount of work. We have a more complicated tree right here, and so you can think of every circle here as one operation. Some operations take 2 inputs, some operations take one. So, what are the total number of steps in going from the inputs up here to the output down here? And what is the total amount of work measured in number of operations to get from the eight inputs up here to the one output down here?

So to compute the number of steps, what we're going to do is start at the top, and count how many steps it gets from the top to the bottom. So here's the first step, it has eight operations. Second step, third step, fourth step, fifth step, sixth step, and then we computed the final results. We'd say there are 6 steps here. What's the total amount of work? Well, to do that, we count the number of operations. That's 8 plus 4 plus 4 plus 2 plus 2 plus 1. Gives us 21 operations to compute the final answer here. When we discuss algorithms, we'll often discuss these metrics not so much in terms of numbers, the number of steps, or the total amount of work but instead as functions of the size of the input. So, we might have a totally different problem and we might say the amount of work in that problem is proportional with the size of the input squared, or the number of steps in this particular problem is proportional with the size of the input.

So, the first algorithm that we're going to cover today is called reduce. So, an example of reducing a set of numbers, for example, is adding up all the numbers in this set to get a total sum. So reduce is an interesting algorithm, because it requires cooperating between processors. So here we have a tree that we can use to compute the sum of a large number of integers. And so it's interesting because we now have dependencies between the operations. First, we add 1 and 2, and we have to take the result and then add it to 3, and then when we have that sum. We can take the result and add it to 4, and so on. So, we now have dependencies between operations that we need to do. And we haven't looked at any algorithms that have this behavior before. Reduce is also interesting because a parallel implementation, like the one that we're about to describe, will have a smaller and a better number of steps than a serial implementation.

S,o we're going to start with the mathematical definition of reduce. Reduce has two inputs. The first input is a set of elements, and we're going to assume they're in an array. Second input is a reduction operator that will operate on the elements in this array. For example, we have a list of numbers here, and we have a reduction operator of plus, and the reduction operator will sum them all up. Let's be a little bit more precise in terms of what operators we're going to support with our parallel implementation. Our operators need to have the following two characteristics. The first is that the operators are binary, the operator must operate on two inputs and create one output, plus is a binary operator for instance. Two, associative. Associativity means that if we have two or more operations in a row, a op b op c, the order of operations doesn't matter as long as the order of the operands aren't changed. To put it simply, a op b take the result and op it with c, needs to give you the same answer as b op c, and then you op the result with a. You should convince yourself that plus is an associative operator but minus is not. Now, it's not immediately clear why we need this property but we're going to see why in a few minutes.

Now it's time for a quick quiz. In the list below, check which operators are both binary and associative. And the list is, multiply, minimum, factorial, logical or, bitwise and, exponentiation, and division.

Four of these operators are both binary and associative, multiply, minimum, logical or, bitwise and. Factorial isn't binary, it only has one argument and it needs two arguments to be binary. And both exponentiation and division are not associative. So, if you want to prove that to yourself, note that say, 4 to the by 2 is not equal to 8 divided by 4 divided by 2.

So today, we're going to discuss how you implement reduce in a serial array. Sort pf the traditional way that we all know and love. And so, the structure of this looks a little bit like map. In both reduce and map, we loop through all of our elements. But it's different in an important way. In map, each loop iteration is independent. We can run these simultaneously, and in any order we choose. In reduce, on the other hand, each iteration is dependent on the previous iteration. And here's the serial code to a sum of series of elements, and this is relatively straight forward. We have a serial variable named sum, we initialize it to zero. We then loop through out set of elements and on each iteration. Add the current element to the previous sum. And so, when we're done, we can return this sum variable, and we're done. So, in the first iteration, we do this first add here, and we take the result on the second iteration. Do a second add here on the third iteration, we do a third add, and so on. So, what's different about this than the math example is that this add operation is dependent on the previous one, whereas in map, all these things could happen in parallel and at the same time. So now, let's take a little bit of a closer look at the serial reduction. We're going to take five pieces of data here, a, b, c, d, e and reduce them using the plus operator. First, how many operations does it take to perform this serial reduction? So, we can just count the operations, 1, going to do is count the number of steps it takes. So that would be 1, 2, 3, 4. So it's also four steps. So a quiz, which of these four statements are true about a serial reduction code running on an input of size n, arbitrary size? First, it takes n operations, then it takes n minus 1 operations, or its work-complexity is order of n, proportional to the size of the input, or its step-complexity is order of 1, independent of the size of the input. So, please check which ones are true.

So, the first two we're talking about the amount of work. And the correct answer is it takes n minus 1 operations to add up n elements takes n minus 1 adds. Next, is it's work complexity order of n? Yes, it is. So the amount of work that we do is linear with respect to the size of the input. If we double the size of the input, we're going to double the number of additions that we do. Is, is step complexity order one, meaning independent of the size of input? No, its step complexity is also linear in the size of the input. As we double the size of the input, we'll double the number of steps to take to reduce all these items. So, to sum up, our serial reduction algorithm has a work complexity of order of n. What that means is that the amount of work is linear in the size of the input. If we double the size of the input, we double the number of operations that we're doing. Our step complexity is also order of n, meaning again, that if we double the size of the input, we double the number of steps that takes to compute the output.

So now, we're going to turn to the parallel reduce. So how do we formulate an algorithm, a procedure to be able to speed this up by running in parallel? So, let's take a look at this reduce picture here. At first glance, this seems very difficult to parallelise. Note that every operation is dependent on the result of the previous operation. So, let's write these operations explicitly. When we formulate the reduction in this way, we have no parallelism. So, can you figure out how to rewrite this to expose more parallelism so that we can do more than one thing at the same time? And the hint is, use the associative property of the operator. How do you rewrite a plus b, take the result add it to c, take the result add it to d, to allow parallel execution? In your answer, please use parentheses to show grouping.

So what we're going to try to do is expose a little bit of parallelism, and we're going to do that by first computing a plus b, and perhaps at the same time, computing c plus d and then add the results together. Now, what's that going to look like in terms of our tree structure? So now we got serial reduce and parallel reduce side by side. Here's serial reduce and the equation. Here's parallel reduce and the equation. Now, what we've done to do parallel reduce, is regrouping these operations in a different order, and this exposes more concurrency. We now have the ability to run multiple operations in parallel at the same time. Now you see why associativity was a necessary quality of the reduction operator. We can reassociate operations to expose more concurrency, then we can run concurrent operations in parallel to reduce the time to solution. Let's see how this works with our four element reduction. Let's compare our serial reduction and our parallel reduction. Both of these have three additions. Both of them have three units of work. But whereas, the serial reduction has one, two, three steps to complete, the parallel reduction only has one, two steps to complete. So potentially, we can run this with parallel hardware, and it will complete faster. If we extend this problem to reduce a set of elements of size n, we're very interested in the work and the step complexity as a function of n. So what's the complexity of these four metrics? So three of them are pretty straightforward. The serial implementation here has both order of n, linear work and linear steps. And the parallel reduction also has linear work, so it's also o of n. But this one's a little bit more complicated to compute. So let's dive in and take a look at this.

So, let's look at the trend of the step complexity of parallel reduction. So first, we're going to look at what happens when we add 1, 2 elements to create one output. So, if we have two elements in our input sequence it's going to take one step to finish. If we have four elements in our input sequence, it's now going to take 1, 2 steps to finish. And if we have 8 elements in our input sequence, 1, 2, 3, 4, 5, 6, 7, 8, we know that it's now going to take 1, 2, 3 steps to finish. So, the quiz is, what is this relationship? Is the number of steps as a function of n, square root of n, log base 2 of n, n, or n log base 2 of n?

And the answer is log base 2 of n. Note that 2 to the power of the number of steps equals the number of elements. We would say that the step complexity of this parallel algorithm is o of log n. So, if we reduced a 1024, we would see that it would take 10 steps, compared to a 1023 in the serial case. This is 2 orders of magnitude fewer steps. And now, you're starting to get an idea of why parallel implementation might get significant speedups. However, note that it can only get these speedups if it has enough processors to handle all these operations at the same time. If we're adding a 1024 elements in parallel, our first step requires performing 512 additions at the same time. Now, modern GPU can actually handle this. But if we're adding a million items in parallel, our math would tell us, we could finish in 20 steps but would also tell us we'd need to perform 500,000 additions at the first step at the first time, which is not too likely given the desktop hardware we have at our disposal today. That's okay though. Even if we're only keeping 500 processors doing additions throughout the computation rather than 500,000, that's still an enormous speed up over having only one processor doing additions. It's a good exercise to work out how many steps it takes to run this algorithm for an input of size n if you have only p processors. This is called Brent's Theorem, which the interested student will want to look up and perhaps discuss on the discussion forum.

So now, we've done all the preliminaries, so let's turn to some actual code. We're going to implement this twice with a similar strategy each time. In both, we're going to implement a sum of a million and actually 2 to the 20th elements, and we're going to do this in two stages. In the first stage, we're going to launch 1,024 blocks each one of which will use 1,024 threads to reduce 1,024 elements. Each of those will produce one single item. So, we're going to have the final 1,024 elements into one single element. So, I'll post all the code of course. But the CPU side code is straightforward, instead we're just going to take a look at the kernel. Let's see how that works. So, each block is going to be responsible for a 1024 element chunk of floats. And we're going to run this loop within the kernel. On each iteration of this loop, we're going to divide the active region in half. So on the first iteration, where we start with 1,024 elements, we're going to have two 512-element regions. Then, each of 512 threads will add its element in the second half to its element in the first half, writing back to the first half. Now, we're going to synchronize all threads, this syncthreads call right here, to make sure everyone is done. We've got 512 elements remaining and so we're going to loop again on this resulting region of to sum these 256 items to these 256 items. And we're going to continue this loop, cutting it in half every time, until we have one element remaining at the very end of ten iterations. And then, we'll write that back out to global memory. So, this works. We can run it on a computer in our lab. So, we're doing that now, and we notice that it finishes in 0.65 milliseconds. Less than a millisecond, that's pretty great, but it's not as efficient as we might like. Specifically, if we take a good look at the code again. We're going to global memory more often than we'd like. On each iteration of the loop, we read n items from global memory and we write back in over 2 items. Then, we read those 2 items back to global memory and so on. In an ideal world, we'd do an original read where we read all of the 1024 items into the thread block, do all the reduction internally, and then write back the final value. And this should be faster because we would incur less memory traffic overall. The CUDA feature we use to do this is called shared memory, and we'll store all intermediate values in shared memory where all threads can access them. Shared memory is considerably faster than global full memory. Let's take a look at the kernel. It's going to look very similar. And in this kernel, we're going to have the exact same loop structure. What's going to be different though, is this little part right here. We have to first copy all the values from global memory into share memory and that's done with this little block. And then, all the further accesses here are from shared memory, this s data, as opposed to from global memory, which we did last time. And then, we're done. We have to write this final value back to global memory again. The only other interesting part of this code is how we declare the amount of shared memory we need. And we do that here, we're declaring that we're going to have an externally defined amount of shared data. Now, we haven't actually said how much we do, so to do that, we're going to have to go down to where we actually call the kernel. So, when we're calling the reduce kernel using the shared memory, we call it with now three arguments inside the triple chevrons, the normal blocks and the threads, but then we say how many bytes we need allocated in shared memory. In this case, every thread is going to ask for one float stored in shared memory. So, the advantage of the shared memory version is that it saves global memory bandwidth. It's a good exercise to figure out how much memory bandwidth you'll save. So, I'll ask that as a quiz, the global memory version uses how many times as much memory bandwidth, as the shared memory version. Round to the nearest integer.

So the answer is, 3. And so, here I'm showing the memory traffic required for the global version. Here I'm showing the memory traffic required from the shared version. I'm showing all the reads you need to in red, and all the writes you need to do in blue. And if you sum this series, you'll find for reducing n values in shared memory, you'll do n reads and one write, but for global memory, you do two n-reads and n-writes, so you would expect that it would run faster, which it will and .et's show that now. So now I'm going to run the shared memory version, and we see that it does in fact run faster, now I'm down to 464 microseconds as opposed to 651, but it doesn't run 3 times faster. How come? Well, the detailed reason is that we're not saturating the memory system. And there's numerous advanced techniques you would need to do to totally max out the performance of reduce. We're not doing that today, but if you're really interested in micro-optimization of this kind, this application is a really great place to start. You want to look in particular, at processing multiple items per thread, instead of just one. You also want to perform the first step of the reductions right when you read the items from global memory into shared memory. And you want to take advantage of the fact that warps are synchronous, and when you're doing the last steps of the reduction, but these are all advanced techniques, we can talk about them on the forums if you all are interested.

We're going to introduce one of the most important parallel primitives, scan. Let me give you a very short example of a scan operation. The input to scan is a list of numbers, such as 1, 2, 3, 4, and an operation, such as add, and the output is the running sum of those numbers. So, each output is the sum of all the numbers in the input up to that given point. 6 is the sum of 1, 2, and 3. Now, scan is important because it allows us to address a set of problems that seem difficult to parallelize. At first glance, it might seem difficult to compute this output from this input in parallel because each element in output depends on the previous element. So first, we compute this add 2 and get 3, add But this style of operation turns out to be incredibly useful. It's also interesting because scan is just not a very useful operation in the serial world. It's really only useful when you're doing parallel computation. But once you know it and use it, you'll wonder what you ever did without it. Now, there's lots of uses for scan, with compaction and allocation being two of the most popular. Later in this lecture, we'll discuss histogram, which uses scan. And our research group has used scan for quick sort, for sparse matrix computation, and for data compression, among others. It's a very useful parallel primitive. But for this part of the lecture, I'm only going to concentrate on what scan does and how it works rather than how it's useful. We'll learn about some more general scan applications in the next unit.

Because scan isn't an interesting serial primitive, it's a little bit harder to find an real-life example. The best example I know it is balancing your checkbook, which was useful back in the paleolithic days when people still wrote checks to each other. So, let's take a look at an example. When you manually balanced your checkbook, you had a list of transactions and you usually had a couple columns in your checkbook. The first columns has your transactions, your deposits, and withdrawals. The second column has your account balance, the cumulative sum of all the transactions so far. So, let's do this as an example. We start off with nothing in our bank, and our first deposit is $20. So, our bank balance is now $20. We add another $5, then we make three withdrawals. Withdraw $11 dollars, withdraw a $5 more dollars, $9 more dollars, withdraw $3 more dollars, leaving us with $2 in our bank account, add in another 15, and so on. Now, the input to scan, is this column of transactions. And the output of scan, is like your bank balance. What you can see, is that at any given time, your balance is the cumulative sum of all the transactions that you've seen so far. Scan here, is calculating a running sum of its input. That's a perfectly cromulent way to think about scan. So now, let's turn to the mathematical description.

Like reduce, scan takes an input array and a binary associative operator as inputs. You may recall that associativity was useful in reduce to allow us to reorder operations to expose concurrency. And you'd be right in thinking that we're going to do the same thing here. For completeness, I should note that all the operators we use in this class are also communicative. X op y gives you the same result as y op x. Implementations turn out to be a little more complex if you don't use that assumption. But we're not going to cover that here. Now, scan also requires a third input. An identity value associated with the binary operator. So, for any operator op, I op a, where I is the identity element and a is any value, always give you back a. For the binary associative operator addition, then the identity element is 0 because if you add 0 to any value, you get back that value. For the binary associative operator minimum, the identity element is the maximum representative value. Consider an unsigned char data type, for instance. The identity element there is 0xFF, because min of unsigned chars between 0xFF and any value a, will always give you back a. So, a little quiz to make sure we understand binary associative operator and their identity elements. What are the identity elements for multiplication, for the logical or operator, and for the logical and operator.

The identity element for multiply is 1, because a times 1 always gives you back a. The identity element for the logical or operator is false, because a or false always gives you back a. And, the identity element for logical and is true, because a and true will always give you back a.

Now, let's define what scan actually does. What is the scan operation? So, there's three inputs to scan. The first is an array of lenght n. The second is a operator, a binary associated operator. For the purpose of explaining this, we're going to assume that operator is plus. And finally, we have an identity element that's associated with this particular operator. So in this case, the identity element is 0. But let's keep in mind that this discussion will apply to any binary associative operator. So, our input is an array of length n. We start off by designating the first element of the input as a0, the second element is a1, the third is a2 and so on. And so, the scan operation will produce and output given these inputs. So, what does that output actually look like? So, the output at any given position is the sum of all the elements that precede it in the input. So, let's see that actually works. So, the output position 0 is the sum of all the elements that come before it. Well, there's no elements before it so the output is simply the identity element. The output at the first position is the sum of all the elements that come before it. In this case, there's one element that comes before it, so that's copied over to the output. The output in the third element here is the sum of all the elements that come before it. In this case, the sum of a0 and a1. The output for the fourth element is simply the sum of all the elements that come before it. In this case, the sum of a0, a1, and a2, and so on. And so, finally, you reach the final element, and it is the sum of everything in the array, except for the very last element. That's the definition of scan. And keep in mind again that this will apply to any binary associative operator, whether that be multiply, max/min, logical operations like and, and or, and so on. Let's do a simple example. Let's take a sum scan of this input array 3, 1, 4, 1, 5, 9. So, how do we compute the output? First, we always start with the identity element. Then, we start with the first element 3. 3 and our input here to scan to a sum scan 3, 1, 4, 1, 5, 9, the output is 0, 3, 4, 8, operator. So, as a little quiz, let's make sure we understand. Let's try a max scan on an array of unsigned integers, the same array of unsigned integers, 3, element if we do a max scan on unsigned ints? And second, what's going to be the output of this max scan operation?

The identity element for unsigned Int is the element that will give us back the same output for any given input, and for us that's going to be the smallest representable value, and that's going to be a big, fat zero. Any element max zero is going to give us back that element. Now we're going to start computing the output here. So remember, we always start with the identity element here and then for each output element we take the max of all the elements that preceded. So here, that's the max of 3, so it's simply 3, next the max of 3 and 1, that's going to give us 3. Next max of 3, 1, and 4, max of 3, 1, 4, 1, max of 3, 1, 4, unsigned Ints of this 6 element input vector. You might note, that the input element 9 is not used in this calculation, and that's correct, that's simply the nature of this formulation of scan.

Let's analyze this algorithm. What we want to know is, how much work is it going to take for us to complete this algorithm? How many operations are we going to do? And how many steps is that going to take? So, the analysis is straightforward. For either flavor, inclusive or exclusive, for each of the n iterations of the loop, we take one step and we do one operation. Thus, we require n operations to complete the algorithm, and we also require n steps. But the parallelization of this algorithm is more complex than with reduce. With reduce, it was really clear that the different pieces of the computation tree could be computed in parallel. With scan, this is much less clear. So, the questions you should be asking yourself now are a, how do we compute scan and parallel at all, and b, how can we reduce the work and step complexity of a parallel implementation as much as possible? But even before we get to those 2 questions, let's answer a larger question. Why do I care about the parallelization of scan in the first place? So, here's a high level intuitive explanation. Many computations have the following form. If we want to process a list of inputs, which I'm showing here with blue circles, and create a list of outputs, which I'm showing here with red circles, and we begin by computing this first output from this first input. Then, we use that first output and the second input to compute the second output. Use that second output and the third input to create a third output, and so on. For instance, in our checkbook example, we used the previous balance and the next check amount to compute the current balance. This creates a dependency structure that looks like what I've drawn in green. When we structure our computation like this, we see that it has no apparent concurrency. We're only doing one operation at a time. First, we compute the first output, and then we use the first output as an input to get the second output. We use this output to get the third output, and so on. Now, this is a very serial structure, and if we implement it in this way, it would be a poor fit for a GPU indeed. Computations like this can often be expressed as a scan. And if we can characterize our computation in terms of scan, that's great. Because, let me tell you a little secret, we can parallelize scan and make it run fast on a parallel processor like a GPU. Now, not all computations with this sort of dependency structure will fit into the scan model, but many will. And for those computations, leveraging and efficient parallel scan implementation turns the problem from something that would be a poor fit for a GPU into something that is now a much better fit. So, how do we parallelize a scan computation?

Now there's actually two different flavors of scan. The one I just presented is exclusive scan. The output at each location is a function of all the elements that came before, not including the current element. So let's see what the output would be if we do a sum scan on this eight element input array. So this is straightforward compute and let's just keep in mind the output here is the sum of all elements that come before it but not including it. Inclusive scan is defined slightly differently, whereas exclusive scan summed up all the elements that came before and didn't include the current element Inclusive scan instead sums up all the elements before and including the current element. So let's see what the answer is here. And so an inclusive scan, this element is the sum of all the elements up to and including the current element. Now, we're not really covering applications of the scan primitives in this lecture. We'll do these in the next lecture, so it's not immediately obvious why you might need both flavors. But what you'll find is that some algorithms are best expressed using an inclusive scan. Others prefer using an exclusive scan. It's likely that if you use a library that contains scan primitives, you'll have access to both flavors. Fortunately it's fairly straight-forward to convert between them, if you need to.

So let's turn to what implementations of scan look like, and we're going to start off with the straightforward, serial implementation. So, here's a serial implementation of inclusive scan. So let me tell you how this works. We start off by initializing an accumulator variable, that's going to sum our partial results, to the identity element. And then we're going to loop through all the elements in our input array. One at a time. So at each step we're going to do two things. The first thing we're going to do is we're going to take the current value of the accumulator and we're going to apply the operation to it with the current element. So this could be any binary associative operator plus times max min and so on. And we're going to store that value back in the accumulator. Then we're going to set the output element at this particular position equal to the accumulator, then move on to the next element. Now what we've just defined is an inclusive scan. What we'd like you to do as a quiz, is to convert this code to exclusive scan. And this is fairly straight forward, so why don't you give it a shot.

So it turns out doing this conversion is actually very straightforward. All we're going to do is flip these two lines around, okay? We interchange these two lines and we've turned this scan from inclusive to exclusive. And why is this the case? So if we remember, an inclusive scan sums up all the elements, including the current element. And then, writes its output. Instead, we want to convert this to an exclusive scan, which is going to sum up all of the elements except for the current element, and then set the output. So, what we do in the inclusive code that we see here, is first we apply the current element, and then output the result. But for an exclusive scan, we will output the result and then apply the current element. So, it's a very simple transformation to go from one to the next.

So let's revisit this inclusive scan example. We have again, a six element input sequence and we're going to reduce it with the plus operator to generate this output sequence. Let's think about it this way. What is this output element as a function of all these input elements? Well really what it is, is just a reduction of the first element. The second output element is simply a reduction of the first two input elements. The third output element is a reduction of the first three inputs, and so on. In general, the nth output element is the reduction of the first n input elements. So, if we want to go ahead and compute this output as a function of the input, we can do that with the tools that we already have at our disposal. To compute all n outputs, we simply run n separate reductions, and then we can run all n of these reductions in parallel. So let's take a quiz on this. As a function of n, how many steps is it going to take to scan in this way, with n reductions, and how much work is it going to take overall? And our choices are, constant amount of steps or work, amount of steps or work proportional to the logarithm of the size of the input, a linear relationship between the size of the input and the number of steps or the amount of work, and a quadratic relationship between the size of the input and the number of steps, and the number of work. So what I'd like you to do is check the right box in the steps row, and check the right box in the work row that corresponds to the relationship between the size of the input and the number of steps for the amount of work.

First, we're going to look at the number of steps, the step complexity. And that should be the same as reduction, since all are running as a whole bunch of parallel reductions. The reductions all have different sizes. So, to get the overall step complexity, we need to look at the largest reduction, which is a reduction at the end here, of all n elements. So, the number of steps to reduce n elements, and thus the number of steps to perform this entire computation, is order of log n. To compute the total amount of work, we notice that we're going to have n reductions to perform. The first reduction requires zero additions. The second reduction requires a single addition. The third reduction requires two additions and in general, the nth output will require n minus 1 additions. And so, what we're going to have to do is sum up that whole series, 0 plus 1 plus 2 on to n minus 1. And if we do add up that series, we're going to find that the result is roughly n squared over 2 additions. So, the overall amount of work is proportional to the square of the number of elements. We would say that the number of additions overall is order of n squared. So, while we like having a smaller number of steps than the serial version, remember that took all of n steps, this quadratic work complexity makes this formulation of scan ridiculously inefficient. So, what we're going to do next is look at two other algorithms that have a more reasonable cost.

So the two algorithms that we're going to look at are by Hillis and Steele, and by Blelloch. Both of them are going to be more efficient than the implementation we just saw, but they have different performance characteristics. What we'll learn is that Hillis and Steele's formulation of scan is more step-efficient than Blelloch's, but Blelloch's formulation is more work-efficient than Hillis and Steele's. They're both quite relevant today in parallel computing, and I hope you find their implementations as interesting as I do.

Our next implementation is an inclusive scan that was popularized in a 1986 paper ,by Danny Hillis and Guy Steele, who are two really interesting people. Danny Hillis founded a company that built a massively-parallel computer, called the Thinking Machine, which had scan as one of its core primitives, and awesome-looking blinky lights. He designed it for AI applications, and said that his goal was to build a computer that was proud of him. At Thinking Machines he worked with Guy Steele, who developed the scheme programming language at MIT and later was one of the core developers of Java. Anyway, back to scan and this particular algorithm, the easiest way to show it is simply to draw it. So we're going to start with an input array that's simply the numbers 1 to 8. So the first stage of this algorithm, involves adding yourself to your neighbor one-away. So for instance, 1 plus 2 is going to give you 3, 2 plus 3 is going to give you 5, and so on. Then we come back to the beginning, and anything that doesn't have a left neighbor, we just copy its value down. Stage two is going to involve adding yourself to your neighbor 2 to the left, so let's see how that's going to work. One plus 5 gives you 6, 3 plus 7 gives you 10, and so on. And again, if you don't have a neighbor 2 to the left, then you just copy your value down. Final stage, now you're going to add yourself to your neighbor 4 to the left, so we add 1 and 14 giving us 15, add 3 and 18 and give us 21, and so on. And again, if you don't have a neighbor 4 to the left, then you'll just copy your own value down. And now, take a look at what we got. This bottom row here, is the inclusive scan of the top row here. So what's the algorithm that we used here? Starting with step zero, here's the steps here. On step i, your job, at each location is to add yourself to your neighbor to the left, 2 to the i hops away, so here one hop, here two hops away and here four hops away. And if you don't have a neighbor that far to the left, you just copy yourself from the previous iteration . So as a quiz, let's analyze the complexity of this computation. So as a function of n, the input size, how much work are we going to do in this algorithm, and how many steps is it going to take?

So we'll start with the steps. Intuitively, we're doubling the size of the communication on every step until we reach the size of the input. This goes 1 hop, this goes 2 hops, this goes 4 hops over and so on. So if that sounds to you like some sort of log, you're right. They're exactly log n steps to compute this scan. So our step complexity is log n. So now turn to computing the amount of work that we're going to do. And the way we're going to do that is we're going to think about this matrix that we drew as a big rectangle and we're going to say, well, at every point in this rectangle we're doing a computation. Some of those computations are adds. Some of those computations are copies. It turns out that if you just consider the adds in the analysis ends up being the exact same. But we're going to say the total amount of work we do, the total number of computation is simply the area of this rectangle. So, how big is this rectangle? Well, in the x direction, the rectangle is n items, we're scanning an array of n items. We already computed what the y size of this rectangle is, that's log of n, the number of steps that we have. So the product of those 2 is going to give us the work complexity. And so the work complexity here, is order of n log in.

So Hillis and Steele scan is fairly efficient. It's actually the first scan to be implemented on GPUs. Guy Blelloch popularized another formulation of scan in 1990, however, that is even better at work efficiency. It's a little bit more complex in terms of the algorithm though, so let's take a look at another example. This algorithm has two stages, first to reduce, then to downsweep. And let's note that this is an exclusive scan. So we're going to do a sum scan, that is exclusive, on the input array one through eight. And so it's going to again have two stages. The first is going to be a reduction, and it looks similar to the reductions that we've seen already, and then the second stage is going to be new. It's going to be a downsweep, and so that's going to take a different operator than we've seen before. Let's start with the reduction. The reduce step actually looks like a fairly standard reduction. The first step adds neighbors one hop away. The next step adds neighbors two hops away. And the third step is going to add neighbors four hops away. However, unlike in reduce, we're going to keep the intermediate results around. By intermediate results I mean this one, this three, this three, this ten, five, 11 and seven, because we're going to use them during the downsweep step. Now step two, the downsweep, we're going to start by resetting the rightmost piece of data to the identify operator. For sum, that identify operator is zero and now we're going to use a communication pattern that is exactly The same as what we saw in the reduction step except in a mirror image. So we see that we have this triangle kind of structure here. We're going to do the same thing upside down when we do downsweep. But what's going to be different about downsweep are two things. The first one is that we're going to use a different operator. So each operator is going to take two inputs, just as we did in reduction, a left input and a right input, but it's going to produce two outputs, not one. So the left most output is simply the right output copied to the left. The right output is equal to the sum, because we're doing sum scan of the two inputs. The second thing that's a little bit different is that we're going to be using these intermediate results. As we need to do a down-sweep between two elements, some of the down-sweep operators might be written down here, because we've derived them during the down-sweep, but if we're missing a piece of data, then we just simply drag down the piece of data that we need from the intermediate results that we stored up here during the reduce. So let's get started. We're going to being the down sweep by operating on this element and this element. That's a again, the mirror image of the communication pattern that we had before. So since we don't have an element written here, we'll just copy it down from the element that we have, right. And so now we'll apply this down sweep operator to these, this pair of elements. So the first we're going to do, is we're going to take this 0, copy it over here. And then we'll take this 10, add it to this 0, and get 10. Now, we're going to operate on elements two to our left. So, the 10 is going to be paired with this will be copied to the left and 11 plus 10 is 21. Now for the final step each of these elements will be paired with the item to its left. So again we're going to have to drag down intermediate values we kept around. We're going to drag down the 7. Drag down this 5, drag down this 3, drag down this 1, and then for the final time apply your downsweep operator. 0, 0 plus 1 is 1, copy the 3, 3 and 3 make 6. Copy the 10, 10 and 5 makes 15. Copy the 21. 7 and 21 make 28. So now our output sequence is 0, 1, 3, 6, 10, 15, 21, 28, which is the exclusive sum scan of the input of the vector from 1 to 8. What we see for instance is that every output such as this 21 here, 21 is the sum of all the elements that came before. So 21 is the sum of 1 to 6 and so on. As a quiz I'd like you to compute the max scan of this input sequence 2 1 4 3 using this reduced down sweep method. So you're going to fill in these values from reduction and then fill in each of these values from a down sweep and when you finish you should have the nextscan, the nice exclusive scan of these elements given these inputs. So give it a shot.

Okay, let's see how we do this. The operator that we're going to do during the reduce is max. So, 2 max 1. Which one's bigger? 2. 4 max 3. Which one's bigger? the down sweep. Remember, we start with the identity element here. And then, we're going to apply the down sweep operator. In this case, we're doing a down sweep with a max. So, we will copy to the left, just as in the sum example. And we will apply max to each, each of the pairs of elements to create the output to its right. So, right now we're running our down sweep operator on this two and this zero, okay? We copy the zero over. And now we run max on 2 and 0 and get two, okay? Now, we'll drag down these intermediate values here. So, we'll copy this to the left, and then 2, max 0, gives us 2. We'll copy to the left, and then 2 max 4 gives us 4, and then we'll complete. Note that the output on any particular element is equal to the maximum value we've seen up to, but not including this element. The max value we see here is 4, which is the max value of each element we've seen up to this point.

So now, let's analyze this algorithm. We know the complexity of the first phase, this reduced phase, already. We know that it has log n steps, as a function of the input count n, and order of n operations. So actually, we have, if we have n elements, we're going to have n minus 1 additions. And the analysis for the downsweep phase, is exactly the same as the analysis for the reduce phase, because the communication pattern here is exactly the same, except mirror imaged as the communication pattern here. So, we know the downsweep also has log n steps and a linear amount of work. This is great. We've now reached the same work efficiency as the serial implementation. Note, however, that it has more steps than the Hillis and Steele formulation. Recall that the Hillis and Steele formulation had log n steps and the Blelloch formulation, this reduced downsweep method, has two log n steps. But the advantage of this Blelloch implementation is less work over all. Recall again that the Hillis and Steele scan had o of n log n work. Whereas, the Blelloch scan has linear amount of work with respect to the input. Now, which is faster? That's a good question. To some extent, it's dependent on the particular GPU, the problem size, the particular optimizations that you chose during the implementation. But to give you a little bit of intuition as to how you might think about the problem of choosing between two different formulations of the same algorithm, one of which has superior work efficiency, like Blelloch scan, one of which has superior step efficiency, like Hillis and Steele scan.

Consider the case where you have way more data, way more work than processors. In this case, your implementation speed is limited by the number of processors, because your processors are busy all the time. So, you'd prefer the implementation that is more work efficient. Firstly, let's say you have more processors than work. In this case, you have plenty of processors, so you're limited instead by the number of steps in the algorithm and you'd be willing to do more work to get fewer steps. You've got more than enough processors to handle the extra work. So, you'd pick the implementation that is more step efficient. There are a number of interesting parallel algorithms that have a pattern that looks like this. You start off with a lot of work, you narrow down to not very much work. And then, you widen back out to a lot of work. An example of this is the Blelloch scan that we just looked at. You start reducing with a large number of items, you get down to a small number of items. And then during down-sweep, you widen out to a large number of items again. In this case, an advance but good strategy is to start off with a work efficient algorithm when there's lots of work. Once you get down to not very much work, you switch to a step efficient algorithm, when you have more processors than work. And then, as you start to widen back out, you switch back to a work efficient algorithm when you have enough work to fill the machine again. So now, as a quiz, we're going to look at three different scenarios of things that we want to scan and given a particular hardware configuration. Your task is to figure out which algorithm is best suited, the serial algorithm, the step efficient Hillis and Steele algorithm, or the work efficient Blelloch algorithm. And the scenarios are, you have a 512 element vector and a machine like a GPU that has 512 processors, you have a 1 million element input vector in 512 processors, or a 128K element vector with one processor.

Okay. So, let's look at these scenarios and figure out what the right choice would be. So, for the first, we're looking at a small input vector and you've got plenty of processors. So, you're not really worried so much about work efficiency, you have more than enough processors to do the work that you need to do. Thus, you're probably going to be concerned with the step efficiency of the algorithm that you pick and the one with the greatest step efficiency, is Hillis and Steele. Now conversely, when you have an enormous amount of work to do, and not nearly enough processors to do it, you're going to be looking for the algorithm that is going to have the best work complexity. And so, for this, if you have parallel processors, you're absolutely going to want to run the work-efficient algorithm, Blelloch's algorithm. Now, if you only have one processor to work with, you're stuck with a serial algorithm no matter what.

The third algorithm we'll cover today is called a histogram. Now, what's a histogram? It's a means of categorizing and usually displaying data that varies with a particular measurement. In practice, to create a histogram, we take a measurement for each item in a population. And then, place those measurements into a set of bins that we define, where each bin covers a certain range of measurements. This is much easier to explain with an example. I'm recording this lecture at Udacity headquarters in Palo Alto, California. So, let's say, I walk out on the street in Palo Alto, and record the height of each of the next 100 adults, that happen to walk by. What that gets me is a list of 100 measurements. Here's what that list might look like. But it's difficult to draw out conclusions from such a list. Instead, what I'll do is create a histogram. I'm going to specify a number of bins. For instance, four bins, shorter than 150 centimeters, between 150 and 165 centimeters, between 165 and 180 centimeters, and taller than 180 centimeters. Then, I'll place each measurement into its bin. So, every time someone walks by, I place another tick mark into the bin. Here, for instance, we might end up with the measurements 12, 34, 38, 60 out of a 100 people. Often in statistics, you might want to plot this histogram, which might give you a curve that tells you something interesting about the data. Let me introduce you to one specific operation you might want to do on a histogram of data, because you'll be doing this in your assignment. For the histogram we just showed here, we might want to compute the answer to the question, if I'm this tall, how many people are shorter than me, for all bins in the histogram? This computation is called the cumulative distribution function on a histogram, it's also known as a definite integral. We want to compute the cumulative distribution function on this histogram. The input of the histogram is an array of values, for instance, 12, 34, 38, 16. In one word or two words, to be a little more precise, what is the operation we need to do to this input to get the cumulative distribution function?

Enter, scan, or more precisely, exclusive scan. Exclusive scan of this input is we can answer the question, if I'm 170 cm tall, how many people are definitely shorter than me? And, just look at the answer here. We know that 46 people are definitely shorter than me out of the 100 people that we interviewed in front of Udacity headquarters.

Our focus is instead to compute the histogram itself. Look forward in this week's assignment you'll compute the histogram of pixel intensity values in an image. Histograms are not a difficult problem if you think about them serially, but they get more interesting and parallel. So, let's first take a look at the serial histogram algorithm. So, here's our code here. We're going to have 2 for loops where we loop over all the histogram bins. So, the first phase, we initialize all the Bins to 0. The counts of the bins to zero. This is so simple we're not going to mention it again. Much more interesting is the second phase. We're going to loop through all the data measurements. For each data measurement we first determine to which bin it belongs. That's the compute bin call. And I'll show an example in a second. Then we fetch the current value of that bin, increment it, and store the new value back into the bin. For instance, in the previous example, let's say the input to compute bin is 175 centimeters. So we take a measurement. It's 175 centimeters. Compute bin then decides which of these bins is associated with 175 centimeters. In this case, compute bin would return this bin, because it stores all of the measurements between 165 and 180. So, we take 175. It's then associated with this bin, and now we increment the value of this bin. So now let's consider these 4 bins, and we're going to trace a program with the 4 measurements 155, 150, 175, and 170 centimeters. So, the first loop here will initialize each of these bins to 0 items, and then we'll walk through these four measurements using compute bin to decide which bin they're in, and incrementing the values in those bins. So, first we'll consider value to 1. Then we look at 150. Compute bin tell us its value is in this bin again and we'll increment its value to 2. Next we'll consider 175, which is associated with this bin. And now we've incremented its value. And finally 170 will allow us to store two. So a couple of quick questions to make sure you understand the histogram. Let's say we have n measurements and b bins. As a function of n and b, what is the maximum number of measurements we would possibly ever see per bin, and what is the average number of measurement you would see per bin?

Well the maximum value that you'd ever see per bin is if all n values went in the same bin, so the answer is n. In any event, you have n items spread over b bins so on average you'd have n over b items per bin.

So now we're going to look at three different formulations of a workable parallel histogram implementation. Now, you might be looking at this serial code and thinking, yeah why can't we just unroll the serial loop n times and launch a thread for iteration of the loop. Note that if there's n measurements taken, we can consider launching n threads, each of which increments one bin count, it turns out, this doesn't work, and it's important to understand why. So Dave already covered this topic in the last unit, so you can feel free to zip through this part if you totally understand it already. Anyway, we're going to take a look at this particular kernel, this naive histo kernel. In this kernel, each thread will be responsible for classifying one element, and incrementing the histogram bin corresponding to that element. So let's look at this very simple kernel. In this code, the first thing we're going to do is compute our global ID, then we're going to fetch our item from global memory, then we're going to calculate which bin our item is associated with and in this case, we're just using a very simple mod operator to do it. And finally, we're going to increment the bin with which our item is associated. Though if scroll down to the main routine here, we see that we have 65,000 elements that we're going to classify into 16 bins, so we would expect 4,096 items per bin. So what happens when we run this? So I'm going to run this kernel on a work station that I'm connected to in our lab. So, let's run this histo executable. And what we're going to see, are the bin counts that we see for each bin. We expected 4096 items per bin, and we're not getting anywhere close to 4,096 items per bin. In fact, we'll run it again, and if you note, these bin counts are even changing from iteration to iteration. So what's going on here? Let's go back and look at the most important instruction in the kernel, the one where we increment the value in the bin. So let's look at what's actually going on here. What does each thread program actually do if we implement the simple serial algorithm in parallel? It does three separate operations. The first one, is doing a global memory read to fetch the bin value into a register. The second thing, is it increments the bin value within the register. And the third thing, is storing the incremented value, back to global memory. Let's illustrate how this could go wrong. Consider two threads running in parallel, 1 the black thread, 2 the red thread, both want to increment in the same bin. This blue bin here, it starts off with the value 5. Both threads happen to be running at the same time, so the first reads the value of the bin into its local register, then the second reads the value of the bin into its local register. Both increment the value in their own local registers, and the first writes its value back to global memory, storing a 6, and the second does the same, also a 6. And now we've got a problem since we'd really like the answer 7. The fundamental issue here is called a race condition. The problem is that incrementing the value in global memory takes multiple steps and it's possible, as we've seen here, for two processors running simultaneously to interleave these steps. Note that this is not going to happen in the serial code, because in the serial code each iteration of the loop runs separately and there's no danger of one threads code running at the same time as another thread's code. So the simple solution doesn't work. So now let's look at three different ways that we might implement this in parallel, that will work. All three of these methods are good parallel methods, none are obviously better or worse and we'll talk about their pros and cons, as we describe them.

First we'll use what's probably the simplest method. Recall that the reason that a trivial parallelization didn't work was that the read, modify write operation that we needed to increment a bin was actually three separate operations, a global memory read, a local increment in a register and a global memory write. What if we could make this read, modify, write operation here, actually be one operation instead of three? If it was only one operation, it could not be interrupted by another thread. Now, we call such an operation atomic, and as we learned last unit, modern GPU's support atomic operations to memory. What the GPU does, in essence, is locks down a particular memory location during the read, modify, write process, so that no other thread can access it. This is a simple change in the code. So we're only going to change one line in the code. Here's our naive histogram. Here's this new histogram that we're doing, and you notice we're only going to change the last line. We're changing this simple increment here to an atomic increment. It's a simple change but let's see the effect. So now we're going to run the histogram program using simple histo instead of naive<u>histo, and we see when we do that we get the</u> expected 4096 elements per bin, great news. Now the disadvantage of this technique, is that it serializes access to any particular bin during the atomic operation. Many threads might want to update that bin, but only one thread at a time actually gets to do so, and so the others have to wait. This contention for the atomic will likely be the performance limiter in any histogram code. So let's say you have a million measurements, with which you'd like to calculate a histogram. You can choose to have, 10 bins, 100 bins, or 1,000 bins in your histogram if you use the atomic method, which will most likely run fastest.

The answer, 1,000 bins will probably be fastest. Since you're likely going to be limited by atomic performance, the histogram that offers the most bins is probably going to have the highest throughput with a 1,000 bin histogram. 1,000 threads can be updating the histogram at any given time. Whereas, with a 10 bin or a 100 bin histogram only 10 or 100 bins can actually update the histogram at any time.

In general, algorithms that rely on atomics have limited scalability, because the atomics limit the amount of parallelism. If we have 100 bins, and use atomics to access them. Then, no matter how awesome a GPU we might have, we're still only doing 100 useful things at a time. Thus, there's interest in algorithms that might be a little bit more complex, but might scale better. So, we'll look at a reduce based method next. So, let's say we have a large number of items to place into our histogram, but relatively few bins. Now, we're going to consider a pretty small example here, for the sake of discussion, just so we can draw everything that we need to. And for the sake of discussion, we're going to say we have 128 items to place into bins. We have 8 threads that we've launched to take care of these items. And we only have 3 bins, so we're going to divide those 128 items among the eight threads, so each thread will be responsible for 16 items. Now, rather than having only one single set of three bins in memory, let's launch eight threads and give every thread its own set of bins, which we'll call its local histogram. Since we don't have very many bins, we can store each thread's bins in that thread's registers, which are fast. Now we have 16 items per thread, and each thread can independently accumulate its own local histogram and its own registers. So, we get eight separate local histograms, each of which has accumulated 16 items, which isn't particularly useful. What we actually need is one global histogram rather than 8 local histograms. Quick quiz, do we need to use atomics to manage access to these local per thread histograms, yes or no?

Answer, no. Each local histogram is only accessed by one thread, serially. So there's no shared access. And thus, we don't need any automics.

We can use reduction to combine these eight local histograms into a single global histogram. So, how do we do this? We simply reduce the first bin of each of the eight local histograms, that's an eight item reduction, and we can do that with a reduction trait. Then, we reduce the second bin. And then, the third bin. Recall that reduction requires a binary associative operator. Note that adding two histograms together, two entire histograms of three bins each, is both binary and associative. So, it might be more efficient to figure out how to add two histograms together during the reduction implementation rather than adding each bin separately.

The final algorithm we'll discuss, which we're going to call sort, then reduce by key, is one that I'll just sketch out, primarily because we haven't discussed the underlying algorithms yet. We'll get to them in the next lecture. However, it'll be a great mental exercise for you to think about how you might implement this approach. Let's think about our entries. Here we have eight entries, as key-value pairs. The key is the bin number, and in this example, we're going to have three bins, 0, 1 and 2. And the value is 1, which we're going to add to each bin. This algorithm has two steps, the first is a sort, the second is a reduce. So first, let's look at the sort. We're going to sort the key value pairs by key. So here's our keys. And notice it's the same key value pairs that we have up here, but now they're in ascending order by key. And the second step is to reduce the result by key. Naturally enough, this algorithm is called reduce by key. What I mean here is to add all the elements that have the same key together. So, we'd like to add these four ones up to create the reduction value of four. Or, more generally, to reduce these values using reduction operator. Note that all elements with the same key are contiguous, which is the key to making an efficient implementation. Now again, this is likely not something you can do with the algorithms that you know so far. But libraries of parallel primitives may have implementations for both of these operations. For instance, the thrust library, that accompanies acuda distribution, implements both sort and reduce, by key methods. Thinking about how you might implement these two operations is worth your time over the next week.

Now, with one word we learned today, how can we combine these 8 local histograms, each of which summarizes the histogram for 16 items assigned to that thread, into one global histogram?

And the answer is reduce. We're going to take these 8 local histograms, and we're going to reduce them into one global histogram. Let's see how that works.

Final thought on histogram, it's certainly possible, even desirable to combine these methods. For instance, the disadvantage of the atomic method is that a large number of threads are trying to get simultaneous access to bins. So, most threads end up waiting around. If we could reduce the number of threads trying to simultaneously access an atomic, that would remove the bottleneck. So, we might choose to use the reduction technique within a block to combine all the individual histogram operations from each thread in that block into a single histogram. Then, use a single thread within that block to add that per block histogram into the global histogram. For instance, if we have 256 threads in our block and eight bins in our histogram, we might choose to do our implementation one of two ways. One is that each individual thread within our block atomically increments a global bin. Another is that we can take the 256 threads that we have, each one containing a single measurement, and then reduce them within the block locally to produce one 8-element histogram. Then, use that 8-element histogram to update the global histogram. For each of these, how many atomic adds are we going to need to do?

Well, for the atomic only technique, we're going to have 256 threads, each of which has to increment one bin. So, we know we're going to need 256 increments. However, if we do all of our reduction within the block, with no atomics necessary, and only after we've reduced to a single 8-element histogram, only each one of those eight elements must update the global histogram with an atomic to be able to complete the histogram calculation. So, this is potentially a large savings in atomics, which is in turn going to help your performance.

We learned three algorithms today. We'll use all three in this week's homework, when we implement a procedure called tone mapping on an image. Generally, tone mapping is the process of converting one set of colors into another set of colors. The reason we want to do this is typically to take a raw image that accurately measures light intensity, and map it onto some physical medium, which might be a piece of the paper or a television, or your computer monitor. The problem is that a real world scene can have brightnesses that span many orders of magnitude and intensity. A piece of white paper in bright sunlight is literally a billion times brighter than on a dark night. And an output device, like a CRT, can only represent about two orders of magnitude between it's brightest and least bright pixel. So, how do we do this mapping here? If we do a bad job, then we'll have images that are either too washed out, overly bright, or instead, overly dark. Let's take a quick look at some pictures to see why this is important. Here's a couple lovely images of old Saint Paul's in Wellington, New Zealand, licensed by the photographer Dean Pemberton, under the Creative Commons license. Thanks, Dean. What we see here is two different exposures of the interior of this building. If you look at the darker image here, you'll see that most of it is quite dark and so it shows little detail. If we look at the brighter image, we see a lot more detail in some areas but bright parts of the image, like the stained glass window, are completely washed out. We can classify this in terms of the histograms. In the dark image, most of the pixels are at the low end of an intensity histogram. And in the light image, they're at the high end. What tone mapping does is it remaps those colors, so they do a better job spanning the entire color range. The intensity spectrum, thus, will look closer to this. In the homework, you'll implement a classic and excellent tone mapping algorithm developed by Greg Ward. This algorithm uses all three of the algorithms you've learned today, reduce scan and histogram to implement such a tone mapping. Good luck, and we'll see you in the next unit.