cs344 ยป

## Welcome to Unit #3

Hi. Welcome back to Unit 3. Today you're going to learn some new techniques for your GPU toolbox. First, how to analyse the speed and efficiency of GPU algorithms. And second, 3 new fundamental algorithms, reduce, scan, and histogram. Really excited about this material, so let's get started.

## Fundamental GPU Algorithms

Alright, today we're going to cover Lecture 3, and this lecture's going to be on fundamental GPU algorithms. We're going to cover the three different algorithms today, reduce, scan and histogram. So, GPU computing is based on exploiting concurrency in a workload, expressing that concurrence in a language like CUDA, allows parallel implementations on the GPU that can take advantage of hundreds of simultaneous computations. Now, the GPU is well suited to communication patterns that are completely independent, like the map operation. So, in the map operation, we're going to have a number of computational elements. And each one of them will compute it's output in parallel and completely separately without any communication whatsoever. It's also great at stencil, or more generally gathered operation like we saw in the last lecture. So, we might have an item we want to compute here and we're going to go out and fetch several items from memory and do our computation. But, the computation of the next item is again completely separate from the computation of the first one. However, not all communications fall into these categories. Some of them have more complex computation patterns, such as, all to one or all to many communications patterns. So, for instance, this element here depends on these three elements here and the computation of this element here will depend on these three elements. And notice that there's some overlap here. So, it's a little bit more complicated. So, we'll cover these more comp, complex computation patterns and three primitives that can implement them in today's lecture. These primitives are reduce, scan and histogram. And we'll also use them in this week's homework.

## Digging Holes Again

So, let's go back to our dig a hole example. If we're going to dig a hole, we're really interested in two things. One is how long it's going to take to dig that hole, and two, how much work, how many total hours it will take to dig. Now, if you're the only one digging this hole, these two things are the same. So, we have one worker, and perhaps it's going to take you 8 hours to finish, and so that means over all you're going to do 8 hours of work. But if you bring all your friends over to dig the hole with you, the amount of time to finish and the total amount of work, might be different. So now, we're going to have 4 workers, and, perhaps they'll work together for 8 hours total and thus be able to finish in 2 hours. Thus, they've finished digging this hole four times as fast. We'd actually call this ideal scaling. Since adding four times as many workers allowed us to finish four times as fast. Alternatively, we might have four diggers, each of which gets in each others way, so we're not going to be quite as efficient. So, for instance, we might have four workers, and together it's going to take them four hours to finish. Thus, meaning that, together they've done 16 hours of work. Now, you might reasonably ask yourself, why your instructor spends so much time talking about digging holes, and that's a fair point. But there's a couple of important points of terminology, that I want to make here. The first is the concept of a step. It's how long it takes, to complete a particular computation, operation, hole digging. The second, is the total amount of work that it took to perform that computation over all the active workers. And, so we're going to go into a little more depth in terms of the way that we define these.

## Steps and Work

As we talk about algorithms, we're going to talk about 2 costs. The first is step complexity and the second is work complexity. So, as an example here, we have 8 elements that we'd like to combine using this tree style structure. And so, we're going to try to characterize the number of steps that it's going to take us to do this computation, as well as the total amount of work. So first, we're going to look at the number of steps. We see that it's going to take us 3 steps to finish. This first step here, we'll do 4 operations, the second step can be done in parallel with 2 operations, and then the third step is a final operation to get a final result. But we can also count the total amount of work we have done here. We've done 1, 2, 3, 4, 5, 6, 7 operations. So, we'd say the step complexity is 3, and the work complexity is 7. So, we'll compare the step and work complexity of the parallel implementations that we develop against the step and work complexity for a serial implementation. And one more piece of terminology, we will say that a parallel algorithm is work-efficient, if it's work complexity, it's asymptotically the same. So, within a constant factor as the work complexity of the sequential algorithm. Now, if we can reduce the step complexity in our parallel implication compared to the serial implementation while still having a reasonable work complexity that isn't too expensive, we expect that this will lead to faster runtime overall. So, the real key here is to formulate the most efficient parallel algorithm. And that's the focus of this lecture and the next lecture.

## A Quiz on Step And Work

Okay. So, quick quiz, just to test a little understanding about number of steps and total amount of work. We have a more complicated tree right here, and so you can think of every circle here as one operation. Some operations take 2 inputs, some operations take one. So, what are the total number of steps in going from the inputs up here to the output down here? And what is the total amount of work measured in number of operations to get from the eight inputs up here to the one output down here?

## A Quiz on Step And Work

So to compute the number of steps, what we're going to do is start at the top, and count how many steps it gets from the top to the bottom. So here's the first step, it has eight operations. Second step, third step, fourth step, fifth step, sixth step, and then we computed the final results. We'd say there are 6 steps here. What's the total amount of work? Well, to do that, we count the number of operations. That's 8 plus 4 plus 4 plus 2 plus 2 plus 1. Gives us 21 operations to compute the final answer here. When we discuss algorithms, we'll often discuss these metrics not so much in terms of numbers, the number of steps, or the total amount of work but instead as functions of the size of the input. So, we might have a totally different problem and we might say the amount of work in that problem is proportional with the size of the input squared, or the number of steps in this particular problem is proportional with the size of the input.

## Reduce Part 1

So, the first algorithm that we're going to cover today is called reduce. So, an example of reducing a set of numbers, for example, is adding up all the numbers in this set to get a total sum. So reduce is an interesting algorithm, because it requires cooperating between processors. So here we have a tree that we can use to compute the sum of a large number of integers. And so it's interesting because we now have dependencies between the operations. First, we add 1 and 2, and we have to take the result and then add it to 3, and then when we have that sum. We can take the result and add it to 4, and so on. So, we now have dependencies between operations that we need to do. And we haven't looked at any algorithms that have this behavior before. Reduce is also interesting because a parallel implementation, like the one that we're about to describe, will have a smaller and a better number of steps than a serial implementation.

## Reduce Part 2

S,o we're going to start with the mathematical definition of reduce. Reduce has two inputs. The first input is a set of elements, and we're going to assume they're in an array. Second input is a reduction operator that will operate on the elements in this array. For example, we have a list of numbers here, and we have a reduction operator of plus, and the reduction operator will sum them all up. Let's be a little bit more precise in terms of what operators we're going to support with our parallel implementation. Our operators need to have the following two characteristics. The first is that the operators are binary, the operator must operate on two inputs and create one output, plus is a binary operator for instance. Two, associative. Associativity means that if we have two or more operations in a row, a op b op c, the order of operations doesn't matter as long as the order of the operands aren't changed. To put it simply, a op b take the result and op it with c, needs to give you the same answer as b op c, and then you op the result with a. You should convince yourself that plus is an associative operator but minus is not. Now, it's not immediately clear why we need this property but we're going to see why in a few minutes.

## Binary and Associative Operators

Now it's time for a quick quiz. In the list below, check which operators are both binary and associative. And the list is, multiply, minimum, factorial, logical or, bitwise and, exponentiation, and division.

## Binary and Associative Operators

Four of these operators are both binary and associative, multiply, minimum, logical or, bitwise and. Factorial isn't binary, it only has one argument and it needs two arguments to be binary. And both exponentiation and division are not associative. So, if you want to prove that to yourself, note that say, 4 to the by 2 is not equal to 8 divided by 4 divided by 2.

## Serial Implementation of Reduce

So, the first two we're talking about the amount of work. And the correct answer is it takes n minus 1 operations to add up n elements takes n minus 1 adds. Next, is it's work complexity order of n? Yes, it is. So the amount of work that we do is linear with respect to the size of the input. If we double the size of the input, we're going to double the number of additions that we do. Is, is step complexity order one, meaning independent of the size of input? No, its step complexity is also linear in the size of the input. As we double the size of the input, we'll double the number of steps to take to reduce all these items. So, to sum up, our serial reduction algorithm has a work complexity of order of n. What that means is that the amount of work is linear in the size of the input. If we double the size of the input, we double the number of operations that we're doing. Our step complexity is also order of n, meaning again, that if we double the size of the input, we double the number of steps that takes to compute the output.

## Parallel Reduce

So now, we're going to turn to the parallel reduce. So how do we formulate an algorithm, a procedure to be able to speed this up by running in parallel? So, let's take a look at this reduce picture here. At first glance, this seems very difficult to parallelise. Note that every operation is dependent on the result of the previous operation. So, let's write these operations explicitly. When we formulate the reduction in this way, we have no parallelism. So, can you figure out how to rewrite this to expose more parallelism so that we can do more than one thing at the same time? And the hint is, use the associative property of the operator. How do you rewrite a plus b, take the result add it to c, take the result add it to d, to allow parallel execution? In your answer, please use parentheses to show grouping.

## Parallel Reduce

So what we're going to try to do is expose a little bit of parallelism, and we're going to do that by first computing a plus b, and perhaps at the same time, computing c plus d and then add the results together. Now, what's that going to look like in terms of our tree structure? So now we got serial reduce and parallel reduce side by side. Here's serial reduce and the equation. Here's parallel reduce and the equation. Now, what we've done to do parallel reduce, is regrouping these operations in a different order, and this exposes more concurrency. We now have the ability to run multiple operations in parallel at the same time. Now you see why associativity was a necessary quality of the reduction operator. We can reassociate operations to expose more concurrency, then we can run concurrent operations in parallel to reduce the time to solution. Let's see how this works with our four element reduction. Let's compare our serial reduction and our parallel reduction. Both of these have three additions. Both of them have three units of work. But whereas, the serial reduction has one, two, three steps to complete, the parallel reduction only has one, two steps to complete. So potentially, we can run this with parallel hardware, and it will complete faster. If we extend this problem to reduce a set of elements of size n, we're very interested in the work and the step complexity as a function of n. So what's the complexity of these four metrics? So three of them are pretty straightforward. The serial implementation here has both order of n, linear work and linear steps. And the parallel reduction also has linear work, so it's also o of n. But this one's a little bit more complicated to compute. So let's dive in and take a look at this.

## Step Complexity of Parallel Reduce

So, let's look at the trend of the step complexity of parallel reduction. So first, we're going to look at what happens when we add 1, 2 elements to create one output. So, if we have two elements in our input sequence it's going to take one step to finish. If we have four elements in our input sequence, it's now going to take 1, 2 steps to finish. And if we have 8 elements in our input sequence, 1, 2, 3, 4, 5, 6, 7, 8, we know that it's now going to take 1, 2, 3 steps to finish. So, the quiz is, what is this relationship? Is the number of steps as a function of n, square root of n, log base 2 of n, n, or n log base 2 of n?

## Step Complexity of Parallel Reduce

And the answer is log base 2 of n. Note that 2 to the power of the number of steps equals the number of elements. We would say that the step complexity of this parallel algorithm is o of log n. So, if we reduced a 1024, we would see that it would take 10 steps, compared to a 1023 in the serial case. This is 2 orders of magnitude fewer steps. And now, you're starting to get an idea of why parallel implementation might get significant speedups. However, note that it can only get these speedups if it has enough processors to handle all these operations at the same time. If we're adding a 1024 elements in parallel, our first step requires performing 512 additions at the same time. Now, modern GPU can actually handle this. But if we're adding a million items in parallel, our math would tell us, we could finish in 20 steps but would also tell us we'd need to perform 500,000 additions at the first step at the first time, which is not too likely given the desktop hardware we have at our disposal today. That's okay though. Even if we're only keeping 500 processors doing additions throughout the computation rather than 500,000, that's still an enormous speed up over having only one processor doing additions. It's a good exercise to work out how many steps it takes to run this algorithm for an input of size n if you have only p processors. This is called Brent's Theorem, which the interested student will want to look up and perhaps discuss on the discussion forum.

## Scan

We're going to introduce one of the most important parallel primitives, scan. Let me give you a very short example of a scan operation. The input to scan is a list of numbers, such as 1, 2, 3, 4, and an operation, such as add, and the output is the running sum of those numbers. So, each output is the sum of all the numbers in the input up to that given point. 6 is the sum of 1, 2, and 3. Now, scan is important because it allows us to address a set of problems that seem difficult to parallelize. At first glance, it might seem difficult to compute this output from this input in parallel because each element in output depends on the previous element. So first, we compute this add 2 and get 3, add But this style of operation turns out to be incredibly useful. It's also interesting because scan is just not a very useful operation in the serial world. It's really only useful when you're doing parallel computation. But once you know it and use it, you'll wonder what you ever did without it. Now, there's lots of uses for scan, with compaction and allocation being two of the most popular. Later in this lecture, we'll discuss histogram, which uses scan. And our research group has used scan for quick sort, for sparse matrix computation, and for data compression, among others. It's a very useful parallel primitive. But for this part of the lecture, I'm only going to concentrate on what scan does and how it works rather than how it's useful. We'll learn about some more general scan applications in the next unit.

## Inputs to Scan

Like reduce, scan takes an input array and a binary associative operator as inputs. You may recall that associativity was useful in reduce to allow us to reorder operations to expose concurrency. And you'd be right in thinking that we're going to do the same thing here. For completeness, I should note that all the operators we use in this class are also communicative. X op y gives you the same result as y op x. Implementations turn out to be a little more complex if you don't use that assumption. But we're not going to cover that here. Now, scan also requires a third input. An identity value associated with the binary operator. So, for any operator op, I op a, where I is the identity element and a is any value, always give you back a. For the binary associative operator addition, then the identity element is 0 because if you add 0 to any value, you get back that value. For the binary associative operator minimum, the identity element is the maximum representative value. Consider an unsigned char data type, for instance. The identity element there is 0xFF, because min of unsigned chars between 0xFF and any value a, will always give you back a. So, a little quiz to make sure we understand binary associative operator and their identity elements. What are the identity elements for multiplication, for the logical or operator, and for the logical and operator.

## Inputs to Scan

The identity element for multiply is 1, because a times 1 always gives you back a. The identity element for the logical or operator is false, because a or false always gives you back a. And, the identity element for logical and is true, because a and true will always give you back a.

## What Scan Actually Does

Now, let's define what scan actually does. What is the scan operation? So, there's three inputs to scan. The first is an array of lenght n. The second is a operator, a binary associated operator. For the purpose of explaining this, we're going to assume that operator is plus. And finally, we have an identity element that's associated with this particular operator. So in this case, the identity element is 0. But let's keep in mind that this discussion will apply to any binary associative operator. So, our input is an array of length n. We start off by designating the first element of the input as a0, the second element is a1, the third is a2 and so on. And so, the scan operation will produce and output given these inputs. So, what does that output actually look like? So, the output at any given position is the sum of all the elements that precede it in the input. So, let's see that actually works. So, the output position 0 is the sum of all the elements that come before it. Well, there's no elements before it so the output is simply the identity element. The output at the first position is the sum of all the elements that come before it. In this case, there's one element that comes before it, so that's copied over to the output. The output in the third element here is the sum of all the elements that come before it. In this case, the sum of a0 and a1. The output for the fourth element is simply the sum of all the elements that come before it. In this case, the sum of a0, a1, and a2, and so on. And so, finally, you reach the final element, and it is the sum of everything in the array, except for the very last element. That's the definition of scan. And keep in mind again that this will apply to any binary associative operator, whether that be multiply, max/min, logical operations like and, and or, and so on. Let's do a simple example. Let's take a sum scan of this input array 3, 1, 4, 1, 5, 9. So, how do we compute the output? First, we always start with the identity element. Then, we start with the first element 3. 3 and our input here to scan to a sum scan 3, 1, 4, 1, 5, 9, the output is 0, 3, 4, 8, operator. So, as a little quiz, let's make sure we understand. Let's try a max scan on an array of unsigned integers, the same array of unsigned integers, 3, element if we do a max scan on unsigned ints? And second, what's going to be the output of this max scan operation?

## What Scan Actually Does

The identity element for unsigned Int is the element that will give us back the same output for any given input, and for us that's going to be the smallest representable value, and that's going to be a big, fat zero. Any element max zero is going to give us back that element. Now we're going to start computing the output here. So remember, we always start with the identity element here and then for each output element we take the max of all the elements that preceded. So here, that's the max of 3, so it's simply 3, next the max of 3 and 1, that's going to give us 3. Next max of 3, 1, and 4, max of 3, 1, 4, 1, max of 3, 1, 4, unsigned Ints of this 6 element input vector. You might note, that the input element 9 is not used in this calculation, and that's correct, that's simply the nature of this formulation of scan.

## Why Do We Care About Parallel Scan

Let's analyze this algorithm. What we want to know is, how much work is it going to take for us to complete this algorithm? How many operations are we going to do? And how many steps is that going to take? So, the analysis is straightforward. For either flavor, inclusive or exclusive, for each of the n iterations of the loop, we take one step and we do one operation. Thus, we require n operations to complete the algorithm, and we also require n steps. But the parallelization of this algorithm is more complex than with reduce. With reduce, it was really clear that the different pieces of the computation tree could be computed in parallel. With scan, this is much less clear. So, the questions you should be asking yourself now are a, how do we compute scan and parallel at all, and b, how can we reduce the work and step complexity of a parallel implementation as much as possible? But even before we get to those 2 questions, let's answer a larger question. Why do I care about the parallelization of scan in the first place? So, here's a high level intuitive explanation. Many computations have the following form. If we want to process a list of inputs, which I'm showing here with blue circles, and create a list of outputs, which I'm showing here with red circles, and we begin by computing this first output from this first input. Then, we use that first output and the second input to compute the second output. Use that second output and the third input to create a third output, and so on. For instance, in our checkbook example, we used the previous balance and the next check amount to compute the current balance. This creates a dependency structure that looks like what I've drawn in green. When we structure our computation like this, we see that it has no apparent concurrency. We're only doing one operation at a time. First, we compute the first output, and then we use the first output as an input to get the second output. We use this output to get the third output, and so on. Now, this is a very serial structure, and if we implement it in this way, it would be a poor fit for a GPU indeed. Computations like this can often be expressed as a scan. And if we can characterize our computation in terms of scan, that's great. Because, let me tell you a little secret, we can parallelize scan and make it run fast on a parallel processor like a GPU. Now, not all computations with this sort of dependency structure will fit into the scan model, but many will. And for those computations, leveraging and efficient parallel scan implementation turns the problem from something that would be a poor fit for a GPU into something that is now a much better fit. So, how do we parallelize a scan computation?

## Inclusive Vs. Exclusive Scan

Now there's actually two different flavors of scan. The one I just presented is exclusive scan. The output at each location is a function of all the elements that came before, not including the current element. So let's see what the output would be if we do a sum scan on this eight element input array. So this is straightforward compute and let's just keep in mind the output here is the sum of all elements that come before it but not including it. Inclusive scan is defined slightly differently, whereas exclusive scan summed up all the elements that came before and didn't include the current element Inclusive scan instead sums up all the elements before and including the current element. So let's see what the answer is here. And so an inclusive scan, this element is the sum of all the elements up to and including the current element. Now, we're not really covering applications of the scan primitives in this lecture. We'll do these in the next lecture, so it's not immediately obvious why you might need both flavors. But what you'll find is that some algorithms are best expressed using an inclusive scan. Others prefer using an exclusive scan. It's likely that if you use a library that contains scan primitives, you'll have access to both flavors. Fortunately it's fairly straight-forward to convert between them, if you need to.

## Serial Implementation of Scan

So let's turn to what implementations of scan look like, and we're going to start off with the straightforward, serial implementation. So, here's a serial implementation of inclusive scan. So let me tell you how this works. We start off by initializing an accumulator variable, that's going to sum our partial results, to the identity element. And then we're going to loop through all the elements in our input array. One at a time. So at each step we're going to do two things. The first thing we're going to do is we're going to take the current value of the accumulator and we're going to apply the operation to it with the current element. So this could be any binary associative operator plus times max min and so on. And we're going to store that value back in the accumulator. Then we're going to set the output element at this particular position equal to the accumulator, then move on to the next element. Now what we've just defined is an inclusive scan. What we'd like you to do as a quiz, is to convert this code to exclusive scan. And this is fairly straight forward, so why don't you give it a shot.

## Serial Implementation of Scan

So it turns out doing this conversion is actually very straightforward. All we're going to do is flip these two lines around, okay? We interchange these two lines and we've turned this scan from inclusive to exclusive. And why is this the case? So if we remember, an inclusive scan sums up all the elements, including the current element. And then, writes its output. Instead, we want to convert this to an exclusive scan, which is going to sum up all of the elements except for the current element, and then set the output. So, what we do in the inclusive code that we see here, is first we apply the current element, and then output the result. But for an exclusive scan, we will output the result and then apply the current element. So, it's a very simple transformation to go from one to the next.

## Inclusive Scan Revisited

So let's revisit this inclusive scan example. We have again, a six element input sequence and we're going to reduce it with the plus operator to generate this output sequence. Let's think about it this way. What is this output element as a function of all these input elements? Well really what it is, is just a reduction of the first element. The second output element is simply a reduction of the first two input elements. The third output element is a reduction of the first three inputs, and so on. In general, the nth output element is the reduction of the first n input elements. So, if we want to go ahead and compute this output as a function of the input, we can do that with the tools that we already have at our disposal. To compute all n outputs, we simply run n separate reductions, and then we can run all n of these reductions in parallel. So let's take a quiz on this. As a function of n, how many steps is it going to take to scan in this way, with n reductions, and how much work is it going to take overall? And our choices are, constant amount of steps or work, amount of steps or work proportional to the logarithm of the size of the input, a linear relationship between the size of the input and the number of steps or the amount of work, and a quadratic relationship between the size of the input and the number of steps, and the number of work. So what I'd like you to do is check the right box in the steps row, and check the right box in the work row that corresponds to the relationship between the size of the input and the number of steps for the amount of work.

## Inclusive Scan Revisited

First, we're going to look at the number of steps, the step complexity. And that should be the same as reduction, since all are running as a whole bunch of parallel reductions. The reductions all have different sizes. So, to get the overall step complexity, we need to look at the largest reduction, which is a reduction at the end here, of all n elements. So, the number of steps to reduce n elements, and thus the number of steps to perform this entire computation, is order of log n. To compute the total amount of work, we notice that we're going to have n reductions to perform. The first reduction requires zero additions. The second reduction requires a single addition. The third reduction requires two additions and in general, the nth output will require n minus 1 additions. And so, what we're going to have to do is sum up that whole series, 0 plus 1 plus 2 on to n minus 1. And if we do add up that series, we're going to find that the result is roughly n squared over 2 additions. So, the overall amount of work is proportional to the square of the number of elements. We would say that the number of additions overall is order of n squared. So, while we like having a smaller number of steps than the serial version, remember that took all of n steps, this quadratic work complexity makes this formulation of scan ridiculously inefficient. So, what we're going to do next is look at two other algorithms that have a more reasonable cost.

## Hillis Steele vs Blelloch Scan

So the two algorithms that we're going to look at are by Hillis and Steele, and by Blelloch. Both of them are going to be more efficient than the implementation we just saw, but they have different performance characteristics. What we'll learn is that Hillis and Steele's formulation of scan is more step-efficient than Blelloch's, but Blelloch's formulation is more work-efficient than Hillis and Steele's. They're both quite relevant today in parallel computing, and I hope you find their implementations as interesting as I do.

## Hillis Steele Scan

So we'll start with the steps. Intuitively, we're doubling the size of the communication on every step until we reach the size of the input. This goes 1 hop, this goes 2 hops, this goes 4 hops over and so on. So if that sounds to you like some sort of log, you're right. They're exactly log n steps to compute this scan. So our step complexity is log n. So now turn to computing the amount of work that we're going to do. And the way we're going to do that is we're going to think about this matrix that we drew as a big rectangle and we're going to say, well, at every point in this rectangle we're doing a computation. Some of those computations are adds. Some of those computations are copies. It turns out that if you just consider the adds in the analysis ends up being the exact same. But we're going to say the total amount of work we do, the total number of computation is simply the area of this rectangle. So, how big is this rectangle? Well, in the x direction, the rectangle is n items, we're scanning an array of n items. We already computed what the y size of this rectangle is, that's log of n, the number of steps that we have. So the product of those 2 is going to give us the work complexity. And so the work complexity here, is order of n log in.

## Blelloch Scan

So Hillis and Steele scan is fairly efficient. It's actually the first scan to be implemented on GPUs. Guy Blelloch popularized another formulation of scan in 1990, however, that is even better at work efficiency. It's a little bit more complex in terms of the algorithm though, so let's take a look at another example. This algorithm has two stages, first to reduce, then to downsweep. And let's note that this is an exclusive scan. So we're going to do a sum scan, that is exclusive, on the input array one through eight. And so it's going to again have two stages. The first is going to be a reduction, and it looks similar to the reductions that we've seen already, and then the second stage is going to be new. It's going to be a downsweep, and so that's going to take a different operator than we've seen before. Let's start with the reduction. The reduce step actually looks like a fairly standard reduction. The first step adds neighbors one hop away. The next step adds neighbors two hops away. And the third step is going to add neighbors four hops away. However, unlike in reduce, we're going to keep the intermediate results around. By intermediate results I mean this one, this three, this three, this ten, five, 11 and seven, because we're going to use them during the downsweep step. Now step two, the downsweep, we're going to start by resetting the rightmost piece of data to the identify operator. For sum, that identify operator is zero and now we're going to use a communication pattern that is exactly The same as what we saw in the reduction step except in a mirror image. So we see that we have this triangle kind of structure here. We're going to do the same thing upside down when we do downsweep. But what's going to be different about downsweep are two things. The first one is that we're going to use a different operator. So each operator is going to take two inputs, just as we did in reduction, a left input and a right input, but it's going to produce two outputs, not one. So the left most output is simply the right output copied to the left. The right output is equal to the sum, because we're doing sum scan of the two inputs. The second thing that's a little bit different is that we're going to be using these intermediate results. As we need to do a down-sweep between two elements, some of the down-sweep operators might be written down here, because we've derived them during the down-sweep, but if we're missing a piece of data, then we just simply drag down the piece of data that we need from the intermediate results that we stored up here during the reduce. So let's get started. We're going to being the down sweep by operating on this element and this element. That's a again, the mirror image of the communication pattern that we had before. So since we don't have an element written here, we'll just copy it down from the element that we have, right. And so now we'll apply this down sweep operator to these, this pair of elements. So the first we're going to do, is we're going to take this 0, copy it over here. And then we'll take this 10, add it to this 0, and get 10. Now, we're going to operate on elements two to our left. So, the 10 is going to be paired with this will be copied to the left and 11 plus 10 is 21. Now for the final step each of these elements will be paired with the item to its left. So again we're going to have to drag down intermediate values we kept around. We're going to drag down the 7. Drag down this 5, drag down this 3, drag down this 1, and then for the final time apply your downsweep operator. 0, 0 plus 1 is 1, copy the 3, 3 and 3 make 6. Copy the 10, 10 and 5 makes 15. Copy the 21. 7 and 21 make 28. So now our output sequence is 0, 1, 3, 6, 10, 15, 21, 28, which is the exclusive sum scan of the input of the vector from 1 to 8. What we see for instance is that every output such as this 21 here, 21 is the sum of all the elements that came before. So 21 is the sum of 1 to 6 and so on. As a quiz I'd like you to compute the max scan of this input sequence 2 1 4 3 using this reduced down sweep method. So you're going to fill in these values from reduction and then fill in each of these values from a down sweep and when you finish you should have the nextscan, the nice exclusive scan of these elements given these inputs. So give it a shot.

## Blelloch Scan

Okay, let's see how we do this. The operator that we're going to do during the reduce is max. So, 2 max 1. Which one's bigger? 2. 4 max 3. Which one's bigger? the down sweep. Remember, we start with the identity element here. And then, we're going to apply the down sweep operator. In this case, we're doing a down sweep with a max. So, we will copy to the left, just as in the sum example. And we will apply max to each, each of the pairs of elements to create the output to its right. So, right now we're running our down sweep operator on this two and this zero, okay? We copy the zero over. And now we run max on 2 and 0 and get two, okay? Now, we'll drag down these intermediate values here. So, we'll copy this to the left, and then 2, max 0, gives us 2. We'll copy to the left, and then 2 max 4 gives us 4, and then we'll complete. Note that the output on any particular element is equal to the maximum value we've seen up to, but not including this element. The max value we see here is 4, which is the max value of each element we've seen up to this point.

## Analyzing The Blelloch Downsweep

So now, let's analyze this algorithm. We know the complexity of the first phase, this reduced phase, already. We know that it has log n steps, as a function of the input count n, and order of n operations. So actually, we have, if we have n elements, we're going to have n minus 1 additions. And the analysis for the downsweep phase, is exactly the same as the analysis for the reduce phase, because the communication pattern here is exactly the same, except mirror imaged as the communication pattern here. So, we know the downsweep also has log n steps and a linear amount of work. This is great. We've now reached the same work efficiency as the serial implementation. Note, however, that it has more steps than the Hillis and Steele formulation. Recall that the Hillis and Steele formulation had log n steps and the Blelloch formulation, this reduced downsweep method, has two log n steps. But the advantage of this Blelloch implementation is less work over all. Recall again that the Hillis and Steele scan had o of n log n work. Whereas, the Blelloch scan has linear amount of work with respect to the input. Now, which is faster? That's a good question. To some extent, it's dependent on the particular GPU, the problem size, the particular optimizations that you chose during the implementation. But to give you a little bit of intuition as to how you might think about the problem of choosing between two different formulations of the same algorithm, one of which has superior work efficiency, like Blelloch scan, one of which has superior step efficiency, like Hillis and Steele scan.

## Which Scan Should You Use

Consider the case where you have way more data, way more work than processors. In this case, your implementation speed is limited by the number of processors, because your processors are busy all the time. So, you'd prefer the implementation that is more work efficient. Firstly, let's say you have more processors than work. In this case, you have plenty of processors, so you're limited instead by the number of steps in the algorithm and you'd be willing to do more work to get fewer steps. You've got more than enough processors to handle the extra work. So, you'd pick the implementation that is more step efficient. There are a number of interesting parallel algorithms that have a pattern that looks like this. You start off with a lot of work, you narrow down to not very much work. And then, you widen back out to a lot of work. An example of this is the Blelloch scan that we just looked at. You start reducing with a large number of items, you get down to a small number of items. And then during down-sweep, you widen out to a large number of items again. In this case, an advance but good strategy is to start off with a work efficient algorithm when there's lots of work. Once you get down to not very much work, you switch to a step efficient algorithm, when you have more processors than work. And then, as you start to widen back out, you switch back to a work efficient algorithm when you have enough work to fill the machine again. So now, as a quiz, we're going to look at three different scenarios of things that we want to scan and given a particular hardware configuration. Your task is to figure out which algorithm is best suited, the serial algorithm, the step efficient Hillis and Steele algorithm, or the work efficient Blelloch algorithm. And the scenarios are, you have a 512 element vector and a machine like a GPU that has 512 processors, you have a 1 million element input vector in 512 processors, or a 128K element vector with one processor.

## Which Scan Should You Use

Okay. So, let's look at these scenarios and figure out what the right choice would be. So, for the first, we're looking at a small input vector and you've got plenty of processors. So, you're not really worried so much about work efficiency, you have more than enough processors to do the work that you need to do. Thus, you're probably going to be concerned with the step efficiency of the algorithm that you pick and the one with the greatest step efficiency, is Hillis and Steele. Now conversely, when you have an enormous amount of work to do, and not nearly enough processors to do it, you're going to be looking for the algorithm that is going to have the best work complexity. And so, for this, if you have parallel processors, you're absolutely going to want to run the work-efficient algorithm, Blelloch's algorithm. Now, if you only have one processor to work with, you're stuck with a serial algorithm no matter what.

## Histogram

The third algorithm we'll cover today is called a histogram. Now, what's a histogram? It's a means of categorizing and usually displaying data that varies with a particular measurement. In practice, to create a histogram, we take a measurement for each item in a population. And then, place those measurements into a set of bins that we define, where each bin covers a certain range of measurements. This is much easier to explain with an example. I'm recording this lecture at Udacity headquarters in Palo Alto, California. So, let's say, I walk out on the street in Palo Alto, and record the height of each of the next 100 adults, that happen to walk by. What that gets me is a list of 100 measurements. Here's what that list might look like. But it's difficult to draw out conclusions from such a list. Instead, what I'll do is create a histogram. I'm going to specify a number of bins. For instance, four bins, shorter than 150 centimeters, between 150 and 165 centimeters, between 165 and 180 centimeters, and taller than 180 centimeters. Then, I'll place each measurement into its bin. So, every time someone walks by, I place another tick mark into the bin. Here, for instance, we might end up with the measurements 12, 34, 38, 60 out of a 100 people. Often in statistics, you might want to plot this histogram, which might give you a curve that tells you something interesting about the data. Let me introduce you to one specific operation you might want to do on a histogram of data, because you'll be doing this in your assignment. For the histogram we just showed here, we might want to compute the answer to the question, if I'm this tall, how many people are shorter than me, for all bins in the histogram? This computation is called the cumulative distribution function on a histogram, it's also known as a definite integral. We want to compute the cumulative distribution function on this histogram. The input of the histogram is an array of values, for instance, 12, 34, 38, 16. In one word or two words, to be a little more precise, what is the operation we need to do to this input to get the cumulative distribution function?

## Histogram

Enter, scan, or more precisely, exclusive scan. Exclusive scan of this input is we can answer the question, if I'm 170 cm tall, how many people are definitely shorter than me? And, just look at the answer here. We know that 46 people are definitely shorter than me out of the 100 people that we interviewed in front of Udacity headquarters.

## Serial Implementation of Histogram

Our focus is instead to compute the histogram itself. Look forward in this week's assignment you'll compute the histogram of pixel intensity values in an image. Histograms are not a difficult problem if you think about them serially, but they get more interesting and parallel. So, let's first take a look at the serial histogram algorithm. So, here's our code here. We're going to have 2 for loops where we loop over all the histogram bins. So, the first phase, we initialize all the Bins to 0. The counts of the bins to zero. This is so simple we're not going to mention it again. Much more interesting is the second phase. We're going to loop through all the data measurements. For each data measurement we first determine to which bin it belongs. That's the compute bin call. And I'll show an example in a second. Then we fetch the current value of that bin, increment it, and store the new value back into the bin. For instance, in the previous example, let's say the input to compute bin is 175 centimeters. So we take a measurement. It's 175 centimeters. Compute bin then decides which of these bins is associated with 175 centimeters. In this case, compute bin would return this bin, because it stores all of the measurements between 165 and 180. So, we take 175. It's then associated with this bin, and now we increment the value of this bin. So now let's consider these 4 bins, and we're going to trace a program with the 4 measurements 155, 150, 175, and 170 centimeters. So, the first loop here will initialize each of these bins to 0 items, and then we'll walk through these four measurements using compute bin to decide which bin they're in, and incrementing the values in those bins. So, first we'll consider value to 1. Then we look at 150. Compute bin tell us its value is in this bin again and we'll increment its value to 2. Next we'll consider 175, which is associated with this bin. And now we've incremented its value. And finally 170 will allow us to store two. So a couple of quick questions to make sure you understand the histogram. Let's say we have n measurements and b bins. As a function of n and b, what is the maximum number of measurements we would possibly ever see per bin, and what is the average number of measurement you would see per bin?

## Serial Implementation of Histogram

Well the maximum value that you'd ever see per bin is if all n values went in the same bin, so the answer is n. In any event, you have n items spread over b bins so on average you'd have n over b items per bin.

## Implementing Histogram Using Atomics

The answer, 1,000 bins will probably be fastest. Since you're likely going to be limited by atomic performance, the histogram that offers the most bins is probably going to have the highest throughput with a 1,000 bin histogram. 1,000 threads can be updating the histogram at any given time. Whereas, with a 10 bin or a 100 bin histogram only 10 or 100 bins can actually update the histogram at any time.

## Implementing Histogram Using Atomics

Answer, no. Each local histogram is only accessed by one thread, serially. So there's no shared access. And thus, we don't need any atomics.

## Calculating Global Histogram Using Reduction

We can use reduction to combine these eight local histograms into a single global histogram. So, how do we do this? We simply reduce the first bin of each of the eight local histograms, that's an eight item reduction, and we can do that with a reduction trait. Then, we reduce the second bin. And then, the third bin. Recall that reduction requires a binary associative operator. Note that adding two histograms together, two entire histograms of three bins each, is both binary and associative. So, it might be more efficient to figure out how to add two histograms together during the reduction implementation rather than adding each bin separately.

## Sort Then Reduce By Key

The final algorithm we'll discuss, which we're going to call sort, then reduce by key, is one that I'll just sketch out, primarily because we haven't discussed the underlying algorithms yet. We'll get to them in the next lecture. However, it'll be a great mental exercise for you to think about how you might implement this approach. Let's think about our entries. Here we have eight entries, as key-value pairs. The key is the bin number, and in this example, we're going to have three bins, 0, 1 and 2. And the value is 1, which we're going to add to each bin. This algorithm has two steps, the first is a sort, the second is a reduce. So first, let's look at the sort. We're going to sort the key value pairs by key. So here's our keys. And notice it's the same key value pairs that we have up here, but now they're in ascending order by key. And the second step is to reduce the result by key. Naturally enough, this algorithm is called reduce by key. What I mean here is to add all the elements that have the same key together. So, we'd like to add these four ones up to create the reduction value of four. Or, more generally, to reduce these values using reduction operator. Note that all elements with the same key are contiguous, which is the key to making an efficient implementation. Now again, this is likely not something you can do with the algorithms that you know so far. But libraries of parallel primitives may have implementations for both of these operations. For instance, the thrust library, that accompanies acuda distribution, implements both sort and reduce, by key methods. Thinking about how you might implement these two operations is worth your time over the next week.

## What Operation Should We Use

Now, with one word we learned today, how can we combine these 8 local histograms, each of which summarizes the histogram for 16 items assigned to that thread, into one global histogram?

## What Operation Should We Use

And the answer is reduce. We're going to take these 8 local histograms, and we're going to reduce them into one global histogram. Let's see how that works.

## Final Thoughts on Histogram

Well, for the atomic only technique, we're going to have 256 threads, each of which has to increment one bin. So, we know we're going to need 256 increments. However, if we do all of our reduction within the block, with no atomics necessary, and only after we've reduced to a single 8-element histogram, only each one of those eight elements must update the global histogram with an atomic to be able to complete the histogram calculation. So, this is potentially a large savings in atomics, which is in turn going to help your performance.

## Description of Problem Set #3

We learned three algorithms today. We'll use all three in this week's homework, when we implement a procedure called tone mapping on an image. Generally, tone mapping is the process of converting one set of colors into another set of colors. The reason we want to do this is typically to take a raw image that accurately measures light intensity, and map it onto some physical medium, which might be a piece of the paper or a television, or your computer monitor. The problem is that a real world scene can have brightnesses that span many orders of magnitude and intensity. A piece of white paper in bright sunlight is literally a billion times brighter than on a dark night. And an output device, like a CRT, can only represent about two orders of magnitude between it's brightest and least bright pixel. So, how do we do this mapping here? If we do a bad job, then we'll have images that are either too washed out, overly bright, or instead, overly dark. Let's take a quick look at some pictures to see why this is important. Here's a couple lovely images of old Saint Paul's in Wellington, New Zealand, licensed by the photographer Dean Pemberton, under the Creative Commons license. Thanks, Dean. What we see here is two different exposures of the interior of this building. If you look at the darker image here, you'll see that most of it is quite dark and so it shows little detail. If we look at the brighter image, we see a lot more detail in some areas but bright parts of the image, like the stained glass window, are completely washed out. We can classify this in terms of the histograms. In the dark image, most of the pixels are at the low end of an intensity histogram. And in the light image, they're at the high end. What tone mapping does is it remaps those colors, so they do a better job spanning the entire color range. The intensity spectrum, thus, will look closer to this. In the homework, you'll implement a classic and excellent tone mapping algorithm developed by Greg Ward. This algorithm uses all three of the algorithms you've learned today, reduce scan and histogram to implement such a tone mapping. Good luck, and we'll see you in the next unit.

## Congratulations

Great job on unit three. Understanding these fundamental parallel primitives opens the door to mapping many interesting tasks to the GPU. And it also gives you the foundation for the next unit, where we're going to look at sort and applications of scan. You'll use today's topics in the problem set where you will implement a high dynamic range, HDR, imaging algorithm that you can use to better represent the wide range of intensities in real scenes.