cs344 ยป

Contents

## Intro to Unit 4

Good to see you back for unit 4! I'm sure you asked yourself, after the last unit, just what do we do with this scan primitive? Well, today we're going to find out. We're going to look at applications of scan, and then we're going to learn how to sort on the GPU. So let's get started.

## Scan Recap

Alright, welcome to Unit four. Today we're going to talk about fundamental GPU algorithms, part two. We're going to start with scan, which we talked about last week, and look at specific applications of scan, and then we're going to turn to looking at how to sort on a GPU. So in the last unit we learned about fundamental GPU algorithms and the last one, and the most important one for this lecture was scan. Scan is an efficient way to paralyze a certain class of parallel problems that otherwise seem difficult to paralyze. We usually refer to these as recurrent relationships. One of the use cases for scan is for problems that keep some sort of running total, such as a running sum. We can also use other binary associative operators like Max, Min and Most Logical operations. So, as a quiz, let's recall two of the important properties of scan. So we're looking at a scan of n elements. And so, in the best G P U scan implementations, what is the amount of work to do the scan and what is the number of steps to do the scan? Your choices are proportional to log n, proportional to n, proportional to n log n. And proportional to n squared. So we'd like you to check the box that corresponds to what the work complexity of the algorithm is and what the step complexity of the algorithm is.

## Scan Recap

So what we learned last week was that scan can be computed efficiently on n elements with run time proportion to n. And we also learned that it can be completed with a number of steps proportional to log n. This is something we can implement very efficiently on the GPU. And because we can implement it efficiently, what we'll find today is that it's the core of a significant interesting number of interesting parallel parameters. Aand we're going to start with one called compact.

## When To Use Compact

So let's take a quick quiz on when we want to be using compact. So, in general, compact is going to be most useful in scenarios where we compact away, where we delete, a small or large number of elements. And the computation we need to run on each surviving element is cheap or expensive.

## When To Use Compact

So the answers are, large, and expensive. Because we are saving the most computation by culling away a large number of elements, each of which would otherwise incur an expensive computation.

## Core Algorithm to Compact

So now we know how to compact in parallel but only if we figure out how to compute the scatter addresses efficiently and in parallel. So let's take a closer look at the computation that we need to do here. We're going to have a set predicates here, true, false, false, true, true, false True, false, and what we need to do is compute the scatter addresses that will result from this string of predicates. Since we don't care about the scatter addresses for the false values, we'll just put a dash for don't care. Our algorithm is going to generate something there, it just doesn't matter what it generates. Again, our goal is to input this predicate string. And output this string of addresses. And now we'll do a simple substitution. Whenever we see a true here, we'll substitute the number one. Whenever we see a false here, we'll substitute the number zero. And rather than don't cares I'm actually going to fill in some values here. So we're going to have the strings 0, 1, 1, 1, 2, 3, 3, 4. So here's our input, a bunch of trues and falses. We're translating it into 1s and 0s, and this is our output, our set of addresses here. And now perhaps the shortest quiz of the course. If this is the input, and this is the output. What is the operation we need to do to get from this input, to this output? Hint. Your answer is one word, of 4 letters.

## Core Algorithm to Compact

And the answer is, scan. Actually if you want to be a little bit more precise, it is exclusive sum scan and we already know how to do a scan efficiently. So if this predicate is our input, this black line here is the output of our scan and note that it corresponds precisely to the addresses we need, to scatter the input to the output.

## Steps to Compact

So let's summarize how to compact. Conceptually, there are four steps though an implementation might combine these into fewer. First, we're going to run a predicate on each element of the input. Then we'll create a scan-in array equal in size to the input. For each element in the input, if the predicate is true, put a one into the scan-in array. If it's false, put a zero. Then we exclusive sum scan the scan in array. The output of this is scatter addresses for compacted array. Now, for the final step, for each element of the input, if the predicate is true, then scatter the input element into the output array. At the address in scan out. Let's try a little quiz to get some intuition about the run time of compact. We're going to compare two compact operations and both are going to compact the list of numbers from 1 to 1 million. Compact operation A, let's say is divisible by 17, is only going to keep a very few of the input items. Whereas compact operation B is not divisible by 31 is going to keep most of the input items. So for each of the predicate, scan and scatter faces of the compact operation, will a run faster, b run faster or will they take the same amount of time?

## Steps to Compact

For the predicate operation, both will take the same amount of time. Both will run their predicate on each of 1 million items. Scan will also take the identical amount of time. Scan is going to run on an array of a million items. Where they're going to differ is scatter. A will run faster than B for scatter, because there's fewer memory operations. Since fewer elements need to be copied into the output array.

## Allocate

We can generalize compaction in the following way. Compact is an operation that allocates exactly one item in the output for each true input and zero items in the output for each false input. This is useful of course in compact as a really common operation. But more generally we can do a compact like operation where the number of output items can be computed dynamically for each input item. Let me offer a graphics example to show why this might be useful. The input to most real time computer graphics pipelines is a set of triangles. Some of these triangles might appear on the screen and some might not. And so we generally test to see if each triangle is visible or not before we pass it down the pipeline for later processing. This is a compact operation. We compact an input stream of triangles, some of which are visible, and some are not, into a smaller output stream where all triangles in the output stream are visible. And we know how to do this with compact. Now, here's the more complicated problem. What if a triangle intersects the boundary of the screen or window? For example, this triangle here, or This triangle here. In this case we apply an operation called clipping, where we cut the triangle with the boundary, and then triangulate the resulting shape. So for this triangle, for instance, we're going to convert it to this triangle right here. This triangle has left behind a trapezoid, and so we only want to deal with triangles. So instead we triangulate this trapezoid and send 2 triangles down the pipe. Now we have a situation where we input a set of triangles and each triangle can product 0, 1 or possibly many output triangles. And we want to output the resulting triangles in a dense array. Here's a good geometry quiz question for you. That has really nothing to do with the course material, but it's a pretty cool question nevertheless. What is the maximum number of triangles that can result from clipping a triangle with a rectangle?

## Allocate

So, here's the worst case. We have a green triangle clipped by a blue rectangle. How many sides is the resulting polygon? So, 1, 2, 3, 4, 5, 6, 7. And so how many triangles will this result in? 1, 2, 3, 4, and 5. So our answer here is 5.

## Segmented Scan

Now, in some applications you might have the need for many small scans instead of one large scan. We've said before that when you launch a kernel on a GPU, you want to have lots of work to do in that kernel. So it doesn't make a lot of sense to launch a scan kernel separately. On each one of these small scans. Instead, we're going to pack these scans together as segments inside one big array. Then have a special scan operator that scans each of the segments independently. Typically, to indicate where segments begin in an array, We use a second array that has 1's for segment heads and 0's for non-segment heads. Let's do an example. So if you'll recall, exclusive sum scan on an array simply takes the running sum of all elements that come before the current element. So the sum of all elements before eight, for instance, is 28. Now what's different about a segmented scan is that we have a normal array, but what we do is we're marking with these large lines here boundaries between these segments. So when we call a segmented exclusive sum scan on this array, what it's going to do is take separate scans of each of these 3 segments. So the result is the scan of 1, 2. The scan of 3, 4, 5. And the scan of 6, 7, 8. And so, the way that we're going to represent these segments is with segment heads. So that we have a second array, the same length as the input array, that marks where segments begin. So a segment begins here with the number 1, a segment begins with the number 3, and a segment begins with the number 6. So to make sure we're on the same page, we're going to take the same array we did last time where we computed an exclusive segmented sum scan. And instead, you're going to fill in the result of an inclusive segmented Sum scan on this input array. If you'll recall, an inclusive sum scan is going to be the sum of all elements that came before in the segment as well as the current element. So why don't you fill in these eight boxes?

## Segmented Scan

Okay, so this is pretty straightforward. The output at any particular element is the sum of that element and all elements that come before it in the current segment. But we're starting it over at every segment boundary. So 12, for instance, is the sum of 3, 4 and 5 but nothing from the previous segment. We're not going to go through how this is implemented, but it's a terrific exercise to work out on your own. Whether you do it with helicon steel scan or the Blaylock scan. It has the same complexity as the unsegmented scan, but runs a little slower because you have to keep track of the segments along the way. So it requires more memory traffic as well as more complex computation. We like to put a little more information about the implementation of segmented scan in the supplementary materials. But again, this is a really great problem for you to work out on your own.

## SpMv

You might be wondering why this might be useful. Never fear. We'll have 2 good examples in this unit. And cover the first 1 here, and the second a little later. So let's consider the problem of sparse matrix, dense vector multiplication. Which we abbreviate Spmv. Sparse matrix vector. Many interesting matrices have lots of zeroes. So we call those matrices sparse. And we want to find a data structure to represent them. It squeezes out the zeroes. >> As well as to find a more efficient way to multiply them by vectors. Afterall, if there's lots of zeroes, we'll do lots of multiplications that have no effect. So sparse matrices are incredibly common in computational problems in mini domains. For instance Pagerank is the world's largest matrix computation. Pagerank is based on a giant sparse matrix that represents every web page in the world. How big is this matrix? Well if there n web pages in the world, this matrix is n by n and each entry at row R in column C is non-zero only if web page R Links to web page C. So the page rate computation on this sparse matrix is how Google computes the importance of web pages. Let's briefly refresh how we multiply a matrix by a vector. Students who already know this are welcome to skip to the next segment. So we're going to multiply this three by three matrix here by this three by one matrix here. So how do we do this? For each output in the output vector, we multiply the row of the input matrix by the column of the input vector. So we do this in a pair-wise way. A times x plus b times y. Plus c times z and that will create the entry here a x plus b y plus c z. For each additional row in the output matrix we simply do another one of these dot products. So, for instance, to compute this value here we dot product this vector with again, our input vector, to get g times x plus h times y plus i times z. So let's do a matrix times a vector as a quiz. We're going to have this sample 3 by 3 matrix times this sample 3 by 3 vector. And I'd like you to write the vector answer in these 3 blanks here.

## SpMv

Okay, how do we go about doing this? Well first, we're going to dot this row times this column, so that's 1 times 1, plus 0 times 2, plus 2 times 3. Okay then we're going to dot this row, with this column, so that's 2 times 1 plus 1 times 2 Plus 0 times 3. And then this row with this vector, 0 times 1, plus 1 times 2, plus 3 times 3. So let's sum these up, 1 plus 6 gives us 7, 2 plus 2 gives us 4, 2 plus 9 gives us 11, so our answers are 7, 4, 11.

## Sparse Matrices

So the traditional way to represent a sparse matrix is what we call compressed sparse row. So here's a small matrix of 9 elements. 3 of them are 0s, and so we want some sort of representation that's going to squeeze out those 0s and only represent the values that are non-0. And it seems a little silly on a small matrix like this. But trust me. As you get to very large matrices with lots and lots of zeroes. This representation is going to save you a lot of space, and save you a lot of computation. So, in CSR format, we require 3 vectors that, together, are going to represent this sparse matrix. So the first one is what we call the value vector. And it is simply going to represent all the non zero data. So, here, we're simply going to list all the data that are not zero as one long array. The second array that we need is recording which column each of these data came from. So, for instance, a is in column 0. B is in column 2. C is in column 0, and so on. And finally, we have to indicate at which element each one of these three rows begin. So, the 3 rows begin with values A, and values C and value F. So what we're going to write in the row pointer is that value A is at index 0. Value C is at index 2. And value F is at index 5. And now we can reconstruct this sparse matrix with these three arrays.

## Generate CSR

So let's take the matrix that we saw earlier, and I'd like you to generate the CSR representation of this particular matrix, please fill in your answers in the value index and row pointer arrays.

## Generate CSR

So if you recall the value array, this is simply the non 0 elements of this input array, which we'll just fill in here. The index array here, is which column each of these array elements are in. So, the 1 is in column 0, the 2 is in column 2, the 2 here is in column 0 The 1 here is in column 1. This one's also in column 1. And this final 3 is in column 2. Finally, what are the indices of each one of these elements that actually begins a row. So, we should look for the index of this one, the index of this 2. And the index of this 1. Because they are what begin each one of these rows. That's this element, this element, and this element. This element is the zeroth element. This element is the second element, and this element here is the fourth element. So, this is the CSR representation of this sparse matrix here.

## Sort

Okay, so the second half of this unit concentrates on sort. Sorting is a fundamental operation. And understanding how we might approach this problem in the parallel world gives a lot of insight as to what works well and what doesn't work well on GPUs. There's a lot of neat algorithms in sorting. And I hope the rest of the unit gives you some new ideas about how to approach the design of efficient parallel algorithms. Now, sort is a challenging problem on GPUs for several reasons. The first is that most sort algorithms are serial algorithms, or at least, usually expressed in a serial fashion, particularly those you might have learned in an algorithm class. So all those nice algorithms that you learned in school are not necessarily applicable here and we'll see this in a bit. The second is that for performance reasons, we prefer to look at algorithms. Parallel algorithms, that coalesce memory axises, and have little branch divergents among threads, and particularly that are able to keep the hardware busy by keeping lots of threads busy at the same time. So the sort of algorithms that you might have learned in an algorithm course tend to be moving around little bits of memory at a time and they have very branchy code. And they're not often very parallel. So, we'd like to take a look at algorithms that have the other characteristics, that can keep the hardware busy. That do limit branch divergence, that do prefer coalesced memory accesses. And so, what we're going to do is look at some classic sorting algorithms and discuss how they might map onto a parallel world. We'll start with one of the simplest algorithms and one that maps nicely to a parallel implementation. Odd-even sort, also known as brick sort. If you're familiar with the serial algorithm called bubble sort. This is the parallel version of bubble sort. So, we're going to start by putting all of our elements to sort in a row. And then we're going to mark all the even elements as black and all the odd elements as red. In an odd-even sort, in the first phase, each red element looks down the line to the left, toward the left end and pairs up with the black elements it's facing. Now if that pair is out of order, they swap places and colors as well. Otherwise, they stay in the same places. Now, every red element turns around, faces to its right, and pairs up with the black element in the other direction. Again, they swap if they're out of order. So we continue until the sequence is sorted. So let's try an example. So just to show this with some real numbers, we'll try a sample with these five numbers. We start by pairing them up, starting at the left, and we swap each pair if they're out of order. Then we pair them with the opposite polarity and continue the process. We return to pairing them the way we did in the first step, and continue to pair them one way then the other way, then one way then the other way, until we finally conclude with a sorted sequence. So it's very important that we understand how the measure, the step and work complexity of these algorithms, because that's often the dominant factor in their run time. So, for this algorithm, what is the step complexity of this algorithm? Order of amount of work that we need to do with the same choices? Please check your choice.

## Sort

We'll look at step complexity first. The worst case is that an element has to move all the way from one side of the array to the other side of the array. So the example here is the number five which starts at the far left and then has to travel all the way to the right. So how quickly does one item move? Its maximum speed is moving 1 position per step. Since the best they can do is swap with its neighbor and move only 1 position away. So it takes on the order of n steps to get from 1 side to the other and how much work does it do per step. Well, on every step if we've n items then we're going to do n over 2 comparisons. So in total we do order n steps and order n per step, totaling order of n squared. Overall, this is not a particularly efficient sort. We'd like to be able to do better than order of n squared steps. That being said, it's kind of a neat parallel algorithm because we can say that, within a step, each one of these comparisons can proceed completely in parallel. So at least, even though this isn't the most efficient algorithm, it's at least one that exploits a lot of parallelism in it's underlying structure.

## Bad To Have One Big Merge

So in the intermediate stages of the merge, we have all the SMs working on independent merges, one SM per merge. At some point though, when we get to the very top of the merge tree, we only have a very small number of merges left or eventually, just one. We have a single task, but it's a very big task. Now, why is this bad? Why is it problematic for efficiency that we only have one merging task remaining. So why is this bad? Why is it problematic for efficiency that we have only one merging task remaining? Well, could be a number of things. We could have lots of idle threads per SM. We could just have lots of SMs that are idle. Or we might suffer from very high branch divergence. So please check the one you think is the best answer.

## Bad To Have One Big Merge

The best answer is that we have lots of SMs that are idle. We know how to keep threads busy within one SM as we did in our, in our immediate merge but we don't know how to keep lots of SM's busy if we have only one task to do, and it's not at all efficient to have most of our SM's sitting idle.

## Sorting Networks

The answer. And this is pretty particular to sorting networks but with a sorting network all will take the same amount of time. It's an important consequence of an oblivious sorting algorithm. Any input will require the same amount of time to sort.

## Sorting Networks Part 2

So how do we implement this on the GPU? Well, its very simple. We're going to assign one thread per input element. In this case, eight input elements. For each comparison, each thread simply needs to know if its keeping the smaller or larger value. So we're actually doing each comparison twice. Once on either side of the comparison. And the only difference is that one side of the comparison will keep the smaller element and one side will keep the larger element. Now, we do need to synchronize after each comparison. So after we're done with this particular set of comparisons, we synchronize and then begin the next stage. And then synchronize and begin the next stage and so on. So just a couple notes on the computation itself. One, if the input set is small enough to fit into shared memory, then a sorting network is actually a very efficient way to sort that input set. So in fact, many sorting algorithms that start by sorting small blocks of data, like merge sort, use a sorting network to do their sorts within a block. And it helps that a bitonic sort is probably the simplest sort to actually implement. Two, bitonic sort is not the only kind of sorting network. The odd-even merge sort is a different kind of sorting network that generally runs a little faster but it's a little more complicated to explain or program but the basic ideas are the same. So as a quiz, what I'd like you to do is fill in each of the black boxes with the correct numbers to make sure that these four input elements are sorted by an odd-even merge sort.

## Sorting Networks Part 2

So we're going to go about this the same way that we did on the other sort, every time we see a pair of lines that are connected by a green line, we will compare the two elements and put the smaller one at the top and the larger one at the bottom. So we'll compare 4 and 2 and switch their order. We'll compare 1 and 3, and they're already in the right order. Now we're comparing 2 and 1. So 2 will go here and 1 will go here. Now we'll compare 4 and 3 and we say they're in the wrong order as well and note that 2 and 3 have one more comparison to go. But note that we come out, again, with a sorted sequence.

That algorithm is compact.

What we're doing, is compacting the items that have a zero as the least significant bit. So if you recall, when we compact, what we need to do, is have a predicate that are going to tell us, which of these input elements to keep. So, if we call each of these elements an unsigned integer, and that unsigned integer is named i, what is the predicate, on each one of these integers that is going to allow us to return true, only for this set of integers. So write this in this box, in the C programming language please.

So the answer here is that what we're going to want to do is look at the least significant bit. That mans we'll take the bit representaton of i and end it with the value 1, which is only going to leave us with the value of the least significant bit. And then we'll going to test whether that bit is 0 or 1. And we only want to keep the ones that are 0. We only want this to return true if that least significant bit is 0.

So what we're really doing here, is simply running a scan over the input. So, the input to the scan is a 1, for each 0 bit, and a 0 for each 1 bit, and that will give us the scatter addresses for the zero half of the split. So we're going to scatter this element to an output zero, this element to 1, this element to 2, and this element to 3. Notice that the last element of the scan, with a little bit of extra math, because it ends with a zero element here, tells us how many zero bits there are total in the input, in this case there are four, 1, 2, scatter addresses for the other half of the split, for the one bits. There are a number of interesting ways to make this faster, and the most common one, is to reduce the total number of passes by taking multiple bits per pass. 4 Bits per pass and a resulting 16 ways split, instead of our 2 ways split here, appears to be fairly common. Overall, rating sort is a fairly brute force algorithm, but it's both simple and fast, recent GPUs can run rate exort on 32 bit keys, at a rate over a billion keys sorted per second.

## Quick Sort

The final sort algorithm we'll discuss is one of the most efficient ones for serial processors, this is Quick sort. It is a more complex algorithm on GPUs because of the control complexity of the algorithm, so let's recap what the Quick sort algorithm does. First, it chooses a pivot element, one particular element from within its input, then it compares all of the elements in its input to the pivot and it uses that comparison to divide the input into 3 sub-arrays. Those that are less than the pivot, those that are equal to the pivot and those that are greater than the pivot, and then it calles Quick sort on each of these sub-arrays and continues until the entire input is sorted. As an example, let's look at a particular array, and choose that the pivot is equal to 3. So what we're going to do here, is compare each one of these elements to the pivot, and we're going to decide if they're equal to, greater than or less than the pivot. Then we'll divide it into 3 arrays, those that are less than the pivot, those that are equal to the pivot, and those that are greater than the pivot, and we'll call Quick sort on each of these arrays, and do the same thing. So when we have 2 and 1, let's say we choose 2 as the pivot, then we'll divide this into smaller than the pivot and equal to the pivot. This doesn't need to be recursed because it only has a single element. And let's say we choose 4 as the pivot here, this is greater than the pivot, this is equal to the pivot, and now we have a completely sorted array. So this is a very challenging algorithm to implement on the GPU. The other algorithms that we've looked at are pretty simple to describe and they don't recurse. This one is more complex, and until very recently, GPUs didn't support recursion at all. Indeed, the GPUs we use in this class, don't support recursion currently. So how can we take this seemingly recursive only algorithm and map it to the GPU using the primitives that we've learned? So I'm bringing up this example for two reasons. The first is that you have already learned all the pieces that you need to implement Quick sort on the GPU. And the second, is to motivate the benefits of new GPU capabilities that do, natively support recursion. So we can implement Quick sort without recursion by using the idea of segments. Recall that segmented operations like scans, only operate within a single segment, operations on one segment don't affect other segments. That sounds a little bit like recursion, and in fact, it maps in a similar way. For Quick sort, when we begin to process the initial array, we're going to use distributes, maps, and compacts to eventually divide it into three segments. We can used segmented scans to do all the necessary operations, that we need to make this work, including distributing a pivot across a segment, for comparisons, and splitting a segment, which is similar to the way that we split a on a particular bit in Ridic sort. Quick Sort is interesting, because it shows you how useful segments can be, that they can let you mirror the same approach you use in recursion, without actually using recursion. But, and I gotta be perfectly honest here, it's really a challenge to implement and equally challenging to make it fast. So it's very exciting that the newest in video GPUs support a more flexible programming model, where kernels can actually call, can launch other kernels, which makes Quick sorts recursive calls much more straightforward. We're not using this new capability in the class. The Amazon instances where our code is running, don't have these new GPUs that support this capability at this time. But it's really exciting, and so when we get to unit 7, we'll see exactly what it looks like and how it applies to Quick sort.

## Key Value Sort

Final note on sort. All the algorithms we discussed were what we call key sorts, where we only sort a set of keys. Many interesting applications however require that you sort not just a key, but a value associated with that key. For instance in unit two, Dave described sorting a set of NBA players by height. In this case, the key might be the height, and the value would be the player's name, the player's team, the player's history and so on. Dealing with values as well as keys is pretty straight forward. First, if you have a value that's large in terms of bytes, it's usually smarter to make the value instead, a pointer to the actual data, and then just sort the value and it's pointer. For instance, if you had a large NBA player data structure, don't sort the whole data structure as the value. Instead, just use a pointer to the data structure as the value instead. Second, our sort algorithms generally move around data keys. Whether that be a sorting network, or a radix sort, or a merge. Instead of just moving the key If you have to deal with a key value sort, move the key and the value together as one unit. Finally, since today's GPUs can often natively handle to input 64-bit values instead.

## Summary

So if you can sort and scan, and more importantly understand the kinds of algorithms that we study in using sort and scan and why they're good and bad for the GPU, you are in fine shape. The ideas you've learned will carry over to many other algorithms. You'll be using these techniques in the assignment this week where you'll do automatic red-eye removal using template matching. This algorithm relies on sorting a vector by key which indirectly relies on scan. So go apply your new found knowledge to the assignment.

## Congratulations

Congratulations on finishing unit 4. We've done a lot in this unit, so well done for getting through this. Now we're going to talk to Ian Buck from Nvidia about Cuda's past, present and future. And then in the assignment, what we're going to do is learn how to remove red eyes from photographs, so that your photographs can look just a little bit better.

## Problem Set 4

In problem set number four you will implementing a parallel algorithm for removing the red eye effect that commonly occurs in pictures of human faces. Here is an example of the effect that we are talking about. You will be implementing a simple algorithm for red eye removal. That factors nicely into 3 different parallel operations. The first operation is a stencil computation over the image. The second is a sort. And the third is map. You have been exposed to map and stencil operations in the previous homework. So, in this homework, you will focus on sorting. We start our red eye removal algorithm by computing a score for each pixel that estimates the likelihood of that pixel belonging to a red eye. And here's what the score looks like. The score is known as normalized cross correlation. And it is expressed naturally as a stencil operation. We have computed this normalized cross-correlation square for you. But if you're interested we will provide extra details about Normalized cross-correlation in the instructor comments. Our next step is to sort all of the pixels in the image according to their score. Note that the pixels with the high scores are the pixels that are most likely belong to a red-eye, as you can see here. This sorting step is what you will be implementing in problem set number 4. In case, if you are interested, we will discuss several different types of optimization strategies, for sort in a shutter comments. Our final step is to reduce the redness. Of the high scoring pixels. This computation is expressed naturally as a map operation. Once again we've performed the step for you so you can concentrate your efforts on sorting. That is all, good luck and I hope you will enjoy problem set number 4. And like the previous problem sets I want to thank Eric and Mike for writing the code and the script to this problem set. So thank you.