cs344 ยป

Contents

- 1 Introduction
- 2 Adding to Your GPU Computing Toolbox
- 3 All Pairs N-Body
- 4 All Pairs N-Body
- 5 How To Implement Dense N-Body as Simply As Possible
- 6 How To Implement Dense N-Body as Simply As Possible
- 7 Dividing N by N Matrix into Tiles
- 8 Dividing N by N Matrix into Tiles
- 9 Downside to Using Tiles
- 10 Downside to Using Tiles
- 11 Using on P Threads
- 12 Why Is This a Good Idea
- 13 Why Is This a Good Idea
- 14 How Big Should We Make a Tile
- 15 How Big Should We Make a Tile
- 16 What If P Is Too Big Or Small
- 17 What If P Is Too Big Or Small
- 18 Determining the Size of the Tiles
- 19 All Pairs N-Body Summary
- 20 SpMv
- 21 Role of thread in SpMv
- 22 Role of thread in SpMv
- 23 Thread Per Row Approach
- 24 Performance Analysis
- 25 Performance Analysis
- 26 Threads Per Element Approach
- 27 Performance analysis
- 28 Performance analysis
- 29 Thread Per Row Vs Thread Per Element
- 30 Thread Per Row Vs Thread Per Element
- 31 Hybrid Approach
- 32 A Quiz on the Hybrid Approach
- 33 A Quiz on the Hybrid Approach
- 34 SpMv - The Big Picture
- 35 Traversal of Graphs
- 36 Calculating Distance From the Root
- 37 Minimum And Maximum Possible Depth From Rooth
- 38 Minimum And Maximum Possible Depth From Rooth
- 39 BFS try #1
- 40 BFS Try #1 - An example
- 41 What is the work complexity
- 42 What is the work complexity
- 43 Lets look at the BFS Code Part1
- 44 Lets look at the BFS Code Part2
- 45 Lets look at the BFS Code Part3
- 46 Do We Have to Worry About Race Condition
- 47 BFS try #1 summary
- 48 Problem Set 6

We're getting pretty close to the end of this class. So, congratulations on making it this far. Now, one of the most challenging skills in all of GPU Computing is trying to paralyze a problem that you've never seen before. So, what we're going to do today is look at several examples of interesting applications where we've done that. Hopefully this is going to give you a few more tools in your GPU Computing tool box. So let's get started.

In previous units, you've learned about the building blocks of GPU computing, the fundamentals of parallelism, mechanics of writing programs, and techniques for making them fast. However, one of the challenging things about learning a new programming model is what happens when you run into a kind of problem you've never seen before. So what we're going to do in this unit is look at a variety of interesting problems to solve, and work through how we might map these problems efficiently onto the GPU. There's no magic pill that will allow you to instantly think in parallel. But what I've found helpful for me, is to see a variety of problems and how they map to parallel algorithms and data structures. It helps me have a little broader tool box of techniques and lessons that, that I can use in attacking other problems that I may not have seen before, and I hope the same will be true for you.

Let's take a look at a commonly used kernel and scientific computation. This is called N-body computation. In an N-body computation you have N objects, each of which exerts a force on each other object. You'll use N-body computations to say, model the gravitational interaction between stars and a galaxy or the electrostatic forces between a number of charged particles. It's the way we compute molecular dynamics, which in turn can model protein folding and misfolding. And is an important component of our understanding of disease and drug discovery. And body techniques are used in folding at home, for instance. The specific formulation of N-body computation that we'll discuss today is called all-pairs N-body. And I'm basing this discussion on an excellent article by Lars Nyland, Mark Harris and Jan Prins in 2007, and GPU Gems 3. All-pairs N-Body is the simplest N-body computation. It's known as a brute-force method and it involves separately calculating the force between each pair of elements, and then, adding up the resulting forces on each element. A simulation using N-body, we usually have two steps. First, compute the forces on each element, then, move each element a little bit based on its force and then repeat. So we first want to understand the complexity of this approach. So we have N objects, each of which computes the forces on it from each other object. For N objects, what is the work complexity of computing all-pairs N-body? Is the amount of work proportional to N, N log N, N squared, or N cubed?

All-Pairs N-Body is an N squared algorithm. Each object must compute its interaction with other object. So each of N objects has to compute N minus 1 forces. N objects times N minus 1 force computations per object yields work proportional to N squared. Now, there's a way you can approximate N-Body computations with high accuracy with superior work efficiency. There are tree based algorithms, for example. One's called Barnes Hut, that for example. One's called Barnes Hut, that have N log N complexity, and a method called fast multi-pole that's actually linear order of N complexity. They basically work by lumping distant bodies together and treating them as a single body during the calculation. These are neat algorithms, fast multipole was actually named one of the top ten algorithms of the 20th century. But they're also considerably more complex. All pairs, on the other hand, is very simple. And often, the way that production N-Body algorithms compute forces among close together objects. And with it, we're going to show how to use the memory hierarchy in an efficient way to make this computation as fast as possible.

So, let's start by looking at how we might implement this as simply as possible. And we're going to begin by thinking about this computation as working on an n by n matrix. The n bodies on the x-axis here are the source of each force, and the n bodies on the y-axis here or the destination of each force. The source and destination objects are the same objects, of course. This is just how we're organizing the computation. So, the matrix element at the position x comma y, represents the force that source element x exerts on destination element y. And what we want to do is calculate the total amount of force from all source objects on each destination object. This formulation of the problem exhibits a lot of parallelism. Each of the n squared forces can be computed in parallel. We just load the parameters of the source and destination thread to do one particular computation, and compute the force, say, using gravitational attraction. This requires specifying parameters for each object. For gravity, this would be, say the mass and the position, x, y, z. Then each object, say object y, must add up all n forces that act on it. Then, we have to calculate n reductions of n forces also in parallel. Since we want to run this computation on tens to hundreds of thousands, maybe even millions of elements, we can see that we have a lot of parallelism to exploit. In this case, the computation would be very straightforward. We will now see that if we run it in this way, we would be making poor use of our memory bandwidth. So, if we assume that all of our n squared force computations running in parallel must go to global memory, how many times must we fetch each element as a function of n?

We know that there are going to be N source elements, we're going to have the same N destination elements. So, for any particular source element, say x here, we must fetch that source element for each individual computation for every destination element. And for a particular destination element, we know that we have to fetch the parameters for each individual source element. So overall, we have to fetch each element N times as the source and N times as the destination, or if we assume that an object does not exert any force on itself, then we fetch each element N minus 1 for each of the source or destination. The answer should be either 2N or 2N minus 2 times. This is quite expensive. We'd really like to organize the computation in such a way that we consume less global memory bandwidth. And we're going to try to do this by storing and reusing some of these memory fetches. And in particular, we'd like to be able to take advantage of the GPU's usability to share data between neighboring threads.

So let's take another look at this n by n matrix. What we're going to do, is to divide it into what we're going to call tiles, they're square chunks of the matrix. What we'll do is calculate each tile in parallel. What we'll do is calculate each tile in parallel, which would then calculate all total forces on the elements within that tile. Then we'll add the resulting forces across horizontal tiles, to get the total force for each destination object. We're going to say a tile is P by P elements, so we'll make n squared over P squared tiles overall. We're not going to talk about how big a tile is yet, we'll cover that later. But for now, when we run this particular tile, we're going to run it within a CUDA thread block, and we can potentially share data between threads in the block. Recall that the x-axis here is the destination. We'd really like to only fetch each object once per tile and reuse it among all the threads in the tile, instead of fetching each item P times per tile. Thus instead of doing 2P squared memory fetches per tile, P times P and two fetches for each item. We will instead do, how many fetches.

The answer is 2P. We'd like to fetch these P items once each, we'd like to fetch these P items once each, and never have to fetch anything again.

Let's start by thinking about how we'll put tiles together after we compute them. Each individual tile will compute partial results for these P destination threads. The partial results will reflect the contributions of the forces produced by these P objects. Then, for each of these tiles, we'll add the partial results from all the tiles in a row to get the complete set of forces that act on the P threads in this band right here. So, let's say, we run this tile with P squared threads. That's going to allow us to compute all P squared individual forces at the same time. This is good, very parallel. But it's got a couple of not too major, but still significant, downsides. First is that we would have to share both source and destination parameters for each object across multiple threads. Remember, we're only fetching each item once. How many threads must get the information of parameters for each source object and for each destination object?

So remember that we only fetch each source object once. But we have to distribute that information across all P threads. And for each destination object. Again, we're going to fetch that information once. But then we have to distribute that information across all P source threads. It would be nicer if we didn't have to do this. Because either we can store that information in shared memory once. But then, we have to read out of shared memory, which is slower or we have to replicate that information in each thread's local storage which wastes space. And limits the number of threads we can have resident in a single thread block. The second downside, is that we have to communicate between threads to sum up the forces. So we've individually computed each separate force on separate threads. And then we have to communicate between all these threads to sum up the force for this particular tile. We'd prefer not to have to do that either. So instead, we'll look at another way to compute a tile.

Instead of having P squared threads to compute P squared force interactions and then summing them up between threads, we'll only use Pthreads total. Each of these Pthreads will then be responsible for P interactions. Another way to think about it is that one thread, this horizontal thread here, is responsible for applying the force of all P source objects to one destination object. So the code for this is really simple and we're showing it here. Our kernel code simply loops n over p times over this particular routine. It accumulates P by P interactions onto Pthreads in each iteration. So what's happening in this code? Well, the input to the kernel is, my parameters, those associated with my destination object. Those would be say, your position and your mass or your charge. And the accumulated force that has acted on my object so far. And the output will be the accumulated force after I add in P more interactions, one for each of the source objects with which I'm interacting. So, we assume that we store parameters for the P source objects in shared memory, sourceParams here. And then we just loop P times over those P objects. We compute the interaction, and this might be computation of gravitational force, it might be electrostatic force, and the Params will differ depending on what kind of force it is. And then, we'll compute the interaction and add resulting force into the force accumulator variable and return it.

Why is this transformation a good idea? We still have to share the information for a source object up here across all these threads. That doesn't change from our previous implementation. But now, we don't have to share any destination object information at all because one thread is completely responsible for one destination object. So, we can load that objects confirmation directly into that thread's local storage. And we don't have to communicate between threads to sum up individual forces. So, it used to be we had P different threads here and we had to communicate between those P threads to add up all these forces. Now, because we only have one thread that they're responsible for all these forces, we don't have to do any communication between threads at all. That thread can just accumulate all of the partial results in its local storage. So, the result is a faster implementation overall. Dave discussed this technique generically in the last unit, Unit 5, and he used the term privatization. So, in making this transformation, is the amount of parallelism now increased, decreased, or kept constant from the previous implementation?

This transformation has actually reduced the amount of parallelism. Previously, we had p squared threads per tile and p squared way parallelism within a tile. Now, we only have p threads per tile, so we have less parallelism than we did before. Well, does that matter? Well, if we have a big, big n body problem to solve, it probably doesn't matter for a big, big problem. There's probably so much parallel work, that will easily fill the machine, even with our new version. But if we have a more modest-sized problem, the lesser amount of parallelism might mean this method isn't the best because we might reduce the amount of parallelism to the point where we can't keep the whole BGU busy. The big picture by what we've done here is that we've chosen an implementation that has fewer threads and does more work per thread. You'll often see this technique when you can convert communication between threads, which is correspondingly more expensive to communication within a thread, because communication between thread local storage is faster than communication between threads within shared memory.

Finally, how big should we make a tile? Let's think about this. If we choose a larger P, does that mean more or less total main memory bandwidth?

The answer here is less. We load each source object's parameters once per tile. Larger tiles means fewer tiles, and thus fewer loads of each source object's parameters.

Now, if P is too big or small, then we can't use shared memory to share a source objects parameter across all destination object threads.

The answer, too big. If P has more threads than a thread block is allowed to have, then we can't use shared memory to share data among all P threads, because we have to distribute that tile across multiple thread blocks. Another consideration is making sure that we have at least as many thread blocks as SMs or else SMs will set IO.

In this problem you need to divide your work up into chunks, in this case tiles. We have a continuum between tiny tiles, lots of them, and fewer tiles where each tile is sized to the maximum that can fit in a single thread block. And in this particular problem bigger tiles, means less memory bandwidth, this is good. Generally, then you want to make your tiles as big as can fit into a single thread block, because that minimizes overall memory bandwidth. But not the following to caveats, one you need to have at least as many thread blocks as you have SMs in your GPU, because otherwise you'll have SMs sitting because otherwise you'll have SMs sitting idle. Definitely, you want to make sure you fill the machine with enough work to keep all the SMs busy, even if you have to move a little bit this way on the continuum and size your tiles just a little bit smaller. Two, if you're sitting at the right end of this continuum, it's best for overall memory bandwidth. But often it turns out that you would actually prefer to just maybe one tic to the left. This allows a small number, say, two blocks to both be resident at a time. And that potentially gives better latency hiding characteristics because you have more warps at the same time from slightly different pieces of the program. It's certainly something that you would want to tune carefully if you needed the fastest possible implementation.

So let's summarize the big ideas of all-pairs N-body include. This might not be the most efficient algorithm. It's not. But it's both simple and high performance. So, if you implement this, you'll find that you're actually able to sustain a significant fraction of the peak theoretical value of the GPU. The real advantage here is that we have a very large amount of parallelism in this problem that we're able to exploit well in our implementation. The interesting part of this implementation is how to structure the ample parallelism that is available in this problem. And the big idea we discussed is the tradeoff between exploiting the maximum amount of parallelism and doing more work per thread. And there's many problems where increasing the amount of work per thread and reducing the amount of parallelism is really the right thing to do, because of the speed of communicated results within a single thread is faster than communicating between threads.

In Unit 4, we looked at Sparse Matrix Vector Multiply, SPMV, SPMV. Noting that this is one of the most important operations in all of computing. What we're going to look at today is how to implement this efficiently. So firs,t a quick recap on what SPMV is. We wish to multiply a matrix by a vector. If this matrix is dense, where every entry in the entire matrix is represented in the matrix data structure, GPUs can get excellent performance. But many interesting matrices are what we call sparse. Many, if not most of the entries in these matrices are zero. So, we've prefer to represent only the non-zeros in this matrix which gives us two advantages. One, less storage since we don't have to represent a bunch of zeros, and two less computation, since we're not doing a bunch of multiplications and additions of zeros. If you need a refresher on how to multiply a sparse matrix by a dense factor, please take a short trip back to Unit 4. Because what we're going to talk about today is a strategy to build a more efficient SPMV. The broad question that we want to answer in thinking about how to map this problem efficiently is, what is the role of a thread? Let's look at this sparse matrix to answer this question. Here, we're using x's to represent non-zeros and blanks to represent zeros. We will consider two alternatives, the first is that we will assign on thread per row of the sparse matrix. Here, a thread is responsible for computing one entry in the output vector and we do this by taking the dot product of this row with this column. Second, we will assign one thread per element in the sparse matrix. For instance, this element here. Here, a thread is responsible for doing one multiplication, in this case, times it's corresponding element in the input vector, what we call a partial product. And then, cooperating with other threads, this thread for instance, in summing up the partial products. So, which is better? The answer is, it depends. And we'll look at how to do both of them.

First though, a little quiz to make sure we think about the implications of the two approaches, which we're calling again, thread per row and thread per element. It's not immediately clear which is better. So, for the same matrix, which approach will answer these three questions for, please choose either thread per row or thread per element. And the three questions are, which approach will launch more threads? Which approach will require communication between threads? And which approach will do more work per thread?

So, which approach will launch more thread, thread per row, thread per element. The answer is thread per element. Thread per element will launch a thread for every x in this matrix. But thread per row, on the other hand, will only launch a thread for every row in this matrix. So, thread per element will certainly have at least as many, if not many more threads than thread per row. Which approach will require communication between threads? Well, that's also going to be thread per element. What's going to happen there is we compute a partial product for each element in a particular row and then we need to add up those partial products across those elements to create the final value. And if you remember we use the segmented scan operation to be able to do that in Unit 4. The final question, which approach will do more work per thread? Well that's going to be thread per row. This row has to do all the partial products and add them up, whereas, thread per element only does one partial product per element and then combines the work across multiple threads to be able to get a final result.

So, we'll start with the thread per row approach. Let's start with the data structure. We're going to use the CSR, the compressed sparse row format here, just as we did in Unit 4. Recall that value contains the non-zero elements in the matrix, index gives the column of each entry, and row pointer contains the index of the beginning of each row. So, each blue dot here corresponds to the element that begins each row, which is element 0, 2, 3, and 5. So, let's just walk through some code. Note this code, like many spin v routines, calculates y plus equals mx. So, it multiplies m by vector x and then adds it to the element y and resets the result as y. It adds the matrix vector product to the destination vector y. We're going to start by this line here, computing the global index for each thread. The thread with this index i will calculate the result for row i. Next, we're going to have an if statement, if row less than the number of rows. Why do we have this if statement? We're going to launch many blocks of many threads and it might be that the number of rows is not a perfect multiple of blocks and threads. This if statement is a common one in CUDA programs. Inside the if is the meat of the routine. Recall that row pointer contains the indices of the starts of each row. So, for instance, the value 3 here says that the third element d here is the beginning of a particular row that then contains d and e. So, we're going to start with d. We're going to start at the beginning of a row and we're going to go up to, but not including the first element of the next row, so, that's this loop right here. And at every iteration of that loop, we will multiply two things. One is the value of that element, so in this case d. And the second is we check which column d is in, in this case d is in column zero so we're going to look up the vector element at position 0 and multiply d by that vector element. So that's this value times that vector element, and then add that to dot. And when we're finally done, we take our destination value y, add it to dot and put it back into y.

Now, let's do a little performance analysis here. Let's say that we have 32 adjacent threads in a warp that are each processing a separate row in the matrix. These rows may be different lengths. Will the overall run time of this set of threads be dependent on the shortest row, the average row, or the longest row?

And the answer is longest and that's really important. So, let's think about how these threads would run. We'll start with all the threads doing useful work. They'll all process their first element at the same time, then their second element, and so on. But eventually, the threads assigned to short rows will run out of work to do, so they'll be idle while the longer rows continue to do work. This code is really efficient if all the rows have the same length. But it's a lot less efficient if the lengths of the rows differ, because then the threads with shorter rows are idle, while threads with longer rows are still working.

So let's try to think about a method that might perform better for matrices with varying numbers of non-zero elements per row. If you recall, the approach that we've pursued in unit four was to assign a thread to each element, do one multiplication per element, then use a backwards inclusive segmented, sum scan to add up the partial products and generator result per row. Let's look at this approach step by step so that we can understand it, so we're going to start with a CSR representation of matrix. So step one, we are going to use the row point or data structure to generate an array with ones for segment heads and zeros for non-segment heads. Then were going to launch a thread for each element in the matrix, were going to multiply it by the corresponding vector element which we fetch using the index data structure. For thread n, the thread in that code will look like value of n times x of index of n. Then were going to form our backwards inclusive segmented sum scan, which will sum up all of the partial products in the matrix row. And then finally, the output of that sum scan will be a sparse vector, which we might want to pack back into a dense vector. And we can use the row pointer addresses to gather that sparse vector back into the dense vector. Now let's go answer the same question we asked about this thread per row approach.

Now, let's go answer the same question we asked about the thread per row approach. Let's say we have 32 adjacent threads in a warp that are each processing a separate element in the matrix. The rows in this matrix may be of different lengths. Will the over all run time of these threads be proportional to the longest row of any of any of those elements, proportional to the shortest row of any of those elements, or completely insensitive to the, the distribution of rows and elements in the matrix?

Well, the answer, this an important answer, is insensitive. It doesn't matter, the structure of the matrix. It's insensitive to the structure of the matrix. It doesn't matter if we have short rows, long rows, all rows the same length, rows of wildly different lengths. The performance of the multiplications and the performance of the segmented scan, don't depend on the distribution of elements under rows at all. Instead the run time is dependent only on the number of elements.

So, we have two approaches here. Thread per row, and thread per element which is better. So, we might have different performance on matricies that have a similar number of elements per row. And we might have differing performance if we have a varying and a wildly varying number of elements per row. So which of these is comparatively better on each of these kinds of matrices? So I'd like you to put a couple check boxes in.

So if we start by looking at a varying number of elements per row. What we've just seen is that the thread by element approach is insensitive to the structure of the matrix. Whereas, if we have a varying number of elements per row, thread per road suffers from load imbalance, and it has a run time depending on the longest row. So, of these two methods, thread per element is probably a better choice. It's completely insensitive to the structure of the matrix. But on the other hand, if our matrices have roughly the same number of elements in each row, which is better? Well, now there's really no load imbalance issues and that's the primary disadvantage with the thread per row approach. And in this case, where we have a similar number of elements per row, it turns out that thread per row is roughly three times faster than thread per element. How come? Well, in thread per row, all of the partial products are accumulated within a single thread, so they're going to communicate through registers and there's no communication between threads at all. In thread per element, on the other hand, the segmented scan operation requires communicating between threads to sum up the partial products. That communication is more expensive. So, because we have different performance characteristics here, it's kind of a head-scratcher problem. Which data structure do we chose, given that we probably don't know what the structure of the matrix is before we need to make this call?

So, one elegant answer to this problem was proposed by Nathan Bell and Michael Garland at the Supercomputing Conference in 2009. And their answer was, use both approaches. And here's how we're going to do that. We're going to divide the matrix into two pieces. One piece, the more regular piece, is going Is going to use the thread per row approach. This right side here, the more irregular approach is instead going to use the thread per element approach. And then, we compute each of these separately and add the results back together at the very end. So, the final question is, where do we draw this line? What happens if we draw it too far to the left, or draw it too far to the right?

So let's turn this into a quiz. If we draw this line too far to the left, then we'll suffer from poor peak performance or load imbalance. And if we draw it too far to the right, will we again get poor peak performance, or load imbalance?

Well, if we go too far to the left, it means that we're going to have poor peak performance. Because we'll be using the slower thread per element approach on data that would instead be a better fit for the faster thread per row approach. But, if we go too far to the right, then we'll be using thread per row on a regular row lengths, and that's going to lead to poor load balance. Bell and Garland noted that thread per row is about three times faster than thread per element for matrices with balanced rows. So, their rule of thumb for drawing the line was when 1 3rd of the rows have non-zero elements. Simple and sensible. You'll find their approach called hybrid, an NVIDIA cuSPARSE, sparse matrix library.

So, what's the big picture here? I'll highlight two lessons here from this discussion. The first is the importance of keeping all your threads busy. Fine grain load imbalance is a performance killer because your thread hardware is sitting idle. Choosing algorithms that reduce imbalance is crucial. Second, managing communication is equally important. Communicating through registers is faster than communicating through shared memory, as we see here, which is in turn faster than communicating with the host.

Now, we'll turn to another interesting problem where the choice of approach is crucial for good performance. This is the problem of graph reversal, specifically, a breadth-first reversal of the graph. So let's define the terms that were using here. A graph consists of a set of vertices and edges that connect to these vertices. So, in this picture vertices are blue circles, edges are green lines. Some graphs might be sparse with few edges per vertex, some might be dense and have lots of edges per vertex. Large graphs, very large graphs, are an interesting recent topic of study. The World Wide Web for instance, can be considered a graph, with pages as vertices and links between pages as edges. Or, the social network of Facebook, Google, or Twitter is a graph where people are vertices and friendships are links. These companies would like to use this social graph to say, suggest friends you might know based on the friends you already have. And making these suggestions depends on analyzing the graph. The traversal involves visiting all the nodes in the graph. Web crawlers must traverse the web to catalog it for searching. How do they make sure they visit each web page once and only once? A traversal of the web graph will allow them to do that. Now, these two approaches to traversing a graph called depth-first traversal and breadth-first traversal. And so, this picture is going to help us show the differences. In a depth-first traversal, we begin at a particular node, so we're going to pick this one in the middle and we're going to name, label it 0. We're going to pick a neighbor we haven't visited yet, let's say this one here. And then we're going to do a depth first reversal from that node. So we're going to continue down the chain, label this one, and so on. If we run out of unvisited neighbors, such as here, we pop back to a previously visited node and continue. Well, we don't have any neighbors here either, so we might pop back here. And now, we're going to continue with our other unvisited neighbors. So we might come here next. We don't have anymore. We pop back up. We might come here next, 4, 5, 6. Pop all the way back up here. Now, we might come down to this one here. Move here, pop back up, and then, go 9 and 10. Eventually, come back up to our original route node. We decide that there are no more unvisited neighbors and then we complete. In a breadth-first traversal, we're going to begin with a node just as we did as in a DFS, and we're going to pick the same starting node. But now, we're going to have a different algorithm. So, once we're at a particular node and we're running a breadth-first traversal, what we're going to do is immediately visit all the neighbors that are one hop away. Before we do any descent at all. So, first we're going to visit all the neighbors that are one hop away, only then will we start to descend farther past any of these neighbors. So now we'll take the first one of these and we'll visit all of its unvisited numbers that are one hop away. Well, now we're complete here, so now we're going to go to number 2. We're going to do a traversal here, we're complete there. We come back up to number 3. We continue to traverse here. So we go one hop away, and then pop back up. Now, we go to number 4. What are the ones that are one hop away from number 4? 'Kay. Now, we're at 8 and so on. One thing you can notice here is that we have a structure that we call a frontier that forms the boundary between all the nodes that we've already visited and all the nodes that we haven't visited. And we see that as we continue our traversal that's just going to expand the frontier out and out and out again. So we'll complete this by going to number 9, and then eventually, to number 10. Now, which one of these is better? Well, it depends on the problem. The structure of the graph, what you're looking for, if you're able to make good decisions about which way to go next. Generally, people will say that DFS requires less state, less temporary storage to run. But DFS on the other hand, exposes more parallelism during the traversal, and specifically, because of this parallelism quality. Today, we're going to look at how to do BFS on the GPU.

So, let's state the specific problem that we want to solve today. And then, we'll look at how we might solve it. We want to start at a specific node and we'll call it s. And then, visit every node in the graph and record it's distance from s in terms of graph hops. We call this quantity, its depth. So any neighbor of s, has depth 1. Any neighbor of those neighbors that we haven't visited yet, has depth 2. And in general, a neighbor of a node with depth d that hasn't already been visited has depth d plus 1. I'm not sure how familiar our international audience is with the actor Kevin Bacon. But the Bacon number of a particular actor is the number of degrees of separation from Kevin Bacon. So, if I had acted in a movie a movie with say, John Belushi, my Bacon number would be two because John Belushi's Bacon number is one, because her appeared with Kevin Bacon in the outstanding film, Animal House. We could generate the Bacon numbers for all movie actors from a graph that records who appeared in movies together and we could do this with a breadth first treversal that begins at Kevin Bacon. And calculates depth for each vertex for actor in the graph.

So, let's make sure we understand BFS and the problem that we're going to solve. So, if we have a graph, an arbitrary graph with n nodes, what is the minimum possible maximum depth of that graph, and what is the maximum possible maximum depth of that graph? And by that, I mean, we can draw any graph we want, okay? And we would like to draw it to minimize the biggest number that we write in here. Or, we'd like to draw a different graph with N nodes and we'd like to maximize the biggest number that we write in here. So, I'd like you to express these as functions of N.

So the minimum possible maximum depth is 1. So if we start with a node here that we're going to number 0, okay. And then we're going to have a graph where every single node is attached to that root node, okay. Well the depth of each one of these is going to be 1. On the other hand, the maximum possible max depth is going to be N minus 1. So what's our graph look like there. We're going to start with the root node and then we're going to have a linear chain of graph nodes all the way out to N minus 1. Now, it's important for you to realize these are two very different graphs from the point of view of parallelization. This linear chain here, very hard to parallelize if we're doing a BFS. It's going to take us order of n steps to get from the start node all the way to the end node. Not going to get a lot of parallelization out of that. And clever students are going to realize, this is the worst case step complexity of the BFS. But if we have a graph that looks like this, a star shaped graph. Then parallelization is a lot more straightforward. We can process each one of those leaves independently, and in parallel.

So let's think of a good parallel algorithm to solve this problem. So here's some of the characteristics that we want out of our algorithm. We want lots of parallelism, coalesced memory access, minimal execution divergence and something that's easy to implement. And I want to show you an algorithm that gives you all of those characteristics, and then I'm going to tell you why it's a crummy algorithm. So, as we're going along, I want you to think about why this algorithm isn't going to be good enough. So, here's what we're going to do. First I'm going to describe the algorithm at a high level, then we're going to do an example, then we're going look at code. So at a high level, we're going to operate over our list of edges. And we're going to iterate n times, where n is the maximum depth of the graph. So on each iteration, n parallel, we can look at each edge independently in n parallel. And if one end of that edge has a depth d, but we haven't visited the other end of the edge yet, your vertex on the other side, then we set that vertex to depth d plus 1. So the first iteration will set all the vertices that have depth 1. The next iteration will set all the vertices with depth 2 and so on. So let's do an example.

So here's a simple graph. We're going to start here with node number 2, and we're going to set the depth there equal to 0. And our goal is to find these depths for each of these vertices. Originally, none of these are set. We're just going to say that's the value for hasn't visited yet. And we're going to begin by setting vertex number 2 to the value 0. So, we note that the furthest vertex away from node 2 is two hops away. So, the maximum depth here is 2 and we should expect that we should iterate twice. So, we're going to begin the algorithm by looking at all of these edges in parallel. And what we're looking for is a pattern where one of the vertices of these edges is going to have a depth value set, and the other one does not. And we're going to find that three edges have this particular quality. So, one of them is going to be the edge between 1 and visited yet. So, we'll set it equal to 0 plus 1, which is 1. We'll do the same thing for this edge here between 2 and 3, okay? We visited 2, its value was 0. We haven't visited 3, so we set its value to be 1. And this edge here, 2 and 6. Now, we complete the first iteration. We visited every single edge and run 6 threads in parallel to make this particular determination. Now we begin our second iteration. So now, again, what we're looking for is an edge where the vertex value of one end of the edge has been set and the other one has not yet been visited. So, we're going to find that's true for the other three edges. So, the edge between 0 and 1, we find that the depth at vertex number 1 is 1. We find that vertex 0 has not yet been visited. So, we set it's depth to 1 plus 1 equals 2. And the same is going to be true for the edges between 3 and 4, and the edge between 5 and 6, okay? Now we filled in a value for all of our vertices here, and we're thus complete.

So let's analyze the word complexity here. So the word complexity is possibly a function of both the number of vertices v, and the number of edges e. So, for the worst case, what is the portion of the word complexity that is a function of v and what is a portion that's a function of e? So your choices are as a function of v, constant with order of v, v log v, or v squared. Similarly for e, constant order of one, order of log e, order of e, order of e log e, and quadratic and e, order of e squared.

So, let's think about the maximum number of iterations that we might have. What graph is going to give us the most iterations and that graph is going to be this awful-looking linear chain. How many iterations is that going to have? Well, it's going to have an order of the iterations because we are going to have the vertices here. And how much work we're going to do in each one of those iterations? Well, each one of these iterations, we know that we will visit all E edges. So, if we take the amount of work here, the product between V and E, we know that our final answer is going to be on the order of V times E. So, we know that it's at least going to be on the order of V and we know that it's at least going to be in the order of E. Probably, in any reasonable graph, this is going to be more edges than vertices. Otherwise, it's not a very interesting graph and if that's the case, we know that the overall work complexity is going to be at least order of V squared. It's quadratic in the number of vertices and that's bad, that is the very bad property. And so, I told you in the beginning of the discussion here that this was a crummy algorithm and this is why we really do not want an algorithm that is quadratic in the number of vertices .

Okay. Now, let's look at some code. It's going to take us two kernels to be able to run this. The first one is going to be an initialization kernel that we run once, and then we're going to show our iteration kernel that we're going to run on every iteration of the loop. Now, the initialization kernel is extremely simple. It's just a map over all the vertices, and it's simply going to initialize the vertex array. So, we get the vertex array as an argument, we know how long it is, and we know which, which vertex is the staring vertex. All we're going to do is calculate a global index for our particular thread, and then if our thread is the starting vertex, then we're going to set its depth to 0. Otherwise, we're going to set it to negative 1, and here, we're going to designate negative 1 as our not visited value. Okay. This is very simple. Easy map operation. The more interesting one is the one we'll show next, which is the kernel that we're going to run on every iteration.

Okay, so here's the more interesting kernel. This is a kernel that we're going to run on our iterations, each iteration is going to launch this kernel over e threads. Each of those threads is associated with one edge and can process that edge completely in parallel. So again, this is a map operation. It's something that maps very nicely to the GPU. So, the first thing we're going to do is we're going to calculate a global index for our particular edge so that our thread is associated with one and only one edge. Then, we're going to compute a few quantities. So, the first thing is, what is the vertex at one end of my edge and what is the depth of that vertex? What is the vertex at the second end of my edge, and what is the depth of that second vertex? Then, we're going to do the logic that we care about. So, if we have visited the first vertex in our edge, but we haven't visited the second vertex in our edge, okay? So, we check the current depth against the first depth and make sure that we haven't visited the second vertex yet. If that's the case, then we set the depth of the second vertex equal to the depth of the first vertex plus one.

So, the converse cases is fairly straight forward. Now, instead of checking the first against the current depth, we're checking the second and we know the first has not been visited yet. If that's the case, then we're going to set the depth of the first to the depth of the second plus 1. Now, there's two other cases, and one of those cases is if we visited both vertices. And the other case is if we visited neither vertex. So, you should convince yourself that we could ignore these cases, that we have of no operations to do if either of these cases is true.

Just a couple of implementation notes, and first a quick mental exercise. What happens if two different paths reach the same vertex at the same time? So, lets consider this graph here. We're going to start at node s here and we're going to set its depth to zero. On the first iteration of our algorithm, then the depth of a will be set to one and the depth of B will be set to one, but we know, in the next iteration of the algorithm, this edge and this edge will both try to set C's depth. Is this a race condition? Does the result depend on which thread happens to run first? Well, fortunately no, because the algorithm is structured to always write only a singe depth value on itera-, on any iteration. So if 2 threads are both trying to write to the same vertex within a single iteration, they'll always both be writing the same depth. In this case, they'll both be trying to write the value 2. So it doesn't really matter who goes first or second, because the end result, the depth of c, is going to be the same. Second, how do we know when we're done? So typically the way we're going to write this code is as follows. We're going to check a variable to see if we're done, and we're going to check that variable on the host. That variable's going to be set on the device. The kernel needs to set that variable. Then we'll copy it at the end of ev, every iteration to the host. The host will check if we're done. If we're not done, we're going to iterate again. If we are done, then we'll fall through the loop and, and move on to the next code. So how do we actually set this variable? So, in this case, what we're going to do is we're going to first initialize a device variable called done to true. The way we're going to initialize it is we're going to have a variable on the host we know is true. And just copy it over to the device, so it's set as true at the beginning. Then we're going to run our iteration. Run this particular kernel, and if the kernel knows that it's not done yet, it will take this done variable and set it to false. Then, whatever the state of the done variable is, we'll copy it back to the host and so the host will then know if we're complete or not and make a decision accordingly. So how are we going to actually do this on a GPU? What's inside the kernel? So, this is an enhancement to the code that we just saw on the previous slide. What we're going to do is every time we set any vertex depth on the GPU, we also alongside, set the global variable done to false. So it's initialized to true if we make no changes at all in this bfs code, then it will stay true and we'll be done. But if we make any changes at all, then we'll set that global variable to false, then we will copy that false variable back to the host and it will give us another iteration. Note that more than one thread can set done to false, but that's fine. It doesn't matter in what order these writes occur since they're all going to be the same value.

So let's summarize our breadth first search try number 1. So, this algorithm has some great properties. It's completely parallel, and there's no direct communication or synchronization between threads at all. There's very little thread divergence. The memory behavior isn't ideal since it involves gathers and scatters, but it's not bad. Unfortunately, the host has to control the iterations, which isn't super attractive. Though, we'll look at a new feature in CUDA 5 in the next unit called Dynamic Parallelism that allows the GPU to control the iterations directly, but the real problem is the quadratic complexity. We're going to have perfectly data parallel work, but we're going to do way too much of it. This approach will scale poorly as we process larger and larger graphs, and we are not cool with that at all.

In Problem Set number 6, you will be implementing a fast parallel algorithm for seamless image cloning. Seamless image cloning is a modern algorithm for pasting one image into another in a very subtle way and here is an example of seamless image cloning. So, for example, you have a bear, a swimming pool and with seamless image cloning, you now have a polar bear in a swimming pool. Our algorithm for seamless image cloning uses It's a technique called double buffering to iteratively compute a solution. The main idea behind double buffering in the context of this homework is that, in store, in approximate solution to a problem in one buffer, then we use that approximate solution to compute a better solution which what we'll store in another buffer. Then, we'll go in the other direction, computing an even better solution and store that solution in the first buffer. After going back and forth, many times, we will arrive at a ideal solution to our problem. This style of computation arises in many parallel algorithms which is why we think it will make a really interesting homework assignment. Before we implement seamless image cloning, let's take a closer look at our inputs. We are given as inputs, a source image, a target image, and a mask. We want to copy the mask value from the source image into the target image in a subtle way. To do this, when [unknown] the values on the inside border of the masked region to be the same as the target image, then what we'll do is to solve for the unknown pixel values in the strict interior of the masked region. So, how do we find these unknown pixel values in the strict interior portion of the masked image? We can do that as follows. First step, we will make a guess for the unknown pixel values, and that's called our initial guess, I0. I0 doesn't have to be a very good guess, so we can set I0 to be all zeros in the strict interior portions of our masked image. But note that 0 will be a very bad guess, because it would not be very subtle. And it also would not conform to the source image very well. Now, suppose we are given a guess solution, IK to our problem, we can compute a better guess IK plus 1 as follows. For some pixel, small i K plus 1, in big I k plus 1, in the strict interior region of the mask, we can compute the value of a small i of K plus 1, as A plus B plus C divided by D. And let's go through each of A and B and C and D one at a time. A is the sum of iK's neighbors, when don't include iK in the sum, and we only include iK's neighbors if they are in the strict interior portion of the masked region. B is the sum of t's neighbors, where t is the corresponding pixel value in the target image, and the target image in our case is the swimming pool. And we only include t's neighbors if they are in the inside border of the masked region. And c is the sum of the difference between S and s's neighbors, where s in this case is the corresponding pixel value in the source image and the source image in our case is the polar bear. And finally, d is the number of iK's neighbors, and iK's neighbors can be less than 4 if iK is on the border of the image.