cs344 ยป

## Introduction

We're getting pretty close to the end of this class. So, congratulations on making it this far. Now, one of the most challenging skills in all of GPU Computing is trying to paralyze a problem that you've never seen before. So, what we're going to do today is look at several examples of interesting applications where we've done that. Hopefully this is going to give you a few more tools in your GPU Computing tool box. So let's get started.

## Adding to Your GPU Computing Toolbox

In previous units, you've learned about the building blocks of GPU computing, the fundamentals of parallelism, mechanics of writing programs, and techniques for making them fast. However, one of the challenging things about learning a new programming model is what happens when you run into a kind of problem you've never seen before. So what we're going to do in this unit is look at a variety of interesting problems to solve, and work through how we might map these problems efficiently onto the GPU. There's no magic pill that will allow you to instantly think in parallel. But what I've found helpful for me, is to see a variety of problems and how they map to parallel algorithms and data structures. It helps me have a little broader tool box of techniques and lessons that, that I can use in attacking other problems that I may not have seen before, and I hope the same will be true for you.

## All Pairs N-Body

Let's take a look at a commonly used kernel and scientific computation. This is called N-body computation. In an N-body computation you have N objects, each of which exerts a force on each other object. You'll use N-body computations to say, model the gravitational interaction between stars and a galaxy or the electrostatic forces between a number of charged particles. It's the way we compute molecular dynamics, which in turn can model protein folding and misfolding. And is an important component of our understanding of disease and drug discovery. And body techniques are used in folding at home, for instance. The specific formulation of N-body computation that we'll discuss today is called all-pairs N-body. And I'm basing this discussion on an excellent article by Lars Nyland, Mark Harris and Jan Prins in 2007, and GPU Gems 3. All-pairs N-Body is the simplest N-body computation. It's known as a brute-force method and it involves separately calculating the force between each pair of elements, and then, adding up the resulting forces on each element. A simulation using N-body, we usually have two steps. First, compute the forces on each element, then, move each element a little bit based on its force and then repeat. So we first want to understand the complexity of this approach. So we have N objects, each of which computes the forces on it from each other object. For N objects, what is the work complexity of computing all-pairs N-body? Is the amount of work proportional to N, N log N, N squared, or N cubed?

## All Pairs N-Body

All-Pairs N-Body is an N squared algorithm. Each object must compute its interaction with other object. So each of N objects has to compute N minus 1 forces. N objects times N minus 1 force computations per object yields work proportional to N squared. Now, there's a way you can approximate N-Body computations with high accuracy with superior work efficiency. There are tree based algorithms, for example. One's called Barnes Hut, that for example. One's called Barnes Hut, that have N log N complexity, and a method called fast multi-pole that's actually linear order of N complexity. They basically work by lumping distant bodies together and treating them as a single body during the calculation. These are neat algorithms, fast multipole was actually named one of the top ten algorithms of the 20th century. But they're also considerably more complex. All pairs, on the other hand, is very simple. And often, the way that production N-Body algorithms compute forces among close together objects. And with it, we're going to show how to use the memory hierarchy in an efficient way to make this computation as fast as possible.

## How To Implement Dense N-Body as Simply As Possible

So, let's start by looking at how we might implement this as simply as possible. And we're going to begin by thinking about this computation as working on an n by n matrix. The n bodies on the x-axis here are the source of each force, and the n bodies on the y-axis here or the destination of each force. The source and destination objects are the same objects, of course. This is just how we're organizing the computation. So, the matrix element at the position x comma y, represents the force that source element x exerts on destination element y. And what we want to do is calculate the total amount of force from all source objects on each destination object. This formulation of the problem exhibits a lot of parallelism. Each of the n squared forces can be computed in parallel. We just load the parameters of the source and destination thread to do one particular computation, and compute the force, say, using gravitational attraction. This requires specifying parameters for each object. For gravity, this would be, say the mass and the position, x, y, z. Then each object, say object y, must add up all n forces that act on it. Then, we have to calculate n reductions of n forces also in parallel. Since we want to run this computation on tens to hundreds of thousands, maybe even millions of elements, we can see that we have a lot of parallelism to exploit. In this case, the computation would be very straightforward. We will now see that if we run it in this way, we would be making poor use of our memory bandwidth. So, if we assume that all of our n squared force computations running in parallel must go to global memory, how many times must we fetch each element as a function of n?

## How To Implement Dense N-Body as Simply As Possible

We know that there are going to be N source elements, we're going to have the same N destination elements. So, for any particular source element, say x here, we must fetch that source element for each individual computation for every destination element. And for a particular destination element, we know that we have to fetch the parameters for each individual source element. So overall, we have to fetch each element N times as the source and N times as the destination, or if we assume that an object does not exert any force on itself, then we fetch each element N minus 1 for each of the source or destination. The answer should be either 2N or 2N minus 2 times. This is quite expensive. We'd really like to organize the computation in such a way that we consume less global memory bandwidth. And we're going to try to do this by storing and reusing some of these memory fetches. And in particular, we'd like to be able to take advantage of the GPU's usability to share data between neighboring threads.

## Dividing N by N Matrix into Tiles

So let's take another look at this n by n matrix. What we're going to do, is to divide it into what we're going to call tiles, they're square chunks of the matrix. What we'll do is calculate each tile in parallel. What we'll do is calculate each tile in parallel, which would then calculate all total forces on the elements within that tile. Then we'll add the resulting forces across horizontal tiles, to get the total force for each destination object. We're going to say a tile is P by P elements, so we'll make n squared over P squared tiles overall. We're not going to talk about how big a tile is yet, we'll cover that later. But for now, when we run this particular tile, we're going to run it within a CUDA thread block, and we can potentially share data between threads in the block. Recall that the x-axis here is the destination. We'd really like to only fetch each object once per tile and reuse it among all the threads in the tile, instead of fetching each item P times per tile. Thus instead of doing 2P squared memory fetches per tile, P times P and two fetches for each item. We will instead do, how many fetches.

## Dividing N by N Matrix into Tiles

The answer is 2P. We'd like to fetch these P items once each, we'd like to fetch these P items once each, and never have to fetch anything again.

## Downside to Using Tiles

Let's start by thinking about how we'll put tiles together after we compute them. Each individual tile will compute partial results for these P destination threads. The partial results will reflect the contributions of the forces produced by these P objects. Then, for each of these tiles, we'll add the partial results from all the tiles in a row to get the complete set of forces that act on the P threads in this band right here. So, let's say, we run this tile with P squared threads. That's going to allow us to compute all P squared individual forces at the same time. This is good, very parallel. But it's got a couple of not too major, but still significant, downsides. First is that we would have to share both source and destination parameters for each object across multiple threads. Remember, we're only fetching each item once. How many threads must get the information of parameters for each source object and for each destination object?

## Downside to Using Tiles

So remember that we only fetch each source object once. But we have to distribute that information across all P threads. And for each destination object. Again, we're going to fetch that information once. But then we have to distribute that information across all P source threads. It would be nicer if we didn't have to do this. Because either we can store that information in shared memory once. But then, we have to read out of shared memory, which is slower or we have to replicate that information in each thread's local storage which wastes space. And limits the number of threads we can have resident in a single thread block. The second downside, is that we have to communicate between threads to sum up the forces. So we've individually computed each separate force on separate threads. And then we have to communicate between all these threads to sum up the force for this particular tile. We'd prefer not to have to do that either. So instead, we'll look at another way to compute a tile.

## Why Is This a Good Idea

This transformation has actually reduced the amount of parallelism. Previously, we had p squared threads per tile and p squared way parallelism within a tile. Now, we only have p threads per tile, so we have less parallelism than we did before. Well, does that matter? Well, if we have a big, big n body problem to solve, it probably doesn't matter for a big, big problem. There's probably so much parallel work, that will easily fill the machine, even with our new version. But if we have a more modest-sized problem, the lesser amount of parallelism might mean this method isn't the best because we might reduce the amount of parallelism to the point where we can't keep the whole BGU busy. The big picture by what we've done here is that we've chosen an implementation that has fewer threads and does more work per thread. You'll often see this technique when you can convert communication between threads, which is correspondingly more expensive to communication within a thread, because communication between thread local storage is faster than communication between threads within shared memory.

## How Big Should We Make a Tile

Finally, how big should we make a tile? Let's think about this. If we choose a larger P, does that mean more or less total main memory bandwidth?

## How Big Should We Make a Tile

The answer here is less. We load each source object's parameters once per tile. Larger tiles means fewer tiles, and thus fewer loads of each source object's parameters.

## What If P Is Too Big Or Small

Now, if P is too big or small, then we can't use shared memory to share a source objects parameter across all destination object threads.

## What If P Is Too Big Or Small

The answer, too big. If P has more threads than a thread block is allowed to have, then we can't use shared memory to share data among all P threads, because we have to distribute that tile across multiple thread blocks. Another consideration is making sure that we have at least as many thread blocks as SMs or else SMs will set IO.

## Determining the Size of the Tiles

In this problem you need to divide your work up into chunks, in this case tiles. We have a continuum between tiny tiles, lots of them, and fewer tiles where each tile is sized to the maximum that can fit in a single thread block. And in this particular problem bigger tiles, means less memory bandwidth, this is good. Generally, then you want to make your tiles as big as can fit into a single thread block, because that minimizes overall memory bandwidth. But not the following to caveats, one you need to have at least as many thread blocks as you have SMs in your GPU, because otherwise you'll have SMs sitting because otherwise you'll have SMs sitting idle. Definitely, you want to make sure you fill the machine with enough work to keep all the SMs busy, even if you have to move a little bit this way on the continuum and size your tiles just a little bit smaller. Two, if you're sitting at the right end of this continuum, it's best for overall memory bandwidth. But often it turns out that you would actually prefer to just maybe one tic to the left. This allows a small number, say, two blocks to both be resident at a time. And that potentially gives better latency hiding characteristics because you have more warps at the same time from slightly different pieces of the program. It's certainly something that you would want to tune carefully if you needed the fastest possible implementation.

## All Pairs N-Body Summary

So let's summarize the big ideas of all-pairs N-body include. This might not be the most efficient algorithm. It's not. But it's both simple and high performance. So, if you implement this, you'll find that you're actually able to sustain a significant fraction of the peak theoretical value of the GPU. The real advantage here is that we have a very large amount of parallelism in this problem that we're able to exploit well in our implementation. The interesting part of this implementation is how to structure the ample parallelism that is available in this problem. And the big idea we discussed is the tradeoff between exploiting the maximum amount of parallelism and doing more work per thread. And there's many problems where increasing the amount of work per thread and reducing the amount of parallelism is really the right thing to do, because of the speed of communicated results within a single thread is faster than communicating between threads.

## Role of thread in SpMv

First though, a little quiz to make sure we think about the implications of the two approaches, which we're calling again, thread per row and thread per element. It's not immediately clear which is better. So, for the same matrix, which approach will answer these three questions for, please choose either thread per row or thread per element. And the three questions are, which approach will launch more threads? Which approach will require communication between threads? And which approach will do more work per thread?

## Performance Analysis

Now, let's do a little performance analysis here. Let's say that we have 32 adjacent threads in a warp that are each processing a separate row in the matrix. These rows may be different lengths. Will the overall run time of this set of threads be dependent on the shortest row, the average row, or the longest row?

## Performance Analysis

And the answer is longest and that's really important. So, let's think about how these threads would run. We'll start with all the threads doing useful work. They'll all process their first element at the same time, then their second element, and so on. But eventually, the threads assigned to short rows will run out of work to do, so they'll be idle while the longer rows continue to do work. This code is really efficient if all the rows have the same length. But it's a lot less efficient if the lengths of the rows differ, because then the threads with shorter rows are idle, while threads with longer rows are still working.

## Performance analysis

Now, let's go answer the same question we asked about the thread per row approach. Let's say we have 32 adjacent threads in a warp that are each processing a separate element in the matrix. The rows in this matrix may be of different lengths. Will the over all run time of these threads be proportional to the longest row of any of any of those elements, proportional to the shortest row of any of those elements, or completely insensitive to the, the distribution of rows and elements in the matrix?

## Performance analysis

Well, the answer, this an important answer, is insensitive. It doesn't matter, the structure of the matrix. It's insensitive to the structure of the matrix. It doesn't matter if we have short rows, long rows, all rows the same length, rows of wildly different lengths. The performance of the multiplications and the performance of the segmented scan, don't depend on the distribution of elements under rows at all. Instead the run time is dependent only on the number of elements.

## Thread Per Row Vs Thread Per Element

So, we have two approaches here. Thread per row, and thread per element which is better. So, we might have different performance on matricies that have a similar number of elements per row. And we might have differing performance if we have a varying and a wildly varying number of elements per row. So which of these is comparatively better on each of these kinds of matrices? So I'd like you to put a couple check boxes in.

## Hybrid Approach

So, one elegant answer to this problem was proposed by Nathan Bell and Michael Garland at the Supercomputing Conference in 2009. And their answer was, use both approaches. And here's how we're going to do that. We're going to divide the matrix into two pieces. One piece, the more regular piece, is going Is going to use the thread per row approach. This right side here, the more irregular approach is instead going to use the thread per element approach. And then, we compute each of these separately and add the results back together at the very end. So, the final question is, where do we draw this line? What happens if we draw it too far to the left, or draw it too far to the right?

## A Quiz on the Hybrid Approach

So let's turn this into a quiz. If we draw this line too far to the left, then we'll suffer from poor peak performance or load imbalance. And if we draw it too far to the right, will we again get poor peak performance, or load imbalance?

## A Quiz on the Hybrid Approach

Well, if we go too far to the left, it means that we're going to have poor peak performance. Because we'll be using the slower thread per element approach on data that would instead be a better fit for the faster thread per row approach. But, if we go too far to the right, then we'll be using thread per row on a regular row lengths, and that's going to lead to poor load balance. Bell and Garland noted that thread per row is about three times faster than thread per element for matrices with balanced rows. So, their rule of thumb for drawing the line was when 1 3rd of the rows have non-zero elements. Simple and sensible. You'll find their approach called hybrid, an NVIDIA cuSPARSE, sparse matrix library.

## SpMv - The Big Picture

So, what's the big picture here? I'll highlight two lessons here from this discussion. The first is the importance of keeping all your threads busy. Fine grain load imbalance is a performance killer because your thread hardware is sitting idle. Choosing algorithms that reduce imbalance is crucial. Second, managing communication is equally important. Communicating through registers is faster than communicating through shared memory, as we see here, which is in turn faster than communicating with the host.

## Calculating Distance From the Root

So, let's state the specific problem that we want to solve today. And then, we'll look at how we might solve it. We want to start at a specific node and we'll call it s. And then, visit every node in the graph and record it's distance from s in terms of graph hops. We call this quantity, its depth. So any neighbor of s, has depth 1. Any neighbor of those neighbors that we haven't visited yet, has depth 2. And in general, a neighbor of a node with depth d that hasn't already been visited has depth d plus 1. I'm not sure how familiar our international audience is with the actor Kevin Bacon. But the Bacon number of a particular actor is the number of degrees of separation from Kevin Bacon. So, if I had acted in a movie a movie with say, John Belushi, my Bacon number would be two because John Belushi's Bacon number is one, because her appeared with Kevin Bacon in the outstanding film, Animal House. We could generate the Bacon numbers for all movie actors from a graph that records who appeared in movies together and we could do this with a breadth first treversal that begins at Kevin Bacon. And calculates depth for each vertex for actor in the graph.

## Minimum And Maximum Possible Depth From Rooth

So, let's make sure we understand BFS and the problem that we're going to solve. So, if we have a graph, an arbitrary graph with n nodes, what is the minimum possible maximum depth of that graph, and what is the maximum possible maximum depth of that graph? And by that, I mean, we can draw any graph we want, okay? And we would like to draw it to minimize the biggest number that we write in here. Or, we'd like to draw a different graph with N nodes and we'd like to maximize the biggest number that we write in here. So, I'd like you to express these as functions of N.

## Minimum And Maximum Possible Depth From Rooth

So the minimum possible maximum depth is 1. So if we start with a node here that we're going to number 0, okay. And then we're going to have a graph where every single node is attached to that root node, okay. Well the depth of each one of these is going to be 1. On the other hand, the maximum possible max depth is going to be N minus 1. So what's our graph look like there. We're going to start with the root node and then we're going to have a linear chain of graph nodes all the way out to N minus 1. Now, it's important for you to realize these are two very different graphs from the point of view of parallelization. This linear chain here, very hard to parallelize if we're doing a BFS. It's going to take us order of n steps to get from the start node all the way to the end node. Not going to get a lot of parallelization out of that. And clever students are going to realize, this is the worst case step complexity of the BFS. But if we have a graph that looks like this, a star shaped graph. Then parallelization is a lot more straightforward. We can process each one of those leaves independently, and in parallel.

## BFS try #1

So let's think of a good parallel algorithm to solve this problem. So here's some of the characteristics that we want out of our algorithm. We want lots of parallelism, coalesced memory access, minimal execution divergence and something that's easy to implement. And I want to show you an algorithm that gives you all of those characteristics, and then I'm going to tell you why it's a crummy algorithm. So, as we're going along, I want you to think about why this algorithm isn't going to be good enough. So, here's what we're going to do. First I'm going to describe the algorithm at a high level, then we're going to do an example, then we're going look at code. So at a high level, we're going to operate over our list of edges. And we're going to iterate n times, where n is the maximum depth of the graph. So on each iteration, n parallel, we can look at each edge independently in n parallel. And if one end of that edge has a depth d, but we haven't visited the other end of the edge yet, your vertex on the other side, then we set that vertex to depth d plus 1. So the first iteration will set all the vertices that have depth 1. The next iteration will set all the vertices with depth 2 and so on. So let's do an example.

## BFS Try #1 - An example

So here's a simple graph. We're going to start here with node number 2, and we're going to set the depth there equal to 0. And our goal is to find these depths for each of these vertices. Originally, none of these are set. We're just going to say that's the value for hasn't visited yet. And we're going to begin by setting vertex number 2 to the value 0. So, we note that the furthest vertex away from node 2 is two hops away. So, the maximum depth here is 2 and we should expect that we should iterate twice. So, we're going to begin the algorithm by looking at all of these edges in parallel. And what we're looking for is a pattern where one of the vertices of these edges is going to have a depth value set, and the other one does not. And we're going to find that three edges have this particular quality. So, one of them is going to be the edge between 1 and visited yet. So, we'll set it equal to 0 plus 1, which is 1. We'll do the same thing for this edge here between 2 and 3, okay? We visited 2, its value was 0. We haven't visited 3, so we set its value to be 1. And this edge here, 2 and 6. Now, we complete the first iteration. We visited every single edge and run 6 threads in parallel to make this particular determination. Now we begin our second iteration. So now, again, what we're looking for is an edge where the vertex value of one end of the edge has been set and the other one has not yet been visited. So, we're going to find that's true for the other three edges. So, the edge between 0 and 1, we find that the depth at vertex number 1 is 1. We find that vertex 0 has not yet been visited. So, we set it's depth to 1 plus 1 equals 2. And the same is going to be true for the edges between 3 and 4, and the edge between 5 and 6, okay? Now we filled in a value for all of our vertices here, and we're thus complete.

## What is the work complexity

So let's analyze the word complexity here. So the word complexity is possibly a function of both the number of vertices v, and the number of edges e. So, for the worst case, what is the portion of the word complexity that is a function of v and what is a portion that's a function of e? So your choices are as a function of v, constant with order of v, v log v, or v squared. Similarly for e, constant order of one, order of log e, order of e, order of e log e, and quadratic and e, order of e squared.

## What is the work complexity

So, let's think about the maximum number of iterations that we might have. What graph is going to give us the most iterations and that graph is going to be this awful-looking linear chain. How many iterations is that going to have? Well, it's going to have an order of the iterations because we are going to have the vertices here. And how much work we're going to do in each one of those iterations? Well, each one of these iterations, we know that we will visit all E edges. So, if we take the amount of work here, the product between V and E, we know that our final answer is going to be on the order of V times E. So, we know that it's at least going to be on the order of V and we know that it's at least going to be in the order of E. Probably, in any reasonable graph, this is going to be more edges than vertices. Otherwise, it's not a very interesting graph and if that's the case, we know that the overall work complexity is going to be at least order of V squared. It's quadratic in the number of vertices and that's bad, that is the very bad property. And so, I told you in the beginning of the discussion here that this was a crummy algorithm and this is why we really do not want an algorithm that is quadratic in the number of vertices .

## Lets look at the BFS Code Part1

Okay. Now, let's look at some code. It's going to take us two kernels to be able to run this. The first one is going to be an initialization kernel that we run once, and then we're going to show our iteration kernel that we're going to run on every iteration of the loop. Now, the initialization kernel is extremely simple. It's just a map over all the vertices, and it's simply going to initialize the vertex array. So, we get the vertex array as an argument, we know how long it is, and we know which, which vertex is the staring vertex. All we're going to do is calculate a global index for our particular thread, and then if our thread is the starting vertex, then we're going to set its depth to 0. Otherwise, we're going to set it to negative 1, and here, we're going to designate negative 1 as our not visited value. Okay. This is very simple. Easy map operation. The more interesting one is the one we'll show next, which is the kernel that we're going to run on every iteration.

## Lets look at the BFS Code Part2

Okay, so here's the more interesting kernel. This is a kernel that we're going to run on our iterations, each iteration is going to launch this kernel over e threads. Each of those threads is associated with one edge and can process that edge completely in parallel. So again, this is a map operation. It's something that maps very nicely to the GPU. So, the first thing we're going to do is we're going to calculate a global index for our particular edge so that our thread is associated with one and only one edge. Then, we're going to compute a few quantities. So, the first thing is, what is the vertex at one end of my edge and what is the depth of that vertex? What is the vertex at the second end of my edge, and what is the depth of that second vertex? Then, we're going to do the logic that we care about. So, if we have visited the first vertex in our edge, but we haven't visited the second vertex in our edge, okay? So, we check the current depth against the first depth and make sure that we haven't visited the second vertex yet. If that's the case, then we set the depth of the second vertex equal to the depth of the first vertex plus one.

## Lets look at the BFS Code Part3

So, the converse cases is fairly straight forward. Now, instead of checking the first against the current depth, we're checking the second and we know the first has not been visited yet. If that's the case, then we're going to set the depth of the first to the depth of the second plus 1. Now, there's two other cases, and one of those cases is if we visited both vertices. And the other case is if we visited neither vertex. So, you should convince yourself that we could ignore these cases, that we have of no operations to do if either of these cases is true.

## Do We Have to Worry About Race Condition

Just a couple of implementation notes, and first a quick mental exercise. What happens if two different paths reach the same vertex at the same time? So, lets consider this graph here. We're going to start at node s here and we're going to set its depth to zero. On the first iteration of our algorithm, then the depth of a will be set to one and the depth of B will be set to one, but we know, in the next iteration of the algorithm, this edge and this edge will both try to set C's depth. Is this a race condition? Does the result depend on which thread happens to run first? Well, fortunately no, because the algorithm is structured to always write only a singe depth value on itera-, on any iteration. So if 2 threads are both trying to write to the same vertex within a single iteration, they'll always both be writing the same depth. In this case, they'll both be trying to write the value 2. So it doesn't really matter who goes first or second, because the end result, the depth of c, is going to be the same. Second, how do we know when we're done? So typically the way we're going to write this code is as follows. We're going to check a variable to see if we're done, and we're going to check that variable on the host. That variable's going to be set on the device. The kernel needs to set that variable. Then we'll copy it at the end of ev, every iteration to the host. The host will check if we're done. If we're not done, we're going to iterate again. If we are done, then we'll fall through the loop and, and move on to the next code. So how do we actually set this variable? So, in this case, what we're going to do is we're going to first initialize a device variable called done to true. The way we're going to initialize it is we're going to have a variable on the host we know is true. And just copy it over to the device, so it's set as true at the beginning. Then we're going to run our iteration. Run this particular kernel, and if the kernel knows that it's not done yet, it will take this done variable and set it to false. Then, whatever the state of the done variable is, we'll copy it back to the host and so the host will then know if we're complete or not and make a decision accordingly. So how are we going to actually do this on a GPU? What's inside the kernel? So, this is an enhancement to the code that we just saw on the previous slide. What we're going to do is every time we set any vertex depth on the GPU, we also alongside, set the global variable done to false. So it's initialized to true if we make no changes at all in this bfs code, then it will stay true and we'll be done. But if we make any changes at all, then we'll set that global variable to false, then we will copy that false variable back to the host and it will give us another iteration. Note that more than one thread can set done to false, but that's fine. It doesn't matter in what order these writes occur since they're all going to be the same value.

## BFS try #1 summary

So let's summarize our breadth first search try number 1. So, this algorithm has some great properties. It's completely parallel, and there's no direct communication or synchronization between threads at all. There's very little thread divergence. The memory behavior isn't ideal since it involves gathers and scatters, but it's not bad. Unfortunately, the host has to control the iterations, which isn't super attractive. Though, we'll look at a new feature in CUDA 5 in the next unit called Dynamic Parallelism that allows the GPU to control the iterations directly, but the real problem is the quadratic complexity. We're going to have perfectly data parallel work, but we're going to do way too much of it. This approach will scale poorly as we process larger and larger graphs, and we are not cool with that at all.

## Problem Set 6

In Problem Set number 6, you will be implementing a fast parallel algorithm for seamless image cloning. Seamless image cloning is a modern algorithm for pasting one image into another in a very subtle way and here is an example of seamless image cloning. So, for example, you have a bear, a swimming pool and with seamless image cloning, you now have a polar bear in a swimming pool. Our algorithm for seamless image cloning uses It's a technique called double buffering to iteratively compute a solution. The main idea behind double buffering in the context of this homework is that, in store, in approximate solution to a problem in one buffer, then we use that approximate solution to compute a better solution which what we'll store in another buffer. Then, we'll go in the other direction, computing an even better solution and store that solution in the first buffer. After going back and forth, many times, we will arrive at a ideal solution to our problem. This style of computation arises in many parallel algorithms which is why we think it will make a really interesting homework assignment. Before we implement seamless image cloning, let's take a closer look at our inputs. We are given as inputs, a source image, a target image, and a mask. We want to copy the mask value from the source image into the target image in a subtle way. To do this, when [unknown] the values on the inside border of the masked region to be the same as the target image, then what we'll do is to solve for the unknown pixel values in the strict interior of the masked region. So, how do we find these unknown pixel values in the strict interior portion of the masked image? We can do that as follows. First step, we will make a guess for the unknown pixel values, and that's called our initial guess, I0. I0 doesn't have to be a very good guess, so we can set I0 to be all zeros in the strict interior portions of our masked image. But note that 0 will be a very bad guess, because it would not be very subtle. And it also would not conform to the source image very well. Now, suppose we are given a guess solution, IK to our problem, we can compute a better guess IK plus 1 as follows. For some pixel, small i K plus 1, in big I k plus 1, in the strict interior region of the mask, we can compute the value of a small i of K plus 1, as A plus B plus C divided by D. And let's go through each of A and B and C and D one at a time. A is the sum of iK's neighbors, when don't include iK in the sum, and we only include iK's neighbors if they are in the strict interior portion of the masked region. B is the sum of t's neighbors, where t is the corresponding pixel value in the target image, and the target image in our case is the swimming pool. And we only include t's neighbors if they are in the inside border of the masked region. And c is the sum of the difference between S and s's neighbors, where s in this case is the corresponding pixel value in the source image and the source image in our case is the polar bear. And finally, d is the number of iK's neighbors, and iK's neighbors can be less than 4 if iK is on the border of the image.