cs344 ยป

Contents

- 1 Quadratic GPU vs Serial CPU implementation
- 2 How to Represent Graph Data Structure
- 3 How to Represent Graph Data Structure
- 4 How Does This Algorithm Work Part1
- 5 how does this algorithm work part2
- 6 how does this algorithm work part2
- 7 how does this algorithm work part3
- 8 how does this algorithm work part3
- 9 How Does This Algorithm Work Part4
- 10 How Does This Algorithm Work Part5
- 11 How Does This Algorithm Work Part5
- 12 Recap of the Algorithm
- 13 Merrills Linear-Complexity BFS on GPUs Part1
- 14 Merrills Linear-Complexity BFS on GPUs Part1
- 15 Merrills Linear-Complexity BFS on GPUs Part2
- 16 Merrills Linear-Complexity BFS on GPUs Part2
- 17 Merrills Linear-Complexity BFS on GPUs Part3
- 18 Merrills Linear-Complexity BFS on GPUs Part3
- 19 List Ranking Part1
- 20 List Ranking Part2
- 21 List Ranking Part3
- 22 List Ranking Part4
- 23 List Ranking Part4
- 24 List Ranking Part5
- 25 List Ranking Part5
- 26 Hash Table on CPUs
- 27 Ideals Keys Per Bucket
- 28 Ideals Keys Per Bucket
- 29 Chaining Is Bad
- 30 why chaining is bad
- 31 why chaining is bad
- 32 Chaining is Bad for Hash Table Construction
- 33 Cuckoo Hashing Part1
- 34 Cuckoo Hashing Part2
- 35 Cuckoo Hashing Part2
- 36 Cuckoo Hashing Part3
- 37 Conclusion
- 38 Congratulations

So instead, we're going to look at another algorithm. One which is far better in terms of work complexity. This implementation was written by Duane Merrill and his colleagues and published in 2012. What's inefficient about the previous algorithm is that we visit the same edge over and over again on each iteration, but we only set its depth once. If we could, instead, only visit an edge when it's ready to be processed and never visit it again, we'd be much more efficient. So, let's go back to the serial algorithm and think about why it's an efficient algorithm. It's efficient because it tries to minimize the number of visits to nodes, and it does so by maintaining a frontier that marks the boundary between visited nodes and unvisited nodes. At any given time, only the nodes that border the frontier are subject to computation. This is different than our brute force N squared algorithm where every node is touched on every iteration. How do we implement something like the frontier on the GPU?

So, I'm just going to sketch out at a high level, how Merle chooses to do this. So, first we're going to look at the data structure for a graph. We're going to store the graph with a CSR like structure, similar to how we'd store a sparse matrix. So, let's look at this graph here, notice it has nine nodes and it, it has links that are either unidirectional or bi-directional and we're going to consider beginning our breadth first traversal with this particular node, right here, node 0. We're going to store two arrays to represent the graph. The first, called C, tells us the neighbors for each node. Node 0 has neighbors 1 and 3, so we see 1 and 3. Node 1 has neighbors 0, 2 and 4, 0, 2 and 4 and so on. The second array, called R, tells us the starting location for each vertex neighbors. Node 0s neighbors started off at 0. Node 1s neighbors, started off set 2. Node 2s neighbors started off set 5 and so on. As a function of v, the number of vertices, and e, the number of edges in the graph, how long is C and how long is R, in terms of the number of elements in the array?

C has one entry per edge, so it is e long. R, on the other hand, has one entry per vertex, so it's v long. Although we're going to put one additional element at the end and we'll see why in a minute. Either v or v+1 is accurate. We're also going to store a depth array, d initialized to negative 1, that indicates the depth for each node in the graph.

To show how this algorithm works, let's assume that we have a list of nodes that are on the current frontier. What we want to find are all nodes that are one hop away from the frontier. So, let's use this example here. And we're going to assume that we're one step into the breadth-first search. So, the frontier is nodes 1 and 3. So, now we're going to walk through the algortihm. Step 1, in parallel, for each node in the frontier, find the starting point of its neighbors. This is pretty simple. For vertex v, that's just r of v. In our example, node 1's neighbors start at offset 2 and node 3's neighbors start at offset 5.

slightly more complex, but still straightforward. You only need the array R. How do you compute the number of neighbors per vertex using array R and vertex number V?

So, we simply take the difference between the vertex's entry in the R array and its neighbors entry in the R array, and that's going to show how many neighbors this particular entry has. So, it's just R of v plus 1 minus R of v. Earlier, we said the R array was one longer than we needed, and this is the reason why. So, what's the starting point of the next vertex? Subtract my starting point. So, in our example, node 1 has three neighbors, 5 minus 2, and node 3 has one neighbor, it.

Step 3, allocate space to store the new frontier. So we have an empty array where we can copy the new frontier, but each node needs to know where in that array it might copy its edge list. We covered this operation in unit 4 It's called

Allocate, and remember, this is based on scan, we just scan the number of neighbors. In our example, we begin with the input array of 3 and 1, because vertex 1 has three neighbors and vertex 3 has one neighbor. And we scan this array with an exclusive sum scan. So, we get the resulting array 0, 3. Node 1 knows to start writing its edge list at offset 0. Node 3 knows to start writing its edge list at offset 3.

Step 4, copy each act of nodes edge list to this array. So, vertex 1 copies its three element neighbor list into the output array, starting at offset zero. And vertex 3 copies its only neighbor to offset 3. So, the potential new frontier here are nodes 0, 2, 4, and 4.

if they've been visited? Well we're going to look in the D array, the depth array for each element, and we can mark the vertex 0 as already visited because it has a depth of 0. All the rest we're going to keep because they have depth minus 1. So what we need is an operation where we take a vector of elements, and a set of trues and falses, and omit only the elements who are associated with trues. What do we call this operation?

And that's compact. Now, the sharp viewer might note that we're going to see vertex four twice in the next frontier. Now the algorithm's going to work okay with duplicates, it's going to work correctly, it's just wasteful. And this turns out to be a tricky problem to solve in parallel and it's beyond the scope of this talk, but you might consult Merrill's paper for a couple of interesting solutions.

Finally, we initialize the algorithm by setting the starting nodes depth to zero, and the initial frontier to that node's neighbor list. So, let's recap the big idea here. The basic step is a sparse copy. We start with a frontier, we look up the adjacency lists for the vertices in that frontier, and then we copy them into a continuous block to make the next frontier, and then repeat. The result is a really fast breadth for search, one that runs in linear work and achieves on the order of 3.3 billion vertices per second on a GPU. This is conservatively 4 times faster than a optimized CPU implementation.

So in the discussion on graphs, I made the following statement. Let me quote myself. If we had a graph that was just a linear chain, the last node would be node n minus 1. The linear chain is very hard to paralyze. If we're doing a BFS here, it's going to take us order of n steps to get to the end of the graph, end quote. Let me state this problem another way. I have n nodes in a linear chain, and each node knows the ID of the next node in the chain. This is just a simple link list. What is the algorithm for each node, every node finding the end of the list? And so of course we can solve this in N steps. Let's say that each node has a next pointer, and I've shown those in blue, that points to the next node in the chain, and the last node has next equals null. Now we don't want to change the next pointers at all, or else we'll lose the structure of our lists. So we're going to assume they're read only. So we're also going to store a second pointer per node that we can change. For historical purposes, we'll call this pointer chum, and we're going to designate it in red. And at the end of the algorithm, we want each node's chum pointer to point to the last node in the chain. So the straightforward algorithm is, on each iteration, on each node. Set chum to chum to next until we reach a node where next is null. So we'll start off by making all chum pointers point to their own nodes. That's how we are going to initialize them, and again, on each iteration, we will set chum to the next pointer. So on the first iteration it's going to look like this. So now we're going to do another iteration. So for any particular node, we look for chum, and then next. So, on the second iteration, it's going to look like this, and so on and so on. So, on each iteration, the length of the chum pointer is going to be 1 more than it was on the iteration before. So the question is, the important question is, can we do better? And it turns out the answer is yes. In this algorithm described by Danny Hillis and Guy Steele in 1986, not discovered by them but described very nicely, is so cool that it's one of the reasons I decided to do parallel computing in the first place. So let's analyze the complexity of the algorithm we just described. Clearly, a serial processor can do this computation in n steps in order of n work. How about the parallel processor? We know that it takes n steps. How much work? Your choices are order of n, n log n, n square root of n, or n squared.

And the answer is, order of n squared work. We have all n processors working and it's going to take n iterations to complete, so that's n squared work overall.

Way back in Unit 1, when we talked about stepping work complexity, we said that we would love to find algorithm that has the same work complexity as the serial algorithm but have a smaller step complexity. Reduce would be a good example here, or scan. But if we said we can't do that, sometimes we would be willing to accept more work if it gets us fewer steps. And that's what we are going to do here. But how? This is not intuitive, or more at least it isn't for most people, and it certainly wasn't for me when I learned this material. Hills and Steele first expressed skepticism they could improve on quadratic work. But then concluded, and I am quoting from their paper, essentially we overlooked the power of having many processors working on the problem at once. So at a high level, here's what we are going to do. On every node, we start by knowing the node that's one hop away. That's the next pointer, in blue. So on the next iteration, we can visit our next pointer's next pointer, and get two hops away. And then, on the next iteration, we can get to three hops away and so on. So, that's what I am showing here when I say straight forward approach. As we increase the number of iterations, we are also increasing the number of hops away. But that's the wrong way to think about it. If we just did it that way, we'd be repeating a lot of work that the nodes down the chain are doing, and we would have quadratic complexity work. Instead, after the first iteration, we have each node knowing the node that is two hops away. So, let's say we're interested in this node and we know that we are pointing to node x here. Well, normally what we do is we take the red pointer here, and then change our red pointer to be red pointer plus blue pointer. So, we'd be moving from knowing the node two hops away to the node knowing three hops away. But look, x also knows the node that is two hops away. So, I know the node that's two hops away, x knows the node that's two hops away. So, on the second iteration we can leverage the work that x already did on it's first iteration to get a pointer that is now four hops away. So, we can go red, then red to set our new red pointer here, to be the chum of the chum. Now, we are going to have a pointer to a node that is four hops away, which also has a pointer to a node that is four hops away. And so, the next iteration will have a pointer that is 8 hops away and so on. So now, instead of going 1, 2, 3, 4, we're now going 1, 2, 4, 8. So, for N iterations, as a function of N, how many steps is this going to take?

And the answer is log n. How cool is that? We're now doing more work, in operations on each of the log n steps. So O of n log n overall. But we'll finish in log n steps instead of n steps, and I find this pretty amazing. What a beautiful algorithm.

So now, let's write the code. You've gotta put up with my coding handwriting here. What we're going to do is initialize k to our global thread index. We're going to initialize chum to our next pointer, and then, we're going to loop. We're going to run this routine until we run into the end. And the question is, what do we do at each stage? What do we set the new chum of k to?

And the answer is, you set it to the chum of chum of K.

So let's try a slightly harder algorithm called list ranking. In list ranking, each node has the index of its successor in the list. And we know the first element in the output. What we want to be able to do is put the nodes in order. This has a number of uses, and in our lab, we've used it to decompress data that's been compressed with B sub 2. So let's take a look at an example. As an example we've got ten nodes here. We know that the output is going to begin with node zero, and the successor to node zero, so one farther in the chain is node five. The successor to node five is node two. The successor to node two is node seven and so on. So this is our input, this array of five, six, seven, eight, nine, two, three, four, zero, one. What's our output going to be? It's going to be the chain 0, 5, 2, 7, 4, 9, 1, 6, 3, 8. So here's our input and here's the output. Now, you can note that the array is actually circular, so it's necessary to actually designate the starting point. And in this case we're saying that node number 0 is the starting point. Of course, a serial processor could do this in n steps. The question is, how can we make it work with parallel hardware with a smaller number of steps?

We're going to sketch out a solution that uses a similar logarithmic structure to the previous problem. I'm going to write it in a different way though and I'm going to use a table here. So, recall, in the previous algorithm, we began with each node knowing its neighbor one hop away. In the first step, each node learned the node that was two hops away. Then, in the next step, four hops away and so on. So, we're going to try to use a similar approach here. And the algorithm's going to proceed in two phases. The first phase is filling in this table, so that every node knows the nodes that are 1, 2, 4, 8 hops away. And this if functionally equivalent to find the last element in the link list problem that we just saw, except that now we're treating a list of circulars so we wrap around. So, with each step of this phase, we're going to begin by knowing the node that's k hops away, for instance, k equals one. And then, we're going to compute per node, the node that is two k hops away. So, let's begin, every node is going to start by knowing the idea of a node that's exactly one hop away. .

So let's see how to compute the node that's two hops away. We simply look at our node that's one hop away and take it's node that's one hop away. And we'll continue as long as the number of hops away isn't greater than number of nodes. So let's do an example here. We know that the successor, the node that's one hop away from node 0 is node 5. So, if we want to compute the node that's two hops away from 0, we'll look to see that the successor to 0 is 5 and the successor to to 6 is node 3, so, we'll fill in a 3 here. The successor to node 2 is 7, the successor to node 7 is node 4, so we'll fill in a 4 here. And notice this is a perfectly data parallel operation. Each one of these vertices can do this computation completely in parallel.

Okay. So, the next step of the algorithm, is the compute the nodes that are 4 hops away and we're going to be able to do this in a similar way. Say, we are starting with node 0 and we want to note, the node that is 4 hops away from node from node 2 is node 4. We'll fill in a 4 here. To find the node that is 4 hops away from node 1, we know the node that's 2 hops away the node that's 2 hops away from it, is node 0 and so on. And then, we'll do the same for the last line here and find the nodes that are 8 hops away from each of these starting nodes. And we'll continue this progress as long as the number of hops away isn't greater than the number of nodes. In our example here, for instance, we have 10 nodes, so we will compute the nodes that are 1, 2, 4, and 8 hops away, but if we computed more, we would be going all the way around the list and beyond. How much work does it take to compute this entire table for n nodes proportional to n, n log n, n squared, or n cubed?

Well, let's take a look at the dimensions of this table. Every node participates in every step, so each step takes order of end work. And we're doubling the hop count on each step, thus there's log of n steps. So, it takes n log n work to construct the entire table. Note, this is more expensive than the serial algorithm which takes linear work, order of n work. However, the serial algorithm also takes n steps, whereas, we're finishing n log n steps here.

Now the second phase of this algorithm uses this table to construct the output order. What we're looking to generate is the list of nodes in order, and we computed this before. We see it over here 0, 5, 2, 7, 4, 9, 1, 6, 3, 8. Remember we can get that by using a serial algorithm just to convince ourselves that it's correct. We start at node zero. Its successor is five. The successor to five is two. The successor to two is seven, and so on. To generate this answer, each node is going to compute its position in the output list, and we're going to call that position outpos, and then when we're done, we can use that position to scatter the node into its proper location in that output list. So just to understand what outpos is going to look like, we see that node 0 ends up at location 0 in the output list, so outpos better be 0. Node 5 ends up at the first location in the output list, so, it's going to get outpos equals 1. Node 2 was next and then, node 7, and so on. So this is what we're trying to compute here, we're trying to compute out pause. So when we start this, all we know is that node 0 is in position 0. So, we can fill in that outpos right away. Now that node 0 is calculated where it will end up in the output, we're going to consider it awake, but all the other nodes are asleep. A node is awake if it has filled in it's value for outpos, otherwise it's asleep. So now we're going to do several iterations of an algorithm that will eventually wake up all the nodes and calculate their output positions. So on each iteration we're going to launch n threads, one per element here, but if a thread is asleep we immediately return from that thread. Only awake threads do any work at all. So for the first iteration, node zero wants to wake up its successor. So how does it find the successor? Well it can look in this plus one array to find its immediate successor and so we can immediately see that's node five. So where will node number five write its output? That's what we need to fill in here for out pos. It knows that it's coming after node number zero, right here, and it knows that it's coming one hop past node number zero. So we can add these 2 values together to see that node number 5 will write its output to position number 1 in the output array. And so now we have 2 nodes that are awake, node 0, node 5. So now, both awake nodes are going to wake up another node, and help those nodes calculate their output position. So now, instead of looking at the immediate successors plus 1. We're going to look at the next line, plus 2. And we're going to use this to find the nodes two away from each of our awake nodes. So node zero is going to wake up node number two. So node zero is going to wake up node number two. And so where does node two write its output? Well, it'll be the sum of where node zero is at output location zero, and we know we're two hops away. So when we add those two up we know that node two will be written to position number 2. Similarly, node 5 is going to wake up the node that's 2 hops away, in this case, node 7. Where will node 7 write its output? It knows that it'll start from node number 5 at position number 1, and then we'll add 2 hops to that, and get position number 3. On the next iteration, we're going to use the next line of the array, the plus 4 array, and each of the 4 awake nodes from output position 0, 1, 2, and 3, will wake up the node that's 4 hops away. This will allow us to fill in 4 more values into the outpos array. Node 4 will get position number 4, node 9 gets position 5, node 1 gets position 6, and node 6 gets position 7. And the final iteration using the plus 8 array lets us fill in node 3 as outpos 8 and node 8 as outpos 9. So the final step is to use outpos to scatter the nodes to their destinations. Let's see how that works. We're going to start by looking at the first node, and we know that node 0 is going to end up in output position 0. So we will scatter this 0 to output location 0 and write the 0. The next node, node number 1 is going to end up in output position we do all of these scatter operations we come up with the nodes in the correct order. So as a function of N how many steps does this take? Square root of N, log in, N or N squared.

Well, the answer again is log n. Just like in the previous step, we're doubling the hop count on each step so it will take log n steps to fill in all output positions. And if you count both awake and asleep threads as doing one unit of work each iteration, we do n log n work overall. The big idea here is that this entire algorithm is a good example of trading off work for steps. We do more work than the serial version, n log n versus n, but we finish in fewer steps, log n versus n.

The most common way to construct a hash table on a CPU works as follows. So we have a bunch of buckets here and we have a hash function h. And this hash function takes a key and maps that key into one of those buckets. So, if the hash key, the hash function of the key k returns 0, h of k is 0, then that key is associated with bucket 0. If h of k is 1, then that key is associated with bucket 1 and so on. Within a bucket, we store a bunch of items as a linked list. And this is called chaining. So, we might have multiple items in this bucket, multiple keys. So, key 12, key 29, key 123 all have a hash function that's equal to 1. So, they're placed in bucket 1, and we store them as this chained, linked list. Then, when we want to look up a key, we take that key. We run it through the hash function to get a particular value out of the hash function. That's going to refer us to a particular bucket. So we have this key, it's going to return, oh, he's in bucket 1. Then we will look through all these chained items to find the key that we're looking for.

So, let's say, we have n items to hash, n keys, and we have b buckets. So, what I'd like you to do as a function of n and b, what's the ideal number of keys per bucket?

So, the ideal number of keys per bucket is simply n over b. If every bucket has n over b items, then the items are all evenly spread between buckets. So, this is largely a function of this hash function. Did we do a good job choosing a hash function that will evenly distribute all the keys among the buckets? If we pick a bad hash function, maybe we end up with all the items in one bucket. And this is bad because any look-ups into that bucket might have to look at all n items, or we could end up with no items in a bucket. And that's a waste of a bucket. Ideally, a hash function distributes all input keys evenly across buckets. So, every bucket ends up with roughly the same number of items.

So, let's think about how chaining would behave in a parallel setting. For construction, that means we have many items to put into the hash table, one per thread. Per look up, that means we have many keys to look up in the hash table, again, one per thread. And chaining has two main disadvantages. Number one, let's say we're looking up many items, one per thread, and we know to look up one item, we calculate the hash function for that item. That paralyzes nicely, it's just a map operation. More problematic however, is searching the link list in the bucket. So, we have a number of threads here, Each thread ends up looking in a different bucket. This particular bucket has three items, the bucket for thread 2 has two items, but the bucket for thread 1 has many, many, many items. Some threads, like thread 2, might find their item right away. Some threads, like thread 1, for instance, might have to visit many or even all the items in a lengthy link list before finding its item. Because threads within a warp run in lock step, the run time of a warp is completely dependent on the slowest look up within the warp. The other threads in the warp have to wait until the slowest item is found, and this behavior, as you might imagine, is bad.

So, let's say we have 32 threads, each of which has a different item to look up in a hash table. Let's say all those threads map to the same bucket, and there's ups. So, all 32 threads will loop through the chain until all 32 threads have found their item. If we consider the fundamental unit of work here, the thread iterations, what fraction of thread iterations here actually do useful work?

The answer, roughly half of them are going to do useful work, the actual fraction ends up being 0.516 and I'll say how I got that. So we got 32 threads here, I'm putting them on X-axis and then on the Y-axis is the number of veneration and so we know That one thread will find it's answer on the first iteration. The next thread will find it's answer on the second iteration and so on. So why don't we say the first thread finds it's answer on the first iteration. The second thread will find it's answer on the second iteration and so on. So eventually The last thread will take 32 iterations to find its answer. And what that means is that it has spent 32 iterations walking through this link list. We have 32 threads times 32 iterations. So the amount of work we could have that's useful is sort of the product of the number of threads and the number of iterations. So we see All of this area in that square is being useful. These threads are doing useful work, they're walking through the link list until they find their item right here, and stop. But all the blue area here, on the other hand, is wastted work. We have threads that have gone to sleep because they've already found their item And now they're not doing anything useful. They're just waiting for the last thread to finish. So if you actually do the math in detail what you'll find is the red area compared to the entire square area is slightly more than half. But basically what we're looking at here is that roughly half the time, if we pick any random thread, it's going to be asleep and waiting for some other thread to finish so it can progress to the next computation.

The second disadvantage is in construction, particularly contention in the construction process. What we might have is two different items each of which want's to place one item into the hash table. Let's say both of these items decide that they have the same hash function for their particular item and so they both want to add an item to hash table bucket number 12 manipulate the link list in the same bucket. To do that they're going to have to serialize and synchronize. Only one of them can update the bucket at any given time and the other must wait it's turn. So any serialization like this within a parallel algorithm is definitely undesirable as well. So the conclusion here is that chaining is a sub-optimal strategy if were dealing with parallel hash tables. So we're going to turn to a different method.

The approach we're instead going to take is based on a different hashing algorithm, non-chaining. And this hashing algorithm is called cuckoo hashing. This is a cuckoo bird. Thanks, Wikipedia. And it's termed a brood parasite. But let's put it into more understandable English. This bird is one of the biggest jerks in the animal kingdom. Rather than taking care of its own eggs and chicks, it instead, lays its eggs in another birds nest, throwing out the other birds eggs to make room and lets the other bird raise its chicks. See if you can detect why the algorithm I'm about to describe is called cuckoo hashing, as I describe it. The key to this method is having multiple hash tables instead of just one. Multiple hash functions, one per hash table. And those hash tables only allow one item in each one of their buckets. There is no chaining in this algorithm at all. In the example we'll show, we're going to have two hash tables and two hash functions. But the method generalizes to more than two. So, here's where we're going to do at a high level. First, all the items that we want to hash, we use the first hash function and try to hash into the first hash table. Some of them will collide and fail. And by that, I mean, we're going to try write multiple items into the same bucket but we're only allowing one item per bucket. That's okay. If n items try to write into one hash bucket, we only require that one of them succeeds, and it doesn't really matter which one. So then, those that fail, we're going to take all the ones that are leftover that failed, and we're going to use the second hash function to try to hash into the second hash table, and so on. Now, the cool part is what happens next. But we're going to go and take an example here and I'm going to show you how it how it works, and then we'll see exactly how this cuckoo hashing works.

We're going to have four items that we're going to try to hash into this giant cuckoo hash table. And the cuckoo hash table has two subtables, t1 and t2. And so, each one of those two subtables has two slots corresponding to h1 equals 0, h1 equals 1, h2 equals 0, h2 equals 1. Now, we have four items that we are going to try to hash into this big cuckoo hash table. And so, I've already computed their hash functions. So, item A, for instance, it's h sub 1 of its particular key, is 1, and h sub 2 is equal to 1, and so on. So, you can see all these hash functions here. So, let's do round 1. What do we do on round 1? What we try to do is place all of the alive items, and we start off with all the items being alive, and we're trying to place them all into t1 using this hash function. So now, we see that items A and C both would like to map into this bucket. And items B and D, because of their values of h1, would like to map into this bucket. Now, we can only fit one item per bucket, so we're going to arbitrarily pick a couple of items to win. Okay. So, I'll arbitrarily pick that C and D happen to win here. D is written into hash bucket number 0 and C is written into hash bucket number 1. Cool. And so, A and B are still alive. And now, we'll move to the next iteration of the algorithm, and we'll use h sub 2 to try to map A and B into table 2. And we see, well, both of these guys would like to map into this same bucket. So, again, they're both going to try. Only one of them is going to succeed. For the purposes of argument, I'm going to say that's going to be B. Okay. So now, we still have the element A left after both round. So, here's the neat part. We now go back to table 1. And we know that if we hash item A, it's going to collide with an item already stored in the hash table. We know that because it didn't succeed the first time. So, we're going to try to hash it again. But now, when it collides with an item that's already in the hash table, here's the different part, we're going to take A, and place it into the hash table and we're going to kick out the item that's already there, and continue. And now, we continue on the next iteration of the algorithm again, we see that the only item that's left alive is now C. We see that its value of h2 is 0, so we can safely place C into the hash table here, and we're done. So, the big picture here is that by kicking out things that are already in the hash table, we have a new set of items to try to continue to place in the hash table and maybe they'll fit better than the old set and we continue to apply this procedure. On each iteration, we always try to place our outstanding items into the next subtable of the hash table and if they succeed, they kick out the items already there until we have no more items left. Now, will this always succeed, yes or no?

Now, it definitely will not always succeed. There are some nice probabilistic guarantees about how often it will succeed, depending on the size and number of the hash tables. But the easy counter example is to say that, well, here we have instance, if we had 3 items where H1 and H2 were both 0, there's no possible way that we can fit them into the hash table. Because we only have 2 slots where any hash function is equal to 0. So in practice we choose a certain number of iterations and we continue to iterate, trying to fill up this hash table until we decide that we've done too many iterations. And so if that's the case then we just stop, we choose new hash functions and we start over. And again, there's very nice probabilistic guarantees about how often this is going to finish. So in the research that inspired this work, the guarantee that we tried to use was that we could guarantee that it was going to fail less than one out of every million times. So once we construct the hash table, the look up procedure is really simple. We're going to calculate all the hash functions for the item that we want to look up. So, for instance, if I want to look up item B, I know, I calculate item B's hash functions. Here hash function 1 is equal to 0 and hash function 2 is equal to 1. So what I'm going to do is I'm going to look in all the tables using the particular hash functions until I find what I'm looking for. So first, I'm going to look in table one and I know that I'm going to look in slot zero. Here, I look in slot zero and I say wait a second, that's not B. So then I have to go to hash two, to table two. Look and see that it's hash value is equal to one, so I'll look in slot one, table two and I'll see, there's B. I've now found the value that I'm looking for. If we don't find it in any of these locations, it's just not in the hash table. Now the nice part here is that this is a constant time look up. It just requires t look ups and t is a constant. It might be 2. It might be 3 and so on. This is different than chaining. Chaining has a variable time look up. It depends on how many items are in the bucket. And if we have many items in the bucket and we have to look all the way to the end. It can potentially take a very long time, whereas we can guarantee exactly how much work and it's a constant amount of work to look up any item in these hash tables.

So, a few notes on implementation. The real benefits of this particular formulation of hashing are that our look-ups are again constant time. And that's terrific for parallelism for a machine like the GPU because we keep all of our threads busy. There's no thread divergence at all, and the construction algorithm is actually pretty simple and pretty fast. It's actually a very efficient hash table construction when it's done in shared memory with a small set of IMs to hash. But it turns out that it actually works fairly well with large data sets when the hash table needs to be constructed in main memory. The really important thing to keep in mind there is that the operation of write my item into the hash table and kick the other item out needs to be atomic operation. In our implementation we used atomic exchange to make sure the two competing threads don't have any possibility of stomping on top of each other. And a final note, there's more than one way we might solve this general problem of check if an element is in a set. Instead of using hash tables, we could choose a fairly brute force way to do this. We could choose to sort all the elements in the set as a construction step. And then, for the look-up step, we could do binary search into the set to see if that particular element is present. So for instance, we might have keys here in a set. To construct this particular data structure, we sort these in order of the key. And to do look-up we just do binary search within this set. So, for instanc,e if we're looking up value k6 with one thread, we'd go see k5 my keys bigger than that. I'd then pop this way and so on until I get to k6. I might have another thread that's looking up k3. It would start in the middle at k5 and then do binary search until it finds k3. Now, sorting is quite fast on GPUs. And even though hash tables are faster for this particular computation, they're not so much faster that should never consider a sort to be a good idea. Often on the GPU, sometimes a brute force approach, like sort and binary search, might be your best option.

So, let's recap some of the important lessons from this unit. If you've made it this far, good for you. There's a lot of material in this unit, advance material that's pushing the frontiers of field, but that's why you took this class, right? So, here's some of the lessons we learned along the way. Dense N-body. We learn two main lessons here. One is the importance of minimizing global memory bandwidth, and the second, reducing parallelism by increasing the amount of work per thread might reduce your overall communication costs. And hence, your overall run time. Sparse Matrix Vector Multiply. The right data structure can make a big difference. And the keys to high performance with our implementation was reducing load imbalance between threads. Keeping all of our threads busy and optimizing to use the most efficient communication possible. Breadth-first Reversal. Choosing the most efficient parallel algorithm is perhaps the most important thing you can do. For large problems, a superior problem, in terms of asymptotic work complexity, will nearly always beat even an optimized, more expensive algorithm. And the real challenge in our optimized algorithm was handling the irregular expansion and compaction at each step. Scan is the natural primitive to handle these operations. List ranking, this is a good example of a problem that is not inherently parallel. To solve it well on a GPU, we use the important technique of trading more work for fewer steps. And once again, we see the power of scan to address a problem that at first glance seems difficult to parallelize. And the hash table. The key insight in the hash table implementation was recognizing that a serial data structure was the wrong fit for this problem. Instead, we used cuckoo hashing, which was much more parallel friendly.

Great job on getting through this unit. You've learned a lot and there's only one more unit left in this course. Pat yourselves on the back for making it this far. Now, the next homework assignment is your last homework assignment and it's the most challenging one since you are going to be writing it, most of it yourself. But don't worry, we're there to help. If you've got any issues, please post on the forums.