cs344 ยป

## Quadratic GPU vs Serial CPU implementation

So instead, we're going to look at another algorithm. One which is far better in terms of work complexity. This implementation was written by Duane Merrill and his colleagues and published in 2012. What's inefficient about the previous algorithm is that we visit the same edge over and over again on each iteration, but we only set its depth once. If we could, instead, only visit an edge when it's ready to be processed and never visit it again, we'd be much more efficient. So, let's go back to the serial algorithm and think about why it's an efficient algorithm. It's efficient because it tries to minimize the number of visits to nodes, and it does so by maintaining a frontier that marks the boundary between visited nodes and unvisited nodes. At any given time, only the nodes that border the frontier are subject to computation. This is different than our brute force N squared algorithm where every node is touched on every iteration. How do we implement something like the frontier on the GPU?

## How to Represent Graph Data Structure

So, I'm just going to sketch out at a high level, how Merle chooses to do this. So, first we're going to look at the data structure for a graph. We're going to store the graph with a CSR like structure, similar to how we'd store a sparse matrix. So, let's look at this graph here, notice it has nine nodes and it, it has links that are either unidirectional or bi-directional and we're going to consider beginning our breadth first traversal with this particular node, right here, node 0. We're going to store two arrays to represent the graph. The first, called C, tells us the neighbors for each node. Node 0 has neighbors 1 and 3, so we see 1 and 3. Node 1 has neighbors 0, 2 and 4, 0, 2 and 4 and so on. The second array, called R, tells us the starting location for each vertex neighbors. Node 0s neighbors started off at 0. Node 1s neighbors, started off set 2. Node 2s neighbors started off set 5 and so on. As a function of v, the number of vertices, and e, the number of edges in the graph, how long is C and how long is R, in terms of the number of elements in the array?

## How to Represent Graph Data Structure

C has one entry per edge, so it is e long. R, on the other hand, has one entry per vertex, so it's v long. Although we're going to put one additional element at the end and we'll see why in a minute. Either v or v+1 is accurate. We're also going to store a depth array, d initialized to negative 1, that indicates the depth for each node in the graph.

## How Does This Algorithm Work Part1

To show how this algorithm works, let's assume that we have a list of nodes that are on the current frontier. What we want to find are all nodes that are one hop away from the frontier. So, let's use this example here. And we're going to assume that we're one step into the breadth-first search. So, the frontier is nodes 1 and 3. So, now we're going to walk through the algortihm. Step 1, in parallel, for each node in the frontier, find the starting point of its neighbors. This is pretty simple. For vertex v, that's just r of v. In our example, node 1's neighbors start at offset 2 and node 3's neighbors start at offset 5.

## how does this algorithm work part2

slightly more complex, but still straightforward. You only need the array R. How do you compute the number of neighbors per vertex using array R and vertex number V?

## how does this algorithm work part2

So, we simply take the difference between the vertex's entry in the R array and its neighbors entry in the R array, and that's going to show how many neighbors this particular entry has. So, it's just R of v plus 1 minus R of v. Earlier, we said the R array was one longer than we needed, and this is the reason why. So, what's the starting point of the next vertex? Subtract my starting point. So, in our example, node 1 has three neighbors, 5 minus 2, and node 3 has one neighbor, it.

## how does this algorithm work part3

Step 3, allocate space to store the new frontier. So we have an empty array where we can copy the new frontier, but each node needs to know where in that array it might copy its edge list. We covered this operation in unit 4 It's called

## how does this algorithm work part3

Allocate, and remember, this is based on scan, we just scan the number of neighbors. In our example, we begin with the input array of 3 and 1, because vertex 1 has three neighbors and vertex 3 has one neighbor. And we scan this array with an exclusive sum scan. So, we get the resulting array 0, 3. Node 1 knows to start writing its edge list at offset 0. Node 3 knows to start writing its edge list at offset 3.

## How Does This Algorithm Work Part4

Step 4, copy each act of nodes edge list to this array. So, vertex 1 copies its three element neighbor list into the output array, starting at offset zero. And vertex 3 copies its only neighbor to offset 3. So, the potential new frontier here are nodes 0, 2, 4, and 4.

## How Does This Algorithm Work Part5

if they've been visited? Well we're going to look in the D array, the depth array for each element, and we can mark the vertex 0 as already visited because it has a depth of 0. All the rest we're going to keep because they have depth minus 1. So what we need is an operation where we take a vector of elements, and a set of trues and falses, and omit only the elements who are associated with trues. What do we call this operation?

## How Does This Algorithm Work Part5

And that's compact. Now, the sharp viewer might note that we're going to see vertex four twice in the next frontier. Now the algorithm's going to work okay with duplicates, it's going to work correctly, it's just wasteful. And this turns out to be a tricky problem to solve in parallel and it's beyond the scope of this talk, but you might consult Merrill's paper for a couple of interesting solutions.

## Recap of the Algorithm

Finally, we initialize the algorithm by setting the starting nodes depth to zero, and the initial frontier to that node's neighbor list. So, let's recap the big idea here. The basic step is a sparse copy. We start with a frontier, we look up the adjacency lists for the vertices in that frontier, and then we copy them into a continuous block to make the next frontier, and then repeat. The result is a really fast breadth for search, one that runs in linear work and achieves on the order of 3.3 billion vertices per second on a GPU. This is conservatively 4 times faster than a optimized CPU implementation.

## Merrills Linear-Complexity BFS on GPUs Part1

So in the discussion on graphs, I made the following statement. Let me quote myself. If we had a graph that was just a linear chain, the last node would be node n minus 1. The linear chain is very hard to paralyze. If we're doing a BFS here, it's going to take us order of n steps to get to the end of the graph, end quote. Let me state this problem another way. I have n nodes in a linear chain, and each node knows the ID of the next node in the chain. This is just a simple link list. What is the algorithm for each node, every node finding the end of the list? And so of course we can solve this in N steps. Let's say that each node has a next pointer, and I've shown those in blue, that points to the next node in the chain, and the last node has next equals null. Now we don't want to change the next pointers at all, or else we'll lose the structure of our lists. So we're going to assume they're read only. So we're also going to store a second pointer per node that we can change. For historical purposes, we'll call this pointer chum, and we're going to designate it in red. And at the end of the algorithm, we want each node's chum pointer to point to the last node in the chain. So the straightforward algorithm is, on each iteration, on each node. Set chum to chum to next until we reach a node where next is null. So we'll start off by making all chum pointers point to their own nodes. That's how we are going to initialize them, and again, on each iteration, we will set chum to the next pointer. So on the first iteration it's going to look like this. So now we're going to do another iteration. So for any particular node, we look for chum, and then next. So, on the second iteration, it's going to look like this, and so on and so on. So, on each iteration, the length of the chum pointer is going to be 1 more than it was on the iteration before. So the question is, the important question is, can we do better? And it turns out the answer is yes. In this algorithm described by Danny Hillis and Guy Steele in 1986, not discovered by them but described very nicely, is so cool that it's one of the reasons I decided to do parallel computing in the first place. So let's analyze the complexity of the algorithm we just described. Clearly, a serial processor can do this computation in n steps in order of n work. How about the parallel processor? We know that it takes n steps. How much work? Your choices are order of n, n log n, n square root of n, or n squared.

## Merrills Linear-Complexity BFS on GPUs Part1

And the answer is, order of n squared work. We have all n processors working and it's going to take n iterations to complete, so that's n squared work overall.

## Merrills Linear-Complexity BFS on GPUs Part2

Way back in Unit 1, when we talked about stepping work complexity, we said that we would love to find algorithm that has the same work complexity as the serial algorithm but have a smaller step complexity. Reduce would be a good example here, or scan. But if we said we can't do that, sometimes we would be willing to accept more work if it gets us fewer steps. And that's what we are going to do here. But how? This is not intuitive, or more at least it isn't for most people, and it certainly wasn't for me when I learned this material. Hills and Steele first expressed skepticism they could improve on quadratic work. But then concluded, and I am quoting from their paper, essentially we overlooked the power of having many processors working on the problem at once. So at a high level, here's what we are going to do. On every node, we start by knowing the node that's one hop away. That's the next pointer, in blue. So on the next iteration, we can visit our next pointer's next pointer, and get two hops away. And then, on the next iteration, we can get to three hops away and so on. So, that's what I am showing here when I say straight forward approach. As we increase the number of iterations, we are also increasing the number of hops away. But that's the wrong way to think about it. If we just did it that way, we'd be repeating a lot of work that the nodes down the chain are doing, and we would have quadratic complexity work. Instead, after the first iteration, we have each node knowing the node that is two hops away. So, let's say we're interested in this node and we know that we are pointing to node x here. Well, normally what we do is we take the red pointer here, and then change our red pointer to be red pointer plus blue pointer. So, we'd be moving from knowing the node two hops away to the node knowing three hops away. But look, x also knows the node that is two hops away. So, I know the node that's two hops away, x knows the node that's two hops away. So, on the second iteration we can leverage the work that x already did on it's first iteration to get a pointer that is now four hops away. So, we can go red, then red to set our new red pointer here, to be the chum of the chum. Now, we are going to have a pointer to a node that is four hops away, which also has a pointer to a node that is four hops away. And so, the next iteration will have a pointer that is 8 hops away and so on. So now, instead of going 1, 2, 3, 4, we're now going 1, 2, 4, 8. So, for N iterations, as a function of N, how many steps is this going to take?

## Merrills Linear-Complexity BFS on GPUs Part2

And the answer is log n. How cool is that? We're now doing more work, in operations on each of the log n steps. So O of n log n overall. But we'll finish in log n steps instead of n steps, and I find this pretty amazing. What a beautiful algorithm.

## Merrills Linear-Complexity BFS on GPUs Part3

So now, let's write the code. You've gotta put up with my coding handwriting here. What we're going to do is initialize k to our global thread index. We're going to initialize chum to our next pointer, and then, we're going to loop. We're going to run this routine until we run into the end. And the question is, what do we do at each stage? What do we set the new chum of k to?

## Merrills Linear-Complexity BFS on GPUs Part3

And the answer is, you set it to the chum of chum of K.

## List Ranking Part1

So let's try a slightly harder algorithm called list ranking. In list ranking, each node has the index of its successor in the list. And we know the first element in the output. What we want to be able to do is put the nodes in order. This has a number of uses, and in our lab, we've used it to decompress data that's been compressed with B sub 2. So let's take a look at an example. As an example we've got ten nodes here. We know that the output is going to begin with node zero, and the successor to node zero, so one farther in the chain is node five. The successor to node five is node two. The successor to node two is node seven and so on. So this is our input, this array of five, six, seven, eight, nine, two, three, four, zero, one. What's our output going to be? It's going to be the chain 0, 5, 2, 7, 4, 9, 1, 6, 3, 8. So here's our input and here's the output. Now, you can note that the array is actually circular, so it's necessary to actually designate the starting point. And in this case we're saying that node number 0 is the starting point. Of course, a serial processor could do this in n steps. The question is, how can we make it work with parallel hardware with a smaller number of steps?

## List Ranking Part2

We're going to sketch out a solution that uses a similar logarithmic structure to the previous problem. I'm going to write it in a different way though and I'm going to use a table here. So, recall, in the previous algorithm, we began with each node knowing its neighbor one hop away. In the first step, each node learned the node that was two hops away. Then, in the next step, four hops away and so on. So, we're going to try to use a similar approach here. And the algorithm's going to proceed in two phases. The first phase is filling in this table, so that every node knows the nodes that are 1, 2, 4, 8 hops away. And this if functionally equivalent to find the last element in the link list problem that we just saw, except that now we're treating a list of circulars so we wrap around. So, with each step of this phase, we're going to begin by knowing the node that's k hops away, for instance, k equals one. And then, we're going to compute per node, the node that is two k hops away. So, let's begin, every node is going to start by knowing the idea of a node that's exactly one hop away. .

## List Ranking Part3

So let's see how to compute the node that's two hops away. We simply look at our node that's one hop away and take it's node that's one hop away. And we'll continue as long as the number of hops away isn't greater than number of nodes. So let's do an example here. We know that the successor, the node that's one hop away from node 0 is node 5. So, if we want to compute the node that's two hops away from 0, we'll look to see that the successor to 0 is 5 and the successor to to 6 is node 3, so, we'll fill in a 3 here. The successor to node 2 is 7, the successor to node 7 is node 4, so we'll fill in a 4 here. And notice this is a perfectly data parallel operation. Each one of these vertices can do this computation completely in parallel.

## List Ranking Part4

Okay. So, the next step of the algorithm, is the compute the nodes that are 4 hops away and we're going to be able to do this in a similar way. Say, we are starting with node 0 and we want to note, the node that is 4 hops away from node from node 2 is node 4. We'll fill in a 4 here. To find the node that is 4 hops away from node 1, we know the node that's 2 hops away the node that's 2 hops away from it, is node 0 and so on. And then, we'll do the same for the last line here and find the nodes that are 8 hops away from each of these starting nodes. And we'll continue this progress as long as the number of hops away isn't greater than the number of nodes. In our example here, for instance, we have 10 nodes, so we will compute the nodes that are 1, 2, 4, and 8 hops away, but if we computed more, we would be going all the way around the list and beyond. How much work does it take to compute this entire table for n nodes proportional to n, n log n, n squared, or n cubed?

## List Ranking Part4

Well, let's take a look at the dimensions of this table. Every node participates in every step, so each step takes order of end work. And we're doubling the hop count on each step, thus there's log of n steps. So, it takes n log n work to construct the entire table. Note, this is more expensive than the serial algorithm which takes linear work, order of n work. However, the serial algorithm also takes n steps, whereas, we're finishing n log n steps here.

## List Ranking Part5

Well, the answer again is log n. Just like in the previous step, we're doubling the hop count on each step so it will take log n steps to fill in all output positions. And if you count both awake and asleep threads as doing one unit of work each iteration, we do n log n work overall. The big idea here is that this entire algorithm is a good example of trading off work for steps. We do more work than the serial version, n log n versus n, but we finish in fewer steps, log n versus n.

## Hash Table on CPUs

The most common way to construct a hash table on a CPU works as follows. So we have a bunch of buckets here and we have a hash function h. And this hash function takes a key and maps that key into one of those buckets. So, if the hash key, the hash function of the key k returns 0, h of k is 0, then that key is associated with bucket 0. If h of k is 1, then that key is associated with bucket 1 and so on. Within a bucket, we store a bunch of items as a linked list. And this is called chaining. So, we might have multiple items in this bucket, multiple keys. So, key 12, key 29, key 123 all have a hash function that's equal to 1. So, they're placed in bucket 1, and we store them as this chained, linked list. Then, when we want to look up a key, we take that key. We run it through the hash function to get a particular value out of the hash function. That's going to refer us to a particular bucket. So we have this key, it's going to return, oh, he's in bucket 1. Then we will look through all these chained items to find the key that we're looking for.

## Ideals Keys Per Bucket

So, let's say, we have n items to hash, n keys, and we have b buckets. So, what I'd like you to do as a function of n and b, what's the ideal number of keys per bucket?

## Ideals Keys Per Bucket

So, the ideal number of keys per bucket is simply n over b. If every bucket has n over b items, then the items are all evenly spread between buckets. So, this is largely a function of this hash function. Did we do a good job choosing a hash function that will evenly distribute all the keys among the buckets? If we pick a bad hash function, maybe we end up with all the items in one bucket. And this is bad because any look-ups into that bucket might have to look at all n items, or we could end up with no items in a bucket. And that's a waste of a bucket. Ideally, a hash function distributes all input keys evenly across buckets. So, every bucket ends up with roughly the same number of items.

So, let's say we have 32 threads, each of which has a different item to look up in a hash table. Let's say all those threads map to the same bucket, and there's ups. So, all 32 threads will loop through the chain until all 32 threads have found their item. If we consider the fundamental unit of work here, the thread iterations, what fraction of thread iterations here actually do useful work?

## Chaining is Bad for Hash Table Construction

The second disadvantage is in construction, particularly contention in the construction process. What we might have is two different items each of which want's to place one item into the hash table. Let's say both of these items decide that they have the same hash function for their particular item and so they both want to add an item to hash table bucket number 12 manipulate the link list in the same bucket. To do that they're going to have to serialize and synchronize. Only one of them can update the bucket at any given time and the other must wait it's turn. So any serialization like this within a parallel algorithm is definitely undesirable as well. So the conclusion here is that chaining is a sub-optimal strategy if were dealing with parallel hash tables. So we're going to turn to a different method.

## Cuckoo Hashing Part1

The approach we're instead going to take is based on a different hashing algorithm, non-chaining. And this hashing algorithm is called cuckoo hashing. This is a cuckoo bird. Thanks, Wikipedia. And it's termed a brood parasite. But let's put it into more understandable English. This bird is one of the biggest jerks in the animal kingdom. Rather than taking care of its own eggs and chicks, it instead, lays its eggs in another birds nest, throwing out the other birds eggs to make room and lets the other bird raise its chicks. See if you can detect why the algorithm I'm about to describe is called cuckoo hashing, as I describe it. The key to this method is having multiple hash tables instead of just one. Multiple hash functions, one per hash table. And those hash tables only allow one item in each one of their buckets. There is no chaining in this algorithm at all. In the example we'll show, we're going to have two hash tables and two hash functions. But the method generalizes to more than two. So, here's where we're going to do at a high level. First, all the items that we want to hash, we use the first hash function and try to hash into the first hash table. Some of them will collide and fail. And by that, I mean, we're going to try write multiple items into the same bucket but we're only allowing one item per bucket. That's okay. If n items try to write into one hash bucket, we only require that one of them succeeds, and it doesn't really matter which one. So then, those that fail, we're going to take all the ones that are leftover that failed, and we're going to use the second hash function to try to hash into the second hash table, and so on. Now, the cool part is what happens next. But we're going to go and take an example here and I'm going to show you how it how it works, and then we'll see exactly how this cuckoo hashing works.

## Cuckoo Hashing Part2

We're going to have four items that we're going to try to hash into this giant cuckoo hash table. And the cuckoo hash table has two subtables, t1 and t2. And so, each one of those two subtables has two slots corresponding to h1 equals 0, h1 equals 1, h2 equals 0, h2 equals 1. Now, we have four items that we are going to try to hash into this big cuckoo hash table. And so, I've already computed their hash functions. So, item A, for instance, it's h sub 1 of its particular key, is 1, and h sub 2 is equal to 1, and so on. So, you can see all these hash functions here. So, let's do round 1. What do we do on round 1? What we try to do is place all of the alive items, and we start off with all the items being alive, and we're trying to place them all into t1 using this hash function. So now, we see that items A and C both would like to map into this bucket. And items B and D, because of their values of h1, would like to map into this bucket. Now, we can only fit one item per bucket, so we're going to arbitrarily pick a couple of items to win. Okay. So, I'll arbitrarily pick that C and D happen to win here. D is written into hash bucket number 0 and C is written into hash bucket number 1. Cool. And so, A and B are still alive. And now, we'll move to the next iteration of the algorithm, and we'll use h sub 2 to try to map A and B into table 2. And we see, well, both of these guys would like to map into this same bucket. So, again, they're both going to try. Only one of them is going to succeed. For the purposes of argument, I'm going to say that's going to be B. Okay. So now, we still have the element A left after both round. So, here's the neat part. We now go back to table 1. And we know that if we hash item A, it's going to collide with an item already stored in the hash table. We know that because it didn't succeed the first time. So, we're going to try to hash it again. But now, when it collides with an item that's already in the hash table, here's the different part, we're going to take A, and place it into the hash table and we're going to kick out the item that's already there, and continue. And now, we continue on the next iteration of the algorithm again, we see that the only item that's left alive is now C. We see that its value of h2 is 0, so we can safely place C into the hash table here, and we're done. So, the big picture here is that by kicking out things that are already in the hash table, we have a new set of items to try to continue to place in the hash table and maybe they'll fit better than the old set and we continue to apply this procedure. On each iteration, we always try to place our outstanding items into the next subtable of the hash table and if they succeed, they kick out the items already there until we have no more items left. Now, will this always succeed, yes or no?

## Cuckoo Hashing Part2

Now, it definitely will not always succeed. There are some nice probabilistic guarantees about how often it will succeed, depending on the size and number of the hash tables. But the easy counter example is to say that, well, here we have instance, if we had 3 items where H1 and H2 were both 0, there's no possible way that we can fit them into the hash table. Because we only have 2 slots where any hash function is equal to 0. So in practice we choose a certain number of iterations and we continue to iterate, trying to fill up this hash table until we decide that we've done too many iterations. And so if that's the case then we just stop, we choose new hash functions and we start over. And again, there's very nice probabilistic guarantees about how often this is going to finish. So in the research that inspired this work, the guarantee that we tried to use was that we could guarantee that it was going to fail less than one out of every million times. So once we construct the hash table, the look up procedure is really simple. We're going to calculate all the hash functions for the item that we want to look up. So, for instance, if I want to look up item B, I know, I calculate item B's hash functions. Here hash function 1 is equal to 0 and hash function 2 is equal to 1. So what I'm going to do is I'm going to look in all the tables using the particular hash functions until I find what I'm looking for. So first, I'm going to look in table one and I know that I'm going to look in slot zero. Here, I look in slot zero and I say wait a second, that's not B. So then I have to go to hash two, to table two. Look and see that it's hash value is equal to one, so I'll look in slot one, table two and I'll see, there's B. I've now found the value that I'm looking for. If we don't find it in any of these locations, it's just not in the hash table. Now the nice part here is that this is a constant time look up. It just requires t look ups and t is a constant. It might be 2. It might be 3 and so on. This is different than chaining. Chaining has a variable time look up. It depends on how many items are in the bucket. And if we have many items in the bucket and we have to look all the way to the end. It can potentially take a very long time, whereas we can guarantee exactly how much work and it's a constant amount of work to look up any item in these hash tables.

## Cuckoo Hashing Part3

So, a few notes on implementation. The real benefits of this particular formulation of hashing are that our look-ups are again constant time. And that's terrific for parallelism for a machine like the GPU because we keep all of our threads busy. There's no thread divergence at all, and the construction algorithm is actually pretty simple and pretty fast. It's actually a very efficient hash table construction when it's done in shared memory with a small set of IMs to hash. But it turns out that it actually works fairly well with large data sets when the hash table needs to be constructed in main memory. The really important thing to keep in mind there is that the operation of write my item into the hash table and kick the other item out needs to be atomic operation. In our implementation we used atomic exchange to make sure the two competing threads don't have any possibility of stomping on top of each other. And a final note, there's more than one way we might solve this general problem of check if an element is in a set. Instead of using hash tables, we could choose a fairly brute force way to do this. We could choose to sort all the elements in the set as a construction step. And then, for the look-up step, we could do binary search into the set to see if that particular element is present. So for instance, we might have keys here in a set. To construct this particular data structure, we sort these in order of the key. And to do look-up we just do binary search within this set. So, for instanc,e if we're looking up value k6 with one thread, we'd go see k5 my keys bigger than that. I'd then pop this way and so on until I get to k6. I might have another thread that's looking up k3. It would start in the middle at k5 and then do binary search until it finds k3. Now, sorting is quite fast on GPUs. And even though hash tables are faster for this particular computation, they're not so much faster that should never consider a sort to be a good idea. Often on the GPU, sometimes a brute force approach, like sort and binary search, might be your best option.