I'm John Owens. I'm an Associate Professor of Electrical and Computer Engineering at UC Davis, where I lead a group of amazing graduate students in exploring the frontiers of GPU computing. And I'm Dave Luebke. I'm a Senior Director of Research at Nvidia where I study 3D graphics, and I've been working with John and others for many years on expanding the reach of GPUs. This class focuses on parallel computing on the GPU. You'll learn about both the hardware and the software on the modern GPU, and get to work with high end GPUs as you develop a set of image processing tools during the assignments. Such as those that you might find in cool applications such as Photoshop or Instagram. We think that projects will be both interesting and fun. But most important, you'll learn how to think about computing in a parallel way. Thnking in parrallel is a crucial skill. Maybe even the crucial skill as you approach challenging problems in all areas of computing. All right. So, let's get started. ,
Welcome to Unit 1, we're really excited to have you join us in this course. I'll be teaching Unit 1 and Dave will follow with Unit 2. Now, what we're going to cover today is the fundamentals of the GPU programming model and how to write your first Cuda program. And then in the rest of the course, we're going to build from this foundation to cover the most interesting and exciting topics in all the parallel computing. So, let's get started.
Good morning, afternoon or evening, wherever you happen to be. This is John, and I'll be presenting Unit 1. So let's get started. Like a lot of American kids, when I was probably six years old, I tried to dig a hole to China. I know I'm not the only kid who tried this. So in general, you can probably conclude that American kids are a, ambitious, and b, not particularly good at geography, at least when they're six years old. Needless to say I was not successful. Why? Besides having the attention span of a gnat, I just couldn't dig a hole fast enough. So the question I'd like to pose today is how could you dig a hole faster? I also meant there's 3 ways I could have gone faster. One, dig faster. Instead of say, removing one shovelful of dirt every two seconds maybe with a lot of effort I could have removed one shovelful every second. This certainly would help but, I think we can all agree that there's a limit to this speeding up. It's not likely I could go ten times as fast no matter how hard I work. My shovel would probably explode. 2) Buy a more productive shovel. Let's say a shovel with two or even three blades instead of one. An interesting approach. Perhaps a better shovel would make me more productive. But again, I think we can agree there's diminishing returns in making shovels better. It's not likely I could use a 10- or 20-bladed shovel. And number 3, hire more diggers. I have a younger sister and a brother, perhaps I could of persuaded them to help me dig faster. This is probably the approach that would have been most productive overall toward my whole digging project. Of course adding more diggers comes with it's own problems. How do I manage them? Will they get in each other's way? Will more diggers help me dig my hole deeper instead of just wider. Well, as you probably figured out I am not just talking about digging holes. Each of these 3 methods for digging holes faster has a parallel in terms of building a faster processor. When we talk about digging faster I am actually t alking about asking our processor should run with the faster clock to spend a shorter amount of time on each step of computation. However, in a modern processor turning up clock speed also increases power consumption and where the limit as far as power consumption on a chip. When I talk about buying a more productive shovel I'm actually talking about asking our processor to do more work on each step, on each clock cycle. But like the super shovel, a single processor has also reached diminishing returns on how much work it can do per cycle. To be a little more technical, the most advanced CP user at a limit, as far as how much instruction-level parallelism they can extract per clock cycle. And when I talk about hire more diggers, I'm referring to parallel computing in a way that we hope to teach you in this class. Instead of having one fast digger with an awesome shovel, we're going to have many diggers with many shovels. Instead of having one with just a few very powerful processors, we going to have many, many weaker, less powerful processors.
So, our first quiz. What are three traditional ways that hardware designers make computers run faster? Please check the three that are true. Our choices are faster clocks, longer clock period, more work per clock cycle, a larger hard disk, adding more processors, and reducing the amount of memory.
And the three correct answers are faster clocks, this corresponds to our more productive digger. More work per clock cycle, this corresponds to our super shuttle. And more processors, this corresponds to hiring more diggers. And this is the approach that we'll be taking in this class.
You might have watched the intro video for this class, where I quoted the American super computer designer, Seymour Cray. If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens? Well, I love the chickens, and Dave loves the chickens, and I hope by the end of this class you'll love the chickens too. Now, the biggest idea in this course is parallelism. We can solve large problems by breaking them up into smaller pieces. Then we can run these smaller pieces at the same time. Parallel computing used to be a niche technology used by exotic supercomputers. Today, the world has gone parallel. Modern computing products are like the chickens. They have hundreds of processors that can each run a piece of your problem in parallel. A high end GPU contains over 3,000 arithmetic units, ALUs, that can simultaneously run 3,000 arithmetic operations. Gpus can have tens of thousands of parallel pieces of work all active at the same time. We call each of those pieces of work a thread, and a modern GPU may be running up to 65,000 concurrent threads. Together, all this computing power has the potential to help you solve your problems faster than you ever could before. But harnessing all that computing power at the same time, in parallel, requires a different way of thinking than programming a single scalar processor, where you're only doing one thing at a time. In this class we'll teach you how to program the GPU, allowing you to practice what we call GPU computing or GPGPU, standing for general purpose programability on the graphics processing unit. More importantly, we'll teach you how to think about programming through a parallel lens. The homework assignments in this class will center around image processing applications. Frankly, they're pretty cool assignments, and with them, you'll be able to do some interesting visual effects at blazing fast speeds.
To understand why the GPU is such an interesting processor today, we'll start with technology trends. Why the world has gone parallel. And first, we'll start with some good news. Modern processors are made from transistors. And each year, those transistors get smaller and smaller. This graph is from Stanford CPUDB project, thanks guys. What it shows is, the feature size of processors over time, or the feature size is the minimum size of a transistor or wire on a chip. So, what we're seeing is, this is time going in this direction. So, that's a long time ago and that's today. And this is the feature size, so how big transistors are. And notice that it's getting smaller and smaller every generation. So, when you hear talk about feature size. So, we see that it's consistently going down over time. As the feature size decreases, transistors get smaller, run faster, use less power, and we can put more of them on a chip. And the consequence is that we have more and more resources for computation every single year. However, we've also got some bad news. Historically, as transistors improved, processor designers would then increase the clock rates of processors, running them faster and faster every year. Let's take a look at this diagram of clock speeds over the years. So again, we have time going on this axis, okay? So, a long time ago, today. And here, we have clock frequency. How fast we're clocking these transistors. Historically, one of the primary drivers of clock performance has been clock speed increases. So, we see over many years, we see clock speeds continue to go up. However, over the last decade, we see that clock speeds have essentially remained constant.
So, a quick quiz to make sure we're all on the same page. At this point in time, our processor is getting faster every generation because we're running our transistors faster or instead, because we have more transistors available for computation.
And the answer is, we have more transistors available for computation. We're not clocking things any faster. Clock rates are remaining constant. The reason we're getting faster is that we get more and more transistors available on a processor die with every generation.
The reason that we're not increasing clock speed is not that the transistors have stopped getting smaller and faster. Even though transistors are continuing to get smaller and faster and consume less energy per transistor, the problem is, running a billion transistors generates an awful lot of heat and we can't keep all these processors cool. Power has emerged as a primary driving factor, possibly the most important factor in modern processor design at all scales, from a mobile phone that you keep in your pocket, all the way to the very largest super computers. The consequence is we can't keep building processors in the way we always have by making a single processor faster and faster. When we do that, we end up with processors that we simply can't keep cool. Instead, processor designers have turned to building smaller, more efficient processors in terms of power, and they then spend their extra resources to build a larger number of efficient processors, rather than faster less-efficient processors.
So what kind of processors are we going to build? Let's assume that the major design constraint is power. Why are traditional CPU like processors not the most energy efficient processors? Well, traditional CPUs have very complicated control hardware. This allows flexibility in performance, but as control hardware gets more complicated, it's increasingly expensive in terms of power and design complexity. So if we want the most bang for the buck in terms of computation for a fixed amount of power, we might instead choose to build simpler control structures, and instead take those transistors and devote them to supporting more computation in the data path. So if we want the most bang for the buck in terms of computation for a fixed amount of power, you might choose to build simpler control structures, and instead devote those transistors to supporting more computation to the data path. And the way that we're going to build that data path in the GPU is by building a large number of parallel compute units. Individually, those compute units are small, simple and power efficient. These are the chickens. But we can put a very large number of them on a single chip, and the challenge to us in this class, and more generally in the computing industry, is to program them in such a way that they can all work together to solve complex problems.
So let's wrap up with a quiz. What techniques are computer designers today using to build more power-efficient chips? Having fewer, more complex processors, having more less complex processors, maximizing the speed of the processor clock or increasing the complexity of the control hardware? Check the ones you think are correct.
And the answer is primarily more simpler processors and that's the philosophy of the GPU.
Now the next question, when we build a high performance processor which, of course, is going to be power-efficient, what are we optimizing for? One choice is minimizing latency. Latency is the amount of time to complete a task. We measure latency in units of time, like seconds. The other choice is throughput. Throughput is tasks completed per unit time. And we measure throughput in units as stuff per time, like jobs completed per hour. Unfortunately, these two goals are not necessarily aligned. In America, if you have a driver's license or a car, you've had the unfortunate opportunity to visit a government office called The Department of Motor Vehicles. If you're not from America, you've probably visited something like it. So in your head, when I say DMV, substitute your favorite government office. When you visit the DMV, it's a very frustrating experience. You wait in lines a lot. This is not necessarily the fault of the DMV, though. The reason this happens is because your goals are not aligned with the DMV's goals. Your goal is to optimize for latency. You want to spend as little time in the DMV as possible. Instead, however, the DMV optimizes for throughput. Specifically, the number of customers they serve per day. Consequently, these two people sitting behind the desk right here that work for DMV want long lines. Long lines mean they're hard working employees are always busy because there's never a time they don't have a customer waiting. Traditional CPU's optimize for latency. They try to minimize the time elapsed of one particular task. GPU's instead chose to optimize for throughput. This is a fundamentally different approach and one that is aligned with technical trends in the computer industry. I'll refer you to a 2004 article by David Patterson called Latency Lags Bandwith. There are many, many applications where optimizing for throughput is the right approach. In computer graphics, for instance, we care more about pixels per second than the latency of any particular pixe l. We're willing to make the processing time of one pixel take twice as long if it means we get more pixel throughput. This class's homework focuses on image processing applications. Here, we also care more about throughput, which is more about pixels produced per second than the time for one individual pixel. Imaging processing applications are a perfect match for the GPU, which is why we're so excited about using them as a driving example in this course.
To understand latency versus bandwidth with a real world example, let's say, my research group and I were going to drive from California to Virginia to visit Dave. This is 4,500 kilometers. Now, we could either take a sports car, which holds two of us, and speed across the country at 200 kilometers an hour. Or we could take a bus that holds 40 of us but only travels at 50 kilometers per hour. What is the latency of each measured in hours, and the throughput measured in people transported per hour?
So, if we take the car, we travel 4,500 kilometers at 200 kilometers per hour, that's going to take 22 and a half hours. The throughput is 2 people divided by traveling 4,500 kilometers at 50 kilometers per hour and that's going to take us And that's going to give us a much better throughput of 0.45 people per hour. And these two trends aren't necessarily opposed. Improved latency often leads to improved throughput. And sometimes improved throughput also leads to improved latency. But the GP designers are really prioritizing throughput. That's really the focus of their efforts.
So, given these technical trends, let's characterize the decisions that GPU designers have made. One, GPUs have lots of simple compute units that together can perform a large amount of computation. In general, the GPU is willing to trade off control for compute and choose simpler control complexity and more compute power. The consequence of this decision is that the GPU programming model. The programmer's view of the machine is perhaps more restrictive than the CPUs. Two, GPUs have an explicitly parallel programming model. When we write programs for this machine, we know that we have lots of processors, and we have to program it in that way. We don't pretend that it has one processor and rely on a magic compiler chain to figure out how to map work onto many processors. This programming model is a main focus of the class. We'll talk a lot more about this. And three, GPUs optimize for throughput, not latency. They are willing to accept increased latency of any single individual computation in exchange for more computation being performed per second. Consequently, they're well suited for application domains where throughput is the most important metric.
So now, that we have some background on why GPUs look the way they do, we'll now discuss what a GPU looks like from the point of view of a software developer. One of the themes of this class is the importance of programming in parallel. This is a crucial skill, not just for GPUs, but also for CPUs even. If you buy an 8 core Intel Ivy Bridge processor, we see that it has 8 cores, each core has threads, multiply those together and you get 128 way parallelism. On this processor, if you run a completely serial, C program with no parallelism at all, you're going to use less than 1% of the capabilities of this machine. There's no doubt that parallel programming is harder than serial programming, no matter what your target. But I hope by the end of the class, we'll convince you that this additional complexity is worth while for the potential performance gains. And throughout the course of the class, we'll try to lay out the benefits and the limitations of this programming model. One of the really exciting things about GPU computing is that it moves quickly, and some of today's limitations may disappear in the next generation of GPU.
The computers we're using in this class are termed heterogeneous. They have two different processors in them, the CPU and the GPU. Now, if you write a plain C program, your code will only allow you to use the CPU to run your program. So, how do we write code that will run on the GPU? That's where CUDA comes in. The CUDA programming model allows us to program both processors with one program so that we can use the power of the GPU in our programs. CUDA supports numerous languages,. But in this class, we're using C. Now, part of your CUDA program is plain c and it will run on your CPU. CUDA calls this the host. The other part of your problem will run on the GPU in parallel. It's also written in C but with some extensions that we use to express parallelism. The CUDA term for your GPU is the device. Then, the CUDA compiler will compile your program, split it into pieces that will run on the CPU and the GPU, and generate code for each. CUDA assumes that the device, the GPU, is a co-processor to the host, the CPU. It also assumes that both the host and the device have their own separate memories where they store data. In the systems we use in this class, both the CPU and the GPU have their own physical dedicated memory in the form of DRAM, with the GPU's memory typically being a very high performance block of memory. Now, in this relationship between CPU and GPU, the CPU is in charge. It runs the main program, and it sends directions to the GPU to tell it what to do. It's the part of the system that's responsible for the following. One, moving data from the CPU's memory to the GPU's memory. Two, moving data from the GPU back to the CPU. Now, in the C programming language, moving data from one place to another is called Memcpy. So, it makes sense that in CUDA, this command, either moving data from the CPU to the GPU or moving data from the GPU to the CPU is called cudaMemcpy. Three, allocating memory on the GPU, and c this command is Malloc, so in CUDA, it's cudaMalloc. And four, invoking pr ograms on the GPU that compute things in parallel. These programs are called kernels. And, here's a lot of jargon in one phrase. We say that the host launches kernels on the device.
Okay, so let's recap with a little quiz. GPU can do the following. So mark each one that's true. One, initiate a send of data from the GPU to the CPU. Two, respond to a CPU request to send data from GPU to CPU. Three, initiate a receive of data sent from CPU to GPU. Four, respond to a CPU request to receive data from CPU to GPU. Five compute a kernel launched by the CPU. Six, and there's no box here, so don't check and don't not check, compute a kernel launched by the GPU.
Now, out of the five boxes here, the GPU can do three of these things. It can respond to CPU requests for sends and receives, but it cannot initiate request on its own. Again, the CPU is the boss in this relationship. It can also compute a kernel that is launched by the CPU. Now, I don't have a box next this final one because this is something that's currently changing. So, the, the issue is can a GPU launch its own work. Now, later on, on the course, we'll cover some advanced Cuda topics that go beyond this basic CPU, GPU model. Among other things, you'll learn how newer GPUs can launch their own kernels and pull directly from the CPU's memory but for now we're going to stick with this basic CPU, GPU model. Now, let's move on and see what a typical program looks like.
So, the typical program looks like this. First, the CPU allocates storage on the GPU. Then, the CPU copies some input data from the CPU to the GPU. Next, the CPU calls some kernels watching these kernels on the GPU that process this data. And finally, the CPU copies the results back to the CPU from the GPU. Now, two of these steps require moving data back and forth between the CPU and the GPU. Is this expensive? Well, in general, you'd like to minimize data transfer between the CPU and the GPU as much as you can. If you're going to move a lot of data and do only a little bit of computation on that date, Cuda or GPU computing probably isn't a great fit for your problem. Generally, we've found that the most successful GPU computing applications do a lot of computation and have a high ratio of computation to communication. They send their data to the GPU. They do a lot of work and only then, they bring it back.
Now allocating and transferring memory is pretty straight forward. The interesting part of GPU computing is defining the computation that actually happens on the GPU. We structure that computation as a series of one or more kernels. Now as we said earlier, the GPU has lots of parallel computation units. So, when you write kernels, those kernels need to take advantage of that hardware parallelism. So, how do they do that? And here is the big idea. It is one of the very core concepts of CUDA. As a programmer When you write a kernel, you write what looks like a serial program. You write this program as if it runs on a single thread. Then when you call the kernel from the CPU, you tell it how many threads to launch, and each of those threads will run this kernel. It is perfectly okay to write a kernel and then tell the GPU. Okay. When you start running this kernel, launch 100,000. Launch a million. Launch 10 million threads, each of which will run this kernel code. You can and will kick off an enormous amount of computation each time you launch a kernel. Now, we haven't covered this yet, but you might be able to make a good guess. Given this programming model, what is the GPU good at? Check all that apply. Launching a small number of threads efficiently. Launching a large number of threads efficiently. Running one thread very quickly, running one thread that does lots of work in parallel, or running a large number of threads in parallel.
Well, the two answers here are launching a large number of threads efficiently and running those threads in parallel. So, let's go into a little more detail.
So, we've presented at a high level what the programming model looks like. Now, what you need to know is what's the GPU good at? How's it going to be good at running programs I write in this programming model? So, what is the GPU good at? For now, let me just tell you two things. Keep these in mind as you're planning your program. Thing number one. It is really good at efficiently launching lots of threads. You may be used to other programming environments where launching threads is an expensive process. That is not the case here. In fact, as we'll discuss later, if you're not launching lots of threads, you're not using the GPU effectively. Dave, my co-instructor likes to say that the GPU doesn't even get out of bed in the morning for fewer than a 1000 threads. The second thing that a GPU is good at is actually running lots of those threads in parallel all at the same time, okay? So now, we're going to consider a simple example. We're going to take an input array of 64 floating point numbers, numbered 0 to 63. And we're going to square each number in the array. So, the output will be 0, 1, 4, 9, and so on. We're going to do this in three steps. We're going to start by looking at how we'd run this code on the CPU. Then, we'll talk about how we'd run this on the GPU without looking at code, instead just discussing what our code would do. Then, we'll dive into what the code actually looks like.
For the CPU code, we'll actually just look at three lines of C source code, without worrying about details like allocating memory, or initializing the array. So here's our code. The first line is setting up the loop. We're going to loop from i equals 0 until i equals 63, incrementing i each time we walk through the loop. What are we going to do on each itteration of the loop? We're going to fetch the input value at array location, i, multiply it times itself and store it into the output array. There's two interesting things to note about this code. One, we have only one thread of execution, and that thread explicitely loops over all of its input. We define thread here as one independent path of execution through the code. This definition is also going to apply to our GPU code. Two, note this code has no explicit parallelism. This is serial code. There's only one thread and it loops 64 times, doing one computation per iteration.
Quick quiz to make sure we're all on the same page. Two quick questions. One, how many multiplications will this CPU code perform? And two, assuming that each multiplication operation takes 2 nanoseconds, and everything else is free, how long will it take to compute the entire computation?
Well, for the first question, we know that we're going to loop through this code going to take 64 multiplications. And for the second question, we know that each one of these iterations is going to proceed serially, one after another after another. So, for each of these 64 iterations, it's going to take us to nanoseconds. If we multiply those together, we'll see that the answer is 128 nanoseconds.
So the GPU code will have two parts, one of which runs on the GPU, one of which runs on the CPU. So we're going to start by saying what do we have to express in the GPU part of this program. And we only need to express something very simple, the idea that out equals n times n Now this kernel that we write for the GPU doesn't say anything about the level of parallelism. If you remember, kernels look like serial programs. So the idea of the fact we're going to do this 64 times, not expressed in the GPU program at all. What does the CPU do? Well, it has to do the allocation of the memory, copying the data to and from the GPU, but the important part in terms of the computation is that the CPU launches the kernel. This is where the parallelism of threads is expressed. The CPU program will look something like okay, let's launch a square kernel, we're going to launch 64 threads, and the arguments to those threads are an output array and an input array. Just to get our terminology straight, what we're doing here is launching a kernel called square kernel on 64 threads and each of those 64 instances of the kernel will perform one of the 64 square operations that we need to do. Now, one question that may be puzzling you is, what good is it to launch 64 instances of the exact same program? So here's how we answer that question. Each thread that we launch knows which thread it is. We're going to call that the thread index. So if you launched 64 threads, one of them knows it's thread 0, one of them knows it's thread 1, and so on. Then, you can assign thread number n to work on the nth element of the array. So let's summarize here. You write your kernel so that it will run on one thread at a time Then you will launch minithreads, each of which will run that kernel independently.
So, in the last quiz, we talked about the performance characteristics of the serial program. Now, we're going to look at the performance characteristics of the parallel program. So, the same two questions. How many multiplications will this code perform? And assuming that each multiplication operation takes 10 nanoseconds, and we have enough GPU hardware resources to run all the multiplications in parallel, and everything else is free, how long will it take to compute the entire computation?
Okay. Well, just like in the serial code, we're going to have 64 multiplications to perform. But now, we can perform all of these multiplications in parallel. So, it's going to take ten nanoseconds. So, this is a good example of the throughput versus latency debate. The latency of each operation is certainly longer than what we saw in the CPU case. But because we can do them in parallel, we both complete faster and have a higher throughput.
Now, we're going to look at actual code. We're showing this in the class IDE. We're going to walk through it very quickly, and then run it in a shell window. And then, we'll come back and look at specific bits of the code, so you're sure you know what all of these calls do. Then, when we're done with that, we're going to ask you to make some changes and run it yourself. So, this is the actual class IDE. I'm scrolling up and down. Here is our main routine, that's what we're going to run. Let me briefly go through what we're looking at and then I'll switch over to a Show window and show you how it actually runs. So, we're going to start off by running code here. It's going to generate the input array on the host. Then, we will declare the GPU memory pointers, allocate the memory for those pointers, copy our host array over to the device. Here's where the magic happens. This is where we actually launch the kernel. We then take the results here and copy them back to the CPU and then we print out the resulting array, free the memory, and return 0. And if we go way back up to the top here, here's the actual kernel that we're going to run. Notice it's very simple, recall that we write a serial program here, and we know that all that program is going to do is square its input and return it into output. Now, let's actually run this code. What we see here is me connected to a Cuda capable computer, actually in our lab at UC Davis. We're going to first look at the code itself, just so you know I'm compiling the real thing. Okay, here's our program. Now, we're going to compile it and here is the command to compile it. The, if, you don't have to worry about this in the class IDE, but I'm just showing you what this looks like if you're running things on your own computer. what we see here is instead of running the regular C compiler we're running nvcc, the Nvidia C Compiler. The output is going to go an executable called square and our input file is square cu.. cu. is the convention for how we name Cuda files when we're saving them in a file system, okay? So, it completes pretty quickly. And then, I'm going to run the program square. So, what's square done? It's taken our 64 element input array of the numbers 0 to 63 and it squared each one of those and printed out the output. So, we see 0, 1, 4, 9, 16, and so on, all the way up to examples that we do in this class will show how you do it in the class IDE. But just for now, I'm showing you the results of this program done in a Shell window. So, let's go back to our code.
So, let's take a closer look at the code, line by line, so we can all be sure we know what each call does. We're going to walk through the CPU code first. The first thing we're going to do is declare the size of the array and determine how many bytes it uses. We then fill it up in this loop with floating point numbers, where array element i is simply set to i. All of this is standard C, nothing GPU specific so far. One thing to note, though, is a common Cuda convention. Data on the CPU, the host, starts with h<u>. Data on the GPU, the device, starts with d<u>.</u></u> This is just a convention. You can name your variables anything you want. But naming variables in this way helps you avoid the single most common beginner error in Cuda where you try to access a piece of data on from the CPU from the GPU or vice versa. If you're accessing data through a point or on the CPU, your pointer better point to something in CPU memory, or you're going to have a bad time. Same thing for the GPU. You'll find lots of Cuda code that you see uses this convention. So, let's scroll up just a little bit. And the first interesting thing that you see is how to declare a pointer on the GPU. It looks just like a pointer declared on the CPU. It's just a float star. Now, to tell Cuda that your data is actually on the GPU, not the CPU, look at the next two lines. We're using cudaMalloc with two arguments, the pointer and the number of bytes to allocate. cudaMalloc means allocate the data on the GPU whereas, a plain Malloc would mean allocate the data on a CPU. The next thing we do is actually copy the data from the CPU the array h<u>in on to the GPU, the array</u> d<u>in. This call is cudamMemcpy. It's just like a regular Memcpy, but it takes</u> four arguments instead of three. The first three arguments are the same as regular C Memcpy, the destination, the source, and the number of bytes. The fourth argument says the direction of the transfer. The three choices are Cuda memory host to device, Cuda memory device to host, and Cud a memory device to device.
Okay. So, here's a code snippet that specifically shows the data movement calls, these cudaMemcpy culls. What I'd like you to do is fill in the question marks here. And your choices are going to be cudaMemcpy host to device, cudaMemcpy device to host, or cudaMemcpy device to device.
Number 1 here is moving the host array, h sub in to the GPU array, d sub in. So, we need to go host to device. This argument is going to be cudaMemcpy host to device. This transfer goes the other way, from an array on the GPU, d<u>out to an</u> array on the CPU, h<u>out.</u> So, we're going to go device to host. cudaMemcpy device to device simply moves a piece of memory on the GPU from one place to another.
Okay. Let's scroll up a little more to look at the second half of the problem. Now, that we've got all the preliminaries out of the way, how do we actually launch a kernel on the GPU? So, here's a new piece of syntax in Cuda, the Cuda launch operator. So, the Cuda launch operator is indicated by these three less than signs and these three greater than signs, with some parameters in the middle. So, this line says, launch the kernel name square on one block of 64 elements. Then, the arguments to the kernel are two pointers, d<u>out and d<u>in.</u></u> This code tells the CPU to launch on the GPU 64 copies of the kernel on 64 threads. Note that we can only call the kernel on GPU data, not CPU data. Since we named our GPU data to start with d<u>, we can visually see that we've done the</u> right thing. Then when we're done with the kernel, the results are in d<u>out on</u> the GPU. And this cudaMemcpy call will move memory from device to host, and place it in h<u>out.</u> The next thing we do is print it out, okay? We're just walking through the h<u>out</u> array, we're printing four things per line, so we're putting tabs in and then a new line after four, and then we free the memory that we allocated on the GPU and return 0. So, that's all the CPU code. There's a fair amount in boilerplate in here, it looks fairly similar for most programs. Most programs are going to have you create some data on the CPU, allocate some data on the GPU, copy memory from CPU to GPU, launch some kernels that will run on the GPU, copy the result back to the CPU and then, continue to process them, print them, and so on.
Now let's look at the kernel itself. Recall that this will look like a serial program that will run on one thread. And the CPU is responsible for launching that program on many parallel threads. This kernel indeed looks exactly like a serial program. So here's our kernel right here, this global void square. And so here's the interesting things about this short kernel program. So, first global. This is underscore, underscore, global, underscore, underscore. That's a se-language construct called a quote, declaration specifier, end quote. Or for short, deckle speck. Don't worry about the name. Just know that this is the way that cuda knows this code is a kernel as opposed to CPU code. Next we have void. Void just means the kernel doesn't return a value. Instead it writes the output into the pointer specified in its argument list. This kernel takes two arguments. These are pointers to the output and the input arrays. Recall that both these pointers need to be allocated on the GPU or else your program will crash spectacularly. Now, note that I name them with d<u>out and d<u>in.</u></u> That's certainly not foolproof, but if I called it d<u>, and I'm consistent about</u> the way I allocate my variables, I know that d variables are allocated on the GPU. Let's walk through the body of the kernel. So the first line of the body here. Remember how I said that each thread know its own index? Here's how we get that index. CUDA has a built in variable called thread index, threadIDX, and that's going to tell each thread its index within a block. threadIDX is actually a c struct with 3 members. .x, .y, and .z. the c struct is called a dim 3. We're just using .x in this code, but we'll explain the others a little bit later. Now, we'll launch 64 threads. So for the first instance of those threads, threadIDX.x will return zero, for the second instance, 1. And so on, up to 63 for the last element. Everything else in this kernel just looks like straightforward C. It looks just like a serial program. So what are we actually doing in this kernel? Well, for each thread, we're going to first read the array element corresponding to this thread index from global memory. We're going to store it in this float variable f. We're then going to square f, and we're going to write that value back to global memory, in the output array element that corresponds to our thread index.
So now lets try it. The first thing you should do, is copy this code into your own, ID and run it, and confirm that it works. And then, as a quiz, what I'd like you to do is change this program in two ways. First, instead of processing an array of 64 elements, lets process an array of 96 elements. Don't forget to change the allocation if necessary. Second, instead of computing the square of each element in the array, let's compute its cube.
Okay, so let's see ho we'd solve this problem. The first thing we're going to do is to change the square routine into a cubed routine and change the math accordingly. So now, we're going to change square to cube. And instead of computing f times f, now we simply compute f times f times f. Now, since our new name is cube here, we are going to have to scroll all the way back down here. And instead of launching square, we've now launched q. The next thing we have to change is the allocation. Now, instead of running this on 64 elements, we want to run this on 96 elements. And, we've conveniently written this code such that everything is done in terms of array size, including the allocations. So, that should be all that we need to do. Now, let's go over to my shell window and demonstrate that this works, okay? Now we're going to do it in our own terminal window. First, I'll make sure to show you that I actually made the changes in this file. So, notice that we now have a cube routine, f times f times f, and now 96. Now, we're going to compile it into a cube executable. And then, we're going to run it. And notice that instead of printing out squares now it prints
Let me come back to one piece of this code that needs a little more explanation, and that's this kernel call. Here's the name of the kernel call, square, and we call it with these launch parameters with these arguments. There's a lot going on in this call so I need to explain details that we didn't need in our example that are necessary as we move forward. What I told you is that we were launching one block of 64 threads, and that's absolutely true. But, let me explain what's happening under the hood in a little more detail. When you launch a kernel, you specify both the number of blocks and the number of threads per block. In our example, we only had one block of 64 threads. But we want to run bigger problems than this. What you need to know about the hardware right now is two things. One, it is capable of running many blocks at the same time. Two, each block has a maximum number of threads that it can support. Newer GPUs can support 1024 threads per block, older GPUs can only support 512. So, when you have lots of work to do, you'll divide that work into any number of blocks. Each of which had no more than 512 or possibly 1,024 threads. So, if we wanted to launch 128 threads and square the values in each of them instead of 64 threads, we could change this call to square of 1,128. If we wanted to launch 1280 threads instead we could call square of 10,128. Launching ten blocks of 128 threads each. Or square 5,256, we're launching five blocks of 256 threads each. But, we can't call square 1,1280 because that's too many threads per block. You should pick the breakdown of threads and blocks that makes the most sense for your problem. As we saw before, each thread that we launch knows its index within that block. And you won't be surprised to hear that each thread also knows the index of its block as well. We'll see how we access this information in a moment. Now, how do these kernels actually map into threads and blocks? Well, when we square 10,128, we're going to launch 10 thread blocks of 128 threads each. When we square diagrams are one dimensional, they only progress in one dimension, the x-dimension. That's fine if your problem is one dimensional. But, many problems are 2 or 3 dimensional. You'll be doing image processing in the homeworks, for instance, and that's very definitely a two dimensional problem. So, it makes sense that CUDA would natively support not only one dimensional layouts of blocks and threads, like we're showing here, but also 2 and 3 dimension as well. For instance, perhaps we like to process this 128 by 128 pixel image, image. We'd like to launch one thread per pixel. We might choose, for instance, to launch these 128 by 128 threads as 128 blocks in the y-dimension where each one of those blocks is a 128 by 1 block of threads in the x-dimension. Or, we might instead choose to watch, to launch an 8 by 8 grid of blocks where each block is blocks. We can also arrange thread blocks into 1, 2, or 3 dimensional grids. So now, let's return to how we launch kernels. We put two parameters in the triple chevrons. The first is the dimensionality of the grid of blocks, and the second is the dimensionality of the threads within a block. We can specify up to three dimensions for each. But, if we don't specify a dimension, it defaults to one. More generally, you can specify a three dimensional configuration for grid of blocks or block of threads with this dim3 struct which you initialize as dim3 x,y,z. And recall that, again, if we don't specify y or z, they default to 1. So, when we say dim3 of w, that means the same thing as dim3 of w,1,1, and you can actually abbreviate this with simply the integer w. Soon specified square of
So, the most general kernel launch we can do looks like this, square of 3 parameters. The first is the dimensionality of the grid of blocks. That has bx times by times bz blocks. Each one of those blocks is specified by this parameter, the, the block of threads that has tx times ty times tz threads in it, and recall that this has a maximum size. By the way, there's a third argument that defaults to zero if you don't use it, and we're not going to cover it specifically today. it's the shared amount memory and bytes allocated per thread block. So, with this one kernel call, you can launch an enormous number of threads. And let's all remember, with great power comes great responsibility. So, launch your kernels wisely. One more important thing about blocks and threads. Recall from our square kernel, that each thread knows its thread ID within a block. It actually knows many things. First, is threaded x, as we've seen. Which thread it is within the block. So here we have a block, each thread, say this thread here, knows its index in each of the x, y, and z dimensions. And we can access those as thread idx.x, thread idx.y, and dot z. We also know block Dim, the size of a block. So, how many threads are there in this block along the x dimension, the y dimension, and potentially the z dimension? And so, we know those two things for a block, we know the analogous things for a grid. So block index for instance is which block am I in within the grid? Again dot x, dot y, and dot z. And grid Dim will tell us the size of the grid, how many blocks there are in the x dimension, the y dimension, and the z dimension. What I want you to take home from this little discussion is only the following. It's convenient to have multi-dimensional grids and blocks when your problem has multiple dimensions. CUDA implements this natively and efficiently. When you call thread at x.x, or block down dot y, that's a very efficient thing within CUDA. Since we're doing image processing in this course, you should be coun ting on to finding a lot of two dimensional grids and blocks. So, let's wrap up with a little quiz. Let's say I launch the following kernel. Kernel with 2 parameters dim 3 8,4,2, 2 and dim 3 16,16, how many blocks will this call launch? How many threads per block, and how many total threads?
Well, we can determine the number of blocks simply by looking at the dimensionality of the grid. We are launching this kernel with the grid of blocks and the dimensionality of the grid is 8 in the x-direction, 4 in the y-direction, and 2 in the z-direction. So, we have 8 times 4 times 2 blocks, which is 64 blocks. How many threads per block? Well, we can look only at the configuration per block of threads. Here, we have a 16 in x, 16 in the y-dimension of each block of threads. So, that's 256 threads per block. The total is 64 blocks of 256 threads per block. That's 16,324 threads that we launched with this kernel call.
So, we're starting to get to the wrap-up phase of this lecture. And so, I want to sum up what we've learned about the programming model so far. So, I want to try to sum this up in three points. First, we know that when we write a kernel program, that program looks like it's only going to run on one threads. However, when we launch that program and we launch the kernel from CPU code, we specify the launch bounds of that such that, that program can be launched to run on any number of threads that we define. Finally, within the kernel program, each one of those threads knows its own index and its own thread block and then the grid of thread blocks, so that it can properly pick up the right piece of work for it to compute. Now, what we're going to do is take a slightly broader view of the kind of operations that we just did in our sample program. And what we're going to have is two ingredients that we've learned that are going to lead to an interesting abstraction that we're going to call map.
Let's take a broader view of what we just did. We have two ingredients here that are going to lead to an interesting and crucial abstraction called map. First, we have a set of elements that we'd like to process. In our example, this set is an array of 64 floats. Second, we have the ability to write an arbitrary function that runs on one element. In our example, our function is squared, each of its input elements producing an output element. Generically, what we did was apply our function to each element in the set. This is a powerful parallel operation. So to be precise when we talk about this, we say that we have an abstraction named map. It takes two arguments, a set of elements and a function that will be applied to each of those elements. Map is a key building block of GPU computing. Many of the problems you will solve as a GPU programmer, and certainly many of the ones in this class, use map operations. GPUs are good at map for two main reasons. One, GPUs have many parallel processors. The GPU is effecient at delgating the computation of individual elements to those processors. We're going to find out more about that soon. Second, GPUs optimize for throughput rather than latency. So, as a programmer, you're probably more interested in optimizing the rate at which entire map operations can complete, instead of the time it takes for any individual element to complete, and the GPU agrees with you. Let's also take a look at the communication pattern for a map operation. Maps model is straightforward. One element in, one element out. With the computation of output element number n, only dependent on input element number n. This is a very simple and straightforward communication pattern. Other computations might require a more complex communication pattern. And we'll learn about those patterns over the next few units. So, to make sure we understand map, let's do a little quiz. Check the problems that can be solved using a map operation on an input array. Sort, add one to each element in the array, sum up all the elements in the array or compute the average of the elements in the input array.
Only one of these can be solved using map, and that's adding one to each element in the input array. Why can we do that? Each individual element in this input array, is processed in parallel. We can't sort an input array using map, because the output at any position is dependant on all the input data. Similarly, to sum up all the elements in an input array requires that we look at all of the seperate elements to compute a single output. The, the individual elements are not parallel. Finally, to compute the average of an input array also requires this all to one style of communication. We don't have in parallel things that happen independently without any communication. Now, all these other operations are very, very interesting in GPU computing, and we will absolutely be covering them in the next few units.
To wrap up. Today we learned about not only that the world is becoming parallel, but why. The GPU is a processor that takes advantage of some of the most important technical trends in computing. We looked at our first CUDA program, learned how it worked, and changed the program to do something new. Now for our first project, we're going to use CUDA to input a color image, and now put that image in greyscale. This will involve a map operation, over every pixel in the image, where the operation on each pixel is converting the input color in red, green, and blue, to a single intensity value. When you're done, you'll have the chance to show your grandparents what your new photos would've looked like back in the day, or at least that's what I'm going to do. Enjoy the assignment, and I hope you enjoy an excellent presentation from Dave for Unit 2.
Congratulations and good job. You've made it to the end of Unit 1. You've learned a lot about parallel computing on the GP already. Both the fundamentals of the GPU programming model, and how to write your first GPU program. Now, we're going to talk to Video Chief Scientist and Senior Vice President of Research Bill Dally. Bill is one of the pioneers of parallel computing so I hope you enjoy his perspective as much as I do. After the interview, we can dive into problem set one. Where you'll write a program that will transform a color image into a gray scale image. And we'll provide a lot of the surrounding code for this assignment, but you're going to write the most important part. The part that actually does the parallel processing on the GPU.
Welcome to problem set number one. I'm Chin Hun and I'm your TA for the class. Today I will be walking you through problem set number one. And for problem set number one, our goal is convert an image from color to black and white. So to convert a image from color to black and white We first have to understand how color digital images are represented. The most common format is to specify how much red, green, and blue is presented at each pixel. Each color is called a channel. Zero means that a color is fully absent. Whereas, 255 means that color is fully saturated. Therefore, if at a pixel, all three channels red, green, and blue are zero, means that pixel is black, like such. Or if all three channels red, green, and blue are 255. That pixel is fully white, just like the background of this slide. So how are pixels represented in CUDA? So, in CUDA code, each pixel is represented as an unsigned char4 struct. This structure has four unsigned chart components named x, y, z, and w. X is your red channel. Y is your green channel, Z is your blue channel, and W is your alpha channel. The fourth component, char W here, is reserved for the Alpha channel, which carries the transparency information. We will ignore this component through our homework, as this does not pertain to transforming the image from color to greyscale. So how do we convert a color image to a black and white image? So, one possibility was simply to take the average of all of the three color channels. So for example, add R to G to B, the channels R stands for red, G stands for green, and B stands for blue and simply divide it by three. But it turns out that our eyes don't respond to all colors equally. We're more sensitive to the color green than red, and more sensitive to the color red than the color blue. So taking into account our varying sensitivities to the different color channels. We are going to use this following formula instead. So we are going to multiply the color channel red by point 299, we are going to multiply th e color channel green by point 587, and we are going to multiply the color channel blue by point 114. So let's look at problem set number one, and let's look at the code that you will actually have to write for this assignment. So for problem set number one, you have two jobs that you have to do. So for the first part. Your job is to fill in this RGBA to greyscale function. We will pass you in an array of uchar4 which represents the pixel in the color image. And your job is to convert each picture in this 2D array into an intensity value that you will write back to this 1D array called great image. And you will follow the formula that we talked about in the previous slide, as seen here. In part two, the kernel's already filled in for you. But with incorrect dimensions, and spot is empty. So your job for part two is to fill in the body of the kernal and to correct the dimension of block size and grid size. We have included reference code that performs the same calculation of siri on the CPU for your reference, and if you have any question, or if you run into any problem feel free to post on the forums, and I'll be there to help you. So lastly, I would like to thank Eric Elson of Royal Caliber and Mike Roberts for their fantastic job with writing the homeworks. They have also written the script this video today, as well, so thank you Eric and Mike.