# Syllabus

• Lesson 1: Introduction and the GPU Programming Model (6-8 Hours)
37 clips (48:46), 13 Quizzes, 1 Problem set, 1 Interview

• Technology Trend: CPU programming, Latency, Bandwidth
• GPU Programming: Design Goals, Kernel, Map
• CPU vs GPU: Squaring a number
Problem set 1: Converting Photos from Color to Greyscale (for that classy touch)
• Bill Dally (NVIDIA) Interview
• Lesson 2: GPU Hardware and Parallel Communication Patterns (8-12 hours)
43 clips (1:16:52), 15 Quizzes, 1 Problem set

• Communication Patterns: Map, Gather, Scatter, Stencil, Transpose
• GPU Memory Model: Synchronization, Barrier, Memory access, Coalesce, Atomics
• Strategies for efficient CUDA programming
Problem set 2: Gaussian filter for smooth blur (miracle product for removing wrinkles)
• Lesson 3: Fundamental Parallel Algorithms 1 (8-12 Hours) 37 Nodes (1:25:13), 19 Quizzes, 1 Problem set

• Step complexity, Work complexity
• Reduce: Serial vs Parallel Implementation, Global and Shared Memory Bandwidth
• Scan: Serial vs Parallel Implementation, Inclusive vs Exclusive Scan, Hillis Steele vs Blelloch Scan
• Histogram: Serial vs Parallel Implementation, Atomics, Local Memory, Reduction
Problem set 3: HDR Tonemapping (because your TV doesn’t really have a 10,000:1 contrast ratio)
• Lesson 4: Fundamental Parallel Algorithms 2 (6-10 Hours)
28 clips (1:03:32), 15 Quizzes, 1 Problem set, 1 Interview

• Compact: Core Algorithm, Procedure
• Allocate: Strategy
• Segmented Scan: SpMV, CSR
• Sort: Brick Sort, Merge Sort, Sorting Networks, Radix Sort, Quicksort
Problem set 4: Red Eye Removal using Template Matching (soothing relief for those bright red eyes)
• Ian Buck (NVIDIA) Interview
• Lesson 5: Optimizing GPU Programs (10-14 Hours)
52 clips (1:50:18), 21 Quizzes, 1 Problem set

• Levels of Optimization: APOD
• Analyze: Hotspots, Amdahl’s Law
• Parallelize: Matrix Transpose, Bandwidth, Tiling, NVVP, Little’s Law, Occupancy
• Thread Divergence: Warp, SIMD, SIMT, Switch
• CPU-GPU Interaction: Streams
Problem set 5: Accelerating Histograms (when fast isn’t fast enough)
• Lesson 6: Parallel Computing Patterns (8-12 Hours)
34 clips (46:34)/14 Quizzes, 26 clips(35:26)/12 Quizzes, 1 Problem set

• Traversal of Graph: Depth-first vs Breadth-first search
• Graph Data Structure
• List Ranking: Merrills Linear Complexity
• Hash Table
Problem set 6: Seamless Image Compositing using Poisson Blending (or, who put the polar bear in the swimming pool?)
• Lesson 7.1: The Frontiers and Future of GPU Computing (8-12 Hours)
47 clips (1:02:22)/9 Quizzes, 17 clips(19:44) / 5 Quizzes, 1 Interview

• Parallel Optimization Patterns: Data Layout transformation, Scatter, Tiling, Privatization, Binning, Compaction, Regularization
• Libraries: cuBLAS
• CUDA C++ Programming Power Tools: Thrust, CUB, cudaDMA
• Other Languages: PyCUDA, MATLAB
• Dynamic Parallelism: Bulk, Nested, Task, Recursive, Quicksort

• Stephen Jones (NVIDIA) Interview

• Final Exam (4 hours)

• Optional Final Project
We encourage you to apply the lessons to your interesting problems.Your project can be shown in our Forums.