CS344 »


  • Lesson 1: Introduction and the GPU Programming Model (6-8 Hours)
    37 clips (48:46), 13 Quizzes, 1 Problem set, 1 Interview

    • Technology Trend: CPU programming, Latency, Bandwidth
    • GPU Programming: Design Goals, Kernel, Map
    • CPU vs GPU: Squaring a number
      Problem set 1: Converting Photos from Color to Greyscale (for that classy touch)
    • Bill Dally (NVIDIA) Interview
  • Lesson 2: GPU Hardware and Parallel Communication Patterns (8-12 hours)
    43 clips (1:16:52), 15 Quizzes, 1 Problem set

    • Communication Patterns: Map, Gather, Scatter, Stencil, Transpose
    • GPU Hardware: Streaming Multiprocessors, Kernel, Thread Blocks, Threads
    • GPU Memory Model: Synchronization, Barrier, Memory access, Coalesce, Atomics
    • Strategies for efficient CUDA programming
      Problem set 2: Gaussian filter for smooth blur (miracle product for removing wrinkles)
  • Lesson 3: Fundamental Parallel Algorithms 1 (8-12 Hours) 37 Nodes (1:25:13), 19 Quizzes, 1 Problem set

    • Step complexity, Work complexity
    • Reduce: Serial vs Parallel Implementation, Global and Shared Memory Bandwidth
    • Scan: Serial vs Parallel Implementation, Inclusive vs Exclusive Scan, Hillis Steele vs Blelloch Scan
    • Histogram: Serial vs Parallel Implementation, Atomics, Local Memory, Reduction
      Problem set 3: HDR Tonemapping (because your TV doesn’t really have a 10,000:1 contrast ratio)
  • Lesson 4: Fundamental Parallel Algorithms 2 (6-10 Hours)
    28 clips (1:03:32), 15 Quizzes, 1 Problem set, 1 Interview

    • Compact: Core Algorithm, Procedure
    • Allocate: Strategy
    • Segmented Scan: SpMV, CSR
    • Sort: Brick Sort, Merge Sort, Sorting Networks, Radix Sort, Quicksort
      Problem set 4: Red Eye Removal using Template Matching (soothing relief for those bright red eyes)
    • Ian Buck (NVIDIA) Interview
  • Lesson 5: Optimizing GPU Programs (10-14 Hours)
    52 clips (1:50:18), 21 Quizzes, 1 Problem set

    • Levels of Optimization: APOD
    • Analyze: Hotspots, Amdahl’s Law
    • Parallelize: Matrix Transpose, Bandwidth, Tiling, NVVP, Little’s Law, Occupancy
    • Thread Divergence: Warp, SIMD, SIMT, Switch
    • CPU-GPU Interaction: Streams
      Problem set 5: Accelerating Histograms (when fast isn’t fast enough)
  • Lesson 6: Parallel Computing Patterns (8-12 Hours)
    34 clips (46:34)/14 Quizzes, 26 clips(35:26)/12 Quizzes, 1 Problem set

    • Dense N-Body: Tiling, SpMV, Thread per Row/Thread per Element,
    • Traversal of Graph: Depth-first vs Breadth-first search
    • Graph Data Structure
    • List Ranking: Merrills Linear Complexity
    • Hash Table
      Problem set 6: Seamless Image Compositing using Poisson Blending (or, who put the polar bear in the swimming pool?)
  • Lesson 7.1: The Frontiers and Future of GPU Computing (8-12 Hours)
    47 clips (1:02:22)/9 Quizzes, 17 clips(19:44) / 5 Quizzes, 1 Interview

    • Parallel Optimization Patterns: Data Layout transformation, Scatter, Tiling, Privatization, Binning, Compaction, Regularization
    • Libraries: cuBLAS
    • CUDA C++ Programming Power Tools: Thrust, CUB, cudaDMA
    • Other Languages: PyCUDA, MATLAB
    • Dynamic Parallelism: Bulk, Nested, Task, Recursive, Quicksort

    • Stephen Jones (NVIDIA) Interview

  • Final Exam (4 hours)

  • Optional Final Project
    We encourage you to apply the lessons to your interesting problems.Your project can be shown in our Forums.

  • Grading Policy

    1. Accuracy: You should make sure that you get the right answer.
    2. Computation Time: If your answer is correct, you should pay attention to the computation time for efficient programming.