About this Course

The goal of this course is to give you solid foundations for developing, analyzing, and implementing parallel and locality-efficient algorithms. This course focuses on theoretical underpinnings. To give a practical feeling for how algorithms map to and behave on real systems, we will supplement algorithmic theory with hands-on exercises on modern HPC systems, such as Cilk Plus or OpenMP on shared memory nodes, CUDA for graphics co-processors (GPUs), and MPI and PGAS models for distributed memory systems.

This course is a graduate-level introduction to scalable parallel algorithms. “Scale” really refers to two things: efficient as the problem size grows, and efficient as the system size (measured in numbers of cores or compute nodes) grows. To really scale your algorithm in both of these senses, you need to be smart about reducing asymptotic complexity the way you’ve done for sequential algorithms since CS 101; but you also need to think about reducing communication and data movement. This course is about the basic algorithmic techniques you’ll need to do so.

The techniques you’ll encounter covers the main algorithm design and analysis ideas for three major classes of machines: for multicore and many core shared memory machines, via the work-span model; for distributed memory machines like clusters and supercomputers, via network models; and for sequential or parallel machines with deep memory hierarchies (e.g., caches). You will see these techniques applied to fundamental problems, like sorting, search on trees and graphs, and linear algebra, among others. The practical aspect of this course is implementing the algorithms and techniques you’ll learn to run on real parallel and distributed systems, so you can check whether what appears to work well in theory also translates into practice. (Programming models you’ll use include Cilk Plus, OpenMP, and MPI, and possibly others.)

Course Cost
Approx. 4 months
Skill Level
Included in Course
  • Icon course 01 3edf6b45629a2e8f1b490e1fb1516899e98b3b30db721466e83b1a1c16e237b1 Rich Learning Content

  • Icon course 04 2edd94a12ef9e5f0ebe04f6c9f6ae2c89e5efba5fd0b703c60f65837f8b54430 Interactive Quizzes

  • Icon course 02 2d90171a3a467a7d4613c7c615f15093d7402c66f2cf9a5ab4bcf11a4958aa33 Taught by Industry Pros

  • Icon course 05 237542f88ede3178ac4845d4bebf431ddd36d9c3c35aedfbd92e148c1c7361c6 Self-Paced Learning

  • Icon course 03 142f0532acf4fa030d680f5cb3babed8007e9ac853d0a3bf731fa30a7869db3a Student Support Community

Join the Path to Greatness

This free course is your first step towards a new career with the Machine Learning Engineer Nanodegree Program.

Free Course

High Performance Computing

by Georgia Institute of Technology

Enhance your skill set and boost your hirability through innovative, independent learning.

Icon steps 54aa753742d05d598baf005f2bb1b5bb6339a7d544b84089a1eee6acd5a8543d

Course Leads

  • Rich Vuduc
    Rich Vuduc


  • Catherine Gamboa
    Catherine Gamboa


What You Will Learn

The course topics are centered on three different ideas or extensions to the usual serial RAM model you encounter in CS 101. Recall that a serial RAM assumes a sequential or serial processor connected to a main memory.

  • Unit 1: The work-span or dynamic multithreading model

In this model, the idea is that there are multiple processors connected to the main memory. Since they can all “see” the same memory, the processors can coordinate and communicate via reads and writes to that “shared” memory.

Sub-topics include:

** Intro to the basic algorithmic model ** Intro to OpenMP, a practical programming model ** Comparison-based sorting algorithms ** Scans and linked list algorithms ** Tree algorithms ** Graph algorithms, e.g., breadth-first search

  • Unit 2: Distributed memory or network models

In this model, the idea is that there is not one serial RAM, but many serial RAMs connected by a network. In this model, each serial RAM’s memory is private to the other RAMs; consequently, the processors must coordinate and communicate by sending and receiving messages.

Sub-topics include:

** The basic algorithmic model ** Intro to the Message Passing Interface, a practical programming model ** Reasoning about the effects of network topology ** Dense linear algebra ** Sorting ** Sparse graph algorithms ** Graph partitioning

  • Unit 3: Two-level memory or I/O models

In this model, we return to a serial RAM, but instead of having only a processor connected to a main memory, there is a smaller but faster scratchpad memory in between the two. The algorithmic question here is how to use the scratchpad effectively, in order to minimize costly data transfers from main memory.

Sub-topics include:

** Basic models ** Efficiency metrics, including “emerging” metrics like energy and power ** I/O-aware algorithms ** Cache-oblivious algorithms

Prerequisites and Requirements

A “second course” in algorithms and data structures, a la Georgia Tech’s CS 3510-B or Udacity’s Intro to Algorithms

For the programming assignments, programming experience in a “low- level” “high-level” language like C or C++

Experience using command line interfaces in *nix environments (e.g., Unix, Linux)

Course readiness survey. You should feel comfortable answering questions like those found in the Readiness Survey Course, HPC-0

See the Technology Requirements for using Udacity.

Why Take This Course

What do I get?
  • Instructor videos
  • Learn by doing exercises
  • Taught by industry professionals

Thanks for your interest!

We'll be in touch soon.

Icon globe e82eae5d45465aba4fbe4bb746905ce55dc3324f310b79c60e4a20089057d347

Udacity 现已提供中文版本! A Udacity tem uma página em português para você! There's a local version of Udacity for you!

前往优达学城中文网站 Ir para a página brasileira Go to Indian Site or continue to Global Site