Introduction to Hadoop and MapReduce

Thank you for signing up for the course! We look forward to working with you and hearing your feedback in our forums.


Need help getting started?


Contents


Course Resources

Alternative - Download and unzip data sets instead of using the Virtual Machine.

Additional Reading

  1. Tom White's Essential Text, Hadoop: The Definitive Guide

Virtual Machine

Setting up the VM & datasets

  • Instructions for downloading and setting up the VM.

  • If you need root access to your virtual machine, the root password is "training".

  • additional dataset for Lesson 4 and the new project - forum_data.tar.gz. Download to your VM, put in the data directory and run a command:

    tar zxvf forum_data.tar.gz

  • Known issues with unzipping: original forum post. In short, you may get a message saying corrupt file if you unzip with 7zip or some other softwares. On linux, use linux's gunzip , tar zxvf <filename> or unzip <filename> command instead. On windows, download git bash and use the unzip command in git bash.

Transferring files back and forth to the VM

  • Mac Users: Instructions for transferring files back and forth to the VM. And importantly for copying and pasting also!

  • Windows Users: Instructions for transferring files back and forth to the VM. And importantly for copying and pasting also!

  • If you need root access to your virtual machine, the root password is "training".

  • For Windows users, there will be a better document with pictures coming soon.

Running a MapReduce job with the VM alias

  • hs {mapper script} {reducer script} {input_directory} {output_directory}

  • Note that the input_direct must exist (and contain data files such as purchases.txt) inside HDFS, but the output_directory must not exist. Hadoop will create it automatically and place it in HDFS.


Downloadable Materials

You can download Supplemental Materials, Lesson Videos and Transcripts from Downloadables (bottom right corner of the Classroom) or from the Dashboard (first option on the navigation bar on the left hand side).

Video Downloads

Downloads can also be found here: Videos for local viewing

Course Syllabus

Lesson 1

What is "Big Data"? The dimensions of Big Data. Scaling problems.  HDFS and the Hadoop ecosystem.

Lesson 2

The basics of HDFS, MapReduce and Hadoop cluster.

Lesson 3

Writing MapReduce programs to answer questions about data.

Lesson 4

MapReduce design patterns.

Final Project

The final project is an opportunity for you to practice what you learned in the course.  In this project you will be answering questions about big sales data and analyzing large website logs.  You are welcomed to post your final project in the forums to recieve feedback from your fellow students.

Here are instructions for how to run your code for the Final Project on your local machine, download a smaller test dataset to work with, and see the output expected when using this dataset.

Acknowledgements

Udacity would like to thank Cloudera, especially the instructors, Sarah Sproehnle and Ian Wrigley. And thanks to our students for your interest in Hadoop and MapReduce. We hope you enjoy the course!