We have seen and learned a lot of models (DT, NN, and so on) and algorithms (ID3, Gradient Descent, etc.) in this course. Often, when a dataset is given, we are trying to fit it to the best model; that is, the best model that generalizes for data we haven’t seen yet. An important skill of a data scientist is to identify the best model. This project is going to help you analyze a dataset and teach you how to choose the best model that generalizes that dataset.
If you were a carpenter, the project is comparable to you building a desk or a tree house. A carpenter does not build the tools that are required for the construction, but he knows which tools to use to get the job done. Similarly, it is more important for a good data scientist to know which tools — that is, machine learning models — to use to get the job done, than it is to find the best model.
The project includes an analysis of a popular dataset. In the analysis, we perform a number of experiments on the dataset and show you the output from each experiment. Your assignment is to ask and answer interesting questions by looking at the experiments and analyzing the output.
We have included questions within the analysis to get you started. But you will get bonus points if you go beyond what is given, ask interesting questions about the dataset, and answer them. Before you start, look at the rubric to understand what we expect. We expect you to provide concise arguments to each question.
The sections, below, will help you understand the logistics of the final project. These documents are intended for students with a Udacity Coach who enrolled in the full course experience. If you are previewing the courseware you are welcome to look at these documents as well (but understand that you will not submit your project to Udacity).
Optionally, you can implement these experiments yourself, and provide us with an additional analysis on a dataset of your choice. You should use this project to conduct similar experiments for any dataset you choose, ask interesting questions and then answer them in your analysis.
There are no additional points for coding, but some of you are curious and may want to analyze the dataset on your own. To help you get started, here is an example of using scikit-learn to analyze a dataset. You should feel free to use and modify this code. If you think you can make this code better, you can submit your program using concepts used within the course in python to your personal Udacity Coach by email. If you don't know your Coach's email, send it to firstname.lastname@example.org.