Learn More

Getting your data projects online to get hired

How do you prepare for a data analyst interview?

At Udacity, we strive to be as responsive as possible to student queries of all kinds, and virtually every member of every team gets the opportunity to speak directly with students at one time or another. One subject that has definitely come up a great deal lately is the question of how to prepare for a data analyst interview. To speak to this matter, our own Mat Leonard—a Udacity course developer—is here to offer some thoughts and experience to nail your data analyst interview! His first tip? Prepare a data analyst portfolio by getting your projects online for all to see. 

First, a bit of “official” background on Mat:

Mat Leonard earned a PhD in Physics from UC Berkeley, where he wrote his dissertation on neural activity related to short term memory. When it came time to make sense of his data, he turned to Python and the science stack including Numpy, Scikit-learn, and Pandas. He created his personal blog, Matatat.org, to publish small data projects online. For example, he explored linear regression models for predicting body fat percentage and a Bayesian approach to A/B testing.

And with all that said, here is Mat on our subject for today!

Putting Your Small Data Projects Online

At our recent Intersect summit, a student asked me how to gain the data analysis experience needed to land a job. My suggestion was to work on small data analysis projects then put them online as a data analyst portfolio, as I did with my blog. Small projects let you deepen your understanding of analysis methods or learn new techniques. Publishing them online builds a portfolio of your work, showing potential employers that you can successfully answer questions with data.

There are a few bits of technology that make getting your projects online a simple process. Firstly, Jupyter notebooks are an excellent tool for combining text, code, and images. Notebooks can be converted to Markdown files for use with web frameworks, such as Pelican. Finally, you can host your blog for free on GitHub. For the information to follow, you’ll need to know basic usage of git and GitHub: how to stage, commit, and push changes. You can learn about git and GitHub in our excellent course on version control. You’ll also need to be comfortable working from the command line, which you can learn about here.

Building Your Blog

To build the blog itself, you can use Pelican, a static site generator written in Python. Pelican uses the Markdown files as blog posts and automatically creates an archive, categories, and tags. There are multiple themes to use with Pelican or you can make your own for a unique and personal touch.

Mat's sample

Also, since Pelican creates a static site, you can host it on GitHub for free.

I’ll lead you through setting up your own blog and give you some suggestions for interesting projects. Here, I’m assuming you have experience with Git and GitHub, as well as Python and shell commands.

Firstly, you’ll want to install Jupyter and Pelican, just follow the installation instructions for both packages. I suggest installing these in a virtualenv or Anaconda environment. In your Pelican site folder, files in the content folder will be used as blog posts so this is where you’ll place the notebook Markdown files. To create the Markdown file, in your terminal:

$ jupyter nbconvert --to markdown /path/to/notebook.ipynb ~/projects/yoursite/content

This will create a file notebook.md in the content folder. If there are images in the notebook, a folder will be created containing the image files, called notebook_files. These files need to be moved to the images folder, where Pelican expects to find image files. The image links in notebook.md also need to be changed to the appropriate location. To do this, I wrote a short script which you can find in this gist. Simply copy the script to the content folder and run it, passing a file as an argument,

$ ./process_notebook.sh notebook.md

You’ll also want to edit notebook.md and add metadata to the beginning of the file, something like:

Title: My First Project
Date: 2016-02-03 10:20
Category: Regression

You can learn more about file metadata here. Now that you have the notebook file in the content folder, and everything is in it’s right place, you can create the site files with

$ pelican content

The site files are written to the output directory, all the HTML and CSS files needed to view your site are located there. Time to get your site out there for the world to see!

Hosting the site on GitHub is quite simple with GitHub Pages. Create a repository named username.github.io, where username is – pretty simply – your GitHub username. After creation, clone the repository to your computer, your projects folder is a good place to keep this. Now you need to copy the files from the output directory to the username.github.io repository folder. You can do this with

$ cp -r ~/projects/blog/output/* ~/projects/username.github.io

Then, stage and commit all the files in ~/projects/username.github.io. To publish your website, push the repository to GitHub. Check out your new blog at http://username.github.io! (And make sure to include this link on your resume before you start looking for that first data analyst interview.)

Your Data Projects: Where To Start

To flesh out your new website with content, you’ll want to do some small data projects. Kaggle hosts data science competitions that are a great place to start. The data is typically  already cleaned and formatted, so you can focus on building the model. Kaggle also hosts a bunch of data sets not associated with competitions for you to explore, such as Hillary Clinton’s emails. Other resources include the datasets subreddit, Data.gov, and many more that you can find in this Quora thread. Many cities also have open data sets, such as San Francisco and New York City. You can also go web scraping to collect data like I did with Yelp.

The most important thing is to find data you are interested in that creates questions you want to answer. Take this time to learn new techniques and dive deeper into methods you already know. Working on small data projects and putting them online can really support your career objectives, especially since it shows employers that you love working with data and want to continue improving yourself. You’ll have lots of interesting items to speak about when you land your data analyst interview. Hopefully, from my guidance above, you can create your own online portfolio and get your excellent work out there for the world to see!

Thanks to Mat for taking the time to offer his insights on this subject, and thanks especially to our students for raising all the right questions, all the time. Keep ‘em coming!


Enroll Now

Mat Leonard
Mat Leonard
Mat Leonard is Product Lead for Udacity's School of Artificial Intelligence. He is a former physicist, research neuroscientist, and data scientist. He did his PhD and Postdoctoral Fellowship at the University of California, Berkeley.