A favorite among data scientists, the R programming language offers ease of use and versatility across a wide spectrum of projects
Working on a simple project in R is a great introduction to the language and its programming environment. It’ll not only give you a sense of the libraries at your disposal, but will give you valuable exposure to real data science work.
In this article, we’ll introduce R and data science generally, before suggesting a few projects for R novices.
What Is R?
R is a programming language with an emphasis on data and statistics. Unlike some programming languages, R includes an interactive programming environment, the R console, where users can try out commands and see the results and get feedback in real time.
The language focuses on statistical programming and includes many libraries and functions that make statistics and data analysis a breeze.
R’s console is available in the command line and as a standalone app, and there are tools like RStudio that build on top of the base R environment, providing a nicer interface and more functionality.
One of R’s key features is its set of graphing tools. Even when you use R from the command line, it‘s easy to create charts and graphs, and display them as images. This ability to visualize data allows data scientists to get insights like identifying trends and understanding how different datasets relate to each other.
What Is Data Science and How Does it Relate to Machine Learning?
Before we dive into the specifics of using R for data science, it’s important to touch on what data science is and how it relates to machine learning.
The goal of data science is to extract insights and learnings from data. To achieve this goal, data scientists use methods from fields like statistics, mathematics, computer science, and data visualization.
A data science project can involve a combination of techniques from multiple fields. For example, you might use simple arithmetics to bring all data points to the same denominator, and then use descriptive statistics to understand what the data looks like at a high level.
Techniques from computer science like machine learning enable data scientists to extract additional learnings from data, in a way that’s not as easily achievable by mathematical or statistical methods alone.
For example, clustering is a machine-learning technique that is particularly helpful for automatically classifying data points into multiple categories. If, as a data scientist, you’re looking to categorize data points at scale, you’re much better off using a clustering-powered categorization solution compared to otherwise manually assigning categories.
An advantage of R for data science is that the language facilitates your use of computer science techniques like machine learning. Let’s take a look at how exactly you could use R for data science.
Getting Started With R for Data Science
R provides a rare combination of ease of use and power: The R environment is easy to install and configure, and the built-in documentation is comprehensive. R’s ecosystem includes many useful libraries, like the visualization package Plotly and the state-of-the-art classification package XGboost, which combined help you gain valuable insights into your data.
Let’s say you’ve found a dataset that interests you (check out this list for a few suggestions) and want to get started with your first project. Understanding which steps to take and in which order can seem a bit overwhelming at first. Here, we have some suggestions for you on how to proceed through your project.
Consider taking on these projects as learning tasks — by the end you will have experimented with R and learned more about how the R environment works.
Data Inspection and Analysis Exercises
Just about any data science project will require you to get a high-level picture of your dataset’s structure. Thankfully, R makes this easy as it includes a number of built-in functions for data inspection and manipulation.
Here are a few data explorations you can do as a first R exercise:
- create a data frame and inspect its size (number of rows and columns);
- check whether any rows or columns have any missing or clearly incorrect values (such as unrealistically high or low values, or values that don’t fit the format expected for this column, e.g. a string instead of an integer);
- perform statistical operations like calculating means, medians, maximums, minimums, and standard deviations;
- clean your data: this includes deciding on whether you want to delete rows or columns with lots of missing or corrupted values, or whether you prefer to replace invalid entries by using an imputation technique (such as calculating the median or mean value);
- filter and sort the data by one or more measures.
Check out this blog post for some great tips on quickly summarizing your data.
Working With Time Series
Frequently, data science projects include analyzing how data points change over a period of time. Perhaps you want to understand weather trends in your area over the past couple of years, or maybe you want to compare vinyl record prices online.
Or perhaps you have access to data on the lifespans of the kings of England dating back to the 11th century and want to look at certain trends. Did these monarchs live longer, on average, as civilization developed?
R is a great fit for time series analysis projects. For example, with R it is straightforward to read common time-series formats (like CSV files), interpret date and time formats, plot how data points change over time, and extract seasonal components from the time-series data through decomposition. The “Using R for Time Series Analysis” tutorial by Avril Coghlan is a great resource for getting started with time series in R.
Continuing on the note of trends, linear regression can be a helpful way to show how two variables are related. With linear regression, you can investigate whether one variable clearly responds to changes in another, and you can use the resulting regression model for simple predictions.
Simple Neural Networks
You cannot necessarily predict everything using a simple linear regression model — the world, for better or worse, is not generally based on a clearly linear relationship between variables. For more complex predictions and working with unstructured data like natural language,, a neural network can be more effective, and implementing a simple neural network is a straightforward task with R.
That being said, neural networks are definitely on the more advanced end of machine learning techniques and require some expertise in setting up. Due to their black-box character, they can be hard to interpret even for advanced machine-learning engineers
Check out this R neural network walkthrough on GitHub. The repository includes the end-to-end R code and the output, allowing you to compare your own results if you get stuck.
Moving into more advanced tasks, R lets you easily analyze sentiment in snippets of natural language. The R package ecosystem includes a number of NLP packages that abstract away some of the tedious tasks and let the data scientist focus on extracting learnings from the dataset at hand.
Become a Data Scientist
We hope that was enough to get you excited about getting started with R!
By now you’re well aware that R is a great fit for data science projects, and that you can learn a lot by playing around with all the available functions.
If you want to go deeper into R, our specialized data science nanodegree is your best bet. In this expert-guided course, you’ll run data pipelines, build recommendation systems, and finish by developing your own open-ended data science project.
Enroll in Udacity’s Data Science Nanodegree today!