R is incredibly popular within the data science industry, as reflected by its rapidly growing community. With its rich repository of packages numbering in the thousands, R has a package for virtually any data-related task. In this article, we’ll introduce you to some of the best R packages for data science.
R for Data Science
Whether you’re a seasoned programmer or a novice looking to get into data science, choosing the right tool for the job can be confusing. And, although software recommendations often come down to personal preference, R is the lingua franca of data science and for good reason.
R was not built as a general-purpose programming language. Rather, it was created by statisticians and geared specifically for data analysis and number-crunching. Being the dominant language of statistical research, most cutting-edge statistical procedures are first available as R data science packages. This has resulted in a rich and varied data science ecosystem that’s constantly growing: There are currently over 17,000 R packages available on CRAN (The Comprehensive R Archive Network).
But despite its origins in academic research, R’s popularity in industrial applications is also catching up: A recent survey on the popularity of programming languages saw R climb 9 positions compared to the previous year. This surge in interest is likely due to an increased demand for data-related jobs and the data science industry’s consequent embracing of R.
Learn more about what makes R a great choice for data science, as we provide an overview of some of the best R data science packages.
ggplot2 for Data Visualization
Data visualization is the visual summary of our data or findings, usually in the form of a graph or a chart. It’s a crucial step in the data scientist’s workflow since visualization often reveals trends and insights that otherwise remain hidden. Additionally, efficient visualization allows for both technical and non-technical audiences to comprehend the information that we generate intuitively and clearly.
Effective visual communication is so important to ggplot2 that the package’s creator, Hadley Wickham, based it on Leland Wilkinson’s book “The Grammar of Graphics” (1999), which formally described graphics and provided rules for effective data visualization. Built according to this grammar, ggplot2 is the standard R package for data science and visualization.
Mastering ggplot2 has a steep learning curve — but it’s well worth the effort, as ggplot2 offers flexibility and visualization capabilities far superior to most other visualization tools.
dplyr and dbplyr for Data Wrangling
The data we receive is rarely ready for visualization and analysis right away. Typically, the data needs to undergo various transformations such as filtering, aggregation and summarization. dplyr (evocative of “data pliers”) defines data transformation through five simple functions, often called “verbs:”
- select(), for choosing the columns that you want to work with
- filter(), for selecting rows
- arrange(), for reordering rows
- mutate(), for creating new variables out of existing ones
- summarize(), for summarizing variables into a single value
Although base R is capable of performing these operations, dplyr’s code is simpler and more understandable. However, the real power of dplyr lies in its ability to chain the verbs in a logical, human-readable sequence. When used in concert, the verbs are powerful enough to cover the majority of data transformation tasks.
Additionally, you can use dbplyr (dplyr’s database backend) to access data in a database. dplyr’s familiar syntax, coupled with SQL linters such as sqlfluff, allows for writing readable and error-free SQL queries.
mlr3 and caret for Machine Learning
mlr3 (short for machine learning in R) is another of the essential R data science packages, used to implement a framework for machine learning operations. This package provides an interface for many other machine learning packages available on CRAN and extends them with methods such as those to evaluate trained models, cross-validation, hyperparameter tuning and others. This infographic shows some of the extension packages for mlr3.
caret (short for classification and regression training) is another popular choice for machine learning tasks. Machine learning tools used to be scattered across many different packages that were not always compatible. caret and mlr3 both offer a unified approach by providing virtually anything a machine learning practitioner might need. Though similar in purpose, mlr3 offers greater functionality than caret, though at the cost of a steeper learning curve.
knitr for Generating Reports
knitr (perhaps a play on “neater”) is a package that automates report generation. Typically, you’d write a statistical report by copying and pasting the relevant results of your analysis into a word processor and typing in additional information about these results. This process is error-prone due to the amount of manual work and the multitasking between analysis and writing.
knitr dynamically generates reports from R code by combining chunks of code with their corresponding descriptions. This way, whenever you compile a document, any changes made to the R code will reflect in the corresponding section in the report. knitr is especially useful if you’re a researcher or need to generate reports of your data analyses. It can integrate your R code into many formats such as Markdown and LaTeX, or even create a report in HTML.
tidyverse for General Data Science Tasks
According to its authors, “the tidyverse is an opinionated collection of R data science packages,“ deemed a reflection of best practices due to their shared intuitive and unified grammar, design philosophy and data structures. In designing tidyverse, the founding team was guided by design principles such as those expressed in The Zen of Python and The Unix philosophy, both of which emphasize consistency, simplicity, brevity and modularity. tidyverse favors composability in providing packages that can be easily extended and repurposed.
The abovementioned ggplot2 and dplyr are included in tidyverse. The following is a list of our other notable mentions:
stringr extends R with efficient string manipulation functions.
readr simplifies accessing and processing tabular data formats such as csv and tsv.
We trust you enjoyed our overview of some of the best R packages for data science. For a deeper understanding of R programming fundamentals, check out our Data Science with R Nanodegree. We’ll provide the training you’ll need to land that dream job in data science!