Best Python Data Science Libraries in 2021

In 2020, businesses, engineers, and scientists across the world collected a daily average of 2.5 million terabytes of data. Programmers need tools to sift through and analyze all of that data, and Python data science libraries are some of the best in this regard.

Working with huge datasets has always been a challenge. Operations that work well on dozens of objects crash and fail when scaled up to millions of objects. Python data science libraries don’t just enable programmers to solve problems in the era of big data, they make the process easy.

What Are Python Data Science Libraries?

A data science library is a collection of classes, functions, and types created to work with large datasets. There are libraries to handle data aggregation, sorting, transformation, and presentation. In this article, we’ll focus on three of the most popular libraries for working with big data:

NumPy implements data types and structures in Python that rival FORTRAN and C;
Pandas excels at manipulating huge datasets as easily as you might sort a spreadsheet;
Matplotlib can turn millions of data points into a concise report.

Let’s take a closer look at how Python data science libraries can facilitate your work with big data.

Scientific Computing With NumPy

NumPy defines objects and data types that are useful for general mathematics. NumPy is a core data processing library in Python, and many other data science libraries rely on its features. NumPy implements data types and collections that take up less memory than the ones Python uses, making calculations faster.

NumPy provides a lot of functionality, such as:

Creating special lists of data called arrays, which are designed to hold large datasets;
Modeling and solving trigonometric and linear algebraic equations;
Interoperability with C, C++, and FORTRAN via the C99 standard;
Enabling programmers to scale and transform matrices using array broadcasting;
Implementing an excellent random number generator; and,
Handling random sampling for statistical analysis.

Install NumPy with this command:

pip install numpy

After NumPy finishes installing, we can test it out in a simple project.

Showing Off NumPy

As an example of NumPy’s capabilities, let’s create a one-dimensional NumPy array, scale it using broadcasting, then transform it to a multidimensional array.

Here’s a snippet of example code:

import numpy

data = [2.5, 5.0, 10.0, 20.0]

# Create a new array
unscaled_array = numpy.array(data)
print(f"Unscaled array: {unscaled_array}")

# Set a scaling factor
scale = 2
print(f"Scale factor: {scale}")

# Use array broadcasting to scale the array by a factor of 2
scaled_array = unscaled_array * 2
print(f"Scaled array: {scaled_array}")

When run, we receive the following output:

As you can see, we used array broadcasting in NumPy to scale our data in a single operation. Next, let’s use NumPy to transform our array with numpy.reshape():

import numpy

data = [2.5, 5.0, 10.0, 20.0]

# Create a new array
original_array = numpy.array(data)
print(f"Original array: {original_array}")

# Reshape our array
reshaped_array = numpy.reshape(original_array, (2,2))
print(f"Reshaped array:\n{reshaped_array}")

NumPy allows us to reshape our array by passing in the original array and a tuple describing the shape of the new array. When we run the code above, we receive this output:

NumPy has successfully reshaped our one-dimensional array into a two-dimensional array. For more information on NumPy, check out the documentation.

Data Processing With Pandas

In pandas, we perform operations on DataFrames. You can think of a DataFrame as a two-dimensional array of columns and rows, like a database table or a spreadsheet. In fact, pandas excels at reading data from CSV files, Excel spreadsheets, and other sources of formatted data.

Python pandas is excellent at:

Performing spreadsheet operations, like sorting and working with pivot tables;
Joining and merging separate tables of data;
Computing elapsed time using time and date deltas;
Working with large datasets that would otherwise crash or slow down a spreadsheet;
Cleaning up and processing data for deep learning applications.

Install pandas with this command:

pip install pandas

Once pandas installs, we can use it to process and analyze data.

Using Pandas To Process Data

As an example, let’s create a DataFrame in pandas and sort it:

import pandas

data = [
    [2, 4, 6],
    [1, 3, 5]
    [0, 9, 3]
    [5, 2, 1]
]

# Create a DataFrame from the sample data
data_frame = pandas.DataFrame(data)
print(f"Unsorted data:\n{data_frame}\n")

# Sort data in column 0
sorted_data = data_frame.sort_values(0)
print(f"Sorted data:\n{sorted_data}\n")

In the code above, we sort the data based on the values in the first column, which is identified by its index. After running the sample code, we receive this output:

As you can see, pandas is able to read our data array and generate a spreadsheet-like report. Our DataFrame has row and column headers, represented by the numbers 0 to 3.

Now let’s use pandas to calculate the median and mean values in a column. Here’s an example building on our previous code:

# Calculate the mean and median of the values in column 1
mean_of_column_0 = data_frame[0].mean()
median_of_column_0 = data_frame[0].median()

print(f"The mean value of column 0 is: {mean_of_column_0}")
print(f"The median value of column 0 is: {median_of_column_0}")

When run, we get the output below:

We find the mean value for column 0 by summing the values in column 0 (8) and dividing by the number of rows (4). Pandas correctly returns 2 as the mean value. We also receive the correct median value, 1.5—the number that’s directly between the two middle values (1 and 2).

Pandas is also very popular for handling data in preparation for machine learning. It’s the industry standard for sorting, grouping, and otherwise processing large datasets. Pandas is a powerful Python data science library: it’s more than capable of performing all spreadsheet tasks in a fraction of the time.

Visualization With Matplotlib

When working with data, you’ll often want to visualize your progress or present the results in a report. Matplotlib generates charts and graphs based on your data. Other data visualization libraries like Seaborn and Plotly build off of matplotlib, but it also stands on its own.

Matplotlib is best at:

Presenting data without the need for much setup;
Generating labels and legends and placing them automatically;
Displaying line and bar graphs, scatter plots, 3D graphs, and more;
Building interactive charts and graphs for web applications;
Adding complex data visualization to spreadsheets.

Install matplotlib with this command:

pip install matplotlib

Once it’s installed, we can import it in our project and use it to start visualizing data.

Generating a Simple Matplotlib Chart

Let’s take a look at creating a simple line graph with matplotlib. We’ll plot the average high and low temperatures in Seattle, WA over the course of a year. Here’s how we created our graph with matplotlib:

import matplotlib.pyplot as plt

# Average temperatures in Fahrenheit per month
average_high = [47, 49, 52, 57, 63, 66, 72, 72, 67, 59, 51, 46]
average_low = [39, 40, 42, 45, 50, 54, 57, 57, 55, 50, 43, 39]
month_names = ["Jan", "Feb", "Mar", "Apr", "May", "June",
              "July", "Aug", "Sept", "Oct", "Nov", "Dec"]

plt.plot(month_names, average_high, average_low)
plt.xlabel("Month")
plt.ylabel("Degrees Fahrenheit")
plt.title("Average yearly temperature in Seattle, WA")
plt.show()

The code above shows you how easy it is to visualize data with matplotlib. We passed in two arrays of data, along with the months we wanted to display on the X axis, and matplotlib did the rest. The library parsed our input and generated a Y axis that fits our data well. We needed only to add a few lines of code for the labels.

When we run the code above, matplotlib generates this plot:

Of course, matplotlib is capable of much more. It can also generate and place the legend and additional labeling, generate bar charts, and plot individual data points in a scatter graph. Matplotlib is a feature-rich data visualization library, and we recommend studying its full documentation.

Preparing for a Data Science Career

Whether you’re looking to become a data scientist or analyst, or you’re tired of slow spreadsheet operations, you should learn the top Python data science libraries. NumPy, pandas, and matplotlib are useful on their own, but they’re also very common dependencies for advanced data processing in Python. Mastering them will make your life easier as you build up to machine learning and other advanced applications.

Looking to learn Python to prepare for a career in data science?

Enroll in Udacity’s “Programming for Data Science with Python” nanodegree today!

Schools

Popular

Featured

Best Python Data Science Libraries in 2021

What Are Python Data Science Libraries?

Scientific Computing With NumPy

Showing Off NumPy

Data Processing With Pandas

Using Pandas To Process Data

Visualization With Matplotlib

Generating a Simple Matplotlib Chart

Preparing for a Data Science Career

Popular Nanodegrees

Programming for Data Science with Python

Data Scientist Nanodegree

Self-Driving Car Engineer

Data Analyst Nanodegree

Android Basics Nanodegree

Intro to Programming Nanodegree

AI for Trading

Predictive Analytics for Business Nanodegree

AI For Business Leaders

Data Structures & Algorithms

School of Artificial Intelligence

School of Cyber Security

School of Data Science

School of Business

School of Autonomous Systems

School of Executive Leadership

School of Programming and Development

Python Pandas Guide: Essential Data Analysis Techniques and Examples

How to Become an AI Engineer in 2025: Skills, Tools, and Career Paths

Building Robust MLOps Pipelines: From Experimentation to Production

10 Machine Learning Projects That Will Boost Your Portfolio

Click below to download your preferred Career Guide

Schools

Popular

Featured

Best Python Data Science Libraries in 2021

What Are Python Data Science Libraries?

Scientific Computing With NumPy

Showing Off NumPy

Data Processing With Pandas

Using Pandas To Process Data

Visualization With Matplotlib

Generating a Simple Matplotlib Chart

Preparing for a Data Science Career

Popular Nanodegrees

Programming for Data Science with Python

Data Scientist Nanodegree

Self-Driving Car Engineer

Data Analyst Nanodegree

Android Basics Nanodegree

Intro to Programming Nanodegree

AI for Trading

Predictive Analytics for Business Nanodegree

AI For Business Leaders

Data Structures & Algorithms

School of Artificial Intelligence

School of Cyber Security

School of Data Science

School of Business

School of Autonomous Systems

School of Executive Leadership

School of Programming and Development

Related Articles

Python Pandas Guide: Essential Data Analysis Techniques and Examples

How to Become an AI Engineer in 2025: Skills, Tools, and Career Paths

Building Robust MLOps Pipelines: From Experimentation to Production

10 Machine Learning Projects That Will Boost Your Portfolio