In 2020, businesses, engineers, and scientists across the world collected a daily average of 2.5 million terabytes of data. Programmers need tools to sift through and analyze all of that data, and Python data science libraries are some of the best in this regard. 

Working with huge datasets has always been a challenge. Operations that work well on dozens of objects crash and fail when scaled up to millions of objects. Python data science libraries don’t just enable programmers to solve problems in the era of big data, they make the process easy. 

What Are Python Data Science Libraries?

A data science library is a collection of classes, functions, and types created to work with large datasets. There are libraries to handle data aggregation, sorting, transformation, and presentation. In this article, we’ll focus on three of the most popular libraries for working with big data:

  • NumPy implements data types and structures in Python that rival FORTRAN and C;
  • Pandas excels at manipulating huge datasets as easily as you might sort a spreadsheet;
  • Matplotlib can turn millions of data points into a concise report.  

Let’s take a closer look at how Python data science libraries can facilitate your work with big data.

Scientific Computing With NumPy

NumPy defines objects and data types that are useful for general mathematics. NumPy is a core data processing library in Python, and many other data science libraries rely on its features. NumPy implements data types and collections that take up less memory than the ones Python uses, making calculations faster.

NumPy provides a lot of functionality, such as:

Install NumPy with this command:

pip install numpy

After NumPy finishes installing, we can test it out in a simple project.

Showing Off NumPy 

As an example of NumPy’s capabilities, let’s create a one-dimensional NumPy array, scale it using broadcasting, then transform it to a multidimensional array. 

Here’s a snippet of example code:

import numpy

data = [2.5, 5.0, 10.0, 20.0]

# Create a new array
unscaled_array = numpy.array(data)
print(f"Unscaled array: {unscaled_array}")

# Set a scaling factor
scale = 2
print(f"Scale factor: {scale}")

# Use array broadcasting to scale the array by a factor of 2
scaled_array = unscaled_array * 2
print(f"Scaled array: {scaled_array}")

When run, we receive the following output:

As you can see, we used array broadcasting in NumPy to scale our data in a single operation. Next, let’s use NumPy to transform our array with numpy.reshape():

import numpy

data = [2.5, 5.0, 10.0, 20.0]

# Create a new array
original_array = numpy.array(data)
print(f"Original array: {original_array}")

# Reshape our array
reshaped_array = numpy.reshape(original_array, (2,2))
print(f"Reshaped array:\n{reshaped_array}")

NumPy allows us to reshape our array by passing in the original array and a tuple describing the shape of the new array. When we run the code above, we receive this output:

NumPy has successfully reshaped our one-dimensional array into a two-dimensional array. For more information on NumPy, check out the documentation.

Data Processing With Pandas

In pandas, we perform operations on DataFrames. You can think of a DataFrame as a two-dimensional array of columns and rows, like a database table or a spreadsheet. In fact, pandas excels at reading data from CSV files, Excel spreadsheets, and other sources of formatted data.

Python pandas is excellent at:

Install pandas with this command:

pip install pandas

Once pandas installs, we can use it to process and analyze data.

Using Pandas To Process Data

As an example, let’s create a DataFrame in pandas and sort it:

import pandas

data = [
    [2, 4, 6],
    [1, 3, 5]
    [0, 9, 3]
    [5, 2, 1]
]

# Create a DataFrame from the sample data
data_frame = pandas.DataFrame(data)
print(f"Unsorted data:\n{data_frame}\n")

# Sort data in column 0
sorted_data = data_frame.sort_values(0)
print(f"Sorted data:\n{sorted_data}\n")

In the code above, we sort the data based on the values in the first column, which is identified by its index. After running the sample code, we receive this output:

As you can see, pandas is able to read our data array and generate a spreadsheet-like report. Our DataFrame has row and column headers, represented by the numbers 0 to 3. 

Now let’s use pandas to calculate the median and mean values in a column. Here’s an example building on our previous code:

# Calculate the mean and median of the values in column 1
mean_of_column_0 = data_frame[0].mean()
median_of_column_0 = data_frame[0].median()

print(f"The mean value of column 0 is: {mean_of_column_0}")
print(f"The median value of column 0 is: {median_of_column_0}")

When run, we get the output below:

We find the mean value for column 0 by summing the values in column 0 (8) and dividing by the number of rows (4). Pandas correctly returns 2 as the mean value. We also receive the correct median value, 1.5—the number that’s directly between the two middle values (1 and 2).

Pandas is also very popular for handling data in preparation for machine learning. It’s the industry standard for sorting, grouping, and otherwise processing large datasets. Pandas is a powerful Python data science library: it’s more than capable of performing all spreadsheet tasks in a fraction of the time.

Visualization With Matplotlib

When working with data, you’ll often want to visualize your progress or present the results in a report. Matplotlib generates charts and graphs based on your data. Other data visualization libraries like Seaborn and Plotly build off of matplotlib, but it also stands on its own.

Matplotlib is best at:

Install matplotlib with this command:

pip install matplotlib

Once it’s installed, we can import it in our project and use it to start visualizing data.

Generating a Simple Matplotlib Chart

Let’s take a look at creating a simple line graph with matplotlib. We’ll plot the average high and low temperatures in Seattle, WA over the course of a year. Here’s how we created our graph with matplotlib:

import matplotlib.pyplot as plt

# Average temperatures in Fahrenheit per month
average_high = [47, 49, 52, 57, 63, 66, 72, 72, 67, 59, 51, 46]
average_low = [39, 40, 42, 45, 50, 54, 57, 57, 55, 50, 43, 39]
month_names = ["Jan", "Feb", "Mar", "Apr", "May", "June",
              "July", "Aug", "Sept", "Oct", "Nov", "Dec"]

plt.plot(month_names, average_high, average_low)
plt.xlabel("Month")
plt.ylabel("Degrees Fahrenheit")
plt.title("Average yearly temperature in Seattle, WA")
plt.show()

The code above shows you how easy it is to visualize data with matplotlib. We passed in two arrays of data, along with the months we wanted to display on the X axis, and matplotlib did the rest. The library parsed our input and generated a Y axis that fits our data well. We needed only to add a few lines of code for the labels.

When we run the code above, matplotlib generates this plot:

Of course, matplotlib is capable of much more. It can also generate and place the legend and additional labeling, generate bar charts, and plot individual data points in a scatter graph. Matplotlib is a feature-rich data visualization library, and we recommend studying its full documentation.  

Preparing for a Data Science Career 

Whether you’re looking to become a data scientist or analyst, or you’re tired of slow spreadsheet operations, you should learn the top Python data science libraries. NumPy, pandas, and matplotlib are useful on their own, but they’re also very common dependencies for advanced data processing in Python. Mastering them will make your life easier as you build up to machine learning and other advanced applications.

Looking to learn Python to prepare for a career in data science? 

Enroll in Udacity’s “Programming for Data Science with Python” nanodegree today!