Select Page

Based on a recent Stack Overflow developer survey, more data scientists are using Python than ever. It’s become the most popular language for data analysis. Python’s ecosystem includes many data science libraries, and some folks that are learning need help navigating the space and choosing the right tools.  

With that in mind, let’s explore some common situations you’ll face when using Python for data science. We’ll look at exception handling, converting objects between NumPy and pandas, working with datetime types, and more. Let’s dive in! 

What Is Data Science?

Data science is a collection of processes and methods used by data science to analyze data. Modern data scientists often have to work with huge datasets, gigabytes in size or larger. These datasets are far too large to work with in a spreadsheet. Python data science libraries are able to handle very large datasets without crashing or slowing down. 

Pandas and NumPy are two popular choices for working with big data. NumPy provides classes and objects to help programmers work with data. Pandas implements spreadsheet-like DataFrames, and the functions needed to modify and analyze them.

NumPy Cheat Sheet

Here are some common situations you might face when using NumPy:

  • Your data is in a pandas DataFrame and you need to convert it to a NumPy array to save memory or use it with another library.
  • You have to work with a very large dataset and a standard NumPy array isn’t performing very well.
  • NumPy isn’t throwing exceptions when you’d like it to, and you’re overflowing variables or experiencing odd behavior as a result.

Let’s take these problems one at a time and find the solutions.

Convert a Pandas DataFrame to NumPy Array

Pandas has a function to convert a two-dimensional NumPy array to a pandas DataFrame. Here’s an example:

import numpy
import pandas
 
two_dimensional_data = [[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12],[13, 14, 15]]
pandas_dataframe = pandas.DataFrame(two_dimensional_data)
 
# Create a NumPy array from a pandas DataFrame
numpy_array = pandas_dataframe.to_numpy()
print(f"NumPy 2D Array:\n{numpy_array}\n")

Our code generates this output:

As you can see, it’s easy to convert a pandas DataFrame to a two-dimensional NumPy array with a single function call. 

Working With Large Arrays

You’ll often need to work with very large arrays in data science. numpy.memmap() lets you work with a small portion of a large array without having to load the entire file into memory. 

Imagine you have data stored in a 4GB CSV file. Here’s how you’d load it into a memmap array:

import numpy
from hurry.filesize import size
 
big_data = numpy.memmap("./big_data.csv")
size_of_numpy_array = size(big_data.size * big_data.itemsize)
 
print(f"Loaded {size_of_numpy_array} of data into a NumPy array")

In the example code, we print the total size of the array:

As expected, our file is loaded into a memmap array. We can now work with our large dataset like any other NumPy array. This array is filled with random values, but let’s print a slice of it to confirm it’s usable:

import numpy
big_data = numpy.memmap("./big_data.csv")
first_10_values = big_data[:10]
 
print(f"Our memmap array:\n{first_10_values}")

We get the expected output:

Keep in mind that numpy.memmap() does not return a NumPy array, but instead an array-like object. Although they can be accessed, reshaped, sliced, and reordered like an array, there are some key differences. Any memmap array you alter needs to be flushed in order to save it to disk, and arrays that you extend will be padded with zeros by default.

Read the documentation for more on memmap arrays and numpy.memmap().

Handling Exceptions in NumPy

NumPy issues warnings instead of throwing exceptions. After displaying a warning, NumPy changes the values in our arrays to special characters in an effort to prevent corrupted arrays. But unexpected values can interfere with analysis. 

First, let’s take a look at NumPy’s default behavior:

import numpy
 
# Create a list of data and load it into a NumPy array
data = [0, 1, 10, 100]
numpy_array = numpy.array(data)
print(f"Original array:\n{numpy_array}")
 
#Divide by zero and print the results
numpy_array = numpy_array / 0
print(f"Invalid array:\n{numpy_array}")

When we run the code, we receive a warning but execution continues as below:

In the output above, NumPy replaces the invalid values in our array. When we divide zero by zero, the return value is not a number (nan). When we divide any other value by zero, NumPy returns infinity (inf).

If you’d rather NumPy raise an exception so you could handle it, you’d need to add the following to your code:

import numpy
 
# Tell NumPy to raise an Exception instead of printing a warning
numpy.seterr(all="raise")
 
#Try to divide by zero and catch the exception
try:
  numpy_array = numpy_array / 0
 
except FloatingPointError as e:
  print(f"Exception caught: {e}")

At the top of the code, we tell NumPy to raise exceptions for all errors. Then we surround our divide by zero in a try…except block. NumPy raises all exceptions as a FloatingPointError, so that’s what we’ll have to catch. 

The output from the code above looks like this:

Instead of replacing the values in our array, NumPy halts processing and preserves our original data. In production code, you could go a step further and attempt to repair the problem and continue processing data

Pandas Cheat Sheet

Here are some common issues you might face when working with pandas:

  • You need to load a spreadsheet into a pandas DataFrame to work with it more efficiently.
  • You have a NumPy array and you’d like to convert it to a DataFrame in order to use functions specific to pandas. 
  • In order to visualize relationships between your data more readily, you’d like to create a pivot table using your pandas DataFrame.
  • Your data has date or time columns and pandas isn’t recognizing them, preventing you from using datetime functions. 

Let’s go through these one-by-one and find the solutions.

Loading CSV Files and Excel Spreadsheets

Pandas has built-in functions for loading CSV files and Excel spreadsheets into DataFrames. Let’s load a spreadsheet containing soil data into a pandas DataFrame:

We’ll save the sample data as both CSV and Excel *.xlsx files. Here’s how to convert them into DataFrames with pandas:

import pandas
 
data_from_csv = pandas.read_csv("./soil_data.csv")
data_from_xlsx = pandas.read_excel("./soil_data.xlsx")
 
print(f"CSV loaded into DataFrame:\n{data_from_csv}")
print(f"XLSX loaded into DataFrame:\n{data_from_xlsx}")

As you can see, the code is straightforward. Pandas provides functions to read data from many common file types

When we run our code, this is the output we get:

All of the columns appear to be formatted correctly, but pandas will try to infer the data type of each column. It may make mistakes or use dtypes with a much larger memory footprint than you need. You can be explicit about which dtype you’d like pandas to use for each column. Here’s an example:

import pandas
 
data_from_csv = pandas.read_csv("./soil_data.csv", dtype={
  'Group': 'int8',
  'Contour': 'object',
  'Depth': 'object',
  'Gp': 'object',
  'Block': 'int8',
  'pH': 'float16',
  'N': 'float16',
  'Dens': 'float16'
})
 
print(data_from_csv.dtypes)

In the code above, we pass a dictionary to the read_csv method containing column names and the desired data type. We can check the dtypes for each DataFrame by printing DataFrame.dtypes. Here’s the output:

We can see the inferred data types on the left, and contrast them with the data types we specified on the right. 

Convert a NumPy Array to Pandas DataFrame

You can also convert NumPy arrays  into pandas DataFrames. Since all DataFrames are two-dimensional, remember to reshape your NumPy array before attempting to convert it. 

Here’s a code snippet showing a simple conversion:

import numpy
import pandas
 
data = [["Red", 255, 0, 0],
      ["Blue", 0, 0, 255],
      ["Yellow", 255, 255, 0]]
 
numpy_array = numpy.array(data)
 
pandas_dataframe = pandas.DataFrame(numpy_array)
 
print(f”Our NumPy array:\n{numpy_array}”)
print(f”Our pandas DataFrame:\n{pandas_dataframe}”)

When run, we see this output:

Building on the previous code, we can add an index and headers to our DataFrame when creating it:

pandas_dataframe = pandas.DataFrame(numpy_array,
  index=["index1", "index2", "index3" ],
  columns=["Color", "R", "G", "B"])
 
print(f“Our DataFrame with an index and headers:\n{pandas_dataframe}”)

That code returns this output, with column names and an index as expected:

Just like when loading from spreadsheet, pandas tried to infer the correct data types from our NumPy array. But we can also specify a dtype for each column, like so:

pandas_dataframe = pandas.DataFrame(numpy_array,
  index=["index1", "index2", "index3" ],
  columns=["Color", "R", "G", "B"])
 
pandas_dataframe["Color"] = pandas_dataframe["Color"].astype("object")
pandas_dataframe["R"] = pandas_dataframe["R"].astype(“int16”)
pandas_dataframe["G"] = pandas_dataframe["G"].astype(“int16”)
pandas_dataframe["B"] = pandas_dataframe["B"].astype(“int16”)
 
print(f"Our pandas DataFrame dtypes:\n{pandas_dataframe.dtypes}")

Now we can be sure that our converted DataFrame has the types we’d like:

As you can see, pandas has many options for converting NumPy arrays to DataFrames. For more on this, read the DataFrame documentation.

Creating Pivot Tables

If you’ve spent much time using spreadsheets, there’s a chance you’re already familiar with pivot tables. Pivot tables enable users to look at subsets of data based on indexes and values. Values are grouped by index, and presented to the user.

Here’s some sales metrics we’ll use as sample data. Let’s use a pivot table to generate a sales report based on our data.

Let’s load in the data from CSV and create a pivot table. In this example, we’ll look at sales performance broken down by manager and employee. 

We’ll set the Manager and Rep columns as our indexes, keeping in mind that pandas will group them in the order they’re specified. We’ll want to see each sales rep’s gross and net sales, so those will be our values.

Here’s the code to create our pivot table:

import pandas
 
sales_data = pandas.read_csv("./sales_data.csv")
 
sales_report = pandas.pivot_table(sales_data,
  index=["Manager", "Rep"],
  values=["Gross", "Net"],
  aggfunc="sum"
)
 
print(sales_report)

Note the aggregation function aggfunc in the snippet above. By default, pandas averages all of the values. In this example, we’d like to see the sum of sales, so we specify sum. For more on pivot tables, take a look at the relevant documentation.

When we run the code, we see the pivot table printed out:

In the output above, the indexes are grouped together in the order we specified. The values are summed as expected.  

Working With Datetime Fields

Pandas has excellent support for working with dates and times. But to use datetime functions, pandas has to recognize that your column contains datetime values. Pandas treats datetime fields like a string by default.

Let’s look at some sample data with datetime fields, and demonstrate how to load them into a DataFrame.

Let’s focus on the three datetime columns in the CSV above: “Start Time,” “End Time,” and “Date.” We’ll tell pandas that these are datetime fields when we load the DataFrame from CSV:

import pandas
 
date_columns = ["Start Time", "End Time", "Date"]
 
data_frame = pandas.read_csv("./synth_data.csv", parse_dates=date_columns)
 
print(data_frame.dtypes)

We can confirm that the columns were parsed correctly by printing out the dtypes for our DataFrame, as in the last line of code. Here’s the result we get:

Now that we know the datetime columns have been parsed, let’s use a function from the pandas.dt library. These functions are created to work with datetime objects, and won’t work unless your data is typed properly. 

Let’s access the “Date” column and use pandas to tell us which day of the week our data was collected on. Here’s a snippet to return the corresponding day for a given date:

days = data_frame["Date"].dt.day_name()
print(days)

When the code executes, this is printed:

As you can see, we’ve printed a list of row indexes and their corresponding day of the week. 

Learn Python for Data Science and More!

We’ve looked at some examples of common situations a data scientist might face. You should understand how to read data from spreadsheets and very large CSV files, convert arrays between NumPy and pandas, work with datetime fields, and handle errors. 

Eventually you’ll get to the point where you can handle most operations in NumPy and pandas without a cheat sheet. Both data science libraries have a bit of a learning curve, but most programmers should be able to master them with a little practice. 

Want to continue learning Python for Data Science but don’t know how to code? At Udacity, we’ve got you covered! Check out our Introduction to Programming nanodegree, where you’ll learn the fundamentals through HTML, CSS, and Python.