Time series analysis is an essential aspect of data science, with applications in industries like finance, healthcare, and environmental science. It helps uncover patterns in data collected over time, enabling us to make informed predictions, detect anomalies, and understand trends.
In this article, we’ll explore key concepts in time series analysis and demonstrate their practical applications using the Airline Passengers dataset. This dataset tracks monthly airline passenger numbers from 1949 to 1960, making it a perfect example for analyzing time-dependent patterns.
Table of Contents
What is Time Series Data?
Time series data is all about observations collected at regular time intervals. Unlike other types of data, the order of these observations matters. Whether it’s daily stock prices, hourly website traffic, or weekly sales numbers, the temporal structure holds valuable information.
Why is this important? Time series analysis allows businesses and researchers to uncover trends, detect seasonal variations, forecast future values, and even spot anomalies. But before diving into the technicalities, let’s start by visualizing our dataset to see what we’re dealing with.
1. Visualization
When working with time series data, the first step is to visualize it. A simple plot can reveal trends, seasonal patterns, and even anomalies.
Here’s the plot of the Airline Passengers dataset, showing monthly passenger numbers:
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv(“https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv”)
data[‘Month’] = pd.to_datetime(data[‘Month’])
data.set_index(‘Month’, inplace=True)
# Plot the time series
plt.figure(figsize=(12, 6))
plt.plot(data[‘Passengers’], label=’Airline Passengers’)
plt.title(‘Monthly Airline Passengers (1949–1960)’)
plt.xlabel(‘Year’)
plt.ylabel(‘Number of Passengers’)
plt.legend()
plt.grid(True)
plt.show()
From the graph, it’s evident that the number of passengers has steadily increased over time. There’s also a clear seasonal pattern, where passenger numbers rise and fall in a predictable cycle. These initial observations set the stage for deeper analysis, helping us decide what to focus on next.
2. Decomposition
To make sense of these patterns, we can decompose the time series into three components: trend, seasonality, and residual (the irregular part). This process breaks down the series into its building blocks, making it easier to analyze each aspect separately:
- Trend: The long-term progression of the series, indicating overall growth or decline over time.
- Seasonality: Regular, repeating patterns or cycles within a fixed time period, such as yearly or monthly fluctuations.
- Residual: The irregular component, capturing noise or random variations that cannot be explained by the trend or seasonality.
By isolating these components, we can focus on specific aspects of the data, which helps in tasks like forecasting or anomaly detection.
Here’s what the decomposition of our dataset looks like:
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series using additive or multiplicative model
decomposition = seasonal_decompose(data[‘Passengers’], model=’multiplicative’)
decomposition.plot()
plt.show()
The decomposition reveals three key insights:
- The trend shows a steady upward growth in passenger numbers over the years.
- The seasonality highlights a repeating yearly pattern, likely influenced by travel seasons.
- The residual captures random noise that doesn’t follow any clear pattern.
By isolating these components, we can focus on specific parts of the data, like removing seasonality to improve forecasting accuracy.
3. Stationarity
Now that we’ve explored the structure of the data, let’s address an important property for time series analysis: stationarity. A stationary series has consistent statistical properties over time—its mean, variance, and autocorrelation remain constant. Many forecasting models, including ARIMA, rely on this assumption.
If the time series isn’t stationary, forecast models tend to produce unreliable forecasts due to the fluctuating nature of trends and variances. This makes it challenging to capture patterns accurately, leading to errors in predictions.
Testing for stationarity ensures that the data is suitable for modeling, which is a critical step in achieving accurate forecasts. One common method to test stationarity is the Augmented Dickey-Fuller (ADF) test, which evaluates whether a series has a unit root—a key indicator of non-stationarity. Here’s how it works in Python:
from statsmodels.tsa.stattools import adfuller
# Perform ADF test
result = adfuller(data[‘Passengers’])
print(‘ADF Statistic:’, result[0])
print(‘p-value:’, result[1])
Output:
ADF Statistic: 0.8153688792060547
p-value: 0.9918802434376411
For our dataset, the ADF test produces a p-value of 0.99, which is far above the commonly used significance threshold of 0.05. This high p-value indicates that we fail to reject the null hypothesis, meaning the data is non-stationary. This result aligns with our earlier observation of clear trends and seasonal patterns in the dataset.
To make the series stationary, transformations such as differencing (subtracting each observation from the previous one) or applying a logarithmic transformation can be used. These steps help remove trends or stabilize variance, ensuring the data is suitable for modeling.
Understanding whether your data is stationary, and how to address non-stationarity—is essential for building accurate and reliable forecasting models.
4. Forecasting Models
Forecasting is one of the most exciting aspects of time series analysis. It allows us to predict future values based on past trends. Here, we’ll explore two popular forecasting models: ARIMA and Prophet.
ARIMA
ARIMA (AutoRegressive Integrated Moving Average) is a classic time series model that relies on past values and errors to make predictions. It’s especially useful for stationary datasets. Here’s how it works for our dataset:
from statsmodels.tsa.arima.model import ARIMA
# Pre-processing the data to handle non-stationarity for the ARIMA model
model = ARIMA(data[‘Passengers’], order=(5, 1, 0))
# Fitting the ARIMA model to the time series data
model_fit = model.fit()
print(model_fit.summary())
# Forecasting
forecast = model_fit.forecast(steps=12)
plt.figure(figsize=(12, 6))
plt.plot(data[‘Passengers’], label=’Historical’)
plt.plot(forecast.index, forecast, label=’Forecast’, color=’red’)
plt.title(‘ARIMA Model Forecast’)
plt.xlabel(‘Year’)
plt.ylabel(‘Number of Passengers’)
plt.legend()
plt.show()
The ARIMA model produces a reasonable forecast by capturing the upward trend in passenger numbers. However, as shown in the plot, the forecast doesn’t align perfectly with historical patterns, particularly with the seasonal variations. This highlights the need for further adjustments to improve accuracy, especially in capturing complex seasonal trends.
Prophet
For datasets with strong seasonality or irregular trends, Prophet, developed by Meta, is a more intuitive choice. It handles missing data well and automatically identifies trends and seasonal patterns.
from prophet import Prophet
# Prepare data for Prophet
prophet_data = data.reset_index()
prophet_data.rename(columns={‘Month’: ‘ds’, ‘Passengers’: ‘y’}, inplace=True)
# Initialize and fit the model
model = Prophet()
model.fit(prophet_data)
# Create future dates for forecasting
future = model.make_future_dataframe(periods=12, freq=’MS’)
forecast = model.predict(future)
# Plot the forecast
model.plot(forecast)
plt.title(‘Prophet Model Forecast’)
plt.xlabel(‘Year’)
plt.ylabel(‘Number of Passengers’)
plt.show()
Prophet’s forecast aligns closely with the historical data, capturing both the overall trend and seasonal variations. It also provides uncertainty intervals, which can be incredibly helpful for making business decisions.
# Plot the components
model.plot_components(forecast)
plt.show()
With tools like ARIMA and Prophet, forecasting becomes a powerful way to anticipate future behavior and make informed decisions.
5. Anomaly Detection
Time series data isn’t always smooth and predictable. Sometimes, unusual spikes or dips—called anomalies—indicate critical events, like unexpected demand surges or system failures. Identifying these anomalies can help businesses act quickly.
Using Isolation Forest, a machine learning model, we can detect anomalies in the Airline Passengers dataset:
from sklearn.ensemble import IsolationForest
# Prepare data
data_for_anomaly = data.copy()
data_for_anomaly.reset_index(inplace=True)
data_for_anomaly[‘Time_Index’] = data_for_anomaly.index
# Fit the model using ‘Time_Index’ and ‘Passengers’
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
features = data_for_anomaly[[‘Time_Index’, ‘Passengers’]]
data_for_anomaly[‘Anomaly’] = isolation_forest.fit_predict(features)
# Extract anomalies
anomalies = data_for_anomaly[data_for_anomaly[‘Anomaly’] == -1]
# Plot the anomalies
plt.figure(figsize=(12, 6))
plt.plot(data_for_anomaly[‘Month’], data_for_anomaly[‘Passengers’], label=’Data’)
plt.scatter(anomalies[‘Month’], anomalies[‘Passengers’], color=’red’, label=’Anomalies’)
plt.title(‘Anomaly Detection with Isolation Forest’)
plt.xlabel(‘Year’)
plt.ylabel(‘Number of Passengers’)
plt.legend()
plt.show()
The red points in the graph highlight anomalies—instances where the number of passengers deviated significantly from the expected pattern. These insights can be invaluable for diagnosing issues or capitalizing on opportunities.
Putting It All Together
Time series analysis is a versatile tool that enables us to uncover trends, understand seasonality, forecast future values, and detect anomalies. From visualizing data to applying sophisticated models like ARIMA and Prophet, the journey through time series analysis is as rewarding as it is insightful.
I hope this article has provided a clear and practical introduction to time series analysis. Whether you’re a data scientist or a software developer exploring this field, there’s immense potential to unlock value from time-dependent data.
If you’d like to learn more about data science, be sure to check out Udacity’s School of Data Science!
References
Box, George E. P., et al. Time Series Analysis: Forecasting and Control. 5th ed., Wiley, 2015.
Chatfield, Chris. The Analysis of Time Series: An Introduction. 6th ed., Chapman and Hall/CRC, 2003.
Hamilton, James D. Time Series Analysis. Princeton University Press, 1994.
“Time Series Analysis in Python – A Comprehensive Guide with Examples.” Machine Learning Plus, 18 Feb. 2020, https://www.machinelearningplus.com/time-series/time-series-analysis-python/.
“Efficient and Scalable Time Series Analysis with Large Datasets in Python.” GeeksforGeeks, 15 June 2023, https://www.geeksforgeeks.org/efficient-and-scalable-time-series-analysis-with-large-datasets-in-python/.
“A Guide to Time Series Analysis in Python.” Built In, 10 May 2024, https://builtin.com/data-science/time-series-python.