Introduction

Python has established itself as the primary programming language for data analysis, largely due to its clear syntax, extensive library ecosystem, and strong community support. Whether you’re analysing large datasets for research projects or deriving business insights for decision-making, Python for data analysis equips you with flexible, scalable tools that apply across academic and professional contexts.

Essential Python Libraries

LibraryPrimary PurposeWhy It Matters
PandasTabular data manipulation and time-series analysisProvides DataFrame structures and high-level functions for filtering, joining, and aggregating data
NumPyFast numerical computing on n-dimensional arraysUnderpins many other libraries with vectorized math operations that run close to C-speed
MatplotlibCreating static, animated, and interactive plotsFundamental for data visualization, serving as the foundation for libraries like Seaborn
SeabornStatistical data visualization built on MatplotlibAllows rapid creation of complex, aesthetically pleasing visualizations with minimal code

Tip: Gaining proficiency in these four libraries will prepare you for most day-to-day data analysis tasks.

Data Wrangling Techniques

Data in the real world is rarely analysis-ready. Data wrangling—the process of cleaning and transforming raw data—is crucial. Without it, any analysis may yield inaccurate or misleading results because models and algorithms rely on clean, well-structured inputs to produce reliable insights. Imagine if an age column contained “twenty-five” instead of 25—without cleaning, your analysis would treat that as missing or incorrect data, leading to flawed insights. Consider these techniques with practical examples:

Data Cleaning

  1. Handling Missing Values

Missing data can bias your analysis and lead to incorrect conclusions. Handling missing values means deciding whether to fill in missing data with substitutions or remove incomplete rows entirely.

For Example: Use df.fillna() to impute or df.dropna() to remove incomplete rows/columns.

df[‘age’].fillna(df[‘age’].median(), inplace=True)

df.dropna(subset=[‘price’, ‘quantity’], inplace=True)

  1. Removing Duplicate Values

Duplicate records distort results by giving undue weight to repeated data points. Removing duplicates ensures that each observation is unique and prevents skewed analysis results.

# Check for duplicate rows

duplicate_rows = df.duplicated()

 

# Drop duplicate rows

df.drop_duplicates(inplace=True)

Data Transformation & Feature Engineering

  1. Type Conversion

Data often comes in incorrect formats, like dates stored as plain text or numeric fields stored as strings. Type conversion transforms columns into appropriate data types so operations like sorting or mathematical calculations work correctly.

Convert columns with pd.to_datetime() or astype() to ensure correct data types.

df[‘event_date’] = pd.to_datetime(df[‘event_date’])

df[‘category_id’] = df[‘category_id’].astype(‘category’)

  1. Feature Engineering

This involves creating new variables from existing ones to highlight patterns and relationships within the data. Thoughtful feature creation often improves model performance and insights. Generally, new variables (e.g., ratio, lag, rolling mean) are created to capture domain insights.

df[‘revenue_per_visit’] = df[‘revenue’] / df[‘visits’]

df[‘sales_7d_avg’] = df[‘sales’].rolling(window=7).mean()

  1. Merging & Reshaping

Combine datasets with merge() / concat() and pivot with melt() / pivot_table().

sales = pd.merge(orders, customers, on=’customer_id’, how=’left’)

tidy = df.melt(id_vars=[‘date’], value_vars=[‘sales’, ‘profit’],

               var_name=’metric’, value_name=’value’)

Exploratory Data Analysis (EDA)

EDA enables you to explore your dataset systematically and uncover underlying patterns that might be invisible through summary statistics alone.

  • Univariate Plots: Histograms and boxplots highlight data distribution and outliers.
  • Bivariate Analysis: Scatterplots show relationships between pairs of variables.
  • Correlation Heatmaps: Visualize feature correlations to inform feature selection.
  • Interactive Dashboards: Create dynamic visuals using tools like Plotly and Streamlit.

Example Commands

sns.histplot(df[‘price’], kde=True)

sns.scatterplot(x=’age’, y=’income’, data=df)

sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’)

Why EDA Matters: A Practical Example

Data exploration in the well-known Seaborn tips dataset led to discoveries that simple averages would obscure. This analysis of a restaurant’s tipping data revealed fascinating patterns in human behaviour that would have been impossible to discern otherwise. The goal was to understand the general tipping habits of customers.

The Unexpected Discovery: Through a series of visualisations, several unexpected trends emerged:

  • Tipping Patterns: Histograms of tip amounts showed distinct peaks at whole-dollar and half-dollar amounts, indicating a strong psychological tendency for customers to round their tips to convenient numbers rather than calculating a precise percentage.
  • Demographic Insights: When the data was segmented by the gender of the person paying the bill, it was found that while both genders tipped similarly on average, males tended to have a wider variance in their tipping, including some very high and very low tips.
  •  Behavioural Clues: A surprising correlation was found between parties seated in the smoking section and higher tipping variability.

Let us perform data exploration on the same tips dataset to gather some insights:

1.       Load the Dataset

import matplotlib.pyplot as plt

import seaborn as sns

tips = sns.load_dataset(‘tips’)

2.       Rounding Behaviour

We will plot a histogram that shows how total bill amounts are distributed and highlights common spending ranges.

plt.figure(figsize=(10, 5))

sns.histplot(data=tips, x=’tip’, bins=40, kde=True, edgecolor=’black’)

 

plt.title(‘Tip Amounts Histogram: Rounding Behavior’)

plt.xlabel(‘Tip Amount ($)’)

plt.ylabel(‘Frequency’)

plt.xticks(range(0, int(tips[‘tip’].max()) + 2))

plt.grid(True, axis=’y’, linestyle=’–‘, alpha=0.7)

plt.show()

3.       Gender and Tipping

Now, let us use violin plots that reveals the distribution of tips given by males and females.

plt.figure(figsize=(10, 5))

sns.violinplot(data=tips, x=’sex’, y=’tip’, inner=’quartile’, palette=’pastel’)

# sns.boxplot(data=tips, x=’sex’, y=’tip’,  palette=’pastel’)

 

plt.title(‘Gender vs. Tip Amount: Distribution & Variance’)

plt.xlabel(‘Gender’)

plt.ylabel(‘Tip Amount ($)’)

plt.grid(True, axis=’y’, linestyle=’–‘, alpha=0.5)

plt.show()

Why Violin Plot?

It shows the density and spread, highlighting variance more effectively than boxplots alone.

Alternative: Use sns.boxplot() for a simpler representation, but violin plots better capture distribution tails.

4.       The Smoker’s Premium

Similar to above, we now use boxplot to see tips given by smokers vs non-smokers

plt.figure(figsize=(10, 5))

sns.boxplot(data=tips, x=’smoker’, y=’tip’, palette=’coolwarm’)

 

plt.title(“Smoker’s Premium: Tip Amount Variability”)

plt.xlabel(‘Smoker Status’)

plt.ylabel(‘Tip Amount ($)’)

plt.grid(True, axis=’y’, linestyle=’–‘, alpha=0.5)

plt.show()

Why Boxplot?

Directly shows variability (IQR, whiskers, and outliers). If smokers show longer whiskers or more outliers, it confirms your variability observation.

Alternative: Use sns.violinplot() if you want to expose distribution details, similar to the gender plot.

These insights showcase how EDA goes beyond surface-level metrics to uncover subtle, actionable patterns within data. While this particular analysis might not have led to a major business overhaul, it powerfully illustrates how EDA can uncover nuanced social and psychological patterns within a dataset. For a business, understanding these subtle customer behaviours can inform everything from marketing to customer service strategies.

Statistical Analysis and Modeling

Once patterns are identified through exploratory data analysis (EDA), formal statistical modeling provides a structured framework to validate these insights and make predictions:

  • Descriptive Statistics: These include measures like mean, median, mode, variance, and standard deviation that help summarise and describe the main features of a dataset. Understanding the central tendency and spread of data is crucial for interpreting its distribution and variability.
  • Inferential Tests: Techniques such as t-tests and chi-square tests assess whether observed differences between groups or associations between variables are statistically significant. This step helps distinguish genuine trends from random variations in the data.
  • Regression Analysis: Linear regression models help quantify the relationship between continuous variables, while logistic regression is used for binary or categorical outcomes. These models not only explain relationships but also enable future predictions based on independent variables.
  • Time-Series Modeling: Time-series techniques, including ARIMA models and Facebook Prophet, are employed to analyse data points collected or indexed in time order. These models are essential for forecasting trends, seasonality, and future values in domains like sales forecasting and financial analysis.

 Example Commands

# T-test

stats.ttest_ind(group_a[‘score’], group_b[‘score’])

 

# Linear Regression

model = LinearRegression().fit(X, y)

 

# Time-Series Forecasting

arima_model = ARIMA(df[‘sales’], order=(1,1,1)).fit()

forecast = arima_model.predict(start=len(df), end=len(df)+12)

Real-World Use Cases

DomainProblemPython Solution
FinancePredicting stock price movementsFeature engineering with Pandas, modeling with scikit-learn, visualization with Matplotlib
HealthcareDetecting anomalies in patient vitalsSignal processing with NumPy, statistical thresholding with SciPy, real-time dashboards via Streamlit
E-CommerceCustomer segmentationData manipulation with Pandas, K-means clustering with scikit-learn, visualized using Seaborn
Sports AnalyticsForecasting player performanceData wrangling with Pandas, modeling with XGBoost, interactive charts using Plotly

My Perspective on EDA

When working with any new dataset, my first step is always to perform quick EDA before jumping into solution design. Over the years, I’ve learned that without understanding the data’s underlying patterns, distributions, and anomalies, any advanced analysis or modeling can easily lead to incorrect assumptions. A few simple visualizations and summary statistics often reveal hidden problems or opportunities that shape the rest of the analysis pipeline.

To streamline the EDA process, I also use automated tools like ydata-profiling (formerly pandas-profiling). With a single line of code, ydata-profiling generates a detailed, interactive report summarizing distributions, missing values, correlations, and potential data quality issues. This accelerates my understanding of the dataset and highlights key features or anomalies without the need for manual plotting initially.

import pandas as pd

from ydata_profiling import ProfileReport

df = pd.read_csv(‘data.csv’)

profile = ProfileReport(df, title=”Profiling Report”)

Conclusion

Mastering Python for data analysis involves more than just learning syntax—it requires developing a strategic understanding of how to clean, explore, model, and visualize data effectively.

Next Steps for Students:

  1. Download and analyse a public dataset (e.g., from Kaggle) to practice real-world data cleaning, exploration, and modeling.
  2. Build and share an interactive dashboard to effectively communicate your findings to peers or stakeholders using tools like Streamlit.

With consistent practice, you’ll learn to transform raw data into actionable insights using Python and the ecosystem of tools.

Check out the data catalog and the AI catalog to upskill in this space.

Mayur Madnani
Mayur Madnani
Mayur is an engineer with deep expertise in software, data, and AI. With experience at SAP, Walmart, Intuit, and JioHotstar, and an MS in ML & AI from LJMU, UK, he is a published researcher, patent holder, and the Udacity course author of "Building Image and Vision Generative AI Solutions on Azure." Mayur has also been an active Udacity mentor since 2020, completing 2,100+ project reviews across various Nanodegree programs. Connect with him on LinkedIn at www.linkedin.com/in/mayurmadnani/