Tabular Data Wrangling with Pandas#
In this notebook, we’ll explore how to manipulate and analyze tabular data using the powerful pandas library in Python. Pandas is essential for data scientists and analysts working with structured data.
First, let’s import the pandas library and create a sample DataFrame to work with.
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'San Francisco', 'London', 'Paris'],
'Salary': [50000, 75000, 80000, 65000]
})
print(df)
Name Age City Salary
0 Alice 25 New York 50000
1 Bob 30 San Francisco 75000
2 Charlie 35 London 80000
3 David 28 Paris 65000
Now, let’s explore some basic operations on our DataFrame, such as selecting columns and filtering rows.
# Select specific columns
print(df[['Name', 'Age']])
# Filter rows based on a condition
print(df[df['Age'] > 30])
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 28
Name Age City Salary
2 Charlie 35 London 80000
Pandas provides powerful functions for data manipulation. Let’s look at sorting and adding new columns.
# Sort the DataFrame by Age
print(df.sort_values('Age'))
# Add a new column
df['Bonus'] = df['Salary'] * 0.1
print(df)
Name Age City Salary
0 Alice 25 New York 50000
3 David 28 Paris 65000
1 Bob 30 San Francisco 75000
2 Charlie 35 London 80000
Name Age City Salary Bonus
0 Alice 25 New York 50000 5000.0
1 Bob 30 San Francisco 75000 7500.0
2 Charlie 35 London 80000 8000.0
3 David 28 Paris 65000 6500.0
Group operations are crucial for data analysis. Let’s group our data by City and calculate some statistics.
# Group by City and calculate mean Age and Salary
city_stats = df.groupby('City').agg({
'Age': 'mean',
'Salary': 'mean'
})
print(city_stats)
Age Salary
City
London 35.0 80000.0
New York 25.0 50000.0
Paris 28.0 65000.0
San Francisco 30.0 75000.0
Finally, let’s demonstrate how to handle missing data, which is common in real-world datasets.
# Introduce some missing values
df.loc[1, 'Salary'] = np.nan
df.loc[3, 'Age'] = np.nan
print(df)
# Fill missing values
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
df['Age'] = df['Age'].fillna(df['Age'].median())
print(df)
Name Age City Salary Bonus
0 Alice 25.0 New York 50000.0 5000.0
1 Bob 30.0 San Francisco NaN 7500.0
2 Charlie 35.0 London 80000.0 8000.0
3 David NaN Paris 65000.0 6500.0
Name Age City Salary Bonus
0 Alice 25.0 New York 50000.0 5000.0
1 Bob 30.0 San Francisco 65000.0 7500.0
2 Charlie 35.0 London 80000.0 8000.0
3 David 30.0 Paris 65000.0 6500.0
This notebook has covered some fundamental operations in pandas for tabular data wrangling. Practice these techniques to become proficient in data manipulation with pandas!