Tabular Data Wrangling with Pandas#

In this notebook, we’ll explore how to manipulate and analyze tabular data using the powerful pandas library in Python. Pandas is essential for data scientists and analysts working with structured data.

First, let’s import the pandas library and create a sample DataFrame to work with.

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'San Francisco', 'London', 'Paris'],
    'Salary': [50000, 75000, 80000, 65000]
})

print(df)
      Name  Age           City  Salary
0    Alice   25       New York   50000
1      Bob   30  San Francisco   75000
2  Charlie   35         London   80000
3    David   28          Paris   65000

Now, let’s explore some basic operations on our DataFrame, such as selecting columns and filtering rows.

# Select specific columns
print(df[['Name', 'Age']])

# Filter rows based on a condition
print(df[df['Age'] > 30])
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28
      Name  Age    City  Salary
2  Charlie   35  London   80000

Pandas provides powerful functions for data manipulation. Let’s look at sorting and adding new columns.

# Sort the DataFrame by Age
print(df.sort_values('Age'))

# Add a new column
df['Bonus'] = df['Salary'] * 0.1
print(df)
      Name  Age           City  Salary
0    Alice   25       New York   50000
3    David   28          Paris   65000
1      Bob   30  San Francisco   75000
2  Charlie   35         London   80000
      Name  Age           City  Salary   Bonus
0    Alice   25       New York   50000  5000.0
1      Bob   30  San Francisco   75000  7500.0
2  Charlie   35         London   80000  8000.0
3    David   28          Paris   65000  6500.0

Group operations are crucial for data analysis. Let’s group our data by City and calculate some statistics.

# Group by City and calculate mean Age and Salary
city_stats = df.groupby('City').agg({
    'Age': 'mean',
    'Salary': 'mean'
})
print(city_stats)
                Age   Salary
City                        
London         35.0  80000.0
New York       25.0  50000.0
Paris          28.0  65000.0
San Francisco  30.0  75000.0

Finally, let’s demonstrate how to handle missing data, which is common in real-world datasets.

# Introduce some missing values
df.loc[1, 'Salary'] = np.nan
df.loc[3, 'Age'] = np.nan
print(df)

# Fill missing values
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
df['Age'] = df['Age'].fillna(df['Age'].median())
print(df)
      Name   Age           City   Salary   Bonus
0    Alice  25.0       New York  50000.0  5000.0
1      Bob  30.0  San Francisco      NaN  7500.0
2  Charlie  35.0         London  80000.0  8000.0
3    David   NaN          Paris  65000.0  6500.0
      Name   Age           City   Salary   Bonus
0    Alice  25.0       New York  50000.0  5000.0
1      Bob  30.0  San Francisco  65000.0  7500.0
2  Charlie  35.0         London  80000.0  8000.0
3    David  30.0          Paris  65000.0  6500.0

This notebook has covered some fundamental operations in pandas for tabular data wrangling. Practice these techniques to become proficient in data manipulation with pandas!