Introduction to Pandas DataFrames
Pandas is a widely-used Python library for data manipulation and analysis. One of its core data structures is the DataFrame, which is a two-dimensional tabular data structure resembling a spreadsheet or a SQL table. DataFrames are incredibly versatile and provide a convenient way to work with structured data.
In this tutorial, we will delve into the world of Pandas DataFrames, covering their creation, manipulation, indexing, and common operations. We’ll explore several examples to illustrate these concepts.
Table of Contents
- Creating DataFrames
- From dictionaries
- From lists of lists
- From CSV files
- Basic DataFrame Operations
- Viewing and inspecting data
- Selecting and filtering data
- Adding and removing columns
- Indexing and Slicing
- Indexing using labels and positions
- Conditional selection
- Data Manipulation
- Applying functions to columns
- Grouping and aggregation
- Merging and Joining DataFrames
- Concatenating DataFrames
- Merging on columns
- Example 1: Analyzing Sales Data
- Example 2: Exploring Student Performance
1. Creating DataFrames
From dictionaries
One of the most common ways to create a DataFrame is from a dictionary. Each key in the dictionary becomes a column, and the corresponding values become the data in that column.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)
From lists of lists
You can also create a DataFrame from a list of lists. Each inner list represents a row in the DataFrame.
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'San Francisco'],
['Charlie', 22, 'Los Angeles']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)
From CSV files
Pandas makes it easy to read data from CSV files and create DataFrames.
df = pd.read_csv('data.csv')
print(df)
2. Basic DataFrame Operations
Viewing and inspecting data
You can use various methods to get an overview of your DataFrame.
# Display the first few rows
print(df.head())
# Display the last few rows
print(df.tail())
# Get summary statistics
print(df.describe())
# Check the data types of each column
print(df.dtypes)
Selecting and filtering data
Pandas allows you to select specific columns or rows based on conditions.
# Select a single column
names = df['Name']
# Select multiple columns
subset = df[['Name', 'Age']]
# Filter rows based on a condition
young_people = df[df['Age'] < 30]
Adding and removing columns
You can add new columns to your DataFrame or remove existing ones.
# Add a new column
df['Gender'] = ['Female', 'Male', 'Male']
# Remove a column
df.drop('Gender', axis=1, inplace=True)
3. Indexing and Slicing
Indexing using labels and positions
Pandas offers flexible indexing capabilities.
# Access a column by label
ages = df['Age']
# Access a row by position
row_0 = df.iloc[0]
# Access a specific value by label and position
value = df.at[1, 'Name']
Conditional selection
You can perform conditional selection on your DataFrame.
# Select rows where Age is greater than 25
selected_rows = df[df['Age'] > 25]
# Select rows where City is 'New York'
ny_residents = df[df['City'] == 'New York']
4. Data Manipulation
Applying functions to columns
You can apply functions to columns using the apply()
method.
# Convert ages to a new category
def categorize_age(age):
if age < 18:
return 'Underage'
elif age < 65:
return 'Adult'
else:
return 'Senior'
df['Age_Category'] = df['Age'].apply(categorize_age)
Grouping and aggregation
Pandas allows you to group data and perform aggregation operations.
# Group data by Age_Category and calculate mean Age for each group
age_group_means = df.groupby('Age_Category')['Age'].mean()
# Calculate multiple aggregations
agg_results = df.groupby('City').agg({'Age': 'mean', 'Name': 'count'})
5. Merging and Joining DataFrames
Concatenating DataFrames
You can concatenate DataFrames along rows or columns.
# Concatenate along rows
df_concat = pd.concat([df1, df2])
# Concatenate along columns
df_concat = pd.concat([df1, df2], axis=1)
Merging on columns
Merging combines DataFrames based on common columns.
# Merge based on a common column
merged_df = pd.merge(df1, df2, on='ID')
6. Example 1: Analyzing Sales Data
Let’s consider an example of sales data analysis using Pandas DataFrames. Imagine you have a dataset with columns: Product
, Price
, Quantity
, and Date
.
# Read data from CSV file
sales_data = pd.read_csv('sales_data.csv')
# Calculate total revenue for each product
product_revenue = sales_data.groupby('Product')['Price'].sum()
# Find the most sold product
most_sold_product = sales_data.groupby('Product')['Quantity'].sum().idxmax()
# Calculate monthly revenue
sales_data['Date'] = pd.to_datetime(sales_data['Date'])
sales_data.set_index('Date', inplace=True)
monthly_revenue = sales_data.resample('M')['Price'].sum()
7. Example 2: Exploring Student Performance
Consider a scenario where you have a dataset containing student information and their exam scores. We will explore this dataset using Pandas.
# Read data from CSV file
student_data = pd.read_csv('student_scores.csv')
# Calculate average scores by subject
average_scores = student_data.groupby('Subject')['Score'].mean()
# Find students with scores above 90
top_students = student_data[student_data['Score'] > 90]
# Calculate correlation between study hours and scores
correlation = student_data['Hours'].corr(student_data['Score'])
Conclusion
Pandas DataFrames are a powerful tool for data manipulation, analysis, and exploration in Python. This tutorial has covered the basics of creating DataFrames, performing various operations on them, indexing and slicing, data manipulation, and merging/joining. With these fundamental skills, you’re well-equipped to dive into more advanced topics and real-world data analysis projects using Pandas. Remember to practice and experiment with different scenarios to fully grasp
the capabilities of Pandas DataFrames.