Pandas is a popular data manipulation library in Python that provides data structures and functions for efficiently working with structured data. One of the fundamental objects in Pandas is the DataFrame, which is a two-dimensional labeled data structure that resembles a table in a relational database or an Excel spreadsheet. In this tutorial, we will explore how to create and manipulate Pandas DataFrames, along with practical examples to illustrate each concept.
Table of Contents
- Introduction to Pandas DataFrames
- Creating DataFrames
- From dictionaries
- From lists of lists
- Exploring DataFrames
- Viewing data
- Basic statistics
- Manipulating DataFrames
- Adding and deleting columns
- Filtering and selecting data
- Applying functions to columns
- Example 1: Analyzing Sales Data
- Example 2: Examining Student Performance
- Conclusion
1. Introduction to Pandas DataFrames
A Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is widely used for data analysis, cleaning, and manipulation tasks. DataFrames provide a versatile and efficient way to work with structured data, making them an essential tool for data scientists, analysts, and researchers.
2. Creating DataFrames
From Dictionaries
One of the most common ways to create a DataFrame is by using a Python dictionary. Each key in the dictionary corresponds to a column name, and the values associated with each key form the data in that column. Let’s see how to create a DataFrame from a dictionary:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 22 Chicago
3 David 28 Houston
From Lists of Lists
Another way to create a DataFrame is by using a list of lists. Each inner list represents a row of data, and the outer list contains all the rows. Column names can be specified separately using the columns
parameter. Here’s an example:
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 22, 'Chicago'],
['David', 28, 'Houston']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 22 Chicago
3 David 28 Houston
3. Exploring DataFrames
Viewing Data
Pandas provides several methods to quickly view the data in a DataFrame. The head()
method displays the first few rows of the DataFrame, while the tail()
method shows the last few rows. By default, both methods display five rows, but you can specify the number of rows to display as an argument.
# Display the first 3 rows
print(df.head(3))
# Display the last 2 rows
print(df.tail(2))
Basic Statistics
You can use the describe()
method to get basic statistics for numeric columns in the DataFrame, such as mean, standard deviation, minimum, and maximum values.
print(df.describe())
Output:
Age
count 4.000000
mean 26.250000
std 3.304038
min 22.000000
25% 24.500000
50% 26.500000
75% 28.250000
max 30.000000
4. Manipulating DataFrames
Adding and Deleting Columns
You can easily add a new column to a DataFrame by assigning values to a new column name. Similarly, you can delete a column using the drop()
method.
# Adding a new column
df['Salary'] = [60000, 75000, 50000, 80000]
# Deleting a column
df.drop('Age', axis=1, inplace=True)
print(df)
Filtering and Selecting Data
You can filter rows based on specific conditions using boolean indexing. For example, to select rows where the age is greater than 25:
filtered_df = df[df['Age'] > 25]
print(filtered_df)
To select specific columns, you can use their column names:
selected_columns = df[['Name', 'City']]
print(selected_columns)
Applying Functions to Columns
You can apply functions to entire columns using the apply()
method. This is particularly useful when you want to transform or calculate values for a column based on a function.
# Define a function to categorize ages
def categorize_age(age):
if age < 25:
return 'Young'
elif age >= 25 and age < 40:
return 'Adult'
else:
return 'Senior'
# Apply the function to the 'Age' column
df['Age_Category'] = df['Age'].apply(categorize_age)
print(df)
5. Example 1: Analyzing Sales Data
Let’s consider a scenario where you have sales data for different products. The data includes columns such as Product
, Price
, and Units Sold
. We can create a DataFrame and perform basic analysis on the data.
sales_data = {
'Product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B'],
'Price': [10.99, 15.99, 10.99, 25.99, 15.99],
'Units Sold': [100, 75, 120, 50, 90]
}
sales_df = pd.DataFrame(sales_data)
# Calculate total revenue for each product
sales_df['Total Revenue'] = sales_df['Price'] * sales_df['Units Sold']
# Display the DataFrame
print(sales_df)
6. Example 2: Examining Student Performance
Let’s explore a scenario involving student performance data. The data includes columns such as Student Name
, Math Score
, and English Score
. We will create a DataFrame and analyze the performance using descriptive statistics.
student_data = {
'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Math Score': [85, 90, 78, 92, 70],
'English Score': [92, 85, 88, 78, 82]
}
student_df = pd.DataFrame(student_data)
# Calculate average scores
student_df['Average Score'] = (student_df['Math Score'] + student_df['English Score']) / 2
# Display
the DataFrame
print(student_df)
7. Conclusion
In this tutorial, we explored the creation and manipulation of Pandas DataFrames in Python. We learned how to create DataFrames from dictionaries and lists of lists, view and analyze data within DataFrames, and manipulate DataFrames by adding and deleting columns, filtering data, and applying functions to columns. Two practical examples demonstrated how DataFrames can be used for analyzing sales data and student performance. Pandas DataFrames are a powerful tool for data manipulation and analysis, and mastering their usage is essential for anyone working with structured data in Python.