Tutorial: Creating and Manipulating Pandas DataFrames in Python

Pandas is a popular data manipulation library in Python that provides data structures and functions for efficiently working with structured data. One of the fundamental objects in Pandas is the DataFrame, which is a two-dimensional labeled data structure that resembles a table in a relational database or an Excel spreadsheet. In this tutorial, we will explore how to create and manipulate Pandas DataFrames, along with practical examples to illustrate each concept.

Introduction to Pandas DataFrames
Creating DataFrames

From dictionaries
From lists of lists

Exploring DataFrames

Viewing data
Basic statistics

Manipulating DataFrames

Adding and deleting columns
Filtering and selecting data
Applying functions to columns

Example 1: Analyzing Sales Data
Example 2: Examining Student Performance
Conclusion

1. Introduction to Pandas DataFrames

A Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is widely used for data analysis, cleaning, and manipulation tasks. DataFrames provide a versatile and efficient way to work with structured data, making them an essential tool for data scientists, analysts, and researchers.

2. Creating DataFrames

From Dictionaries

One of the most common ways to create a DataFrame is by using a Python dictionary. Each key in the dictionary corresponds to a column name, and the values associated with each key form the data in that column. Let’s see how to create a DataFrame from a dictionary:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago
3    David   28      Houston

From Lists of Lists

Another way to create a DataFrame is by using a list of lists. Each inner list represents a row of data, and the outer list contains all the rows. Column names can be specified separately using the columns parameter. Here’s an example:

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 22, 'Chicago'],
    ['David', 28, 'Houston']
]

columns = ['Name', 'Age', 'City']

df = pd.DataFrame(data, columns=columns)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago
3    David   28      Houston

3. Exploring DataFrames

Viewing Data

Pandas provides several methods to quickly view the data in a DataFrame. The head() method displays the first few rows of the DataFrame, while the tail() method shows the last few rows. By default, both methods display five rows, but you can specify the number of rows to display as an argument.

# Display the first 3 rows
print(df.head(3))

# Display the last 2 rows
print(df.tail(2))

Basic Statistics

You can use the describe() method to get basic statistics for numeric columns in the DataFrame, such as mean, standard deviation, minimum, and maximum values.

print(df.describe())

Output:

             Age
count   4.000000
mean   26.250000
std     3.304038
min    22.000000
25%    24.500000
50%    26.500000
75%    28.250000
max    30.000000

4. Manipulating DataFrames

Adding and Deleting Columns

You can easily add a new column to a DataFrame by assigning values to a new column name. Similarly, you can delete a column using the drop() method.

# Adding a new column
df['Salary'] = [60000, 75000, 50000, 80000]

# Deleting a column
df.drop('Age', axis=1, inplace=True)

print(df)

Filtering and Selecting Data

You can filter rows based on specific conditions using boolean indexing. For example, to select rows where the age is greater than 25:

filtered_df = df[df['Age'] > 25]
print(filtered_df)

To select specific columns, you can use their column names:

selected_columns = df[['Name', 'City']]
print(selected_columns)

Applying Functions to Columns

You can apply functions to entire columns using the apply() method. This is particularly useful when you want to transform or calculate values for a column based on a function.

# Define a function to categorize ages
def categorize_age(age):
    if age < 25:
        return 'Young'
    elif age >= 25 and age < 40:
        return 'Adult'
    else:
        return 'Senior'

# Apply the function to the 'Age' column
df['Age_Category'] = df['Age'].apply(categorize_age)

print(df)

5. Example 1: Analyzing Sales Data

Let’s consider a scenario where you have sales data for different products. The data includes columns such as Product, Price, and Units Sold. We can create a DataFrame and perform basic analysis on the data.

sales_data = {
    'Product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B'],
    'Price': [10.99, 15.99, 10.99, 25.99, 15.99],
    'Units Sold': [100, 75, 120, 50, 90]
}

sales_df = pd.DataFrame(sales_data)

# Calculate total revenue for each product
sales_df['Total Revenue'] = sales_df['Price'] * sales_df['Units Sold']

# Display the DataFrame
print(sales_df)

6. Example 2: Examining Student Performance

Let’s explore a scenario involving student performance data. The data includes columns such as Student Name, Math Score, and English Score. We will create a DataFrame and analyze the performance using descriptive statistics.

student_data = {
    'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Math Score': [85, 90, 78, 92, 70],
    'English Score': [92, 85, 88, 78, 82]
}

student_df = pd.DataFrame(student_data)

# Calculate average scores
student_df['Average Score'] = (student_df['Math Score'] + student_df['English Score']) / 2

# Display

 the DataFrame
print(student_df)

7. Conclusion

In this tutorial, we explored the creation and manipulation of Pandas DataFrames in Python. We learned how to create DataFrames from dictionaries and lists of lists, view and analyze data within DataFrames, and manipulate DataFrames by adding and deleting columns, filtering data, and applying functions to columns. Two practical examples demonstrated how DataFrames can be used for analyzing sales data and student performance. Pandas DataFrames are a powerful tool for data manipulation and analysis, and mastering their usage is essential for anyone working with structured data in Python.

Tutorial: Creating and Manipulating Pandas DataFrames in Python

Table of Contents

1. Introduction to Pandas DataFrames

2. Creating DataFrames

From Dictionaries

From Lists of Lists

3. Exploring DataFrames

Viewing Data

Basic Statistics

4. Manipulating DataFrames

Adding and Deleting Columns

Filtering and Selecting Data

Applying Functions to Columns

5. Example 1: Analyzing Sales Data

6. Example 2: Examining Student Performance

7. Conclusion

Leave a Reply Cancel reply