Introduction to the describe()
Function
The describe()
function in the Python library Pandas is a powerful tool for quickly summarizing the statistics of a dataset. It provides a concise overview of the central tendencies, spread, and distribution of the data, making it an essential step in the exploratory data analysis (EDA) process. This tutorial will delve into the details of the describe()
function, explaining its various parameters and providing practical examples to showcase its usage.
Table of Contents
- What is the
describe()
Function? - Parameters of the
describe()
Function - Output of the
describe()
Function - Example 1: Analyzing a Numerical Dataset
- Example 2: Examining a Mixed-Type Dataset
- Conclusion
1. What is the describe()
Function?
The describe()
function is a convenient tool in Pandas that generates a variety of descriptive statistics for numerical and categorical data columns within a DataFrame. It calculates essential summary statistics such as the mean, standard deviation, quartiles, minimum, maximum, and more. This function offers a quick way to understand the distribution and characteristics of data without having to write extensive code.
The basic syntax of the describe()
function is as follows:
DataFrame.describe(percentiles=None, include=None, exclude=None)
2. Parameters of the describe()
Function
The describe()
function can be customized using the following parameters:
percentiles
: A list of percentiles to compute for numerical data. By default, it calculates the 25th, 50th (median), and 75th percentiles. You can specify additional percentiles using this parameter.include
: A list of data types to be included in the summary. By default, it includes only numeric columns. You can use this parameter to include other data types like ‘object’, ‘datetime’, etc.exclude
: A list of data types to be excluded from the summary. This parameter is useful when you want to exclude specific data types from the summary.
3. Output of the describe()
Function
The describe()
function generates a DataFrame with summary statistics for each numerical column in the input DataFrame. The resulting DataFrame contains the following statistics for each column:
count
: The number of non-null values.mean
: The mean (average) of the data.std
: The standard deviation, which measures the spread of the data.min
: The minimum value.25%
: The first quartile (25th percentile).50%
: The median (50th percentile).75%
: The third quartile (75th percentile).max
: The maximum value.
4. Example 1: Analyzing a Numerical Dataset
Let’s begin by working with a synthetic numerical dataset to demonstrate the usage of the describe()
function.
import pandas as pd
import numpy as np
# Create a DataFrame with random numerical data
data = {
'A': np.random.randn(1000),
'B': np.random.randint(1, 100, 1000),
'C': np.random.uniform(0, 1, 1000)
}
df = pd.DataFrame(data)
Now, let’s apply the describe()
function to our DataFrame:
summary = df.describe()
print(summary)
The output will look like this:
A B C
count 1000.000000 1000.000000 1000.000000
mean -0.006832 49.617000 0.498975
std 1.007611 28.753597 0.288235
min -3.361839 1.000000 0.000996
25% -0.706895 25.000000 0.246854
50% -0.024112 49.000000 0.493872
75% 0.683935 75.000000 0.744900
max 3.631837 99.000000 0.999631
In this example, we created a DataFrame with three numerical columns (‘A’, ‘B’, ‘C’) and used the describe()
function to compute summary statistics for each column. The output provides insights into the distribution of the data, including the mean, standard deviation, quartiles, and other key statistics.
5. Example 2: Examining a Mixed-Type Dataset
The describe()
function can also handle mixed-type datasets containing both numerical and categorical columns. Let’s work with an example to illustrate this.
# Create a mixed-type DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [25, 32, 28, 19, 45],
'Salary': [60000, 75000, 50000, 22000, 90000],
'Department': ['HR', 'Engineering', 'Finance', 'Engineering', 'Marketing']
}
mixed_df = pd.DataFrame(data)
Now, let’s apply the describe()
function to our mixed-type DataFrame:
mixed_summary = mixed_df.describe(include='all')
print(mixed_summary)
The output will look like this:
Name Age Salary Department
count 5 5.000000 5.000000 5
unique 5 NaN NaN 4
top Emily NaN NaN Engineering
freq 1 NaN NaN 2
mean NaN 29.800000 59400.000000 NaN
std NaN 10.131862 26577.594890 NaN
min NaN 19.000000 22000.000000 NaN
25% NaN 25.000000 50000.000000 NaN
50% NaN 28.000000 60000.000000 NaN
75% NaN 32.000000 75000.000000 NaN
max NaN 45.000000 90000.000000 NaN
In this example, we created a mixed-type DataFrame with both numerical and categorical columns. By setting the include
parameter to 'all'
, we instructed the describe()
function to generate summary statistics for all columns, regardless of their data type. The output provides statistics specific to categorical columns, such as the most frequent category (top
) and its frequency (freq
), in addition to the numerical statistics.
6. Conclusion
The describe()
function in Pandas is an indispensable tool for quickly gaining insights into the statistics of your dataset. It provides a comprehensive overview of numerical and categorical data, helping you identify patterns, outliers, and trends. By using the various parameters of the describe()
function, you can tailor the summary statistics to meet your specific needs. Incorporate the describe()
function into your exploratory data analysis workflow to
efficiently understand your data’s characteristics and make informed decisions in data-driven projects.