Categorical features are variables that can take on a limited, fixed number of values or categories. These features are commonly encountered in datasets and can present challenges when working with machine learning algorithms, as many algorithms require numerical input. Pandas, a popular data manipulation library in Python, offers various techniques for encoding categorical features into numerical representations. In this tutorial, we will explore different methods of encoding categorical features using Pandas, along with illustrative examples.
Table of Contents
- Introduction to Categorical Encoding
- Label Encoding
- One-Hot Encoding
- Ordinal Encoding
- Binary Encoding
- Count Encoding
- Target Encoding (Mean Encoding)
- Conclusion
1. Introduction to Categorical Encoding
Categorical encoding is the process of converting categorical variables into numerical formats that machine learning algorithms can work with. This step is essential because many machine learning algorithms, such as regression and neural networks, rely on numerical data for processing. Pandas provides several methods for categorical encoding, each with its own advantages and use cases.
In this tutorial, we will cover the following categorical encoding methods:
- Label Encoding
- One-Hot Encoding
- Ordinal Encoding
- Binary Encoding
- Count Encoding
- Target Encoding (Mean Encoding)
Before we dive into these methods, let’s set up our environment by importing the necessary libraries:
import pandas as pd
2. Label Encoding
Label encoding is a simple method of assigning unique numerical values to each category present in a categorical feature. Each category is mapped to an integer, starting from 0. While this method is straightforward, it can lead to issues where the algorithm might interpret the encoded values as ordinal when they are not.
Let’s consider an example using a sample dataset of animal types:
data = {'Animal': ['Cat', 'Dog', 'Dog', 'Fish', 'Cat', 'Fish']}
df = pd.DataFrame(data)
print(df)
Output:
Animal
0 Cat
1 Dog
2 Dog
3 Fish
4 Cat
5 Fish
To label encode the ‘Animal’ column, we can use the LabelEncoder
class from the sklearn.preprocessing
module:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Animal_LabelEncoded'] = encoder.fit_transform(df['Animal'])
print(df)
Output:
Animal Animal_LabelEncoded
0 Cat 0
1 Dog 1
2 Dog 1
3 Fish 2
4 Cat 0
5 Fish 2
In this example, the ‘Animal’ column has been label encoded into numerical values. However, it’s important to note that label encoding may not be suitable for all categorical features, especially if the encoded values imply an ordinal relationship that doesn’t exist.
3. One-Hot Encoding
One-Hot Encoding is a technique that creates binary columns for each category in the original feature. Each binary column represents the presence or absence of a particular category. This method eliminates the potential ordinal relationship and prevents algorithms from assigning unintended importance to categories.
Let’s continue with the ‘Animal’ example and perform one-hot encoding:
one_hot_encoded = pd.get_dummies(df['Animal'], prefix='Animal')
df = pd.concat([df, one_hot_encoded], axis=1)
print(df)
Output:
Animal Animal_LabelEncoded Animal_Cat Animal_Dog Animal_Fish
0 Cat 0 1 0 0
1 Dog 1 0 1 0
2 Dog 1 0 1 0
3 Fish 2 0 0 1
4 Cat 0 1 0 0
5 Fish 2 0 0 1
Here, the ‘Animal’ column has been one-hot encoded into three binary columns: ‘Animal_Cat’, ‘Animal_Dog’, and ‘Animal_Fish’. Each binary column indicates the presence of a specific animal type.
4. Ordinal Encoding
Ordinal encoding is suitable when the categorical variable has an inherent order or rank among its categories. In this method, each category is assigned a unique integer based on its order.
Let’s consider an example where we have a dataset of education levels:
data = {'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']}
df = pd.DataFrame(data)
print(df)
Output:
Education
0 High School
1 Bachelor
2 Master
3 PhD
4 Bachelor
To perform ordinal encoding, we first define the order of the categories and then map them to integers:
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
df['Education_OrdinalEncoded'] = df['Education'].apply(lambda x: education_order.index(x))
print(df)
Output:
Education Education_OrdinalEncoded
0 High School 0
1 Bachelor 1
2 Master 2
3 PhD 3
4 Bachelor 1
In this example, we have used ordinal encoding to represent education levels with their respective order.
5. Binary Encoding
Binary encoding is a hybrid method that combines aspects of both label encoding and one-hot encoding. It converts categories into binary code and then splits the binary digits into separate columns.
Consider a dataset of car manufacturers:
data = {'Manufacturer': ['Toyota', 'Ford', 'Toyota', 'Chevrolet', 'Ford']}
df = pd.DataFrame(data)
print(df)
Output:
Manufacturer
0 Toyota
1 Ford
2 Toyota
3 Chevrolet
4 Ford
To perform binary encoding, we need to convert the categories to numerical values and then represent those values in binary format:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['Manufacturer'])
df_binary = encoder.fit_transform(df)
print(df_binary)
Output:
Manufacturer_0 Manufacturer_1 Manufacturer_2
0 0 0 1
1 0 1 0
2 0 0 1
3 0 1 1
4 0 1 0
Here, the ‘Manufacturer’ column has been binary encoded, and the binary digits are split into three separate columns.
6. Count Encoding
Count encoding is a technique where each category is replaced with the count of its occurrences in the dataset. This method can be helpful when a categorical feature has a strong correlation with the target variable.
Let’s work with a dataset of cities:
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']}
df = pd.DataFrame(data)
print(df)
Output:
City
0 New York
1 Los Angeles
2 Chicago
3 New York
4 Chicago
To
perform count encoding, we create a mapping of category counts and then replace the categories with their respective counts:
count_map = df['City'].value_counts().to_dict()
df['City_CountEncoded'] = df['City'].map(count_map)
print(df)
Output:
City City_CountEncoded
0 New York 2
1 Los Angeles 1
2 Chicago 2
3 New York 2
4 Chicago 2
In this example, the ‘City’ column has been count encoded using the frequency of each city’s occurrence.
7. Target Encoding (Mean Encoding)
Target encoding, also known as mean encoding, involves replacing each category with the mean of the target variable for that category. This technique can be useful when there is a clear relationship between the categorical feature and the target variable.
Suppose we have a dataset of car types:
data = {'CarType': ['Sedan', 'SUV', 'SUV', 'Convertible', 'Sedan']}
target = [25000, 30000, 32000, 28000, 27000]
df = pd.DataFrame({'CarType': data['CarType'], 'Price': target})
print(df)
Output:
CarType Price
0 Sedan 25000
1 SUV 30000
2 SUV 32000
3 Convertible 28000
4 Sedan 27000
To perform target encoding, we group the data by the categorical column, calculate the mean of the target variable for each category, and then replace the categories with their respective mean values:
target_mean = df.groupby('CarType')['Price'].mean().to_dict()
df['CarType_TargetEncoded'] = df['CarType'].map(target_mean)
print(df)
Output:
CarType Price CarType_TargetEncoded
0 Sedan 25000 26000.000000
1 SUV 30000 31000.000000
2 SUV 32000 31000.000000
3 Convertible 28000 28000.000000
4 Sedan 27000 26000.000000
In this example, the ‘CarType’ column has been target encoded using the mean prices of each car type.
8. Conclusion
Categorical encoding is a crucial preprocessing step when working with machine learning algorithms that require numerical input. In this tutorial, we covered several categorical encoding methods using the Pandas library, including label encoding, one-hot encoding, ordinal encoding, binary encoding, count encoding, and target encoding.
Remember that the choice of encoding method depends on the nature of the categorical variable, the problem at hand, and the algorithms you plan to use. It’s important to understand the characteristics of each encoding technique and choose the one that best suits your data and objectives. Always assess the impact of your chosen encoding method on the model’s performance to ensure optimal results.