Show Menu
Cheatography

Categorical Encoding Cheat Sheet (DRAFT) by [deleted]

This cheatsheet will give you a sneak peek and basic understanding of why and how to encode categorical variables by using scikit learn library. It will show you 3 types of encoding One-Hot Encoding, Ordinal Encoding and Label Encoding.

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Why do we Encode?

- Most of the models only accept numeric values.
- We cannot afford to loose important features because of their data types.
- It is required to ensure correct and good perfor­­mance of the model.

Types of Encoding

- Ordinal Encoding
- One Hot Encoding
- Label Encoding

Ordinal Encoding

- Used for encoding Ordinal Variables.
- Numbers are assigned to each category based on their order hierarchy of the variable.
- Assigned numbers can be any numbers as long as original order is unchanged.

Code:

!pip install catego­­ry­_­e­nc­­oders

import catego­­ry­_­e­nc­­oders as ce

encoder = ce.Ord­­in­a­l­En­­cod­­er­(­m­ap­­pin­­g=­[­{­'col': 'feedb­­ack', 'mapping': {'bad': 1, 'okay': 2, 'good'­­:3}}])

encode­­r.f­­it(X)
X = encode­­r.t­­r­an­­sfo­­rm(X)
X['fee­­db­ack']

Output:

feedback
1
2
3
2
3
.
.
Docume­­nt­ation: https:­­//­c­o­nt­­rib.sc­­ik­i­t­-l­­ear­­n.o­­r­g/­­cat­­eg­o­r­y_­­enc­­od­e­r­s/­­ord­­in­a­l.html
 

One-Hot Encoding

- Used when number of categories in the variable are low, max 3 or 4. Anymore will seriously increase the size of your dataset and decrease perfor­mance of your model.
- Assigns 0 and 1 to the categories based on their presence in the columns.
- Creates extra columns based on the number of catego­­rical elements in the main column.
i.e if there are 3 categories in the column Shipping - Standard, One Day, Two Day, 3 extra columns are created in place of the original column, 1 for each category and 1 will be assigned for each unique value.

Usage:

import catego­­ry­_­e­nc­­oders as ce

encoder = ce.One­­Ho­t­E­nc­­ode­­r(­c­o­ls­­=['­­Column Name'])
encode­­r.f­­i­t(df)
df = encode­­r.t­­r­an­­sfo­­rm(df)
df['Sh­­ip­p­ing']
Docume­nta­tio­n:h­ttp­s:/­/co­ntr­ib.s­ci­kit­-le­arn.or­g/c­ate­gor­y_e­nco­der­s/o­neh­ot.html

Output

 

Label Encoding

- Converts each category in a column to a number directly.
- Can also be used for non-nu­­me­rical values as long as they are relevant and usable to the target variable.
- Different Methods can be applied according to your requir­ements.

from sklear­­n.p­­r­ep­­roc­­essing import LabelE­­ncoder

le = LabelE­­nc­o­der()
df['Column Name_Cat'] = le.fit­­_t­r­a­ns­­for­­m(­d­f­['­­Column Name'])
df

Output