One Hot Encoding Categorical Variables for Multivariate Linear Regression

2 min readJun 2, 2020

Introduction

Looking at your data-set, it is sometimes difficult to determine whether you have categorical variables, continuous variables or a combination of both in a data-set. Let’s take a look at the effect of not applying one-hot encoding to categorical variables in a data-set. We will use this data-set to demonstrate:

House Sales in King County, USA

Predict house price using regression

www.kaggle.com

Data Exloration

With the data-set linked above, no transformations were performed. There was some pre-processing performed, including filling in NaN’s. Let’s take a look at the data-set.

Data Cleaning

There are a few columns that need to be cleaned. There are also some columns that do not provide any insights into our data like ‘id’ and ‘view’. Let’s apply some data cleaning to our data-set.

One column which may bring about some thought is the ‘yr_renovated’ column. This column can be changed into a categorical variable. If there is a year shown, we input 1, else we input 0 for no renovations. The results of these transformations give this result:

The adjusted R-squared value is fairly high. However we can do better. There are more variables that are categorical. We transform these categorical variables using one hot encoding. To do this, use a Pandas function: get_dummies.

A function that finds columns with unique values under 30 is used to find categorical variables.

The result of the one hot encoding is:

One hot encoding the categorical variables creates a significant difference. The adjusted R-squared value is now 0.807, which is about a 15% increase from the previous adjusted R-squared value of 0.696.

Conclusion

To use one hot encoding is a necessary decision in correctly transforming categorical variables. Some variables may not be so obviously categorical. The amount of unique values a variable has can give insight into whether the variable is continuous or categorical.

One Hot Encoding Categorical Variables for Multivariate Linear Regression

House Sales in King County, USA

Predict house price using regression

Written by Jeffery Rosario

No responses yet