One Hot Encoding Categorical Variables for Multivariate Linear Regression
Introduction
Looking at your data-set, it is sometimes difficult to determine whether you have categorical variables, continuous variables or a combination of both in a data-set. Let’s take a look at the effect of not applying one-hot encoding to categorical variables in a data-set. We will use this data-set to demonstrate:
Data Exloration
With the data-set linked above, no transformations were performed. There was some pre-processing performed, including filling in NaN’s. Let’s take a look at the data-set.
Data Cleaning
There are a few columns that need to be cleaned. There are also some columns that do not provide any insights into our data like ‘id’ and ‘view’. Let’s apply some data cleaning to our data-set.
One column which may bring about some thought is the ‘yr_renovated’ column. This column can be changed into a categorical variable. If there is a year shown, we input 1, else we input 0 for no renovations. The results of these transformations give this result:
The adjusted R-squared value is fairly high. However we can do better. There are more variables that are categorical. We transform these categorical variables using one hot encoding. To do this, use a Pandas function: get_dummies.
A function that finds columns with unique values under 30 is used to find categorical variables.
The result of the one hot encoding is:
One hot encoding the categorical variables creates a significant difference. The adjusted R-squared value is now 0.807, which is about a 15% increase from the previous adjusted R-squared value of 0.696.
Conclusion
To use one hot encoding is a necessary decision in correctly transforming categorical variables. Some variables may not be so obviously categorical. The amount of unique values a variable has can give insight into whether the variable is continuous or categorical.