Should One Hot Encoding or Dummy Variables Be Used With Ridge Regression?

For a regression problem in which the predictor is a single categorical variable with $q$ categories, Ridge regression can be considered the Best Linear Unbiased Predictor (BLUP) for the mixed model

$$ \mathbf{y} = \mathbf{X} \beta +\mathbf{Zu}+ \boldsymbol{\epsilon} $$

In this case, $\mathbf{X}$ is just a columns of 1s and $\beta$ is the intercept. $\mathbf{Z}$ is the design matrix that encodes the $q$ random effects. In this situation, Ridge regression is the BLUP estimator if we set the ridge parameter to $\lambda = \sigma_{\epsilon}^2 / \sigma^2$. Here, $\sigma_{\epsilon}$ is the variance of $\boldsymbol{\epsilon}$, and $\sigma$ is the variance of $\mathbf{u}$ (here, the random effects are assumed to be isotropic). I've been told this is equivalent to fitting a Ridge model in which the feature matrix is a one hot encoding (that is, all $q$ categories appear as columns) of the categories.

Some convincing arguments are made, which I will summarize here in python code. They are themselves summaries from a paper called That BLUP is a good thing The Estimation of Random Effects.. Here, 3 groups are simulated. 4 Ridge models are fit: 3 in which each group takes turns being the reference group, and one in which all groups appear in the feature matrix. Shown below is a plot of the predictions

import numpy as npimport pandas as pd import sklearn as skfrom sklearn.linear_model import LinearRegression, Ridgeimport seaborn as sb import matplotlib.pyplot as pltD=pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3],'y':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})# Now let's dummy code the group in four different ways X = pd.get_dummies(D['group'],drop_first=False).values # Full dummy code with a dummy variable for each group X1 = X[:,1:3]   # Drop the first group X2 = X[:,[0,2]] # Drop the second group X3 = X[:,:2]    # Drop the last group # Now let's use the different dummy coding schemes for Ridge regression, using a Ridge coefficient of 1# First the first one: R1 = Ridge(alpha = 1)R1.fit(X1,D['y'])ypred_R1=R1.predict(Xpred1)ypred_R1>>> array([ 4.875     ,  7.47916667, 11.64583333])# Then dropping the middle group  R2 = Ridge(alpha = 1)R2.fit(X2,D['y'])ypred_R2=R2.predict(Xpred2)ypred_R2>>> array([ 3.83333333,  8.        , 12.16666667])# And finally dropping the third groupR3 = Ridge(alpha = 1)R3.fit(X3,D['y'])ypred_R3=R3.predict(Xpred3)ypred_R3>>>array([ 4.35416667,  8.52083333, 11.125     ])# Now we have 3 regressors, instead of two. To achieve a similar amount of shrinkage, we need to # therefor increase our Ridge coefficient a bit.   R = Ridge(alpha = 3, fit_intercept = True)R.fit(X,D['y'])ypred_R=R.predict(Xpred)ypred_R>>>array([ 4.875,  8.   , 11.125])

It turns out, that making one group the reference group makes implicit assumptions about the covariance of the $\mathbf{u}$. Here is a plot of the covariance structure when we drop one of the variables to use as a reference

Compare this to the true covariance structure assumed by the model with all 3 predictors in the feature matrix

Question

I've never before seen the recommendation to keep all categories in a Ridge regression. Indeed, if we are cross validating over the penalty, the resulting covariance structure does not seem worth the instability induced by the collinearity.

Does a ridge model in which all categories appear as binary indicators perform better (i.e. lower loss) than a ridge model which absorbs a category into the intercept term? If so, why? Some (not very rigorous) experiments seem to hint that the answer is "no", but I'm interested in hearing the perspectives of other data scientists and statisticians.

Additionally, if the goal is prediction then what are the general consequences of imposing the wrong covariance structure on the data by making one group a reference category?

Should we include all categories when fitting a ridge model?

Question

Latest Images

Trending Articles

Latest Images