Ridge vs Lasso

Overview

Lasso

Ridge

L1

L2

loss+λj=1pβjloss+ \lambda \sum_{j=1}^{p} | \beta_j|

loss+λj=1pβj2loss + \lambda \sum_{j=1}^{p} \beta_j^2

Regularization

About the equations

Both regressions can be thought of as solving an equation, where the summation of the regularized coefficients is less or equal to s. Where s is a constant that exists for each value of shrinkage factor λ.

  • Ridge: β12+β22s\beta_1^2 + \beta_2^2 \le s. This implies that coefficients have the smallest RSS for all points that lie within the circle given by the inequation.

  • Lasso: β1+β2s|\beta_1| + |\beta_2| \le s. This implies that lasso coefficients have the smallest RSS for all points that lie within the diamond given by the inequation.

Conclusions

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some features altogether. So, this works well for feature selection in case we have a huge number of features.

This sheds light on the obvious disadvantage of ridge regression, which is model interpretability. Given, it will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models.

Last updated