Regularization

Loss functionsRidge vs Lasso

Any type of function that somehow helps to penalize the complexity of the model, rather than fit the training data.

Overview

Avoiding overfitting is one of the major aspects of training a machine learning model. This happens when the model adjusts to the noise in training data. Thus, the resulting model won't be flexible enough to generalize new instances. Regularization discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

For further info about overfitting and how to avoid it, check out the Bias-Variance Tradeoff section.

How does overfitting happen?

Let's imagine we want to build a simple model:

  • Given our training set XX, we try applying a linear regression y=β0+β1x1+β2x2+β3x3 y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3. Where each βi\beta_i represents the coefficient estimates for the different xix_i variables.

  • We measure accuracy regarding a loss metric, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

RSS=i=1n(yiβ0i=1pβjxij)2RSS = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{i=1}^{p} \beta_jx_{ij} \right)^2
  • However, this model might be too simple. Then, we could think about adding some new features y=β0+β1x1+β2x2+β3x3+...+βpxpy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p. Now the model has become more complex, but it might tend to overfit!

  • If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

Techniques

Lasso (L1)

It penalizes the loss function by adding the shrinkage quantity. This way, the coefficients are estimated by minimizing the whole quantity. Lasso applies the modulus βj| \beta_j| to every coefficient. This is also called L1 regularization or L1 norm.

RSS+λj=1pβj=i=1n(yiβ0i=1pβjxij)2+λj=1pβj RSS + \lambda \sum_{j=1}^{p} | \beta_j| = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{i=1}^{p} \beta_jx_{ij} \right)^2 + \lambda \sum_{j=1}^{p} | \beta_j|

Here, λ is the regularization parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small.

When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows.

Ridge (L2)

As lasso does, it penalizes the loss function by adding the shrinkage quantity. It's also parametrized by the λ is the regularization parameter. However, ridge applies the square βj2\beta_j^2 to every coefficient. It's also called L2 regularization or L2 norm.

RSS+λj=1pβj2=i=1n(yiβ0i=1pβjxij)2+λj=1pβj2RSS + \lambda \sum_{j=1}^{p} \beta_j^2 = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{i=1}^{p} \beta_jx_{ij} \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2

Others

  • Elastic net (L1 + L2)

  • Max norm regularization

  • Dropout

  • Fancier: batch normalization, stochastic depth

L1 vs L2 norm

The image below shows differences between L1 and L2:

  • Points on the ellipse share the value of RSS.

  • Points of the green area share the values of regularization

  • So, the regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region.

What does it mean?

  • Given L2 has a circular constraint (β2\beta²), this intersection will not generally occur on an axis.

    • L2 regression coefficient estimates will be exclusively non-zero.

  • The L1 constraint has corners on axes, so the RSS distribution may intersect at an axis.

    • When this occurs, one of the coefficients will equal zero.

  • Ridge (L2) will shrink coefficients very close to zero, but doesn't ignore any feature.

  • Lasso (L1) also performs variable selection.

Other alternatives

We can use other techniques to reduce variance (given it's the goal of L1 and L2 regularization):

  • Data augmentation

  • Adding noise

When dealing with neural networks, we can use these techniques, as well:

  • Early stopping

  • Dropout

Last updated