Regularization
Last updated
Last updated
Any type of function that somehow helps to penalize the complexity of the model, rather than fit the training data.
Avoiding overfitting is one of the major aspects of training a machine learning model. This happens when the model adjusts to the noise in training data. Thus, the resulting model won't be flexible enough to generalize new instances. Regularization discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
For further info about overfitting and how to avoid it, check out the Bias-Variance Tradeoff section.
Let's imagine we want to build a simple model:
Given our training set , we try applying a linear regression . Where each represents the coefficient estimates for the different variables.
We measure accuracy regarding a loss metric, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.
However, this model might be too simple. Then, we could think about adding some new features . Now the model has become more complex, but it might tend to overfit!
If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.
It penalizes the loss function by adding the shrinkage quantity. This way, the coefficients are estimated by minimizing the whole quantity. Lasso applies the modulus to every coefficient. This is also called L1 regularization or L1 norm.
Here, λ is the regularization parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small.
When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows.
As lasso does, it penalizes the loss function by adding the shrinkage quantity. It's also parametrized by the λ is the regularization parameter. However, ridge applies the square to every coefficient. It's also called L2 regularization or L2 norm.
Elastic net (L1 + L2)
Max norm regularization
Dropout
Fancier: batch normalization, stochastic depth
The image below shows differences between L1 and L2:
Points on the ellipse share the value of RSS.
Points of the green area share the values of regularization
So, the regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region.
What does it mean?
Given L2 has a circular constraint (), this intersection will not generally occur on an axis.
L2 regression coefficient estimates will be exclusively non-zero.
The L1 constraint has corners on axes, so the RSS distribution may intersect at an axis.
When this occurs, one of the coefficients will equal zero.
Ridge (L2) will shrink coefficients very close to zero, but doesn't ignore any feature.
Lasso (L1) also performs variable selection.
We can use other techniques to reduce variance (given it's the goal of L1 and L2 regularization):
Data augmentation
Adding noise
When dealing with neural networks, we can use these techniques, as well:
Early stopping
Dropout