The iron ML notebook
  • The iron data science notebook
  • ML & Data Science
    • Frequent Questions
      • Discriminative vs Generative models
      • Supervised vs Unsupervised learning
      • Batch vs Online Learning
      • Instance-based vs Model-based Learning
      • Bias-Variance Tradeoff
      • Probability vs Likelihood
      • Covariance vs Correlation Matrix
      • Precision vs Recall
      • How does a ROC curve work?
      • Ridge vs Lasso
      • Anomaly detection methods
      • How to deal with imbalanced datasets?
      • What is "Statistically Significant"?
      • Recommendation systems methods
    • Statistics
      • The basics
      • Distributions
      • Sampling
      • IQR
      • Z-score
      • F-statistic
      • Outliers
      • The bayesian basis
      • Statistic vs Parameter
      • Markov Monte Carlo Chain
    • ML Techniques
      • Pre-process
        • PCA
      • Loss functions
      • Regularization
      • Optimization
      • Metrics
        • Distance measures
      • Activation Functions
      • Selection functions
      • Feature Normalization
      • Cross-validation
      • Hyperparameter tuning
      • Ensemble methods
      • Hard negative mining
      • ML Serving
        • Quantization
        • Kernel Auto-Tuning
        • NVIDIA TensorRT vs ONNX Runtime
    • Machine Learning Algorithms
      • Supervised Learning
        • Support Vector Machines
        • Adaptative boosting
        • Gradient boosting
        • Regression algorithms
          • Linear Regression
          • Lasso regression
          • Multi Layer Perceptron
        • Classification algorithms
          • Perceptron
          • Logistic Regression
          • Multilayer Perceptron
          • kNN
          • Naive Bayes
          • Decision Trees
          • Random Forest
          • Gradient Boosted Trees
      • Unsupervised learning
        • Clustering
          • Clustering metrics
          • kMeans
          • Gaussian Mixture Model
          • Hierarchical clustering
          • DBSCAN
      • Cameras
        • Intrinsic and extrinsic parameters
    • Computer Vision
      • Object Detection
        • Two-Stage detectors
          • Traditional Detection Models
          • R-CNN
          • Fast R-CNN
          • Faster R-CNN
        • One-Stage detectors
          • YOLO
          • YOLO v2
          • YOLO v3
          • YOLOX
        • Techniques
          • NMS
          • ROI Pooling
        • Metrics
          • Objectness Score
          • Coco Metrics
          • IoU
      • MOT
        • SORT
        • Deep SORT
  • Related Topics
    • Intro
    • Python
      • Global Interpreter Lock (GIL)
      • Mutability
      • AsyncIO
    • SQL
    • Combinatorics
    • Data Engineering Questions
    • Distributed computation
      • About threads & processes
      • REST vs gRPC
  • Algorithms & data structures
    • Array
      • Online Stock Span
      • Two Sum
      • Best time to by and sell stock
      • Rank word combination
      • Largest subarray with zero sum
    • Binary
      • Sum of Two Integers
    • Tree
      • Maximum Depth of Binary Tree
      • Same Tree
      • Invert/Flip Binary Tree
      • Binary Tree Paths
      • Binary Tree Maximum Path Sum
    • Matrix
      • Set Matrix Zeroes
    • Linked List
      • Reverse Linked List
      • Detect Cycle
      • Merge Two Sorted Lists
      • Merge k Sorted Lists
    • String
      • Longest Substring Without Repeating Characters
      • Longest Repeating Character Replacement
      • Minimum Window Substring
    • Interval
    • Graph
    • Heap
    • Dynamic Programming
      • Fibonacci
      • Grid Traveler
      • Can Sum
      • How Sum
      • Best Sum
      • Can Construct
      • Count Construct
      • All Construct
      • Climbing Stairs
Powered by GitBook
On this page
  • Overview
  • How does overfitting happen?
  • Techniques
  • Lasso (L1)
  • Ridge (L2)
  • Others
  • L1 vs L2 norm
  • Other alternatives

Was this helpful?

  1. ML & Data Science
  2. ML Techniques

Regularization

PreviousLoss functionsNextOptimization

Last updated 3 years ago

Was this helpful?

Sources:

Related sources:

Any type of function that somehow helps to penalize the complexity of the model, rather than fit the training data.

Overview

Avoiding overfitting is one of the major aspects of training a machine learning model. This happens when the model adjusts to the noise in training data. Thus, the resulting model won't be flexible enough to generalize new instances. Regularization discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

For further info about overfitting and how to avoid it, check out the section.

How does overfitting happen?

Let's imagine we want to build a simple model:

  • Given our training set XXX, we try applying a linear regression y=β0+β1x1+β2x2+β3x3 y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3y=β0​+β1​x1​+β2​x2​+β3​x3​. Where each βi\beta_iβi​ represents the coefficient estimates for the different xix_ixi​ variables.

  • We measure regarding a , known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

RSS=∑i=1n(yi−β0−∑i=1pβjxij)2RSS = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{i=1}^{p} \beta_jx_{ij} \right)^2RSS=i=1∑n​(yi​−β0​−i=1∑p​βj​xij​)2
  • However, this model might be too simple. Then, we could think about adding some new features y=β0+β1x1+β2x2+β3x3+...+βpxpy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_py=β0​+β1​x1​+β2​x2​+β3​x3​+...+βp​xp​. Now the model has become more complex, but it might tend to overfit!

  • If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

Techniques

Lasso (L1)

It penalizes the loss function by adding the shrinkage quantity. This way, the coefficients are estimated by minimizing the whole quantity. Lasso applies the modulus ∣βj∣| \beta_j|∣βj​∣ to every coefficient. This is also called L1 regularization or L1 norm.

RSS+λ∑j=1p∣βj∣=∑i=1n(yi−β0−∑i=1pβjxij)2+λ∑j=1p∣βj∣ RSS + \lambda \sum_{j=1}^{p} | \beta_j| = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{i=1}^{p} \beta_jx_{ij} \right)^2 + \lambda \sum_{j=1}^{p} | \beta_j|RSS+λj=1∑p​∣βj​∣=i=1∑n​(yi​−β0​−i=1∑p​βj​xij​)2+λj=1∑p​∣βj​∣

Here, λ is the regularization parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small.

When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows.

Ridge (L2)

As lasso does, it penalizes the loss function by adding the shrinkage quantity. It's also parametrized by the λ is the regularization parameter. However, ridge applies the square βj2\beta_j^2βj2​ to every coefficient. It's also called L2 regularization or L2 norm.

RSS+λ∑j=1pβj2=∑i=1n(yi−β0−∑i=1pβjxij)2+λ∑j=1pβj2RSS + \lambda \sum_{j=1}^{p} \beta_j^2 = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{i=1}^{p} \beta_jx_{ij} \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2RSS+λj=1∑p​βj2​=i=1∑n​(yi​−β0​−i=1∑p​βj​xij​)2+λj=1∑p​βj2​

Others

  • Elastic net (L1 + L2)

  • Max norm regularization

  • Dropout

  • Fancier: batch normalization, stochastic depth

L1 vs L2 norm

The image below shows differences between L1 and L2:

  • Points on the ellipse share the value of RSS.

  • Points of the green area share the values of regularization

  • So, the regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region.

What does it mean?

  • Given L2 has a circular constraint (β2\beta²β2), this intersection will not generally occur on an axis.

    • L2 regression coefficient estimates will be exclusively non-zero.

  • The L1 constraint has corners on axes, so the RSS distribution may intersect at an axis.

    • When this occurs, one of the coefficients will equal zero.

  • Ridge (L2) will shrink coefficients very close to zero, but doesn't ignore any feature.

  • Lasso (L1) also performs variable selection.

Other alternatives

We can use other techniques to reduce variance (given it's the goal of L1 and L2 regularization):

  • Data augmentation

  • Adding noise

When dealing with neural networks, we can use these techniques, as well:

  • Early stopping

  • Dropout

Regularization in Machine Learning (Prashant Gupta)
What is regularization in machine learning? (Quora)
Differences between L1 and L2 as Loss Function and Regularization
Loss functions
Ridge vs Lasso
Bias-Variance Tradeoff
loss metric
Bagging
accuracy