The iron ML notebook
  • The iron data science notebook
  • ML & Data Science
    • Frequent Questions
      • Discriminative vs Generative models
      • Supervised vs Unsupervised learning
      • Batch vs Online Learning
      • Instance-based vs Model-based Learning
      • Bias-Variance Tradeoff
      • Probability vs Likelihood
      • Covariance vs Correlation Matrix
      • Precision vs Recall
      • How does a ROC curve work?
      • Ridge vs Lasso
      • Anomaly detection methods
      • How to deal with imbalanced datasets?
      • What is "Statistically Significant"?
      • Recommendation systems methods
    • Statistics
      • The basics
      • Distributions
      • Sampling
      • IQR
      • Z-score
      • F-statistic
      • Outliers
      • The bayesian basis
      • Statistic vs Parameter
      • Markov Monte Carlo Chain
    • ML Techniques
      • Pre-process
        • PCA
      • Loss functions
      • Regularization
      • Optimization
      • Metrics
        • Distance measures
      • Activation Functions
      • Selection functions
      • Feature Normalization
      • Cross-validation
      • Hyperparameter tuning
      • Ensemble methods
      • Hard negative mining
      • ML Serving
        • Quantization
        • Kernel Auto-Tuning
        • NVIDIA TensorRT vs ONNX Runtime
    • Machine Learning Algorithms
      • Supervised Learning
        • Support Vector Machines
        • Adaptative boosting
        • Gradient boosting
        • Regression algorithms
          • Linear Regression
          • Lasso regression
          • Multi Layer Perceptron
        • Classification algorithms
          • Perceptron
          • Logistic Regression
          • Multilayer Perceptron
          • kNN
          • Naive Bayes
          • Decision Trees
          • Random Forest
          • Gradient Boosted Trees
      • Unsupervised learning
        • Clustering
          • Clustering metrics
          • kMeans
          • Gaussian Mixture Model
          • Hierarchical clustering
          • DBSCAN
      • Cameras
        • Intrinsic and extrinsic parameters
    • Computer Vision
      • Object Detection
        • Two-Stage detectors
          • Traditional Detection Models
          • R-CNN
          • Fast R-CNN
          • Faster R-CNN
        • One-Stage detectors
          • YOLO
          • YOLO v2
          • YOLO v3
          • YOLOX
        • Techniques
          • NMS
          • ROI Pooling
        • Metrics
          • Objectness Score
          • Coco Metrics
          • IoU
      • MOT
        • SORT
        • Deep SORT
  • Related Topics
    • Intro
    • Python
      • Global Interpreter Lock (GIL)
      • Mutability
      • AsyncIO
    • SQL
    • Combinatorics
    • Data Engineering Questions
    • Distributed computation
      • About threads & processes
      • REST vs gRPC
  • Algorithms & data structures
    • Array
      • Online Stock Span
      • Two Sum
      • Best time to by and sell stock
      • Rank word combination
      • Largest subarray with zero sum
    • Binary
      • Sum of Two Integers
    • Tree
      • Maximum Depth of Binary Tree
      • Same Tree
      • Invert/Flip Binary Tree
      • Binary Tree Paths
      • Binary Tree Maximum Path Sum
    • Matrix
      • Set Matrix Zeroes
    • Linked List
      • Reverse Linked List
      • Detect Cycle
      • Merge Two Sorted Lists
      • Merge k Sorted Lists
    • String
      • Longest Substring Without Repeating Characters
      • Longest Repeating Character Replacement
      • Minimum Window Substring
    • Interval
    • Graph
    • Heap
    • Dynamic Programming
      • Fibonacci
      • Grid Traveler
      • Can Sum
      • How Sum
      • Best Sum
      • Can Construct
      • Count Construct
      • All Construct
      • Climbing Stairs
Powered by GitBook
On this page
  • What are loss functions?
  • Cross-entropy
  • Multi-category cross-entropy loss
  • ​Log-Loss
  • Exponential Loss
  • Hinge Loss
  • MSE Loss (L2 regularization)
  • MAE Loss (L1 regularization)
  • Huber Loss

Was this helpful?

  1. ML & Data Science
  2. ML Techniques

Loss functions

PreviousPCANextRegularization

Last updated 3 months ago

Was this helpful?

Sources:

  • )

What are loss functions?

A way to measure whether the algorithm is doing a good job.

This is necessary to determine the distance between the algorithm’s current output and its expected output. The measurement is used as a feedback signal to adjust how the algorithm works. This adjustment step is what we call learning.

François Chollet, Deep learning with Python (2017), Manning, chapter 1 p.6

It can be categorized into two groups. One for classification (discrete values, 0,1,2…) and the other for regression (continuous values).

Commonly used loss functions:

  • For classification:

    • Cross-entropy

    • Log-Loss

    • Exponential Loss

    • Hinge Loss

    • Kullback Leibler Divergence Loss

  • For regression:

    • Mean Square Error Loss (L2)

    • Mean Absolute Error Loss (L1)

    • Huber Loss

Cross-entropy

Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

About entropy and Information Theory

Information quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.

In information theory, we like to describe the “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information.

  • Low Probability Event (surprising): More information.

  • Higher Probability Event (unsurprising): Less information.

Information h(x) can be calculated for an event x, given the probability of the event P(x) as follows:

Entropy is the number of bits required to transmit a randomly selected event from a probability distribution.

A skewed distribution has low entropy, whereas a distribution where events have equal probability has a larger entropy.

So, what's cross-entropy?

Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.

If we consider a target distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.

  • 0.00: Perfect probabilities

  • < 0.02: Great probabilities

  • < 0.20: Great

  • > 0.30: Not great

  • > 2.00 Something is not working

If M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

Binary cross-entropy

Also called Negative Log-Likelihood, it is only related to binary classification problems.

For a given sample, if the GT is 0, the left side of the formula won't do anything. And, the GT class is 1, the right side of the formula won't operate.

Multi-category cross-entropy loss

  • Computes the cross-entropy for:

    • Multiple training examples (N) and

    • Multiple classes (K)

  • Expect One-hot encoded class labels

    • This means each training sample only has a (K) with a 1 label.

    • It implies the formula below will only sum values for one class on each sample:

​Log-Loss

The Log-Loss is the Binary cross-entropy up to a factor 1 / log(2). This loss function is convex and grows linearly for negative values: this means it's less sensitive to outliers. The common algorithm which uses the Log-loss is the logistic regression.

Exponential Loss

Hinge Loss

It's a loss function used for “maximum-margin” classification, most notably for support vector machines (SVM).

MSE Loss (L2 regularization)

The square difference between the current output y_pred and the expected output y_true divided by the number of outputs.

It's very sensitive to outliers because the difference is a square that gives more importance to outliers.

The behavior is a quadratic curve especially useful for gradient descent algorithms. The gradient will be smaller close to the minima. MSE is very useful if outliers are important for the problem, if outliers are noisy or bad data or bad measures you should use the MAE loss function.

MAE Loss (L1 regularization)

At the difference of the previous loss function, the square is replaced by an absolute value. This difference has a big impact on the behavior of the loss function which has a “V” form.

The MAE function is more robust to outliers because it is based on absolute value compared to the square of the MSE. It’s like a median, outliers can’t really impact her behavior.

Huber Loss

It is a combination of MAE and MSE (L1-L2) but it depends on an additional parameter called delta that influences the shape of the loss function. This parameter needs to be fine-tuned by the algorithm. When the values are large (far from the minima), the function has the behavior of the MAE, and closer to the minima, the function behaves like the MSE. So the delta parameter is your sensitivity to outliers.

h(x)=−log(P(x))h(x) = -log(P(x))h(x)=−log(P(x))

The result is a value [0,∞)[0, \infty)[0,∞):

In binary classification, where the number of classes MMM equals 2, cross-entropy can be calculated as:

−(ylog(p)+(1−y)log(1−p))−(ylog⁡(p)+(1−y)log⁡(1−p))−(ylog(p)+(1−y)log(1−p))−(ylog⁡(p)+(1−y)log⁡(1−p))−(ylog(p)+(1−y)log(1−p))−(ylog⁡(p)+(1−y)log⁡(1−p))
−∑c=1Myo,clog(po,c)−∑c=1Myo,clog(po,c)−∑c=1Myo,clog(po,c)
Loss=1n ∑i=1n −yi log(σ(z)) − (1−yi) log(1−σ(z))Loss = \frac{1}{n} \space \sum_{i=1}^n \space - y^i \space log(\sigma(z))\space - \space(1 - y^i)\space log(1 - \sigma(z))Loss=n1​ i=1∑n​ −yi log(σ(z)) − (1−yi) log(1−σ(z))

(video)

L=1n∑i=1n∑k=1K−yk[i]log(ak[i])L = \frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K -y_k^{[i]} log(a_k^{[i]})L=n1​i=1∑n​k=1∑K​−yk[i]​log(ak[i]​)

The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponential loss is used in the .

exp_loss=1/m∗sum(exp(−y∗f(x)))exp\_loss = 1/m * sum(exp(-y*f(x)))exp_loss=1/m∗sum(exp(−y∗f(x)))
Hinge=max(0,1−y∗f(x))Hinge = max(0, 1-y*f(x))Hinge=max(0,1−y∗f(x))
Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names (Raúl Gómez blog)
What is loss function? (Christophee Pere
Machine Learning Glossary
Regularization
Related source
AdaBoost algorithm
Adaptative boosting
Support Vector Machines
Regularization
Regularization