Loss functions
What are loss functions?
A way to measure whether the algorithm is doing a good job.
This is necessary to determine the distance between the algorithm’s current output and its expected output. The measurement is used as a feedback signal to adjust how the algorithm works. This adjustment step is what we call learning.
François Chollet, Deep learning with Python (2017), Manning, chapter 1 p.6
It can be categorized into two groups. One for classification (discrete values, 0,1,2…) and the other for regression (continuous values).
Commonly used loss functions:
For classification:
Cross-entropy
Log-Loss
Exponential Loss
Hinge Loss
Kullback Leibler Divergence Loss
For regression:
Mean Square Error Loss (L2)
Mean Absolute Error Loss (L1)
Huber Loss
Cross-entropy
Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.
About entropy and Information Theory
Information quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.
In information theory, we like to describe the “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information.
Low Probability Event (surprising): More information.
Higher Probability Event (unsurprising): Less information.
Information h(x) can be calculated for an event x, given the probability of the event P(x) as follows:
Entropy is the number of bits required to transmit a randomly selected event from a probability distribution.
A skewed distribution has low entropy, whereas a distribution where events have equal probability has a larger entropy.
So, what's cross-entropy?
Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.
If we consider a target distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.
The result is a value :
0.00: Perfect probabilities
< 0.02: Great probabilities
< 0.20: Great
> 0.30: Not great
> 2.00 Something is not working
In binary classification, where the number of classes equals 2, cross-entropy can be calculated as:
If M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.
Binary cross-entropy
Also called Negative Log-Likelihood, it is only related to binary classification problems.
For a given sample, if the GT is 0, the left side of the formula won't do anything. And, the GT class is 1, the right side of the formula won't operate.
Multi-category cross-entropy loss
Related source (video)
Computes the cross-entropy for:
Multiple training examples (N) and
Multiple classes (K)
Expect One-hot encoded class labels
This means each training sample only has a (K) with a 1 label.
It implies the formula below will only sum values for one class on each sample:
Log-Loss
The Log-Loss is the Binary cross-entropy up to a factor 1 / log(2). This loss function is convex and grows linearly for negative values: this means it's less sensitive to outliers. The common algorithm which uses the Log-loss is the logistic regression.
Exponential Loss
The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponential loss is used in the AdaBoost algorithm.
Hinge Loss
It's a loss function used for “maximum-margin” classification, most notably for support vector machines (SVM).
MSE Loss (L2 regularization)
The square difference between the current output y_pred and the expected output y_true divided by the number of outputs.
It's very sensitive to outliers because the difference is a square that gives more importance to outliers.
The behavior is a quadratic curve especially useful for gradient descent algorithms. The gradient will be smaller close to the minima. MSE is very useful if outliers are important for the problem, if outliers are noisy or bad data or bad measures you should use the MAE loss function.
MAE Loss (L1 regularization)
At the difference of the previous loss function, the square is replaced by an absolute value. This difference has a big impact on the behavior of the loss function which has a “V” form.
The MAE function is more robust to outliers because it is based on absolute value compared to the square of the MSE. It’s like a median, outliers can’t really impact her behavior.
Huber Loss
It is a combination of MAE and MSE (L1-L2) but it depends on an additional parameter called delta that influences the shape of the loss function. This parameter needs to be fine-tuned by the algorithm. When the values are large (far from the minima), the function has the behavior of the MAE, and closer to the minima, the function behaves like the MSE. So the delta parameter is your sensitivity to outliers.
Last updated