Loss functions

Regularization

What are loss functions?

A way to measure whether the algorithm is doing a good job.

This is necessary to determine the distance between the algorithm’s current output and its expected output. The measurement is used as a feedback signal to adjust how the algorithm works. This adjustment step is what we call learning.

François Chollet, Deep learning with Python (2017), Manning, chapter 1 p.6

It can be categorized into two groups. One for classification (discrete values, 0,1,2…) and the other for regression (continuous values).

Commonly used loss functions:

  • For classification:

    • Cross-entropy

    • Log-Loss

    • Exponential Loss

    • Hinge Loss

    • Kullback Leibler Divergence Loss

  • For regression:

    • Mean Square Error Loss (L2)

    • Mean Absolute Error Loss (L1)

    • Huber Loss

Cross-entropy

Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

About entropy and Information Theory

Information quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.

In information theory, we like to describe the “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information.

  • Low Probability Event (surprising): More information.

  • Higher Probability Event (unsurprising): Less information.

Information h(x) can be calculated for an event x, given the probability of the event P(x) as follows:

h(x)=log(P(x))h(x) = -log(P(x))

Entropy is the number of bits required to transmit a randomly selected event from a probability distribution.

A skewed distribution has low entropy, whereas a distribution where events have equal probability has a larger entropy.

So, what's cross-entropy?

Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.

If we consider a target distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.

The result is a value [0,)[0, \infty):

  • 0.00: Perfect probabilities

  • < 0.02: Great probabilities

  • < 0.20: Great

  • > 0.30: Not great

  • > 2.00 Something is not working

In binary classification, where the number of classes MM equals 2, cross-entropy can be calculated as:

(ylog(p)+(1y)log(1p))(ylog(p)+(1y)log(1p))−(ylog(p)+(1−y)log(1−p))−(ylog⁡(p)+(1−y)log⁡(1−p))

If M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

c=1Myo,clog(po,c)−∑c=1Myo,clog(po,c)

​Log-Loss

The Log-Loss is the Binary cross-entropy up to a factor 1 / log(2). This loss function is convex and grows linearly for negative values: this means it's less sensitive to outliers. The common algorithm which uses the Log-loss is the logistic regression.

Logistic Regression

Exponential Loss

The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponential loss is used in the AdaBoost algorithm.

exp_loss=1/msum(exp(yf(x)))exp\_loss = 1/m * sum(exp(-y*f(x)))
Adaptative boosting

Hinge Loss

It's a loss function used for “maximum-margin” classification, most notably for support vector machines (SVM).

Hinge=max(0,1yf(x))Hinge = max(0, 1-y*f(x))
Support Vector Machines

MSE Loss (L2 regularization)

The square difference between the current output y_pred and the expected output y_true divided by the number of outputs.

It's very sensitive to outliers because the difference is a square that gives more importance to outliers.

The behavior is a quadratic curve especially useful for gradient descent algorithms. The gradient will be smaller close to the minima. MSE is very useful if outliers are important for the problem, if outliers are noisy or bad data or bad measures you should use the MAE loss function.

Regularization

MAE Loss (L1 regularization)

At the difference of the previous loss function, the square is replaced by an absolute value. This difference has a big impact on the behavior of the loss function which has a “V” form.

The MAE function is more robust to outliers because it is based on absolute value compared to the square of the MSE. It’s like a median, outliers can’t really impact her behavior.

Regularization

Huber Loss

It is a combination of MAE and MSE (L1-L2) but it depends on an additional parameter called delta that influences the shape of the loss function. This parameter needs to be fine-tuned by the algorithm. When the values are large (far from the minima), the function has the behavior of the MAE, and closer to the minima, the function behaves like the MSE. So the delta parameter is your sensitivity to outliers.

Last updated