Motivating the cross-entropy loss
Introduction
In machine learning, the cross-entropy loss is frequently introduced without explicitly emphasizing its underlying connection to the likelihood of a categorical distribution. Understanding this link can greatly enhance one’s grasp of the loss and is the topic of this short post.
Prerequisites
Categorical distribution likelihood
Consider an experiment in which we roll a (not necessarily fair)
Performing this experiment a finite number of times
where
The likelihood of
and hence its log-likelihood is
Proposition. The MLE for the parameter of the categorical distribution is the empirical probability mass function
Proof. Consider the program
The Karush–Kuhn–Tucker stationarity condition is
In other words, the MLE
Cross-entropy
The cross-entropy between
The choice of logarithm base yields different units:
When
which is exactly the (negation of the) log-likelihood we encountered above.
As such, one can intuit that minimizing
The Kullback–Leibler (KL) divergence is another such measure:
Minimizing the KL divergence is the same as minimizing the cross-entropy, but the KL divergence satisfies some nice properties that one would expect of a measure of dissimilarity. In particular,
𝐷 K L ( 𝑝 ‖ 𝑞 ) ≥ 0 𝐷 K L ( 𝑝 ‖ 𝑝 ) = 0
We proved the first inequality for PMFs by showing that the choice of
Cross-entropy loss
Statistical classification is the problem of mapping each input datum
A common parametric estimator for image classification tasks such as CIFAR-10 is a neural network: a differentiable map
Given a set of observations
where
See also
- PyTorch CrossEntropyLoss
- Keras CategoricalCrossentropy