Motivating the cross-entropy loss

Introduction

In machine learning, the cross-entropy loss is frequently introduced without explicitly emphasizing its underlying connection to the likelihood of a categorical distribution. Understanding this link can greatly enhance one’s grasp of the loss and is the topic of this short post.

Prerequisites

maximum likelihood estimator (MLE)

Categorical distribution likelihood

Consider an experiment in which we roll a (not necessarily fair) $K$-sided die. The result of this roll is an integer between $1$ and $K$ (inclusive) corresponding to the faces of the die. Let $q(k)$ be the probability of seeing the $k$-th face. What we have described here, in general, is a categorical random variable: a random variable which takes one of a finite number of values. Repeating this experiment multiple times yields IID random variables $X_{1},\ldots,X_{N}\sim\operatorname{Categorical}(q)$.

Performing this experiment a finite number of times $N$ does not allow us to introspect $q$ precisely, but it does allow us to estimate it. One way to approximate $q(k)$ is by counting the number of times the die face $k$ was observed and normalizing the result:

\[\begin{equation}\tag{1}\label{eq:empirical_pmf} p(k)=\frac{1}{N}\sum_{n}[X_{n}=k] \end{equation}\]

where $[\cdot]$ is the Iverson bracket. Since $Y_{n}=[X_{n}=k]$ is itself a random variable (an indicator random variable), the law of large numbers tells us that $p(k)$ converges (a.s.) to $\mathbb{E}Y_{1}=\mathbb{P}(X_{n}=k)=q(k)$.

The likelihood of $q$ is

\[\mathcal{L}(q)=\prod_{n}\prod_{k}q(k)^{[X_{n}=k]}=\prod_{k}q(k)^{\sum_{n}[X_{n}=k]}=\prod_{k}q(k)^{Np(k)}\]

and hence its log-likelihood is

\[\ell(q)=\log\mathcal{L}(q)=\sum_{k}Np(k)\log q(k)\propto\sum_{k}p(k)\log q(k).\]

Proposition. The MLE for the parameter of the categorical distribution is the empirical probability mass function \eqref{eq:empirical_pmf}.

Proof. Consider the program

\[\begin{align*} \min_{q} & -\ell(q)\\ \text{subject to} & \sum_{k}q(k)-1=0. \end{align*}\]

The Karush–Kuhn–Tucker stationarity condition is

\[-\frac{p(k)}{q(k)}+\lambda=0\text{ for }k=1,\ldots,K.\]

In other words, the MLE $\hat{q}$ is a multiple of $p$. Since the MLE needs to be a probability vector, $\hat{q} = p$.

Cross-entropy

The cross-entropy between $q$ relative to $p$ is

\[H(p, q) = - \mathbb{E}_{X \sim p} [ \log q(X) ].\]

The choice of logarithm base yields different units:

base 2: bits
base e: nats
base 10: hartleys

When $p$ and $q$ are probability mass functions (PMFs), the cross-entropy reduces to

\[H(p, q) = - \sum_x p(x) \log q(x)\]

which is exactly the (negation of the) log-likelihood we encountered above. As such, one can intuit that minimizing $q$ in the cross-entropy yields a distribution that is similar to $p$. In other words, the cross-entropy is an asymmetric measure of dissimilarity between $q$ and $p$.

The Kullback–Leibler (KL) divergence is another such measure:

\[D_{\mathrm{KL}}(p\Vert q) =\mathbb{E}_{p}\left[\log\frac{p(X)}{q(X)}\right] =H(p,q) - H(p,p).\]

Minimizing the KL divergence is the same as minimizing the cross-entropy, but the KL divergence satisfies some nice properties that one would expect of a measure of dissimilarity. In particular,

$D_{\mathrm{KL}}(p\Vert q) \geq 0$
$D_{\mathrm{KL}}(p\Vert p) = 0$

We proved the first inequality for PMFs by showing that the choice of $q = p$ maximizes the cross-entropy. The second inequality is trivial.

Cross-entropy loss

Statistical classification is the problem of mapping each input datum $x \in \mathcal{X}$ to a class label $y = 1, \ldots, K$. For example, in the CIFAR-10 classification task, each $x$ is a 32x32 color image and each $K = 10$ corresponding to ten distinct classes (e.g., airplanes, cats, trucks).

A common parametric estimator for image classification tasks such as CIFAR-10 is a neural network: a differentiable map $f: \mathcal{X} \rightarrow \mathbb{R}^K$. Note, in particular, that the network outputs a vector of real numbers. These are typically transformed to probabilities by way of the softmax function $\sigma$. In other words, for input $x$, $\hat{y} = \sigma(f(x))$ is a probability vector of size $K$. The $k$-th element of this vector is the “belief” that the network assigns to $x$ being a member of class $k$.

Given a set of observations $\mathcal{D} = {(x_1, y_1), \ldots, (x_N, y_N)}$, the cross-entropy loss for this task is

\[L(\mathcal{D}) = \frac{1}{N}\sum_{n}H(p_{n},q_{n})\]

where $q_{n}=\sigma(f(x_{n}))$ and $p_{n}$ is the probability mass function which places all of its mass on $y_{n}$. Expanding this, we obtain what is to some the more familiar representation

\[L(\mathcal{D}) = -\frac{1}{N}\sum_{n}[\log\sigma(f(x_{n}))]_{y_{n}}.\]