A Bayesian network is a graphical model that captures the conditional dependence of variables via a directed acyclic graph. Consider, for example, the following mixture model in which the outcome of a (possibly unfair) coin toss determines which of two normal distributions to sample from.
Latent (a.k.a. unobservable) and observable variables are distinguished in the graphical model by the use of rectangular and elliptic nodes. Only the final result, $Y$, is observable. The coin toss, for example, is not. How does one estimate the parameters $\pi, \mu_0, \sigma_0, \mu_1, \sigma_1$ of this model?
Using the mixture model as a running example, this expository post considers maximum likelihood estimators and Bayesian estimators (e.g., maximum a posteriori estimators and credible intervals) to answer the above. More often than not, Bayesian networks do not have closed forms for these estimators (or they are unwieldy) and hence iterative algorithms are needed. Specifically, this post introduces classical optimization methods and expectation maximization (EM) for computing an MLE and Markov Chain Monte Carlo (MCMC) for computing Bayesian estimators.
The first thing to note is that some of the parameters are constrained. Namely, $\pi \in (0, 1)$ and $\sigma_i \in (0, \infty)$. Some of the methods described in this post are either incompatible with or perform poorly in the presence of constraints. In light of this, given parameters $\pi, \mu_0, \sigma_0, \mu_1, \sigma_1$, let
\[\theta \equiv (\operatorname{logit} \pi, \mu_0, \log \sigma_0, \mu_1, \log \sigma_1).\]Note, in particular, that $\theta$ takes values in all of $\mathbb{R}^d$ with $d=5$.
def unconstrain(π, μ_0, σ_0, μ_1, σ_1):
"""Transform to unconstrained space."""
logit_π = np.log(π / (1. - π))
log_σ_0 = np.log(σ_0)
log_σ_1 = np.log(σ_1)
return logit_π, μ_0, log_σ_0, μ_1, log_σ_1
def constrain(θ):
"""Transform to constrained space."""
logit_π, μ_0, log_σ_0, μ_1, log_σ_1 = θ
π = 1. / (1. + np.exp(logit_π))
σ_0 = np.exp(log_σ_0)
σ_1 = np.exp(log_σ_1)
return π, μ_0, σ_0, μ_1, σ_1
Suppose we observe $N$ IID realizations of $Y$, the mixture model. Denote these realizations $\boldsymbol{y} \equiv (y_1, \ldots, y_N)$.
# True parameters
π = 0.5
μ_0 = 1.0
σ_0 = 0.7
μ_1 = 5.0
σ_1 = 1.0
N = 1_000
np.random.seed(0)
Δ = np.random.uniform(size=N) < π
Y_0 = σ_0 * np.random.randn(N) + μ_0
Y_1 = σ_1 * np.random.randn(N) + μ_1
Y = (1. - Δ) * Y_0 + Δ * Y_1
Given an arbitrary choice of parameters $\theta$ (not necessarily equal to the true parameters), the likelihood function gives us a way to quantify their goodness:
\[L(\theta) \equiv f_{\theta}(\textcolor{green}{\boldsymbol{y}}) = f_{\theta}(\textcolor{green}{y_1, \ldots, y_N}) = \prod_n f_{\theta}(\textcolor{green}{y_n})\](we have, with a slight abuse of notation, used $f_\theta$ to denote both a joint and marginal density). For the mixture model,
\[f_\theta(y) = \textcolor{blue}{\phi_0(y) \left(1 - \pi\right)} + \textcolor{red}{\phi_1(y) \pi}\]where $\phi_i$ is a normal PDF with mean $\mu_i$ and scale $\sigma_i$.
Since the likelihood is a product of probabilities (numbers between zero and one), it can cause numerical issues in the presence of a large sample size $N$. As such, we work with the log-likelihood $\ell(\theta) \equiv \log L(\theta)$ instead. A function to compute $-\ell(\theta) / N$ for the mixture model is provided below. This quantity, referred to herafter as the NLL for short, is the negative of the log-likelihood normalized by the number of samples. While normalization does not materially affect the optimization, it does allow us to compare likelihoods across different population sizes. The negation is purely a matter of convention coupled to the optimization literature in which authors prefer to pose problems as minimizations, not maximizations.
def nll(θ):
"""Compute the NLL for the mixture model."""
π, μ_0, σ_0, μ_1, σ_1 = constrain(θ)
φ_0 = stats.norm.pdf(Y, loc=μ_0, scale=σ_0)
φ_1 = stats.norm.pdf(Y, loc=μ_1, scale=σ_1)
return -np.log(φ_0 * (1. - π) + φ_1 * π).mean()
Indeed, if we evalute the NLL using the same parameters that were used to generate the data, we obtain a small number. This is evidence that the parameters explain the data well:
nll(unconstrain(π, μ_0, σ_0, μ_1, σ_1))
1.8705724767802003
On the other hand, different parameters may result in a much larger NLL. This is evidence that these alternate parameters explain the data poorly:
nll(unconstrain(π, μ_0 + 1., σ_0, μ_1 + 1., σ_1))
2.540302153968016
Since the NLL is simply a function from $\mathbb{R}^{d}$ to $\mathbb{R}$, we can use any one of a number of classical optimization methods to minimize it. For example, Newton’s method applied to the NLL results in the iterates
\[\theta^{[k + 1]} = \theta^{[k]} - [\boldsymbol{H}_\ell]^{-1}(\theta^{[k]}) \nabla \ell(\theta^{[k]})\]where $\nabla \ell$ and $\boldsymbol{H}_\ell$ are the gradient and Hessian of the log-likelihood, respectively.
In the presence of higher dimensional parameter spaces, a quasi-Newton method should be preferred as such methods do not explicitly compute and store the Hessian. An example of such a method is BFGS, used below to approximate the MLE:
θ_0 = unconstrain(0.5, Y.min(), Y.std(), Y.max(), Y.std())
result = optimize.minimize(nll, θ_0, method="BFGS")
π | μ_0 | σ_0 | μ_1 | σ_1 | |
---|---|---|---|---|---|
Exact | 0.5 | 1 | 0.7 | 5 | 1 |
BFGS | 0.523454 | 0.963808 | 0.628632 | 4.90962 | 0.995069 |
As shown in the figure below, the convergence of BFGS is somewhat undesirable:
We cover expectation maximization, which does not suffer these ailments, in the next section.
For brevity, let $Z$ denote the latent variables. As usual, we use boldface notation to refer to multiple realizations (e.g., $\boldsymbol{y}$).
EM is a fixed point iteration on the map
\[T\theta \equiv \operatorname{argsup}_{\theta^\prime} \mathbb{E}_\theta \left[ \log f_{\theta^\prime}(\boldsymbol{Y}, \boldsymbol{Z}) \mid \boldsymbol{Y} = \boldsymbol{y} \right].\]Proposition. The iterates produced by EM have nondecreasing likelihood.
Proof. First, note that, by the properties of the logarithm,
\[T\theta=\operatorname{argsup}_{\theta^{\prime}}\mathbb{E}_{\theta}\left[\log\frac{f_{\theta^{\prime}}(\boldsymbol{Y},\boldsymbol{Z})}{f_{\theta}(\boldsymbol{Y},\boldsymbol{Z})}\middle|\boldsymbol{Y}=\boldsymbol{y}\right].\]Let $\theta$ be arbitrary parameter. Since $T\theta$ is a maximizer, it follows that
\[\begin{align*} 0 & =\mathbb{E}_{\theta}\left[\log\frac{f_{\theta}(\boldsymbol{Y},\boldsymbol{Z})}{f_{\theta}(\boldsymbol{Y},\boldsymbol{Z})}\middle|\boldsymbol{Y}=\boldsymbol{y}\right]\\ & \leq\mathbb{E}_{\theta}\left[\log\frac{f_{T\theta}(\boldsymbol{Y},\boldsymbol{Z})}{f_{\theta}(\boldsymbol{Y},\boldsymbol{Z})}\middle|\boldsymbol{Y}=\boldsymbol{y}\right]\\ & =\log\frac{f_{T\theta}(\boldsymbol{y})}{f_{\theta}(\boldsymbol{y})}+\underbrace{\mathbb{E}_{\theta}\left[\log\frac{f_{T\theta}(\boldsymbol{Z}\mid\boldsymbol{Y})}{f_{\theta}(\boldsymbol{Z}\mid\boldsymbol{Y})}\middle|\boldsymbol{Y}=\boldsymbol{y}\right]}_{-D_{\mathrm{KL}}}\\ & \leq\log\frac{f_{T\theta}(\boldsymbol{y})}{f_{\theta}(\boldsymbol{y})} \end{align*}\]where the last inequality is a consequence of the nonnegativity of the Kullback-Leibler divergence. Since $L(\theta) \equiv f_\theta(\boldsymbol{y})$ by definition, the desired result follows. $\blacksquare$
Returning to the mixture model, in order to set up the EM iteration, we first have to obtain an expression for $T\theta$. For the mixture model, it will be helpful to consider only the coin flip as latent, and hence we set $Z \equiv \Delta$. We denote realizations of $\Delta$ by the lower case letter $\delta$. The joint density for a single sample $(y,\delta)$ is
\[f_{\theta}(y,\delta)=f_{\theta}(y\mid\delta)f_{\theta}(\delta)=\left[\phi_{0}(y)\left(1-\pi\right)\right]^{1-\delta}\left[\phi_{1}(y)\pi\right]^{\delta}.\]Therefore,
\[\log f_{\theta}(y,\delta)=\left(1-\delta\right)\left(\log\phi_{0}(y)+\log(1-\pi)\right)+\delta\left(\log\phi_{1}(y)+\log\pi\right).\]Moreover, since
\[f_{\theta}(\delta\mid y)=\frac{f_{\theta}(y\mid\delta)f_{\theta}(\delta)}{f_{\theta}(y)}=\frac{\left[\phi_{0}(y)\left(1-\pi\right)\right]^{1-\delta}\left[\phi_{1}(y)\pi\right]^{\delta}}{\phi_{0}(y)\left(1-\pi\right)+\phi_{1}(y)\pi},\]it follows that
\[\mathbb{E}_{\theta}\left[\Delta\mid Y=y\right]=\frac{\phi_{1}(y)\pi}{\phi_{0}(y)\left(1-\pi\right)+\phi_{1}(y)\pi}.\]Putting this all together,
\[T\theta \equiv \operatorname{argsup}_{\theta} T(\theta, \theta^\prime) =\operatorname{argsup}_{\theta^\prime}\sum_{n}\log\phi(y_{n};\mu_{0}^{\prime},\sigma_{0}^{\prime})+\log(1-\pi^{\prime})+\mathbb{E}_{\theta}\left[\Delta\mid Y=y_{n}\right]\left(\log\frac{\phi(y_{n};\mu_{1}^{\prime},\sigma_{1}^{\prime})}{\phi(y_{n};\mu_{0}^{\prime},\sigma_{0}^{\prime})}+\log\frac{\pi^{\prime}}{1-\pi^{\prime}}\right).\]Next, let
\[\xi_{n}(\theta)\equiv\mathbb{E}_{\theta}\left[\Delta\mid Y=y_{n}\right].\]Note that
\[\begin{align*} \frac{\partial\log\phi(y;\mu,\sigma)}{\partial\mu} & =\frac{y-\mu}{\sigma^{2}}\\ \frac{\partial\log\phi(y;\mu,\sigma)}{\partial\sigma} & =\frac{1}{\sigma}\left(\frac{\left(y-\mu\right)^{2}}{\sigma^{2}}-1\right). \end{align*}\]Therefore, differentiating $T(\theta,\theta^{\prime})$ yields
\[\begin{align*} \frac{\partial T(\theta,\theta^{\prime})}{\partial\pi^{\prime}} & =\frac{1}{\pi^{\prime}\left(1-\pi^{\prime}\right)}\left[-N\pi^{\prime}+\sum_{n}\xi_{n}(\theta)\right]\\ \frac{\partial T(\theta,\theta^{\prime})}{\partial\mu_{0}^{\prime}} & =\frac{1}{\left(\sigma_{0}^{\prime}\right)^{2}}\left[-\mu_{0}^{\prime}\sum_{n}\left(1-\xi_{n}(\theta)\right)+\sum_{n}\left(1-\xi_{n}(\theta)\right)y_{n}\right]\\ \frac{\partial T(\theta,\theta^{\prime})}{\partial\sigma_{0}^{\prime}} & =\frac{1}{\left(\sigma_{0}^{\prime}\right)^{3}}\left[-\left(\sigma_{0}^{\prime}\right)^{2}\sum_{n}\left(1-\xi_{n}(\theta)\right)+\sum_{n}\left(1-\xi_{n}(\theta)\right)\left(y_{n}-\mu_{0}^{\prime}\right)^{2}\right]\\ \frac{\partial T(\theta,\theta^{\prime})}{\partial\mu_{1}^{\prime}} & =\frac{1}{\left(\sigma_{1}^{\prime}\right)^{2}}\left[-\mu_{1}^{\prime}\sum_{n}\xi_{n}(\theta)+\sum_{n}\xi_{n}(\theta)y\right].\\ \frac{\partial T(\theta,\theta^{\prime})}{\partial\sigma_{0}^{\prime}} & =\frac{1}{\left(\sigma_{1}^{\prime}\right)^{3}}\left[-\left(\sigma_{1}^{\prime}\right)^{2}\sum_{n}\xi_{n}(\theta)+\sum_{n}\xi_{n}(\theta)\left(y_{n}-\mu_{1}^{\prime}\right)^{2}\right] \end{align*}\]Letting $\theta^{[k]}$ denote the $k$-th iterate of EM and $\boldsymbol{\xi}^{[k]}\equiv(\xi_{1}(\theta^{[k]}),\ldots,\xi_{N}(\theta^{[k]}))^{\intercal}$, it follows that the equations for the $(k+1)$-th iterate are
\[\begin{align*} \pi^{[k+1]} & =\frac{1}{N}\boldsymbol{1}^{\intercal}\boldsymbol{\xi}^{[k]}\\ \mu_{0}^{[k+1]} & =\frac{\boldsymbol{y}^{\intercal}\left(\boldsymbol{1}-\boldsymbol{\xi}^{[k]}\right)}{\boldsymbol{1}^{\intercal}\left(\boldsymbol{1}-\boldsymbol{\xi}^{[k]}\right)}\\ \sigma_{0}^{[k+1]} & =\sqrt{\frac{\operatorname{tr}\Bigl(\left(\boldsymbol{y}-\mu_{0}^{[k+1]}\right)\operatorname{diag}(1-\boldsymbol{\xi})\left(\boldsymbol{y}-\mu_{0}^{[k+1]}\right)^{\intercal}\Bigr)}{\boldsymbol{1}^{\intercal}\left(\boldsymbol{1}-\boldsymbol{\xi}\right)}}\\ \mu_{1}^{[k+1]} & =\frac{\boldsymbol{y}^{\intercal}\boldsymbol{\xi}^{[k]}}{\boldsymbol{1}^{\intercal}\boldsymbol{\xi}^{[k]}}\\ \sigma_{1}^{[k+1]} & =\sqrt{\frac{\operatorname{tr}\Bigl(\left(\boldsymbol{y}-\mu_{1}^{[k+1]}\right)\operatorname{diag}(\boldsymbol{\xi})\left(\boldsymbol{y}-\mu_{1}^{[k+1]}\right)^{\intercal}\Bigr)}{\boldsymbol{1}^{\intercal}\boldsymbol{\xi}}}. \end{align*}\]def em_update(π, μ_0, σ_0, μ_1, σ_1):
φ_0 = stats.norm.pdf(Y, loc=μ_0, scale=σ_0)
φ_1 = stats.norm.pdf(Y, loc=μ_1, scale=σ_1)
# The code below is optimized for readability, not speed
ξ = φ_1 * π / (φ_0 * (1. - π) + φ_1 * π)
π = ξ.mean()
μ_0 = Y @ (1. - ξ) / (1. - ξ).sum()
σ_0 = np.sqrt((1. - ξ) @ (Y - μ_0)**2 / (1. - ξ).sum())
μ_1 = Y @ ξ / ξ.sum()
σ_1 = np.sqrt(ξ @ (Y - μ_1)**2 / ξ.sum())
return π, μ_0, σ_0, μ_1, σ_1
# Use the same initial guess as in Newton's method
π_k, μ_0_k, σ_0_k, μ_1_k, σ_1_k = constrain(θ_0)
for _ in range(10):
π_k, μ_0_k, σ_0_k, μ_1_k, σ_1_k = em_update(π_k, μ_0_k, σ_0_k, μ_1_k, σ_1_k)
As shown below, EM converges in just a few iterations. Near convergence, the NLL can be non-monotone up to noise due purely to floating point error.
tabulate(
[
("Exact", π, μ_0, σ_0, μ_1, σ_1),
("BFGS", ) + constrain(result.x),
("EM", π_k, μ_0_k, σ_0_k, μ_1_k, σ_1_k)
],
headers=("", "π", "μ_0", "σ_0", "μ_1", "σ_1"),
tablefmt="html",
)
π | μ_0 | σ_0 | μ_1 | σ_1 | |
---|---|---|---|---|---|
Exact | 0.5 | 1 | 0.7 | 5 | 1 |
BFGS | 0.523454 | 0.963808 | 0.628632 | 4.90962 | 0.995069 |
EM | 0.523328 | 0.964187 | 0.629008 | 4.91024 | 0.994388 |
So far we have treated true parameter $\theta$ as a constant. Bayesian methods treat it as a random variable with prior density $f(\theta)$. Priors express our beliefs about the parameter before seeing (i.e., unconditional on) data. After data is observed, we obtain the posterior, reflecting our updated beliefs:
\[f(\theta \mid \boldsymbol{y}) = \frac{f(\boldsymbol{y} \mid \theta) f(\theta)}{f(\boldsymbol{y})} = \frac{f(\boldsymbol{y} \mid \theta) f(\theta)}{\int f(\boldsymbol{y},\theta) f(\theta) d\theta} \propto f(\boldsymbol{y} \mid \theta) f(\theta).\]That is, the posterior is proportional to the likelihood times the prior:
\[f(\theta \mid \boldsymbol{y}) \propto L(\theta) f(\theta).\]One can imagine any number of Bayesian estimators constructed from the posterior. Any one of the mean, median, or mode of the posterior produce reasonable estimators.
Below is an example using PyMC to compute a highest density interval (HDI), a form of credible interval, for the parameter dimensions. Reasonable priors for each dimension are chosen (e.g., $\pi \sim \operatorname{Uniform}(0, 1)$).
with pymc.Model() as model:
π_rv = pymc.Uniform("π", lower=0., upper=1.)
μ_0_rv = pymc.Normal("μ_0", mu=np.quantile(Y, 0.25), sigma=1.)
σ_0_rv = pymc.HalfNormal("σ_0", sigma=1.)
μ_1_rv = pymc.Normal("μ_1", mu=np.quantile(Y, 0.75), sigma=1.)
σ_1_rv = pymc.HalfNormal("σ_1", sigma=1.)
Y_0_dist = pymc.Normal.dist(mu=μ_0_rv, sigma=σ_0_rv)
Y_1_dist = pymc.Normal.dist(mu=μ_1_rv, sigma=σ_1_rv)
likelihood = pymc.Mixture("Y", w=[π_rv, 1. - π_rv], comp_dists=[Y_0_dist, Y_1_dist], observed=Y)
trace = pymc.sample(2000)
pymc.plot_posterior(trace, var_names=["π", "μ_0", "σ_0", "μ_1", "σ_1"]);
The Fokker-Planck equation is a partial differential equation (PDE) which describes the evolution of the probability density function of an Ito diffusion. Since it is a PDE, it admits solutions in certain special cases and is amenable to numerical methods for PDEs in the general case.
Consider the SDE
\[\mathrm{d}X_{t}=a(t,X_{t})\mathrm{d}t+b(t,X_{t})\mathrm{d}W_{t}\]with bounded coefficients: (i.e., $\sup_{t,x}|a(t,x)|+\sup_{t,x}|b(t,x)|<\infty$). This requirement is used below to apply Fubini’s theorem but can be relaxed.
Let $f:\mathbb{R}\rightarrow\mathbb{R}$ be smooth with compact support. By Ito’s lemma,
\[\mathrm{d}f=f_{x}\mathrm{d}X_{t}+\frac{1}{2}f_{xx}\mathrm{d}X_{t}^{2}=\left(af_{x}+\frac{1}{2}b^{2}f_{xx}\right)\mathrm{d}t+bf_{x}\mathrm{d}W_{t}.\]Taking expectations of both sides,
\[\mathbb{E}\left[\mathrm{d}f\right]=\mathbb{E}\left[\left(af_{x}+\frac{1}{2}b^{2}f_{xx}\right)\mathrm{d}t\right].\]The above is formalism for the expression
\[\mathbb{E}f(X_{T})-\mathbb{E}f(X_{t})=\mathbb{E}\left[\int_{t}^{T}a(s,X_{s})f_{x}(X_{s})+\frac{1}{2}b(s,X_{s})^{2}f_{xx}(X_{s})\mathrm{d}s\right].\]By Fubini’s theorem, we can interchange the expectation and integral on the right hand side. Moreover, by the Mean Value Theorem, we can find $\xi$ between $t$ and $T$ such that
\[\frac{\mathbb{E}f(X_{T})-\mathbb{E}f(X_{t})}{T-t}=\mathbb{E}\left[a(\xi,X_{\xi})f_{x}(X_{\xi})+\frac{1}{2}b(\xi,X_{\xi})^{2}f_{xx}(X_{\xi})\right].\]Taking limits as $T\downarrow t$ and applying the Dominated Convergence Theorem,
\[\frac{\partial}{\partial t}\left[\mathbb{E}f(X_{t})\right]=\mathbb{E}\left[a(t,X_{t})f_{x}(X_{t})+\frac{1}{2}b(t,X_{t})^{2}f_{xx}(X_{t})\right].\]Let $p(t,\cdot)$ be the density of $X_{t}$. Then, the above is equivalent to
\[\frac{\partial}{\partial t}\int p(t,x)f(x)\mathrm{d}x=\int p(t,x)\left(a(t,x)f_{x}(x)+\frac{1}{2}b(t,x)^{2}f_{xx}(x)\right)\mathrm{d}x.\]Applying integration by parts to the right hand side,
\[\frac{\partial}{\partial t}\int f(x)p(t,x)\mathrm{d}x=\int f(x)\left(-\frac{\partial}{\partial x}\left[p(t,x)a(t,x)\right]+\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}\left[p(t,x)b(t,x)^{2}\right]\right)\mathrm{d}x.\]Since this holds for all functions $f$, it follows that
\[\frac{\partial p}{\partial t}(t,x)=-\frac{\partial}{\partial x}\left[p(t,x)a(t,x)\right]+\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}\left[p(t,x)b(t,x)^{2}\right].\]This is the Fokker-Planck equation in one dimension. The derivation for multiple dimensions is similar.
]]>The softmax function $\sigma$ is used to transform a vector in $\mathbb{R}^n$ to a probability vector in a monotonicity-preserving way. Specifically, if $x_i \leq x_j$, then $\sigma(x)_i \leq \sigma(x)_j$.
The softmax is typically parametrized by a “temperature” parameter $T$ to yield $\sigma_T(x) \equiv \sigma(x / T)$ which
More details regarding the temperature can be found in a previous blog post.
Algebraically, the softmax is defined as
\[\sigma(x)_i \equiv \frac{\exp(x_i)}{\sum_j \exp(x_j)}.\]This quantity is clearly continuous on $\mathbb{R}^n$ and hence finite there. However, in the presence of floating point computation, computing this quantity naively can result in blow-up:
x = np.array([768, 1024.])
exp_x = np.exp(x)
exp_x / exp_x.sum()
/tmp/ipykernel_117792/4003806838.py:1: RuntimeWarning: overflow encountered in exp
exp_x = np.exp(x)
/tmp/ipykernel_117792/4003806838.py:2: RuntimeWarning: invalid value encountered in divide
exp_x / exp_x.sum()
array([nan, nan])
The LogSumExp trick is a clever way of reformulating this computation so that it is robust to floating point error.
First, let $\bar{x} = \max_i x_i$ and note that
\[\sigma(x)_{i}=\frac{\exp(x_{i}-\bar{x})}{\sum_{j}\exp(x_{j}-\bar{x})}.\]Taking logarithms,
\[\log(\sigma(x)_{i})=x_{i}-\bar{x}-\log\biggl(\sum_{j}\exp(x_{j}-\bar{x})\biggr).\]Exponentiating,
\[\sigma(x)_{i}=\exp\biggr(x_{i}-\bar{x}-\log\biggl(\sum_{j}\exp(x_{j}-\bar{x})\biggr)\biggr).\]In particular, note that $x_j - \bar{x}$ is, by construction, nonpositive and hence has a value less than one when exponentiated.
def softmax(x: np.ndarray) -> np.ndarray:
x_max = x.max(axis=-1, keepdims=True)
delta = x - x_max
lse = np.log(np.exp(delta).sum(axis=-1, keepdims=True))
return np.exp(delta - lse)
x = np.array([768, 1024.])
softmax(x)
array([6.61626106e-112, 1.00000000e+000])
The determinant of a matrix is typically introduced in an undergraduate linear algebra course via either the Leibniz Formula or a recurrence relation arising from the Leibniz Formula. Pedagogically, it is better to introduce the determinant as a mapping which satisfies some desirable properties and only then show that it is equivalent to the Leibniz Formula. This short expository post attempts to do just that.
A determinant function is a mapping $\det$ from the square complex matrices to complex numbers satisfying the following properties:
The first three points above correspond to the three elementary row operations.
Proposition. Let $\det$ be a determinant function and $A$ be a square complex matrix whose rows are linearly dependent. Then, $\det A = 0$.
Proof. In this case, we can perform a sequence of elementary row operations (excluding multiplying a row by $c = 0$) that result in a row consisting of only zeros. The result then follows by property (2).
Indeed, by performing elimination to reduce the matrix into either the identity or a matrix with at least one row of zeros, we can unambiguously define a determinant function (note that we have not yet proven that such a function is unique). The code below does just that, proving the existence of a determinant function. For now, we refer to this as the canonical determinant.
Remark. The code below operates on floating point numbers. The definition of the canonical determinant should be understood to be the “algebraic” version of this code that runs without deference to floating point error.
def det(mat: np.ndarray) -> float:
"""Computes a determinant.
This algorithm works by eliminating the strict lower triangular part of the
matrix and then eliminating the strict upper triangular part of the matrix.
This elimination is done using row operations, while keeping track of any
swaps that may change the sign parity of the determinant.
If you are already familiar with the determinant, you will note that
eliminating the strict upper triangular part is not necessary. Even if this
algorithm was optimized to remove that step, this is still not a performant
way to compute determinants!
Parameters
----------
mat
A matrix
Returns
-------
Determinant
"""
m, n = mat.shape
assert m == n
mat = mat.copy()
sign = 1
for _ in range(2):
for j in range(n):
# Find pivot element
p = -1
for i in range(j, n):
if not np.isclose(mat[i, j], 0.0):
p = i
break
if p < 0:
continue
# Swap
if j != p:
r = mat[p].copy()
mat[p] = mat[j]
mat[j] = r
sign *= -1
# Eliminate
for i in range(j + 1, n):
if not np.isclose(mat[i, j], 0.0):
mat[i] -= mat[p] * mat[i, j] / mat[p, j]
mat = mat.T
return float(sign) * np.diag(mat).prod().item()
Notation. For a set $\mathcal{A}$, we write $A \equiv (a_1, \ldots, a_n)$ to denote an element of $\mathcal{A}^n$.
Definition (Alternating multilinear map). Let $\mathcal{A}$ and $\mathcal{B}$ be vector spaces. An alternating multilinear map is a multilinear map $f: \mathcal{A}^n \rightarrow \mathcal{B}$ that satisfies $f(A) = 0$ whenever $a_i = a_{i + 1}$ for some $i < n$.
Notation. Let $\sigma$ be a permutation of {1, …, n}. Since $A$ in $\mathcal{A}^n$ can be thought of as a function from {1, …, n} to $\mathcal{A}$, we write $A \circ \sigma \equiv (a_{\sigma(1)}, \ldots, a_{\sigma(n)})$ to denote a permutation of the elements of $A$.
Proposition (Transposition parity). Let $f$ be an alternating multilinear map. Let $\sigma$ be a transposition (a permutation which swaps two elements). Then, $f(A) = -f(A \circ \sigma)$.
Proof. Let $i < j$ denote the swapped indices in the transposition. Fix $A$ and let
\[g(x, y) \equiv f(a_1, \ldots, a_{i - 1}, x, a_{i + 1}, \ldots, a_{j - 1}, y, a_{j + 1}, \ldots, a_n).\]It follows that
\[g(x, y) + g(y, x) = g(x, y) + g(y, y) + g(y, x) + g(x, x) = g(x + y, y) + g(x + y, x) = g(x + y, x + y) = 0\]and hence $g(x, y) = -g(y, x)$, as desired. $\blacksquare$
Corollary. Let $f$ be an alternating multilinear map. Then $f(A) = 0$ whenever $a_i = a_j$ for some $(i, j)$ with $i < j$.
Proof. Let $\sigma$ be the transposition which swaps indices $i + 1$ and $j$. Then, $f(A) = -f(A \circ \sigma) = 0$. $\blacksquare$
Corollary. Let $f$ be an alternating multilinear map and $\sigma$ be a permutation. Then, $f(A) = \operatorname{sgn}(\sigma) f(A \circ \sigma)$ where $\operatorname{sgn}(\sigma)$ is the parity of the permutation.
Proof. The result follows from the fact that a permutation can be written as a composition of transpositions. $\blacksquare$
Proposition. A multilinear map $f:\mathcal{A}^{n}\rightarrow\mathcal{B}$ is alternating multilinear if and only if $f(A)=0$ whenever $a_{1},\ldots,a_{n}$ are linearly dependent.
Proof. Suppose the map is alternating multilinear. Let $a_{1},\ldots,a_{n}$ be linearly dependent so that, without loss of generality, $a_{1}=\sum_{i>1}\alpha_{i}a_{i}$. By linearity,
\[f(A)=\sum_{i>1}\alpha_{i}f(a_{i},a_{2},\ldots,a_{n})=0.\]The converse is trivial. $\blacksquare$
Notation. If $\mathcal{A} = \mathbb{C}^n$, then $\mathcal{A}^n$ is isomorphic to the set of $n \times n$ complex matrices. In light of this, an element in $\mathcal{A}^n$ can be considered as a matrix $A \equiv (a_{ij})$ or as a tuple $A \equiv (a_1, \ldots, a_n)$ consisting of the rows of said matrix.
Proposition (Uniqueness). Let $f: (\mathbb{C}^n)^n \rightarrow \mathbb{C}$ be an alternating multilinear map such that $f(I) = 1$. Then,
\[\begin{equation}\label{eq:leibniz_formula} f(A) = \sum_{\sigma \in S_n} \operatorname{sgn}(\sigma) a_{1 \sigma(1)} \cdots a_{n \sigma(n)}.\tag{Leibniz Formula} \end{equation}\]where $S_n$ is the set of all permutations on {1, …, n}.
Proof. First, note that
\[f(A) = f\biggl(\sum_j a_{1j} e_j, \ldots, \sum_j a_{nj} e_j\biggr) \\ = \sum_{1 \leq j_1,\ldots,j_n \leq n} a_{1 j_1} \cdots a_{n j_n} f(e_{j_1}, \ldots, e_{j_n}).\]Since $f$ is alternating multilinear and hence equal to zero whenever any of its two inputs are equal, we can restrict our attention to the permutations:
\[f(A) = \sum_{\sigma \in S_n} a_{1 \sigma(1)} \cdots a_{n \sigma(n)} f(e_{\sigma(1)}, \ldots, e_{\sigma(n)}).\]Since $f$ is alternating multilinear, we can change the order of its inputs so long as we count the number of transpositions and use that to account for a possible sign-change:
\[f(A) = \sum_{\sigma \in S_n} \operatorname{sgn}(\sigma) a_{1 \sigma(1)} \cdots a_{n \sigma(n)} f(I).\]Using the assumption $f(I) = 1$, the desired result follows. $\blacksquare$
Remark. $\operatorname{sgn}(\sigma)$ is sometimes represented as $\epsilon_{i_1 \ldots i_n}$ where $i_j = \sigma(j)$. This is called the Levi-Civita symbol. Using this symbol and Einstein notation, the \ref{eq:leibniz_formula} becomes
\[\epsilon_{i_1 \ldots i_n} a_{1 i_1} \cdots a_{n i_n}.\]Proposition. A determinant function is multilinear.
Proof. Let $A$ be a square complex matrix and $h$ be a vector. It is sufficient to show that
\[\det A+\det(h,a_{2},\ldots,a_{n})=\det(a_{1}+h,a_{2},\ldots,a_{n}).\]Suppose the rows of $A$ are linearly dependent. Without loss of generality, write $a_{1}=\sum_{i>1}\alpha_{i}a_{i}$ and $h=b+\sum_{i>1}\beta_{i}a_{i}$ where $b$ is orthogonal to the $a_{i}$. Then, $\det A=0$. Moreover,
\[\det(h,a_{2},\ldots,a_{n})=\det\biggl(b+\sum_{i>1}\beta_{i}a_{i},a_{2},\ldots,a_{n}\biggr)=\det(b,a_{2},\ldots,a_{n})\]and
\[\det(a_{1}+h,a_{2},\ldots,a_{n})=\det\biggl(b+\sum_{i>1}\left(\alpha_{i}+\beta_{i}\right)a_{i},a_{2},\ldots,a_{n}\biggr)=\det(b,a_{2},\ldots,a_{n}),\]as desired.
Suppose the rows of $A$ are linearly independent. It follows that we can write $h=\sum_{i}\beta_{i}a_{i}$. Then,
\[\begin{align*} \det A+\det(h,a_{2},\ldots,a_{n}) & =\det A+\det\biggl(\sum_{i}\beta_{i}a_{i},a_{2},\ldots,a_{n}\biggr)\\ & =\det A+\det\biggl(\beta_{1}a_{1},a_{2},\ldots,a_{n}\biggr)\\ & =\det\biggl(\left(1+\beta_{1}\right)a_{1},a_{2},\ldots,a_{n}\biggr)\\ & =\det\biggl(a_{1}+\sum_{i}\beta_{i}a_{i},a_{2},\ldots,a_{n}\biggr)\\ & =\det\biggl(a_{1}+h,a_{2},\ldots,a_{n}\biggr). \end{align*}\]$\blacksquare$
Corollary. A determinant function is an alternating multilinear map.
Corollary. There is only one determinant function and it is given by the \ref{eq:leibniz_formula}.
We can now use the \ref{eq:leibniz_formula} to derive various properties of the determinant. The following results are concerned with complex matrices $A \equiv (a_{ij})$ and $B \equiv (b_{ij})$.
Proposition. $\det A = \det A^\intercal$.
Proof.
\[\det A =\sum_{\sigma}\operatorname{sgn}(\sigma)\prod_{i}a_{i\sigma(i)} =\sum_{\sigma}\operatorname{sgn}(\sigma)\prod_{i}a_{\sigma^{-1}(i)\sigma(\sigma^{-1}(i))} =\sum_{\sigma}\operatorname{sgn}(\sigma^{-1})\prod_{i}a_{\sigma^{-1}(i)i} =\det A^{\intercal}.\]$\blacksquare$
Notation. For a matrix $A$, let $A^{(i, j)}$ be the same matrix after the simultaneous removal of its $i$-th row and $j$-th column.
Lemma.
\[\det A = \sum_j \left( -1 \right)^{j - 1} a_{1j} \det A^{(1, j)}\]Proof. We demonstrate the idea for a $3\times3$ matrix; the generalization is straight-forward.
Using multilinearity,
\[\begin{align*} \det\begin{pmatrix}a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{pmatrix} & =a_{11}\det\begin{pmatrix}1 & 0 & 0\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{pmatrix}+a_{12}\det\begin{pmatrix}0 & 1 & 0\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{pmatrix}+a_{13}\det\begin{pmatrix}0 & 0 & 1\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{pmatrix}\\ & =a_{11}\det\begin{pmatrix}1 & 0 & 0\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{pmatrix}-a_{12}\det\begin{pmatrix}1 & 0 & 0\\ a_{22} & a_{21} & a_{23}\\ a_{32} & a_{31} & a_{33} \end{pmatrix}+a_{13}\det\begin{pmatrix}1 & 0 & 0\\ a_{23} & a_{21} & a_{22}\\ a_{33} & a_{31} & a_{32} \end{pmatrix} \end{align*}\]Moreover, by the Leibniz Formula,
\[\det\begin{pmatrix}1 & 0 & 0\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{pmatrix} =\sum_{\sigma}\operatorname{sgn}(\sigma)a_{1\sigma(1)}a_{2\sigma(2)}a_{3\sigma(3)} =\sum_{\sigma\colon\sigma(1)=1}\operatorname{sgn}(\sigma)a_{2\sigma(2)}a_{3\sigma(3)} =\det\begin{pmatrix}a_{22} & a_{23}\\ a_{32} & a_{33} \end{pmatrix}.\]The remaining terms are handled similarly. $\blacksquare$
Proposition (Cofactor expansion). For any $i$ between $1$ and $n$ (inclusive),
\[\det A = \sum_j \left( -1 \right)^{i + j} a_{ij} \det A^{(i, j)}\]Proof. Recalling that the determinant flips signs when any two rows are swapped, we can perform a sequence of $i - 1$ transpositions to move $a_i$, the $i$-th row of the matrix, to the “top” and apply the previous lemma:
\[\left(-1\right)^{i-1}\det A=\det\begin{pmatrix}a_{i}^{\intercal}\\ a_{1}^{\intercal}\\ a_{2}^{\intercal}\\ \vdots\\ a_{i-1}^{\intercal}\\ a_{i+1}^{\intercal}\\ \vdots\\ a_{n}^{\intercal} \end{pmatrix}.\]$\blacksquare$
Corollary. If $A$ is either lower or upper triangular, $\det A = \prod_i a_{ii}$.
Proof. First, note that it is sufficient to consider the lower triangular case since the transpose of an upper triangular matrix is lower triangular. The result then follows from performing a cofactor expansion along the first row inductively. $\blacksquare$
Proposition.
\[\det(AB) = \det A \det B\]Proof. If either $A$ or $B$ are singular, the claim is trivial since both sides are zero. Therefore, proceed assuming $A$ and $B$ are nonsingular.
As with the construction of the canonical determinant, we can write
\[I=E_{k}\cdots E_{1}A\]where $E_{1},\ldots,E_{k}$ are a sequence of elementary row operations. It is easy to see that elementary row operations are nonsingular and their inverses are themselves elementary row operations. Therefore, $A$ can be written as a product of elementary row operations. To arrive at the desired result, it is sufficient to show that for any sequence of row operations $E_{1}^{\prime},\ldots,E_{k}^{\prime}$ there exists a constant $\alpha$ such that for any matrix $M$
\[\det(E_{1}^{\prime}\cdots E_{k}^{\prime}M)=\alpha\det M.\]$\blacksquare$
Corollary. The determinant of an $n \times n$ complex matrix is the product of its $n$ (possibly non-unique) eigenvalues.
Proof. Let $A$ be an $n \times n$ complex matrix and denote by $A = P^{-1} J P$ its Jordan normal form. Since the matrix $J$ has the eigenvalues $\lambda_1, \ldots, \lambda_n$ of $A$ on its diagonal and is upper triangular,
\[\det A = \det P^{-1} \det J \det P = \det J = \prod_{i = 1}^n \lambda_i.\]$\blacksquare$
]]>Autoregressive (AR) models are a type of time series model in which the value at any point in time is a linear function of the previous values and white noise. AR models have many applications including, but not limited to, mathematical finance.
In this short article, we derive the maximum likelihood estimator (MLE) for the coefficients and variance of an autoregressive model when the white noise is normally distributed. We will do this by appealing to the finite memory of the AR process.
As an example application, we use our findings to fit the parameters of an Ornstein–Uhlenbeck process.
An autoregressive process of order p, or AR(p), is a stochastic process $(X_1, X_2, \ldots)$ satisfying
\[X_n = c_0 + \sum_{i = 1}^p c_i X_{n - i} + \epsilon_n\]where $c_i$ are (non-random) coefficients and $(\epsilon_1, \epsilon_2, \ldots)$ is a white noise process (i.e., the $\epsilon_n$ have zero mean and finite variance and are IID). The $X_{-(p-1)}, \ldots, X_0$ are initial conditions that, for simplicity, we assume to be (non-random) constants.
Remark. Most definitions of AR(p) processes do not include a bias term $c_0$ which we include for greater generality.
It will be useful, for a subsequent section, to define some notation. Let
\[\mathbf{X}_{N,i} = \begin{pmatrix} X_{1 - i} \\ \vdots \\ X_{N - i} \end{pmatrix}\]In other words, $\mathbf{X}_{N, i}$ are the the first $N$ observations after a shift backwards by $i$ steps. Furthermore, let
\[\mathbf{A}_N = \begin{pmatrix} \mathbf{1} & \mathbf{X}_{N, 1} & \ldots & \mathbf{X}_{N, p} \end{pmatrix}\]be the $N \times (p + 1)$ design matrix which contains the ones vector as its first column and the various shift vectors as its subsequent columns.
Suppose we have observed to the process up to and including time $N$. The corresponding likelihood (assuming absolute continuity) is
\[\mathcal{L} = f(X_1, \ldots, X_N) = f(X_1) f(X_2 \mid X_1) f(X_3 \mid X_1, X_2) \cdots f(X_N \mid X_1, \ldots, X_{N - 1}).\]Since the process at a fixed time depends only on its last $p$ values, we can simplify this further to
\[\mathcal{L} = \prod_{n = 1}^N f(X_n \mid X_{n - 1}, \ldots, X_{n - p}).\]When the white noise is normal, the conditional densities above are normal densities and we can obtain very simple expressions for the MLEs:
Proposition. Consider an AR(p) model under normal white noise with variance $\sigma^2_\epsilon$. MLEs for the coefficients and variance satisfy
\[(\mathbf{A}_N^\intercal \mathbf{A}_N) \hat{\mathbf{c}} = \mathbf{A}_N^\intercal \mathbf{X}_{N, 0}\]and
\[N \hat{\sigma}^2_\epsilon = \mathbf{1}^\intercal ( \mathbf{A}_N \hat{\mathbf{c}} - \mathbf{X}_{N, 0} )^{\circ 2}\]where $\mathbf{M}^{\circ 2}$ is the element-wise square of the matrix $\mathbf{M}$.
Proof. As discussed above, the log-likelihood is
\[\log \mathcal{L} = \sum_{n = 1}^N \log f(X_n \mid X_{n - 1}, \ldots, X_{n - p}) = \mathrm{const.} - \frac{N}{2} \log \sigma^2_\epsilon - \frac{1}{2 \sigma^2_\epsilon} \sum_{n = 1}^N \left( c_0 + c_1 X_{n - 1} + \cdots + c_p X_{n - p} - X_n \right)^2.\]We recognize this as equivalent to the log-likelihood used to derive the normal equations for ordinary least squares (OLS) in the presence of a bias. The predictors, in this case, are the shifted realizations $\mathbf{X}_{N, i}$. From this, the desired result follows.
The Ornstein-Uhlenbeck process $(Y_t)_{t \geq 0}$ is a stochastic process satisfying the stochastic differential equation
\[dY_t = \theta \left(\mu - Y_t\right) dt + \sigma dW_t\]where $\theta > 0$ and $\sigma > 0$ are parameters. Suppose we observe the process at sequence of (increasing) equally-spaced times $t_1, t_2, \ldots, t_N$. By the above,
\[\Delta Y_n = \theta \left(\mu \Delta t - \int_{t_{n - 1}}^{t_n} Y_s ds\right) + \sigma \Delta W_n \approx \theta \left(\mu - Y_{t_{n - 1}} \right) \Delta t + \sigma \Delta W_n\]where $\Delta Y_n = Y_{t_n} - Y_{t_{n - 1}}$, $\Delta t = t_n - t_{n - 1}$, and $\Delta W_n = W_{t_n} - W_{t_{n - 1}}$. By approximating the integral using its leftmost endpoint, we have arrived at the AR(1) sequence
\[X_n = \underbrace{\theta \mu \Delta t}_{c_0} + \underbrace{\left(1 - \theta \Delta t\right)}_{c_1} X_{n - 1} + \underbrace{\sigma \Delta W_n}_{\epsilon_n}\]where $X_n \approx Y_{t_n}$. Note, in particular, that the conditional increments are normal.
Remark. This approximation is called the Euler-Maruyama method.
Now, we are in a position to estimate the parameters $\theta$, $\mu$, and $\sigma$. However, recall that in deriving the MLEs, we expressed our results in terms of $c$ and $\sigma^2_\epsilon$! Fortunately, due to the equivariance of MLEs, we can rely on the transformations
\[\begin{align*} \hat{\theta} & = ( 1 - \hat{c}_1 ) / \Delta t \\ \hat{\mu} & = \hat{c}_0 / ( \hat{\theta} \Delta t ) \\ \hat{\sigma}^2 & = \hat{\sigma}^2_\epsilon / \Delta t \end{align*}\]to retrieve the parameters of interest.
Let’s try this out with some synthetic data below!
N = 1_000
T = 1.
θ = 10.
μ = 0.5
σ = 0.1
Y0 = 1.
np.random.seed(1)
Y = np.empty((N + 1,))
Y[0] = Y0
Δt = T / N
for n in range(N):
ΔW = np.sqrt(Δt) * np.random.randn()
Y[n + 1] = θ * μ * Δt + (1. - θ * Δt) * Y[n] + σ * ΔW
A_N = np.stack([np.ones_like(Y, shape=(N, )), Y[:-1]], axis=1)
X_N0 = Y[1:]
c_hat = np.linalg.solve(A_N.T @ A_N, A_N.T @ X_N0)
σ2_ε_hat = ((A_N @ c_hat - X_N0)**2).sum() / N
θ_hat = (1. - c_hat[1]) / Δt
μ_hat = c_hat[0] / (θ_hat * Δt)
σ2_hat = σ2_ε_hat / Δt
σ_hat = np.sqrt(σ2_hat)
Parameter | Exact | MLE |
---|---|---|
θ | 10 | 10.1671 |
μ | 0.5 | 0.513091 |
σ | 0.1 | 0.098099 |
In probability and statistics, standardization is a process that scales and centers variables. For a scalar random variable $X$, its standardization is the variable $Z$ defined by
\[\begin{equation} Z = \frac{X - \mathbb{E} X}{\sqrt{\operatorname{Var}(X)}}. \end{equation}\]Standardization allows for easier comparison and analysis of different variables, providing a common ground for meaningful interpretations. A good example of this is the use of standardized coefficients in regression.
If $X$ is instead a random vector, the above formula falls short since $\operatorname{Var}(X)$ is a covariance matrix. In this short exposition, the above formula is generalized to the vector (a.k.a. multivariate) setting.
As an application, we consider a well-known recipe for sampling from multivariate random normal distributions using a standard random normal sampler.
Standardization in multiple dimensions is an immediate consequence of the following result:
Proposition (First and Second Moments of Affine Transform). Let $U$ and $V$ be random vectors, $A$ be a (deterministic) matrix and $b$ be a (deterministic) vector such that
\[U = A V + b.\]Then,
\[\begin{align*} \mathbb{E}U & =A\mathbb{E}V+b\\ \operatorname{Var}(U) & =A\operatorname{Var}(V)A^{\intercal}. \end{align*}\]You can prove the above by direct computation, using linearity of expectation and the identity $\operatorname{Var}(Y) = \mathbb{E}[YY^\intercal] - \mathbb{E}Y\mathbb{E}Y^\intercal$.
Corollary (Standardization). Let $\mu$ be a vector and $\Sigma$ be a positive definite matrix so that it admits a Cholesky factorization $LL^\intercal = \Sigma$. Let $X$ and $Z$ be random vectors satisfying
\[\begin{equation} Z = L^{-1} \left( X - \mu \right). \end{equation}\]Then, $X$ has mean $\mu$ and covariance matrix $\Sigma$ if and only if $Z$ has mean $\mathbf{0}$ and covariance matrix $I$.
Note that in the above, $L$ generalizes $\sqrt{\operatorname{Var}(X)}$ from the scalar case.
It is useful, for the sake of reference, to derive closed forms for the Cholesky factorization in the $d = 2$ case. In this case, the covariance matrix takes the form
\[\Sigma=\begin{pmatrix}\sigma_{1}\\ & \sigma_{2} \end{pmatrix}\begin{pmatrix}1 & \rho\\ \rho & 1 \end{pmatrix}\begin{pmatrix}\sigma_{1}\\ & \sigma_{2} \end{pmatrix}.\]It is easy to verify (by matrix multiplication) that the Cholesky factorization is given by
\[L=\begin{pmatrix}\sigma_{1}\\ & \sigma_{2} \end{pmatrix}\begin{pmatrix}1 & 0\\ \rho & \sqrt{1-\rho^{2}} \end{pmatrix}.\]Its inverse is
\[L^{-1}=\begin{pmatrix}1 & 0\\ -\frac{\rho}{\sqrt{1-\rho^{2}}} & \frac{1}{\sqrt{1-\rho^{2}}} \end{pmatrix}\begin{pmatrix}\frac{1}{\sigma_{1}}\\ & \frac{1}{\sigma_{2}} \end{pmatrix}.\]A multivariate random normal variable is defined as any random vector $X$ which can be written in the form
\[X = LZ + \mu\]where $Z$ is a random vector whose coordinates are independent standard normal variables. From our results above, we know that $X$ has mean $\mu$ and covariance $\Sigma = LL^\intercal$.
Remark. This only one of many equivalent ways to define a multivariate random normal variable.
This gives us a recipe for simulating draws from an arbitrary multivariate random normal distribution given only a standard random normal sampler such as np.random.randn()
:
def sample_multivariate_normal(
mean: np.ndarray,
covariance: np.ndarray,
n_samples: int = 1,
) -> np.ndarray:
"""Samples a multivariate random normal distribution.
Parameters
----------
mean
``(d, )`` shaped mean
covariance
Positive definite ``(d, d)`` shaped covariance matrix
n_samples
Number of samples
Returns
-------
``(n_samples, d)`` shaped array where each row corresponds to a single draw from a multivariate normal.
"""
chol = np.linalg.cholesky(covariance)
rand = np.random.randn(mean.shape[0], n_samples)
return (chol @ rand).T + mean
We can verify this is working as desired with a small test and visualization:
mean = np.array([5., 10.])
covariance = np.array([[2., 1.],
[1., 4.]])
samples = sample_multivariate_normal(mean=mean,
covariance=covariance,
n_samples=10_000)
x, y = samples.T
plt.scatter(x, y, alpha=0.1)
empirical_covariance = np.cov(samples, rowvar=False)
print(f"""
Empirical covariance:
{empirical_covariance}
True covariance:
{covariance}
""")
Empirical covariance:
[[1.96748266 1.02186592]
[1.02186592 3.98317796]]
True covariance:
[[2. 1.]
[1. 4.]]
In machine learning, the cross-entropy loss is frequently introduced without explicitly emphasizing its underlying connection to the likelihood of a categorical distribution. Understanding this link can greatly enhance one’s grasp of the loss and is the topic of this short post.
Consider an experiment in which we roll a (not necessarily fair) $K$-sided die. The result of this roll is an integer between $1$ and $K$ (inclusive) corresponding to the faces of the die. Let $q(k)$ be the probability of seeing the $k$-th face. What we have described here, in general, is a categorical random variable: a random variable which takes one of a finite number of values. Repeating this experiment multiple times yields IID random variables $X_{1},\ldots,X_{N}\sim\operatorname{Categorical}(q)$.
Performing this experiment a finite number of times $N$ does not allow us to introspect $q$ precisely, but it does allow us to estimate it. One way to approximate $q(k)$ is by counting the number of times the die face $k$ was observed and normalizing the result:
\[\begin{equation}\tag{1}\label{eq:empirical_pmf} p(k)=\frac{1}{N}\sum_{n}[X_{n}=k] \end{equation}\]where $[\cdot]$ is the Iverson bracket. Since $Y_{n}=[X_{n}=k]$ is itself a random variable (an indicator random variable), the law of large numbers tells us that $p(k)$ converges (a.s.) to $\mathbb{E}Y_{1}=\mathbb{P}(X_{n}=k)=q(k)$.
The likelihood of $q$ is
\[\mathcal{L}(q)=\prod_{n}\prod_{k}q(k)^{[X_{n}=k]}=\prod_{k}q(k)^{\sum_{n}[X_{n}=k]}=\prod_{k}q(k)^{Np(k)}\]and hence its log-likelihood is
\[\ell(q)=\log\mathcal{L}(q)=\sum_{k}Np(k)\log q(k)\propto\sum_{k}p(k)\log q(k).\]Proposition. The MLE for the parameter of the categorical distribution is the empirical probability mass function \eqref{eq:empirical_pmf}.
Proof. Consider the program
\[\begin{align*} \min_{q} & -\ell(q)\\ \text{subject to} & \sum_{k}q(k)-1=0. \end{align*}\]The Karush–Kuhn–Tucker stationarity condition is
\[-\frac{p(k)}{q(k)}+\lambda=0\text{ for }k=1,\ldots,K.\]In other words, the MLE $\hat{q}$ is a multiple of $p$. Since the MLE needs to be a probability vector, $\hat{q} = p$.
The cross-entropy between $q$ relative to $p$ is
\[H(p, q) = - \mathbb{E}_{X \sim p} [ \log q(X) ].\]The choice of logarithm base yields different units:
When $p$ and $q$ are probability mass functions (PMFs), the cross-entropy reduces to
\[H(p, q) = - \sum_x p(x) \log q(x)\]which is exactly the (negation of the) log-likelihood we encountered above. As such, one can intuit that minimizing $q$ in the cross-entropy yields a distribution that is similar to $p$. In other words, the cross-entropy is an asymmetric measure of dissimilarity between $q$ and $p$.
The Kullback–Leibler (KL) divergence is another such measure:
\[D_{\mathrm{KL}}(p\Vert q) =\mathbb{E}_{p}\left[\log\frac{p(X)}{q(X)}\right] =H(p,q) - H(p,p).\]Minimizing the KL divergence is the same as minimizing the cross-entropy, but the KL divergence satisfies some nice properties that one would expect of a measure of dissimilarity. In particular,
We proved the first inequality for PMFs by showing that the choice of $q = p$ maximizes the cross-entropy. The second inequality is trivial.
Statistical classification is the problem of mapping each input datum $x \in \mathcal{X}$ to a class label $y = 1, \ldots, K$. For example, in the CIFAR-10 classification task, each $x$ is a 32x32 color image and each $K = 10$ corresponding to ten distinct classes (e.g., airplanes, cats, trucks).
A common parametric estimator for image classification tasks such as CIFAR-10 is a neural network: a differentiable map $f: \mathcal{X} \rightarrow \mathbb{R}^K$. Note, in particular, that the network outputs a vector of real numbers. These are typically transformed to probabilities by way of the softmax function $\sigma$. In other words, for input $x$, $\hat{y} = \sigma(f(x))$ is a probability vector of size $K$. The $k$-th element of this vector is the “belief” that the network assigns to $x$ being a member of class $k$.
Given a set of observations $\mathcal{D} = {(x_1, y_1), \ldots, (x_N, y_N)}$, the cross-entropy loss for this task is
\[L(\mathcal{D}) = \frac{1}{N}\sum_{n}H(p_{n},q_{n})\]where $q_{n}=\sigma(f(x_{n}))$ and $p_{n}$ is the probability mass function which places all of its mass on $y_{n}$. Expanding this, we obtain what is to some the more familiar representation
\[L(\mathcal{D}) = -\frac{1}{N}\sum_{n}[\log\sigma(f(x_{n}))]_{y_{n}}.\]Let $A_n \equiv (a_{ij})$ be an $n \times n$ positive definite matrix with Cholesky decomposition $L_nL_n^*$. Next, consider expanding the size of this matrix (while maintaining positive definiteness):
\[A_{n+1}=\begin{pmatrix}A_{n} & a_{n+1,1:n}^{*}\\ a_{n+1,1:n} & a_{n+1,n+1} \end{pmatrix}.\]The notation $a_{1:n, n+1}$ signifies a column vector with $n$ entries. Suppose the Cholesky decomposition of $A_{n + 1}$ has the following form:
\[L_{n+1}=\begin{pmatrix}L_{n} & 0\\ \ell_{n+1,1:n} & \ell_{n+1,n+1} \end{pmatrix}.\]Simple algebra reveals that
\[L_{n+1}L_{n+1}^{*}=\begin{pmatrix}A_{n} & L_{n}\ell_{n+1,1:n}^{*}\\ \ell_{n+1,1:n}L_{n}^{*} & \ell_{n+1,1:n}\ell_{n+1,1:n}^{*}+\left|\ell_{n+1,n+1}\right|^{2} \end{pmatrix}.\]This reveals that we need to solve the equations
\[L_n \ell_{n+1,1:n}^* = a_{n+1,1:n}^*\]and
\[\ell_{n+1,n+1} = \sqrt{a_{n+1,n+1} - \Vert \ell_{n+1, 1:n}\Vert^2}\]to obtain the updated Cholesky decomposition. Since the former involves a triangular matrix, it can be solved by forward substitution in $O(n^2)$ floating point operations (FLOPs). The latter requires $O(n)$ FLOPs due to the norm.
import numpy as np
import scipy.linalg
def update_chol(chol: np.ndarray, new_vec: np.ndarray) -> np.ndarray:
"""Update the Cholesky factorization of a matrix for real inputs."""
u = new_vec[:-1]
α = new_vec[-1]
v = scipy.linalg.solve_triangular(chol, u, lower=True)
β = np.sqrt(α - v @ v)
n = chol.shape[0]
# WARNING: This is not efficient!
new_chol = chol.copy()
new_chol = np.pad(new_chol, [(0, 1), (0, 1)])
new_chol[:-1, :-1] = chol
new_chol[-1, :-1] = v
new_chol[n, n] = β
return new_chol
np.random.seed(42)
x = np.random.randn(5, 5)
a = x.T @ x
np.linalg.cholesky(a)
array([[ 1.72643986, 0. , 0. , 0. , 0. ],
[ 0.00926244, 1.9510639 , 0. , 0. , 0. ],
[-0.02770041, 0.34669923, 1.02437592, 0. , 0. ],
[ 0.10163684, 0.60454141, -0.41500106, 2.91668584, 0. ],
[ 0.31988585, 1.66212358, -1.17204427, 1.10508656, 0.39447333]])
chol = np.linalg.cholesky(a[:-1,:-1])
update_chol(chol, a[-1])
array([[ 1.72643986, 0. , 0. , 0. , 0. ],
[ 0.00926244, 1.9510639 , 0. , 0. , 0. ],
[-0.02770041, 0.34669923, 1.02437592, 0. , 0. ],
[ 0.10163684, 0.60454141, -0.41500106, 2.91668584, 0. ],
[ 0.31988585, 1.66212358, -1.17204427, 1.10508656, 0.39447333]])
Note that by applying the algorithm iteratively, it can be used to obtain the full Cholesky decomposition of a positive definite matrix $A_N \equiv (a_{ij})$. The base case is $L_1 = (\sqrt{a_{11}})$. Assuming each square root takes $c$ FLOPs, the total cost is
\[c + \sum_{n=1}^{N-1} n^2 + n + 1 + c = \frac{1}{3} N^{3} + \left( c + \frac{2}{3} \right) N - 1.\]In particular, the leading term shows that this algorithm is roughly half the complexity of Gaussian elimination applied to arbitrary (i.e., not necessarily positive definite) matrices.
]]>Let $V_{k}$ be a real $p\times k$ matrix consisting of the first $k$ principal components of $Y$. Just as we can interpret the rows of $Y$ as $p$-dimensional points $y_{1},\ldots,y_{N}$, we can interpret the rows $a_{1},\ldots,a_{N}$ of $YV_{k}$ as $k$-dimensional points in “PCA space”. A property of PCA space is that the coordinates are uncorrelated: \begin{equation} V_{k}^{\intercal}Y^{\intercal}YV_{k}-\frac{1}{N}V_{k}^{\intercal}Y^{\intercal}ee^{\intercal}YV_{k}=\Lambda_{k}-0=\Lambda_{k} \end{equation} where $\Lambda_{k}$ is the diagonal matrix consisting of the first $k$ eigenvalues of $Y^{\intercal}Y$.
Let $W_{k}=DV_{k}$. While $V_{k}$ is orthonormal, $W_{k}^{\intercal}W_{k}$ is generally dense. However, similarly to the above, the rows $b_{1},\ldots,b_{N}$ of the matrix $XW_{k}$ have uncorrelated coordinates: \begin{equation} W_{k}^{\intercal}X^{\intercal}XW_{k} =W_{k}^{\intercal}\left(YD+\frac{1}{N}ee^{\intercal}X\right)^{\intercal}\left(YD+\frac{1}{N}ee^{\intercal}X\right)W_{k} =V_{k}^{\intercal}Y^{\intercal}YV_{k}+\frac{1}{N}W_{k}^{\intercal}X^{\intercal}ee^{\intercal}XW_{k} =\Lambda_{k}+\frac{1}{N}W_{k}^{\intercal}X^{\intercal}ee^{\intercal}XW_{k}. \end{equation} Note, in particular, that because $X$ is not demeaned, the second term on the last equation is not necessarily equal to zero.
]]>