## Motivation

Consider, for example, that we want to perform ordinary least squares (OLS) to find a line of best fit between the points $x_1, \ldots, x_N$ in $p$ dimensional Euclidean space and labels $y_1, \ldots, y_N$.

When the number of predictors $p$ is large, it is possible to end up with a linear regression model with high variance (and hence higher than desirable prediction error). In addition, the resulting coefficients learned by the model may be harder to interpret than an alternative model with fewer predictors.

Principal component analysis (PCA) is a method for transforming points (such as $x_1, \ldots, x_N$ above) in a high dimensional space to points in a lower dimensional space by performing a sequence of scalar projections such that the resulting points account for as much of the original variance as possible.

Don’t worry if this sounds vague; we’ll make it precise below.

## Prerequisites

To understand this post, you will need to be familiar with the following concepts:

## A review of scalar projections

We use $\Vert \cdot \Vert$ to denote the Euclidean distance. Given vectors $x$ and $v$, the scalar projection of $x$ onto $v$ is $\Vert x \Vert \cos \theta$ where $\theta$ is the angle between the two vectors.

If $v$ is a unit vector (i.e., $\Vert v \Vert = 1$), the scalar projection can also be written $x \cdot v$ using the dot product.

## Principal components

Before we define the principal components, let’s introduce some equivalent representations of the data:

• Let $\mathbf{x}$ be random vector which takes on the values $x_1, \ldots, x_N$ with uniform probability.
• Let $X$ be a matrix with rows $x_1, \ldots, x_N$.

We assume that $\mathbf{x}$ has zero expectation. If this is not true, we can always just center the data by subtracting $\mathbb{E} \mathbf{x}$ from each point.

Moreover, We assume $N > p$ and $\operatorname{rank}(X) = p$. In the context of our motivating example of OLS, this means that there are more samples than there are predictors and that no predictors are redundant.

### First principal component

The first principal component is a unit vector $v_1$ along which the variance of the scalar projection of $\mathbf{x}$ onto $v_1$ is maximized. That is,

In other words, we are looking for the direction along which the data varies the most.

Remark. The first principal component need not be unique: there may be two or more maximizers of the variance. In this case, it is understood that “the” first principal component is picked according to some tie-breaking rule.

But how do we compute the first principal component? First, let’s obtain an equivalent expression for the variance:

Lemma. Let $A$ be a positive semidefinite matrix. Let $w$ be the maximizer of $v \mapsto v^\intercal A v$ over all (real) unit vectors. Then, $w$ is an eigenvector of $A$ whose corresponding eigenvalue is maximal.

Proof. Proceeding by the method of Lagrange multipliers, let $L(v) \equiv v^\intercal A v - \lambda (v^\intercal v - 1)$ where $\lambda$ is an arbitrary constant. Then, $\nabla L(v) \propto A v - \lambda v$. Since $w$ is a critical point of $L$, it follows that $w$ is an eigenvector of $A$. Moreover, denoting by $r$ the eigenvalue corresponding to $w$, since $w^\intercal A w = w^\intercal r w = r$, it follows that $r$ is maximal.

The above suggests a simple way to compute the first principal component: applying power iteration to $A \equiv X^\intercal X$. Power iteration is an algorithm which returns, under reasonable conditions, an eigenvector corresponding to the largest eigenvalue of the input matrix. The details of power iteration are outside the scope of this article.

### Remaining principal components

Given the first principal component $v_1$, we can transform our data so that all contributions in the $v_1$ direction are removed. In particular, for each point $x_i$, we can create a new point

Equivalently, we can represent this transformation in the matrix form

from which it is clear that this transformation is a projection with matrix $P \equiv I - v_1 v_1^\intercal$.

The second principal component $v_2$ of $X$ is defined as the first principal component of $X^{(2)}$. Once it is computed, we can, as above, “project out” its contributions by taking $X^{(3)} \equiv X^{(2)}(I - v_2 v_2^\intercal)$. Continuing in this way, we can define the third, fourth, etc. principal components.

For completeness, we summarize the above procedure. Let $X^{(1)} \equiv X$ and define

where $v_k$ is the first principal component of $X^{(k)}$. We call $v_k$ the $k$-th principal component of $X$.

Since each projection reduces the dimensionality of the space by one, it is guaranteed that $X^{(n+1)} = 0$. That is, it is only meaningful to talk about the first $n$ principal components.

## Basis transformation

### Lossless

Let $V$ denote the matrix whose columns consist of all $p$ principal components of $X$. We can transform our points into “PCA space” by right-multiplying by $V$:

This transformation is lossless: there is no reduction in dimensionality and we can transform from $Z$ back to $X$ by right-multiplying by $V^{-1} = V^\intercal$.

### Lossy

Recalling our motivation for studying PCA, we need a lossy transformation that reduces the dimensionality of the space. First, we pick the target dimension $% $. Since we would like to keep as much of the original variance as possible, we transform our points by right-multiplying by $V_k$, the $p \times k$ matrix whose columns are the first $k$ principal components of $X$: