Consider, for example, that we want to perform ordinary least squares (OLS) to find a line of best fit between the points in dimensional Euclidean space and labels .
When the number of predictors is large, it is possible to end up with a linear regression model with high variance (and hence higher than desirable prediction error). In addition, the resulting coefficients learned by the model may be harder to interpret than an alternative model with fewer predictors.
Principal component analysis (PCA) is a method for transforming points (such as above) in a high dimensional space to points in a lower dimensional space by performing a sequence of scalar projections such that the resulting points account for as much of the original variance as possible.
Don’t worry if this sounds vague; we’ll make it precise below.
To understand this post, you will need to be familiar with the following concepts:
A review of scalar projections
We use to denote the Euclidean distance. Given vectors and , the scalar projection of onto is where is the angle between the two vectors.
If is a unit vector (i.e., ), the scalar projection can also be written using the dot product.
Before we define the principal components, let’s introduce some equivalent representations of the data:
- Let be random vector which takes on the values with uniform probability.
- Let be a matrix with rows .
We assume that has zero expectation. If this is not true, we can always just center the data by subtracting from each point.
Moreover, We assume and . In the context of our motivating example of OLS, this means that there are more samples than there are predictors and that no predictors are redundant.
First principal component
The first principal component is a unit vector along which the variance of the scalar projection of onto is maximized. That is,
In other words, we are looking for the direction along which the data varies the most.
Remark. The first principal component need not be unique: there may be two or more maximizers of the variance. In this case, it is understood that “the” first principal component is picked according to some tie-breaking rule.
But how do we compute the first principal component? First, let’s obtain an equivalent expression for the variance:
Lemma. Let be a positive semidefinite matrix. Let be the maximizer of over all (real) unit vectors. Then, is an eigenvector of whose corresponding eigenvalue is maximal.
Proof. Proceeding by the method of Lagrange multipliers, let where is an arbitrary constant. Then, . Since is a critical point of , it follows that is an eigenvector of . Moreover, denoting by the eigenvalue corresponding to , since , it follows that is maximal.
The above suggests a simple way to compute the first principal component: applying power iteration to . Power iteration is an algorithm which returns, under reasonable conditions, an eigenvector corresponding to the largest eigenvalue of the input matrix. The details of power iteration are outside the scope of this article.
Remaining principal components
Given the first principal component , we can transform our data so that all contributions in the direction are removed. In particular, for each point , we can create a new point
Equivalently, we can represent this transformation in the matrix form
from which it is clear that this transformation is a projection with matrix .
The second principal component of is defined as the first principal component of . Once it is computed, we can, as above, “project out” its contributions by taking . Continuing in this way, we can define the third, fourth, etc. principal components.
For completeness, we summarize the above procedure. Let and define
where is the first principal component of . We call the -th principal component of .
Since each projection reduces the dimensionality of the space by one, it is guaranteed that . That is, it is only meaningful to talk about the first principal components.
Let denote the matrix whose columns consist of all principal components of . We can transform our points into “PCA space” by right-multiplying by :
This transformation is lossless: there is no reduction in dimensionality and we can transform from back to by right-multiplying by .
Recalling our motivation for studying PCA, we need a lossy transformation that reduces the dimensionality of the space. First, we pick the target dimension . Since we would like to keep as much of the original variance as possible, we transform our points by right-multiplying by , the matrix whose columns are the first principal components of :