Principal component analysis - part two

Introduction

In the first post in this series, we outlined the motivation and theory behind principal component analysis (PCA), which takes points $x_1, \ldots, x_N$ in a high dimensional space to points in a lower dimensional space while preserving as much of the original variance as possible.

In this follow-up post, we apply principal components regression (PCR), an algorithm which includes PCA as a subroutine, to a small dataset to demonstrate the ideas in practice.

Prerequisites

To understand this post, you will need to be familiar with the following concepts:

PCA (see the first post in this series)
ordinary least squares

Ordinary least squares

In ordinary least squares (OLS), we want to find a line of best fit between the points $x_1, \ldots, x_N$ and the labels $y_1, \ldots, y_N$.

Denoting by $X$ the matrix whose rows are the points and $y$ the vector whose entries are the labels, the intercept $\alpha$ and slope (a.k.a. gradient) $\beta$ are obtained by minimizing $\Vert \alpha + X \beta - y \Vert$. Some matrix calculus reveals that the minimum is obtained at the values of $\alpha$ and $\beta$ for which

\[N \alpha = y^\intercal e - \beta^\intercal X^\intercal e\]

and

\[X^\intercal X \beta = X^\intercal y - \alpha X^\intercal e\]

where $e$ is the vector of all ones.

Principal components regression

The idea behind PCR is simple: instead of doing OLS on the high dimensional space, we first map the points to a lower dimensional space obtained by PCA and then do OLS. In more detail, we

pick a positive integer $k < p$,
construct the matrix $V_k$ whose columns are the first $k$ principal components of $X$,
compute $Z_k = X V_k$, a matrix whose rows are the original points transformed to a lower dimensional “PCA space”, and
perform OLS to find a line of best fit between the transformed points and $y$.

By the previous section, we know that the minimum is obtained at the values of the intercept $\alpha_k$ and gradient $\beta_k$ for which

\[N \alpha_k = y^\intercal e - \beta_k^\intercal Z_k^\intercal e\]

and

\[Z_k^\intercal Z_k \beta_k = Z_k^\intercal y - \alpha_k Z_k^\intercal e\]

Once we have solved these equations for $\alpha_k$ and $\beta_k$, we can predict the label $\hat{y}$ corresponding to a new sample $x$ as $\hat{y} = \alpha_k + x^\intercal V_k \beta_k$.

Computational considerations

Due to the result below, the linear system involving $\alpha_k$ and $\beta_k$ is a (permuted) arrowhead matrix. As such, the system can be solved efficiently.

Lemma. $Z_k^\intercal Z_k = \Sigma_k^2$ where $\Sigma_k$ is the $k \times k$ diagonal matrix whose entries are the first $k$ principal components of $X$ in descending order.

Proof. Let $v_j$ denote the $j$-th column of $V_k$. Since $v_j$ is a principal component of $X$, it is also an eigenvector of $X^\intercal X$ with eigenvalue $\sigma_j^2$, the square of the $j$-th singular value. Therefore, the $(i, j)$-th entry of $Z_k^\intercal Z_k$ is

\[(X v_i)^\intercal (X v_j) = v_i^\intercal X^\intercal X v_j = \sigma_j^2 v_i^\intercal v_j = \sigma_j^2 \delta_{ij}.\]

where $\delta_{ij}$ is the Kronecker delta.

The California housing dataset

The California housing dataset from [1] has 20,640 samples and 8 predictors. For each $k \leq p = 8$, we fit using PCR on 80% of the samples (the training set) and report the root mean squared error (RMSE) on both the training set and the remaining 20% samples (the test set).

from sklearn.datasets import fetch_california_housing
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
import numpy as np


np.random.seed(1)
X_all, y_all = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2)


def fit(degree: int = 1):
    """Fits a PCR model to the training data.

    Parameters
    ----------
    degree: Generates all polynomial combinations of the features up to the specified degree.

    Returns
    -------
    An `sklearn.pipeline.Pipeline` object corresponding to a learned model.
    """
    pipeline = make_pipeline(
        StandardScaler(),
        PolynomialFeatures(degree, include_bias=False),
        PCA(),
        LinearRegression(),
    )
    pipeline.fit(X_train, y_train)
    return pipeline


def rmses(model, X, y):
    """Calculates the RMSE of the model using only the first k PCs for each possible value of k.

    Parameters
    ----------
    model: The `sklearn.pipeline.Pipeline` object returned by the `fit` method.
    X: An (N, p) matrix of features.
    y: The corresponding (N) vector of targets.

    Returns
    -------
    A list of RMSEs (the first entry corresponds to k=1 and the last to k=p).
    """
    pca = model[:-1]
    Z = pca.transform(X)
    regression = model[-1]
    results = []
    for k in range(1, Z.shape[1] + 1):
        pred = regression.intercept_ + Z[:, :k] @ regression.coef_[:k]
        rmse = mean_squared_error(y, pred, squared=False)
        results.append(rmse)
    return results


model = fit()
rmses_train = rmses(model, X_train, y_train)
rmses_test = rmses(model, X_test, y_test)

Rank (k)	Training set RMSE (in $100,000s)	Test set RMSE (in $100,000s)
1	1.15492	1.14608
2	1.14128	1.12491
3	1.13947	1.12457
4	0.853976	0.844071
5	0.853972	0.844123
6	0.810381	0.807466
7	0.729905	0.731179
8	0.723369	0.72742

Both training and test set RMSEs are (roughly) decreasing functions of the rank. This suggests that using all 8 predictors does not cause overfitting.

Deriving predictors

One way to reduce the test set RMSE is to introduce more predictors into the model. Consider, as a toy example, a dataset where each sample $x_i$ has only three predictors: $x_i \equiv (a_i, b_i, c_i)$. We can replace each sample $x_i$ by a new sample $x_i^\prime \equiv (a_i, b_i, c_i, a_i^2, a_i b_i, a_i c_i, b_i^2, b_i c_i, c_i^2)$. In particular, we have added all possible quadratic monomials in $a_i, b_i, c_i$. These new entries are referred to as “derived” predictors. Note that derived predictors need not be quadratic, or even monomials; any function of the original predictors is referred to as a derived predictor.

Returning to the dataset, we add all cubic monomials. It is reasonable to expect that unlike OLS applied to $X$, OLS applied to the derived matrix $X^\prime$ will almost certainly overfit. We plot the results (obtained by setting degree=3 in the previous fitting code) of PCR below, observing the effects of overfitting at large values of the rank.

model = fit(degree=3)
rmses_train = rmses(model, X_train, y_train)
rmses_test = rmses(model, X_test, y_test)

png

Bibliography

[1] Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

Parsiad Azimzadeh