## 1.

It is easier to work in the multivariate setting for this proof. In light of this, let $$X_{i}$$ be a random $$p$$ dimensional vector. Define $$X_{-0}$$ as the $$n\times p$$ matrix whose rows are $$X_{i}^{\intercal}$$. Augment this matrix to obtain $$X=(e\mid X_{-0})$$ where $$e$$ is the vector of ones, corresponding to a design matrix with a bias column. Let $$Y$$ be the vector whose coordinates are $$Y_{i}$$.

Using the fact that $$\sum_{i}\hat{\epsilon}_{i}^{2}=\Vert Y-X\hat{\beta}\Vert^{2}$$ and matrix calculus, it is straightforward to show that the RSS is minimized when $$\hat{\beta}$$ is chosen to satisfy the linear system

$X^{\intercal}X\hat{\beta}=X^{\intercal}Y.$

Note that

$X^{\intercal}Y=\begin{pmatrix}e^{\intercal}Y\\ X_{-0}^{\intercal}Y \end{pmatrix}=\begin{pmatrix}n\overline{Y}\\ X_{-0}^{\intercal}Y \end{pmatrix}$

and

$X^{\intercal}X=\begin{pmatrix}n & e^{\intercal}X_{-0}\\ X_{-0}^{\intercal}e & X_{-0}^{\intercal}X_{-0} \end{pmatrix}.$

Let $$\hat{\beta}=(\hat{\beta}_{0}\mid\hat{\beta}_{-0})$$ where $$\hat{\beta}_{0}$$ is a scalar. The first row of the linear system yields

$\hat{\beta}_{0}=\overline{Y}-\frac{1}{n}e^{\intercal}X_{-0}\hat{\beta}_{-0}.$

Since $$e^{\intercal}X_{-0}=n\overline{X}$$ when $$p=1$$, the above is equivalent to Eq. (13.6). Substituting the above into the second row of the linear system yields

$\left(X_{-0}^{\intercal}X_{-0}-\frac{1}{n}X_{-0}^{\intercal}ee^{\intercal}X_{-0}\right)\hat{\beta}_{-0}=X_{-0}^{\intercal}Y-X_{-0}^{\intercal}e\overline{Y}.$

If $$p=1$$, the above simplifies to

$\left(\sum_{i}X_{i}^{2}-n\overline{X}^{2}\right)\hat{\beta}_{1}=\sum_{i}X_{i}Y_{i}-n\overline{X}\overline{Y}$

which, with some work, can be shown to be equivalent to Eq. (13.5).

Next, denoting by $$\hat{\epsilon}$$ the vector with coordinates $$\hat{\epsilon}_{i}$$, we have

$\hat{\epsilon}=Y-X\hat{\beta}=MY$

where $$M=I-X(X^{\intercal}X)^{-1}X^{\intercal}$$. Denoting by $$\epsilon$$ the vector with coordinates $$\epsilon_{i}$$ and $$\beta$$ the vector of true coefficients,

$\hat{\epsilon}=MY=M(X\beta+\epsilon)=M\epsilon.$

Using the fact that $$M$$ is both symmetric and idempotent,

$\mathrm{RSS}=\sum_{i}\hat{\epsilon}_{i}^{2}=\hat{\epsilon}^{\intercal}\hat{\epsilon}=\epsilon^{\intercal}M^{\intercal}M\epsilon=\epsilon^{\intercal}M\epsilon.$

For brevity, we abuse notation by writing $$\mathbb{E} f$$ to mean $$\mathbb{E}[f\mid X]$$. Then,

$\mathbb{E}\left[\mathrm{RSS}\right]=\mathbb{E}\left[\epsilon^{\intercal}M\epsilon\right]=\operatorname{tr}(M\mathbb{E}\left[\epsilon\epsilon^{\intercal}\right]).$

Assuming that $$\epsilon_{i}$$ and $$\epsilon_{j}$$ are independent whenever $$i\neq j$$ yields $$\mathbb{E}[\epsilon\epsilon^{\intercal}]=\sigma^{2}I$$ and hence

$\mathbb{E}\left[\mathrm{RSS}\right]=\sigma^{2}\operatorname{tr}(M).$

Moreover,

$\operatorname{tr}(M)=\operatorname{tr}(I_{n\times n})-\operatorname{tr}(X^{\intercal}X(X^{\intercal}X)^{-1})=\operatorname{tr}(I_{n\times n})-\operatorname{tr}(I_{(p+1)\times (p+1)})=n-\left(p+1\right),$

establishing that (13.7) is an unbiased estimator of the noise variance.

## 2.

We continue to use the notation established in the answer to the first exercise. First, note that

$\mathbb{E}Y=\mathbb{E}\left[X\beta+\epsilon\right]=X\beta$

and

$\mathbb{E}\left[YY^{\intercal}\right]=\mathbb{E}\left[\left(X\beta+\epsilon\right)\left(X\beta+\epsilon\right)^{\intercal}\right]=\mathbb{E}\left[X\beta\beta^{\intercal}X^{\intercal}+2X\beta\epsilon^{\intercal}+\epsilon\epsilon^{\intercal}\right]=X\beta\beta^{\intercal}X^{\intercal}+\sigma^{2}I.$

Therefore,

$\mathbb{E}\hat{\beta}=\left(X^{\intercal}X\right)^{-1}X^{\intercal}\mathbb{E}\left[Y\right]=\beta$

and

$\begin{multline*} \mathbb{E}\left[\hat{\beta}\hat{\beta}^{\intercal}\right]=\mathbb{E}\left[\left(X^{\intercal}X\right)^{-1}X^{\intercal}YY^{\intercal}X\left(X^{\intercal}X\right)^{-1}\right]\\ =\left(X^{\intercal}X\right)^{-1}X^{\intercal}\mathbb{E}\left[YY^{\intercal}\right]X\left(X^{\intercal}X\right)^{-1}=\beta\beta^{\intercal}+\sigma^{2}\left(X^{\intercal}X\right)^{-1}. \end{multline*}$

Combining the above yields

$\mathbb{V}(\hat{\beta}\hat{\beta}^{\intercal})=\mathbb{E}\left[\hat{\beta}\hat{\beta}^{\intercal}\right]-\mathbb{E}\left[\hat{\beta}\right]\mathbb{E}\left[\hat{\beta}\right]^{\intercal}=\sigma^{2}\left(X^{\intercal}X\right)^{-1}.$

In the univariate case, the form

$X^{\intercal}X=\begin{pmatrix}n & n\overline{X}\\ n\overline{X} & \sum_{i}X_{i}^{2} \end{pmatrix}$

can be used to derive a closed form expression for the inverse which in turn yields (13.11) as desired.

## 3.

A univariate regression through the origin is a special case of the multivariate regression seen in Exercise 1. It has least squares coefficient

$\frac{\sum_i X_i Y_i}{\sum_i X_i^2}.$

This is well-defined whenever at least one of the $$X_i$$ is nonzero.

The standard error of this coefficient is also a special case of the standard error for the multivariate case seen in Exercise 2. It is

$\frac{\sigma^2}{\sum_i X_i^2}.$

Since the least squares estimate is an MLE, it is consistent whenever it is well-defined.

## 4.

Using the fact that $$Y_{i}$$ and $$Y_{i}^{*}$$ are IID,

\begin{align*} \mathbb{E}\left[\hat{R}_{\mathrm{tr}}(S)\right]-R(S) & =\sum_{i}\mathbb{E}\left[\left(\hat{Y}_{i}(S)-Y_{i}\right)^{2}-\left(\hat{Y}_{i}(S)-Y_{i}^{*}\right)^{2}\right]\\ & =\sum_{i}\mathbb{E}\left[\hat{Y}_{i}(S)^{2}-2\hat{Y}_{i}(S)Y_{i}+Y_{i}^{2}-\hat{Y}_{i}(S)^{2}+2\hat{Y}_{i}(S)Y_{i}^{*}-\left(Y_{i}^{*}\right)^{2}\right]\\ & =\sum_{i}-2\mathbb{E}\left[\hat{Y}_{i}(S)Y_{i}\right]+\mathbb{E}\left[Y_{i}^{2}\right]+2\mathbb{E}\left[\hat{Y}_{i}(S)Y_{i}^{*}\right]-\mathbb{E}\left[\left(Y_{i}^{*}\right)^{2}\right]\\ & =-2\sum_{i}\mathbb{E}\left[\hat{Y}_{i}(S)Y_{i}\right]-\mathbb{E}\left[\hat{Y}_{i}(S)\right]\mathbb{E}\left[Y_{i}\right]\\ & =-2\sum_{i}\operatorname{Cov}(\hat{Y}_{i}(S),Y_{i}). \end{align*}

## 5.

Let $$\hat{\delta}=\hat{\beta}_{1}-17\hat{\beta}_{0}$$. By Theorem 13.8,

$\mathbb{V}(\hat{\delta})=\mathbb{V}(\hat{\beta}_{1})+17^{2}\mathbb{V}(\hat{\beta}_{0})-17\operatorname{Cov}(\hat{\beta}_{0},\hat{\beta}_{1})=\frac{\sigma^{2}}{ns_{X}^{2}}\left(1+17\overline{X}+\frac{17^{2}}{n}\sum_{i}X_{i}^{2}\right).$

Replacing $$\sigma$$ by $$\hat{\sigma}$$ and taking square roots yields $$\hat{\operatorname{se}}(\hat{\delta})$$. The Wald statistic is $$W=\hat{\delta}/\hat{\operatorname{se}}(\hat{\delta})$$.

## 6.

TODO (Computer experiment).

## 7.

TODO (Computer experiment).

## 8.

Maximizing $$\mathrm{AIC}$$ is equivalent to minimizing $$-2\sigma^{2}\mathrm{AIC}$$. This is equivalent to minimizing Mallow’s $$C_{p}$$ statistic since

\begin{align*} -2\sigma^{2}\mathrm{AIC} & =-2\sigma^{2}\ell_{S}+2\left|S\right|\sigma^{2}\\ & =-2\sigma^{2}\left\{ \frac{n}{2}\log(2\pi)-n\log\sigma-\frac{1}{2\sigma^{2}}\sum_{i}\left(\hat{Y}_{i}(S)-Y_{i}\right)^{2}\right\} +2\left|S\right|\sigma^{2}\\ & =\text{const.}+\sum_{i}\left(\hat{Y}_{i}(S)-Y_{i}\right)^{2}+2\left|S\right|\sigma^{2}\\ & =\text{const.}+C_{p}+2\left|S\right|\sigma^{2}. \end{align*}

## 9.

Choosing the model with the highest AIC is equivalent to choosing the model with the lowest Mallow’s $$C_{p}$$ statistic. The two models have Mallow’s statistics $$C_{p}^{0}=\sum_{i}X_{i}^{2}$$ and $$C_{p}^{1}=[\sum_{i}(X_{i}-\hat{\theta})^{2}]+2$$ with $$\hat{\theta}=\overline{X}$$. Note that

$C_{p}^{0}-C_{p}^{1}=\sum_{i}X_{i}^{2}-\sum_{i}\left(X_{i}-\hat{\theta}\right)^{2}+2=n\hat{\theta}^{2}-2.$

Therefore, $$\mathcal{M}_{0}$$ is picked if and only if $$\hat{\theta}^2 < 2/n$$.

### a)

First, note that $$\hat{\theta} \sim N(\theta,1/n)$$. If $$\theta = 0$$, then

$\mathbb{P}(J_{n}=0) =\mathbb{P}(|\hat{\theta}|<\sqrt{2}n^{-1/2}) =\mathbb{P}(\left|Z\right|<\sqrt{2}) =2\Phi(\sqrt{2})-1\approx0.8427.$

If $$\theta\neq0$$, then

$\begin{multline*} \mathbb{P}(J_{n}=0) =\mathbb{P}(|\hat{\theta}|<\sqrt{2}n^{-1/2}) =\mathbb{P}(|Zn^{-1/2}+\theta|<\sqrt{2}n^{-1/2})\\ =\mathbb{P}(-\sqrt{2}-\theta\sqrt{n}<Z<\sqrt{2}-\theta\sqrt{n}) =\Phi(\sqrt{2}-\theta\sqrt{n})-\Phi(-\sqrt{2}-\theta\sqrt{n}) \rightarrow 0. \end{multline*}$

### b)

Let $$\mu=\hat{\theta}I_{\{J_{n}=1\}}$$ so that

$\hat{f}_{n}(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{\left(x-\mu\right)^{2}}{2}\right).$

Let $$Z\sim N(0,1)$$. The KL distance between $$\phi_{0}$$ and $$\hat{f}_{n}$$ is

\begin{align*} D(\phi_{0},\hat{f}_{n}) & =\int\phi_{0}(z)\left(\log\phi_{0}(z)-\log\hat{f}_{n}(z)\right)dz\\ & =\mathbb{E}\left[\log\phi_{0}(Z)-\log\hat{f}_{n}(Z)\right]\\ & =\frac{1}{2}\mathbb{E}\left[-Z^{2}+\left(Z-\mu\right)^{2}\right]\\ & =\frac{1}{2}\mathbb{E}\left[-2\mu Z+\mu^{2}\right]=\frac{1}{2}\mu^{2}. \end{align*}

If $$\theta=0$$, this quantity converges to zero in probability since

$\mathbb{P}(\mu^{2}>\epsilon)=\mathbb{P}(\hat{\theta}^{2}I_{\{J_{n}=1\}}>\epsilon)\leq\mathbb{P}(\hat{\theta}^{2}>\epsilon)=\mathbb{P}(|Z|>\sqrt{n\epsilon}).$

Next, the KL distance between $$\phi_{\hat{\theta}}$$ and $$\hat{f}_{n}$$ is

\begin{align*} D(\phi_{\hat{\theta}},\hat{f}_{n}) & =\int\phi_{\hat{\theta}}(x)\left(\log\phi_{\hat{\theta}}(x)-\log\hat{f}_{n}(x)\right)dx\\ & =\int\phi_{0}(z)\left(\log\phi_{0}(z)-\log\hat{f}_{n}(z+\hat{\theta})\right)dz\\ & =\mathbb{E}\left[\log\phi_{0}(Z)-\log\hat{f}_{n}(Z+\hat{\theta})\right]\\ & =\frac{1}{2}\mathbb{E}\left[-Z^{2}+\left(Z+\hat{\theta}-\mu\right)^{2}\right]\\ & =\frac{1}{2}\mathbb{E}\left[2\left(\hat{\theta}-\mu\right)Z+\hat{\theta}^{2}-2\hat{\theta}\mu+\mu^{2}\right]\\ & =\frac{1}{2}\left(\hat{\theta}^{2}-2\hat{\theta}\mu+\mu^{2}\right). \end{align*}

By the LLN, $$\hat{\theta}$$ converges to $$\theta$$ in probability. Suppose that $$\theta\neq0$$. Our findings in Part (a) imply that $$I_{\{J_{n}=1\}}$$ converges to one in probability. Therefore, by Theorem 5.5, $$\mu$$ converges to $$\theta$$ in probability and hence $$D(\phi_{\hat{\theta}},\hat{f}_{n})$$ converges to zero in probability.

### c)

Noting that the only difference between the AIC and BIC criteria is replacing the penalty of $$2$$ by $$\log n$$, we can conclude that if $$\theta=0$$, then

$\mathbb{P}(J_{n}=0)=2\Phi(\sqrt{\log n})-1\rightarrow1.$

Recall that even in the limit, the corresponding quantity for AIC was not one. Similarly, if $$\theta\neq0$$, then

$\mathbb{P}(J_{n}=0)=\Phi(\sqrt{\log n}-\theta\sqrt{n})-\Phi(-\sqrt{\log n}-\theta\sqrt{n})\rightarrow0.$

The limiting KL distances are also as before.

## 10.

### a)

Suppose $$\epsilon\sim N(0,\sigma^{2})$$. Since $$\epsilon$$ is independent of $$\hat{\theta}$$ (recall that $$X_{*}$$ correspond to a sample that hasn’t been trained on),

$\frac{Y_{*}-\hat{Y}_{*}}{s}=-\frac{\hat{\theta}-\theta}{s}+\frac{\epsilon}{s}\approx N\biggl(0,1+\frac{\sigma^{2}}{s^2}\biggr).$

### b)

Similarly to Part (a),

$\begin{multline*} \frac{Y_{*}-\hat{Y}_{*}}{\xi_{n}}=-\frac{\hat{\theta}-\theta}{\xi_{n}}+\frac{\epsilon}{\xi_{n}}=-\frac{\hat{\theta}-\theta}{s}\frac{s}{\sqrt{s^{2}+\sigma^{2}}}+\frac{\epsilon}{\sqrt{s^{2}+\sigma^{2}}}\\ \approx N\biggl(0,\frac{s^{2}}{s^{2}+\sigma^{2}}\biggr)+N\biggl(0,\frac{\sigma^{2}}{s^{2}+\sigma^{2}}\biggr)=N(0,1). \end{multline*}$

## 11.

TODO (Computer experiment).