Jekyll2020-01-24T03:30:50+00:00http://parsiad.ca/feed.xmlParsiad AzimzadehParsiad Azimzadeh's personal webpage.All of Statistics - Chapter 8 Solutions2020-01-23T20:00:00+00:002020-01-23T20:00:00+00:00http://parsiad.ca/blog/2020/all-of-statistics-chapter-08<h2 id="1">1.</h2> <p>TODO</p> <h2 id="2">2.</h2> <p>TODO</p> <h2 id="3">3.</h2> <p>TODO</p> <h2 id="4">4.</h2> <p>This is a <a href="https://en.wikipedia.org/wiki/Stars_and_bars_(combinatorics)">stars and bars</a> problem (or, equivalently, an “indistinguishable balls in distinct buckets” problem). For example, the configuration <code class="language-plaintext highlighter-rouge">★|★★★||★</code> corresponds to sampling <script type="math/tex">X_1</script> once, sampling <script type="math/tex">X_2</script> three times, sampling <script type="math/tex">X_3</script> zero times, and sampling <script type="math/tex">X_4</script> once. In general, there are <script type="math/tex">n</script> stars and <script type="math/tex">n-1</script> bars, and hence the total number of configurations is <script type="math/tex">(2n - 1)!/(n!(n-1)!)</script>.</p> <h2 id="5">5.</h2> <p>First, note that</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{E}\left[\overline{X}_{n}^{*}\mid X_{1},\ldots,X_{n}\right] =\mathbb{E}\left[X_{1}^{*}\mid X_{1},\ldots,X_{n}\right]=\overline{X}_{n}. \end{equation}</script> <p>Therefore, by the tower property, <script type="math/tex">\mathbb{E}[\overline{X}_{n}^{*}]=\mathbb{E}[X_{1}]</script>. Next, note that</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{V}(\overline{X}_{n}^{*}\mid X_{1},\ldots,X_{n})=\frac{1}{n}\mathbb{V}(X_{1}^{*}\mid X_{1},\ldots,X_{n})=\frac{1}{n^{2}}\sum_{i}\left(X_{i}-\overline{X}_{n}\right)^{2}. \end{equation}</script> <p>The above can also be expressed as <script type="math/tex">S_{n}(n-1)/n^{2}</script> where <script type="math/tex">S_{n}</script> is the unbiased sample variance of <script type="math/tex">(X_{1},\ldots,X_{n})</script>. Next, note that</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{E}\left[\left(\overline{X}_{n}\right)^{2}\right]=\frac{1}{n^{2}}\mathbb{E}\left[\sum_{i}X_{i}^{2}+\sum_{i\neq j}X_{i}X_{j}\right]=\frac{1}{n}\left(\sigma^{2}+\mu^{2}\right)+\frac{n-1}{n}\mu^{2}=\frac{\sigma^{2}}{n}+\mu^{2} \end{equation}</script> <p>where <script type="math/tex">\mu = \mathbb{E}[X_{1}]</script> and <script type="math/tex">\sigma^{2} = \mathbb{V}(X_{1})</script>. Now, recall that for any random variable <script type="math/tex">Y</script>,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{V}(Y\mid\mathcal{H})=\mathbb{E}\left[Y^{2}\mid\mathcal{H}\right]-\mathbb{E}\left[Y\mid\mathcal{H}\right]^{2}. \end{equation}</script> <p>Therefore, by the tower property,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{E}\left[Y^{2}\right]=\mathbb{E}\left[\mathbb{V}(Y\mid\mathcal{H})+\mathbb{E}\left[Y\mid\mathcal{H}\right]^{2}\right]. \end{equation}</script> <p>Applying this to our setting,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{E}\left[\left(\overline{X}_{n}^{*}\right)^{2}\right]=\mathbb{E}\left[\frac{n-1}{n^{2}}S_{n}+\left(\overline{X}_{n}\right)^{2}\right]=\frac{2n-1}{n^{2}}\sigma^{2}. \end{equation}</script> <p>As such, we can conclude that</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{V}(\overline{X}_{n}^{*}) = \frac{2n-1}{n^{2}} \sigma^{2} = \frac{2n-1}{n} \mathbb{V}(\overline{X}_n) \sim 2\mathbb{V}(\overline{X}_{n}) \end{equation}</script> <p>where the asymptotic is in the limit of large <script type="math/tex">n</script>.</p> <h2 id="6">6.</h2> <p>TODO</p> <h2 id="7">7.</h2> <h3 id="a">a)</h3> <p>The distribution of <script type="math/tex">\hat{\theta}</script> is given in the solution of Question 2 of Chapter 6.</p> <p>TODO</p> <h3 id="b">b)</h3> <p>Let <script type="math/tex">\hat{\theta}^*</script> be a bootstrap resample. Then,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{P}(\hat{\theta}^* = \hat{\theta} \mid \hat{\theta}) = 1 - \mathbb{P}(\hat{\theta}^* \neq \hat{\theta} \mid \hat{\theta}) = 1 - \left( 1 - 1/n \right)^n \rightarrow 1 - \exp(-1) \approx 0.632. \end{equation}</script> <h2 id="8">8.</h2> <p>TODO</p>1.All of Statistics - Chapter 6 Solutions2020-01-20T20:00:00+00:002020-01-20T20:00:00+00:00http://parsiad.ca/blog/2020/all-of-statistics-chapter-06<h2 id="1">1.</h2> <p>Since <script type="math/tex">\mathbb{E}_\lambda[\hat{\lambda}] = \mathbb{E}_\lambda[X_1]</script>, the estimator is unbiased. Moreover, <script type="math/tex">\operatorname{se}(\hat{\lambda})^2 = \mathbb{V}_\lambda(X_1) / n = \lambda / n</script>. By the bias-variance decomposition, the MSE is equal to <script type="math/tex">\operatorname{se}(\hat{\lambda})^2</script>.</p> <h2 id="2">2.</h2> <p>If <script type="math/tex">y</script> is between <script type="math/tex">0</script> and <script type="math/tex">\theta</script>,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{P}_\theta(\hat{\theta} \leq y) = \mathbb{P}_\theta(X_1 \leq y)^n = (y/\theta)^n. \end{equation}</script> <p>Differentiating yields the PDF of <script type="math/tex">\hat{\theta}</script> between <script type="math/tex">0</script> and <script type="math/tex">\theta</script> as <script type="math/tex">y \mapsto n(y/\theta)^n / y</script>. Therefore,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{E}_\theta[\hat{\theta}] = \int_0^\theta n(y/\theta)^n dy = \theta n / (n + 1). \end{equation}</script> <p>It follows that the bias of this estimator is <script type="math/tex">-\theta/(n+1)</script> Moreover,</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{se}(\hat{\theta})^2 = \int_0^\theta ny(y/\theta)^n dy - \mathbb{E}_\theta[\hat{\theta}]^2 = \theta^2 n / (n+2) - \mathbb{E}_\theta[\hat{\theta}]^2. \end{equation}</script> <p>By the bias-variance decomposition, the MSE is <script type="math/tex">\theta^2 n / (n+2) - \theta^2 (n^2 - 1) / (n+1)^2</script>.</p> <p><em>Remark</em>. <script type="math/tex">\hat{\theta} (n+1)/n</script> is an unbiased estimator.</p> <h2 id="3">3.</h2> <p>Since <script type="math/tex">\mathbb{E}_\theta[\hat{\theta}] = 2 \mathbb{E}_\theta[X_1] = \theta</script>, the estimator is unbiased. Moreover,</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{se}(\hat{\theta})^2 = 4 \mathbb{V}_\theta(X_1) / n = \theta^2 / (3n). \end{equation}</script> <p>By the bias-variance decomposition, the MSE is equal to <script type="math/tex">\operatorname{se}(\hat{\theta})^2</script>.</p>1.All of Statistics - Chapter 7 Solutions2020-01-20T20:00:00+00:002020-01-20T20:00:00+00:00http://parsiad.ca/blog/2020/all-of-statistics-chapter-07<h2 id="1">1.</h2> <p>Note that</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{E}[\hat{F}_n(x)] = \mathbb{E}[I(X_1 \leq x)] = \mathbb{P}(X_1 \leq x) = F(x). \end{equation}</script> <p>Moreover,</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{V}(\hat{F}_n(x)) = \mathbb{V}(I(X_1 \leq x)) / n = F(x) (1 - F(x)) / n. \end{equation}</script> <p>By the bias-variance decomposition, the MSE converges to zero. Equivalently, we can say that <script type="math/tex">\hat{F}_n(x)</script> converges to <script type="math/tex">F(x)</script> in the L2 norm. Since Lp convergence implies convergence in probability, we are done.</p> <p><em>Remark</em>. For each <script type="math/tex">x</script>, <script type="math/tex">\hat{F}_n(x)</script> is a random variable. The above proves only that each random variable <script type="math/tex">\hat{F}_n(x)</script> converges in probability to the true value of the CDF <script type="math/tex">F(x)</script>. The Glivenko-Cantelli Theorem yields a much stronger result; it states that <script type="math/tex">\Vert \hat{F}_n - F \Vert_\infty</script> converges almost surely (and hence in probability) to zero.</p> <h2 id="2">2.</h2> <p><em>Assumption</em>. The Bernoulli random variables in the statement of the question are pairwise independent.</p> <p>The plug-in estimator is <script type="math/tex">\hat{p} = \overline{X}_n</script>. The standard error is <script type="math/tex">\operatorname{se}(\hat{p})^2 = \mathbb{V}(X_1) / n = p (1 - p) / n</script>. We can estimate the standard error by <script type="math/tex">\hat{\operatorname{se}}(\hat{p})^2 = \hat{p}(1 - \hat{p}) / n</script>. By the CLT,</p> <script type="math/tex; mode=display">\begin{equation} \hat{p} \approx N(p, \operatorname{se}(\hat{p})^2) \approx N(\hat{p}, \hat{\operatorname{se}}(\hat{p})^2) \end{equation}</script> <p>and hence an approximate 90% confidence interval is <script type="math/tex">\hat{p} \pm 1.64 \cdot \hat{\operatorname{se}}(\hat{p})</script>. The second part of this question is handled similarly.</p> <h2 id="3">3.</h2> <p>TODO</p> <h2 id="4">4.</h2> <p>By the CLT</p> <script type="math/tex; mode=display">\begin{equation} \sqrt{n} \left( \frac{\sum_i I(X_i \leq x)}{n} - \mathbb{E} \left[ I(X_1 \leq x) \right] \right) \rightsquigarrow N(0, \mathbb{V}(I(X_1 \leq x))). \end{equation}</script> <p>Equivalently,</p> <script type="math/tex; mode=display">\begin{equation} \sqrt{n} \left( \hat{F}_n(x) - F(x) \right) \rightsquigarrow N(0, F(x) \left( 1 - F(x) \right)). \end{equation}</script> <p>Or, more conveniently,</p> <script type="math/tex; mode=display">\begin{equation} \hat{F}_n(x) \approx N \left( F(x), \frac{F(x) \left( 1 - F(x) \right)}{n} \right). \end{equation}</script> <p><em>Remark</em>. The closer (respectively, further) <script type="math/tex">F(x)</script> is to 0.5, the more (respectively, less) variance there is in the empirical distribution evaluated at <script type="math/tex">x</script>.</p> <h2 id="5">5.</h2> <p>Without loss of generality, assume <script type="math/tex">% <![CDATA[ x < y %]]></script>. Then,</p> <script type="math/tex; mode=display">\begin{multline} \operatorname{Cov}(\hat{F}_n(x), \hat{F}_n(y)) = \frac{1}{n^2} \operatorname{Cov}(\sum_i I(X_i \leq x), \sum_i I(X_i \leq y)) \\ = \frac{1}{n^2} \sum_i \operatorname{Cov}(I(X_i \leq x), I(X_i \leq y)) = \frac{1}{n} \operatorname{Cov}(I(X_1 \leq x), I(X_1 \leq y)) \\ = \frac{1}{n} \left( F(x) - F(x)F(y) \right) = \frac{1}{n} F(x) \left(1 - F(y) \right). \end{multline}</script> <h2 id="6">6.</h2> <p>By the results of the previous question,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} n \cdot \operatorname{se}(\hat{\theta})^2 & = n \mathbb{V}(\hat{F}_n(b) - \hat{F}_n(a)) \\ & = n \mathbb{V}(\hat{F}_n(b)) + n \mathbb{V}(\hat{F}_n(a)) - 2 n \operatorname{Cov}(\hat{F}_n(b), \hat{F}_n(a)) \\ & = F(b) \left( 1 - F(b) \right) + F(a) \left( 1 - F(a) \right) - 2 F(a) \left( 1 - F(b) \right) \\ & = \left( F(b) - F(a) \right) \left[ 1 - \left( F(b) - F(a) \right) \right]. \end{align} %]]></script> <p>We can use the estimator</p> <script type="math/tex; mode=display">\begin{equation} \hat{\operatorname{se}}(\hat{\theta})^2 = \frac{1}{n} \left( \hat{F}_n(b) - \hat{F}_n(a) \right) \left[ 1 - \left( \hat{F}_n(b) - \hat{F}_n(a) \right) \right]. \end{equation}</script> <p>An approximate <script type="math/tex">1 - \alpha</script> confidence interval is <script type="math/tex">\hat{\theta} \pm z_{\alpha / 2} \cdot \hat{\operatorname{se}}(\hat{\theta})</script>.</p> <p><em>Remark</em>. The closer <script type="math/tex">F(b) - F(a)</script> is to zero or one, the smaller the standard error.</p> <h2 id="7">7.</h2> <p>TODO</p> <h2 id="8">8.</h2> <p>TODO</p> <h2 id="9">9.</h2> <p>This is an application of our findings in Question 2. In particular, we use the estimate <script type="math/tex">(90 - 85) / 100 = 0.05</script>. A <script type="math/tex">1 - \alpha</script> confidence interval for this estimate is <script type="math/tex">0.05 \pm z_{\alpha / 2} \cdot \hat{\operatorname{se}}</script> where</p> <script type="math/tex; mode=display">\begin{equation} \hat{\operatorname{se}} = \sqrt{ 0.9 \left( 1 - 0.9 \right) / 100 + 0.85 \left( 1 - 0.85 \right) / 100 } \approx 0.047. \end{equation}</script> <p>The z-scores corresponding to 80% and 95% intervals are approximately 1.28 and 1.96.</p> <h2 id="10">10.</h2> <p>TODO</p>1.Principal component analysis - part two2019-12-25T20:00:00+00:002019-12-25T20:00:00+00:00http://parsiad.ca/blog/2019/principal-component-analysis-part-two<h2 id="introduction">Introduction</h2> <p>In <a href="/blog/2019/principal-component-analysis-part-one/">the first post in this series</a>, we outlined the motivation and theory behind principal component analysis (PCA), which takes points <script type="math/tex">x_1, \ldots, x_N</script> in a high dimensional space to points in a lower dimensional space while preserving as much of the original variance as possible.</p> <p>In this follow-up post, we apply principal components regression (PCR), an algorithm which includes PCA as a subroutine, to a small dataset to demonstrate the ideas in practice.</p> <h2 id="prerequisites">Prerequisites</h2> <p>To understand this post, you will need to be familiar with the following concepts:</p> <ul> <li>PCA (see <a href="/blog/2019/principal-component-analysis-part-one/">the first post in this series</a>)</li> <li><a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares</a></li> </ul> <h2 id="ordinary-least-squares">Ordinary least squares</h2> <p>In ordinary least squares (OLS), we want to find a line of best fit between the points <script type="math/tex">x_1, \ldots, x_N</script> and the labels <script type="math/tex">y_1, \ldots, y_N</script>.</p> <p>Denoting by <script type="math/tex">X</script> the matrix whose rows are the points and <script type="math/tex">y</script> the vector whose entries are the labels, the intercept <script type="math/tex">\alpha</script> and slope (a.k.a. gradient) <script type="math/tex">\beta</script> are obtained by minimizing <script type="math/tex">\Vert \alpha + X \beta - y \Vert</script>. Some <a href="https://en.wikipedia.org/wiki/Matrix_calculus">matrix calculus</a> reveals that the minimum is obtained at the values of <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script> for which</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} N \alpha & = y^\intercal e - \beta^\intercal X^\intercal e \\ X^\intercal X \beta & = X^\intercal y - \alpha X^\intercal e \end{align*} %]]></script> <p>where <script type="math/tex">e</script> is the vector of all ones.</p> <h2 id="principal-components-regression">Principal components regression</h2> <p>The idea behind PCR is simple: instead of doing OLS on the high dimensional space, we first map the points to a lower dimensional space obtained by PCA and <em>then</em> do OLS. In more detail, we</p> <ol> <li>pick a positive integer <script type="math/tex">% <![CDATA[ k < p %]]></script>,</li> <li>construct the matrix <script type="math/tex">V_k</script> whose columns are the first <script type="math/tex">k</script> principal components of <script type="math/tex">X</script>,</li> <li>compute <script type="math/tex">Z_k = X V_k</script>, a matrix whose rows are the original points transformed to a lower dimensional “PCA space”, and</li> <li>perform OLS to find a line of best fit between the transformed points and <script type="math/tex">y</script>.</li> </ol> <p>By the previous section, we know that the minimum is obtained at the values of the intercept <script type="math/tex">\alpha_k</script> and gradient <script type="math/tex">\beta_k</script> for which</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} N \alpha_k & = y^\intercal e - \beta_k^\intercal Z_k^\intercal e \\ Z_k^\intercal Z_k \beta_k & = Z_k^\intercal y - \alpha_k Z_k^\intercal e \end{align*} %]]></script> <p>Once we have solved these equations for <script type="math/tex">\alpha_k</script> and <script type="math/tex">\beta_k</script>, we can predict the label <script type="math/tex">\hat{y}</script> corresponding to a new sample <script type="math/tex">x</script> as <script type="math/tex">\hat{y} = \alpha_k + x^\intercal V_k \beta_k</script>.</p> <h3 id="computational-considerations">Computational considerations</h3> <p>Due to the result below, the linear system involving <script type="math/tex">\alpha_k</script> and <script type="math/tex">\beta_k</script> is a (permuted) <a href="https://en.wikipedia.org/wiki/Arrowhead_matrix">arrowhead matrix</a>. As such, the system can be solved efficiently.</p> <p><strong>Lemma.</strong> <script type="math/tex">Z_k^\intercal Z_k = \Sigma_k^2</script> where <script type="math/tex">\Sigma_k</script> is the <script type="math/tex">k \times k</script> diagonal matrix whose entries are the first <script type="math/tex">k</script> principal components of <script type="math/tex">X</script> in descending order.</p> <p><em>Proof</em>. Let <script type="math/tex">v_j</script> denote the <script type="math/tex">j</script>-th column of <script type="math/tex">V_k</script>. Since <script type="math/tex">v_j</script> is a principal component of <script type="math/tex">X</script>, it is also an eigenvector of <script type="math/tex">X^\intercal X</script> with eigenvalue <script type="math/tex">\sigma_j^2</script>, the square of the <script type="math/tex">j</script>-th singular value. Therefore, the <script type="math/tex">(i, j)</script>-th entry of <script type="math/tex">Z_k^\intercal Z_k</script> is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} (X v_i)^\intercal (X v_j) = v_i^\intercal X^\intercal X v_j = \sigma_j^2 v_i^\intercal v_j = \begin{cases} \sigma_j^2 & \text{if } i = j \\ 0 & \text{if } i \neq j. \end{cases} \end{equation} %]]></script> <h2 id="boston-house-prices-dataset">Boston house prices dataset</h2> <p>The <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/housing">Boston house prices dataset</a> from  has 506 samples and 13 predictors. For each <script type="math/tex">k \leq p = 13</script>, we fit using PCR on the first 405 samples (the training set) and report the <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">root mean squared error</a> (RMSE) on both the training set and the set of remaining 101 samples (the test set).</p> <table> <thead> <tr> <th>Rank (k)</th> <th>Training set RMSE (in $1000s)</th> <th>Test set RMSE (in$1000s)</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>7.16734</td> <td>7.57061</td> </tr> <tr> <td>2</td> <td>6.7612</td> <td>6.91805</td> </tr> <tr> <td>3</td> <td>5.61098</td> <td>5.80307</td> </tr> <tr> <td>4</td> <td>5.42897</td> <td>6.07821</td> </tr> <tr> <td>5</td> <td>4.89393</td> <td>5.78428</td> </tr> <tr> <td>6</td> <td>4.88918</td> <td>5.76014</td> </tr> <tr> <td>7</td> <td>4.86875</td> <td>5.78133</td> </tr> <tr> <td>8</td> <td>4.82526</td> <td>5.71379</td> </tr> <tr> <td>9</td> <td>4.818</td> <td>5.74823</td> </tr> <tr> <td>10</td> <td>4.78993</td> <td>5.73366</td> </tr> <tr> <td>11</td> <td>4.75929</td> <td>5.67803</td> </tr> <tr> <td>12</td> <td>4.6241</td> <td>5.38402</td> </tr> <tr> <td>13</td> <td>4.54322</td> <td>5.32823</td> </tr> </tbody> </table> <p>Both training and test set RMSEs are (roughly) decreasing functions of the rank. This suggests that using all 13 predictors does not cause overfitting.</p> <p>Code used to generate the table above is given in the appendix.</p> <h3 id="deriving-predictors">Deriving predictors</h3> <p>One way to reduce the test set RMSE is to introduce more predictors into the model. Consider, as a toy example, a dataset where each sample <script type="math/tex">x_i</script> has only three predictors: <script type="math/tex">x_i \equiv (a_i, b_i, c_i)</script>. We can replace each sample <script type="math/tex">x_i</script> by a new sample <script type="math/tex">x_i^\prime \equiv (a_i, b_i, c_i, a_i^2, a_i b_i, a_i c_i, b_i^2, b_i c_i, c_i^2)</script>. In particular, we have added all possible quadratic monomials in <script type="math/tex">a_i, b_i, c_i</script>. These new entries are referred to as “derived” predictors. Note that derived predictors need not be quadratic, or even monomials; any function of the original predictors is referred to as a derived predictor.</p> <p>Returning to the Boston house prices dataset, of all possible derived cubic monomial predictors, we randomly choose roughly 100 to add to our dataset. Since we have approximately 400 training samples, it is reasonable to expect that unlike OLS applied to <script type="math/tex">X</script>, OLS applied to the derived matrix <script type="math/tex">X^\prime</script> will almost certainly overfit. We plot the results of PCR below, observing the effects of overfitting after approximately a rank of greater than 80.</p> <p><img src="/assets/img/principal-component-analysis-part-two/train_test_plot.png" alt="" /></p> <h2 id="bibliography">Bibliography</h2> <p> Harrison Jr, D., &amp; Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. <em>Journal of environmental economics and management</em>, 5(1), 81-102.</p> <h2 id="appendix-code">Appendix: code</h2> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_boston</span> <span class="kn">from</span> <span class="nn">tabulate</span> <span class="kn">import</span> <span class="n">tabulate</span> <span class="n">TRAIN_TEST_SPLIT_FRACTION</span> <span class="o">=</span> <span class="mf">0.2</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">load_boston</span><span class="p">(</span><span class="n">return_X_y</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">N</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span> <span class="c1"># Train test split. </span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span> <span class="n">perm</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">permutation</span><span class="p">(</span><span class="n">N</span><span class="p">)</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">perm</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="n">perm</span><span class="p">]</span> <span class="n">N_test</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">TRAIN_TEST_SPLIT_FRACTION</span> <span class="o">*</span> <span class="n">N</span><span class="p">)</span> <span class="n">N_train</span> <span class="o">=</span> <span class="n">N</span> <span class="o">-</span> <span class="n">N_test</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">X_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="p">[</span><span class="n">N_test</span><span class="p">])</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">y_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="p">[</span><span class="n">N_test</span><span class="p">])</span> <span class="c1"># Normalize data. </span><span class="n">X_mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="n">X_std</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">X_train</span> <span class="o">=</span> <span class="p">[(</span><span class="n">X_sub</span> <span class="o">-</span> <span class="n">X_mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">X_std</span> <span class="k">for</span> <span class="n">X_sub</span> <span class="ow">in</span> <span class="p">[</span><span class="n">X_test</span><span class="p">,</span> <span class="n">X_train</span><span class="p">]]</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">V_T</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">svd</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span> <span class="n">V</span> <span class="o">=</span> <span class="n">V_T</span><span class="o">.</span><span class="n">T</span> <span class="n">rows</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span> <span class="n">V_k</span> <span class="o">=</span> <span class="n">V</span><span class="p">[:,</span> <span class="p">:</span><span class="n">k</span><span class="p">]</span> <span class="n">Z_k</span> <span class="o">=</span> <span class="n">X_train</span> <span class="o">@</span> <span class="n">V_k</span> <span class="c1"># Solve for alpha_k and beta_k by adding a bias column to Z_k. </span> <span class="c1"># This is not efficient (see "Computational considerations" above). </span> <span class="n">Z_k_bias</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="n">N_train</span><span class="p">,</span> <span class="mi">1</span><span class="p">]),</span> <span class="n">Z_k</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">solution</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">solve</span><span class="p">(</span><span class="n">Z_k_bias</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">Z_k_bias</span><span class="p">,</span> <span class="n">Z_k_bias</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">y_train</span><span class="p">)</span> <span class="n">alpha_k</span> <span class="o">=</span> <span class="n">solution</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">beta_k</span> <span class="o">=</span> <span class="n">solution</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="n">V_k_beta_k</span> <span class="o">=</span> <span class="n">V_k</span> <span class="o">@</span> <span class="n">beta_k</span> <span class="n">row</span> <span class="o">=</span> <span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">X_sub</span><span class="p">,</span> <span class="n">y_sub</span> <span class="ow">in</span> <span class="p">[(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)]:</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="n">alpha_k</span> <span class="o">+</span> <span class="n">X_sub</span> <span class="o">@</span> <span class="n">V_k_beta_k</span> <span class="n">error</span> <span class="o">=</span> <span class="n">y_hat</span> <span class="o">-</span> <span class="n">y_sub</span> <span class="n">rmse</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">error</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span> <span class="n">row</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">rmse</span><span class="p">)</span> <span class="n">rows</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">row</span><span class="p">)</span> <span class="n">table</span> <span class="o">=</span> <span class="n">tabulate</span><span class="p">(</span><span class="n">rows</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">[</span><span class="s">'Rank (k)'</span><span class="p">,</span> <span class="s">'Training set RMSE (in $1000s)'</span><span class="p">,</span> <span class="s">'Test set RMSE (in$1000s)'</span><span class="p">],</span> <span class="n">tablefmt</span><span class="o">=</span><span class="s">'github'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> </code></pre></div></div>IntroductionMotivating the Poisson distribution2019-12-01T20:00:00+00:002019-12-01T20:00:00+00:00http://parsiad.ca/blog/2019/motivating-the-poisson-distribution<h2 id="introduction">Introduction</h2> <p>The Poisson distribution is usually introduced by its probability mass function (PMF). It is then tied back to the binomial distribution by showing that a particular parameterization of the binomial distribution converges to the Poisson distribution. This approach is somewhat unsatisfying: it does not give much insight into the Poisson distribution and, namely, why it is used to model certain phenomena.</p> <p>In this short post, we avoid motivating the Poisson distribution by its PMF and instead <em>construct</em> it directly from the binomial distribution.</p> <h2 id="prerequisites">Prerequisites</h2> <p>To understand this post, you will need to be familiar with the following concepts:</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Probability_mass_function">probability mass function</a> (PMF)</li> <li><a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">independent and identically distributed</a> (IID)</li> <li><a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli distribution</a></li> <li><a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a></li> </ul> <h2 id="construction-by-partitions">Construction by partitions</h2> <p>Let’s start with a motivating example: we would like to model the number of emails you receive during a particular work day (e.g., between 9am and 5pm). For convenience, we can normalize this interval of time by relabeling the start of the day 0 and the end 1.</p> <p>While an email may arrive at any time between 0 and 1, it will be easier to start with a model in which emails arrive only at one of finitely many times between 0 and 1. We do this by subdividing our day into <script type="math/tex">n</script> uniformly sized partitions, each having length <script type="math/tex">h = 1 / n</script>:</p> <p><img src="/assets/img/motivating-the-poisson-distribution/ticks.svg" alt="" /></p> <p>We assume that inside each of these partitions, you receive at most one email. Let <script type="math/tex">X_k = 1</script> if you received an email between <script type="math/tex">(k-1)h</script> and <script type="math/tex">kh</script>. Otherwise, let <script type="math/tex">X_k = 0</script>.</p> <p>Moreover, we assume that emails are independent (while this might be a lofty assumption, we leave it to the reader to create more complicated models) and that the probability of receiving an email in one partition is identical to that of receiving an email in another partition. These assumptions can be neatly summarized in one sentence: the random variables <script type="math/tex">X_1, \ldots, X_n</script> are IID.</p> <p>The sum <script type="math/tex">S_n = X_1 + \cdots + X_n</script> counts the total number of emails received in the work day. Being a sum of IID Bernoulli random variables, <script type="math/tex">S_n</script> has binomial distribution with <script type="math/tex">n</script> trials and success probability <script type="math/tex">p = \mathbb{P}(X_1 = 1)</script>. That is, <script type="math/tex">S_n</script> has PMF</p> <script type="math/tex; mode=display">\begin{equation} f_{n, p}(s) = \binom{n}{s} p^s \left( 1 - p \right)^s. \end{equation}</script> <p>The expected number of emails received under this model is <script type="math/tex">\mathbb{E} S_n = np</script>. We would like to pick <script type="math/tex">p</script> such that the expected number of emails does not depend on the number of partitions <script type="math/tex">n</script>. Our model would be a bit weird if the expected number of emails changed as a function of the number of partitions! The only way to prevent this from happening is to pick <script type="math/tex">p = \lambda / n</script> for some positive constant <script type="math/tex">\lambda</script> (technically, it’s possible that <script type="math/tex">p = \lambda / n > 1</script>, but we can always pick <script type="math/tex">n</script> sufficiently large to obtain a valid probability). Under this choice, the PMF becomes</p> <script type="math/tex; mode=display">\begin{equation} f_{n,\lambda/n}(s) = \binom{n}{s}\left(\frac{\lambda}{n}\right)^{s}\left(1-\frac{\lambda}{n}\right)^{n-s} = \frac{\lambda^{s}}{s!}\frac{1}{n^{n}}\frac{n!}{\left(n-s\right)!}\left(n-\lambda\right)^{n-s}. \end{equation}</script> <p>Next, note that</p> <script type="math/tex; mode=display">\begin{multline*} \frac{1}{n^{n}}\frac{n!}{\left(n-s\right)!}\left(n-\lambda\right)^{n-s} = \frac{1}{n^{n}}\frac{n!}{\left(n-s\right)!}\sum_{k=0}^{n-s}\binom{n-s}{k}\left(-\lambda\right)^{k}n^{n-s-k}\\ = \sum_{k=0}^{n-s}\frac{\left(-\lambda\right)^{k}}{k!}\frac{n\left(n-1\right)\cdots\left(n-s-k+1\right)}{n^{s+k}}. \end{multline*}</script> <p>As we increase the number of partitions, our model becomes more and more realistic. By taking a limit as the number of partitions goes to infinity, we obtain a model in which emails can be received at any point in time. We make a new PMF <script type="math/tex">g_\lambda</script> by taking this limit:</p> <script type="math/tex; mode=display">\begin{equation} g_\lambda(s) = \lim_{n}f_{n,\lambda/n}(s) = \frac{\lambda^s}{s!} \sum_{k \geq 0} \frac{\left(-\lambda\right)^k}{k!} = \frac{\lambda^{s}}{s!}e^{-\lambda}. \end{equation}</script> <p>The figure below plots <script type="math/tex">g_\lambda</script> as a function of <script type="math/tex">k</script> (<a href="https://en.wikipedia.org/wiki/File:Poisson_pmf.svg">source</a>). When <script type="math/tex">\lambda</script> is a positive integer, the modes are <script type="math/tex">\lambda</script> and <script type="math/tex">\lambda - 1</script>, as can be seen in the plot.</p> <p><img src="https://upload.wikimedia.org/wikipedia/commons/1/16/Poisson_pmf.svg" alt="" /></p> <p>Our construction above suggests that the PMF <script type="math/tex">g_\lambda</script> models the number of emails received per work day assuming that</p> <ul> <li>the receipt (or non-receipt) of one email does not affect future emails and</li> <li>the expected number of emails received is specified by the parameter <script type="math/tex">\lambda</script>.</li> </ul> <p>Of course, we can replace “email” by any event and “work day” by any finite time horizon we are interested in, so long as the modeling assumptions above are sound.</p> <p>In summary, any random variable with PMF <script type="math/tex">g_\lambda</script> (for some positive value of <script type="math/tex">\lambda</script>) is said to be a <em>Poisson random variable</em> (or, equivalently, have a <em>Poisson distribution</em>). All of our hard work above is not only a construction of the Poisson distribution but also a proof of the following result:</p> <p><strong>Proposition (Binomial to Poisson)</strong>. Let <script type="math/tex">\lambda</script> be a positive number. The binomial distribution with <script type="math/tex">n</script> trials and success probability <script type="math/tex">p = \lambda / n</script> <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables#Convergence_in_distribution">converges</a>, as <script type="math/tex">n \rightarrow \infty</script>, to the Poisson distribution with parameter <script type="math/tex">\lambda</script>.</p> <h2 id="the-usual-rigmarole">The usual rigmarole</h2> <p>For completeness, we also give the approach alluded to in the introduction of this article. In particular, we take the definition of the Poisson distribution involving <script type="math/tex">g_\lambda</script> as given and prove the proposition of the previous section using a more expedient approach.</p> <p><em>Proof (Binomial to Poisson)</em>. We establish the proposition by showing that the <a href="https://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)">characteristic function</a> (CF) of the binomial distribution conveges to that of the Poisson distribution and applying <a href="https://en.wikipedia.org/wiki/L%C3%A9vy%27s_continuity_theorem">Lévy’s continuity theorem</a>.</p> <p>Let’s start by computing the CF of the Poisson distribution:</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{CF}_{\mathrm{Poisson}(\lambda)}(t) = e^{-\lambda}\sum_{k\geq0}\frac{1}{k!}\left(\lambda e^{it}\right)^{k} = e^{-\lambda}\exp\left(\lambda e^{it}\right) = \exp\left(\lambda\left(e^{it}-1\right)\right). \end{equation}</script> <p>Let’s now compute the CF of the binomial distribution with <script type="math/tex">n</script> trials and success probability <script type="math/tex">p</script>. The CF can be obtained by applying the <a href="https://en.wikipedia.org/wiki/Binomial_theorem">binomial theorem</a>:</p> <script type="math/tex; mode=display">\begin{multline*} \operatorname{CF}_{\mathrm{Binomial}(n, p)}(t) = \sum_{k\geq0}e^{itk}\binom{n}{k}p^{k}\left(1-p\right)^{n-k} = \sum_{k\geq0}\binom{n}{k}\left(pe^{it}\right)^{k}\left(1-p\right)^{n-k}\\ = \left(1-p+pe^{it}\right)^{n} = \left(1+p\left(e^{it}-1\right)\right)^{n}. \end{multline*}</script> <p>Setting <script type="math/tex">p = \lambda / n</script> and taking limits in the above, we obtain the desired result:</p> <script type="math/tex; mode=display">\lim_n \operatorname{CF}_{\mathrm{Binomial}(n, \lambda / n)}(t) = \lim_n \left( \exp \left( \frac{\lambda}{n} \left( e^{it} - 1 \right) \right) + O \left( \frac{1}{n^2} \right) \right)^n = \operatorname{CF}_{\mathrm{Poisson}(\lambda)}(t). \text{ } \square</script>IntroductionPrincipal component analysis - part one2019-11-17T20:00:00+00:002019-11-17T20:00:00+00:00http://parsiad.ca/blog/2019/principal-component-analysis-part-one<h2 id="motivation">Motivation</h2> <p>Consider, for example, that we want to perform <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares</a> (OLS) to find a line of best fit between the points <script type="math/tex">x_1, \ldots, x_N</script> in <script type="math/tex">p</script> dimensional Euclidean space and labels <script type="math/tex">y_1, \ldots, y_N</script>.</p> <p><img src="/assets/img/principal-component-analysis-part-one/points.png" alt="" /></p> <p>When the number of predictors <script type="math/tex">p</script> is large, it is possible to end up with a linear regression model with high variance (and hence higher than desirable prediction error). In addition, the resulting coefficients learned by the model may be harder to interpret than an alternative model with fewer predictors.</p> <p><em>Principal component analysis</em> (PCA) is a method for transforming points (such as <script type="math/tex">x_1, \ldots, x_N</script> above) in a high dimensional space to points in a lower dimensional space by performing a sequence of scalar projections such that the resulting points account for as much of the original variance as possible.</p> <p>Don’t worry if this sounds vague; we’ll make it precise below.</p> <h2 id="prerequisites">Prerequisites</h2> <p>To understand this post, you will need to be familiar with the following concepts:</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Linear_algebra">linear algebra</a></li> <li><a href="https://en.wikipedia.org/wiki/Variance">variance</a></li> </ul> <h2 id="a-review-of-scalar-projections">A review of scalar projections</h2> <p>We use <script type="math/tex">\Vert \cdot \Vert</script> to denote the <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a>. Given vectors <script type="math/tex">x</script> and <script type="math/tex">v</script>, the <em>scalar projection</em> of <script type="math/tex">x</script> onto <script type="math/tex">v</script> is <script type="math/tex">\Vert x \Vert \cos \theta</script> where <script type="math/tex">\theta</script> is the angle between the two vectors.</p> <p><img src="/assets/img/principal-component-analysis-part-one/scalar_projection.png" alt="" /></p> <p>If <script type="math/tex">v</script> is a unit vector (i.e., <script type="math/tex">\Vert v \Vert = 1</script>), the scalar projection can also be written <script type="math/tex">x \cdot v</script> using the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a>.</p> <h2 id="principal-components">Principal components</h2> <p>Before we define the principal components, let’s introduce some equivalent representations of the data:</p> <ul> <li>Let <script type="math/tex">\mathbf{x}</script> be random vector which takes on the values <script type="math/tex">x_1, \ldots, x_N</script> with uniform probability.</li> <li>Let <script type="math/tex">X</script> be a matrix with rows <script type="math/tex">x_1, \ldots, x_N</script>.</li> </ul> <p>We assume that <script type="math/tex">\mathbf{x}</script> has zero expectation. If this is not true, we can always just center the data by subtracting <script type="math/tex">\mathbb{E} \mathbf{x}</script> from each point.</p> <p>Moreover, We assume <script type="math/tex">N > p</script> and <script type="math/tex">\operatorname{rank}(X) = p</script>. In the context of our motivating example of OLS, this means that there are more samples than there are predictors and that no predictors are redundant.</p> <h3 id="first-principal-component">First principal component</h3> <p>The <em>first principal component</em> is a unit vector <script type="math/tex">v_1</script> along which the variance of the scalar projection of <script type="math/tex">\mathbf{x}</script> onto <script type="math/tex">v_1</script> is maximized. That is,</p> <script type="math/tex; mode=display">\operatorname{Var}(\mathbf{x} \cdot v_1) = \max_{\Vert v \Vert = 1} \operatorname{Var}(\mathbf{x} \cdot v).</script> <p>In other words, we are looking for the direction along which the data varies the most.</p> <p><img src="/assets/img/principal-component-analysis-part-one/first_principal_component.png" alt="" /></p> <p><em>Remark</em>. The first principal component need not be unique: there may be two or more maximizers of the variance. In this case, it is understood that “the” first principal component is picked according to some tie-breaking rule.</p> <p>But how do we compute the first principal component? First, let’s obtain an equivalent expression for the variance:</p> <script type="math/tex; mode=display">N \operatorname{Var}(\mathbf{x} \cdot v) = \sum_{i=1}^N \left( x_i \cdot v \right)^2 = \Vert Xv \Vert^2 = \left(Xv\right)^\intercal \left(Xv\right) = v^\intercal X^\intercal X v.</script> <p><strong>Lemma.</strong> Let <script type="math/tex">A</script> be a <a href="https://en.wikipedia.org/wiki/Definiteness_of_a_matrix">positive semidefinite matrix</a>. Let <script type="math/tex">w</script> be the maximizer of <script type="math/tex">v \mapsto v^\intercal A v</script> over all (real) unit vectors. Then, <script type="math/tex">w</script> is an eigenvector of <script type="math/tex">A</script> whose corresponding eigenvalue is maximal.</p> <p><em>Proof</em>. Proceeding by the method of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>, let <script type="math/tex">L(v) \equiv v^\intercal A v - \lambda (v^\intercal v - 1)</script> where <script type="math/tex">\lambda</script> is an arbitrary constant. Then, <script type="math/tex">\nabla L(v) \propto A v - \lambda v</script>. Since <script type="math/tex">w</script> is a <a href="https://en.wikipedia.org/wiki/Critical_point_(mathematics)">critical point</a> of <script type="math/tex">L</script>, it follows that <script type="math/tex">w</script> is an eigenvector of <script type="math/tex">A</script>. Moreover, denoting by <script type="math/tex">r</script> the eigenvalue corresponding to <script type="math/tex">w</script>, since <script type="math/tex">w^\intercal A w = w^\intercal r w = r</script>, it follows that <script type="math/tex">r</script> is maximal.</p> <p>The above suggests a simple way to compute the first principal component: applying <a href="https://en.wikipedia.org/wiki/Power_iteration">power iteration</a> to <script type="math/tex">A \equiv X^\intercal X</script>. Power iteration is an algorithm which returns, under reasonable conditions, an eigenvector corresponding to the largest eigenvalue of the input matrix. The details of power iteration are outside the scope of this article.</p> <h3 id="remaining-principal-components">Remaining principal components</h3> <p>Given the first principal component <script type="math/tex">v_1</script>, we can transform our data so that all contributions in the <script type="math/tex">v_1</script> direction are removed. In particular, for each point <script type="math/tex">x_i</script>, we can create a new point</p> <script type="math/tex; mode=display">x_i^{(2)} \equiv x_i - (x_i \cdot v_1) v_1.</script> <p><img src="/assets/img/principal-component-analysis-part-one/decomposition.png" alt="" /></p> <p>Equivalently, we can represent this transformation in the matrix form</p> <script type="math/tex; mode=display">X^{(2)} \equiv X - X v_1 v_1^\intercal = X \left( I - v_1 v_1^\intercal \right)</script> <p>from which it is clear that this transformation is a <a href="https://en.wikipedia.org/wiki/Projection_(linear_algebra)">projection</a> with matrix <script type="math/tex">P \equiv I - v_1 v_1^\intercal</script>.</p> <p><img src="/assets/img/principal-component-analysis-part-one/projected_points.png" alt="" /></p> <p>The <strong>second</strong> principal component <script type="math/tex">v_2</script> of <script type="math/tex">X</script> is defined as the <strong>first</strong> principal component of <script type="math/tex">X^{(2)}</script>. Once it is computed, we can, as above, “project out” its contributions by taking <script type="math/tex">X^{(3)} \equiv X^{(2)}(I - v_2 v_2^\intercal)</script>. Continuing in this way, we can define the third, fourth, etc. principal components.</p> <p>For completeness, we summarize the above procedure. Let <script type="math/tex">X^{(1)} \equiv X</script> and define</p> <script type="math/tex; mode=display">X^{(k+1)} \equiv X^{(k)} \left( I - v_k v_k^\intercal \right) \text{ for } k \geq 1</script> <p>where <script type="math/tex">v_k</script> is the <strong>first</strong> principal component of <script type="math/tex">X^{(k)}</script>. We call <script type="math/tex">v_k</script> the <strong><script type="math/tex">k</script>-th</strong> principal component of <script type="math/tex">X</script>.</p> <p>Since each projection reduces the dimensionality of the space by one, it is guaranteed that <script type="math/tex">X^{(n+1)} = 0</script>. That is, it is only meaningful to talk about the first <script type="math/tex">n</script> principal components.</p> <h2 id="basis-transformation">Basis transformation</h2> <h3 id="lossless">Lossless</h3> <p>Let <script type="math/tex">V</script> denote the matrix whose columns consist of all <script type="math/tex">p</script> principal components of <script type="math/tex">X</script>. We can transform our points into “PCA space” by right-multiplying by <script type="math/tex">V</script>:</p> <script type="math/tex; mode=display">Z \equiv X V</script> <p>This transformation is <em>lossless</em>: there is no reduction in dimensionality and we can transform from <script type="math/tex">Z</script> back to <script type="math/tex">X</script> by right-multiplying by <script type="math/tex">V^{-1} = V^\intercal</script>.</p> <h3 id="lossy">Lossy</h3> <p>Recalling our motivation for studying PCA, we need a <em>lossy</em> transformation that reduces the dimensionality of the space. First, we pick the target dimension <script type="math/tex">% <![CDATA[ k < p %]]></script>. Since we would like to keep as much of the original variance as possible, we transform our points by right-multiplying by <script type="math/tex">V_k</script>, the <script type="math/tex">p \times k</script> matrix whose columns are the first <script type="math/tex">k</script> principal components of <script type="math/tex">X</script>:</p> <script type="math/tex; mode=display">Z_k \equiv X V_k.</script>MotivationHypothesis testing for mathematicians2019-11-02T20:00:00+00:002019-11-02T20:00:00+00:00http://parsiad.ca/blog/2019/hypothesis-testing-for-mathematicians<p>Most introductions to hypothesis testing are targeted at non-mathematicians. This short post aims to be a precise introduction to the subject for mathematicians.</p> <p><em>Remark</em>. While the presentation may differ, some of the notation in this article is from <a href="https://doi.org/10.1007/978-0-387-21736-9">L. Wasserman’s <em>All of Statistics: a Concise Course in Statistical Inference</em></a>.</p> <p>Consider a parametric model with parameter set <script type="math/tex">\Theta</script>. The model generates realizations <script type="math/tex">X_1, \ldots, X_n</script>.</p> <p><strong>Example (Coin Flip).</strong> We are given a coin. The coin has probability <script type="math/tex">\theta</script> in <script type="math/tex">\Theta \equiv [0, 1]</script> of showing heads. We flip the coin <script type="math/tex">n</script> times and record <script type="math/tex">X_i = 1</script> if the <script type="math/tex">i</script>-th flip is heads and <script type="math/tex">0</script> otherwise.</p> <p>Throughout this article, we use the above coin flip model to illustrate the ideas.</p> <p>In hypothesis testing, we start with a <em>hypothesis</em> (also called the <em>null hypothesis</em>). Specifying a null hypothesis is equivalent to picking some nonempty subset <script type="math/tex">\Theta_0</script> of the parameter set <script type="math/tex">\Theta</script>. Precisely, the null hypothesis is the assumption that realizations are being generated by the model parameterized by some <script type="math/tex">\theta</script> in <script type="math/tex">\Theta_0</script>.</p> <p><strong>Example (Coin Flip).</strong> Our hypothesis is <script type="math/tex">\Theta_0 \equiv \{ 1 / 2 \}</script>. That is, we hypothesize that the coin is fair.</p> <p>For brevity, let <script type="math/tex">X \equiv (X_1, \ldots, X_n)</script>. To specify when the null hypothesis is rejected, we define a <em>rejection function</em> <script type="math/tex">R</script> such that <script type="math/tex">R(X)</script> is an indicator random variable whose unit value corresponds to rejection.</p> <p><strong>Example (Coin Flip).</strong> Let</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} R(x_1, \ldots, x_n) \equiv \begin{cases} 1, & \text{if } \left| \left(x_1 + \cdots + x_n \right) / n - 1 / 2 \right| \geq \epsilon \\ 0, & \text{otherwise}. \end{cases} \end{equation} %]]></script> <p>This corresponds to rejecting the null hypothesis whenever we see “significantly” more heads than tails (or vice versa). Our notion of significance is controlled by <script type="math/tex">\epsilon</script>.</p> <p>Note that nothing stops us from making a bad test. For example, taking <script type="math/tex">\epsilon = 0</script> in the above example yields a test that always rejects. Conversely, taking <script type="math/tex">\epsilon > 1/2</script> yields a test that never rejects.</p> <p><strong>Definition (Power).</strong> The <em>power</em></p> <script type="math/tex; mode=display">\begin{equation} \operatorname{Power}(\theta, R) \equiv \mathbb{P}_\theta \left\{ R(X) = 1 \right\} \end{equation}</script> <p>gives the probability of rejection assuming that the true model parameter is <script type="math/tex">\theta</script>.</p> <p><strong>Example (Coin Flip).</strong> Let <script type="math/tex">F_\theta</script> denote the CDF of a binomial distribution with <script type="math/tex">n</script> trials and success probability <script type="math/tex">\theta</script>. Let <script type="math/tex">S \equiv X_1 + \cdots + X_n</script>. Then, assuming <script type="math/tex">\epsilon</script> is positive,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \operatorname{Power}(\theta,R) & = 1 - \mathbb{P}_{\theta} \left\{ \left|S/n-1/2\right| < \epsilon \right\} \\ & = 1 - \mathbb{P}_{\theta} \left\{ n/2-\epsilon n < S < n/2+\epsilon n \right\} \\ & = 1 - F_\theta(\left(n/2+\epsilon n\right)-) + F_\theta(n/2-\epsilon n) \end{align*} %]]></script> <p>where <script type="math/tex">F(x-) = \lim_{y \uparrow x} F(y)</script> is a <a href="https://en.wikipedia.org/wiki/One-sided_limit">left-hand limit</a>.</p> <p><img src="/assets/img/hypothesis-testing-for-mathematicians/power.png" alt="" /></p> <p><strong>Definition (Size).</strong> The <em>size</em> of a test</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{Size}(R) \equiv \sup \left \{ \operatorname{Power}(\theta, R) \colon \theta \in \Theta_0 \right \} \end{equation}</script> <p>gives, assuming that the null hypothesis is true, the “worst-case” probability of rejection.</p> <p>Rejecting the null hypothesis errenously is called a <em>type I error</em> (see the table below). The size puts an upper bound on making a type I error.</p> <table> <thead> <tr> <th> </th> <th>Retain Null</th> <th>Reject Null</th> </tr> </thead> <tbody> <tr> <td>Null Hypothesis is True</td> <td>No error</td> <td>Type I error</td> </tr> <tr> <td>Null Hypothesis is False</td> <td>Type II error</td> <td>No error</td> </tr> </tbody> </table> <p><strong>Example (Coin Flip).</strong> Since <script type="math/tex">\Theta_0 = \{ 1 / 2 \}</script> is a singleton, <script type="math/tex">\operatorname{Size}(R) = \operatorname{Power}(1/2, R)</script>.</p> <p><strong>Definition (p-value).</strong> Let <script type="math/tex">(R_\alpha)_\alpha</script> be a collection of rejection functions. Define</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{p-value} \equiv \inf \left\{ \operatorname{Size}(R_\alpha) \colon R_\alpha(X) = 1 \right\}. \end{equation}</script> <p>as the smallest size for which the null-hypothesis is rejected.</p> <p>Unlike the size, the p-value is itself a random variable. The smaller the p-value, the more confident we can be that a rejection is justified. A common threshold for rejection is a p-value smaller than 0.01. A rejection in this case can be understood as being at least 99% certain the rejection was not done erroneously.</p> <p><strong>Theorem 1.</strong> Suppose we have a collection of rejection functions <script type="math/tex">(R_{\alpha})_{\alpha}</script> of the form</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} R_{\alpha}(x_1, \ldots, x_n) \equiv \begin{cases} 1, & \text{if } f(x_1, \ldots, x_n) \geq c_{\alpha} \\ 0, & \text{otherwise} \end{cases} \end{equation} %]]></script> <p>where <script type="math/tex">f</script> does not vary with <script type="math/tex">\alpha</script>. Suppose also that for each point <script type="math/tex">y</script> in the range of <script type="math/tex">f</script>, there exists <script type="math/tex">\alpha</script> such that <script type="math/tex">c_{\alpha} = y</script>. Then,</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{p-value}(\omega) \equiv \sup\left\{ \mathbb{P}_{\theta} \left\{ f(X) \geq f(X(\omega)) \right\} \colon \theta \in \Theta_0 \right\}. \end{equation}</script> <p>In other words, the p-value (under the setting of Theorem 1) is the worst-case probability of sampling <script type="math/tex">f(X)</script> larger than what was observed, <script type="math/tex">f(X(\omega))</script>. Note that in the above, we have used <script type="math/tex">\omega</script> to distinguish between the actual random variable <script type="math/tex">X</script> and <script type="math/tex">X(\omega)</script>, the observation.</p> <p><em>Proof</em>. Note that</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{p-value}(\omega) = \inf \left\{ \sup \left\{ \mathbb{P}_{\theta} \left\{ f(X)\geq c_{\alpha} \right\} \colon \theta \in \Theta_0 \right\} \colon f(X(\omega)) \geq c_{\alpha} \right\}. \end{equation}</script> <p>The result follows from noting that the infimum is achieved at the value of <script type="math/tex">\alpha</script> for which <script type="math/tex">c_{\alpha}=f(X(\omega))</script>. <script type="math/tex">\square</script></p> <p><strong>Example (Coin Flip).</strong> We flip the coin <script type="math/tex">n</script> times and observe <script type="math/tex">S(\omega)</script> heads. By Theorem 1,</p> <script type="math/tex; mode=display">\begin{equation} \operatorname{p-value}(\omega) = \mathbb{P}_{1/2} \left\{ \left|S/n - 1/2\right| \geq \left|S(\omega)/n - 1/2\right| \right\}. \end{equation}</script> <p>Denoting by <script type="math/tex">K(\omega) = S(\omega) - n/2</script>,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \operatorname{p-value}(\omega) & = 1 - \mathbb{P}_{1/2} \left\{ n/2 - \left|K(\omega)\right| < S < n/2 + \left|K(\omega)\right| \right\} \\ & = 1 - F_{1/2}(\left(n/2 + \left|K(\omega)\right|\right)-) + F_{1/2}(n/2 - \left|K(\omega)\right|). \end{align*} %]]></script> <p><img src="/assets/img/hypothesis-testing-for-mathematicians/p_value.png" alt="" /></p> <p><strong>Theorem 2.</strong> Suppose the setting of Theorem 1 and that, in addition, <script type="math/tex">\Theta_0 = \{\theta_0\}</script> is a singleton and <script type="math/tex">f(X)</script> has a continuous and strictly increasing CDF under <script type="math/tex">\theta_0</script>. Then, the p-value has a uniform distribution on <script type="math/tex">[0,1]</script> under <script type="math/tex">\theta_0</script>.</p> <p>In other words, if the null hypothesis is true, the p-value (under the setting of Theorem 2) is uniformly distributed on <script type="math/tex">[0, 1]</script>.</p> <p><em>Proof</em>. Denote by <script type="math/tex">G</script> the CDF of <script type="math/tex">f(X)</script> under <script type="math/tex">\theta_0</script>. First, note that</p> <script type="math/tex; mode=display">\begin{equation} \mathbb{P}_{\theta_0} \left\{ f(X) \geq f(X(\omega)) \right\} = 1 - G[f(X(\omega))]. \end{equation}</script> <p>Then,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \mathbb{P}_{\theta_0} \left\{ \omega \colon \operatorname{p-value}(\omega) \leq u \right\} & = 1 - \mathbb{P}_{\theta_0} \left\{ \omega \colon G[f(X(\omega))] \leq 1 - u \right\} \\ & = 1 - \mathbb{P}_{\theta_0} \left\{ \omega \colon f(X(\omega)) \leq G^{-1}(1 - u) \right\} \\ & = 1 - G(G^{-1}(1 - u)) = u. \square \end{align*} %]]></script>Most introductions to hypothesis testing are targeted at non-mathematicians. This short post aims to be a precise introduction to the subject for mathematicians.Generating finite difference coefficients2018-11-02T20:00:00+00:002018-11-02T20:00:00+00:00http://parsiad.ca/blog/2018/generating-finite-difference-coefficients<p>Let <em>u</em> be a real-valued <em>n</em>-times differentiable function of time.</p> <p>You are given evaluations of this function <em>u(t<sub>0</sub>), …, u(t<sub>n</sub>)</em> at distinct points in time and asked to approximate <em>u<sup>(m)</sup>(t)</em>, the <em>m</em>-th derivative of the function evaluated at some given point <em>t</em> (<em>m≤n</em>).</p> <p>If you have studied numerical methods, you are probably already familiar with how to tackle this problem with what is sometimes referred to as the “method of undetermined coefficients” (or, in equivalent language, by using a <a href="https://en.wikipedia.org/wiki/Lagrange_polynomial">Lagrange interpolating polynomial</a>). In this post, after reviewing the method, we implement it in a few lines of code.</p> <p>Consider approximating the derivative <em>u<sup>(m)</sup>(t)</em> by a linear combination of the observations:</p> <script type="math/tex; mode=display">\boldsymbol{w}^\intercal \boldsymbol{u} \equiv (w_0, \ldots, w_n) (u(t_0), \ldots, u(t_n))^\intercal = w_0 u(t_0) + \cdots + w_n u(t_n).</script> <p>Taylor expanding each term around <em>t</em>,</p> <script type="math/tex; mode=display">\boldsymbol{w}^\intercal \boldsymbol{u} = \sum_{k = 0}^n w_k \left( u(t) + u^\prime(t) \left(t_k - t\right) + \cdots + u^{(n)}(t) \frac{\left(t_k - t\right)^{n}}{n!} + O(\left(t_k - t\right)^{n+1}) \right).</script> <p>Rearranging the resulting expression,</p> <script type="math/tex; mode=display">\boldsymbol{w}^\intercal \boldsymbol{u} = O(\max_k \left|t_k - t\right|^{n+1} \Vert \boldsymbol{w} \Vert_\infty) + \sum_{k = 0}^{n} u^{(k)}(t) \left( w_0 \frac{\left(t_0 - t\right)^k}{k!} + \cdots + w_n \frac{\left(t_n - t\right)^k}{k!} \right).</script> <p>The form above makes it clear that the original linear combination is nothing more than an error term (represented in Big O notation) along with a weighted sum of the derivatives of <em>u</em> evaluated at <em>t</em>. Since we are interested only in the <em>m</em>-th derivative, we would like to pick the weights <strong><em>w</em></strong> such that the coefficient of <em>u<sup>(k)</sup>(t)</em> is 1 whenever <em>k=m</em> and 0 otherwise. This suggests solving the linear system</p> <p><img src="/assets/img/generating-finite-difference-coefficients/system.png" alt="" /></p> <p>The matrix on the left hand side is a <a href="https://en.wikipedia.org/wiki/Vandermonde_matrix">Vandermonde matrix</a>, and hence this system has a unique solution. Denoting by <strong><em>v</em></strong> the solution of this system, we have</p> <script type="math/tex; mode=display">u^{(m)}(t) = \boldsymbol{v}^\intercal \boldsymbol{u} + O(\max_k \left|t_k - t\right|^{n+1} \Vert \boldsymbol{v} \Vert_\infty).</script> <h2 id="example-application-backward-differentiation-formula-bdf">Example application: backward differentiation formula (BDF)</h2> <p>As an application, consider the case in which we want to compute the first derivative of the function (<em>m=1</em>) and the observations are made at the points <em>t<sub>k</sub>=t-kh</em> where <em>h</em> is some positive constant. In this case, the linear system simplifies significantly:</p> <p><img src="/assets/img/generating-finite-difference-coefficients/bdf_system.png" alt="" /></p> <p>For each value of <em>n</em>, we can solve the above linear system to obtain the coefficients:</p> <table> <thead> <tr> <th> </th> <th><em>n=1</em></th> <th><em>n=2</em></th> <th><em>n=3</em></th> <th><em>n=4</em></th> <th><em>n=5</em></th> </tr> </thead> <tbody> <tr> <td><em>hw<sub>0</sub></em></td> <td>1</td> <td>3/2</td> <td>11/6</td> <td>25/12</td> <td>137/60</td> </tr> <tr> <td><em>hw<sub>1</sub></em></td> <td>-1</td> <td>-2</td> <td>-3</td> <td>-4</td> <td>-5</td> </tr> <tr> <td><em>hw<sub>2</sub></em></td> <td> </td> <td>1/2</td> <td>3/2</td> <td>3</td> <td>5</td> </tr> <tr> <td><em>hw<sub>3</sub></em></td> <td> </td> <td> </td> <td>-1/3</td> <td>-4/3</td> <td>-10/3</td> </tr> <tr> <td><em>hw<sub>4</sub></em></td> <td> </td> <td> </td> <td> </td> <td>1/4</td> <td>5/4</td> </tr> <tr> <td><em>hw<sub>5</sub></em></td> <td> </td> <td> </td> <td> </td> <td> </td> <td>-1/5</td> </tr> </tbody> </table> <p>As an example of how to read the above table, the third column (<em>n=3</em>) tells us</p> <script type="math/tex; mode=display">h u^\prime(t) = 11/6 \cdot u(t) - 3 \cdot u(t - h) + 3/2 \cdot u(t - 2h) - 1/3 \cdot u(t - 3h) + O(h^4).</script> <p>The table was generated by the following piece of code:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># bdf.py </span> <span class="kn">import</span> <span class="nn">fractions</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="k">def</span> <span class="nf">bdf</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> <span class="s">"""Creates the coefficient vector for a BDF formula of order n. Args: n: A positive integer. Returns: A (one-dimensional) numpy array of coefficients hw. Denoting by h a positive step size, the derivative is approximated by (hw * u(t) + hw * u(t-h) + ... + hw[n] * u(t-nh)) / h where u is some real-valued, real-input callable. """</span> <span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vander</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">increasing</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span> <span class="n">b</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">k</span> <span class="o">%</span> <span class="mi">2</span><span class="p">))</span> <span class="o">*</span> <span class="nb">int</span><span class="p">(</span><span class="n">k</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">solve</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">set_printoptions</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="p">{</span> <span class="s">'all'</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">fractions</span><span class="o">.</span><span class="n">Fraction</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">limit_denominator</span><span class="p">())</span> <span class="p">})</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">6</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="s">'BDF {}: {}'</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">bdf</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span> </code></pre></div></div> <p>As a crude sanity check, we can verify that the <em>n</em>-th BDF formula applied to <em>e<sup>x</sup></em> becomes a better approximation of <em>e<sup>x</sup></em> as <em>n</em> increases (recall that the exponential function is its own derivative):</p> <p><img src="/assets/img/generating-finite-difference-coefficients/bdf_exp.png" alt="" /></p> <p>The code to generate the plot is given below:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">from</span> <span class="nn">bdf</span> <span class="kn">import</span> <span class="n">bdf</span> <span class="n">a</span> <span class="o">=</span> <span class="mf">0.</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">10.</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">10</span> <span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="n">N</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">semilogy</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Exponential function"</span><span class="p">)</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">):</span> <span class="n">approx_exp</span> <span class="o">=</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">convolve</span><span class="p">(</span><span class="n">bdf</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="n">y</span><span class="p">)</span> <span class="o">/</span> <span class="n">h</span><span class="p">)[</span><span class="n">n</span><span class="p">:</span><span class="o">-</span><span class="n">n</span><span class="p">]</span> <span class="n">plt</span><span class="o">.</span><span class="n">semilogy</span><span class="p">(</span> <span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">:],</span> <span class="n">approx_exp</span><span class="p">,</span> <span class="s">'-x'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"BDF {} approximation"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span> <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> </code></pre></div></div>Let u be a real-valued n-times differentiable function of time.mlinterp - Fast arbitrary dimension linear interpolation in C++2017-06-19T20:00:00+00:002017-06-19T20:00:00+00:00http://parsiad.ca/blog/2017/mlinterp<p>I made a header-only C++ library for arbitrary dimension <a href="https://en.wikipedia.org/wiki/Linear_interpolation">linear interpolation</a> (a.k.a. multilinear interpolation). The design philosophy is to push as much to compile-time as possible by <a href="https://en.wikipedia.org/wiki/Template_metaprogramming">template metaprogramming</a>.</p> <p>Instructions for how to include it in your work are on <a href="https://github.com/parsiad/mlinterp">the GitHub project page</a>.</p> <p>Below are some simple examples of its usage.</p> <h2 id="examples">Examples</h2> <h3 id="1d">1d</h3> <p>Let’s interpolate y = sin(x) on the interval [-pi, pi] using 15 evenly-spaced data points.</p> <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="k">namespace</span> <span class="n">mlinterp</span><span class="p">;</span> <span class="c1">// Boundaries of the interval [-pi, pi]</span> <span class="k">constexpr</span> <span class="kt">double</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">3.14159265358979323846</span><span class="p">,</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="n">b</span><span class="p">;</span> <span class="c1">// Subdivide the interval [-pi, pi] using 15 evenly-spaced points and</span> <span class="c1">// evaluate sin(x) at each of those points</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">nxd</span> <span class="o">=</span> <span class="mi">15</span><span class="p">,</span> <span class="n">nd</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="n">nxd</span> <span class="p">};</span> <span class="kt">double</span> <span class="n">xd</span><span class="p">[</span><span class="n">nxd</span><span class="p">];</span> <span class="kt">double</span> <span class="n">yd</span><span class="p">[</span><span class="n">nxd</span><span class="p">];</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">nxd</span><span class="p">;</span> <span class="o">++</span><span class="n">n</span><span class="p">)</span> <span class="p">{</span> <span class="n">xd</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">nxd</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">n</span><span class="p">;</span> <span class="n">yd</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">sin</span><span class="p">(</span><span class="n">xd</span><span class="p">[</span><span class="n">n</span><span class="p">]);</span> <span class="p">}</span> <span class="c1">// Subdivide the interval [-pi, pi] using 100 evenly-spaced points</span> <span class="c1">// (these are the points at which we interpolate)</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">ni</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span> <span class="kt">double</span> <span class="n">xi</span><span class="p">[</span><span class="n">ni</span><span class="p">];</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">ni</span><span class="p">;</span> <span class="o">++</span><span class="n">n</span><span class="p">)</span> <span class="p">{</span> <span class="n">xi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">ni</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">n</span><span class="p">;</span> <span class="p">}</span> <span class="c1">// Perform the interpolation</span> <span class="kt">double</span> <span class="n">yi</span><span class="p">[</span><span class="n">ni</span><span class="p">];</span> <span class="c1">// Result is stored in this buffer</span> <span class="n">interp</span><span class="p">(</span> <span class="n">nd</span><span class="p">,</span> <span class="n">ni</span><span class="p">,</span> <span class="c1">// Number of points</span> <span class="n">yd</span><span class="p">,</span> <span class="n">yi</span><span class="p">,</span> <span class="c1">// Output axis (y)</span> <span class="n">xd</span><span class="p">,</span> <span class="n">xi</span> <span class="c1">// Input axis (x)</span> <span class="p">);</span> <span class="c1">// Print the interpolated values</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">scientific</span> <span class="o">&lt;&lt;</span> <span class="n">setprecision</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="n">showpos</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">ni</span><span class="p">;</span> <span class="o">++</span><span class="n">n</span><span class="p">)</span> <span class="p">{</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">xi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="s">"</span><span class="se">\t</span><span class="s">"</span> <span class="o">&lt;&lt;</span> <span class="n">yi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="n">endl</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p><img src="https://raw.githubusercontent.com/parsiad/mlinterp/master/examples/1d.png" alt="" /></p> <h3 id="2d">2d</h3> <p>Let’s interpolate z = sin(x)cos(y) on the interval [-pi, pi] X [-pi, pi] using 15 evenly-spaced points along the x axis and 15 evenly-spaced points along the y axis.</p> <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="k">namespace</span> <span class="n">mlinterp</span><span class="p">;</span> <span class="c1">// Boundaries of the interval [-pi, pi]</span> <span class="k">constexpr</span> <span class="kt">double</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">3.14159265358979323846</span><span class="p">,</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="n">b</span><span class="p">;</span> <span class="c1">// Discretize the set [-pi, pi] X [-pi, pi] using 15 evenly-spaced</span> <span class="c1">// points along the x axis and 15 evenly-spaced points along the y axis</span> <span class="c1">// and evaluate sin(x)cos(y) at each of those points</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">nxd</span> <span class="o">=</span> <span class="mi">15</span><span class="p">,</span> <span class="n">nyd</span> <span class="o">=</span> <span class="mi">15</span><span class="p">,</span> <span class="n">nd</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="n">nxd</span><span class="p">,</span> <span class="n">nyd</span> <span class="p">};</span> <span class="kt">double</span> <span class="n">xd</span><span class="p">[</span><span class="n">nxd</span><span class="p">];</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nxd</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">xd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">nxd</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span> <span class="p">}</span> <span class="kt">double</span> <span class="n">yd</span><span class="p">[</span><span class="n">nyd</span><span class="p">];</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">nyd</span><span class="p">;</span> <span class="o">++</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span> <span class="n">yd</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">nyd</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">j</span><span class="p">;</span> <span class="p">}</span> <span class="kt">double</span> <span class="n">zd</span><span class="p">[</span><span class="n">nxd</span> <span class="o">*</span> <span class="n">nyd</span><span class="p">];</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nxd</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">nyd</span><span class="p">;</span> <span class="o">++</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">nyd</span><span class="p">;</span> <span class="n">zd</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">sin</span><span class="p">(</span><span class="n">xd</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">cos</span><span class="p">(</span><span class="n">yd</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span> <span class="p">}</span> <span class="p">}</span> <span class="c1">// Subdivide the set [-pi, pi] X [-pi, pi] using 100 evenly-spaced</span> <span class="c1">// points along the x axis and 100 evenly-spaced points along the y axis</span> <span class="c1">// (these are the points at which we interpolate)</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">m</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span> <span class="n">ni</span> <span class="o">=</span> <span class="n">m</span> <span class="o">*</span> <span class="n">m</span><span class="p">;</span> <span class="kt">double</span> <span class="n">xi</span><span class="p">[</span><span class="n">ni</span><span class="p">];</span> <span class="kt">double</span> <span class="n">yi</span><span class="p">[</span><span class="n">ni</span><span class="p">];</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">m</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">m</span><span class="p">;</span> <span class="o">++</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">m</span><span class="p">;</span> <span class="n">xi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">m</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span> <span class="n">yi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">m</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">j</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="c1">// Perform the interpolation</span> <span class="kt">double</span> <span class="n">zi</span><span class="p">[</span><span class="n">ni</span><span class="p">];</span> <span class="c1">// Result is stored in this buffer</span> <span class="n">interp</span><span class="p">(</span> <span class="n">nd</span><span class="p">,</span> <span class="n">ni</span><span class="p">,</span> <span class="c1">// Number of points</span> <span class="n">zd</span><span class="p">,</span> <span class="n">zi</span><span class="p">,</span> <span class="c1">// Output axis (z)</span> <span class="n">xd</span><span class="p">,</span> <span class="n">xi</span><span class="p">,</span> <span class="n">yd</span><span class="p">,</span> <span class="n">yi</span> <span class="c1">// Input axes (x and y)</span> <span class="p">);</span> <span class="c1">// Print the interpolated values</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">scientific</span> <span class="o">&lt;&lt;</span> <span class="n">setprecision</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="n">showpos</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">ni</span><span class="p">;</span> <span class="o">++</span><span class="n">n</span><span class="p">)</span> <span class="p">{</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">xi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="s">"</span><span class="se">\t</span><span class="s">"</span> <span class="o">&lt;&lt;</span> <span class="n">yi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="s">"</span><span class="se">\t</span><span class="s">"</span> <span class="o">&lt;&lt;</span> <span class="n">zi</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="n">endl</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p><img src="https://raw.githubusercontent.com/parsiad/mlinterp/master/examples/2d.png" alt="" /></p>I made a header-only C++ library for arbitrary dimension linear interpolation (a.k.a. multilinear interpolation). The design philosophy is to push as much to compile-time as possible by template metaprogramming.Octave Financial 0.5.0 released2016-02-02T20:00:00+00:002016-02-02T20:00:00+00:00http://parsiad.ca/blog/2016/octave-financial-0-5-0-released<p>I am happy to announce the release of the GNU Octave Financial package version 0.5.0. This is the <em>first</em> release since I took on the role of maintainer.</p> <p>If you do not already have GNU Octave, you can <a href="https://www.gnu.org/software/octave/download.html">grab a free copy here</a>.</p> <p>To install the package, launch Octave and run the following commands:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pkg install -forge io pkg install -forge financial </code></pre></div></div> <p>Perhaps the most exciting addition in this version is the Monte Carlo simulation framework, which is significantly faster than its MATLAB counterpart. A brief tutorial (along with benchmarking information) are available in <a href="/blog/2015/sdes-and-monte-carlo-in-octave-financial/">a previous post</a>. Other additions include Black-Scholes options and greeks valuation routines, implied volatility calculations, and general bug fixes. Some useful links for GNU Octave Financial are below:</p> <ul> <li><a href="http://octave.sourceforge.net/financial/index.html">Home page</a></li> <li><a href="http://octave.sourceforge.net/financial/overview.html">Documentation</a></li> <li><a href="http://octave.sourceforge.net/financial/NEWS.html">News</a></li> </ul>I am happy to announce the release of the GNU Octave Financial package version 0.5.0. This is the first release since I took on the role of maintainer.