Most introductions to hypothesis testing are targeted at non-mathematicians. This short post aims to be a precise introduction to the subject for mathematicians.

*Remark*. While the presentation may differ, some of the notation in this article is from L. Wasserman’s *All of Statistics: a Concise Course in Statistical Inference*.

Consider a parametric model with parameter set $\Theta$. The model generates realizations $X_1, \ldots, X_n$.

**Example (Coin Flip).**
*We are given a coin.
The coin has probability $\theta$ in $\Theta \equiv [0, 1]$ of showing heads.
We flip the coin $n$ times and record $X_i = 1$ if the $i$-th flip is heads and $0$ otherwise.*

Throughout this article, we use the above coin flip model to illustrate the ideas.

In hypothesis testing, we start with a *hypothesis* (also called the *null hypothesis*).
Specifying a null hypothesis is equivalent to picking some nonempty subset $\Theta_0$ of the parameter set $\Theta$.
Precisely, the null hypothesis is the assumption that realizations are being generated by the model parameterized by some $\theta$ in $\Theta_0$.

**Example (Coin Flip).**
*Our hypothesis is $\Theta_0 \equiv \{ 1 / 2 \}$.
That is, we hypothesize that the coin is fair.*

For brevity, let $X \equiv (X_1, \ldots, X_n)$.
To specify when the null hypothesis is rejected, we define a *rejection function* $R$ such that $R(X)$ is an indicator random variable whose unit value corresponds to rejection.

**Example (Coin Flip).**
*Let
\begin{equation}
R(x_1, \ldots, x_n)
= \left[
\left| \frac{x_1 + \cdots + x_n}{n} - \frac{1}{2} \right| \geq \epsilon
\right]
\end{equation}
where $[\cdot]$ is the Iverson bracket.
This corresponds to rejecting the null hypothesis whenever we see “significantly” more heads than tails (or vice versa). Our notion of significance is controlled by $\epsilon$.*

Note that nothing stops us from making a bad test. For example, taking $\epsilon = 0$ in the above example yields a test that always rejects. Conversely, taking $\epsilon > 1/2$ yields a test that never rejects.

**Definition (Power).**
*The power
\begin{equation}
\operatorname{Power}(\theta, R)
\equiv \mathbb{P}_\theta \{ R(X) = 1 \}
\end{equation}
gives the probability of rejection assuming that the true model parameter is $\theta$.*

**Example (Coin Flip).**
*Let $F_\theta$ denote the CDF of a binomial distribution with $n$ trials and success probability $\theta$.
Let $S \equiv X_1 + \cdots + X_n$.
Then, assuming $\epsilon$ is positive,
\begin{equation}
\operatorname{Power}(\theta,R)
= 1 - \mathbb{P}_{\theta}\{\left|S/n-1/2\right| < \epsilon\}
= 1 - \mathbb{P}_{\theta}\{n/2-\epsilon n < S < n/2+\epsilon n\}
= 1 - F_\theta(\left(n/2+\epsilon n\right)-) + F_\theta(n/2-\epsilon n)
\end{equation}
where $F(x-) = \lim_{y \uparrow x} F(y)$ is a left-hand limit.*

**Definition (Size).**
*The size of a test
\begin{equation}
\operatorname{Size}(R) \equiv \sup_{\theta \in \Theta_0} \operatorname{Power}(\theta, R)
\end{equation}
gives, assuming that the null hypothesis is true, the “worst-case” probability of rejection.*

Rejecting the null hypothesis errenously is called a *type I error* (see the table below).
The size puts an upper bound on making a type I error.

Retain Null | Reject Null | |
---|---|---|

Null Hypothesis is True | No error | Type I error |

Null Hypothesis is False | Type II error | No error |

**Example (Coin Flip).**
*Since $\Theta_0 = \{ 1 / 2 \}$ is a singleton, $\operatorname{Size}(R) = \operatorname{Power}(1/2, R)$.*

**Definition (p-value).**
*Let $(R_\alpha)_\alpha$ be a collection of rejection functions.
Define
\begin{equation}
\operatorname{p-value}
\equiv \inf \left\{ \operatorname{Size}(R_\alpha) \colon R_\alpha(X) = 1 \right\}.
\end{equation}
as the smallest size for which the null-hypothesis is rejected.*

Unlike the size, the p-value is itself a random variable. The smaller the p-value, the more confident we can be that a rejection is justified. A common threshold for rejection is a p-value smaller than 0.01. A rejection in this case can be understood as being at least 99% certain the rejection was not done erroneously.

**Theorem 1.**
*Suppose we have a collection of rejection functions $(R_{\alpha})_{\alpha}$ of the form
\begin{equation}
R_{\alpha}(x_1, \ldots, x_n)
= [f(x_1, \ldots, x_n) \geq c_{\alpha}]
\end{equation}
where $f$ does not vary with $\alpha$.
Suppose also that for each point $y$ in the range of $f$, there exists $\alpha$ such that $c_{\alpha} = y$.
Then,
\begin{equation}
\operatorname{p-value}(\omega)
\equiv
\sup_{\theta \in \Theta_0} \mathbb{P}_{\theta} \{
f(X) \geq f(X(\omega))
\}.
\end{equation}
In other words, the p-value (under the setting of Theorem 1) is the worst-case probability of sampling $f(X)$ larger than what was observed, $f(X(\omega))$.
Note that in the above, we have used $\omega$ to distinguish between the actual random variable $X$ and $X(\omega)$, the observation.*

*Proof*.
Recall that the p-value is an infimum among sizes of tests that reject the observed data.
The infimum is achieved at $c_\alpha = f(X(\omega))$.
This is because a larger value of $c_\alpha$ yields a test that accepts while a smaller value yields a test that rejects at least as many outcomes, thereby potentially having larger size.
The size under the choice of $c_\alpha = f(X(\omega))$ is exactly the expression for the p-value given above. $\square$

**Example (Coin Flip).**
*We flip the coin $n$ times and observe $S(\omega)$ heads.
By Theorem 1,
\begin{equation}
\operatorname{p-value}(\omega)
= \mathbb{P}_{1/2} \left\{
\left|S/n - 1/2\right| \geq
\left|S(\omega)/n - 1/2\right|
\right\}.
\end{equation}
Denoting by $K(\omega) = |S(\omega) - n/2|$,
\begin{equation}
\operatorname{p-value}(\omega)
= 1 - \mathbb{P}_{1/2} \left\{
n/2 - K(\omega) < S < n/2 + K(\omega)
\right\}
= 1 - F_{1/2}(\left(n/2 + K(\omega)\right)-) + F_{1/2}(n/2 - K(\omega)).
\end{equation}*

**Theorem 2.**
*Suppose the setting of Theorem 1 and that, in addition, $\Theta_0 = \{\theta_0\}$ is a singleton and $f(X)$ has a continuous and strictly increasing CDF under $\theta_0$. Then, the p-value has a uniform distribution on $[0,1]$ under $\theta_0$.*

In other words, if the null hypothesis is true, the p-value (under the setting of Theorem 2) is uniformly distributed on $[0, 1]$.

*Proof*.
Denote by $G$ the CDF of $f(X)$ under $\theta_0$.
By Theorem 1,
\begin{equation}
\operatorname{p-value}(\omega) = 1 - G[f(X(\omega))].
\end{equation}
Then (omitting the subscript $\theta_0$ for brevity),
\begin{equation}
\mathbb{P} \{
\omega \colon \operatorname{p-value}(\omega) \leq u
\}
= 1 - \mathbb{P} \{
\omega \colon G[f(X(\omega))] \leq
1 - u
\}
= 1 - \mathbb{P} \{
\omega \colon f(X(\omega)) \leq
G^{-1}(1 - u)
\}
= 1 - G(G^{-1}(1 - u)) = u. \square
\end{equation}