Most introductions to hypothesis testing are targeted at non-mathematicians. This short post aims to be a precise introduction to the subject for mathematicians.
Remark. While the presentation may differ, some of the notation in this article is from L. Wasserman’s All of Statistics: a Concise Course in Statistical Inference.
Consider a parametric model with parameter set . The model generates realizations .
Example (Coin Flip). We are given a coin. The coin has probability in of showing heads. We flip the coin times and record if the -th flip is heads and otherwise.
Throughout this article, we use the above coin flip model to illustrate the ideas.
In hypothesis testing, we start with a hypothesis (also called the null hypothesis). Specifying a null hypothesis is equivalent to picking some nonempty subset of the parameter set . Precisely, the null hypothesis is the assumption that realizations are being generated by the model parameterized by some in .
Example (Coin Flip). Our hypothesis is . That is, we hypothesize that the coin is fair.
For brevity, let . To specify when the null hypothesis is rejected, we define a rejection function such that is an indicator random variable whose unit value corresponds to rejection.
Example (Coin Flip). Let
This corresponds to rejecting the null hypothesis whenever we see “significantly” more heads than tails (or vice versa). Our notion of significance is controlled by .
Note that nothing stops us from making a bad test. For example, taking in the above example yields a test that always rejects. Conversely, taking yields a test that never rejects.
Definition (Power). The power
gives the probability of rejection assuming that the true model parameter is .
Example (Coin Flip). Let denote the CDF of a binomial distribution with trials and success probability . Let . Then, assuming is positive,
where is a left-hand limit.
Definition (Size). The size of a test
gives, assuming that the null hypothesis is true, the “worst-case” probability of rejection.
Rejecting the null hypothesis errenously is called a type I error (see the table below). The size puts an upper bound on making a type I error.
|Retain Null||Reject Null|
|Null Hypothesis is True||No error||Type I error|
|Null Hypothesis is False||Type II error||No error|
Example (Coin Flip). Since is a singleton, .
Definition (p-value). Let be a collection of rejection functions. Define
as the smallest size for which the null-hypothesis is rejected.
Unlike the size, the p-value is itself a random variable. The smaller the p-value, the more confident we can be that a rejection is justified. A common threshold for rejection is a p-value smaller than 0.01. A rejection in this case can be understood as being at least 99% certain the rejection was not done erroneously.
Theorem 1. Suppose we have a collection of rejection functions of the form
where does not vary with . Suppose also that for each point in the range of , there exists such that . Then,
In other words, the p-value (under the setting of Theorem 1) is the worst-case probability of sampling larger than what was observed, . Note that in the above, we have used to distinguish between the actual random variable and , the observation.
Proof. Note that
The result follows from noting that the infimum is achieved at the value of for which .
Example (Coin Flip). We flip the coin times and observe heads. By Theorem 1,
Denoting by ,
Theorem 2. Suppose the setting of Theorem 1 and that, in addition, is a singleton and has a continuous and strictly increasing CDF under . Then, the p-value has a uniform distribution on under .
In other words, if the null hypothesis is true, the p-value (under the setting of Theorem 2) is uniformly distributed on .
Proof. Denote by the CDF of under . First, note that