# Hypothesis testing for mathematicians

Most introductions to hypothesis testing are targeted at non-mathematicians. This short post aims to be a precise introduction to the subject for mathematicians.

*Remark*. While the presentation may differ, some of the notation in this article is from L. Wasserman’s *All of Statistics: a Concise Course in Statistical Inference*.

Consider a parametric model with parameter set . The model generates realizations .

**Example (Coin Flip).**
We are given a coin.
The coin has probability in of showing heads.
We flip the coin times and record if the -th flip is heads and otherwise.

Throughout this article, we use the above coin flip model to illustrate the ideas.

In hypothesis testing, we start with a *hypothesis* (also called the *null hypothesis*).
Specifying a null hypothesis is equivalent to picking some nonempty subset of the parameter set .
Precisely, the null hypothesis is the assumption that realizations are being generated by the model parameterized by some in .

**Example (Coin Flip).**
Our hypothesis is .
That is, we hypothesize that the coin is fair.

For brevity, let .
To specify when the null hypothesis is rejected, we define a *rejection function* such that is an indicator random variable whose unit value corresponds to rejection.

**Example (Coin Flip).**
Let

This corresponds to rejecting the null hypothesis whenever we see “significantly” more heads than tails (or vice versa). Our notion of significance is controlled by .

Note that nothing stops us from making a bad test. For example, taking in the above example yields a test that always rejects. Conversely, taking yields a test that never rejects.

**Definition (Power).**
The *power*

gives the probability of rejection assuming that the true model parameter is .

**Example (Coin Flip).**
Let denote the CDF of a binomial distribution with trials and success probability .
Let .
Then, assuming is positive,

where is a left-hand limit.

**Definition (Size).**
The *size* of a test

gives, assuming that the null hypothesis is true, the “worst-case” probability of rejection.

Rejecting the null hypothesis errenously is called a *type I error* (see the table below).
The size puts an upper bound on making a type I error.

Retain Null | Reject Null | |
---|---|---|

Null Hypothesis is True | No error | Type I error |

Null Hypothesis is False | Type II error | No error |

**Example (Coin Flip).**
Since is a singleton, .

**Definition (p-value).**
Let be a collection of rejection functions.
Define

as the smallest size for which the null-hypothesis is rejected.

Unlike the size, the p-value is itself a random variable. The smaller the p-value, the more confident we can be that a rejection is justified. A common threshold for rejection is a p-value smaller than 0.01. A rejection in this case can be understood as being at least 99% certain the rejection was not done erroneously.

**Theorem 1.**
Suppose we have a collection of rejection functions of the form

where does not vary with . Suppose also that for each point in the range of , there exists such that . Then,

In other words, the p-value (under the setting of Theorem 1) is the worst-case probability of sampling larger than what was observed, . Note that in the above, we have used to distinguish between the actual random variable and , the observation.

*Proof*.
Note that

The result follows from noting that the infimum is achieved at the value of for which .

**Example (Coin Flip).**
We flip the coin times and observe heads.
By Theorem 1,

Denoting by ,

**Theorem 2.**
Suppose the setting of Theorem 1 and that, in addition, is a singleton and has a continuous and strictly increasing CDF under . Then, the p-value has a uniform distribution on under .

In other words, if the null hypothesis is true, the p-value (under the setting of Theorem 2) is uniformly distributed on .

*Proof*.
Denote by the CDF of under . First, note that

Then,