Holm's Sequentially Selective Bonferroni Method

This is a neat variant of the standard, and intuitive, Bonferroni inequality. Suppose that you have a computer program that makes N statistical tests, probably not quite independent, and keeps the smallest tail probabilities. For example, you might have it run 100 tests, and it might find that the smallest tail probability among these is 1/1000. Is this unexpected? In working this out, we consider ourselves to have raised a single false alarm if one or more true null hypotheses, amongst all those considered, are rejected, and call this chance of this the overall false alarm rate (or family-wise error rate). So falsely picking out just one hypothesis as causal amongst many is total failure - but this means that falsely picking out everything as causal is no worse.

The Bonferroni inequality is P(A or B) <= P(A) + P(B), and it follows from P(A or B) = P(A) + P(B and not A) <= P(A) + P(B). This is true whether or not A and B are independent, and it extends to show that the P(A1 or A2 or ...) <= SUM P(Ai). For our purposes P(A or B or...) is the probability of a 'false alarm' - falsely rejecting a null hypothesis. If we want this to be 1/100 then, with 100 tests, we would want 100P(A) <= 1/100, or P(A) <= 1/10000. So we should not accept a lowest tail probability of only 1/1000 as strong evidence. In fact, we would only have rejected our null hypothesis if we had decided we were satisified with the much less stringent false alarm probabilty of 100*1/1000 = 1/10.

Put simply, the Bonferroni inequality says that if you do N tests, you should multiply all your tail probabilities by N. One way to play games with this is to assign each test a possibly different tail probability ahead of time, so that all the tail probabilities add up to the desired overall false alarm rate. This allows you to decide that some tests are more important to you than others.

The Holm method does slightly better than the Bonferroni when more than one of the hypotheses to be tested are in fact false. The rule is that you go through the N tail probabilities in order of size, starting with the smallest, and numbering them from 1. If you preassign your desired false alarm rate to be Alpha, then you stop as soon as you find a hypothesis i whose tail probability is > Alpha/(N-i+1). Reject all the hypotheses you meet before stopping.

If all the hypotheses are in fact true, you fail if you reject any hypothesis, which happens iff the lowest probability has tail probability <= Alpha/(N-1+1) = Alpha/N, which we know from Bonferroni leads to a false alarm rate of Alpha: OK so far.

If m hypotheses are true, consider their tail probabilities as given, and consider the lowest probability in the N-m other hypotheses. If we reject any of these, we have failed. It is safe to assume that the lowest m slots are filled by the m true hypotheses, because that gives the remaining hypotheses as high a target tail probability as possible: if they lead to a false rejection in these circumstances, they will certainly lead to a false rejection in any other circumstance. We will not raise a false alarm unless the lowest true hypothesis (amongst N-m) has a tail probability <= Alpha/(N-m) - but by Bonferroni again that gives us an overall false alarm rate of at most Alpha, so everything is OK.

With the simple Bonferroni method, you can consider yourself to be allocating a small amount of false alarm probability amongst the N hypotheses, and even give them different weights, setting a threshold of pi for hypothesis i, so that the pi add up to an upper bound of Sumipi on the false alarm rate. I believe that you can do the same thing with the Holm method. Allocate to each hypothesis i a (positive) weight wi. After rejecting m null hypotheses, work out for each hypothesis alpha*wi/Sumiwi, where the sum ranges over all hypotheses not yet rejected. Reject a remaining hypothesis with smallest p-value at this stage less than its adjusted alpha-value, if any, else terminate. At each stage Sumiwi decreases, so the thresholds get easier to pass. If there is a family-wise error with n false null hypotheses, then amongst the N-n true hypotheses there must be one such that its p-value is less than alpha wi/Sumiwi, where i ranges over all the true hypotheses. The probability of this happening is at most alpha*Sumiwi/Sumiwi, which is alpha.

Quantile-Based Methods

In some situations, you might expect a lot of fairly significant results, rather than a few very significant results. In this case, you could sort the tail probabilities into ascending order, and test the i-th probability against pi/n (for a single pre-chosen i, counting from 1), which will give you a false alarm rate of at most p. See Early Birds and Sleepy Heads.

Home

Here is a link back up to my home page.