Flaky Programs

This is inspired by a real life problem; I was testing a program not under my control, and found that it failed now and again. A modified version did not fail after a large number of tries. In practice, I satisfied myself that an improvement had been made by writing a program (a shell script) to test the modified version, and running this overnight to accumulate a very large number of successful runs. But I thought I would consider the statistical problem of deciding whether or not an improvement had indeed been made.

The statistics becomes quite tractable if you decide ahead of time how many test runs you will make with each version of the program. This is quite convenient, because you can decide how much time you want to spend on the problem and then work from there. After you have done your runs you will can fill in a table as follows:

Old VersionNew Version
Successesab
Failurescd (hopefully 0)

There is a statistical test, called Fisher's exact test, which can be used to assess the significance of such tables; that is, to provide the probability of seeing a result at least as encouraging as the data provided, under the assumption that there is in fact no difference between the versions. If that probability is very small, we say that the result is significant, and can start believing that we have done some good.

There is a quibble, though; Fisher's exact test assumes that we decide ahead of time not only the total number of tests made on each version, but the total number of successes and failures. Strictly speaking, this analysis is a mis-application of Fisher's exact test. By claiming that the total successes and failures were known ahead of time, we are certainly ignoring the information contained in them. This can be expected to reduce the sensitivity of our test. Can it lead us astray?

Suppose that we don't do the experiment ourselves, but are forced to rely on a stooge. This stooge decides ahead of time what totals of success and failure they want. They carry out the experiment precisely as ordered, with the specified number of trials of old and new versions, except that if the total number of success and failures don't come out right, they ignore the results and start again from the beginning. They only get back to us when they have achieved a run with the right total of successes and failures. Fisher's exact test applies to the table delivered to us, which means that the probability we work out of receiving an result at least as encouraging as the one we actually get is correct.

We don't have such a stooge, and I don't propose that we employ one, but our analysis of the situation corresponds to the case when we we have a very large number of such stooges, each with their own favoured success and failure totals, and we choose such a stooge at random. Since the probability of receiving a result at least as encouraging as the one we actually get is correct (and the same) no matter which stooge we pick, it is correct overall; we are throwing away some information by pretending that the totals of success and failure are fixed ahead of time, but the probability we work out is not misleading. To illustrate that information is lost, consider the case when no test succeeded with either version; here the totals of success and failure are enough to tell you all you need to know on their own.

The probability of any particular combination of our a, b, c, d counts, given their various sub-totals is (a + b)! (c + d)! (a + c)! (b + d) ! / (a! b! c! d! (a + b + c + d)!). If d = 0 (all tests of the new version succeeded) then the numbers you have are the most encouraging possible, given the sub-totals, and you only have one number to work out; you may be able to do this by hand. Nevertheless, I provide below a JavaScript-based table that works out the probability of receiving a result at least as encouraging as that observed (d = the value actually seen or smaller). Remember that low numbers are good. If you have automated the entire testing process (as you should, if possible), you can probably collect enough data to make p very small, perhaps 1/1000 = 1.0E-3 or 1/1000000 = 1.0E-6 (assuming of course that you actually have improved the program).

Statisticians, and those intending to use this web page for other purposes, should note that this a "one-tailed test", that is, it considers only the possibility that the program has improved, and not the possibility that it has got worse. If you are not sure whether d should be smaller or larger than would be expected by chance, you should analyse the numbers twice, treating old as new and vice versa, and double the lower of the two probabilities found to get the conventional measure of significance. You should also double the probability found if you only decide on which is the improved version after looking at the data; if you don't, you will make wrong decisions twice as often as you are expecting to.

Fisher's Exact Test (One-tailed version)

Old VersionNew Version
Successes
Failures