All-Pairs Testing

This stuff is inspired by the article "Too darned BIG to test", by Keith Stobie (Microsoft) in ACM Queue, Vol 3 No. 1. I concentrate on one observation from this article.

The Problem

Suppose you have, for example, a computer program to test. (Some of this also applies to more general sorts of experiments, but it is best suited to situations where just one example will prove that e.g. a program works or does not work for a given input, and to situations where we might expect that interactions are not very complicated). Suppose that what the program does depends on a number of input variables, each of which can take one of a small number of settings. If you have 10 variables, each with 10 possible settings, then there are 1010 possible inputs in all, and exhaustive testing is unlikely to be feasible.

With just 10 runs, you can assign to each input variable each of its 10 possible values: for example runs 0000000000, 1111111111,... 9999999999. That will catch some stupid errors, but what about errors that occur, for example, only when the third input is set to 5 and the fifth input is set to 7?

It turns out that, for these parameters, you can construct - by hand if necessary - a set of 121 runs that guarantee coverage of all pairwise combinations. (In fact, if you are alert you will recognise that one of these 121 rows can be discarded, to produce a solution with only 120 rows). The method of construction guarantees a reasonable number of rows and is not too complicated: in fact it is a lot quicker to construct the rows than to check mechanically that they do indeed cover all pairwise combinations, especially for large numbers of variables.

The building-block: a solution for primes

The simplest example of a solution for this can be constructed around any prime p. It produces a table of p2 rows and p + 1 columns, such that, for any pair of columns, each combination of two values from 0..p-1 appears exactly in exactly one row.

The first column consists of p copies of 0, then p copies of 1, and so on: row i of that column contains the value i / p. The second column consists of the sequence 0,1,2..p-1 repeated p times, so row i of that column contains the value i mod p.

Each of the following columns can be given a number in the range 1..p-1, which I will call x. For each row I will call the value in the first column a, and the value in the second column b. We set the value in the column assigned number x to a + xb mod p.

Each possible pair appears exactly once in the first two columns. If you are given a pair of values, and told which two columns they were taken from, you can write down two simultaneous equations to solve for a and b, the values in the first two columns of that row. Depending on whether the columns you are given include one of the first two columns, there are a number of possible cases, but all cases lead to exactly one solution, because each number in the range 1..p-1 has a multiplicative inverse mod p. That is, given a number x mod p, x != 0 mod p, I can produce a number y, such that xy = 1 mod p. For example, if p = 5, 1*1 = 1, 2*3 = 6 = 1 mod 5, 4 * 4 = 16 = 1 mod 5.

Using the building block

You will need to choose a prime which is at least as large as the largest number of options for any input variable. If this produces values outside the permitted range for some variable, just replace them with arbitrary values inside the permitted range: the guarantee that all pairs will occur still holds, but now some pairs may turn up more than once. If you end up with too many columns you can just ignore the extra ones. After doing this you may find that some rows are totally unnecessary, so you can drop them. Going back to our example with 10 choices for each of 10 variables, we find that 11 is the next prime along, so we can certainly solve the problem with 121 rows. In fact, a computer program can find a way to cut this down to 120 rows. This is clear, because the construction produces one row which is entirely zeros. If we simply map 1-10 to our 10 choices and accept 0 as a variable we can map as we choose, then it is clear that we can simply discard a row that is entirely zeros.

Handling large numbers of columns

Suppose you need more than p+1 columns? There is an easy answer for this. Give each column a number, starting from 0, and write down those numbers base p + 1. So if p+1 = 3 and you have nine columns, they will be columns 00, 01, 02, 10, 11, 12, 20, 21, and 22. If you end up needing d-digit numbers, create a design with dp2 rows. Number the columns in the basic construction from 0..p and, to construct e.g. column 12 in the larger design, put column 1 from the basic design on top of column 2 in the basic design.

Now consider any two columns in the large design. Because they have different d-digit numbers, their numbers, compared digit by digit, differ in at least one place. For example, 12 differs from 11 in the second digit. So if we look at the rows corresponding to that digit in the large design, we see we have two complete, and different, columns from the smaller design. This means that, within just these rows, all possible combinations of values appear at least once. So within the larger design, all pairwise combinations of values appear at least once.

The number of digits required to write down a number grows with the logarithm of that number, so the number of rows required grows very slowly with the number of columns. If we have just two possible settings for each variable, then with 4 rows we can handle only 3 columns, but with 8 rows we can handle up to 9 columns, with 12 up to 27, with 16 up to 81, with 20 up to 243, and so on. Again, if we don't need all of those 81 columns, we can just ignore them, and it is possible that we will then find some of those 16 rows are now redundant. One quoted example is 40 variables, each with 3 options. Our basic block then has 9 rows and 4 columns, and 43 = 64, so we can certainly produce a design with 3 * 9 = 27 rows. A computer program cuts this down to 25 rows, but you can do this by hand too: the first row of our building block is entirely zeroes, so the first cell in every column is zero. This means that when we pile columns 3 high, we produce three rows, all of which are entirely zero, and can get rid of two of them, taking us from 27 to 25 rows. It was convenient, but not surprising, that we had a row of zeros we could take advantage of. If we have a building block that covers all combinations, then if we consistently renumber any column, for example by subtracting a constant using the same modular arithmetic used to construct it, we still have a building block that covers all combinations. So if we didn't have any constant row at all, we can always renumber the columns to produce one. A neater trick would be to have more than one constant row, which doesn't happen with building block described above (in the case when the first two columns contain the number 1, we would need their sum to be 1 as well).

Powers of primes

If you are familiar with Galois Fields, you can produce a slight generalisation of this construction. Let q=pn be a prime power, and do your arithmetic over GF(pn). You will end up with p2n rows and pn + 1 columns. As before, start off with two columns containing all p2n pairs of values. To create pn - 1 more columns, take every possible non-zero element of GF(pn) and assign it to a column. Fill in the values of that column by taking the value for that row in the first column plus its assigned element times the value for that row in the second column.

A special case: only two choices per column

The paper "An Improved Test Generation Algorithm For Pair-Wise Testing", by Maity, Nayak, Zaman, Bansal, and Srivastava, contains a beautiful design for all-pairs testing where there are only two choices in each column. Start off by fixing an odd number of rows, 2k+1. Now generate (2k+1 choose k) columns by writing out each possible column containing exact k+1 1s and k 0s. Then add an extra row at the bottom, containing only zeros. Consider any two different columns. Their top 2k+1 rows both contain k+1 1s each, so there is at least one place where both columns contain a 1 - if not, would need 2k+2 rows to hold the 2(k+1) 1s to be placed. The two columns are different, so there is at least one place where one has a 0 and another has a 1. Since they both contain the same number of 1s, there is also a place where the positions are reversed. This covers the combinations 11, 01, and 10. The bottom row of all-zeros covers the 00 case, so this design does all-pairs testing.

The case for k=1 is as follows:

1 1 0
1 0 1
0 1 1
0 0 0

This is also the layout built from the prime p=2, but the next case up is better than I have so far described:

1 1 1 1 1 1 0 0 0 0 
1 1 1 0 0 0 1 1 1 0
1 0 0 1 1 0 1 1 0 1
0 1 0 1 0 1 1 0 1 1
0 0 1 0 1 1 0 1 1 1
0 0 0 0 0 0 0 0 0 0

For 2k+2 rows we get (2k + 1 choose k) columns. We have seen 3 columns for 4 rows and 10 columns for 6 rows. The next cases up are 35 columns for 8 rows and 126 columns for 10 rows. If you divide (2k + 1 choose k) by (2k - 1 choose k-1) you will see that we are multiplying the number of columns by a number that increases from 10/3 > 3 towards 4 (without ever reaching it) so the number of rows required by these designs is at also at most logarithmic in the number of columns.

There is a sense in which these rows are as far apart as possible. I will talk about the 10-column case, but my observations are easily generalised. Every row is different from every other row in exactly 6 places (because that is the number of different columns you can make when you set two pre-determined rows to 01 and 10). Consider any other 10-column row, and compare it with each of our 6 row vectors. Because our 6x10 matrix has 30 0s and 30 1s, there must be a total of 30 differences between our new 10-column row and the 6 rows we have already. With only 30 differences to dole out between 6 comparisons, our new row can be different from the most similar old row in at most 5 places, so it is closer to the old rows than they are to each other. Consider any three 10-column rows. Swap 0s and 1s in columns, if necessary, so that one of the rows is all-zeros. This leaves unchanged the number of differences between the rows. If the two rows that are not all-zeros have seven or more 1s, then there are a total of at least 14 ones between them, so they must be the same in at least 4 places, and are distance at most 6 from each other. So there is no arrangement of 6 rows with 10 columns each so that any two are different in 7 or more places.

If there was a pattern which had exactly the maximum possible difference to all rows, (different in exactly half of its bits) then we could flip 1s and 0s in the columns where that pattern had 1s to give another table with the all-pairs property, with the same distance between rows as before, and with exactly half of the bits set in all rows. Where then number of columns is odd no such pattern can exist. For at least the case of six rows and ten columns, such a pattern does exist; in fact, a computer search finds twelve of them. Here is one of the resulting tables.

1 1 0 0 1 0 1 1 0 0
1 1 0 1 0 1 0 0 1 0
1 0 1 0 1 1 0 0 0 1
0 1 1 0 0 0 0 1 1 1
0 0 0 1 1 0 1 0 1 1
0 0 1 1 0 1 1 1 0 0

I have been intrigued enough by this construction to write a web page on such tables, and other ways in which we might use designs to explore a situation where we answer yes/no questions. See Exploring The Worlds of Yes/No Questions.

In some cases, the cost of running a test may depend on the number of set bits in it. This would be the case, for instance, if you then had to check each set bit one at a time. In such a circumstance, you might prefer to have fewer set bits per test, even if came at the cost of accepting more tests. Consider generating the set of all columns with some smaller number of set bits, for example columns with exactly two set bits. The previous proof will show that a design made of these columns will contain all the 00, 01, and 10 combinations at some row, given any pair of columns. We will probably be missing the 11 combination, but we can restore that easily by adding an all-1s test. This amounts to just a single extra 1 per column, and may even be easily checked as a special case.

Computers and brute force

I have a computer program that not only uses prime-based solutions, but also tries a brute-force approach. For the brute force approach, it builds up an answer row by row. It keeps track of which combinations have been accounted for at each stage, and starts off each partial row with a position-value combination that has most matches left to do. It then fills in that row by choosing, at each stage, a value that most increases the number of matches accounted for. There is a good deal of choice even given these constraints, because it is common for there to be many top-scoring answers at each stage, all with the same score. It makes these decisions at random. This means that it can repeatedly run this part of the program, making different decisions at each attempt, and producing output when the current attempt happens, at random, to produce a solution with fewer rows than the best answer so far. In fact, the prime or Galois Field based attempt also allows for a number of random choices to be made, when columns are to be discarded, or out of range values replaced. So this program alternates between the two approaches: the best answer from either will be output. If, in any situation, one approach is much worse than the other, only half of the attempts will be wasted.

The program AllPairs

The source for the java program uk.co.demon.mcdowella.algorithms.AllPairs is in the zip file mcdowella.zip. The simplest way to run it is to include, as its parameters, a series of numbers giving the number of choices at each point. For example:

E:\web\uk\co\demon\mcdowella\algorithms>java uk.co.demon.mcdowella.algorithms.AllPairs 2 2 2
Working on choices: 2 2 2
MaxGoes 100 seed 42 noShuffle false
New best of 4 rows at go 0
1 1 0
0 1 1
1 0 1
0 0 0
New best was 4 rows at go 0

The parameters seen here can be changed. If you want it to try more random choices, use -goes #. If you want to change the seed used to produce the random seed, use -seed #. -seed 0 causes it to keep trying indefinitely, though I suspect that after the first few thousand goes, the chance of anything better turning up is pretty slight. If you know that the parameters fit the prime-based method example exactly, use -noShuffle to get a pretty non-shuffled answer (the prime-based method is tried first, so it should prevail over brute force). For example, here is a 'pretty' answer, using GF(22), in one prime-based go with explicit seed 19 (which in this case is ignored).

E:\web\uk\co\demon\mcdowella\algorithms>java uk.co.demon.mcdowella.algorithms.AllPairs -noShuffle -goes 1 -seed 19 4 4 4 4 4
Working on choices: 4 4 4 4 4
MaxGoes 1 seed 19 noShuffle true
New best of 16 rows at go 0
0 0 0 0 0
0 1 1 2 3
0 2 2 3 1
0 3 3 1 2
1 0 1 1 1
1 1 0 3 2
1 2 3 2 0
1 3 2 0 3
2 0 2 2 2
2 1 3 0 1
2 2 0 1 3
2 3 1 3 0
3 0 3 3 3
3 1 2 1 0
3 2 1 0 2
3 3 0 2 1
New best was 16 rows at go 0

The program will look for a basic construction just large enough to handle the maximum number of choices at any point, checking for the first prime from the maximum number of choices, and also checking for powers of 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, and 37.

You can have input and output represented as names, instead of numbers, if you like. Use the -file XXX option to give it the name of a file giving choices. In this file # is a comment that extends up to the end of a line, and non-empty lines represent positions. They consist of list of words, each of which is a choice. For example, if we put the following into pairs.in

# Test choices for AllPairs
Shuffle NoShuffle
goes100 goes200
seed2 seed3
file parm

You get the following output (note that it gives a 6-row solution before finding a 5-row solution: I have changed the 5-row solution manually into bold to make it stand out)

E:\web\uk\co\demon\mcdowella\algorithms>java uk.co.demon.mcdowella.algorithms.AllPairs -file pairs.in
Working on choices: 2 2 2 2
MaxGoes 100 seed 42 noShuffle false
New best of 6 rows at go 0
NoShuffle goes100 seed3 file
NoShuffle goes200 seed2 parm
Shuffle goes100 seed3 parm
Shuffle goes200 seed3 parm
Shuffle goes100 seed2 file
NoShuffle goes200 seed2 file
New best was 6 rows at go 0
New best of 5 rows at go 19
NoShuffle goes100 seed3 file
Shuffle goes200 seed2 file
Shuffle goes100 seed3 parm
NoShuffle goes200 seed3 parm
NoShuffle goes100 seed2 parm
New best was 5 rows at go 19

Working with Spreadsheets

It can be useful to store the options for each choice, and the schedule generated by AllPairs, in a format that makes it easy to display input and output in a spreadsheet. Then, for instance, you can include spaces in the names. If you use both the arguments -file <filename> and -readCsv, AllPairs will expect its input file, filename, to be in comma-separated format. If you also add the argument -writeCsv, AllPairs will write its output lines in comma-separated format. Most most spreadsheets (including Excell) will read and write in comma-separated format, and this allows you to specify almost arbitrary names for choices. Here is the csv version of the above example:

E:\web\uk\co\demon\mcdowella\algorithms>java uk.co.demon.mcdowella.algorithms.AllPairs -file pairs.csv -readCsv -writeCsv
Working on choices: 2 2 2 2
MaxGoes 100 seed 42 noShuffle false
New best of 6 rows at go 0
"NoShuffle","goes100","seed3","file"
"NoShuffle","goes200","seed2","parm"
"Shuffle","goes100","seed3","parm"
"Shuffle","goes200","seed3","parm"
"Shuffle","goes100","seed2","file"
"NoShuffle","goes200","seed2","file"
New best was 6 rows at go 0
New best of 5 rows at go 19
"NoShuffle","goes100","seed3","file"
"Shuffle","goes200","seed2","file"
"Shuffle","goes100","seed3","parm"
"NoShuffle","goes200","seed3","parm"
"NoShuffle","goes100","seed2","parm"
New best was 5 rows at go 19

All-Triplets, and so on

All-Pairs is high payoff for very little work. All-triplets and beyond seems to involve a lot more work, and/or more abstruse mathematics. The following is as far as I have got.

Brute Force again

We can define the number of interacting factors we cater for as the strength, so all-pairs is strength 2, and all-triplets is strength 3. Note that if you have a design of strength n, it will automatically test all interactions of less than n factors, so when constructing one, we need only test to see if it covers all combinations of n factors. The simplest way to generate stronger test designs is by brute force, and it turns out that there is a proof that this can be done using a number of rows that grows only with the logarithm of the number of variables: see The AETG System. The basic idea is to choose test cases one at a time at random, and see how many new interactions they cover. The chance of any given interaction being covered by a randomly chosen test case is p = one over the number of choices to the power of the strength of the interaction, and we know that every interaction does in fact appear in this fraction of all possible test cases. Note that this does not depend on the number of variables in play. If there are N interactions left uncovered, there is at least one test case that covers at least Np interactions, because otherwise we couldn't account for the number of times those interactions appear in the complete set of all possible test cases. In practice if you choose a reasonable number of test cases at random, you are quite likely to find one. If you find such good test cases, then each cuts down the number of untested interactions by a factor of at least (1-p), so the number of test cases required grows only logarithmically with the number of interactions, which is (variables choose strength) so the growth, for fixed strength, is only logarithmic in the number of variables. If I have applied the correct generalisation to the formula in the link, the upper bound is lt(ln(k choose t) + t ln(l)), where l is the number of choices for each variable, t is the strength, and k is the number of variables.

I have put an implementation of this in the zip file as uk.co.demon.mcdowella.algorithms.AllCombs. The main problem with this implementation is keeping track of which interactions have been covered. When I tried for 64 variables, 10 choices per variable, and strength 4, I found that there were over 232 interactions, which means that the program bombs out on a -ve array address, and I would need to start using 64-bit arithmetic to address each interaction (and when I did that I found that the time required for the program to run was more than I had patience for).

Building blocks for stronger interactions

One building block for a strength t interaction is to write down all combinations of t variables as t columns, and then to write down a final column, which is the sum, mod the number of choices, of every other column. Given all but one of the first lot of columns and the final column, you can work out the missing column, so every interaction involving only t variables is covered.

A second building block requires you to use arithmetic mod p, or Galois arithmetic mod GF(pn). Given t points and associated values (within the range 0..p-1 for the mod p case) you can fit a polynomial of degree t-1 at those points. Suppose you write down all polynomials of degree t-1 and fewer. For arithmetic mod p, there are pt of them. Now write down a table with pt rows and p columns by evaluating those polynomials at all p points. Choose t points and t values for those points. There will be a polynomial of degree t-1 through those points, so those values appear at their t locations in one of the rows of your table. This gives you a building block of strength t with pt rows and p columns, but you can add on an extra column. In this column, write down the coefficient for the (t-1)-th power of each polynomial. If I choose t columns which does not include this extra column, then all possible combinations of t variables appear within these columns, by the previous argument. If I include this final column together with t-1 others, and you present me with any combination of t variables, I can still find a row at which this combination appears, by solving for the polynomial associated with that row, as follows:

First of all look at the extra column to find out the coefficient of xt-1 in that polynomial. Now, for each other column, evaluate xt-1 at the point at which that column evaluates its polynomial and subtract that off, times the newly discovered coefficient, from the value of the combination there. t-1 is the highest power used in any polynomial, so you now have a polynomial of degree t-2 evaluated at t-1 points. This means that there is exactly one polynomial that fits, and you have a unique solution: so every possible combination again appears. This gives you a solution of strength t with pt rows (in the mod p case) and p+1 columns. Note, however, that the original construction has a large number of constant rows: these correspond to all the polynomials where only the constant term is non-zero. In the augmented version, there is just one constant row, corresponding to the all-zero polynomial. This means that the adding just one more column makes the result a poorer building block, because if we end up stacking columns on top of each other, we will have to preserve very nearly identical rows (constant except in the final column, which is zero). If these rows had been entirely filled with the same value, we could probably have discarded many of them later on.

Suppose we have a building block of strength t with r rows and c columns. We can now specify r test cases at once, by giving a single row containing numbers from 0..c-1. If we can produce a set of N such rows, such that, for any t columns, there is at least one row where the numbers in those columns are all different, we have a test design with Nr rows of strength t, because if we look at the r tests derived from the row with all numbers different, they will contain all possible combinations at those t points. For t > 2 the most practical way I know of generating such combinations is just to use brute force again, with the same justification: produce rows populated with numbers from 0..c-1 at random (actually we should choose them so that they have nearly equal numbers of each choice), and pick good ones, keeping track of the number of combinations of columns we have accounted for. Now we need to keep track only of the number of combinations of columns we have accounted for, not of individual interactions at that point, so if we have k choices and strength t we have gained a factor of kt.

More Examples

For this table, I have worked out the number of tests required with 64 variables (I chose 64 to make it clear that exhaustive testing of every single combination would be impractical). For the strength 2 cases I have used AllPairs. For strength 3 and 4 other than strength 4 10 choices, I used uk.co.demon.mcdowella.algorithms.AllCombs to provide test cases. For the last case I have used uk.co.demon.mcdowella.algorithms.AllPerms to find a way of using the building block with 114 rows and 12 columns, and then multiplied the number of such building-block rows by 114: since the table shows that the upper bound is less than this, you can do better. One of the problems here is that the building blocks are limited to primes and prime powers: the same construction would work for 11 options per variable, for which the upper bound is 195634, which makes it look good value.

What if you just used a random number generator, for the length of time given by the upper bound in the formula? For each case I have calculated the probability of missing any particular interaction (that is, a single assigment of values to strength different variables) and the expected number of such interactions missed (so this is just the first number times the number of interactions).

VariablesChoicesStrength Upper bound from formulaNumber of test cases found by programMiss ProbExpected Misses
64223113 1.6E-41.3
64238650 1.2E-53.9
6424214140 1.0E-610.3
64102763239 4.7E-494
64103106399490 2.4E-5994
6410413362010*114=146410 1.6E-69993

A diversion

The extension to handle large numbers of columns in the all-pairs case enticed me to try and find a generalisation for higher powers. Well, I did find a strategy that works with higher powers and reduces to the columns-as-digits strategy when considering all-pairs. However, it is does not produce competitive answers at higher powers and requires a minimum number of columns to start off with, so I present it here as a diversion rather than a practically useful solution to any particular problem.

We take an oblique approach. Suppose that I write down all combinations of t+1 digits from 0..b-1, producing bt+1 different numbers base b. You are then allowed to choose for me t+1 distinct numbers from these bt+1 numbers. I claim that I will be able to find at least one digit position such that you can mask out that position from each of the t+1 numbers chosen without causing any two of them to have the same t-long combination of unmasked digits.

This is clearly the case for the first two numbers you chose: since they are different, there is at least one digit position that distinguishes them. Let us decide that that position won't be the one that is masked out and consider a third number. Since the first two numbers are distinct at the position we have just been considering, the third number can match at most one of them at this position. We need consider only that number, and find a digit position that distinguishes this partially matched number from the new number. If the third number turns out to be distinguishable from the first two at the position we have already chosen, we can chose an arbitrary second position, just to make life simple. Similarly, any fourth number can form a match on the two digit positions chosen so far with at most one of the first three numbers, and we can resolve that match by chosing a third digit position with only the fourth number, and any partial match, in mind. Continuing, we find that we have accounted for t+1 different numbers and can distinguish between them using at most t of the t+1 available digits.

Now we number bt+1 columns with base b numbers, and number each of t+1 rows with the number of a digit position which we are about to mask out at that row. At the intersection of each row and column we put a t-digit number which is the number of its column written with one of its t+1 digits masked out. Here is an example with b=2, t+1=3. (So each cell corresponds to a 3-bit number with one bit missed out - in the top row the high order bit, and in the bottom row the low order bit).

01230123
01012323
00112233

The relevance of this table is that, given any set of t+1=3 columns, we can distinguish the 3-bit numbers corresponding to those columns by looking at only 2 of the 3 available bits. This means that in one of the rows of the table - that one that masks out the unnecessary bit - the entries for each of those 3 columns are distinct. Going back to our previous use of similar tables, this gives us a way to take a scheme with 4 distinct columns and strength 3 and produce a scheme with 8 distinct columns, while keeping a strength of 3, at the cost of multiplying the number of rows by three: just number the original columns 0,1,2,3 and place them in the table according to the numbers. Furthermore, the scheme with t+1=2 gives us a way to take a scheme with strength 2 and produce a scheme with twice as many rows and the square of the number of columns, which is the basis of the strength 2 scheme for handling large numbers of columns. Unfortunately, for higher strengths, you go not from b to b2 but from bt to bt+1, which for larger t is nothing like as useful a jump - I looked at a strength 3 scheme for 64 variables and 2 choices and came up with 190 rows, while the table in the previous section tells me that it can be done in 50.

Applying all-pairs and related testing schemes

Scandalously few papers on software engineering are backed by experiments. An exception is the paper "Pairwise Testing: A Best Practice That Isn't", by Bach and Schroeder. This is available as I write this at http://www.pnsqc.org/conference04/tuesday.php, inside a much larger PDF. It references a paper I have not read, "Comparing the Fault Detection Effectiveness of N-way and Random Test Suites", by Schroeder, Bolaki, and Gopu. Fault injection was used to produce broken versions of a program, which was then tested both by Pairwise Testing, and by a randomly generated test set of the same size. This was also done using tests of strength 3 and 4. No significant difference in test effectiveness was found.

Unsurprisingly, the full burden of the content of the paper cannot be summed up in just its title. I will quote a full paragraph, instead:

Should you adopt pairwise testing? If you can reduce your expectations of the technique from their currently lofty level the answer is yes. Pairwise testing should be one tool in a toolbox of combinatorial testing techniques. In the best of all possible worlds, a tester selects a testing strategy based on the testing problem at hand, rather than fitting the problem at hand to a pre-existing test strategy.

The study quotes a number of ways that pairwise testing can fail. Here is a paraphrase of the titles, with my comments:

Right values not tested.
One (of many) ways this can happen with pairwise testing is as follows: Suppose you have a continuous range, say 0.0-100.0. With pure random testing, you just pick a random number from this range every time. With pairwise testing, you might decide to confine yourself to 0, 25, 50, 75, and 100, giving yourself 5 options for this attribute. Better would be to have five ranges, 0-19.999.., 20-39.999, ... so that each pairwise test specifies only a range for each such attribute. Then, when you come to do the test, pick a value from that range at random. This should guard against the failing value not being one of the options 0, 25, 50, 75, or 100.
You could produce a bug but not recognise it
A problem common to all automated testing methods is recognising bugs when you find them. How do you know if the answer that comes out is correct? Perhaps you can find simple checks that will spot only some errors, but never produce false alarms. Conservation laws are a good example: accountancy programs might check that they never create or destroy money, but simply move it from one place to another, so that the sum of all monies at the start is the same as the sum of all monies at the end. Checks that values are within a given range can be another, or perhaps there are input values that make it easy to work out what the program should do. Checks within a program can be enormously useful. The most common variety are assertion statements, which bring the whole thing crashing down when they fail. In some cases, you might prefer to discreetly log an error report and continue as best you can. In either case, you can test for bugs by repeatedly running the program and not even bothering to look at the output: just check for error reports or assertion failures. And then, there is always hand checking. To me, one of the advantages of pairwise testing is that, by selecting a small number of runs that represent a wide variety of program inputs, it allows you to use a limited amount of hand checking as effectively as possible.
Highly probable values get too little attention
Catching this requires a completely different style of test set construction, where the test sets reflect the use of the program in the real world. Note that this technique can require very long tests to get round to areas of the program that are exercised only infrequently in practice. See my comments below on "Not 'or' but 'and'".
More complex interactions are missed
The basic assumption of pairwise testing is that most bugs are produced by one or two-way interactions. The fault injection testing found a total of 93 bugs, split between two programs (77 introduced faults did not produce bugs that were found at all: most of these were one-value failures: see "right values not tested"). Of the 93 bugs found, 59 were found by pairwise interaction tests. To me, this means that pairwise testing can be useful, but it cannot be used alone as evidence of a reasonable level of quality.

Not 'or' but 'and'

We now know about pairwise testing and random testing. We may have used old-fashioned tests produced by going through the requirements in the past. We know there is code review, black-box testing (testing with no knowledge of the underlying code) and white-box testing (testing with knowledge of the underlying code). Yet another idea is to make test runs as similar to real life as possible. Should I use A, or B, or C? No! you should use A and B and C. Taking code review vs black-box vs white-box as an example, studies have shown that each is most efficient at detecting a particular variety of bug, so the most cost-effective way to find bugs is to do a little of each.

A possible application to redundancy

Suppose we build a redundant system out of multiple dual-redundant components. That is, the system consists of n pairs of boxes, and we can choose, for each pair, whether to put box A live or box B live. The system as a whole works exactly when all the live boxes are working. We don't have to be very smart to control this, as long as we can tell whether the system as a whole is working or not; we could just keep a list of configurations to try and work our way down the list until we find one that works. It is a good idea to have a short list of pre-tested configurations, in case there are interactions between the pairs of boxes that means that some combinations are damaging; with a short list, we can check things out by hand.

The problem of finding a good list of configurations is a thinly disguised problem in software testing. After one or more boxes, and perhaps their interconnections, have been damaged, we want to have a list of configurations that is likely to include at least one working configuration. Instead of looking for configurations that expose a bug that depends on the setting of a few input variables, we are looking for configurations that produce a working system that happens when the redundancy selections work round a few points of damage.

For example, if all our boxes form dual redundant pairs, we can recover from any single error using just two configurations; one that selects the A box from every pair, and one that selects the B box from every pair. With multiple faults, we have to hope that we don't get two faults in a single pair. Assuming this, here is a design for ten pairs of boxes with six configurations, such that if any two boxes from different pairs fail, at least one of the configurations works round the failure; this is just an all-pairs design with ten columns and two choices for each variable, but renumbered from 0 and 1 to A and B, and with the all-zeros case shifted up to the top.

A A A A A A A A A A
B B B B B B A A A A 
B B B A A A B B B A
B A A B B A B B A B
A B A B A B B A B B
A A B A B B A B B B

References

The AETG paper is at http://www.argreenhouse.com/papers/gcp/AETGieee97.shtml

A paper reporting improvements on this in many cases is "An Improved Test Generation Algorithm For Pair-Wise Testing",

The paper "Software and Hardware Testing using Combinatorial Covering Suites", by Alan Hartman (IBM Haifa Research Laboratories) can be found off http://www.haifa.il.ibm.com/projects/verification/mdt/public.html as paper 8

Testcover.com provide a service that generates test vectors, paying attention to practical problems that may rule out testing particular combinations. Their web site at http://testcover.com contains papers and general information as well as the test service itself. See also the paper by Sherwood, Martirosyan, and Colbourn: "Covering Arrays of Higher Strength from Permutation Vectors" (users with a subscription can access this via http://www3.interscience.wiley.com/cgi-bin/abstract/111079860/ABSTRACT)