The motivation for this is one of distrust of officialdom. I have seen people work hard and gather statistics showing this, that, or the other, and present them for public view. The official response was not "It's a fair cop, guv!", but "This is a small collection of isolated examples. Everybody knows you can't tell anything from small samples". Well, you can. A lot of the time what you can tell is just that there isn't enough data, but you can at least tell that for sure, and not just hear it as a convenient excuse from somebody.
Given a statistic, what you normally want to know to work out if it means anything or not is how likely it is that you would see that statistic, or something at least as extreme, if a null hypothesis was true. The null hypothesis usually amounts to saying that A has no effect on B. If you can't even prove that A has an effect on B from the data you have, you're not going to be believed when you claim you know what that effect is. People are sufficiently good at seeing things that aren't there that this is a useful discipline: without it, you're closer to Applied Superstition than Applied Statistics.
If the observed statistic is sufficiently unlikely under the null hypothesis then, for all practical purposes, you've just disproved it. You want the probability of seeing something at least as extreme, because the reason you're looking at probabilities at all is usually to make sure that your 'false alarm rate' (which I should really call the type I error rate) is below some magic value, and you would normally say you'ld found something for every statistic over some threshold that you want to compute. Lots of people put the maximum bearable false alarm rate at 5%, but I'm not happy myself with anything that isn't well below 1%. (Partly because I've seen some dodgy software, and people who do every test imaginable until they get a result).
The catch is that most of the tables in most statistics books are only for cases when there are a large number of samples, when some asymptotic behaviour is very nearly true. Hence the official get-out clause.
Well, it doesn't have to be that way. There are useful tests when you can compute the exact probability of seeing a statistic at least as extreme as the observed result. Here are some references:
Very often, if you want to summarise a list of numbers, you give the mean (usual average) number in the list. If you are prepared to assume that the numbers are independently normally distributed, you could use the t distribution to test hypotheses about this mean. Another way to summarise a list of numbers is to give the median: the number halfway up the list after it has been sorted. It turns out that you can provide confidence bounds for the median of the underlying distribution, without assuming anything about it at all.
(Confidence bounds are closely related to significance testing. You can think of a procedure for producing confidence bounds as a rule for producing a range of numbers, such that under some set of assumptions the probability that the range includes the true value is at least some agreed value, typically 95%).
The page Quantile Bounds (or why to take at least six samples) describes how, and points to a program that can calculate the bounds.
"Permutation Tests" (see above) is very keen on this. If you can write a computer program to simulate the process you've just observed under the assumption that the null hypothesis is true, then it's easy to see how extreme your observed value is: just get it to run over and over and see how often you get a value as extreme as the real value. Since this typically replaces a few books of high-powered theory that only provides approximations anyway with one page of computer program, and five or ten minutes on today's desktop computers, it's something of a liberating experience.
The title "Permutation Tests" comes from the fact that in a lot of cases the null hypothesis can be that any one of a collection of permutations of the real test results is equally likely. For instance, if you observe sets of pairs (Ai, Bi), the null hypothesis may imply that any set of pairs with the Bs permuted amongst themselves is equally likely (because the null hypothesis says that there is no connection between the As and the Bs). If it turns out that the observed data has a higher large value of some statistic SUMi t(Ai, Bi), than most of the other collections of (Ai, Bi) produced by permuting the Bs amongst themselves, then you have evidence against the null hypothesis.
There's a nice wrinkle I first heard on usenet that I think I can find described in "Statistical Inference", by Garthwaite, Jollife, and Jones (Prentice Hall, ISBN 0-13-847260-2), near the beginning of Chapter 9. Count the observed value as one of your Monte Carlo runs (this often makes programming easier, if anything). Under the null hypothesis, if you have 1 observed value and N true Monte Carlo runs, these all have the same distribution, so the probability of the observed value being n-th (e.g. 1st) or less out of the statistics found in (N+1) runs in all is exactly n/(N+1). Personally, I think of this as merging the Monte Carlo process with the experiment that generated the observed value.
Here's an idea I regard as an extension of this, because I'm less sure about it. Sometimes when you do Monte Carlo tests, you can't generate random examples from scratch. Instead you use something like the Hastings-Metropolis Algorithm (also described in Garthwaite/Jollife/Jones) to generate a random example from another random example. The long run distribution of the examples generated is then correct, but any two successive random examples are not independent. Now suppose that you have an observation you need to check via Monte Carlo, and you decide that you can afford to generate N such examples. I propose that you randomly divide this into n and N-n examples, where n is uniform in [0..N]. Starting from the observed example, run forwards n times. Then start from the observed example and run backwards N-n times. That is, if the generation of one random example from another is not symmetric, reverse the direction. You now have a string of N+1 random (but correlated) examples, with the observed example at a random position in it. I claim that all the probabilities associated with this under the null hypothesis are the same as if you had N+1 examples all generated at random and picked one out at random. So (despite the fact that the examples are all correlated with each other, and we haven't even allowed the Hastings-Metropolis process to converge towards its guaranteed distribution) the statistic from the observed value has a random rank within all the statistics, and we have a reliable Monte Carlo test.
I tangled with this while running Monte Carlo tests for Contingency tables with structural zeros. Here I don't know how to generate a contingency table with the correct marginals and pattern of structural zeros at random, but I do know how to randomly walk between such tables. I haven't included code for this, because the only version I have is in C, and it would take a while to convert it over and check it. The rest of the code here is rushed enough already without that.
The amusing thing about some exact tests is that they require little more than basic probability. This is especially true if the statistic you are working with is integer-valued. You can represent the distribution of an integer-valued statistic just by giving a probability for each possible value, and you can (for instance) work out the distribution of the sum of two independent integer-valued statistics just by working out what all the possible combinations are, and summing up all the ways to get any particular value. What's more, you get an integer-valued result out of this, so you can repeat the exercise until you've worked out the statistic you really want.
If your statistic isn't integer-valued, well - I propose that you pretend that it is! (this is where I may be on even dodgier ground than usual). First of all, by scaling everything up by a constant factor you can make your integer-valued version of the statistic as close to the real thing as you like (or can find the computer time for). Secondly, you can now compute the exact distribution of a statistic that is almost the statistic you are really interested in. Even if your integer-valued approximation isn't completely accurate, the way you lose is that you won't have quite such a sensitive test as the real thing was. Since you can compute the exact distribution of what you do have, the chance of you coming up with an apparently significant statistic under the null hypothesis is the same as it ever was - whatever significance level (5%, 1%, 0.1%, that you feel happy with).
By a contingency table, I mean a table of counts. Each count is the number of independent observations of some particular type. So a 1x2 table might record the number of heads and tails seen after tossing a coin a number of times. A 2x2 table might record the number of pairs of results seen when tossing two different (and distinguishable) coins at the same time. Typically, with a 1x2 table we'ld be interested in testing the theory that that probablity of seeing a head was 1/2, so we'ld be suspicious if we had 1000 tails and 10 heads. With a 2x2 table we'ld be interested in testing the theory that the two coins tossed were behaving independently of each other, so we'd be suspicious if we never saw the higher value coin come up heads at the same time as the lower value coin came up tails.
The market leader for testing contingency tables seems to be the network algorithm, by Mehta and Patel. This isn't restricted to integer (or integer-ised) statistics, and has been made to work for general tables. It's inside StatXact, SAS, and probably other packages (none of which I have access to). What I can offer works with integer-ised statistics, and on 2xN tables and 1xN tables. By a 1xN table I mean a situation where your null hypothesis tells you that the observation is of type A with known probability Pa, type B with known probability Pb,... and you have a count of As, Bs, Cs,.. and so on observed.
The best starting point to explain this is the 1xN case. If you only had one possible type of observation you could say exactly what to expect: everything would be of type A. If you had two types of observation, then the number of As and Bs follow a binomial distribution. The probability of seeing a As is PaaPbb (a+b)!/(a!b!). Now, suppose that the statistic we are interested in is the sum of a number of integer-valued components. Each component depends only on the number of one particular type seen. So if the number of type i is Ni, then the statistic is SUMi Ti[Ni], where the tables Ti[] can all be different. If we have just two types to keep track of, we could build a set of tables showing, for any particular total number of observations, the probability of our statistic, Ta[Na] + Tb[Nb]. All we have to do is to work out all the possible combinations of (Na, Nb), the statistic we get from the combination, Ta[Na] + Tb[Nb], and the probability under the null hypothesis, PaNa PbNb (Na + Nb)!/(Na! Nb!).
So far, so obvious. The nice part is that once we have built such a set of tables, we can extend it to cases when there are 3 types without computing all the possible combinations of 3 counts, (Na, Nb, Nc). We can work out the probability of seeing Nc observations of type C and (total - Nc) of types A and B combined, and then refer to the tables we have built for combinations of 2 counts to give us the distribution of the contributions from Ta and Tb. This means that as things get more complicated the amount of work tends to grown according to some polynomial, which is usually rather nasty, but nothing near as horrible as the exponential growth you'ld see if you tried to track all possible combinations.
Lots of statistics for contingency tables can be turned into sums of contributions, where each contribution depends on just one count - so you can compute the significance of each of them - you just need to know which of them you want. There is a tradeoff here: the more specific you can make a test, the more sensitive it will be.
The Pearson chi-squared test for a 1xn table containing n counts Ni, with total N, is against the null hypothesis that the counts correspond to totals of equally probable independent choices. It is intended to detect any deviation from this. It is SUMi(Ni-N/n)2/(N/n), which is a monotonic function of SUMiNi(Ni-1). This is sometimes called the repeat rate. It is already integer-valued, so it is easy for us. If the distribution under the null hypothesis does not assign an equal probability to each cell of the table, the chi-squared test doesn't produce a naturally integer-valued version. If you feed in the probabilities you can still get an accurate significance for SUMiNi(Ni-1), but it is no longer the most powerful test for detecting general deviations from the null hypothesis.
There is another test which also offers (asymptotically) the chi-squared distribution. Suppose that the null hypothesis is a special case of a more general hypothesis. It might amount to fixing k parameters of the general case to specified values, for instance. Then the maximum likelihood in the general case must always be at least as great as under the null hypothesis (because it can use the values set by the null hypothesis if it really wants to). If the null hypothesis is true, then twice the logarithm of the ratio produced by dividing the greater likelihood by the less is (up to various conditions) distributed asymptotically according to chi-squared, with k degrees of freedom. This is described in Garthwaite/Jollife/Jones (4.6.1), but not proved. They refer you to Section 9.3 of "Theoretical Statistics", by Cox and Hinkley, (published by Chapman and Hall, ISBN 0-412-16160-5) for the proof. You'll find proofs of this in a wide variety of books, though, under the title of "Maximum Likelihood Ratio Test" - or some subset of those words.
You can apply the likelihood ratio to the tables we're looking at here. You get the maximum likelihood when the probability of every cell reflects the proportion of the total count it receives. The null hypothesis either explicitly states a probability, or says the probability is the product of row and column probabilities. Lots of times when I say something like chi-squared or likelihood chi-squared in these programs I'm really referring to twice the log of some likelihood ratio. I have no really strong reason for preferring it to the Pearson Chi-Squared, though.
Yet another statistic you might look at for contigency tables is the probability of the table given its marginals, under the null hypothesis of independence. It turns out that this is PRODi Ni.! PRODj N.j! / (N..! PRODi,j Ni,j!). One way to see this is not to think of a table, but of writing down the pair of (row, column) symbols that might be used to record each individual event listed in the table. The marginals tell you the number of each type of symbol. The null hypothesis of independence tells you that you can permute all of the row symbols and/or all of the column symbols at random to produce new random tables with the same marginals as the old. Of the (N!/PRODi Ni.!)(N!/PRODi N.j!) different double permutations, (N!/PRODi,j Ni,j!) will produce tables with the same counts as before. I offer the minus the logarithm of this as a statistic. It looks as though it's testing about the same things as the two variations on chi-squared though, so its main purpose is probably to demonstrate an easy way of generating random contingency tables with the same marginals as the observed table.
The chi-squared test (and variations thereof) for a 2xn and larger tables is against the null hypothesis of independence: the counts correspond to independent choices of row and column. As with the test for 1xn tables, it attempts to detect any possible deviation from this. Unfortunately, I know of no way to get an inherently integer-valued statistic out of it, so I scale it and convert to integers.
One way to make a more specific, and more sensitive test, is to look for a linear trend of some sort. In a 1xn table, we might look for a general increase or decrease in counts along the table. In a 2xn table we can look for a general increase or decrease along either of the rows. Since the sum down each column is fixed, this amounts to a decrease or increase along the other row. In either case, I choose (based on Good and Sprent) the statistic SUMiNi(2i - n + 1).
This finishes the tables for which these programs provide exact or semi-exact answers, but there is also a Monte-Carlo routine for mxn tables. In this case, there need be no strong connection between the trends for different rows. To detect the case where a trend exists only along the rows, I use a statistic produced by squaring the trend statistic above for each row and adding the result. To detect the case where a trend exists along both rows and columns, I use SUMi,jNij(2i - n + 1) (2j - n + 1).
This is for comparing one group of values with another, for example the weights of males against those of females. It works by sorting both sets of values together, and then noting the place of each value in the resulting order. Once it has done that, it replaces the original values with the places, and works with them from now on. This is a bit like giving points for a race based on the place each runner ended up in, and then working out team scores by adding up those points. It makes it possible to work out a significance without knowing anything about how the original values have been distributed. It turns out that most of the time you don't lose very much by this, and sometimes it makes for a more reliable test, since very large or very small values only have a limited effect.
The usual statistic to work on from here is the sum of the orders given to one group, though there are a few apparently different, but actually equivalent, ways of stating this. Then you work out the probability of getting a score as high as this or higher, or as low as this or lower, at random, if there was no difference between the two groups. You can get a program to do this, or use tables; programs have the advantage that they can take account of any ties.
If there is a difference, you might be interested to find out what it is. If you believe that the two groups are identically distributed, except for a fixed offset added on to one of the groups, you can get a confidence bound for what that offset is. Even it isn't really very plausible that that is the only change, it's still one way of summarising the difference between the groups.
Here is a link to a page of descriptions of Java programs, including some related to this topic. A Jar/zip file of source is here.
Here is a link back up to my home page.