It is notoriously difficult to distinguish cause from effect without additional information, such as knowledge of the time at which the two candidate effects occurred. Statisticians very often assume that the statistics they have available are part of a single multi-normal distribution. In this case, all the available information can be held as a mean vector and a variance-covariance matrix, and this information does not allow us to distinguish cause from effect.
We can only hope to distinguish cause from effect, then, if we give up the mathematical convenience of multivariate normality. William Chambers published a series of usenet articles which he claimed demonstrated a means of distinguishing cause from effect. He made use of the central limit theorem, which states that, under certain conditions, sums of non-normally distributed random variables take on something closer to the normally distributed ideal, becoming normal as the number of variables in the sums grows to infinity. In his model, the more normally distributed variable must be the effect, and the less normally distributed variable the cause. The examples appeared artificial, and it seemed at least possible that there might be some other, also artificial, way of building a model which displayed the same statistics but had cause and effect reversed. A great deal of heat was generated on usenet.
In this web page I describe two models constructed entirely from everyday statistical components. They have cause and effect running in two different directions, and they are distinguishable from each other, at least in principle. I use computer experiments to show how easy it is to tell which of the two models is correct for various parameters. I then provide example situations where one model or another will be found to fit, but the supposed arrow of cause and effect is at best misleading, if not plain wrong.
Sometimes statisticians are forced to reject the assumption of normality, because it is obviously implausible. One example of this is when we measure a proportion, such as the percentage of time spent by somebody doing something. Nobody, no matter how dedicated they may be, can really spend 150% of their time studying statistics (or spend 150% of their time reading Science Fiction, so leaving them with -50% of their time to study statistics). Therefore a normal distribution, which may possibly attain values anywhere in the range from minus infinity to plus infinity, predicts things which cannot possibly happen in real life. One common fix for this is to apply the logistic function. One way to derive this is to say that the proportion p, if turned into log odds, is normally distributed. We have x = log(p / (1 - p)), or p = 1 / (1 + exp(-x)). For x very negative, 1 + exp(-x) becomes very large, so p becomes very close to 0, but never quite reaches it. For x very positive, 1 + exp(-x) becomes very close to 1, so p becomes very close to 1, but never quite reaches it.
In our first model the proportion is an effect. So we have a normally distributed x (for example an indicator of ability) and this produces a proportion which we obseve (for example, people with a native ability to understand statistics find studying it more rewarding, and so spend more time doing so). Suppose that we observe our normally distributed x, and then the result of applying the logistic function to the sum of x and normally distributed noise.
In our second model, the proportion is a cause. Perhaps people are good at statistics because they spend a lot of time studying it. In this case the proportion of time spent studying statistics is the result of applying the logistic function to a normally distributed y, and we measure their statistical ability, x, by observing the sum of this non-normally distributed function of y and normally distributed noise.
Should these be distinguishable? Suppose that we are in a regime where the mean of the value fed to the logistic function is 0 (which would produce p = 1/2) but the variance is very high. Therefore most of the values produced by the logistic function will be either very close to 0 or very close to 1. In this world you either live and breath and talk statistics, or you run away from it wherever and whenever it appears. If the amount of time spent studying is an effect, we therefore observe an entirely normal underlying ability and a logistic time spent proportion, which may appear to be bimodal, since most of the values are very close to 0 or very close to 1. If the amount of time spent studying is a cause, then the resulting distribution of ability may be bimodal as well, since it is the sum of normally distributed noise and a random variable which is, as near as damn-it, either pretty much 0 or pretty much 1. At least in theory, these two situations are indeed distinguishable.
A Bayesian would be able to produce a most powerful means of distinguishing one model from another almost as quickly as they could type, after assuming priors for the various parameters involved. I wish to make clear that I am not importing information from the priors, so I will try and distinguish one model from another by finding the maximum likelihood fit to the data for both models, and form a Bayes factor by comparing the two maximised likelihoods.
Under the "proportion as cause" model, we observe a logistic proportion as cause and a noisy version of it as effect. By reversing the logistic function, we can choose to observe its normally distributed input. By performing a linear regression between the observed proportion and the observed ability, we can work out the best model for the ability given the proprtion. We can therefore work out the best fitting parameters in this case.
Under the "proportion as effect" model, we have a normally distributed measured ability. Given this, we again reverse the logistic function to find a problem in linear regression, and work out the best fitting parameters.
Both fits involve a problem in one variable linear regression and a single normally distributed value. They therefore have not only the same number of parameters, but basically the same structure; only the relative position of the logistic transformation is different. So we can compare the likelihoods from the two fits without any trouble. The log likelihood produced by the best fit for X is related to Var(X), and the log likelihood produced by the best fit for Y given Z is related to Var(Y) - Cov(Y, Z)2/Var(Z), so it turns out that all we really need are the variance of the normally distributed observation, the variance of two versions of the proportion, before and after undoing the supposed logistic transform, and the covariances between the normally distributed observation and the two versions of the proportion.
I expect the strength of this effect to depend on the mean and variance of the normal distribution just before it is turned into a proportion, and on the correlation between the cause and the effect. Snedecor's book, "Statistical Methods" says, (section 7.2, "The sample correlation component"), that "Each field of investigation has its own range of coefficients. Inherited characteristics like height ordinarily have coefficients between 0.35 and 0.55...". I have mostly used a correlation of 0.5, I wrote a computer program to repeatedly generate data for both models, and then to see whether it could tell the difference between them. Here are some results.
| Mean | Std. Dev. | Observations | Trials of Each Kind | Proportional Cause Correct | Normal Cause Correct | Correlation |
|---|---|---|---|---|---|---|
| 0.0 | 2.0 | 100 | 10000 | 7703 | 7631 | 0.5 |
| 0.0 | 3.0 | 100 | 10000 | 8380 | 8326 | 0.5 |
| 0.0 | 3.0 | 200 | 10000 | 9216 | 9176 | 0.5 |
| 1.0 | 3.0 | 200 | 10000 | 9355 | 9346 | 0.5 |
| 2.0 | 3.0 | 200 | 10000 | 9650 | 9640 | 0.5 |
| 2.0 | 3.0 | 400 | 10000 | 9966 | 9948 | 0.5 |
| 2.0 | 3.0 | 400 | 10000 | 9515 | 9550 | 0.35 |
| 0.0 | 1.0 | 5000 | 10000 | 9536 | 9512 | 0.35 |
You seem to need hundreds of observations, with quite an extreme bimodal proportion, to use this effect reliably, but not necessarily an extremely strong correlation. If you look at the resulting distributions on their own, the normal observation is not noticeably non-normal when driven by the proportion (that is, its non-normality is noticeable to statistical tests, given enough data, but not to the human eye, even with a Q-Q plot), but the proportional distribution is very obviously bimodal.
I have an image in my mind of a large box of dark wood, with various handles and levers protruding from it through small holes cut in the casing. Most of them are moving, in a vaguely co-ordinated pattern. You can note down their movements, and the correlations between them, for as long as you like. You can even rest your hands lightly on the levers, in an attempt to sense subtle but revealing vibrations from the underlying machinery. But until you actually close your hands on them and attempt to move them, you won't know which ones do something, which ones are immovable to human strength, and which ones will come off in your hands.
Our toy situation contains a model in which practice brings forth skill, and one in which underlying talent surfaces as both ability and inclination. A large employer, or a society, might care about the difference if it was deciding whether education should be primarily a matter of providing training, or primarily a mechanism for screening for talent. We can also put ourselves in the place of a medical researcher wondering if some characteristic of lifestyle is a cause or an effect of some characteristic of people who take part in it.
In our first model, we find an entirely normally distributed range of ability, and a strongly bimodal distribution of proportion of time spent studying. The proportion is an effect of the underlying ability, and so our employer would chose to screen for people demonstrating the inborn ability, rather than training employees to foster it. Can we shift the balance in favour of training?
If ability was not normal, we would have a better fit to the second model. But do we really know what the distribution of ability is? The distribution of examination scores is determined partly by the distribution of difficulty in the questions posed, or the way in which people are awarded points when their essay answers achieve particular goals. The resulting scores may also be recoded. Very often these features, and others, are specifically designed to make the examination scores fit a normal distribution. So our employer can't assume that ability is independent of time spent practising. But perhaps this argument doesn't work for our medical researcher, where there might be a natural measure of our characteristic, such as height compared between professional basketball players and other people.
If the proportion of time spent studying was less bimodal and more normal, we would have two normals, and we know that there is no arrow of cause and effect there. The distribution of time spent is fixed, but what about its effects? We know that we can transform between normal and logistic-normal distributions. If we hypothesize that the effects of spending an additional second learning are larger when almost all of the time is spent learning, or almost none of the time is spent learning, then we could transform our bimodal proportion back into a normal distribution. I can believe that the first small amount of time spent on a subject is extremely valuable, because that brings us back to the law of diminishing returns. I am less keen on the idea that complete absorbtion yields extraordinary benefits.
We think that we have a normally distributed driving factor, but we know that we should be suspicious when the words "normal", "cause", and "effect" are too close together. Perhaps this supposed normally distributed driving factor merely happens to be correlated with the true underlying cause. Perhaps our normally distributed underlying ability is in fact the sum of a large number of components reflecting different sorts of mathematical and scientific training before exposure to statistics. In that case, rather than screening for a normally distributed ability, we could train each of these varied contributing skills.
The 'levers come off in your hand' theory suggests that, even if observed ability is the driving factor for measured ability, screening for ability will not be the best strategy. This might be the case if the ability measured is not identical to the ability that we actually require. Perhaps we are measuring underlying ability with an examination that asks students to solve interesting problems that require flashes of insight, whereas real life situations are, or can be made to be, much more routine. Training addressed to real life situations would then be much more valuable.
In our ability vs training example, we might observe that most people either do a lot of training or a little, and that most people are either very good or very bad (For example, the dialog "Can you play the violin?" / "I don't know; I've never tried" is not intended to be taken seriously). Here we might deduce that time spent training is the cause and ability the effect. We can appeal to our measurement of the resulting ability. Perhaps it is intrinsically bimodal. If the measured score is "percentage of attempts applicant could walk through a standard doorway without stooping or bumping their head," and the time spent training is "number of times participant attempted to walk through a standard doorway without stooping," then we could observe that both measures were strongly bimodal without there being a training effect.
For this direction, the "levers come off in your hand," possibility has a natural interpretation. Under this, training is only valuable when the participant requests it; if the employer mandates it, training unwilling participants may be useless or even counter-productive. We could also consider the possibility that the training given is not effective, or not relevant to the real world tasks. The same things could apply to our medical researcher's proposed treatments, or proposed methods of inducing lifestyle changes.
If the set of plausible models is very restricted, then you may be able to tell cause from effect using observational studies alone, without information on relative timings. If this is the case, then you have probably missed a model or an interpretation of a model. Regardless of that, if what you really want to do is to work out whether a planned action will have the intended effect, there is no substitute for actually trying it and seeing what happens.