Building a statistical model allows you to predict future observations. As well as being useful in its own right, this can be a base for more general statements: perhaps you can look at the model and say that increasing x1 by 10 leads to an increase of 20 in the predicted y, or at least that it tends to move the predicted y in one direction or another, depending on the value of x1 and other variables in the model.
Many models come with their own probability structure, for example that y is a linear function of x plus noise with mean 0 and some constant but unknown variance. In this case, maximum likelihood, or some more model-specific theory, will probably tell you how to fit the model and how to assess the reliability of the fitted parameters.
On the other hand, you may not know the structure of the underlying relationship. You may even believe that even if you knew it it would be so complex, and contain so many unknown parameters, as to be useless to you. The noise term may be unknown, and its variance may change in unknown ways: many non-normal distributions are generated by mechanisms that tend to change both the mean and the variance at the same time. If your model is only an approximation of the real underlying mechanism, its accuracy may be better in some regions than others.
One approach to such intractable situations is to use the Bootstrap, as demonstrated, for instance, in "Regression Modeling Strategies", by Harrell. This reference also shows the power of model-based approaches. I dislike the Bootstrap, because few absolute guarantees are possible. In many cases Monte Carlo investigations show that the confidence regions generated from this are close to the claimed size, but not exact: a claimed 95% confidence region may contain the true value only 90% of the time. Most of theory provides results only for large or infinite amounts of data.
Permutation tests, like the Bootstrap, are a conceptually simple computer-intensive approach. Where applicable, they tend to produce exact calculations of probabilities. An excellent reference for these is "Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses", by Phillip Good. Here I attempt to apply them to modelling.
The following is a particular case of a fairly standard idea which does have a theoretical justification: that of forming a confidence region from a hypothesis test. See, for example, section 7.5 (Estimation of Future Observations) of "Theoretical Statistics", by Cox and Hinkley, and section 2.1 (Confidence regions for future realizations) of "Predictive Inference: An Introduction" by Geisser.
However, you may feel that the assumptions I make are unreasonably strong, or that the resulting method has little practical use. It does, however, provide the basics of predictive modelling: a confidence region to back its predictions, and a measure of goodness of fit. The assumptions I make are not inconsistent with those made by more traditional methods, so if the results from this method are too different from those of your favourite method, you should probably regard this as a warning.
This methods also has one amusing feature, in that it tends to allow only bounded extrapolation: for most models, the prediction regions for values outside the data points grow rapidly, until it effectively outlaws extrapolation by returning confidence intervals of infinite length.
The main assumption can be likened to a game. Your opponent provides vectors of independent x variables, each with a single matching dependent y variable, at n+1 points. The referee picks one point at random and hides the value of its dependent y variable before passing you the data. It is up to you to guess that variable. This assumption is most reasonable when you have gathered the samples with known y by sampling from a larger set at random. You might then use this to predict the y values in the unsampled members of the larger set.
Your approach is as follows: guess a value for the hidden variable and apply your model to provide a fit and n+1 fitted values for the dependent variable, so that each has a residual. If the model predicts different noise terms at different points, you may wish to scale the residuals. Now suppose that you sort the scaled residuals into increasing order. If you happen to have guessed the hidden y value correctly then, since the referee chose the point to hide at random, its position in the vector of sorted scaled residuals is also random, occupying any of the n+1 positions with probability 1/(n+1).
This means that if you take for the prediction confidence region all values of predicted y that lead the scaled residual to occupy the central k of the n+1 possible positions, the chance that your predicted region does not include the true y value is (n + 1 - k)/(n+1). Typically you find the region by an iterative computerised method similar to that for solving non-linear equations in a single variable: here the two equations concerned are (residual at hidden value) = (lth residual in order not counting the one for the hidden value) and (residual at hidden value) = (mth residual in order not counting the one for the hidden value), for pre-chosen values of l and m. In most cases, this is a tractable and reasonably well understood problem: see books such as "Numerical Recipies" for ideas.
This works when the resulting interval of y values is reasonably manageable. If the x vector with the hidden y value is too far outside the range of the other x vectors, then, for many models, the hidden y value will have high leverage, with so much influence on the fitted model that it is impossible for its residual to be extreme, no matter what the y value. I regard this as the method refusing to extrapolate too far outside its base.
I implemented and tested this for the obvious case. The Yi were in fact produced by an affine (linear plus constant) function of the Xi (plus a randomly generated error). I then produced the residuals by finding the affine function which provides the best fit to the observed (Xi, Yi), including the unknown Yi that we were trying to predict. To solve the residual equations, I used a version of the 'method of false position', which amounts to approximating the equation to be solved near the solution by a straight line. I tested my prediction method by comparing it with a more traditional method. This simply predicts the missing Yi by finding the best affine fit to the completely observed pairs and evaluating it at the Xi with missing Yi, producing a confidence interval either from the observed variance so far (assuming normality), or via leave-out-one cross-validation. To produce a single point estimate from our rank-based confidence interval I chose the value of Yi which would make its residual the median observed residual. Three different sources of random error were used: one uniform on [0, 1], one exponential, and one normal.
The resulting point estimate is less accurate than the simple linear prediction, even for non-normal errors. It is, however, the only method true to its confidence limits in all cases. The confidence limits based on an assumption of normality are wrong for non-normal errors. The simple confidence limits I derived from leave-out-one cross-validation are pessimistic, because I didn't use provide many samples to predict from, so leaving out one pair decreases the accuracy, and the real prediction does better than the leave-out-one cross-validation attempts did. Since in my test I did have an independent referee choosing a Yi to withold, the assumptions of this residual-based prediction method are correct, and it gives accurate confidence limits in all cases. Nevertheless, if this simulation is representative, I would prefer to go for the standard linear predictor, with its better accuracy on point estimates, even if its confidence interval is too conservative. (you have been warned!).
The fact that the point estimate producing the median observed residual is not equal to the ordinary fitted value from the model shows that the fitted value can be outside our confidence interval, because the median residual will be the sole occupant of the confidence region when a very narrow confidence interval is requested - perhaps a 1% confidence interval. However, the ordinary fitted value is not likely to be outside the confidence interval in most cases. Most models fit by minimising some function of the residuals, typically the sum of their squares. In this case, the value produced at the hidden point by fitting the model to the other n points alone will turn out to have residual zero when we fit the model to all n+1 points, and we do not expect the median residual to be very far from 0. (We get residual zero when we replace the hidden value with the value predicted based on the other n points because this minimises the total sum of squares when we fit the model to n+1 points: residual 0 on the hidden point, and the minimum possible sum of squares for the other n points).
These prediction intervals reflect both the uncertainty that we have the right model, and the noise present in every observation that would be visible even if we had the exact model. For very large amounts of data, and when the model we are fitting is in fact the correct model, we should find that our prediction intervals reflect only the noise term.
As a check, cosider the trivial case when a single x variable takes values 0 and 1, and suppose that in these two cases the mean of the y variable produced takes the values a or b respectively, with normally distributed noise of mean 0 and standard deviation 1. We fit the model by taking the average of the values observed in the two cases, and for large amounts of data our fitted model will be very nearly the correct model, so that our prediction intervals will reflect the observed empirical distribution of the noise.
Note that this is different from model uncertainty. If the two means are 1 and -1, we will very quickly realise that they are different, but prediction intervals produced for the two possible cases may still overlap at reasonable confidence levels. This reflects the fact that with the noise having standard deviation 1, it is still possible to observe a value of (for instance) -0.5 from a population with mean 1, or +0.5 from a population with mean -1.
Whether this is what you want depends on whether you are really predicting a single value or trying to discern a trend. If the latter, you probably need to make more assumptions about the model, unless you can observe different y values with identical x values. As a counter example, consider cases when a deterministic 'noise' term is derived by taking a cryptographically strong hash of the input x values. This produces an entirely deterministic model indistinguishable from a noisy model unless the model is known exactly, or you can see results from identical x values.
With this method, fit assessment is secondary to the generation of confidence intervals. It might be used to chose between two models, or to tune the parameters of a model outside the main fitting procedure. Note that doing either of these two things invalidates the confidence intervals produced, because the uncertainty in model choice or model tuning is itself a source of error. The only way to get round this effect is to turn model choice or tuning into a precisely defined computerisable procedure, consider it as a 'black box' model fitting procedure, and run it at each iteration as we attempt to solve "residual from black box procedure at hidden value" = mth residual in order ignoring the hidden value for the two values of m that form the bounds of our confidence area. This is likely to prove expensive, unless you can find some clever optimisation, perhaps sacrificing accuracy for speed in the early interations.
Since we already have accurate prediction confidence intervals, we can gauge the fit of the model by the size of those intervals: the smaller the better! Of course, the size of the prediction confidence interval will depend on the point at which the prediction is made. This may be a reflection of the amount of influence that point has in the fit, or it may be a reflection of the amount of noise the model expects at that point. We might therefore gauge model fit by the expected length of the confidence interval, or the median length of the confidence interval, given the distribution of the values we wish to predict.
We can find out the length of the confidence interval at any point we care to examine just by running through the prediction process. In some cases, our knowledge of the distribution of the prediction points may be limited to the sample points we used for training. In this case we can, in sequence, exclude the observed value of each traning point from our data, and note down the length of the prediction interval for it generated from the other n points. Note that we need not check to see if the predicted confidence interval in fact contains the actual value, because we can trust the confidence intervals to be of the specified size.
Using this process we could work out an average confidence interval length, or a median confidence interval length. Can we trust this value? If our modelling process is sensible, we should expect it to be conservative, because we are assessing the performance of the prediction process given n-1 training points, when in fact we have n points complete with the actual value at that point. The prediction precess we have just tested is closer to the following (presumably sub-optimal) modelling strategy: throw away one of the n points at random and predict the unknown value using the other n-1.
If we go back to our referee, and their random choice of one point from n+1 whose y value will be hidden, we see that each test prediction corresponds to us choosing at random to hide a second value. Therefore the expected length of the prediction interval at the point hidden by the referee, if predicted using just the n-1 remaining known y values, is the same as the expected length of the prediction interval for the point we have hidden ourselves, if predicted using those n-1 remaining y values. Therefore our expected prediction interval length is unbiassed for the length of the prediction interval for our (sub-optimal) 'hide a second value at random' prediction strategy. This also holds if you consider not the length of the prediction interval, but some function of the length. This is useful, because some of the lengths may be infinite, so you could use a function that maps those infinite lengths onto something more reasonable, such as a measure of cost, so you don't get infinite expectations.
I would like to say something about a non-parametric measure of the confidence interval lengths, for example provide a measure depending on the median of the lengths of the observed intervals, but there seem to be problems with this, because our adversary has quite a lot of control over what we observe via the model. By choosing a suitable model, they may be able to simultaneously control all the residuals, leaving us and the referee only the freedom to sample from them at random. Consider the situation when there are six points, of which the referee hands us five chosen at random. I will represent an observation as e.g. Abcde, which means that we were given the five points a,b,c,d and e, and calculated the length of the confidence interval for predicting at point a, given the values at b-e. We can write these points out in a table, such that the columns correspond to the test predictions we can make from our five known values and each column goes with a line of values possibly split over two rows, which are the predictions we might make for the point which has been hidden for it. Here is such a table, with a very contrived prediction interval length written down beside the 5-letter combinations, and with one column and its corresponding split row marked with the html header emphasis.
| Abcde 1 | Abcdf 21 | Abcef 9 | Abdef 20 | Acdef 19 | Bcdef 10 |
|---|---|---|---|---|---|
| aBcde 2 | aBcdf 4 | aBcef 8 | aBdef 13 | aCdef 18 | bCdef 11 |
| abCde 3 | abCdf 5 | abCef 7 | abDef 24 | acDef 23 | bcDef 22 |
| abcDe 14 | abcDf 6 | abcEf 26 | abdEf 25 | acdEf 15 | bcdEf 12 |
| abcdE 27 | abcdF 30 | abceF 29 | abdeF 28 | acdeF 17 | bcdeF 16 |
The column marked out has a median prediction interval length of 9, but the split row marked out of predictions you might make using four of the values in that column has a median prediction interval length of 22. In fact, the median test and observed prediction intervals are as follows:
| Median test length | Median true length |
|---|---|
| 3 | 28 |
| 6 | 25 |
| 9 | 22 |
| 24 | 7 |
| 18 | 8 |
| 12 | 19 |
We can take the median of the values 1-30 as 15.5, so we see that 4 of the 6 test prediction interval lengths that we might use to assess the likely length of the next prediction interval are below this median, but their corresponding actual prediction interval lengths - for the sole remaining variable, are above this median.
This example has the merit of being simple enough for hand calculation, while still being based on a real problem. See Columbia Accident Investigation Board, Prediction, Laplace, and the Second Coming, and Posterior Predictive p-values (which shows mathematically that hindsight will almost always rate events that have just occurred as more probable than they were previously taken to be).
Suppose that you have a piece of equipment that is not quite functioning as designed, but which appears unaffected by this discrepancy in 111 tests. With what confidence can we predict success on the 112th test? In other words, we assume that it is known that there are only two possible values: 0 and 1. We have observed 111 0s. Ignoring all other sources of information, with what confidence can we predict a 112th 0? To predict success with confidence, we need to outlaw the possibility that there are 111 0s and a single 1 in the sample of size 112, and that our referee happened to hide the single 1. So we can predict success if our confidence level is at most 1-1/112, or 99.11%: in other words, a string of 111 launches where this effect did not cause a serious problem does not give you enough confidence to avoid looking into the problem more seriously.
One attractive way to answer questions like this is to fit a model that predicts variable Y from variables Xi, and then see if variable T is correlated with the residuals from that model. There is an easy permutation test to see if the vector A is correlated with the vector B: form the dot product, SUMi Ai * Bi, and then permute A (or, equivalently, B) and repeat the exercise. The probability of seeing a result at least as low (high) as that observed is taken to be the fraction of the permutations which preduce a result at least as low (high) as that observed, if the assumptions are satisfied. See Good's book for much more info.
The assumptions, of course, boil down to assuming that at least one of the A or B are exchangeable - that every permutation of the observed value is equally likely under the null hypothesis. The catch is that in general there is absolutely no reason to assume that this is the case: the null hypothesis probably says that Yi = f(Xi) + ei, but it does not necessarily say that the error terms ei are exchangeable and, even if it did, the process of fitting the model probably means that the residuals aren't exchangeable.
However, there is one easy case where the assumptions are satisfied: if the data you are analysing are results from an experiment, you have the opportunity to make T exchangeable when you plan the experiment: just take the T you first think of and randomly permute its elements amongst themselves. Then the assumptions of the permutation test are satisfied, not because the residuals are exchangeable, but because the dependent variable T is. This is a good idea in general, anyway, because it stops some influence you haven't thought of, or can't guard against - such as the relative order of the individual readings, or the weather, or the alertness of the experimenter, being confused with T.
Here is a link to a page of descriptions of Java programs, including some related to this topic. A Jar file of source is here.
Here is a link back up to my home page.