The problem here is that you are using the same data twice, once to fit the model parameters, and then again as input to a significance test using those parameters. Advantages are that the posterior distribution on the model parameters may well be proper, even if the prior distribution on those parameters is not, and that using Monte Carlo techniques to work with Bayesian models tends to produce samples from the posterior distribution, making it relatively easy to estimate the posterior predictive p-values simply by drawing a simulated observation according to each sample from the posterior distribution on the model parameters and comparing these with the real observation.
The paper "Posterior Predictive p-values", by Xiao-Li Meng, in "The Annals of Statistics", 1994 Vol 22, No 3, 1142-1160, describes and discusses posterior predictive p-values very understandably and clearly, and proves theorems relating the posterior predictive p-values to prior p-values.
The use of posterior predictive p-values is described in the book "Bayesian Data Analysis", by Gelman, Carlin, Stern, and Rubin, in the following succinct paragraph from section 6.3:
Our basic technique for checking the fit of a model to data is to draw simulated values from the posterior predictive distribution of replicated data and compare those samples to the observed data. Any systematic differences between the simulations and the data indicate potential failings in the model...
Posterior Predictive p-values are also described as one of several options in "Measures of Surprise in Bayesian Analysis", by Bayarri and Berger, (available via Citeseer).
The manual for the software package BUGS describes this strategy, and how it can easily by followed in BUGS (section 9.3.3).
Integral p(x|t)p(t|x) dt ≥ p(x) = Integral p(x|t)p(t) dt
Here t represents the parameters of the model, which we take to be
distributed according to the prior distribution. p(x|t) is then the
probability of x given t, and p(t|x) is the posterior probability of
t given x. This inequality therefore shows that recalculating the
probability of x, using the posterior distribution on t derived from x
instead
of the prior distribution on t, can cannot decrease the probability of
x. This follows because
Integral p(x|t)p(t|x) dt = Integral p(x|t)p(x|t)p(t)/p(x) dt =
Expectation(p(x|t)2)/p(x) =
Expectation(p(x|t))2/p(x) + Variance(p(x|t))/p(x)
≥ Expectation(p(x|t))2/p(x)
= p(x).
Here the expectation is over the prior distribution of t. Note that we
have equality iff the variance of P(x|t), according to our prior
distribution on t, is zero, which is to say
that the probability of x given t is constant almost everywhere
the prior distribution on t is positive. See also the postscript for an
application of this.
p(T'=t') = Integral p(T'=t'|X=x)p(X=x|T=t)p(T=t) dx dt =
Integral p(T'=t'|X=x)p(x) dx =
Integral p(X=x & T=t') dx = p(T=t')
Here T' represents the model parameters produced as samples from the posterior distribution, T represents the model parameters produced directly from the prior distribution, and X represents the observation. So t', t, and x are samples from T', T, and X respectively.
Since the distribution of posterior model parameters is identical to the prior distribution on those parameters, it follows that the posterior predictive distribution for the observations X' (marginalised over observations and model parameters) is the same as the distribution of observations implied by the model (marginalised over model parameters). This implies that, in the long run, the distribution of S(X) is the same as the distribution of S(X') assuming that the model and its prior distribution are correct.
If we just look at the distribution of X and X' (where X is an observation and X' is a sample from the posterior predictive distribution given that observation) we note that p(X'|X) = Integral p(X'|t)p(t|X)dt =
Integral p(X'|t)p(X|t)p(t)/p(X) dt
so p(X',X) = p(X'|X)p(X) = Integral p(X'|t)p(X|t)p(t) dt = p(X, X')
Here t is the parameter controlling the underlying distribution, and we have that the joint distribution of (X,X') is symmetrical. This immediately tells us that P(X' > X) = p(X > X') ≤ 1/2, which tells us that the probability of any given X being higher than half of the X' sampled from it is ≤ 1/2, because if it was higher than this we could produce pairs of (X,X') with X > X' by sampling first X at random and then X' from its joint distribution. Furthermore, every distribution on (X,X') with p(X,X')=p(X',X) can be produced as the result of sampling from X and then sampling for p(X'|X): let the parameter t select a box that chooses X with probability 1/2 and X' with probability 1/2, and set p(t) = 2p(X,X') = 2p(X', X).
This is loosely based on one of the proofs in the paper by Meng. I feel it all goes past very fast for me, so I've tried to take it step by step.
First suppose that we have an arbitrary function S(t) of some
random variable t. Let h(a) be a convex (U-shaped) function of a,
so that h(ma+(1-m)b) ≤ mh(a)+(1-m)h(b), where we have
0 ≤ m ≤ 1. Then Jensen's inequality (See e.g. Introduction to the
Theory of Statistics, by Mood, Graybill, and Boes section 4.5) says
that
h(E(S(t))) ≤ E(h(S(t)))
where the expectation is over t is chosen at random from some
arbitrary distribution.
If S(t) is in fact a function S(t, x) of both t and x, then the above
holds for any x we like, which means that we can also say
E(h(E(S(t, x)))) ≤ E(E(h(S(t, x))))
where the inner expectation is as before, and the outer expectation
is over x chosen at random from some arbitrary distribution. Since we
have established that our original inequality holds for arbitrary
distributions, we can let the distribution of t depend on x.
Now we let the distribution of t be P(t|x), the posterior distribution
for the parameters given x. We let S(t, x) be the tail probability of
some hypothesis, and we call E(S(t, x)) a' on the left.
We then get
E(h(a')) ≤ E(E(h(S(t, x))))
and it turns out that the left hand side is the tail probability
according to the posterior predictive distribution for t. What about
the right hand side? Well, the expectations take our observed x
and then work out p(t|x), so t and x are distributed according to
p(x)p(t|x) = p(t, x) = p(t)p(x|t). So we get the same result as if
we generated a random pair t, x, by generating t according to its
prior distribution and then generated x given t: this is generating
x and t exactly according to our underlying model,
and we should expect any tail
probability S(t, x) to be
uniform in [0, 1] under these circumstances. Call this tail
probability a. We have
E(h(a')) ≤ E(h(a))
for convex (U-shaped) h.
An example of such an h(a) is (a-0.5)2, which you could think of as a measure of the distance betweeen the p-value a and an 'ideal' of 0.5.
Meng refers to this property of a', that E(h(a')) ≤ E(h(a)) for any convex function h(), as a' being stochastically less variable than a. He goes on to prove an inequality involving integrals by referring to a result in a textbook on Stochastic processes. One of the consequences that Meng derives from this second theorem can be proved independently: if the data are derived according to the model, the probability of achieving a p-value of a or less is at most 2a. So we haven't proved that the posterior p-values are conservative, but we can prove that doubling them yields a conservative p-value.
Let h(x) be the piecewise linear function that is 1 at x=0, slopes
straight down to be 0 at x=q, and is then 0 from x=q to x=1. Since
a is uniform in [0,1] we have that E(h(a)) = q/2. h(x) is convex
because dh/dx is monotonically non-decreasing. For any fixed a, chose
q ≥ a. Then h(a) = 1 - a/q. If x ≤ a then h(x) ≥ 1-a/q.
So if p is the probability that a' ≤ a then we have that
E(h(a')) ≥ p(1-a/q). So p(1-a/q) ≤ E(h(a')) ≤ E(h(a)) = q/2.
p ≤ q/(2(1-a/q)) = q2/(2(q-a))
Simple calculus tells us that the best q to choose here is q=2a, which
gives us the inequality p ≤ 2a.
We have seen above that when you apply Bayes Theorem to work out P(t|X), the posterior distribution of the parameter t given the observation X, the resulting posterior distribution on t makes X at least as probable in hindsight as it was according to the prior, and that you have equality only if almost every parameter given non-zero weight by the prior produces the same probability for X.
One obvious application for this would be to repeatedly reapply Bayes theorem using only a single observation. This would not produce a valid posterior, but would converge to a distribution on the parameters (locally) maximising the likelihood of the observation. I cannot, however come up with an application in which this is practical.
Another application is more theoretical. Suppose that we wish to prove that a hidden Markov model is not equivalent to a Markov model of any finite order. Consider a sequence of identical observations. If the underlying model is Markov of order k, then the probability that the next observation in the sequence follows the pattern, given the previous n ≥ k observations, is the same for all such n, because by definition the probability distribution of a kth order Markov model depends only on the last k observations.
Now consider a hidden Markov model with a single parameter t. We start off with a prior distribution on t, and re-estimate this to predict the next observation, given the observations so far. Since all the observations are identical, this is the same calculation as repeatedly applying a Bayes pass to a single observation, and the probability that the next observation follows that pattern cannot decrease. In fact, it must increase unless our distribution on the possible parameter values, given the observations so far, produces the same prediction for almost all parameter values given positive weight.
As an example consider a hidden Markov model with a single parameter, initially set to the difference between two random variables, both Poisson with parameter 1. To produce each observation, we look at the value of the parameter. If it is below zero, we increment the value of the parameter by another sample from a Poisson random variable and output 0. If it is above zero, we decrement the parameter by the value of a random sample from the same distribution and output 1. If it is equal to zero we toss a coin and then either output 0 and increment or output 1 and decrement.
The prior probability on our observation puts positive weight on every integer. After seeing a 0 or one, we cannot absolutely rule out any integer, because of the perturbation introduced by incrementing or decrementing the hidden variable by a Possion random variable. Furthermore, the range of values in play produces different predictions for the next obseration. So if we observe an indefinate stream of zeros, the probability that the next observation is also a zero increases at each setp. Therefore the process is not Markov of any finite order: it is irretrievably hidden Markov.