Using a Markov Chain to Construct a Tractable Approximation of an Intractable Probability Distribution

Abstract.  Let π denote an intractable probability distribution that we would like to explore. Suppose that we have a positive recurrent, irreducible Markov chain that satisfies a minorization condition and has π as its invariant measure. We provide a method of using simulations from the Markov chain to construct a statistical estimate of π from which it is straightforward to sample. We show that this estimate is ‘strongly consistent’ in the sense that the total variation distance between the estimate and π converges to 0 almost surely as the number of simulations grows. Moreover, we use some recently developed asymptotic results to provide guidance as to how much simulation is necessary. Draws from the estimate can be used to approximate features of π or as intelligent starting values for the original Markov chain. We illustrate our methods with two examples.


Introduction
Let p be a probability distribution on X that we would like to explore. For example, we might want to know the value of E p g :¼ R X g(x)p(dx). Suppose that p is intractable in the sense that numerical integration and classical Monte Carlo methods are not viable options for approximating the features of p. Also, assume that we have at our disposal a Markov transition kernel, P(x, dy), that satisfies the usual regularity conditions (see section 2), has p as its invariant probability measure and is easy to simulate. Write the corresponding Markov chain as X ¼ fX n g 1 n¼0 . As is now well-known, there are many methods for constructing such kernels (see e.g. Liu, 2001;Robert & Casella, 2004). A Markov chain Monte Carlo (MCMC) solution to the problem is to estimate E p g using g n :¼ n À1 P nÀ1 i¼0 gðX i Þ, which converges almost surely to E p g no matter what the distribution of X 0 .
As the complexity of p precludes the use of classical Monte Carlo methods, it will also be difficult, if not impossible, to start the Markov chain in stationarity (by drawing X 0 $ p). Hence, the estimate g n will be based on a sequence of random variables that are neither independent nor identically distributed. Two important consequences of this are that g n is a biased estimate of E p g and that variance estimation cannot be based on standard techniques for independent and identically distributed (i.i.d) data. Note that these two problems do not surface when using classical Monte Carlo methods based on i.i.d. draws from p.
Typically, the chain is started by setting X 0 ¼ x, where x is just some point from which it is convenient to start the simulation. Let P n (x, AE) represent the distribution of X n given X 0 ¼ x. The basic Markov chain theory underlying MCMC implies that kp(AE) À P n (x, AE)k # 0, where kAEk denotes the total variation norm. Thus, the marginal distribution of each successive X n is closer to p than the previous one. Often, in order to reduce the effect of using something other than p as the starting distribution, the first b, say, simulated values are Ôthrown outÕ and only the values of X b , X bþ1 , . . . are used to explore p. For example, instead of using g n to estimate E p g, we would use n À1 P nþbÀ1 i¼b gðX i Þ. This practice is known as burn-in. Aside from reducing bias, burn-in may lead to improved performance of variance estimation techniques whose derivations are based on the assumption that the underlying stochastic process is stationary. Ideally, one would like to choose the amount of burn-in by calculating b such that kp(AE) À P b (x, AE)k < c where c > 0 is some predetermined constant. Unfortunately, the methods that are currently available for doing this require the user to perform some potentially Ôdifficult theoretical analysisÕ (Fill et al., 2000) of the Markov chain before they can be applied (Meyn & Tweedie, 1994;Rosenthal, 1995a;Roberts & Tweedie, 1999;Douc et al., 2004;Baxendale, 2005). To be specific, minorization and drift conditions must be established for the Markov chain. See Jones & Hobert (2001) for a simple introduction to these concepts. Hobert & Robert (2004) presented an alternative solution to the problem described above, which, unfortunately, also requires minorization and drift conditions. These authors show if P satisfies a minorization condition of the form P(x, AE) ! eI C (x)m(AE), where e > 0, C & X and m(AE) is a measure on X, then p can be represented as where each Q t (AE) is a probability measure (on the same space as p) and fp t g 1 t¼1 is a sequence of non-negative numbers that sum to 1. The Q t s and p t s, which are formally defined in section 2, are associated with the hitting times on an accessible atom introduced via the splitting construction of Athreya & Ney (1978) and Nummelin (1978).
Representation (1) is appealing from a simulation point of view because it reveals the potential for drawing from p by randomly drawing an element from the set fQ 1 , Q 2 , Q 3 , . . .g according to the probabilities p 1 , p 2 , p 3 , . . . and then making an independent random draw from the chosen Q t . Of course, the first part of this recipe is equivalent to simulating a discrete random variable, call it T, whose mass function is given by Pr(T ¼ t) ¼ p t for t ¼ 1, 2, 3 . . .. We shall see later that drawing from Q t (AE) is simple. The challenge is simulating T.
One situation where the distribution of T is simple is when C ¼ X. In this case, the Markov chain X is uniformly ergodic and p t ¼ e(1 À e) tÀ1 ; that is, the p t s are geometric probabilities. In this case, it is easy to use (1) to make i.i.d. draws from p. This fact has been used either directly or indirectly by many authors including Asmussen et al. (1992), Murdoch & Green (1998), Breyer & Roberts (2001 and Wilson (2000). Unfortunately, there is no known method for simulating T outside of the uniformly ergodic case. This is important because most Markov chains underlying practically relevant MCMC algorithms are not uniformly ergodic. [Jones & Hobert (2004) call an MCMC algorithm practically relevant when the stationary distribution is complex enough that i.i.d. sampling is not straightforward.] Suppose now that C is a proper subset of X. Let fp t g 1 t¼1 be a second sequence of nonnegative numbers that sum to 1 and letT denote the corresponding discrete random variable. Consider an approximation to p of the formpðAÞ ¼ P 1 t¼1 Q t ðAÞp t . If one can simulateT , then one can make draws fromp. Furthermore, it is easy to see that the total variation distance between p andp satisfies kp Àpk X 1 t¼1 jp t Àp t j: Hobert & Robert (2004) show how to use a geometric drift condition on the Markov chain X to construct a sequence fp t g 1 t¼1 such that P 1 t¼1 jp t Àp t j is arbitrarily small and simulatingT is easy. This yields an alternative solution to the burn-in problem. Indeed, one simply constructŝ p such that kp Àpk < c and thenp is used as the starting distribution; i.e. X 0 $p.
In this paper, we consider the more realistic situation where a geometric drift condition is not available. Our main contribution is a method of using simulations of the Markov chain X to build an explicit sequence fp t g 1 t¼1 in such a way that P 1 t¼1 jp t Àp t j À and hence kp Àpk Á is small with high probability. Armed with the numbers fp t g 1 t¼1 and the ability to simulate from Q t , we can make i.i.d. draws fromp. While we cannot say for certain thatp is within c of p in total variation, taking X 0 $p is a much more rigorous method of doing burn-in than the typical ad hoc method which involves an essentially subjective choice of b. Of course, if the Markov chain underlying the MCMC algorithm is very complicated, then establishing a viable minorization condition may be difficult.
The rest of the paper is organized as follows. The mixture representation of p is developed in section 2. The approximation,p, is described in section 3. In section 4, we consider a toy example where p is a known univariate exponential distribution. This enables us to compare our results with the truth. In section 5, we apply our method to a practically relevant MCMC algorithm. Finally, section 6 contains some discussion about the strengths and limitations of our method.
2. The mixture representation of p Let X ¼ fX n g 1 n¼0 be a Markov chain on a general state space (X, B(X)) with Markov transition kernel P(x, dy). Let P n (x, dy) denote the n-step Markov transition kernel corresponding to P; that is, for i 2 f0, 1, 2, . . .g, x 2 X and a measurable set B, P n (x, B) ¼ Pr(X nþi 2 B|X i ¼ x). We assume throughout that X is p-irreducible and positive Harris recurrent where p is the invariant probability measure. Many Markov chains that are the basis of an MCMC algorithm satisfy these basic properties.
Our main additional assumption is that X satisfies a one-step minorization condition; that is, we assume that we have a function s : X ! [0, 1] satisfying R X s(x)p(dx) > 0 and a probability measure m on B(X) such that for all x 2 X and all measurable B, This is a more general minorization condition than the one considered in Hobert & Robert (2004). Indeed, these authors assume that s(x) has the specific form eI C (x). There are practical advantages to working with the more general form of minorization (Jones & Hobert, 2001). While p-irreducibility and positive Harris recurrence do not together imply the existence of a one-step minorization condition, they do imply that a k-step minorization holds; that is, they guarantee the existence of a k 2 N :¼ f1, 2, . . .g such that P k (x, AE) ! s(x)m(AE) where s and m are as described above. For a given chain, if it is not possible to establish (2), but P k (x, AE) ! s(x)m(AE) can be established for some k 2 f2, 3, 4, . . .g, then we simply consider the Markov chain corresponding to P k to be the chain of interest. This is legitimate as p is still invariant for P k and the k-step chain inherits the basic properties from X. Unfortunately, in most standard MCMC settings, P k will not be available in closed form and this will make the application of our methods more challenging. Scand J Statist 33 It is often straightforward to establish (2). It is especially simple when X is countable as we can just fix a pointx 2 X and take sðxÞ ¼ Iðx ¼xÞ and mðÁÞ ¼ P ðx; ÁÞ. Mykland et al. (1995) describe general methods for establishing (2) in the context of standard MCMC algorithms such as the Gibbs sampler and independence and random walk versions of the Metropolis-Hastings-Green algorithm. We also note that the simulated tempering method of Geyer & Thompson (1995) and Marinari & Parisi (1992) often naturally induces a minorization (see e.g. Brooks et al., 2004;Møller & Nicholls, 2005). See also Brockwell & Kadane (2005).
The minorization allows for the fundamental splitting construction of Nummelin (1978Nummelin ( , 1984. Specifically, we can use (2) to write P(x, AE) as a two-component mixture where R(x, dy) : If X is the basis of an MCMC algorithm, then presumably there is a convenient method of simulating from P(x, AE). The mixture representation (3) provides the following alternative method: given In fact, this is a recipe for simulating the split chain, X 0 ¼ fðX n ; d n Þg 1 n¼0 , which lives on the space X Â f0, 1g and is such that, marginally, the sequence fX n g 1 n¼0 has the same distribution as the original chain, X. An important property of X 0 is that X Â f1g is an accessible atom and the (random) times at which X 0 enters X Â f1g are regeneration times when the chain stochastically restarts; i.e. the next value has distribution m.
[See Nummelin (1984, Section 4.4) for a thorough development of X 0 including expressions for its transition kernel and stationary distribution.] As a practical matter, simulating the split chain in the manner described above may be troublesome as drawing from R(x, dy) can be prohibitively difficult. However, there is a simple method for avoiding this. Specifically, Mykland et al. (1995) suggest simulating from the distribution of X iþ1 |X i using the sampler at hand and then Ôfilling inÕ d i by simulating from the distribution of d i |X i , X iþ1 with where q(AE) and k(AE|x) are densities corresponding to m(AE) and P(x, AE). We use this approach in our examples. The development of the split chain is now used to derive (1). Define s to be the first return time to the atom; that is, Also, let Pr Ã (AE) and E Ã (AE) denote probability and expectation conditional on d 0 ¼ 1 (with X 0 chosen arbitrarily); i.e. X 1 $ m(AE). As X 0 is positive recurrent, it follows that E Ã (s) < 1 (Meyn & Tweedie, 1993, Chapter 10). Consequently, we can define a discrete random variable, T, with support N and probabilities given by It is important to recognize that, in general, p t 6 ¼ Pr Ã (s ¼ t).
Remark 1. The random variable T is related to the (discrete) delayed renewal process S n ¼ P n i¼0 Y i where Y 1 , Y 2 , . . . are i.i.d. copies of s and Y 0 is an independent, nonnegative discrete random variable. Indeed, if Y 0 ¼ d T À 1, then S n is an equilibrium renewal process; i.e. the equilibrium distribution is that of T À 1 (Ross, 1983, p. 76). Now, for any t 2 N and any measurable B, we define i.e. Q t is the conditional distribution of X t given that (X 0 , d 0 ) 2 X Â f1g and that there are no regenerations in the split chain before time t. We now formally state an extension of Hobert & Robert's (2004) theorem 1.
Theorem 1 Let X be a Markov chain on a general state space (X, B(X)) with Markov transition kernel P.
Assume that X is p-irreducible and positive Harris recurrent where p is the invariant probability measure. Assume further that (2) holds. Then for any B 2 B(X), we have where p t and Q t are defined in terms of the split chain at (5) and (6).
Proof. See the proof of Hobert & Robert's (2004) theorem 1, which still goes through with the more general minorization condition.
Remark 2. Asmussen et al. (1992) provide general methods for constructing stationary versions of certain regenerative stochastic processes. Representation (7) can also be obtained by applying their methods to the split chain, which possesses the necessary regenerative properties. Equation (7) demonstrates the possibility of simulating a random variable from p using a sequential sampling mechanism. That is, a draw from p can be made by first drawing T, call the result t, and then making an independent draw from Q t . In fact, it is always possible to simulate from Q t using a simple accept-reject algorithm that we call algorithm I. All that is required is the ability to simulate the split chain. Note that Q 1 (AE) m(AE) so in the statement of the algorithm, it is assumed that t ! 2.
Algorithm I: 1. Take (x 0 , d 0 ) 2 X Â f1g and simulate the split chain for t iterations.
Assume now that s(x) in (2) is not bounded away from zero. This will be the case whenever the Markov chain X is not uniformly ergodic. There is no known method for simulating T in this case and hence we focus on using (7) to build an approximation of p from which it is straightforward to sample. Our approximation takes the form where fp t g 1 t¼1 are non-negative numbers that sum to 1. It is easy to show that kp Àpk P 1 t¼1 jp t Àp t j; that is, the total variation distance between the distributions p and p is bounded above by twice the total variation distance between the distributions of T andT , whereT is the discrete random variable on N with probabilities fp t g 1 t¼1 .
[Note that we are using the version of total variation which does not have the factor of 2; that is, kp Àpk is Scand J Statist 33 defined to be the supremum over measurable B of jpðBÞ ÀpðBÞj.] In the next section we will show that given any c > 0, it is possible to construct a sequence fp t g 1 t¼1 such that P 1 t¼1 jp t Àp t j < c with high probability.

Approximating p
The key to our argument is that making i.i.d. draws from the distribution of s is trivial; just take X 1 $ m(AE), run the split chain, and count how many iterations until the first regeneration. This unlimited supply of i.i.d. copies of s can be used to construct a statistical estimate of p t ¼ Pr Ã (s ! t)/E Ã (s). Indeed, let s 1 , . . ., s m denote a random sample of size m and let F m (t) denote the corresponding empirical distribution function. We estimate p t witĥ where s is the sample mean. As P 1 t¼1p t ¼ 1,p t f g 1 t¼1 is a legitimate mass function on N from which we can sample.
We now use asymptotic arguments to show thatp t f g 1 t¼1 enjoys a type of Ôstrong consistencyÕ and to get a handle on the error ofp t f g 1 t¼1 . These results allow us to develop a method of choosing an appropriate value for m. For obvious reasons, we use P 1 t¼1 jp t À p t j as our measure of error. Let G 1 and G 2 denote two univariate distribution functions. The L 1 -Wasserstein distance between the probability distributions corresponding to G 1 and G 2 is defined as (Shorack & Wellner, 1986, Chapter 2) The following result shows that (at least asymptotically) fp t g 1 t¼1 is a reasonable estimate of the mass function of T.
Obviously, no matter how large m is, we can never say for certain that d 1 (F m , F) < c. However, we can use asymptotic results to make statements like Pr[d 1 (F m , F) < c] % 1 À a. Indeed, Del Barrio et al. (1999) have recently described the first-order asymptotics for the L 1 -Wasserstein distance between the empirical and true distribution functions. In particular, their results imply that if where B(s), 0 s 1, denotes a Brownian bridge process. Condition (9) is very close to a finite second moment condition. Indeed, (9) implies that E Ã À s 2 Á < 1, while if E Ã À s 2þe Á < 1 for some e > 0 then (9) holds.
We now explain how (10) can be used to come up with a reasonable value of m. Suppose that s 1 , . . ., s m 0 is an initial sample of i.i.d. ss with corresponding empirical distribution function F m 0 . Let u m 0 denote the number of unique values in this sample. Also, let L denote the random variable P 1 t¼1 BðF m 0 ðtÞÞ j j , which, if m 0 is large, should have a distribution quite similar to that of P 1 t¼1 BðF ðtÞÞ j j . Simulating the random variable L is quite simple. Indeed, all that is required is u m 0 values of one realization of standard Brownian motion in (0, 1), which can be done sequentially using only univariate normal draws. Hence, it is easy to find c such that Pr L < c ½ %1 À a: Then if we take m ¼ 4c 2 /c 2 , we can say that Pr[2d 1 (F m , F) < c] % 1Àa and hence that kp Àpk < c with probability approximately equal to 1 À a. If (9) fails then (10) fails (Del Barrio et al., 1999) and our method for choosing m is not applicable. This situation is analogous to one where we have i.i.d. random variables W 1 , W 2 , . . . such that E|W 1 | p is finite when p ¼ 1 but is infinite when p ¼ 2. In this case, n À1 P n i¼1 W i can be used to estimate E(W) since the strong law holds, but the central limit theorem (CLT) cannot be used to choose an appropriate value of n. However, the condition (9) is closely related to the mixing properties of the Markov chain and is a weak condition. In fact, if (9) were to fail, the Markov chain would probably not mix sufficiently well to be of any practical use anyway.
Remark 3. It appears that a weak form of polynomial ergodicity (of X) is enough to guarantee that E Ã À s 2þe Á < 1 for some e > 0 (Jarner & Roberts, 2002), but we are unaware of any clean statements in the literature connecting polynomial ergodicity and the moments of s. On the other hand, if X is geometrically ergodic, then s has a moment generating function (see e.g. Hobert et al., 2002).
In the next two sections, we illustrate the construction ofp with toy and realistic examples, respectively. Scand J Statist 33

A toy example
Suppose that p(x) ¼ e Àx I Rþ (x). This distribution is clearly not intractable in any sense, but using a simple, univariate distribution allows us to evaluate our approximations by comparing them directly to the truth. The Markov chain we consider is the independence Metropolis sampler with an exp(h) proposal; that is, the proposal density is q(x) ¼ h exp(Àhx)I Rþ (x). The chain evolves as follows: Given X n ¼ x, draw y $ exp(h) and independently draw u $ Uni(0, 1). If u < expf(x À y)(1 À h)g then set X nþ1 ¼ y, otherwise set X nþ1 ¼ x. The case h ¼ 1 is not of interest to us since in this case the algorithm yields i.i.d. draws from the target distribution. Results in Mengersen & Tweedie (1996) can be used to show that when 0 < h < 1, the chain is uniformly ergodic and hence E Ã À s 2þe Á < 1 for any e > 0. Moreover, it is easy to verify the conditions of theorem 5.3 in Jarner & Roberts (2002) which shows that this sampler is polynomially ergodic, but apparently this is not sufficient to guarantee E Ã À s 2 Á < 1 when 1 < h. Finding a minorization condition is simple. Let w(x) ¼ h À1 exp[x(hÀ1)]. Applying results in Mykland et al. (1995, p. 236) shows that (2)  for any a > 0. Mykland et al. also give an expression for the probability of regeneration that does not require the normalizing constant for the density of m. We constructed three approximations to p: the first was based on a uniformly ergodic sampler with h ¼ 0.75; the second was based on a subgeometric sampler with h ¼ 1.5; and the third used a subgeometric sampler with h ¼ 2.5. [Actually, we suspect that (9) fails in the h ¼ 2.5 case.] In all cases, after some trial and error, we chose a ¼ 1.5. For each value of h, an initial sample of m 0 ¼ 2.5 Â 10 5 i.i.d. ss was drawn. The results are reported in Table 1 which gives the number of unique values observed (u m 0 ), the maximum value observed (max), and the 99th percentile.
Then, for each value of h, we simulated 5 Â 10 4 values of L and the results are given in Table 2. In particular, Table 2 gives the number (m) of ss necessary to ensure thatp is within c of the stationary distribution in total variation distance with approximate probability 1 À a. The values of m in Table 2 clearly reflect the fact that the sampler enjoys superior mixing for smaller values of h. For each value of h, we constructedp using the values of m given in Table 2. Then, for each value of h, we compared a density estimate based on a random sample of size 5 Â 10 4 fromp with the exp(1) density and the two curves were essentially coincidental. The results suggest thatp is an excellent approximation to p even when h ¼ 2.5.

Hierarchical linear mixed models
Consider the usual frequentist general linear mixed model where Y is an n Â 1 vector of observations, X is a known n Â p matrix, Z is a known n Â q matrix, b is a p Â 1 vector of parameters, u is a q Â 1 vector of random variables, and e is an n Â 1 vector of residual errors. We assume that X is of full column rank so that X T X is invertible. A Bayesian version of this model may be expressed as the following conditionally independent hierarchical model with as yet unspecified priors f(R) and f(D). Here b 0 and B À1 are assumed to be known. The posterior density of (b, u, R, D) given the data, y, is characterized by pðb; u; R; DjyÞ / f ðyjb; u; R; DÞf ðbju; R; DÞf ðujD; RÞf ðRÞf ðDÞ: We assume that the priors on R and D are such that the resulting posterior (12) is proper. Even if proper conjugate priors are chosen, the integrals required for inference through this posterior cannot be evaluated in closed form. Thus, exploring the posterior in order to make inferences might require MCMC.

A block Gibbs sampler and a minorization condition
In this section, we consider a block Gibbs sampler with components R, D and n ¼ (u T , b T ) T . The full conditional densities for R and D are given by The density p(n|R, D, y) is a (p þ q)-variate normal with mean n 0 and covariance matrix R À1 where Scand J Statist 33 Consider the block Gibbs sampler corresponding to the following updating scheme: ðD 0 ; R 0 ; n 0 Þ ! ðD; R 0 ; n 0 Þ ! ðD; R; n 0 Þ ! ðD; R; nÞ: Conditional on n, D and R are independent and hence the order in which they are updated is irrelevant. That is, we are effectively dealing with a two-variable Gibbs sampler. Suppressing dependence on the data, the transition density is given by kðD; R; njD 0 ; R 0 ; n 0 Þ ¼ pðDjn 0 ÞpðRjn 0 ÞpðnjR; DÞ: We now develop a minorization condition of the form (2)  Then the minorization condition will follow by taking Let S denote the space in which R lives; that is, the set of points in R n(nþ1)/2 corresponding to symmetric, positive definite n Â n matrices. Note that M R must be chosen so that M R \ S has positive measure. Otherwise, c q will be zero and, from a practical standpoint, R will never land in M R . Similar comments apply to the choice of M D . Using equation (4), it is easy to see that when R 2 M R and D 2 M D the probability of regeneration is given by Thus we have to calculate the infima in (14) and plug into (15). Let a 1ij a 2ij for i ¼ 1, . . ., q and j ¼ 1, . . ., q be constants and define M D ¼ fM qÂq : a 1ij m ij a 2ij g. Then . ., n and j ¼ 1, . . ., n be constants and define M R ¼ fM nÂn : b 1ij m ij b 2ij g. A calculation similar to the one above shows that inf R2MR pðRjn 0 Þ & Thus the probability of regeneration is given by We end this section by noting that the minorization condition established above is quite different than the one derived in Jones & Hobert (2004) for the simpler Bayesian hierarchical version of the one-way random effects model. One major difference is that the one-way model contains only two univariate variance components, whereas the general linear mixed model considered here contains two unstructured covariance matrices. Moreover, Jones & Hobert (2004) established a minorization condition of the form P(x,AE) ! eI C (x)m(AE) while the one we have derived here is of the form P(x,AE) ! s(x)m(AE) with an s that cannot be expressed as a constant multiple of an indicator.

A numerical example
In this section, we identify a specific example of the model (11), simulate some data from that model and then use the block Gibbs sampler described above to form an approximation of the resulting intractable posterior density.
Suppose that p ¼ 1 so that X ¼ (x 1 , . . ., x n ) T 2 R n and that q ¼ n with Z ¼ I n . Fix b 0 ¼ 0 and B À1 ¼ 1. Assume that R À1 ¼ k À1 R I n and D À1 ¼ k À1 D I n where k À1 R and k À1 D are scalar variance components whose reciprocals are assigned the following conjugate priors k R $ Gamðr 1 ; r 2 Þ and k D $ Gamðd 1 ; d 2 Þ: Set n ¼ (u T , b) T and k ¼ (k D , k R ) T . We simulated data according to this model with n ¼ 6, r 1 ¼ r 2 ¼ d 1 ¼ d 2 ¼ 1 and the vector of covariates drawn from the N(0, I 6 ) distribution. The resulting (x, y) pairs were: (À0.65201, 3.05577), (0.46053, À0.84096), (À0.39088, À3.21066), (À0.64953, À0.47085), (À0.65276, 2.23286), (0.75399, À0.02815). We will constructp corresponding to the posterior that results from these data and from setting r 1 ¼ d 1 ¼ 1 and r 2 ¼ d 2 ¼ 2. Recall that the block Gibbs sampler from the previous section uses the sampling scheme: (k 0 , n 0 ) ! (k, n 0 ) ! (k, n). The full conditionals for the precision parameters are given by Approximating an intractable distribution 47 Now n|k R , k D , y $ N 7 (n 0 , R À1 ) where While some work has been done analysing block Gibbs samplers for simpler hierarchical linear models (Rosenthal, 1995b;Hobert & Geyer, 1998;Jones & Hobert, 2004), none of these results apply to our block Gibbs sampler. That is, little is known about the mixing properties of our Markov chain. We simply assume that it satisfies E Ã À s 2þe Á < 1 for some e > 0. To use the minorization condition developed in the previous section, we must fix a pointñ and sets M D ¼ [a 1 , a 2 ] and M R ¼ [b 1 , b 2 ] where 0 < a 1 < a 2 and 0 < b 1 < b 2 . To this end, we ran the block Gibbs sampler for 1 Â 10 4 iterations starting from n ¼ y1 where 1 is a vector of ones. Letũ 1 ; . . . ;ũ 6 ;b;k D ;k R be the estimated posterior expectations of the associated parameters. We setñ ¼ ðũ 1 ; . . . ;ũ 6 ;bÞ T , ½a 1 ; a 2 ¼k D AE ws kD and ½b 1 ; b 2 ¼k R AE ws kR where w > 0 and s k D , s k R are the usual sample standard deviations of the sample of k D s and k R s, respectively. Note that the choice of w controls the trade-off between the size of M D and M R and the magnitude of the probability of regeneration. In our example, we used w ¼ 1.5.
We simulated an initial sample of m 0 ¼ 1 Â 10 4 i.i.d. ss. The number of unique values in the sample was 209, the maximum was 368 and the 99th percentile was 155. We then simulated 1 Â 10 4 values of L. Using this sample together with the formula from section 3 leads to the conclusion that a random sample of size 1,623,331 ss is necessary to ensure thatp is within 0.10 of the stationary distribution in total variation distance with approximate probability 0.90. (The value of c was 63.7.) Using these results we constructedp and subsequently simulated 1 Â 10 4 i.i.d. draws from it and estimated the marginal density function of k R using the density command available in the R software package (R Development Core Team, 2004). This density estimate is shown in Fig. 1. Unlike the toy example studied in the previous section, here we cannot compare our density estimate with the truth as we do not have the marginal posterior density at our disposal. Instead, we ran 1 Â 10 4 independent Gibbs chains each started from n ¼ y1 for 1 Â 10 4 iterations and collected the last state from each chain to form an i.i.d. sample. We then used this i.i.d. sample to estimate the marginal posterior density and this estimate is also shown in Fig. 1. (Our experience analysing similar Markov chains suggests that the block Gibbs sampler is probably quite close to stationarity after 1 Â 10 4 iterations.) Note that the two density estimates are nearly coincidental. We obtained similar results (not shown) for k D and for b.
Suppose that t is a ÔlargeÕ integer and consider using algorithm I to get a draw from Q t . Algorithm I is successful; that is, returns a draw from Q t only if d 1 ¼ Á Á Á ¼ d tÀ1 ¼ 0. Thus, we must run the split chain over and over again until we get a realization in which there are no regenerations before time t. It may seem as if this could take an impractically large amount of time. However, if large values of T are observed, this suggests that the mass function of T has a heavy tail, which in turn suggests that the split chain is prone to long stretches without a regeneration. Hence, it is not unlikely for a long consecutive string of ds to be 0 and this means that algorithm I will be viable. As an illustration, consider the example described above. The largest value of T that was observed while simulating the random sample of size 1 Â 10 4 from p was 333. We performed an experiment in which we used algorithm I to get 10 draws from Q 333 and the number of iterations of algorithm I that were required ranged from 656 to 19,995 and the entire experiment took only about 15 min on a slow workstation.

Discussion
Let P(x, dy) be a Markov transition kernel with invariant probability measure p that satisfies the minorization condition P(x,AE) ! s(x)m(AE). We have shown how to use simulations from the corresponding split chain to build a strongly consistent statistical approximation to p from which it is easy to sample. Furthermore, we have shown how to take advantage of asymptotic results (that hold under a weak condition on P) to construct the approximation,p, in such a way that kp Àpk is small with high (asymptotic) probability.
In our view, an important practical use forp is as a starting distribution for the original Markov chain. Suppose it is necessary to make sure that the chain is close to stationarity before sampling begins. The best way to accomplish this is via the exact methods mentioned in section 1 which yield a b such that kP b (x,AE) À p(AE)k < c. Unfortunately, the exact methods cannot even be implemented until both drift and minorization conditions have been established for the underlying Markov chain. Our method is less rigorous than the exact method, but it requires far less analysis of the underlying chain. That said, most users of MCMC would not be willing to develop a viable minorization condition just to get a reasonable starting value. However, as we mention in section 2, general minorization conditions are already available for many standard MCMC algorithms. In addition, it is possible to develop minorization conditions for very general models as in section 5.1.
While our method is less rigorous than the exact method, it has a much firmer theoretical grounding than most ad hoc convergence diagnostics that involve running the chain and observing the behaviour of some univariate statistics. In fact, in some cases, our approximation can be used to improve convergence diagnostics. For example, the widely used Gelman & Rubin (1992) diagnostic involves running independent (parallel) Markov chains whose starting points are drawn from an Ôoverdispersed starting distributionÕ. This starting distribution is based on an approximation of p that is a mixture of multivariate normal distributions whose components are centred at the modes of p. Finding the modes of p requires some Estimates of the Marginal Posterior Density of λ R Fig. 1. The solid and dashed lines are density estimates constructed using samples fromp and the Gibbs sampler, respectively. Each is based on a random sample of size 1 Â 10 4 . The random sample from the Gibbs sampler was constructed by running 1 Â 10 4 independent chains (each started from n ¼ y1) for 1 Â 10 4 iterations and collecting the last state from each chain. Both density estimates were made using the density command in R with the bandwidth parameter set to 0.12.
Scand J Statist 33 potentially tedious and time-consuming numerical analysis and, if the target distribution is complex and high dimensional, there is no guarantee of finding all of the important modes. Our approximation to p is an attractive alternative to Gelman & Rubin's (1992) mixture of normals as it requires no direct numerical analysis of p.
It is important to recognize that our method (and more generally burn-in) is not a way to ÔfixÕ a poorly mixing Markov chain. Indeed, such chains are not very useful even when started at stationarity. In particular, for chains with good mixing properties, regardless of the starting distribution, n À1 P nÀ1 i¼0 gðX i Þ converges almost surely to R X g(x)p(dx) and there is a corresponding CLT that can be used to assess Monte Carlo error (Jones, 2004). Unfortunately, the CLT fails to hold when the chain converges too slowly. Moreover, it is well known that if the CLT holds for any initial distribution then it holds for all initial distributions (Meyn & Tweedie, 1993, proposition 17.1.6). Thus, starting a poorly mixing chain at stationarity cannot help it to enjoy a CLT. Generally speaking, a non-stationary chain with good mixing properties is much more useful than a stationary version of a poorly mixing chain.
Although we have not emphasized it,p can also be used to visualize important features of p. The existing MCMC methods for such visualization (Geyer, 1994(Geyer, , 1996Henderson & Glynn, 2001;Sko¨ld & Roberts, 2003) have problems that are partly due to the fact that they are based on dependent data. For sufficiently small a and c our method will produce i.i.d. samples from a distribution that is close to p, thus circumventing any problems due to the use of dependent data. Of course, using a small a and c will likely require substantial computational resources.