Perfect samplers for mixtures of distributions

Summary. We consider the construction of perfect samplers for posterior distributions associated with mixtures of exponential families and conjugate priors, starting with a perfect slice sampler in the spirit of Mira and co‐workers. The methods rely on a marginalization akin to Rao–Blackwellization and illustrate the duality principle of Diebolt and Robert. A first approximation embeds the finite support distribution on the latent variables within a continuous support distribution that is easier to simulate by slice sampling, but we later demonstrate that the approximation can be very poor. We conclude by showing that an alternative perfect sampler based on a single backward chain can be constructed. This alternative can handle much larger sample sizes than the slice sampler first proposed.


Introduction
Perfect sampling has been developed in recent years to take advantage of Markov chain Monte Carlo (MCMC) algorithms without suffering from the drawback of MCMC methods, namely that the distribution of interest is only the asymptotic distribution of the Markov chain generated. See Dimakos (2001) for an excellent introduction, as well as Wilson (1998), whose Web site http://dimacs.rutgers.edu/∼dbwilson/exact.html is constantly updated, and Green and Murdoch (1999) and Møller and Nicholls (1999) for recent statistical applications.
When considering realistic statistical models like those involving finite mixtures of distributions (Titterington et al., 1985), with densities of the form .1/ where the p i s are all positive and the f.·|θ i / are density functions, MCMC algorithms are necessary for processing the posterior distribution of the parameters ξ i = .p i ; θ i / (see, for example, Celeux et al. (2000)). It is, however, quite a delicate exercise to produce a perfect sampling version, as shown by Hobert et al. (1999), who could only process a mixture like expression (1) when the parameters θ i are known and when k 3.
The reasons for this difficulty are that perfect sampling techniques, although not requiring monotonicity structures in the Markov transition, work better using monotonicity assumptions, i.e. the existence of an ordering ' ' on the state space that is preserved by the Markov transition (see Propp and Wilson (1996)). It requires hard work to establish monotonicity in the mixture model. One of the key features of the solution of Hobert et al. (1999) is to exploit the duality principle established by Diebolt and Robert (1994) for latent variable models. In set-ups where the chain of interest .ξ .t/ / is generated conditionally on a second chain .Z .t/ / with finite support, the probabilistic properties of .ξ .t/ / can be derived from the properties of .Z .t/ /, whose finiteness facilitates theoretical study. For instance, if we assume that the component distributions in mixture (1) are from an exponential family, the latent variable that is associated with mixture model (1) and a sample .X 1 ;: : :; X n / is the allocation vector Z = .Z 1 ;: : :; Z n / such that, for j = 1;: : :; n, X j ; Z j ∼ p z f.x|θ z / I {1; : : :; k} .z/: .

2/
This finite latent structure is not of direct practical relevance for perfect sampling, since the support of Z is of size k n for k-component mixtures with n observations, and thus precludes a direct implementation of the coupling from the past (CFTP) algorithm (Propp and Wilson, 1996). However, monotonicity structures can often be observed on the .Z .t/ / chain. This paper extends the result of Hobert et al. (1999) to the case of general finite mixtures of exponential family distributions, under conjugate priors, by proposing a different and more generic approach to the problem. The foundation of the technique used here relies on the facts that, under conjugate priors, the marginal posterior distribution of the latent variables Z is known in closed form, up to a constant, as exhibited and exploited for importance sampling in Casella et al. (1999). Moreover, minimum and maximum points can be found for this distribution. The results of Mira et al. (2001) on perfect slice samplers can then be adapted to this setting. Although its practical implementation is limited to small sample sizes, we show that a coupling strategy of Breyer and Roberts (2001) can be easily implemented on the same principle for much larger sample sizes.
In Section 2, we use the marginalization argument of Casella et al. (1999) to show that the marginal posterior of Z is available in closed form and allows for a minimum and a maximum point in common cases. Section 3 then provides a detailed description of the perfect slice sampling techniques, first via a naïve uniform proposal and then through an approximating augmented distribution. Section 4 introduces a generic perfect sampler based on a single backward chain and evaluates the validity of the approximation of Section 3.

Marginalization arguments
Consider a random sample from a k-component exponential family mixture, with density X 1 ;: : :; X n ∼ k i=1 p i exp{xθ i − ψ.θ i /}: . 3/ Assume in addition that the θ i s are associated with conjugate priors, i.e.
where .x 0i ; λ i / is the hyperparameter, and that the p i s, when unknown, are associated with a Dirichlet D.δ 1 ;: : :; δ k / prior. Model (3) can be represented as a latent variable model, i.e. it can be interpreted as the marginal distribution of the joint distribution (2). As previously noted in Casella et al. (1999), the corresponding marginal posterior distribution on the Z j s is then where n i denotes the number of z j s equal to i and s i is the sum of the corresponding x j s, Taking advantage of the conjugate priors, we can then solve the above integrals and obtain .5/ where K.λ; s/ is the normalizing constant associated with expression (4). The marginal posterior distribution of the latent variables .Z 1 ;: : :; Z n / is therefore available in closed form. Since this is also true for the parameters of interest ξ i , one may wonder where the advantage of this representation is. The crucial points are as follows.
(a) Posterior expectations of functions of the ξ i s can often be approximated from the distribution of the Z j s by using Rao-Blackwellization. (b) The Z j s have a finite support, with size k n , and it is often possible to determine the minimum and maximum of expression (5). Moreover, the distribution factors through the (latent) sufficient statistics .n i ; s i /. (c) When possible, perfect simulation from expression (5) leads to perfect simulation from the posterior distribution of the ξ j s by a single call to the conditional distribution π.ξ | x; z/.
The constraint on this closed form representation is obviously that the prior distributions must be conjugate, but this is often the case in the literature (see Diebolt and Robert (1994) or Richardson and Green (1997)).
In what follows, two running examples will be presented.

Example 1: exponential mixtures
Consider the case of the two-component mixture pλ 0 exp.−λ 0 x/ + .1 − p/λ 1 exp.−λ 1 x/; λ > 0; . 6/ where 0 < p < 1 is known. A conjugate prior distribution on λ i is the Ga.α i ; β i / distribution. The marginal posterior distribution on the Z j s is then This distribution thus factors through the sufficient statistic .n 0 ; s 0 /, since n 1 = n − n 0 and s 1 = S − s 0 , where S is the sum of all observations. Therefore, the entire simulation problem is reduced to the distribution of the pair .n 0 ; s 0 /. (Note that a similar expression can be obtained for a Poisson mixture with the same factorization.)

Example 2: normal mixture
We now consider a two-component normal mixture density with common variance σ 2 and known weight p, under a conjugate prior distribution in which The marginal posterior distribution of Z is proportional to ; .8/ where T denotes the total sum of squares and x j is the average of the observations allocated to component j.

Background
We recall that the slice sampler, as developed in Besag and Green (1993) or Damien et al. (1999), consists of simulating a density π.ξ/ by introducing an auxiliary variable " ∼ U.[0; π.ξ/]/, and generating a Markov chain according to the following transition kernel.
Step 1 of this algorithm is straightforward but step 2 can be quite complex. For instance, in the mixture problem, generating uniformly a point in the region where θ = .θ 1 ;: : :; θ k / and p = .p 1 ;: : :; p k /, is impossible in practice given that this region is usually not connected. As detailed in Section 3.2, it is also difficult to generate a point at random in {z; π.z/ "}, given the finite support of Z.
A very appealing feature of slice samplers is that, as pointed out in Mira et al. (2001), there is a natural stochastic ordering, which is induced by π.·/, and hence monotonicity arguments can be invoked, as we now detail.
First, if π.ξ 1 / π.ξ 2 /, the corresponding slices satisfy and, therefore, simulation from a uniform distribution on A 2 can first proceed by acceptancerejection of a uniform sampling on A 1 . From a perfect sampling point of view, this property induces a natural propensity for coupling: if ξ 1 ∼ U.A 1 / belongs to A 2 , this realization is also acceptable as a simulation from U.A 2 / and both chains couple; if it does not belong to A 2 , the value ξ 2 simulated directly from the uniform distribution on A 2 preserves the ordering π.ξ 1 / π.ξ 2 /. Secondly, and most importantly, in the case of expression (5), there is a maximal and a minimal element,1 and0, for the order induced by π.n 0 ; s 0 /, since the space of the .n 0 ; s 0 /s is finite. Moreover, these maximal and minimal elements can be identified for standard mixtures. Thus, monotone CFTP (Propp and Wilson, 1996) applies, i.e. it is sufficient to run two chains starting from1 and0, and to check whether both chains coalesce at time 0. Following a monotonicity argument, the collection of all chains in between the extreme chains has coalesced by the time that those two coalesce.

Example 3: exponential mixtures (continued)
Consider distribution (7). For a given value n 0 such that 0 n 0 n, s 0 belongs to the interval [s 0 .n 0 /; s 0 .n 0 /], where in which the x .i/ s denote the order statistics of the sample x 1 ;: : :; x n , with x .1/ : : : x .n/ . The minimum and maximum of π.n 0 ; s/ for a given value of n 0 can then be obtained in closed form, given the monotonicity in s of π.n 0 ; s/. This function is decreasing and then increasing, with a minimum at provided that this value is in [s 0 .n 0 /; s 0 .n 0 /]. The maximum is obviously attained at one of the two extremes. The points1 and0 are then obtained by n + 1 comparisons over all n 0 s. Note that s 0 * .n 0 / and hence0 are not necessarily possible points, i.e. sums of n 0 observations from the sample. This is not a problem for perfect sampling: starting from0 instead of the true minimum only slows down coalescence but does not modify the validity of CFTP. (This is the envelope argument of Kendall (1998).)
Note again the crucial appeal of running the slice sampler on the latent variable chain rather than on the original parameter chain. It is indeed nearly impossible to find0 and1 for the latter, since this is equivalent to finding the maximum likelihood estimator (for1) and a 'minimum likelihood estimator' (for0). In non-compact cases, these extrema may not even exist when the likelihood is unbounded.
As noted in Mira et al. (2001), only knowledge of the maximal element1 is necessary to run the monotone perfect slice sampler, given that the minimal element0 is never really used: for the chain starting from0, the next value is selected at random from the entire state space of the ωs, since π.ξ/ u π.0/ does not impose any constraint on ξ.

Exact slice sampling
In principle, it is possible to implement a perfect slice sampler on the discrete chain. For simplicity, we study only the exponential mixture case for k = 2, although the method formally extends to larger values of k.
Consider the two extremes0 and1 in the .n 0 ; s 0 / state space. Then, values can be generated from the uniform distribution U.{.n 0 ; s 0 /; π.n 0 ; s 0 / "}/ by a crude accept-reject scheme which picks out points at random from .n 0 ; s 0 /; s 0 = n 0 j=1 x i j ; 1 i 1 < : : : < i n 0 n .9/ until they satisfy the constraint π.n 0 ; s 0 / ". (Note that this random sampling can be decomposed as the selection of n 0 with probability proportional to n n 0 and the subsequent generation of a random permutation of {1;: : :; n}.) The first difficulty with this implementation is that the uniform sequence from expression (9) used to generate the points .n .t/ 0 ; s .t/ 0 / must be preserved for the next time that t is visited in the CFTP scheme, to build the deterministic mapping that validates CFTP. (As rightly pointed out by a referee, a practical implementation of this requirement is to store only the initial random seed used at time t, given that the infinite sequence of uniform points on expression (9) will always be the same.) The second problem with this perfect slice sampler is more down to earth but cannot be overcome: since the accept-reject step is based on a uniform proposal on the set of all possible values, acceptance may require many proposals. It is obviously reduced to be implemented by eliminating values of n 0 for which the maximum (in s 0 ) posterior is below the bound ", but this only removes a tiny portion of the k n -sized parameter space. The poor match between set (9) and {.n 0 ; s 0 /; π.n 0 ; s 0 / "} is such a serious problem that we failed to achieve a single coalescence after several days of monitoring for a two-component exponential mixture with 60 observations.
For smaller sample sizes, we could, however, achieve coalescence in a 'human scale' (rather than geological) time. For instance, Fig. 1 represents the 32 steps of a successful CFTP for 35 The fundamental difficulty behind the brute force implementation of the slice sampler via the uniform proposal in expression (9) is that the range of the log-posterior values log{π.n 0 ; s 0 /} increases with n. For instance, in the case of 40 observations, it ranges from −102 to −69. Therefore, when considering the maximum point1, it is increasingly unlikely to find a point at random that satisfies π.n 0 ; s 0 /s u π.1/ if u is quite close to 1. A way to alleviate this partially is to introduce intermediate chains, i.e. new starting-points between0 and1. The motivation for this multiplication of chains is that, by providing a better coverage of the range [π.0/; π.1/], we increase the chances that the upper chain (starting from1) collapses into lower log-posterior chains, thus reducing the odds that the algorithm becomes stuck with a large u and almost no .n 0 ; s 0 / satisfying π.n 0 ; s 0 /s u π.1/. However, this modification does not improve the speed of the algorithm, as shown by an experiment with the same data set as that used in Fig. 1 with 99 intermediate chains, since coalescence still took approximately 40 min of computation.

A continuous approximation
Since the difficulty with the above perfect slice sampler resides in the discrete nature of the parameter space, we may be tempted to embed distribution (7) in a continuous distribution, for which slice sampling is much easier. We first show why this is so, before discussing the validity of the approximation.

Example 5: exponential mixtures (continued)
The set of s 0 s such that is the entire interval [s 0 .n 0 /; s 0 .n 0 /] if the value at the minimum s Å 0 is larger than "; otherwise, it is easily derived by dichotomous trial-and-error search.
The joint uniform generation of .n 0 ; s 0 / in expression (10) thus depends on two uniform random variables: one to select n 0 with probability proportional to l.n 0 /, where l.n 0 / is the length of the region R.n 0 /, and one to select s 0 uniformly in the region R.n 0 /. We can thus represent the slice sampling procedure as a deterministic mapping and derive from this representation a straightforward CFTP algorithm, which is a true perfect sampling algorithm for the continuous distribution but only an approximation for the discrete distribution (7). Fig. 2 provides an illustration of the paths of the two chains started at0 and1 for a twocomponent exponential mixture with n = 73 observations. It also provides the corresponding range of the log-posteriors log {π.ω .t/ 0 /} and log {π.ω .t/ 1 /}. Note that, contrary to the exact case of Section 3.2, coalescence is fast, even for large data sets.
The above theory extends to the case k > 2, as long as we can find a maximum starting valuẽ 1, although we are then forced to settle for simple but slow accept-reject methods.

Evaluation of the approximation
At first glance, we may consider that, for large values of n, the continuous interval [s 0 .n 0 /; s 0 .n 0 /] is a good approximation to the set of possible values of s 0 . This is not so, however, because, as pointed out to us by R. M. Neal, there will be many more s 0 s in the centre of the interval [s 0 .n 0 /; s 0 .n 0 /] than close to its boundaries and the uniform assumption used in the continuous approximation is therefore invalid. The true distribution of the s 0 s will thus tend to have a binomial or normal distribution shape. The second reason for a discrepancy is that the first and last values of s 0 are well separated, regardless of the sample size, given their dependence on extreme values.
The continuous extension thus leads to an approximation of the genuine distribution of interest. One may then wonder how accurate this approximation can be. The answer is, unfortunately, that it can be very poor. Given that the approach of Section 4 does provides a true sample from the posterior distribution associated with a mixture, we can assess the effect of the continuous approximation. For instance, in the case of a two-component exponential mixture with n = 81 and n 0 = 59, Fig. 3 shows how disastrous the approximation can be: whereas the density of s 0 Fig. 1 for a simulated sample of 73 observations from a mixture of two exponential distributions, when using the continuous CFTP approximation given by distribution (7) decreases quite rapidly towards 0, as reflected by the histogram on the left-hand side, the main bulk of the possible values of s 0 is centred much more to the right and the reweighting of these values by the probability mass function (7) only brings the histogram slightly to the left, still leaving the supports of both histograms almost disjoint.

Perfect sampling via automatic coupling
As well as slice sampling, there is another approach to perfect sampling, based on Breyer and Roberts's (2001) The transition for the second chain .x .t/ 1 / is then based on the same proposed value y t , with the difference that the Metropolis-Hastings scheme is now an independent scheme, and hence the transition mechanism is .11/ We can use the same uniform variable u t for both transitions, thus ensuring that the chains coalesce once they both accept a proposal y t . The implications of this rudimentary coupling strategy on perfect sampling are quite interesting. Consider an independent proposal distribution h.y/ for the first chain .x .t/ 0 /. The Metropolis-Hastings acceptance probabilities for both chains are then If the two chains are ordered according to the order induced by π=h, i.e. if coupling scheme (11) preserves the ordering. Indeed, since ρ .t/ 0 ρ .t/ 1 when the chains are ordered and the same uniform variable u t is used to update both chains, there are three possible cases: (a) y t is rejected for both chains; (b) y t is accepted for the first chain but rejected by the second chain; (c) y t is accepted for both chains.
That the order is preserved for cases (a) and (c) is trivial. For case (b), the fact that y t is rejected for the second chain implies that ρ .t/ 1 < 1, and thus that This property, namely that the coupling proposal preserves the ordering, was first obtained by Corcoran and Tweedie (2000).
Consider a monotone setting where there are a minimal and a maximal element,0 and1, for the order induced by π=h. We can then run a perfect sampler based on the two chains starting from0 and1. Note that, since the choice of h is arbitrary, it can be made to give an easy derivation of0 and1. None-the-less, this cannot be achieved in every setting since the existence of 1 implies that the corresponding MCMC chain is uniformly ergodic, as shown by Mengersen and Tweedie (1996). In the special case where the state space X is compact, h can be chosen as the uniform distribution on X and the extremal elements0 and1 are then those induced by π.
This principle applies to the mixture problem, since the pair .n 0 ; s 0 / evolves in a compact (finite) state. To illustrate its implementation in the exponential case, we fixed the value of n 0 and simulated an exact sample of s 0 from the distribution proportional to 1 . If coalescence occurred by time 0, this meant that CFTP was successful. Otherwise, we increased T . Fig. 4(a) illustrates the rudimentary nature of this coupling: for a long time (almost 1000 iterations), coupling is rejected by .x .t/ 1 /, which remains equal to the original value1, whereas .x .t/ 0 / moves with a good mixing rate. Once coupling has occurred, both chains are identical and keep mixing with the same rate till time 0. So, in practice, this CFTP fundamentally relies on a single chain, the second chain only being used through its reference value1 to delay coalescence. In that sense, this perfect sampling algorithm is simply a hidden accept-reject algorithm. Fig. 4(b) provides additional insight into the coupling phenomenon by plotting the corresponding log-posteriors. As can be seen from this graph, the chain .x A drawback with this approach is obviously that the difference between the two extremes 0 and1 may prevent coupling for a long run of iterations. Fig. 5 presents an illustration of this problem for 81 observations from a mixture of two exponential distributions through the histogram of the number of iterations till coalescence over 5000 reproductions of the CFTP scheme. (Note that the horizontal axis is labelled log 2 .T /, since we followed Propp and Wilson's (1996) recommendation to double T at each unsuccessful CFTP.) The majority of times of iteration are thus above 2 10 . None-the-less, we could tackle sample sizes that were far above those undertaken by the technique of Section 3.2, since the computational time was negligible.

Conclusion
We have studied two exact samplers for mixture posterior distributions, one of which can handle reasonably large sample sizes. We also introduced a continuous approximation which allows us to run a much faster slice sampler but we demonstrated that this approximation can be poor.
For integration of the weights against a conjugate prior does not modify the complexity of the posterior distribution in Z, since it only depends on the n i s, as shown by expression (5).
We have thus obtained a general sampling method for data that are independent and identically distributed for mixture posterior distributions. As pointed out in the discussion of Green and Murdoch (1999), mixture models are a good showcase for establishing that perfect sampling can be achieved for realistic statistical models. Green and Murdoch (1999) 'surmise that it might be possible to couple MCMC methods for mixtures' and the present paper somehow achieves this goal, although more work remains to be done to handle larger sample sizes and more general mixtures.