Missing data in a stochastic Dollo model for binary trait data, and its application to the dating of Proto‐Indo‐European

Summary.  Nicholls and Gray have described a phylogenetic model for trait data. They used their model to estimate branching times on Indo‐European language trees from lexical data. Alekseyenko and co‐workers extended the model and gave applications in genetics. We extend the inference to handle data missing at random. When trait data are gathered, traits are thinned in a way that depends on both the trait and the missing data content. Nicholls and Gray treated missing records as absent traits. Hittite has 12% missing trait records. Its age is poorly predicted in their cross‐validation. Our prediction is consistent with the historical record. Nicholls and Gray dropped seven languages with too much missing data. We fit all 24 languages in the lexical data of Ringe and co‐workers. To model spatiotemporal rate heterogeneity we add a catastrophe process to the model. When a language passes through a catastrophe, many traits change at the same time. We fit the full model in a Bayesian setting, via Markov chain Monte Carlo sampling. We validate our fit by using Bayes factors to test known age constraints. We reject three of 30 historically attested constraints. Our main result is a unimodal posterior distribution for the age of Proto‐Indo‐European centred at 8400 years before Present with 95% highest posterior density interval equal to 7100–9800 years before Present.


Introduction
The Indo-European languages descend from a common ancestor called Proto-Indo-European. Lexical data show the patterns of relatedness among Indo-European languages. These data are 'cognacy classes': a pair of words in the same class descend, through a process of change in sound, from a common ancestor. For example, English sea and German See are cognate to one another, but not to the French mer. Gray and Atkinson (2003) coded data of this kind in a matrix in which rows correspond to languages and columns to distinct cognacy classes, and entries are 0 or 1 as the language possesses or lacks a term in the column class. They analysed these data by using phylogenetic algorithms that are similar to those used for genetic data. Our analysis has the same objectives, but we fit a model that was designed for lexical trait data. We work with data that were compiled by Ringe et al. (2002), recording the distribution of 872 distinct cognacy classes in 24 modern and ancient Indo-European languages. In Section 8, we give estimates for the unknown topology and branching times of the phylogeny of the core vocabulary of these languages. Nicholls and Gray (2008) analysed the same data by using a closely related stochastic Dollo model for binary trait evolution. However, they could not deal with missing trait records. Missing data arise when we cannot answer the question 'does language X have a cognate in cognacy class Y?'. Nicholls and Gray (2008) dropped seven languages which had many missing entries, and treated missing trait records in the remainder as absent traits. This is unsatisfactory. However, it is not straightforward to give a model-based integration of missing data for the trait evolution model of Nicholls and Gray (2008). In this paper we integrate the missing trait data, and this technical advance allows us to fit all 24 languages in the original data. The binary trait model of Nicholls and Gray (2008) has been extended by Alekseyenko et al. (2008) to mutliple-level traits and is finding applications in biology. A proper treatment of missing data will be of use in other applications.
We are specifically interested in phylogenetic dating. Because we are working with lexical, and not syntactic, data, it is the age of the branching of the core vocabulary of Proto-Indo-European that we estimate. This is a controversial matter. Workers in historical linguistics have evidence from linguistic palaeontology that the most recent common ancestor of all known Indo-European languages branched no earlier than about 6000-6500 years before the present (BP) (Mallory, 1989). For a recent review of the argument from linguistic palaeontology, and a criticism of phylogenetic dating, see Garrett (2006) and McMahon and McMahon (2005). An alternative hypothesis suggests that the spread began around 8500 BP when the Anatolians mastered farming (Renfrew, 1987) in the early Neolithic. Recent efforts to apply quantitative phylogenetic methods to dating Proto-Indo-European give a time depth around 8000-9500 years BP, supporting the link to farming. For instance, Gray and Atkinson (2003) estimated the age of Proto-Indo-European to lie between 7800 and 9800 years BP; Nicholls and Gray (2008) arrived at similar results. In this work we obtain a unimodal posterior distribution for the age of Proto-Indo-European centred at 8400 years BP with 95% highest posterior density interval equal to 7100-9800 years BP.
One point of view is that the argument from linguistic palaeontology is correct, and the phylogenetic dates are incorrect, owing to rate heterogeneity, or some other model failing. In this respect we advance the search that was started in Nicholls and Gray (2008) for a model misspecification which might explain the 20% difference in the central age estimates from our fitting, and the status quo. The difference is sufficiently large that we are hopeful of finding a single coherent error, if any such error exists. Accounting now for missing data, we can publish a much wider cross-validation test (up from 10 to 30 tested calibrations). Also, we fit a model for explicit rate heterogeneity in time and space. In view of the spatiotemporal homogeneity we find here and in other data, the case for the earlier date seems now fairly strong. The most probable alternative seems to be a step change in the rate of lexical diversification acting in a co-ordinated fashion across the Indo-European territory about 3000-5000 years ago.
In Sections 2 and 3.1 we describe the data and specify a subjective prior for the phylogeny of vocabularies. In Sections 3.2 and 4, we give a generative model for the data and the corresponding likelihood. We include, in Section 4, a recursive algorithm which makes the sum over missing data tractable. In Sections 5 and 6 we give the posterior distribution on tree and parameter space, and briefly describe our Markov chain Monte Carlo (MCMC) sampler. In Section 7, we discuss likely model misspecification scenarios, test for robustness by fitting synthetic data simulated under such conditions and cross-validate our predictions. We fit the model to a data set of Indo-European languages in Section 8. This paper has a supplement (Ryder and Nicholls, 2010) giving an analysis of a second data set, collected by Dyen et al. (1997).
Phylogenetic methods have been used to make inference for tree (Ringe et al., 2002) and network structure (McMahon and McMahon, 2005) in historical linguistics. Warnow et al. (2006) wrote down a more realistic model for the diversification of vocabulary, accounting for word borrowing between languages (so that their history need not be tree like) but there has been to date no fitting.

Description of the data
The data group words from 328 meaning categories and 24 Indo-European languages into 872 homology classes. The data were collected and coded by Ringe et al. (2002). Meaning categories cover the 'core' vocabulary and are assumed relevant to all languages in the study. Two words of closely similar meaning, which are descended from a common ancestor but subject to variation in phonology, are cognate terms. A cognacy class is a homology class of words all belonging to a single meaning category. For example, for the meaning 'head', the Italian testa and the French tête belong to the same cognacy class, whereas the English head and the Swedish huvud belong to another cognacy class. An element of a cognacy class is thus a word in a particular language, a sound-meaning pair. These elements are called cognates. The vocabulary of a single language is represented as a set of distinct cognates. In our analysis a cognate has just two properties: its language and its cognacy class. If there are N distinct cognacy classes in data for L languages, then the ath class M a ⊆ {1, 2, . . . , L} is a list of the indices of languages which have a cognate in that class. The data are coded as a binary matrix D. A row corresponds to a language and a column to a cognacy class, so D i,a = 1 if the ath cognacy class has an instance in the ith language, and D i,a = 0 otherwise. See Table 1 for an example. This coding allows a language to have several words for one meaning (such as Old High German stirbit and touwit for 'he dies', which are an instance of polymorphism), or no word at all. Missing matrix elements arise because the reconstructed vocabularies of some ancient languages are incomplete. If we cannot answer the question 'does language i have a cognate in cognacy class a?' then we set D i,a = ?. ? †(a) The word 'he dies' in six Indo-European languages and (b) the coding of these data as a binary matrix with ?s for missing data. In the notation of Section 2, the first column in (b) is a cognacy class M 1 ∈ Ω 1 with Ω 1 = {{Old English, Old High German}, {Old English, Old High German, Oscan}}.
We need notation for both matrix and set representations with missing data. Denote by B a column a of L × N matrix B. For a = 1, 2, . . . , N let D a be the set of all column vectors d Å allowed by the data D a in column a of D, Denote by Ω a the set of cognacy classes that are consistent with the data D a , so that The data D are then equivalently Ω = .Ω 1 , Ω 2 , . . . , Ω N /. The Ω a -notation generalizes the M a -notation to handle missing data. It is illustrated in Table 1. Ringe et al. (2002) listed 24 mostly ancient languages. For 11 of these languages (Latin, Modern Latvian, Old Norse. . . ), all the matrix entries are recorded. For the rest, the proportion of missing entries varies between 1% (for Old Irish) and 91% (for Lycian, which is an ancient language of Anatolia). Note that data are usually missing in small blocks corresponding to the cognacy classes for a given meaning category, as in Table 1. We do not model this aspect of the missing data. This is related to the model error that Nicholls and Gray (2008) called 'the empty field approximation', under which cognacy classes in the same meaning category are assumed to evolve independently.
For these Indo-European phylogenies, the topology of some subtrees is known from historical records. We have lower and upper bounds for the age of the root node of some subtrees. For example, the Slavic languages are known to form a subtree, and the most recent common ancestor of the Slavic languages in the data is known to be at least 1300 years old. In this way internal nodes of the phylogeny are constrained. We have age bounds for leaf nodes as well, since we are told when the vocabulary of the ancient languages in these data was in use. We combine these calibration constraints with the cognacy class data in Section 3.1. Jumping ahead to our results, Fig. 6 in Section 8 is a sample from the posterior distribution that we find for phylogenies. Calibration constraints are represented by the black bars across nodes in this tree.

Models
We specify a subjective prior for phylogenies, representing a state of knowledge that is of interest to us. The model that we give for the diversification of vocabulary in Section 3.2 is, in contrast, generative. We model the diversification of vocabulary as a branching process of sets of cognates, evolving on the phylogeny. Leaves are labelled by languages and a branch represents the ancestral lineage of a vocabulary.

Prior distribution on trees
The material in this subsection follows Nicholls and Gray (2008). We consider a rooted binary tree g with 2L nodes: L leaves, L − 2 internal nodes, a root node r = 2L − 1 and an Adam node A = 2L, which is linked to r by a edge of infinite length. Each node i = 1, 2, . . . , 2L is assigned an age t i and t = .t 1 , t 2 , . . . , t A /; the units of age are years before the present; for the Adam node, t A = ∞. The edge between parent node j and child node i is a directed branch i, j of the phylogeny, with the orderings i < j and t i < t j . Let E be the set of all edges, including the edge r, A , let V be the set of all nodes and V L = {1, 2, . . . , L} the set of all leaf nodes. Our base tree space Γ is the set of all rooted directed binary trees g = .E, V , t/, with distinguishable leaves and, for i = 1, 2, . . . , 2L, node ages t i > 0 assigned so that the directed path from a leaf i ∈ V L to node A passes through nodes of strictly increasing age. With this numbering convention, .E, V/ is called the ordered history of a rooted directed binary tree.
We allow for C calibration constraints on tree topology and selected node ages. These are described at the end of Section 2. Each such constraint c allows trees in just some subspace Γ .c/ ⊂ Γ of tree space. We add to these constraints an upper bound on the root time at some age T > 0, and we set Γ .0/ = {.E, V , t/ ∈ Γ : t r T }. The space of calibrated phylogenies with constraints is then The root age is a sensitive statistic in this inference. A prior distribution on trees, with the property that the marginal distribution of the root age is uniform over a fixed prior range t L t r T , is of interest. For node i ∈ V , let t + i = sup g∈Γ C .t i / and t − i = inf g∈Γ C .t i / be the greatest and least admissible ages for node i, and let S = {i ∈ V : t + i = T }, so that S is the set of nodes having ages not bounded above by a calibration (there are 12 such nodes in Fig. 6 in Section 8). Nicholls and Gray (2008) showed that, before calibration (when C = 0) and for tree spaces in which the leaves have fixed equal ages, the prior probability distribution with density gives a marginal density for t r which is exactly uniform in t L < t r < T . Nicholls and Gray (2008) did not comment on the distribution that is determined by f over tree topologies. The prior f has in this case (C = 0) a uniform marginal distribution on ordered histories that is exactly equal to the corresponding distribution for the Yule model (Yule, 1925). This is not uniform but favours balanced leaf-labelled topologies, i.e. topologies in which sibling subtrees have the same number of leaves. For C > 0, and on tree spaces in which leaf ages are constrained by prior knowledge to lie in an interval only and without the catastrophes that we introduce in the next section, Nicholls and Gray (2008) argued from simulation studies that this prior gives a reasonably flat marginal distribution for t r if in addition T max i∈V \S .t + i / (the greatest upper bound among the calibration constraints is not too close to the upper bound on the root age). We give a sample from the prior in the supplementary material; the prior does not represent any reasonable prior belief before about 4500 BP, but this is immaterial as the likelihood rules these values out (which is an instance of an application of the principle of sufficient reason).

Diversification of cognacy classes
In this subsection we extend the stochastic Dollo model of Nicholls and Gray (2008) to incorporate rate heterogeneity in time and space, via a catastrophe process.
Consider the evolution of cognates down the ancestral lineage of the vocabulary of a single language. An example is given in Fig. 1. A new cognacy class is born when its first cognate is born. This new word is not cognate with other words in the modelled process. Loan words from outside the core meaning categories of any language in the study, or from a language outside the study, may be good for word birth without violating this condition. A cognate dies in a particular vocabulary when it ceases to be used for the meaning that is common to its cognacy class: its meaning may change, or it may fall out of use.
Cognates evolving in a single language (i.e. down a single branch of a language phylogeny) are born independently at rate λ, die independently at per capita rate μ and are subject to point-like catastrophes, which they encounter at rate ρ along a branch. At a catastrophe, each cognate dies independently with probability κ, and a Poisson number of cognates with mean ν are born. At The process that we have described is not reversible, and this greatly complicates the analysis. It seems acceptable, from a data modelling perspective, to impose the condition ν = κλ=μ, which is necessary and sufficient for reversibility (see the supplementary material for a proof). Under this condition, adding a catastrophe to an edge is equivalent to lengthening that edge by T C .κ, μ/ = − log.1 − κ/=μ years: given a branch i, j of length t j − t i with k i catastrophes, the distribution of the data at .t i , i/ (conditional on the data at .t j , j/) is equal to the distribution that would arise at the end of a branch of length t j − t i + k i T C but with no catastrophes. This follows because the number of cognates that are generated by the anagenic part of the process in an interval of length T C is Poisson distributed with mean .λ=μ/{1 − exp.−μT C /} equal to κλ=μ, and the probability that a cognate entering an interval of length T C dies during that interval is 1 − exp.−μT C /, which equals κ.
Because a catastrophe simply extends its edge by a block of virtual time, the likelihood depends only on the number of catastrophes on an edge, and not on their location in time. Let k i be the number of catastrophes on edge i, j , and k = .k 1 , . . . , k 2L−2 / be the catastrophe state vector. We record no catastrophes on the R, A edge (its length is already infinite). The tree g = .V , E, t, k/ is specified by its topology, node ages and catastrophe state. Calibrated tree space extended for catastrophes is We drop the catastrophe process from the calculation in Section 4. It is straightforward to restore it, and we do this in the expression for the posterior distribution in Section 5.

Registration process
Let D Å denote a notional full random binary data matrix, representing the outcome of the diversification process of Section 3.2. The number of columns in D Å is random and equal to N Å . For the realization depicted in Fig. 1 Table 2. Note that D Å has a column for each cognacy class present at the root node, or born below it, whether or not the cognacy class has any cognates that are represented in any leaf languages. We call the random mapping of the unknown full representation D Å to the registered data D, which is a random matrix with N columns, the registration process. There are two stages to this process: the masking of matrix elements of the fully realized data D Å with ?s to form an intermediate data matrixD, and the selection of columns fromD to determine the realized data D.
Let I Å be a random L × N Å indicator matrix of independent Bernoulli random variables for observed elements. The 0s of I Å mark matrix entries in D Å which will be unobservable. For i = 1, 2, . . . , L and given a = 1, 2, . . . , N Å , let ξ i = Pr.I Å i,a = 1/. The probability ξ i that we can answer the question 'does language i have a word in the ath cognacy class?' is assumed to be a function of the language index i only. If we obtain an answer, it is assumed correct. Let ξ = .ξ 1 , . . . , ξ L / and denote by I Å a realization of I Å . Denote byD =D.D Å , I Å / the masked version of the full random data matrix: if I Å i,a = 1 thenD i,a = D Å i,a and if I Å i,a = 0 thenD i,a = ?. Matrix columns may be missing also, so that N N Å . We have missing columns even when there are no missing data. The matrixD in Table 2 includes some columns with only 0s and ?, corresponding to cognacy classes which existed in the past but are observed in none of the leaf languages from which the data were compiled. These cognacy classes are not included in the registered data. Denote by R the registration rule D = R.D/ mapping the full data to registered data.
Given a = 1, 2, . . . , N Å , let Y a = Y.D Å a , I Å a / and Q a = Q.I Å a /. In Appendix A we give an efficient algorithm for computing the likelihood for rules that is formed by compounding the following elementary thinning operations: (a) R 1 .D/ = .D a : Y a > 0/ (discard classes with no instances at the leaves); (b) R 2 .D/ = .D a : Y a > 1/ (discard classes-singletons-observed at a single leaf); (c) R 3 .D/ = .D a : Y a < L/ (discard classes which are observed at all leaves); (d) R 4 .D/ = .D a : Y a < L − 1/ (discard classes which are observed at all leaves or at all leaves but one); (e) R 5 .D/ = .D a : Y a + Q a < L/ (discard classes which are potentially present at all leaves); (f) R 6 .D/ = .D a : Y a + Q a < L − 1/ (discard classes which are potentially present at all leaves or at all leaves but one).
The registration rule is chosen in advance by the linguists collecting the data. We assume that the chosen rule includes condition (a). The rule D = R.D/ with collects 'parsimony informative' cognacy classes. Ronquist et al. (2005) gave the likelihood for the finite sites trait evolution model of Lewis (2001) for registration rules like (a)-(f). In the example in Table 2, and in Sections 7.2 and 8, we fit data registered with R.D/ = R 1 .D/.
The selection of columns is something that we have in general no control over: the column selection rule simply describes what happened at registration. Results in the supplementary material for the Dyen et al. (1997) data use the rule R.D/ = R 2 .D/, since singleton columns were not included in those data. However, certain column types may include data which are difficult to model well, and so we may choose to make further thinning by using the other rules. Recursions for the other rules are given in Appendix A.
Column indices a = 1, 2, . . . , N Å are exchangeable. It is convenient to renumber the columns of D Å , I Å andD after registration, so thatD a = D a and I Å a = I a for a = 1, 2, . . . , N. The information that is needed to evaluate Y a and Q a is available in the column D a and set Ω a representations. We write Y.D a / = Y.Ω a / = Y a and Q.D a / = Q.Ω a / = Q a .

Point process of births for registered cognacy classes
Fix a catastrophe-free phylogeny g ∈ Γ C K , with k = .0, 0, . . . , 0/, and let an edge i, j and a time τ ∈ [t i , t j / be given. Denote by [g] the set of all points .τ , i/ on the phylogeny, including points .τ , r/ with τ t r in the edge r, A . The locations z D = {z 1 , z 2 , . . . , z N } of the birth events of the N registered cognacy classes are a realization of an inhomogeneous Poisson point process Z D on [g]. Let Z ∈ [g] be the birth location of a generic (and possibly unregistered) cognacy class M ⊆ {1, 2, . . . , L}, corresponding to a column ofD with Y observed 1s and Q ?s, and let E Z be the event that this class generates a column of the registered data. For the ath cognacy class M a , born at Z a , this event is E Z a = {D a = R.D a /} since R.D a / is empty forD a a column dropped at registration.
Cognacy classes are born at constant rate on the branches of g but are thinned by registration. However, conditional on the birth location Z a = z a , our modelling assumes that the outcome E Z a for the ath cognacy class is decided independently of all events in all other cognacy classes. The point process Z D of birth locations of registered cognacy classes has intensitỹ λ.z/ = λ Pr.E Z |g, μ, λ, ξ, Z = z/ at z ∈ [g] and probability density .z a / with respect to the element of volume dz D = dz 1 dz 2 . . . dz N in [g] N , where The number N of registered cognacy classes is N ∼ Poisson{Λ. [g]/}.

Likelihood calculations
We give the likelihood for g, μ, λ, κ, ρ and ξ given the data, along with an efficient algorithm to compute the sum over all missing data. The catastrophe process is left out, and reincorporated in the next section.
Since we only ever see registered data, the likelihood for g, λ, μ and ξ is the probability P{D = D|g, μ, λ, ξ, D = R.D/}, to obtain data D given the parameters and conditional on the data having passed registration. We restore the birth locations (and so omit λ from the conditioning), and factorize by using the joint independence of D a , a = 1, 2, . . . , N, under the given conditions: P.D a = D a |g, μ, ξ, Z a = z a /dz a : The last line follows because P.E Z a |D a = D a , . . . , Z a = z a / = 1 for D a a column of registered data: if the outcome of the birth at z a was the registerable data D a then the event E Z a certainly occurs. The likelihood depends on the awkward condition D = R.D/ through the mean number λ.
[g]/ of registered cognacy classes only, whereas P.D a = D a |g, μ, ξ, Z a = z a / is the probability to realize the data vector D a in the unconditioned diversification-missing element process. The calculation has so far extended Nicholls and Gray (2008) to give the likelihood for a greater variety of column thinning rules. We now add the missing element component of the registration process.
We sum over possible values of the missing matrix elements in the registered data. Since P.D a = D a |g, μ, λ, ξ, Z a = z a / is not conditioned on the requirement that the column D a becomes registered, the entries of the corresponding column I a are determined by the unconditioned Bernoulli process, and we have P.D a = D a |g, μ, ξ, The likelihood is where we have switched now from summing d Å ∈ D a to the equivalent set representation ω ∈ Ω a . For the two integrated quantities in equation (1) we have tractable recursive formulae. We are using a pruning procedure akin to Felsenstein (1981). We begin with Λ. [g]/.
We assume that the registration rule includes at least condition (a). It follows that a cognacy class that is born at Z = .τ , i/ in [g] must survive down to the node below, at Z = .t i , i/, to be registered, and so P{E Z |Z = .τ , i/, g, μ, ξ} = P{E Z |Z = .t i , i/, g, μ, ξ} exp{−μ.τ − t i /}: We can substitute this into the expression for Λ.
[g]/, and integrate, to obtain .3/ u .n/ i = 0 unless s i n, so for example u .L/ i is non-zero at i the root node only. Since our main application is for data registered under condition (a), we give, in the body of this paper, recursions for u For nodes i and j, let The recursion is evaluated from the leaves i ∈ V L , at which We now give the equivalent recursions for λ [g] Σ ω∈Ω a P.M = ω a |Z = z a , g, μ/ dz a . Consider the set m a = ∩ ω∈Ω a ω of leaves that are known to have a cognate in the ath registered cognacy class. Let E a be the set of branches on the path from the most recent common ancestor of the leaves in m a up to the Adam node A above the root. Cognacy class a must have been born on an edge in E a . For a = 1, 2, . . . , N, class M a is non-empty and we can shift the birth location to the node below and convert the integral to a sum, For each a = 1, 2, . . . , N and ω ∈ Ω a , let ω .i/ = ω ∩ V .i/ L and denote the set of all subsets ω .i/ of the leaves V .i/ L which are cognacy classes consistent with the data that are available for those leaves. Consider two child branches c 1 , i and c 2 , i at node i.
.c 2 / a , and events are independent along the two branches, Having moved the birth event at .t i , i/ to .t i , c 1 / and .t i , c 2 / (off the node and onto its child edges) we now move the birth event at .t i , c/ to .t c , c/ (down an edge) as follows: The recursion is evaluated from the leaves. If c is a leaf, then To restore catastrophes to this calculation, and given g ∈ Γ K , with k i catastrophes on edge i, j ∈ E, replace t j − t i with t j − t i + k i T C .κ, μ/ and replace all conditions on μ with conditions on μ and κ throughout.

Posterior distribution
Our prior on the birth rate λ, death rate μ and catastrophe rate ρ is p.λ, μ, ρ/ ∝ 1=λμρ and we take a uniform prior over [0, 1] for the probability of death at a catastrophe κ and each missing data parameter ξ i . Substituting using equations (2) and (3) into equation (1) and multiplying by the prior f G .g|T/ p.λ, μ, ρ/, we obtain the posterior distribution p.g, μ, λ, κ, ρ, ξ|D for parameters μ, λ, κ, ρ > 0, 0 < ξ i < 1 and trees g ∈ Γ .C/ K . The posterior is improper without bounds on ρ since k T = 0 is allowed. We place very conservative bounds on ρ and μ. Results are not sensitive to this choice; see the supplementary material for details.

Markov chain Monte Carlo sampling
We use MCMC methods to sample the posterior distribution and to estimate summary statistics. If the prior on the cognacy class birth rate parameter λ has the conjugate form λ u exp.−vλ/ then the conditional distribution of λ in the posterior distribution above has the form λ N+u exp{−λ.X + v/}. We took the improper prior u = −1 and v = 0 for λ and integrated. The MCMC state is then x = ..E, V , t, k/, μ, κ, ρ, ξ/ and the target distribution p.x|D/ is the density that is obtained by integrating the density in equation (8) over λ.
The MCMC sampler that was described in Nicholls and Gray (2008) has state x = ..E, V , t/, μ/. We add to the .E, V , t/-and μ-updates of Nicholls and Gray (2008) further MCMC updates acting on the catastrophe vector k = .k 1 , . . . , k L−2 /, on the catastrophe parameters κ and ρ and on ξ, the probability parameter for observable data matrix elements. The catastrophe rate parameter ρ is added to time scaling updates in which subsets of parameters are simultaneously scaled by a common random factor s: if θ is a parameter in the scaled subset having units time u , then θ → s u θ. The probability 0 < ξ i < 1 for an element of the registered data matrix to be observable is, for many leaf languages, close to 1, so we update those parameters by scaling 1 − ξ i .
We incorporate updates adding and deleting catastrophes (the filled dots that are marked on the branches of Fig. 1) plus an update which moves a catastrophe from an edge to a parent, child or sibling edge. For the addition and deletion of catastrophes, we do not need to use reversible jump MCMC methods, as the state vector specifies the numbers, and not the locations, of catastrophes on edges.
We omit the details of these moves but give, as an example, the update that moves a catastrophe from an edge to a parent, child or sibling edge. Let k T = Σ L−2 i=1 k i give the total number of catastrophes. Given a state x = .g, μ, κ, ρ, ξ/ with g = .V , E, t, k/, we pick edge i, j ∈ E with probability k i =k T . Let E i,j be the set of edges neighbouring edge i, j (child, sibling and parent edges, but excluding the edge R, A ) and let q i = card.E i,j /. We have in general q i = 4. However, for i the index of a leaf node, q i = 2 (one parent; one sibling; no children). If j is the root and i is non-leaf, then q i = 3 (one sibling; two children) and if j is the root and i is a leaf we have q i = 1 (a sibling edge). Choose a neighbouring edge ĩ ,j uniformly at random from E i,j and move one catastrophe from i, j to ĩ ,j . The candidate state is x = ..V , E, t, k /, μ, κ, ρ/, with k i = k i − 1 and k ĩ = kĩ + 1 and k j = k j for j = i,ĩ. This move is accepted with probability α.x |x/ = min 1, q i k ĩ p.x |D/ qĩk i p.x|D/ : We assessed convergence with the asymptotic behaviour of the auto-correlation for the parameters μ, κ, ρ and t R , as suggested by Geyer (1992). This method indicated that we could use runs of about 10 million samples; we also let the MCMC algorithm run for 100 million samples and checked that the computed statistics did not vary.

Validation
We made a number of tests using synthetic data. Fitting the model to synthetic data simulated according to the likelihood P{D = D|g, μ, λ, ρ, κ, ξ, D = R.D} (in-model data) shows us just how informative the data are for catastrophe placement, as well as making a debug check on our implementation.
We simulated data under our model for various values of λ, μ, κ and ρ, to explore the various possible scenarios of rate heterogeneity, such as few large catastrophes or many small catastrophes. We synthesized data for 20 languages, with on average λ=μ = 40 traits per language and five clade constraints. We set μ = 2 × 10 −5 deaths per year, since the estimates for that parameter in Nicholls and Gray (2008) were around that value. We studied small (κ = 0:1), medium (κ = 0:2) and large (κ = 0:5) catastrophes (corresponding to catastrophes equivalent to 520 years, 1160 years and 3280 years of change respectively). We show typical results in Fig. 2. The true topology was almost perfectly reconstructed in every case; the position and size of catastrophes were perfectly reconstructed; the posterior evaluations of the parameters were close to the true values. This remains true for a wide variety of parameter values.
When catastrophes are present in the true tree, the signal for them is strong. We wished to check the influence of catastrophes on our reconstructions, so we tried to fit a model without catastrophes to data which did, in fact, evolve with catastrophes. Fig. 3 shows that, if we do not include catastrophes in the model, our reconstructed parameters (root age and death rate μ) are very far from the true values. The reconstructed topology is also much further from the truth than when we fit the model with catastrophes.
We fitted out-of-model data also. These are synthetic data that are simulated under likely model violation scenarios and are used to identify sources of systematic bias. We summarize results for synthetic data simulated by using the parameter values that we estimate in Section 8 on the data. For in-model data we correctly reconstruct topology, root age and the number and position of catastrophes. Further details are given in the supplementary material. Nicholls and Gray (2008) used out-of-model data representing unmodelled loan words (which is called borrowing), rate heterogeneity in time and space, rate heterogeneity across cognacy classes and the empty field approximation. They discussed also model misspecification due to missing data and the incorrect identification of cognacy classes (in particular, a hazard for deeply rooted classes to be split). Rate heterogeneity in time and space, and missing data are now part of the in-model analysis.

Model misspecification
We synthesized out-of-model data with the borrowing model of Nicholls and Gray (2008), to see whether unmodelled borrowing biased our results. For what Nicholls and Gray (2008) called low-to-moderate levels of borrowing, we could reconstruct true parameter values well. See the supplementary material for details. With high levels of borrowing, we underestimate the root age and overestimate rate parameters. However, unidentified loan words in the registered data do not generate model misspecification unless they are copies of cognates which occur in the observed meaning categories of languages that are ancestral to the leaf languages. It is not plausible that unidentified borrowing of this kind is present at high levels.  . 3. Importance of including the catastrophes: given data synthesized under a true tree without catastrophes (a), which was well reconstructed by a model with catastrophes (b), we tried to fit a model without catastrophes; the topology (c), root age (d) and death rate μ (e) were all badly reconstructed We have not repeated the Nicholls and Gray (2008) out-of-model analysis of rate heterogeneity across cognacy classes.
In a real vocabulary, distinct cognacy classes that share a meaning category would not evolve independently. Also, a real language might be expected to have a word in each of the core meaning categories. This constrains the number of cognacy classes in each meaning category to be non-zero. Our model allows empty meaning categories. Nicholls and Gray (2008) found no substantial bias in a fit to out-of-model data respecting the constraint.
Our treatment of missing data introduces a new misspecification. We modelled matrix elements as missing independently. However, we obtain missing data when we do not know the word that is used in a given language to cover a given meaning category. Matrix elements are as a consequence typically missing in blocks corresponding to all the cognacy classes for the given meaning. Because the ages of poorly reconstructed languages are well predicted in the cross-validation study below, we have not looked further at this issue.

Prediction tests for calibration
In the next section we estimate the age of a tree node (the root). In this section we test to see whether the uncertainties that we estimate are reliable.
The calibration data that were described in Section 2 fix the topologies and root ages of the subtrees that are marked with bars in Fig. 6 in Section 8. We remove each calibration constraint in turn and use the data and the remaining constraints to estimate the age of the constrained node, and the probability for the constrained subtree topology. The topological constraints were all perfectly reconstructed. In 25 of the 30 tests the constrained age interval overlaps the 95% highest posterior density interval, as shown in the bottom half of Fig. 4.
How good or bad are each of these predictions? Looking at the Hittite prediction in Fig. 4, the large prediction uncertainty only just allows the calibration interval. Is this bad? We quantify the goodness of fit, for each calibration, by using Bayes factors to replace p-values as indices of misfit. For each constraint c = 1, 2, . . . , C we compute a Bayes factor measuring the support for the fully constrained model compared with a model with just the cth constraint removed.
For c = 1, 2, . . . , C let where the second line follows since Γ C K ⊂ Γ C−c K and the third from the definition of the conditional probabilities. The numerator P.g ∈ Γ C K |D, g ∈ Γ C−c K / is the posterior probability for the cth constraint to be satisfied given the data and the other constraints. The denominator P.g ∈ Γ C K |g ∈ Γ C K / is the prior probability for the cth constraint to be satisfied given the other constraints. We estimate these probabilities by using simulation of the posterior and prior distributions with constraint c removed. The Bayes factors are estimated with negligible uncertainty and 2 log.B C,C−c / is plotted for c = 1, 2, . . . , C in the top half of Fig. 4.
Strong evidence against a calibration is failure at prediction. Taking a Bayes factor exceeding 12 (i.e. 2 log.B C,C−c / −5 in Fig. 4) as strong evidence against the constraint, following Raftery (1996), we have conflict for three of the 30 constraints: the ages of Old Irish and Avestan, and for the age of the Balto-Slav clade. As our analysis in Section 8 shows, there is a high posterior probability that a catastrophe event occurred on the branch between Old Irish and Welsh, and another between Old Persian and Avestan. The evidence for rate heterogeneity in the rest of the tree is so slight that when we try to predict these calibrations we are predicting atypical events.
Our missing data analysis has helped here. The calibration interval for the Hittite vocabulary in these data is 3200-3700 BP. A reconstruction of the age of Hittite which ignores missing data predicts 60-2010 BP, which is well outside the constraints. The 95% highest posterior density interval for the age of Hittite in our model is 430-3250 BP, which just overlaps the constraint. The Bayes factor gives odds that are less than 3:1 against, so the evidence against the constraint is 'hardly worth mentioning'.

Results
In this section we present results for our MCMC simulation of the full posterior: equation (8). An upper limit T = 16 000 was used in the tree prior of Section 3.1. Any value for T exceeding around T = 10 000 would lead to the same results. We show a consensus tree in Fig. 5. In a tree, an edge corresponds to a split partitioning the leaves into two sets. A consensus tree displays just those splits that are present in at least 50% of the posterior sample. Splits which receive less than 95% support are labelled. Where no split is present in 50% of the posterior sample, the consensus tree is multifurcating. The date that is shown for a node is the average posterior date given the existence of the split; similarly, the number of catastrophes that is shown on an edge is the average posterior number of catastrophes on that edge given the existence of the split, rounded to the nearest integer. The 95% highest posterior probability density intervals for our estimates for the parameters are as follows: μ = 1:86 × 10 −4 ± 1:47 × 10 −5 deaths per year; κ = 0:361 ± 0:055; ρ = 6:4 × 10 −5 ± 3:3 × 10 −5 catastrophes per year (corresponding to large but rare catastrophes: about one catastrophe every 15000 years, or an average of 3.4 on the tree, with each catastrophe corresponding to 2400 years of change).
We display in Fig. 6 the calibration constraints on a tree sampled from the posterior. The constraints cannot be shown on a consensus tree, as slices across a consensus tree are not isochronous.
The analysis reconstructs some well-known features of the Indo-European tree. The Germanic, Celtic and Italic families are grouped together, but no particular configuration of their relative positions is favoured. The Indo-Persian group can fall outside the Balto-Slav group but the relative position of these two is uncertain. The deep topology of the tree is left quite uncertain by these data, especially the position of Albanian. We find evidence for catastrophic rate heterogeneity in three positions: on the edges leading to Old Irish and Old Persian, and in the Umbrian-Oscan clade.  Nicholls and Gray (2008) analysed a subset of the Ringe et al. (2002) data (they discarded languages with too much missing data). Where the comparison is possible, the topology that they reconstructed is similar to the consensus tree that is presented in Fig. 5; the main differences are that we have more uncertainty on the relative positions of the Germanic, Celtic and Italic families and on the relative positions of the Balto-Slav and Indo-Iranian families, and that we find somewhat more support for a subtree grouping Hittite, Luvian, Lycian and the Tocharian languages.
The 95% highest posterior probability density interval for our estimate for the root age of the Indo-European family is 7110-9750 years BP. The distribution of this key statistic is close to normal. This is in line with previous quantitative analyses of Indo-European lexical data (Gray and Atkinson, 2003;Nicholls and Gray, 2008).

Conclusions
Our results give a root age for the most recent common ancestor of the Indo-European family of language vocabularies in agreement with earlier phylogenetic studies. Our results are in agreement with models which put this date around 8500 BP, and in conflict with models which require it to be less than 6500 years BP. Our studies of synthetic out-of-model data and reconstruction tests for known historical data support our view that this main result is robust to model error. It would not be robust to a step change in the rate of lexical diversification acting in a co-ordinated fashion across the Indo-European languages extant about 3000-5000 years ago.
The methods that are outlined here for handling missing data and rate heterogeneity in the diversification of languages, as seen through lexical data, will find applications to generic trait data.
The recursion is evaluated from the leaves i ∈ V L , at which