Nonparametric Bayesian estimation of multivariate Hawkes processes

This paper studies nonparametric estimation of parameters of multivariate Hawkes processes. We consider the Bayesian setting and derive posterior concentration rates. First rates are derived for L1-metrics for stochastic intensities of the Hawkes process. We then deduce rates for the L1-norm of interactions functions of the process. Our results are exemplified by using priors based on piecewise constant functions, with regular or random partitions and priors based on mixtures of Betas distributions. Numerical illustrations are then proposed with in mind applications for inferring functional connec-tivity graphs of neurons.

1. Introduction. In this paper we study the properties of Bayesian nonparametric procedures in the context of multivariate Hawkes processes. The aim of this paper is to give some general results on posterior concentration rates for such models and to study some families of nonparametric priors.
1.1. Hawkes processes. Hawkes processes, introduced by Hawkes (1971), are specific point processes which are extensively used to model data whose occurrences depend on previous occurrences of the same process. To describe them, we first consider N a point process on R. We denote by B(R) the Borel σ-algebra on R and for any Borel set A ∈ B(R), we denote by N (A) the number of occurrences of N in A. For short, for any t ≥ 0, N t denotes the number of occurrences in [0, t]. We assume that for any t ≥ 0, N t < ∞ almost surely. If G t is the history of N until t, then, λ t , the predictable intensity of N at time t, which represents the probability to observe a new occurrence at time t given previous occurrences, is defined by λ t dt = P(dN t = 1 | G t − ), for φ : R → R + and h : R → R. We recall that the last integral means The case of linear Hawkes processes corresponds to φ(x) = ν +x and h(t) ≥ 0 for any t. The parameter ν ∈ R * + is the spontaneous rate and h is the self-exciting function. We now assume that N is a marked point process, meaning that each occurrence T i of N is associated to a mark m i ∈ {1, . . . , K}, see [17]. In this case, we can identify N with a multivariate point process and for any k ∈ {1, . . . , K}, N k (A) denotes the number of occurrences of N in A with mark k. In the sequel, we only consider linear multivariate Hawkes processes, so we assume that λ k t , the intensity of N k , is where ν k > 0 and h ,k , which is assumed to be non-negative and supported by R + , is the interaction function of N on N k . Theorem 7 of [9] shows that if the K × K matrix ρ, with has a spectral radius strictly smaller than 1, then there exists a unique stationary distribution for the multivariate process N = (N k ) k=1,...,K with the previous dynamics and finite average intensity. Hawkes processes have been extensively used in a wide range of applications. They are used to model earthquakes [42,30,45], interactions in social networks [41,44,27,4,16,28,43], financial data [18,6,5,3,1], violence rates [29,34], genomes [23,11,39] or neuronal activities [10,15,31,32,33,24,36,37], to name but a few.
Parametric inference for Hawkes models based on the likelihood is the most common in the literature and we refer the reader to [30,11] for instance. Non-parametric estimation has first been considered by Reynaud-Bouret and Schbath [39] who proposed a procedure based on minimization of an 2 -criterion penalized by an 0 -penalty for univariate Hawkes processes. Their results have been extended to the multivariate setting by Hansen, Reynaud-Bouret and Rivoirard [24] where the 0 -penalty is replaced with an 1 -penalty. The resulting Lasso-type estimate leads to an easily implementable procedure providing sparse estimation of the structure of the underlying connectivity graph. To generalize this procedure to the high-dimensional setting, Chen, Witten and Shojaie [14] proposed a simple and computationally inexpensive edge screening approach, whereas Bacry, Gaïffas and Muzy [4] combine These are all frequentist methods; Bayesian approaches for Hawkes models have received much less attention. To the best of our knowledge, the only contributions for the Bayesian inference are due to Rasmussen [35] and Blundell, Beck and Heller [8] who explored parametric approaches and used MCMC to approximate the posterior distribution of the parameters.
1.2. Our contribution. In this paper, we study nonparametric posterior concentration rates when T → +∞, for estimating the parameter f = ((ν k ) k=1,...,K , (h ,k ) k, =1,...,K ) by using realizations of the multivariate process (N k t ) k=1,...,K for t ∈ [0, T ]. Analyzing asymptotic properties in the setting where T → +∞ means that the observation time becomes very large hence providing a large number of observations. Note that along the paper, K, the number of observed processes, is assumed to be fixed and can be viewed as a constant. Considering K → +∞ is a very challenging problem beyond the scope of this paper. Using the general theory of Ghosal and van der Vaart [19], we express the posterior concentration rates in terms of simple and usual quantities associated to the prior on f and under mild conditions on the true parameter. Two types of posterior concentration rates are provided: the first one is in terms of the L 1 -distance on the stochastic intensity functions (λ k ) k=1,...,K and the second one is in terms of the L 1 -distance on the parameter f (see precise notations below). To the best of our knowledge, these are the first theoretical results on Bayesian nonparametric inference in Hawkes models. Moreover, these are the first results on L 1 -convergence rates for the interaction functions h ,k . In the frequentist literature, theoretical results are given in terms of either the L 2 -error of the stochastic intensity, as in [4] and [7], or in terms of the L 2 -error on the interaction functions themselves, the latter being much more involved, as in [39] and [24]. In [39], the estimator is constructed using a frequentist model selection procedure with a specific family of models based on piecewise constant functions. In the multivariate setting of [24], more generic families of approximation models are considered (wavelets of Fourier dictionaries) and then combined with a Lasso procedure, but under a somewhat restrictive assumption on the size of models that can be used to construct the estimators (see Section 5.2 of [24]). Our general results do not involve such strong conditions and therefore allow us to work with approximating families of models that are quite general. In particular, we can apply them to two families of prior models on the interaction functions h ,k : priors based on piecewise constant functions, with regular or random partitions and priors based on mixtures of Betas distributions. From the posterior concentration rates, we also deduce a frequentist convergence rate for the posterior mean, seen as a point estimator. We finally propose an MCMC algorithm to simulate from the posterior distribution for the priors constructed from piecewise constant functions and a simulation study is conducted to illustrate our results.
1.3. Overview of the paper. In Section 2, Theorem 1 first states the posterior convergence rates obtained for stochastic intensities. Theorem 2 constitues a variation of this first result. From these results, we derive L 1 -rates for the parameter f (see Theorem 3) and for the posterior mean (see Corollary 1). Examples of prior models satisfying conditions of these theorems are given in Section 2.3. In Section 3, numerical results are provided.

Notations and assumptions.
We denote by f 0 = ((ν 0 k ) k=1,...,K , (h 0 ,k ) k, =1,...,K ) the true parameter and assume that the interaction functions h 0 ,k are supported by a compact interval [0, A], with A assumed to be known. Given a parameter f = ((ν k ) k=1,...,K , (h ,k ) k, =1,...,K ), we denote by ρ the spectral norm of the matrix ρ associated with f and defined in (1.2). We recall that ρ provides an upper bound of the spectral radius of ρ and we set H = {(h ,k ) k, =1,...,K ; h ,k ≥ 0, support(h ,k ) ⊂ [0, A], ρ ,k < ∞, ∀ k, = 1, . . . , K, ρ < 1} and We assume that f 0 ∈ F and denote by ρ 0 the matrix such that ρ 0 ,k = A 0 h 0 ,k (t)dt. For any function h : R → R, we denote by h p the L p -norm of h. With a slight abuse of notations, we also use for Finally, we consider d 1,T , the following stochastic distance on F: where λ k t (f ) and λ k t (f ) denote the stochastic intensity (introduced in (1.1)) associated with f and f respectively. We denote by N (u, H 0 , d) the covering number of a set H 0 by balls with respect to the metric d with radius u. We set for any , µ 0 the mean of λ t (f 0 ) under P 0 where P 0 denotes the stationary distribution associated with f 0 and E 0 is the expectation associated with P 0 . We also write u T v T if |u T /v T | is bounded when T → +∞ and similarly u T v T if |v T /u T | is bounded.
2. Main results. This section contains main results of the paper. We first provide an expression for the posterior distribution.
2.1. Posterior distribution. Using Proposition 7.3.III of [17], and identifying a multivariate Hawkes process as a specific marked Hawkes process, we can write the log-likelihood function of the process observed on the interval [0, T ], conditional on With a slight abuse of notation, we shall also denote L T (λ) instead of L T (f ).
Recall that we restrict ourselves to the setup where for all , k, h ,k has support included in [0, A] for some fixed A > 0. This hypothesis is very common in the context of Hawkes processes, see [24]. Note that, in this case, the conditional distribution of (N k ) k=1,...,K observed on the interval [0, T ] given G 0 − is equal to its conditional distribution given Hence, in the following, we assume that we observe the process (N k ) k=1,...,K on [−A, T ], but we base our inference on the log-likelihood (2.1), which is associated to the observation of (N k ) k=1,...,K on [0, T ]. We consider a Bayesian nonparametric approach and denote by Π the prior distribution on the parameter f = ((ν k ) k=1,...,K , (h ,k ) k, =1,...,K ). The posterior distribution is then formally equal to We approximate it by the following pseudo-posterior distribution, which we write Π (·|N ) which thus corresponds to choosing dΠ(f ) = dΠ(f |G − 0 ).

2.2.
Posterior convergence rates for d 1,T and L 1 -metrics. In this section we give two results of posterior concentration rates, one in terms of the stochastic distance d 1,T and another one in terms of the L 1 -distance, which constitutes the main result of this paper. We define T and δ 0 > 0 and C α two positive constants not depending on T . From Lemmas 3 and 4 in Section 4.7, we have that for all α > 0 there exist C α > 0 and δ 0 > 0 only depending on α and f 0 such that when T is large enough. In the sequel, we take α > 1 and C α accordingly. Note in particular that, on Ω T , when T is large enough. We then have the following theorem.
Theorem 1. Consider the multivariate Hawkes process (N k ) k=1,...,K observed on [−A, T ], with likelihood given by (2.1). Let Π be a prior distribution on F. Let T be a positive sequence such that and assume following conditions are satisfied for T large enough.
(i) There exists c 1 > 0 and B > 0 such that (ii) There exists a subset H T ⊂ H, such that where κ T := κ log(r −1 T ) log log T , with r T defined in (4.11) and κ defined in (4.9). (iii) There exist ζ 0 > 0 and x 0 > 0 such that Then, there exist M > 0 and C > 0 such that Assumptions (i), (ii) and (iii) are very common in the literature about posterior convergence rates. As expressed by Assumption (ii), some conditions are required on the prior on H T but not on F T . Except the usual concentration property of ν around ν 0 expressed in the definition of B( T , B), which is in particular satisfied if ν has a positive continuous density with respect to Lebesgue measure, we have no further condition on the tails of the distribution of ν.
Remark 1. As appears in the proof of Theorem 1, the term √ log log T appearing in the posterior concentration rate can be dropped if B( T , B) is replaced by in Assumption (i). In this case, r T = 1/2 in Assumption (ii) and κ T does not depend on T . This is used for instance in Section 2.3.1 to study random histograms priors whereas mixtures of Beta priors are controlled using the L 2 -norm.
Similarly to other general theorems on posterior concentration rates, we can consider some variants. Since the metric d 1,T is stochastic, we cannot use slices in the form d 1,T (f 0 , f ) ∈ (j T , (j + 1) T ) as in Theorem 1 of Ghosal and van der Vaart [19], however we can consider other forms of slices, using a similar idea as in Theorem 5 of Ghosal and van der Vaart [20]. This is presented in the following theorem.
Theorem 2. Consider the setting and assumptions of Theorem 1 except that assumption (iii) is replaced by the following one: There exists a sequence of sets for some positive constant x 0 > 0. Then, there exists M > 0 such that The posterior concentration rates of Theorems 1 and 2 are in terms of the metric d 1,T on the intensity functions, which are data dependent and therefore not completely satisfying to understand concentration around the objects of interest namely f 0 . We now use Theorem 1 to provide a general result to derive a posterior concentration rate in terms of the L 1 -norm.
Theorem 3. Assume that the prior Π satisfies following assumptions.
(i) There exists ε T = o(1) such that ε T ≥ δ T (see the definition of Ω T ) and c 1 > 0 such that The prior on ρ satisfies : for all u 0 > 0, when T is large enough, Then, for any w T → +∞, Remark 2. Condition (i) of Theorem 3 is in particular verified under the assumptions of Theo- Remark 3. Compared to Theorem 1, we also assume (ii), i.e. that the prior distribution puts very little mass near the boundary of space {f ; ρ < 1}. In particular, if under Π, ρ has its support included in [0, 1 − ] for a fixed small > 0 then (2.5) is verified.
A consequence of previous theorems is that the posterior meanf = E π [f |N ] is converging to f 0 at the rate ε T , which is described by the following corollary.
Corollary 1. Under the assumptions of Theorem 1 or Theorem 2, together with (2.5) with ε T = √ log log T T and if F f 1 dΠ(f ) < +∞, then for any w T → +∞ The proof of Corollary 1 is given in Section 4.6. We now illustrate these general results on specific prior models.

2.3.
Examples of prior models. The advantage of Theorems 1 and 3 is that the conditions required on the priors on the functions h k, are quite standard, in particular if the functions h k, are parameterized in the following way We thus consider priors on θ = (ν , ρ k, ,h k, , k, ≤ K) following the scheme We consider Π ν absolutely continuous with respect to the Lebesgue measure on R + with positive and continuous density π ν , Π ρ a probability distribution on the set of matrices with positive entries and spectral norm ρ < 1, with positive density with respect to Lebesgue measures and satisfying (2.5). We now concentrate on the nonparametric part, namely the prior distribution Π h . Then, from Theorems 1 and 3 it is enough that Π h satisfies for each k, ≤ K, for some B > 0 and c > 0 such that there exists F 1,T with for ζ > 0, x 0 > 0 and C > 0 large enough. Note that from remark 1, if we have that for all , k T then it is enough to verify in place of (2.8). These conditions have been checked for a large selection of types of priors on the set of densities. We discuss here two cases: one based on random histograms, these priors make sense in particular in the context of modeling neuronal interactions and the second based on mixtures of Betas, because it leads to adaptive posterior concentration rates over a large collection of functional classes. To simplify the presentation we assume that A = 1 but generalization to any A > 0 is straightforward.
2.3.1. Random histogram prior. These priors are motivated by the neuronal application, where one is interested in characterizing time zones when neurons are or are not interacting (see Section 3). Random histograms have been studied quite a lot recently for density estimation, both in semi and non parametric problems. We consider two types of random histograms: regular partitions and random partitions histograms. Random histogram priors are defined by: for J ≥ 1, and T 0 = 0 < t 1 < · · · < t J = 1.
In both cases, the prior is constructed in the following hierarchical manner: where c 1 and c 2 are two positive constants. Denoting S J the J-dimensional simplex, we assume that the prior on (w 1 , · · · , w J ) satisfies : for all M > 0, for all w 0 ∈ S J with for any j, w 0j ≤ M/J and all u > 0 small enough, there exists c > 0 such that Many probability distributions on S J satisfy (2.12). For instance, if Π w is the Dirichlet distribution D(α 1,J , · · · , α J,J ) with c 3 J −a ≤ α i,J ≤ c 4 , for a, c 3 and c 4 three positive constants, then (2.12) holds, see for instance Castillo and Rousseau [12]. Also, consider the following hierarchical prior allowing some the of w j 's to be equal to 0. Set and (j 1 , · · · , j sz ) the indices corresponding to Z j = 1. Then, Regular partition histograms correspond to t j = j/J for j ≤ J, in which case we writeh w,J instead of h w,t,J ; while in random partition histograms we put a prior on (t 1 , · · · , t J ). We now consider Hölder balls of smoothness β and radius L 0 , denoted H(β, L 0 ), and prove that the posterior concentration rate associated with both types of histogram priors is bounded by T = 0 (log T /T ) β/(2β+1) for 0 < β ≤ 1, where 0 is a constant large enough. From Remark 1, we use the version of assumption (i) based on and need to verify (2.9). Then applying Lemma 4 of the supplementary material of Castillo and Rousseau [12], we obtain for allh 0 ∈ H(β, L 0 ) and ifh 0 is not the null function This thus implies that Π (B ∞ ( T , B)) pe −c T 2 T for some c > 0. This result holds both for the regular grid and random grid histograms with a prior on the grid points (t 1 , · · · , t J ) given by (u 1 , · · · , u J ) ∼ D(α, · · · , α) with u j = t j − t j−1 . Then condition (2.5) is verified if Π( ρ > 1 − u) e −a u −a with a > 3/β and a > 0, for u small enough. This condition holds for any β ∈ (0, 1] if there exist a , τ > 0 such that when u is small enough Moreover, set F 1,T = {h w,J , J ≤ J 1 (T / log T ) 1/(2β+1) } for J 1 a constant, then for all ζ > 0, Therefore, (2.9) is checked. We finally obtain the following corollary.
Corollary 3. Consider the random histogram prior (2.10) based on random partition with a prior on w satisfying (2.11) and (2.12) and with a Dirichlet prior on u = (t j − t j−1 , j ≤ J), with parameter α ≥ 6. If (2.13) is satisfied, then if for any k, = 1, . . . , K, h 0 k, belongs to H(β, L) for 0 < β ≤ 1, then for any w T → +∞, The proof of this corollary is given in Section 4.8. In the following section, we consider another family of priors suited for smooth functions h k, and based on mixtures of Beta distributions.

2.3.2.
Mixtures of Betas. The following family of prior distributions is inspired by Rousseau [40]. Consider functions where M k, are bounded signed measures on [0, 1] such that |M k, | = 1. In other words the above functions are the positive parts of mixtures of Betas distributions with parameterization (α/ , α/(1− )) so that is the mean parameter. The mixing random measures M k, are allowed to be negative. The reason for allowing M k, to be negative is that h k, is then allowed to be null on sets with positive Lebesgue measure. The prior is then constructed in the following way. Writing h k, = ρ k, h k, we define a prior onh k, via a prior on M k, and on α k, . In particular we assume that M k, iid ∼ Π M and α k, iid ∼ π α . As in Rousseau [40] we consider a prior on α absolutely continuous with respect to Lebesgue measure and with density satisfying: there exists b 1 , c 1 , c 2 , c 3 , A, C > 0 such that for all u large enough, (2.14) There are many ways to construct discrete signed measures on [0, 1], for instance, writing the prior on M is then defined by J ∼ Π J and conditionally on J, where Ra denotes the Rademacher distribution taking values {−1, 1} each with probability 1/2. Assume that G has positive continuous density on [0, 1] and that there exists A 0 > 0 such that J j=1 a j ≤ A 0 . We have the following corollary.
Corollary 4. Consider a prior as described above. Assume that for all k, ≤ K h 0 k, = (g 0 k, ) + for some functions g 0 k, ∈ H(β, L 0 ) with β > 0. If condition (2.13) holds and if G has density with respect to Lebesgue measure verifying then, for any w T → +∞, Note that in the context of density estimation, T −β/(2β+1) is the minimax rate and we expect that it is the same for Hawkes processes.
3. Numerical illustration in the neuroscience context. It's now well-known that neurons receive and transmit signals as electrical impulses called action potentials. Although action potentials can vary somewhat in duration, amplitude and shape, they are typically treated as identical stereotyped events in neural coding studies. Therefore, an action potential sequence, or spike train, can be characterized simply by a series of all-or-none point events in time. Multivariate Hawkes processes have been used in neuroscience to model spike trains of several neurons and in particular to model functional connectivity between them through mutual excitation or inhibition [26]. In this section, we conduct a simulation study mimicking the neural context, through appropriate choices of parameters. The protocol is similar to the setting proposed in Section 6 of [24].
3.1. Simulation scenarios. We consider three simulation scenarios involving respectively K = 2 and K = 8 neurons. The scenarios are roughly similar to the one tested in [24]. Following the notations introduced in the previous sections, for any (k, ) ∈ {1, . . . K} 2 , h k, denotes the interaction function of neuron k over neuron . We now describe the three scenarios. The upper bound of each h k, 's support, denoted [0, A] is set equal to A = 0.04 seconds.
• Scenario 1 : We first consider K = 2 neurons and piecewise constant interactions: • Scenario 2 : In this scenario, we mimic K = 8 neurons belonging to three independent groups.
The non-null interactions are the piecewise constant functions defined as: In Figure 1, we plot the subsequent interactions directed graph between the 8 neurons: the vertices represent the K neurons and an oriented edge is plotted from vertex k to vertex if the interaction function h k, is non-null. • Scenario 3 : Setting K = 2, we consider non piecewise constant interactions functions defined as: In all the scenarios, we consider ν = 20, = 1, . . . , K. For each scenario, we simulate 25 datasets on the time interval [0, 22] seconds. The Bayesian inference is performed considering recordings on three possible periods of length T = 5 seconds, T = 10 seconds and T = 20 seconds. For any dataset, we remove the initial period of 2 seconds -corresponding to 50 times the length of the support of the h k, -functions, assuming that, after this period, the Hawkes processes have reached their stationary distribution.

3.2.
Prior distribution on f = (ν , h k, ) l,k∈{1,...,K} . We use the prior distribution described in Section 2.3 setting a log-prior distribution on the ν 's of parameter µ ν , s 2 ν . About the interaction functions (h k, ) k, ∈{1,...,K} , the prior distribution is defined on the set of piecewise constant functions, h k, being written as follows: Here, δ (k, ) is a global parameter of nullity for h k, : for all (k, ) ∈ {1, . . . , K} 2 , For all (k, ) ∈ {1, . . . , K} 2 , the number of steps (J (k, ) ) follows a translated Poisson prior distribution: To minimize the influence of η on the posterior distribution, we consider an hyperprior distribution on the hyperparameter η: Given J (k, ) , we consider a spike and slab prior distribution on (β 1} denote a sign indicator for each step, we set: ∀j ∈ {1, . . . , J (k, ) }: We consider two prior distributions on (t . The first one (refered as the Regular histogram prior) is a regular partition of [0, A]: The second prior distribution is refered as random histogram prior and specifies: In the simulations studies, we set the following hyperparameters: Posterior sampling. The posterior distribution is sampled using a standard Reversible-jump Markov chain Monte Carlo. Considering the current parameter (ν, h), ν (c) is proposed using a Metropolisadjusted Langevin proposal. For a fixed J (k, ) , the heights β (k, ) j are proposed using a random walk proposing null or non-null candidates. Changes in the number of steps J (k, ) are proposed by standard birth and death moves [22]. In this simulation study, we generate chains of length 30000 removing the first 10000 burn-in iterations. The algorithm is implemented in R on an Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz.
The computation times (mean over the 25 datasets) are given in Table 1. First note that the computation time increases roughly as a linear function of T . This is due to the fact that the heavier task in the algorithm is the integration of the conditional likelihood and the computation time of this operation is roughly a linear function of the length of the integration (observation) time interval. Besides, because we implemented a Reversible Jumps algorithm, the computation time is a stochastic quantity: the algorithm can explore parts of the domain where the number of bins J k is large, thus increasing the computation time. This point can explain the unexpected computation times for K = 2. Moreover, we remark that the computation time explodes as K increases (due to the fact that K 2 intensity functions have to be estimated), reaching computation times greater than a day. 3.4. Results. We describe here the results for each scenario. We first present the L 1 -distances on λ k and h k, for all 3 the scenarios, all three length observation times T and the two prior distributions. In Table 2, we show the estimated L 1 -distances on λ k and h k, . More precisely, we evaluate the L 1distances on the interactions functions   and the following stochastic distance : where f 0 is the true set of parameters, d 1,T (f, f 0 ) has been defined in Section 1.4 and the posterior expectations are approximated by Monte Carlo method using the outputs of the Reversible Jumps algorithm.
As expected, the error decreases as T increases. As we will detail later, the random histogram prior gives better results than the regular prior. Finally, we perform better when the true interaction function (h k, ) are step functions (due to the form of the prior distribution).
3.4.1. Results for scenario 1: K = 2 with step functions. When K = 2, we estimate the parameters using both regular and random prior distributions on (t (k, ) j ) (equations (3.6) and (3.7)). One typical  Table 2 L1-distances on h k, and λ k posterior distribution of ν is given in Figure 2b (left), for a randomly chosen dataset, clearly showing a smaller variance when the length of the observation interval increases. We also present the global estimation results, over the 25 simulated datasets. The distribution of the posterior mean estimators for (ν 1 , ν 2 ) computed for the 25 25 is given in Figure  2b on the right panel, showing an expected decreasing variance for the estimator as T increases. On the top panels the posterior is based on the regular grid prior while on the bottom the posterior is based on the random (grid) histogram prior: the results are equivalent.
About the estimation of the interaction functions, for the same given dataset, the estimation of the h k, is plotted in Figure 2a (upper panel) for the regular prior, with its credible interval. Its corresponding estimation with the random prior is given in Figure 2a  where the interaction functions are null are also well identified. The estimation given with the random histogram prior is in general better than the one supplied by the regular prior. This may be due to several factors. First, the random histogram prior leads to a sparser estimation than the regular one. Secondly, it is easier to design a proposal move in the Reversible Jump algorithm in the former case than in the latter context. Moreover, the interaction graph is perfectly inferred since the posterior probability for δ (2,2) to be 0 is almost 1. For the 25 dataset, we estimate the posterior probabilities P(δ (k, ) = 1|(N sim t ) t∈[0,T ] ) for k, = 1, 2 and sim = 1 . . . 25. In Table 3, we display the mean of these posterior quantities. Even for the shorter observation time interval (T = 5) these quantities -defining completely the connexion graph-are well recovered. These results are improved when T increases. Once again, the random histogram prior (3.7) gives better results.
Finally, we also have a look at the conditional intensities Figure 3, we plot 50 realizations of the conditional intensity from the posterior distributions. More precisely, for one given dataset, for 50 parameters ..K sampled from the posterior distribution (obtained at the end of the MCMC chain), we compute the corresponding (λ k(i) t ) and plot them. For the sake of clarity, only the conditional intensity of the first process (k = 1) is plotted and we restrict the graph to a short time interval [3.2, 3.6]. As noticed before, the conditional intensity (a) Estimation of the (h k, ) k, =1,2 using the regular prior (upper panel) and the random histogram prior (bottom panel). The gray region indicates the credible region for h k, (t) (delimited by the 5% and 95% percentiles of the posterior distribution). The true h k, is in plain line, the posterior expectation and posterior median for h k, (t) are in dotted and dashed lines respectively.
(b) On the left, posterior distribution of (ν 1 , ν 2 ) with T = 5, T = 10 and T = 20 for one dataset. On the right, distribution of the posterior mean of (ν 1 ,  Table 3 Scenario 1, K=2. Mean of the posterior estimations: In the context of neurosciences, we are especially interested in recovering the interaction graph of the K = 8 neurons. In Figure 5a, we consider the same dataset as the one used in Figure 4a and plot the posterior estimation of the interaction graph, for respectively T = 10 on the left and T = 20 on the right. The width and the gray level of the edges are proportional to the estimated posterior probability P(δ (k, ) = 1|(N t ) t∈[0,T ] ). The global structure of the graph is recovered (to be compared to the true graph plotted in Figure 1). We observe that the false positive edges appearing when T = 10 disappear when T = 20. In Figure 5b, we consider the mean of the estimates of the graph over the 25 datasets. The resulting graph for T = 10 is on the left and for T = 20 on the right. Note that, in this example, for any (k, ) such that the true δ (k, ) = 1, the estimated posterior probability P(δ (k, ) = 1|(N sim t ) t∈[0,T ] ) is equal to 1, for any dataset and any length of observation interval. In other words, the non-null interactions are perfectly recovered. In a simulation scenario with other interaction functions, the results could have been different. In Figure 4b, we plot the posterior means (with credible regions) of the non-null interaction functions for the same simulated dataset as in Figure 5a. The time intervals where the interaction functions are null are again perfectly recovered. The posterior incertainty around the non-null functions h k, decreases when T increases.

4.1.
Proof of Theorem 1. To prove Theorem 1, we apply the general methodology of Ghosal and van der Vaart [19], with modifications due to the fact that exp(L T (f )) is the likelihood of the distribution of (N k ) k=1,...,K on [0, T ] conditional on G 0 − and that the metric d 1,T depends on the observations. We set M T = M √ log log T , for M a positive constant. Let and for j ≥ 1, we set by using Lemma 2 of Section 4.4. Remember we have set ρ 0 k, := h 0 k, 1 and ρ k, := h k, 1 . Since h k, and h 0 k, are non-negative functions, then for any = 1, . . . , K, This implies for f ∈ S j that On Ω T , so that, for T large enough, for all j ≥ 1 S j ⊂ F j with ..,N j be the centering points of a minimal L 1 -covering of F j by balls of radius ζj T with ζ = 1/(6N 0 ) (with N 0 defined in Section 2) and define φ (j) = max i=1,...,N j φ f i ,j where φ f i ,j is the individual test defined in Lemma 1 associated to f i and j (see Section 4.3). Note also that there exists a constant C 0 such that where N (ζj T /2, H T , . 1 ) is the covering number of H T by L 1 -balls with radius ζj T /2. There exists from hypothesis (iii) in Theorem 1. Combining this with Lemma 1, we have for all j ≥ 2ζ 0 /ζ, if M is a constant large enough, which terminates the proof of Theorem 1.

4.2.
Proof of Theorem 2. The proof of Theorem 2 follows the same lines as for Theorem 1, except that the decomposition of F T is based on the sets F j and H T,i , i ≥ 1 and j ≥ M T for some M T > 0. For each i ≥ 1, j ≥ M T , consider S i,j a maximal set of ζj T -separated points in F j ∩ H T,i (with a slight abuse of notations) and φ i,j = max f 1 ∈S i,j φ f 1 with φ f 1 defined in Lemma 1. Then, using similar computations as for the proof of Theorem 1, we have: Assumptions of the theorem allow us to deal with the first three terms. So, we just have to bound the last two ones. Using the same arguments and the same notations as for Theorem 1, Now, for γ a fixed positive constant smaller than x 2 , setting π T,i = Π(H T,i ), we have But, we have for M a contant large enough. This terminates the proof of Theorem 2.

Construction of tests.
As usual, the control of the posterior distributions is based on specific tests. We build them in the following lemma. Lemma 1. Let j ≥ 1, f 1 ∈ F j and define the test with N 0 is defined in Section 2 and By using (4.3), observe that on the event Ω T , and for T large enough, Let j ≤ µ 0 −1 T and x = x 1 j 2 T 2 T , for x 1 a constant. We use inequality (7.7) of [24], with τ = T , If x 1 ≤ 1/(1024µ 0 ) and x 1 ≤ 36, we have that If j ≥ µ 0 −1 T , we apply the same inequality but with x = x 0 jT T with x 0 = µ 0 × x 1 . Then, where we have used (4.5). It implies and This implies in particular that if N 0 ζ < 1, We then have Note that we can adapt inequality (7.7) of [24], with H t = 1 A 1, (t) to the case of conditional probability given G 0 − since the process E t defined in the proof of Theorem 3 of [24], being a supermartingale, satisfies E f [E t |G 0 − ] ≤ E 0 = 1 and, given that from (4.2) and (4.4), for T large enough, we obtain: We use the same computations as before, observing thatṽ = v + K(j + 1)T T .

Now, if
and the same computations are run with A 1, playing the role of A c 1, . This ends the proof of Lemma 1.

4.4.
Control of the denominator. The following lemma gives a control of D T .
On B( T , B), for T larger than T 0 , with T 0 some constant depending on f 0 , with and r T is defined in (4.11). (4.10) for C a constant only depending on f 0 and B.
Proof. We consider the setΩ T defined in Lemma 3 and we set N T = C α log T . We have: where for u > 0, Ψ(u) := − log(u) − 1 + u ≥ 0. First, observe that onΩ T ∩ B( T , B), Furthermore, observe that for u ∈ [r T , 1/2), Ψ(u) ≤ log(r −1 T ), since r T = o(1). And for all u ≥ 1/2, Ψ(u) ≤ (u − 1) 2 . Finally, for any u ≥ r T , We first deal with the first term. Using stationarity of the process and Proposition 2 of [24] We now deal with R T . We have, on B( T , B), Conversely, So, using Lemma 3, if α is an absolute constant large enough, R T = o(1) and so that we can take r T = 1/2 and R T = o(T 2 T ). We now study We have for any integer Q T such that x := T /(2Q T ) > A, Note that F q is a measurable function of the points of N appearing in [2qx − A; 2qx + x) denoted by F (N |[2qx−A;2qx+x) ). Using Proposition 3.1 of [38], we consider an i.i.d. sequence (M x q ) q=0,...,Q T −1 of Hawkes processes with the same distribution as N but restricted to [2qx − A; 2qx + x) and such that for all q, the variation distance between M x q and N |[2qx−A;2qx+x) is less than 2P 0 (T e > x − A), where T e is the extinction time of the process. We then set for any q, We have built an i.i.d. sequence (G q ) q=0,...,Q T −1 with the same distributions as the F q 's. Furthermore, for any q, We now have, by stationarity We first deal with the first term of the previous expression:

Now, by setting dM
Note that onΩ T , for any t where C 1 (B, f 0 ) only depends on B and f 0 . Then, λ k t (f 0 )dt and using same arguments as for the bound of KL(f 0 , f ), the previous term is bounded by log(r −1 T up to a constant. Since for any u ≥ 1/2, we have | log(u)| ≤ 2|u−1|, we have for any u ≥ r T , By taking α ≥ 2 and using Lemma 3, we obtain: Finally, T . for C 2 (B, f 0 ) a constant only depending on B and f 0 , and It remains to deal with the last term of the previous expression. The proof of Proposition 3 of [24] shows that there exists a constant D only depending on f 0 such that if we take x = D log T , which is larger than A for T large enough, then We now have log(r −1 T )N 2 T × (T /Q T ) = O(log log(T ) log 3 (T )), which ends the proof of the lemma.
Using (4.2), there exists C 0 such that for all f ∈ A ε T , on Ω T , Therefore, on Ω T , We set u T := u 0 (log T ) 1/6 ε 1/3 T with u 0 a large constant to be chosen later.
for T large enough. Following the same lines as in the proof of Theorem 1, we then have where P f denotes the stationary distribution when the true parameter is f . We will now prove that with J T such that J T = κ 0 (log T ) −1 T u 2 T and κ 0 a constant chosen later. Note that J T → +∞ and

From Lemma 5 we have that there exists (depending on
The problem in dealing with the right hand side of the above inequality is that the Z m, 's are not independent. We therefore show that we can construct independent random variablesZ m, such that, on Ω 1,T . For all 1 ≤ m ≤ J T − 1, define N 0,m the sub-counting measure of N generated from the ancestors of any type born on [(2m − 1)T /(2J T ), (2m + 1)T /(2J T )] and the K-multivariate point processN m defined bȳ where N 0,m,k if the kth coordinate of N 0,m . Observe that if I m = [2mT /(2J T ) − A, (2m + 1)T /(2J T )], thenN m (I m ) is the number of points ofN m lying in I m . We have: Since by construction theZ m, are positive, independent, identically distributed and independent of G 0 − , the Bernstein inequality gives .
We have to bound E f (Z 2 1, ). Observe that and We then have to bound T −2 J 2 T max k E f [N 0,1,k (I 1 ) 2 ]. Using notations of Lemma 7, we have: We now bound E f [(W ) 2 ] by using Lemma 6. Without loss of generality, we can assume that ρ > 1/2.
Therefore, since f ∈ F T , there exists a constant C K only depending on K such that where the last inequality follows from the definition of u T and J T . We obtain the desired bound as soon as u 0 is large enough.
Using (4.17) and Assumption (i), we then have that (4.15) is true, which proves the theorem.

4.6.
Proof of Corollary 1. Let w T → +∞. The proof of Corollary 1 follows from the usual convexity argument, so that together with a control of the second term of the right hand side similar to the proof of Theorem 3.
We write where the last inequality comes from the proof of Theorem 3. Similarly, using the proof of Theorem 1, and P 0 ( f − f 0 1 > 3w T ε T ) = o(1). Since this is true for any w T → +∞, this terminates the proof.
for T large enough.
Proof. For the first part, we split the interval [−A; T ] into disjoint intervals of length A and we use Proposition 2 of [24]. For the second part, we set Furthermore, for T large enough, Proof of Lemma 4. We use Proposition 3 of [24] and notations introduced for this result. We denote N [−A, 0) the total number of points of N in [−A, 0), all marks included. Let δ T := δ 0 (log T ) 3 /T , with δ 0 a constant. We have: (4.20) and we observe that , where S is the shift operator introduced in Proposition 3 of [24]. We then have So, for any α > 0, the second term of (4.20) is O(T −α ) for δ 0 large enough depending on α and f 0 . The first term is controlled by using Inequality (7.7) of [24] with τ = T , We take x 0 a positive constant such that 8µ 0 k x 0 < 1, so that, for T large enough

Therefore, we have
which terminates the proof.

4.7.3.
Lemma on E f [Z 1, ]. We have the following result which is useful to prove Theorem 3.

Lemma 5.
For for all f ∈ F T such that d 1,T (f, f 0 ) ≤ ε T , there exists (depending on f and f 0 ) such that on Ω T , where C is a constant depending on f 0 .
Proof. By using the first bound of (4.2), we observe that on Ω T , for any , since inf ν 0 > 0, then inf µ 0 > 0 (by using (4.3)) and we obtain that K k=1 ρ k, and K k=1 ν k are bounded. Therefore f 1 is bounded. On Ω T , since ε T ≥ δ T , still using (4.2), for any , for M a constant large enough. By using the formula Therefore, since ρ = ρ T (ρρ T and ρ T ρ have the same eigenvalues), T (log T ) 1/6 . Therefore, µ is bounded. As in [24], we denote Q f a measure such that under Q f the distribution of the full point process restricted to (−∞, 0] is identical to the distribution under P f and such that on (0, ∞) the process consists of independent components each being a homogeneous Poisson process with rate 1. Furthermore, the Poisson processes should be independent of the process on (−∞, 0]. From Corollary 5.1.2 in [25] the likelihood process is given by Let τ > 0 satisfying • Assume that for any , ν − ν 0 < τ f − f 0 1 . Then, for any , Then, for any , and We denote where a is a fixed constant chosen later. We then have Note that on Ω k , Since on F T , where C 1 and C(f 0 ) are some constants, we have, by definition of Q f , . , We also have, using (4.21), Furthermore, with a = 2K. Finally, and using (4.22), where C depends on f 0 . • We now assume that there exists such that In this case, using similar arguments, still with a = 2K, for C depending on f 0 . Lemma 5 is proved.

4.7.4.
Upper bound for the Laplace transform of the number of points in a cluster. In the next lemma, we refine the proof of Lemma 1 of [24]. Given an ancestor of type , we denote W the number of points in its cluster. We have the following result.
Lemma 6. Assume ρ < 1 and consider t such that . Then, we have for any ∈ {1, . . . , K}, Moreover, if ρ ≤ 1/2, then there exist two absolute constants c 0 and C 0 such that if Proof of Lemma 6. We introduce K (n) ∈ R K the vector of the number of descendants of the nth generation from a single ancestral point of type , with K (0) = e , where (e ) k = 1 k= . More precisely, (K (n)) k is the number of descendants of the nth generation and of the type k from a single ancestral point of type . Then, We now set for any θ ∈ R K , and φ(θ) = (φ 1 (θ), . . . , φ K (θ)) T .
By the previous result, the right hand side is bounded if t is small enough. More precisely, for all The second point is obvious in view of previous computations. Moreover, since We obtain by induction that E f [1 T K (n)] = 1 T (ρ T ) n e and taking the limit, since ρ < 1,

Lemma onN m .
Lemma 7. There existsc 0 such that for all c 0 > 0 such that for T large enough, Furthermore, there exists a constant κ 0 > 0 (see the definition of J T ) such that Proof of Lemma 7. We use computations of the proof of Proposition 2 of Hansen et al. [24]. To boundN m (I m ), first observe that we only consider points of N whose ancestors are born before (2m−1)T /(2J T ), i.e. the distance between the occurrence of an ancestor and I m is at least 2mT Using the cluster representations of the process, for any p ∈ Z and for any ∈ {1, . . . , K}, we consider B p, the number of ancestors of type born in the interval [p, p + 1]. The B p, 's are iid Poisson random variables with parameter ν . We have where W p,k is the number of points in the cluster generated by the ancestor k which is of type and For the first term of the previous right hand side, we have used same arguments as Hansen et al. [24] and the lower bound of the distance determined previously. For the second term of the right hand side, since p ≤ 0, this lower bound is at least −p − 1 + T J T − A. Conditioned on the B p, 's, the variables (W p,k ) k are iid with same distribution as W introduced in Lemma 6. Furthermore, by Lemma 6 applied with f = f 0 , since ρ 0 < 1, we know that for t 0 > 0 small enough (only depending on ρ 0 and K), where C 0 is a constant. So, for any c > 0, Therefore, Similarly,