An explicit split point procedure in model-based trees allowing for a quick fitting of GLM trees and GLM forests

Classification and regression trees (CART) prove to be a true alternative to full parametric models such as linear models (LM) and generalized linear models (GLM). Although CART suffer from a biased variable selection issue, they are commonly applied to various topics and used for tree ensembles and random forests because of their simplicity and computation speed. Conditional inference trees and model-based trees algorithms for which variable selection is tackled via fluctuation tests are known to give more accurate and interpretable results than CART, but yield longer computation times. Using a closed-form maximum likelihood estimator for GLM, this paper proposes a split point procedure based on the explicit likelihood in order to save time when searching for the best split for a given splitting variable. A simulation study for non-Gaussian response is performed to assess the computational gain when building GLM trees. We also propose a benchmark on simulated and empirical datasets of GLM trees against CART, conditional inference trees and LM trees in order to identify situations where GLM trees are efficient. This approach is extended to multiway split trees and log-transformed distributions. Making GLM trees possible through a new split point procedure allows us to investigate the use of GLM in ensemble methods. We propose a numerical comparison of GLM forests against other random forest-type approaches. Our simulation analyses show cases where GLM forests are good challengers to random forests.


Introduction
Recursive binary partitioning is a statistical technique for building decision trees by separating a dataset into different homogeneous subgroups according to partitioning explanatory variables that characterize them generally known as features. These models have been widely used in supervised learning for regression and classification for more than 50 years (Loh 2014). Most binary trees comprise two steps: (1) recursively splitting the dataset by selecting the best split, which forms a node or a leaf, until reaching a stopping criterion; (2) fitting an intercept-only model at each terminal node. CART (Breiman et al. 1984 Prim'Act, 42 av. de la Grande Armée, 75017 Paris, France because it is efficient and easy to interpret. However, it suffers from a well-known selection bias problem induced by an exhaustive search over all possible splits simultaneously, see, e.g., Hothorn et al. (2006) or Breiman et al. (1984, p. 42). Several solutions permitting an unbiased split selection in tree algorithms based on statistical tests have been proposed in the literature, see, e.g., FACT (Loh and Vanichsetakul 1988), QUEST (Loh and Shih 1997), CRUISE (Kim and Loh 2001) or CTREE (Hothorn et al. 2006) algorithms. However, these approaches may lead to long computation times, especially on large datasets.
While these classical algorithms are built on a set of nodes with constant values for predictions, the concept of decision trees can be cleverly combined with parametric models through MOdel-Based recursive partitioning (MOB) introduced by Zeileis et al. (2008). The idea consists in fitting separate parametric models for each terminal node of a tree with specific subgroup parameters. For instance, Rusch and Zeileis (2013) and Seibold et al. (2018) use the MOB approach with LM trees and GLM trees. For growing the tree, a unique objective function, such as the log-likelihood function, is considered both for the model parameter estimation and partitioning variables selection. These variables are selected via a fluctuation test (Zeileis and Hornik 2007) for parameter instability, which overcomes the variable selection-bias issue. This general framework is the successor of different previous models which also integrate a parametric model into a tree, see among others generalized regression trees (Ciampi 1991), functional trees (Gama 2004), GUIDE (Loh 2002) or maximum likelihood trees under the assumption of a Gaussian response (Su et al. 2004). Using a well-established parametric model, a key goal of MOB is that it provides predictions given by a statistical model adapted to each node, which are easy to interpret. However, the computation time may be longer than simple tree algorithms with intercept-only nodes since the splitting procedure requires maximizing the objective function and assessing the parameters instability using a hypothesis test.
For all tree methods, single trees suffer from an instability issue, i.e., the resulting tree can be significantly affected by small changes in the training data. See Philipp et al. (2018) for a general framework for assessing the stability of results generated from supervised statistical learning algorithms. They may be less competitive than other classical approaches in artificial intelligence and machine learning, like neural networks (Lawrence 1994) or support vector machines (Cortes and Vapnik 1995) in terms of prediction. Predictions can be improved by introducing ensemble tree methods based on bagging (Breiman 1996), random forest (Breiman 2001) or boosting (Friedman 2002), but this is done at the expense of interpretability. In this literature, the overwhelming majority of approaches are based on the CART algorithm, although variable selection may be biased. Unbiased trees for forest are also available, see function cforest in (Hothorn and Zeileis 2015) for the CTREE algorithm introduced in Hothorn et al. (2006) or the function mobForest related to random forest for MOB (Garge et al. 2013), both implemented in R (R Core Team 2021). Since ensembles are based on random sampling, the longer computation time of these unbiased tree methods makes them more difficult to use than ensemble trees based on CART, as stated, e.g., by Garge et al. (2013) or Fokkema (2020).
In this paper, we propose a new approach for reducing the computation time of recursive binary partitioning based on the explicit fitted GLM likelihood as an objective function. Based on the idea of Brouste et al. (2020), we show that closed-form estimators can be derived when computing the split points with respect to each partitioning variable for GLM-type trees. More precisely, we replace a timeconsuming step in the splitting procedure where the split point is determined by optimizing a score function based on the GLM likelihood. This approach is general as the results obtained by Brouste et al. (2020) are valid for any distribution belonging to the one-parameter exponential family (possibly with an additional dispersion parameter) and any link function in the case of categorical explanatory variables. Hence, it can be applied to several GLM-type models such as maximum likelihood trees (Su et al. 2004) or GLM Trees (Rusch and Zeileis 2013). We specifically focus on this last approach in that paper due to its flexibility.
The remainder of this paper is organized as follows. In Sect. 2, we present GLMs and GLM trees as well as the framework for building GLM trees based on a closedform estimator. Section 3 assesses the performances of our approach on simulated and empirical datasets. In Sect. 4, we propose a GLM forest algorithm based on GLM trees based on bagging. Finally, Sect. 5 concludes.

Generalized linear models binary trees
In this section, we first describe GLMs and GLM-based trees and explain how we introduce a closed-form estimator for GLMs in the original GLM tree algorithms. Then, we discuss how this approach can be extended to other decision trees based on Maximum Likelihood Estimation (MLE). Finally, several improvements of estimators proposed by Brouste et al. (2020) and useful for tree models are introduced.
In the following, for the sake of clarity, we use bold notations for vectors of R p , R q or R n , where p, q, n ∈ N . We use the letter x for explanatory variables on which GLMs are fitted and the letter z for partitioning variables used for the split point procedure. The index i ∈ I = {1, . . . , n} is reserved for the observations, while the indexes j, k, l are used both for explanatory and partitioning variables.

Generalized linear models
Let x 1 , . . . , x n be deterministic exogenous explanatory variables, with x i = (x i,1 , . . . , x i, p ) ∈ R p for i ∈ I . It is assumed that x i, j is either a real for numeric variables or a binary dummy for categorical variables. Generalized linear models, see, e.g., McCullagh and Nelder (1989), assume that (i) the random response vector Y has independent components, (ii) the distribution of random variable Y i , i ∈ I , belongs to the exponential family with natural parameter λ i , see Eq. (1), (iii) a so-called link function g governs the relation between random response variables Y i and deterministic explanatory variables x i , see Eq.
(2). In the rest of the paper, lower case y i is used for the corresponding observation of the random variable Y i .
In other words, the log-likelihood associated with a single observation y i of the random variable Y i , i ∈ I , in the exponential family is given by and where a : R → R, b : → R and c : Y×R → R are known real-valued differentiable measurable functions. The term λ i is called the natural parameter and φ is the dispersion parameter.
Since the expectation is E [Y i ] = b (λ i ) for the exponential family, the relationship between the expectation of Y i and x i is governed by where η i is the linear predictor, ., . denotes the scalar product, and λ i (θ ) stresses the dependence on the finitedimensional parameter θ . Since g is a bijective function, we have an equivalent relationship The link function g is called canonical when g =b. Some usual examples are given in Table 1. It is worth noting that θ is estimated in the first step. Secondly, the dispersion parameter φ is estimated if any. This additional dispersion parameter makes possible to use nonconstant variance distributions such as gamma or inverse Gaussian distributions, see, e.g., McCullagh and Nelder (1989) [Chapter 8].
If the model is identifiable, the MLE θ n has desirable properties (existence, consistence, asymptotic normality), see Ludwig and Heinz (1985). In the general case, when x i is a vector of numerical and/or categorical explanatory variables, there is no explicit solution for the MLE and estimation relies on an iterative weighted least-square (IWLS) algorithm. Conversely, it is well known that the MLE can be derived directly with a closed-form formula with the Gaussian family, see e.g. Seber and Lee (2003) and Weisberg (2005).

GLM-based trees and explicit likelihood split point
The GLM-based tree algorithm introduced by Rusch and Zeileis (2013) consists of splitting the dataset recursively based on a set of partitioning variables and of fitting a GLM on a set of explanatory variables to observations in each node. First, the selection of the variable j among all partitioning variables is performed by assessing the parameter stability and by finding the partitioning variable with the highest parameter instability. Given the partitioning variable j , this procedure locally estimates in a second step a parametric model for all possible splits of that variable, see Sect. 2.2.1. This step can be time-consuming since an exhaustive search is realized is performed to find the optimal split point s which requires to fit as many GLM as possible split points. For instance for a categorical variable x with 4 levels A, B, C, D, the following six GLMs are fitted using the IWLS algorithm: -left node 1 x∈{A,B,C} against right node 1 x∈{D} -left node 1 x∈{A,B,D} against right node 1 x∈{C} -left node 1 x∈{A,C,D} against right node 1 x∈{B} -left node 1 x∈{A,B} against right node 1 x∈{C,D} -left node 1 x∈{A,C} against right node 1 x∈{B,D} -left node 1 x∈{A,D} against right node 1 x∈{B,C} .
For these two steps, we show in Sect. 2.2.2 how to improve this procedure based on the results of Brouste et al. (2020), where a closed-form solution for GLMs with categorical variables is available for any link function g and for any probability distribution. More precisely, explicit solutions are derived for GLM-based tree models for any type of partitioning variables in situations where the GLMs are fitted with only an intercept or with one categorical explanatory variable. The difference between non-explicit and explicit GLM trees is how GLMs are fitted: the first one relies on the IWLS algorithm to compute the MLE, while our procedure computes the closed-form MLE when the tree is growing.

Model-based partitioning trees
The MOB model introduced by Zeileis et al. (2008) is a flexible framework of model parameter estimation based on least squares, maximum likelihood or more broadly Mestimation approaches. Within this approach, the parametric . . , B, i ∈ I , is locally fitted using vectors of explanatory covariates {x i } on each subset of a partition {B b } of B segments. The partition is determined when growing a tree based on partitioning variables z i = (z i,1 , . . . , z i,q ) ∈ R q , and an objective function is maximized to obtain the generic collection of parameters {λ b }.
In this paper, the local model on which we focus is the GLM similarly to Rusch and Zeileis (2013) where parameters θ b , b = 1, . . . B, are estimated by maximizing the log-likelihood on the set of explanatory covariables. Their algorithm for GLM trees is given in Algorithm 1. Contrary to CART, MOB doesn't require a post-pruning procedure of the tree. For instance, a pre-pruning step can be applied to avoid the size of a node becomes too small.
First, a full GLM is fitted on all observations of the current node b (i ∈ b) with all explanatory variables available x i (continuous and/or categorical). Candidate variables for splitting are either all partitioning variables or a subset of them if a random selection is performed. Unless explanatory variables come from a single categorical variable, for which an explicit MLE exists (Brouste et al. 2020),θ b is computed by the IWLS algorithm.
Second, a variable selection is performed to avoid selecting useless variables. It is based on a M-fluctuation test. 1 Let W j (t) be an empirical fluctuation process, i.e., a normalized cumulative score process. Zeileis and Hornik (2007) show that under regularity conditions, a normalized W j (t) converges toward a standard Brownian bridge, which allows to perform M-fluctuation tests for testing the null hypothesis of parameter stability for each partitioning variable and thus to select the most significant one. For GLMs, the i-th score Data: Response vector y, vector of explanatory covariates x 1 , . . . , x n , partitioning variables z 1 , . . . , z n while Loop over node b until no significant instability is detected do Compute the observation number n b = b for node b. if n b is too small then Stop the process for that node. end 1. Fit the local model: Fit GLM by maximizing (1) for obs. i ∈ b to obtain fitted parameterθ b . 2. Assess param. instab. with M-fluctuation tests: if j is a numerical variable then Compute parameter instability using (4) as Compute parameter instability using (4) as 3. Choose the best splitting point s: if j is a numerical variable then Search for the optimal split point s ∈ (min i z i, j , max i z i, j ) based on (10). else Search for the optimal set s ⊂ {v j ,1 , . . . , v j ,l j } based on (10). end end end Algorithm 1: Recursive partition algorithm for GLMs for a binary tree contribution is explicit and given by where μ i = h(η i ), h = g −1 the inverse link function, η i = x i , θ . The corresponding covariance matrix J (θ ) is also explicit, see, e.g., Zeileis and Hornik (2007, Section 4.1). Therefore, the empirical fluctuation process has the following expression for GLMs whereŝ i, j = s i, j (θ ), σ (i) is the ordering permutation giving the anti-rank observation of z i, j , b is the cardinality of the set b andĴ = J (θ ) the fitted covariance matrix. The function t → W j (t,θ ) is a step function and we denote the increment for variable j in category v by v W j (t,θ ). Hence, the splitting variable with the highest significant instability is selected, i.e., the lowest p-value satisfying the significance level adjusted with the Bonferroni correction.
Once the best splitting variable is selected, the algorithm chooses the best split point based on the objective function calculated on the B ≥ 2 daughter nodes of the splitting variable where L b ( j ) corresponds to the b-th segment with respect to values taken by the variable j . For binary tree (B = 2), only one split point for a continuous variable or one subset for a categorical variable, hereafter noted s, should be exhibited.
In general, objective function (5) is not explicit because parameter vectorθ b has to be estimated by the IWLS algorithm. In what follows, we focus on this last step and propose closed-form formula for θ 1 , . . . ,θ B leading to an explicit objective function. More precisely, our GLM-based tree algorithm with closed-form solutions covers these particular situations: -a GLM with no explanatory variable and a set of partitioning variables ( When continuous explanatory variables or more than one categorical explanatory variable are considered, it is appropriate to return to the general framework set by Zeileis et al. (2008).

Explicit likelihood split point for binary trees with intercept-only nodes
Based on Brouste et al. (2020), we now detail how exactly the split point can be performed more efficiently for a GLMbased recursive partition, either for a numerical or categorical splitting variable j . For the sake of clarity, we present the case of binary trees with intercept-only nodes, i.e., a GLMbased tree with any explanatory variable. Such a tree with intercept-only nodes is especially relevant for comparisons with classical partitioning recursive like the CART or the CTREE algorithms, or for generating ensemble of decisions trees, see, e.g., Fokkema (2020). This presentation can be extended to situations with trees of non-constant nodes, i.e., multiple categorical explanatory variables, and to multiway splits, see Sect. 2.4. When the j -th partitioning variable is numerical, we consider two binary dummies 2 for a known split point s for all i ∈ I Split point values s are taken in the range of the j -th partitioning variable: s ∈ (min i z i, j , max i z i, j ). When the j -th variable is categorical taking its values in {v j ,1 , . . . , v j ,l j }, we consider two binary dummies for a known subset s for all i ∈ I Subset values are taken among the partition of the j -th par- In order to unify both situations, we use a generic notation for the linear predictor where L( j, s) and R( j, s) are the children subsets resulting from the split. Equation (6) Brouste et al. (2020), given the link function g, the MLE θ (s) = (θ L (s),θ R (s)) is given bŷ where y L j (s), y R j (s) are average responses respectively for left and right subsets conditional on the j value, see Table 2. As mentioned in Brouste et al. (2020), the fitted parameters do not depend on the distribution (functions a, b, c) but only on the link function g.
Furthermore, we can deduce the fitted log-likelihood (5) using Corollary 3.1 of Brouste et al. (2020), see "Appendix A.1". Equation (15) in "Appendix A.1" has terms which do not depend on s and the same applies for the fitted deviance, see "Appendix A.2". Hence, we do not need to estimate φ nor to depend on the functions a and c in order to maximize the log-likelihood log L j with respect to s, i.e., the application s → log L(θ L (s),θ R (s), y, s) (or the deviance) can be simplified.
Let y j (s) = (y L j (s), y R j (s)) be the vector of conditional average responses and m j (s) = (m L j (s), m R j (s)) be the vector of node sizes. Thanks to Brouste et al. (2020), we maximize the following explicit objective function s → O( y j (s), m j (s)) given the splitting variable j in order to find the best split point or the best subset s with (10) is thus linked to the distribution assumption only through the function b but neither a, c, nor the link function g. Hence, the choice of link function has an impact on the fitted parameter θ (s) but neither on the fitted likelihood, nor the objective function.
As noted by Zeileis et al. (2008), the framework introduced by MOB is quite general and unifies different decision tree models, such as maximum likelihood trees introduced by Su et al. (2004) for regression model with intercept only. This remains true in our approach, which is in line with this framework and allows some classical models to benefit from explicit solutions. The objective function (10) is actually very close to classical situations, namely the entropy function and the residual sum of squares used in the CART algorithm.
Consider a Bernoulli response with λ = log( p/(1 − p)), for which Y = {0, 1}, = R. Using Table 1 and ignoring s in frequencies and averages for ease of notation, Eq. (10) becomes This formula is of type p log( p) + (1 − p) log(1 − p) as the entropy function used in classification trees, see Venables and Ripley (2002, p. 255). Note that if for some splits, the left node or the right node contains only successes or failures, the above objective function has an indefinite form 0 × ∞. In practice, the limit of p log( p) is used, i.e., p log( p) → 0 as p → 0, or analogously (1 − p) log(1 − p) → 0 as p → 1. Hence, the contribution of the problematic node to the objection function is set to zero.
For a Gaussian response with a mean μ = λ and a variance σ 2 = φ for which Y = R and = R, Eq. (10) becomes using Table 1 and ignoring s, Despite the formula looks different than the usual regression tree, this objective function is indeed proportional to the loss in deviance. From Chambers and Hastie (1993, p. 414) with their notation, the loss in deviance is defined as Usingμ L = y L j ,μ R = y R j , m L = m L j and m R = m R j leads to D is proportional to our objective function.

Non-Gaussian non-binary distributions for GLM trees
An advantage of the GLM framework is the possibility to use any probability distribution in the exponential family: we are not limited to Gaussian distribution for numeric responses and Bernoulli distribution for binary responses, see usual distributions in Table 1. In the insurance field, actuaries and statisticians usually use the Poisson distribution to model claim frequencies and the gamma distribution to model claim severities, see, e.g., Denuit et al. (2019). In biology, there are also several examples of the relevancy of non-Gaussian nonbinary distributions, e.g., Wilson and Grenfell (1997) and Szöcs and Schäfer (2015).

Classical distributions for GLMs
The objective function (10) contains many other special cases. We give below typical probability distributions generally used within the GLM framework. A gamma distribution parametrized by its mean μ and its shape parameter ν is obtained with λ = −1 A Poisson distribution with a mean μ is obtained with λ = log(μ), for which Y = N and = R. Using Table 1 An Inverse Gaussian distribution parametrized by its mean μ and its shape parameter σ 2 is obtained with λ = −1/(2μ 2 ), φ = 1/σ 2 , for which Y = (0, +∞) and = (−∞, 0). Using Table 1, we get These examples illustrate the non-quadratic objective function (10) for continuous distributions and non-logit objective functions for discrete distributions. All distributions considered are generally already implemented in statistical software, such as in the glm() function for the R statistical software (R Core Team 2021).  Table 3. From a computational point of view, the weighted sums and the weighted average can be computed as follows. Let us first compute the total weight and the total weighted sum

Taking weights into account
Then only for the left node, using Table 3, we deduce m L j,w (s) and S L j,w (s). Since the right-branch statistics can be deduced from m w and m L j,w (s), S w and S L j,w (s), we easily obtain the needed weighted averages as This may reduce the computation time if the selection {i ∈ L( j, s)} is computer intensive. Now we present usual distributions that benefit from using a weighted MLE in order to illustrate this framework. The binomial distribution enters the exponential family framework by considering the random variable Y /m when Y ∼ B(m, p) for a size parameter m and a probability parameter p, see, e.g., Zeileis and Hornik (2007). Using Table 1, the distribution of Y /m belongs to the exponential family, for which Y = {0, 1/m, . . . , 1} and = R. In practice, the size parameter m is known for the binomial experiment and does not have to be estimated. Furthermore, the parameter is not common to every observations, that is we model So, we can tackle this issue by considering a weighted MLE without changing a, b function yet c becomes c(x, φ) = 1/m i log m i m i x . Using Table 1, the objective function for the binomial distribution is the same as for the Bernoulli distribution by changing arithmetical means y

Other special cases for log transformed variable
Finally, we close our overview of distributions belonging to the GLM framework by considering a transformation of the response variable. Various parametrized transformations have been proposed to nonlinearly transform response variables so that the resulting distribution has a tractable distribution, typically the Gaussian distribution. In that perspective, a classical transform is the Box-Cox transform (Box and Cox 1964). Transforming the response variable is different than transforming the expectation through a link function Table 4 Notations for conditional average for transform t(x) with frequencies given in Table 2 Left node Right node (possibly parametrized), see the discussion of McCullagh and Nelder (1989, Section 11.3.3).
In order to keep tractable solutions of MLE, we consider the log transformation which leads to known distributions. More precisely, we consider the transformation t(x) = log(d 1 x + d 2 ) and denote by T i = t(Y i ) the transformed random variables, where d 1 , d 2 are known parameters, i.e., no estimation is needed. We assume that T 1 , . . . , T n are independent random variables with a distribution defined in Eq. (1).
In "Appendix A.4", we show that the log-likelihood (1) only differs by a newc functioñ whereas a and b remain identical to the original distribution. As shown in Brouste et al. (2020), the MLE is still explicit and the closed-form expression (9) has to be updated by replacing the original variable Y i by the transformed variable T i . Thus, we obtain expressions given in Table 4.
As for non-transformed responses, the fitted log-likelihood is explicit, see "Appendix A.4". Hence, the objective function, when searching for the best split point s or the best subset s , is similar to (10) We present now important distributions which fall in this framework. The Pareto 1 distribution with a known threshold parameter d is obtained by considering d 1 = 1/d and d 2 = 0. Indeed as studied in Brouste et al. (2020, Section 4), in that case, Y i follows a Pareto 1 with shape parameter λ and threshold parameter d implies that T i = log(Y i /d) follows an exponential distribution E(λ), i.e., a special of the gamma distribution G(1, 1/λ). Hence, this falls in the GLM framework with a unitary dispersion φ = 1 in Table 1. Using Eq. (11), we obtain the following objective function The shifted lognormal distribution with a known threshold parameter d is obtained by considering d 1 = 1 and d 2 = −d. Indeed as studied in Brouste et al. (2020)[Section 5], in that case, Y i follows a shifted lognormal distribution with parameters μ, σ 2 and threshold parameter d implies that N (μ, σ 2 ). Using Table 1 and Eq. (11), we obtain the following objective function

Extensions to more complex trees
In this section, we discuss two possible extensions for GLMbased trees by including more than two splits in Sect. 2.4.1 and several categorical explanatory variables in the GLM in Sect. 2.4.2.

Split into more than two segments
Our approach can easily deal with multiway splits at tree nodes. Generalizing the linear predictor (8), we use a generic notation for the linear predictor where L k ( j, s) is the k-th leaf subset resulting from the split. For numeric partitioning variables, L 1 ∪· · ·∪L m is a partition of the interval [min i z i, j , max i z i, j ], while for categorical partitioning variables, it is a partition of the modalities set {v j,1 , . . . , v j,l j }. Again, from Theorem 3.1 of Brouste et al. (2020), the MLE θ (s) depends only on the link function g and is given bŷ The fitted log-likelihood is also explicit using Corollary 3.1 of Brouste et al. (2020) and the objective function (10) is This objective function is completely tractable and deals with both categorical and numerical partitioning variables. This differs from other algorithms for multiway splits. Indeed, CHAID by Kass (1980) and CRUISE by Kim and Loh (2001) build up contingency tables to compute chi-square statistics in the search of the best partition not restricted to two children nodes. FACT by Loh and Vanichsetakul (1988) is based on the linear discriminant analysis which allows multiple splits at the same node. Note that categorical partitioning variables in CRUISE and FACT are converted to numerical variables via the largest discriminant coordinate. Functional trees (FT) by Gama (2004) provide a general algorithm but not an explicit objective function to be maximized at each node. LMT by Landwehr et al. (2005) deals only with Bernoulli or multinomial responses and propose an iterative algorithm (LogitBoost) to fit logistic regression models as well as heuristics to speed up their algorithm.

Explicit MLE solutions with categorical explanatory variables
where θ l L k for different k and l are unknown parameters and x where y k,l j is the empirical average over node k and modality v l given the partitioning variable j . Again the fitted loglikelihood can be derived using Corollary 3.3 of Brouste et al. (2020): this consists in replacing averages y L k j (s) by y k,l j (s) as well as node sides m L k j (s) by m k,l j (s) in (13) . With a higher number of explanatory categorical variables, we may define dummy variables based on interaction terms as in (14). However, an explicit log-likelihood is available only when the assumptions of Theorem 3.2 of Brouste et al. (2020) are satisfied. More precisely, for two categorical explanatory variables with modality numbers d 2 > 1, d 3 > 1, the MLÊ θ is given by solving a system zeroing the score where we need to impose that q = 1 + d 2 + d 3 linear constraints on the parameter θ (Rθ = 0) as well as real matrix with d 2 d 3 the cardinal of the set of combined modalities between the two categorical explanatory variables (excluding unobserved combinations), real matrix of linear contrasts whose rank is equal to q, ⊗ is the Kronecker product, I d 2 and I d 3 are identity matrices and 1 d 2 , 1 d 2 and 1 d 2 d 3 are ones-vectors. In the case of no intercept and no single-effect, this simplifies and the linear predictor (14) is generalized to two explanatory variables as where θ l 2 ,l 3 L k are unknown parameters and is a new dummy variable. In that case, its MLE is obtained as the g-transformed average response over L k given x (1) for a given split point s, k = 1, . . . , m, l 1 = 1, . . . , d 1 and l 2 = 1, . . . , d 2 θ l 2 ,l 3 L k (s) = g(y k,l 2 ,l 3 j (s)).
To our knowledge, this approach can not be directly extended to one or more continuous explanatory variables, since explicit MLE is only available for categorical explanatory variables. Different ways are possible for bypassing this issue. In some situations, continuous explanatory variables are never used directly but splitted into levels, typically based on a regular grid or on quantile values. In experimental situations also, continuous explanatory variables are sometimes recorded with a finite precision such as for concentration, temperature, volume, so that a categorization of the variable is practicable. For instance, this is the case for insurance pricing where the policyholder age is discretized, e.g., Denuit et al. (2019). In ecotoxicology, Szöcs and Schäfer (2015) use discrete values in mg/L of treatment to explain mayfly larvae counts. Alternatively, if modelers want to benefit from explicit MLE for GLM when building the tree, continuous variables could be used as partitioning variables. Finally, the original glmtree model (without explicit solutions) can of course be used in situations where continuous explanatory variables should remain numeric so that a single coefficient is fitted in order to obtain monotonicity of the response with respect to that variable.

Computational gain of closed-form solutions for GLM trees
In this section, we want to compare the performances, both in terms of accuracy and computation time, of model-based recursive partitioning methods against a CART model on simulated datasets. Following the literature (Zeileis et al. 2008), the superiority of GLM-based trees is expected, but our approach should significantly reduce the runtime.

Dataset simulation
We consider an approach to build up various test datasets used in Wood (2011) for benchmarking generalized additive models. Precisely for m explanatory variables, we first generate independent and uniformly random variables (x i, j ) i, j . Then, we simulate continuous independent variables Y i , i = 1, . . . , n with the mean μ i = g −1 (η i ) for a linear predictor η i defined as where the first five f j are Simon Wood's smooth functions and the last ten f j are similar nonlinear test functions defined in "Appendix B.1". Distributions considered are given in Table 5. For the considered distributions, a dispersion φ is needed to fully characterize the distribution, see last column in Table 5 for φ used when generating datasets and Table 1 for the link between the shape parameter and φ. Various sample sizes are considered when simulating ranging from n = 100 to n = 50,000 as well as different explanatory variable number m = 10 or 20. The naming convention for simulated datasets is given in Table 6, for instance contIG2 refers to a simulated dataset with m = 20 continuous explanatory variables with inverse Gaussian responses.

Implementation details
Numerical illustrations of this paper have been carried out in the R statistical software. Core functions, such as glm, are implemented in the stats package, but some CRAN packages are also used. We use ctree from package partykit by Hothorn and Zeileis (2015) to fit recursive partitioning based conditional inference trees (CTREE), and rpart from package rpart by Therneau and Atkinson (2019) for the CART algorithm. New functions have been written to compare GLM-based trees: in the following, GLM tree refers to outputs of glmtree, a special case of mob, in partykit, and explicit GLM tree refers to the new estimation procedure for GLM trees proposed in Sect. 2. In addition, the function lmtree from package partykit refers to linear model tree, which is equivalent to GLM trees with a Gaussian distribution. Recall that a closed-form solution exists for this last method. Variables (x i, j ) i, j can play the role of both explanatory and partitioning variables in the GLM-tree algorithm.
All theses variables are considered as partitioning variables in Sects. 3.3 and 3.4, i.e., GLMs are fitted with an intercept. GLMs with one explanatory variable are also considered in Sect. 3.4. They are estimated with our Explicit GLM Tree model if the explanatory variable is categorical, and with GLM Tree if the explanatory variable is continuous. All analyses are done with the default parameters with the following exceptions. All trees are constrained to have a minimum number of 7 observations per terminal node, 20 observations per internal node, and a maximum tree depth of 9 for all models. Furthermore, we also considered three options 4 of ctree fit by modifying the way the distribution of the test statistic is computed: Teststatistic refers to the raw statistic for computing the p values, Bonferroni and Univariate correspond respectively to adjusted and unadjusted p values from the asymptotic distribution. It is worth noting that the tuning of parameters has not been optimized. A wise choice of these parameters could improve the prediction quality of the different models.

Runtime comparison between GLM tree and explicit GLM tree
In this section, we especially focus on the gain in runtime between the original GLM tree method and our new approach. To examine the superiority of our approach in terms of computation speed, we perform a comparison simulation as described above and generate several samples of different sizes. Here, performance are assessed based on the computation time only since the predictions are identical. Figure 1 displays average computation times (over 10 runs) as a function of sample size from n = 10 2 to 10 4 for ten continuous explanatory variables. Sub- figure 1a, b correspond to the Gamma distribution while sub-figures 1c, d are for the inverse Gaussian distribution. The results presented show clearly the important gain in computation time for all sub-figures of Fig. 1 when the size of the simulated sample increases. For sample size larger n > 10 3 , explicit GLM tree is particularly fast. Note also that comparisons cannot be done on samples of larger size because glmtree, which uses glm, does not converge when GLMs are fitted during the split point search. This advantage holds irrespectively of the chosen distribution or the number of explanatory variables.
Time ratios for other distributions and link functions are displayed in Fig. 6 in "Appendix B.2". Note that for some distributions, such as normal or gamma, GLM tree does not converge for a non-canonical link so that a comparison with explicit GLM tree is unfortunately not possible. Nonconvergence issues for GLM tree is due to some combinations of split points and explanatory variables where glm is badly initialized and does not reach the convergence. Time ratios for the mean, the median and the standard deviation runtime have similar patterns leading to the same conclusion.

Comparison with benchmark models
Since CART models are often preferred to unbiased tree methods while accuracy may be worse, our aim is to show how the trade-off between computation times and predictive accuracy evolves with the introduction of our approach. We compare the performance in accuracy, complexity and computation time of our version of GLM trees with an explicit solution (Explicit GLM Tree) with a intercept-node only; GLM trees with a one explanatory variable (GLM Tree reg) which admit an explicit solution when this variable is categorical; the original approach based on IWLS (GLM Tree), CTREE with different test specifications and CART. These performance indicators are assessed by performing a bootstrap cross-validation approach based on 100 bootstrap replications with replacement for each dataset. The sample size is fixed to n = 1000. Then, different models are trained on each bootstrap sample and validated based on the leaveout cross-validation estimates. The accuracy of each model is calculated through the root mean squared error (MSE) since the response is continuous. Complexity of trees is assessed as the number of its terminal nodes. Finally, we assess the computation times in seconds for each of these methods. Figure 2 displays the distribution of the predictive RMSE of the considered methods over 4 simulated data-sets introduced in Table 6, namely CategG1, CategIG1, ContG1 and ContIG1. For almost all datasets, the explicit GLM Tree and GLM Tree 5 with inverse Gaussian or gamma distributions produces the smallest median RMSE compared to the other specifications. This result can be expected since inverse Gaussian or gamma distributions are used for generating the response variable. 6 The performance of GLM Tree reg with one explanatory variable are fairly similar for CategG1 and CategIG1, except for the Gamma distribution, and for ContIG1. We observe however than the explicit GLM Tree performs better for ContG1. We also note that the explicit GLM Tree with Gaussian distribution outperforms CART and the CTREE specifications for for all samples, even if the difference is visually small with the latter for CategIG1 and ContG1. Differences are observed between the results of LM Tree and those of Explicit GLM Tree with Gaussian distribution. These discrepancies, which appear in a material way only for the ContIG1 dataset, are explained by the splitting criterion which is different between a linear model and a generalized linear model. Indeed, the first one is based on the residual sum of squares while the second one is based on the log-likelihood of the model. Furthermore, the interquartile range of the predictive RMSE does not show that Explicit GLM tree models provide more reliable predictions than other algorithms.  Table 7 focuses on the average over the 100 bootstrap samples of the predictive complexity. Despite its lesser performance, we clearly note that CART produces less complex trees compared to other models. As a result, this model retains an important interest compared to its competitors to the extent that CART provides more interpretable and simpler to explain results. The complexity of all Explicit GLM Tree and LM tree are close for each of them, but much higher to that of CART. For instance, on the CategG1 dataset, the average complexity is around 3.3 times higher than the number of terminal nodes of CART. Hence, the choice of these tree models to be used can be assessed in terms of the trade-off between performance and complexity. The complexity of a Explicit GLM Tree and a GLM Tree reg is similar with categorical variables, while a GLM Tree reg is clearly less complex in presence of one continuous explanatory variable and continuous partitioning variables. This comes from the fact that the trees of GLM Tree reg are grown with one less partitioning variable than Explicit GLM Tree, which clearly reduce the number of possible nodes for trees built on continuous parti- For all datasets, the Bonferroni option is the most parsimonious without a loss of performance, see Fig. 2. Regarding the complexity, Explicit GLM Trees are less complex to this specification on datasets with categorical partitioning variables (CategG1 and CategIG1) and more complex for datasets with partitioning continuous variables. The mean runtime is summarized in Table 8. The results of Explicit GLM Tree and GLM Tree are presented separately since the computation times obtained are different. In a similar way to Sect. 3.3, it can be observed that Explicit GLM Tree largely outperforms GLM Trees. Concerning the CTREE, the performance gap with GLM Trees differs depending on whether the partitioning variables are continuous or categorical. For the CategG1 dataset for example, the models with a Gamma distribution and an inverse Gaussian distribution have an average computation time of respectively 0.658 s and 0.612 s against 0.541 s for a CTREE with the Bonferroni option. The improvement is thus substantial compared to the use of glmtree for which the computation time is 2.45 times and 2.65 times higher on average. A comparable result is observed with the CategIG1 dataset. Furthermore, it is also interesting to compare the performance of Explicit GLM Trees and GLM Trees reg for categorical and continuous variables. With categorical variables, GLM Trees reg admit an explicit formula and clearly outperforms Explicit GLM Tree. For continuous data, although the Explicit GLM Tree algorithm relies on an explicit formula, it does not outperform GLM Tree reg due to its higher complexity.
For continuous partitioning variables, the superiority of CTREE over Explicit GLM Tree is more marked. For the ContG1 dataset for example, the models with a Gamma dis-  In this situation, the gaps in terms of computation time between the two types of algorithms is amplified since the number of split points is greater than for categorical partitioning variables. The gain compared to glmtree due to the use of a closed-form formula remains however very interesting. In addition, we can observe that the computation time of CART is much lower than those of the other algorithms. The rpart package is indeed known for its speed of execution as it relies on C code, whereas the partykit package is entirely developed in the R language, which largely explains the important differences of calculation times.

Performance on BostonHousing and Hitters datasets
We now assess the performance of Explicit GLM tree, CTREE and CART on two public benchmark datasets: BostonHousing and Hitters from R packages mlbench (Leisch and Dimitriadou 2021) and ISLR (James et al. 2017). On these datasets, most of variables are continuous, including the response variable. Hence, we consider three distributions for GLM trees (Gaussian, inverse Gaussian and gamma) with 100 bootstrap replications as in Sect. 3.2. Figure 3 displays the distribution of the predictive RMSE. For BostonHousing, CTREEs are the best, yet other algorithms performance are very comparable. Since the response variable for this dataset is the median value of owner- For Hitters, LM Tree is the best yet all models perform similarly. Since the shape of the density of the variable distribution is fairly close to that of a gamma distribution, the specification with this law is the best for Explicit GLM trees. Table 9 displays both the average and the median over the 100 bootstrap samples of the predictive complexity and the runtime. The most complex trees are observed for CTREE on both datasets. Again, we observe the Bonferroni option produces the most parsimonious CTREEs. The least complex trees are produced by CARTs for BostonHousing and Explicit GLM Trees as well as LM trees for Hitters. Regarding mean and median runtimes, Explicit GLM Trees clearly outperform GLM Trees for both datasets. Since most of partitioning variables are continuous, all CTREEs are more efficient than Explicit GLM Trees, not due to differences in complexity but rather the way that the Explicit GLM Tree algorithm is implemented in R. Finally, we note that the CART algorithm remains the fastest method due to its very efficient implementation compared to the other models (Table 9).

Random forest based on GLM Trees
In this section, we assess the benefits of our approach based on a closed-form formula by implementing a random forest type approach for GLM tree model, called GLM forest hereafter. This analysis is conducted on simulated datasets, where we compare the performance of our approach against two classical random forest competitors: the function cforest from package partykit to fit random forests based on CTREE, as well as the function randomForest from the R package randomForest (Liaw and Wiener 2002).

Datasets and implementation details
We use the following datasets ContG2 and ContIG2 defined in Sect. 3.1, that is, datasets with m = 20 continuous explanatory variables for gamma or inverse Gaussian responses and n = 1000 observations, see Table 6.
We consider three versions of GLM forest by choosing Gaussian, gamma and inverse Gaussian distributions (with canonical link) as for GLM tree in Sect. 3. Regarding cforest, we also consider three versions depending on the way the distribution of the test statistic is computed: Teststatistic refers to the raw statistic, Bonferroni and Univariate correspond respectively to adjusted and unadjusted p values as in the previous section. Finally, the randomForest function from package of the same name is used with the default arguments, except for those listed below.
In order to make a fair and reliable comparison, we control forest-type algorithms using the following arguments: -the number of trees, called ntree, -the number of input variables randomly sampled as splitting candidates at each node, called mtry, -the maximum depth of the tree, called maxdepth. An infinite value means that no restrictions are applied to tree sizes.
For randomForest, the terminal node number is capped by 2 maxdepth since there is no argument for the maximum depth. We consider two metrics for this benchmark: the usual root mean squared error (RMSE) as well as the mean absolute error (MAE) which proves to be a discriminant metric

Benchmark accuracy results
In Fig. 4, we display RMSE and MAE for seven algorithms considered in this paper as a function of the maximum tree depth with ntree=500 and mtry=5. Regarding RMSE, all algorithms have similar performance for both datasets. The three cforests are very close to one another leading to the conclusion that the test statistics has no importance on RMSE yet the computation times are different as we will see. Depending on the type of distribution used, GLM forests perform well for small trees on ContIG2 in Subfigure 4b and for large trees on ContG2 in Subfigure 4a. On these datasets, the randomForest algorithm proves to be less competitive with no situation where this algorithm is the best, even for unconstrained trees (infinite maxdepth).
Regarding MAE, we observe a large difference between randomForests and the others algorithms. Using this another metric leads to a viewable difference for the three cforests where cforest Test Stat. appears as the worst of all. In Subfigure 4d, the best algorithm is always a GLM forest algorithm, whereas for Subfigure 4c cforest Bonferroni is the best for medium-size trees. Overall, GLM forests Gaussian perform particularly well, while the datasets are generated with non-Gaussian distributions.
This analysis was also performed on datasets with less simulated explanatory variables ContG1 and ContIG1 with similar conclusions. We also test our seven algorithms on a wide range of tree numbers (ntree) and partitioning variable numbers (mtry) for which similar patterns of RMSE and MAE are observed.

Runtime and complexity analysis
The analysis of runtime and complexity also gives another point of view of our benchmark. The complexity of a forest for a given run is the sum of terminal node number for each tree. The maximum complexity for a given maxdepth is thus 100 × 2 maxdepth . In Table 10, we display the mean and the median of complexity as well as computation times (seconds) for maxdepth=8.
By far, the randomForest algorithm proposes the most complex forests with a complexity ten times higher than that of the cforest Teststatistic algorithm, which is in turn ten times higher than other complexities. In conjunction with Figure 4, at maxdepth=8, we can conclude that most com- domForest package is well optimized with some routines in C and Fortran unlike partykit and that our implementation of GLM forests is developed in R. Therefore, the superiority of randomForest was expected since the different algorithms are not compared on the same playing field in terms of language. Yet, it should be noted that the greater complexity of randomForest leads to a larger number of operations, which means that an implementation of GLM forests in an equivalent language could lead to similar runtime performance.

Performance on BostonHousing and Hitters datasets
Finally, we assess the performance of GLM forests, cforests and RandomForests with ntree=500, mtry=5, maxdepth=8 on BostonHousing and Hitters, see Sect. 3.5. As in Sect. 4.1, we consider three versions of GLM forests by choosing Gaussian, gamma and inverse Gaussian distributions (with canonical link) as for GLM trees. Regarding cforests, we restrict ourselves to only one version based on Teststatistic which is the most effective. Figure 5 displays the predictive RMSE, whereas Table  11 displays the complexity and the runtime over 100 bootstrap replications. For BostonHousing, most complex forests produced by a randomForest are the best followed by a cforest. Again there is no advantage of using non-Gaussian distributions for this dataset and the predictive performance for GLM forests is slightly worse than that of cforests and more unstable. For Hitters, RMSE boxplots of all algorithms are more close, yet randomForests perform slightly better at the price of complexity. It is thus interesting to note the parsimony of cforests and GLM forests.

Conclusion
This paper focuses on GLM-based trees which are a particular case of MOB employed for partitioning GLM. This algorithm is of particular interest for identifying relevant subgroups in data where a GLM is locally estimated. We propose a new fast algorithm for growing GLM trees based on the use of explicit MLE solutions for GLM with categorical explanatory variables. The proposed algorithm in Sect. 2 is flexible and can be applied for any probability distribution in the one-para-meter exponential family and any link function. The main method presented in the paper relies on binary trees with intercept-only nodes, but we also show that it can be combined with multiway splits and the use of several explanatory categorical variables in the GLM. Our approach also has explicit objective functions for a weighted MLE and transformed response variables.
We demonstrate on simulated and empirical datasets that this approach greatly increases the computation speed of the GLM-based tree model compared to the features originally offered by the R package partykit. Particularly, we show in Sect. 3 the proposed algorithm is three times faster for a Bernoulli response, five times faster for continuous non-Gaussian responses than the original. Furthermore, our numerical applications on continuous simulated datasets confirm the effective out-of-sample performance compared to other tree-based approaches such as RPART or ctree, both with categorical and continuous partitioning variables. Hence, this increases the interest of these models for applications where they are used intensively, as for ensemble decision trees. Despite of the use of closed-form estimators, our approach developed in R and based on the partykit package remains however much slower than rpart and ctree functions, indicating that additional efforts would be needed to speed up this method in C or Fortran code, both available in base R API.
Furthermore, this new algorithm makes it possible to derive a GLM forest algorithm in Sect. 4. Although this is also proposed by Garge et al. (2013), the use of closed-form estimators reduces the gaps in computation times to the other classical random forest algorithms such as randomForests and cforests. Similarly to cforests, GLM forests produce less complex trees than randomForests which could be interesting for the purpose of interpretability. For numerical applications based on simulated datasets, our GLM Forest approach performs better than its competitors in terms of MAE and is similar in terms of RMSE, depending on the chosen maximum depth of the tree. For the two chosen empirical datasets, GLM Forest does not perform better than classical random forest, but offers a good compromise in terms of performance and complexity.
This approach opens up some pathways for future research. Other types of distributions could be studied in the framework model-based trees, for instance inflated distributions such as zero-inflated Poisson, see, e.g., CORE models by Liu et al. (2019), two-parameter exponential families such as beta, negative binomial distributions or heavy-tailed distributions as in Farkas et al. (2021). In addition, we believe this method can be applied to other ensemble decision tree algorithm, such as boosted trees, or for prediction rule ensembles (Fokkema 2020) where the features of MOBs if of interest for interpretable rule generation.