A closed-form alternative estimator for GLM with categorical explanatory variables

Abstract The parameters of generalized linear models (GLMs) are usually estimated by the maximum likelihood estimator (MLE) which is known to be asymptotically efficient. But the MLE is computed using a Newton-Raphson-type algorithm which is time-consuming for a large number of variables or modalities, or a large sample size. An alternative closed-form estimator is proposed in this paper in the case of categorical explanatory variables. Asymptotic properties of the alternative estimator is studied. The performances in terms of both computation time and asymptotic variance of the proposed estimator are compared with the MLE for a Gamma distributed GLM.


Introduction
Generalized linear models (GLMs) deal with regression models such that the distribution of the response variable belongs to the exponential family whose natural parameter is, up to a link function, a linear combination of explanatory variables.It is worth recalling that the distributions of exponential type include most of the classical discrete distributions (binomial, Poisson, etc.) and continuous distributions (Gaussian, gamma, inverse Gaussian, etc.), see, e.g., McCullagh and Nelder (1989).
The unknown parameters of GLMs are generally estimated by the maximum likelihood estimator (MLE).The asymptotic normality and efficiency of the MLE for the GLMs were studied in Fahrmeir and Kaufmann (1985).In a general framework, the MLE has no closed-form and is numerically computed by a gradient descent-type method known as the Fisher scoring which can be equivalently written as an iteratively re-weighted least square method (IWLS), see for example McCullagh and Nelder (1989, Chapter 2).This computation method is time-consuming for large datasets or for high dimensions.
The case of categorical explanatory variables is singular due to the non-identifiability of the model.Then, aforementioned results do not apply directly.Generally, linear identifiability conditions are imposed via a contrast matrix.In this setting, closed-form MLE have been exhibited in Brouste, Dutang, and Rohmer (2020) for any distribution of exponential type under a supplementary assumption on the contrast matrix.Due to the closed-form formula, the parameters are quickly estimated and the methodology can handle large datasets and/or large number of modalities.
The alternative estimator proposed in this paper is motivated by theoretical arguments as well as practical motivations.Indeed, the MLE does not have an explicit solution in the case of two or more explanatory variables only used as single effects.Our proposed closed-form estimator makes possible to deal with large datasets or a large number of explanatory variables avoiding using the time-consuming IWLS algorithm.
Dealing with only categorical explanatory variables is of particular interest.For instance, in the insurance industry, policy pricing uses a finite number of risk group relying on categorical explanatory variables.Typically, motor insurance ratings rely on vehicle classification with a large number of modalities, e.g., the dataset pg17trainpol in CASdatasets by Dutang and Charpentier (2020), exhibits this feature with 1023 vehicle models for 101 vehicle brands.Another appealing example is Kadarmideen, Thompson, and Simm (2000) where authors study disease, fertility and milk production in dairy cattle and consider 7530 levels for Herd-year-seasons effect.In both situation, having a fast efficient estimator is at stake.
In most situations, explanatory variables are used as single effect.Indeed, McCullagh and Nelder (1989) use single-effect models with the binomial distribution (Chapter 4) and with the Poisson distribution (Chapter 6); Lindsey (1997) also uses single-effect models for the Bernoulli distribution in Chapter 2. This is also particularly true in the actuarial field, where (Denuit, Hainaut, and Trufin 2020, Chap. 4) model claim count distributions, and Wuethrich and Merz (2021, Chap.5) model claim size distributions with single-effect models.
Whatever the domain of application, finding closed-form solutions is at stake.In the literature of choice modeling, Lipovetsky and Conklin (2014) provide analytical formulae for multinomial logit model which permits the inference of the characteristics of the model's quality (standard errors of the utilities, choice probabilities, the residual deviance and pseudo-R-square), see also Lipovetsky (2015) for the special case of logit regression.Marley, Islam, and Hawkins (2016) show that Lipovetsky and Conklin (2014)'s analytical closed-form solutions and Frischknecht et al. (2014)'s normalized best-worst scores provides better fits to the aggregate choices in several best-worst choice data sets.
Furthermore, Lipovetsky, Liakhovitski, and Conklin (2015) use Lipovetsky and Conklin (2014) to derive analytical formulas for determining the needed sample size for Best-Worst Scaling studies, which allows to test the formula assumption and demonstrate the soundness of the formula to be used and proposed rough empirical rule for determining the sample size for Best-Worst Scaling.
When building decision trees, having a closed-form estimator is also of particular interest.In that respect, Dutang and Guibert (2022) propose a fast estimation procedure based on Brouste, Dutang, and Rohmer (2020) to fit GLM trees, which also makes possible to fit GLM forests on practical applications.
The paper is structured as follows.Notations are introduced in Sec. 2. In Sec.3.1, the general setting and the asymptotic properties of the alternative estimator are presented.In Sec.3.2, the case of two categorical explanatory variables (and also one explanatory variable) is described with several examples.We also focus on the particular case of single effect only.A simulation analysis is performed in Sec. 4 in order to benchmark the proposed estimator against the IWLS algorithm in terms of computation time and to compare the asymptotic variances on a Gamma distributed GLM with single effect only.

Notation
In the following, vectors of R p or R n are bolded, while the index i is reserved for the observations, while the indexes j, k, l are used for the explanatory variables.
In the GLM setting, the sample Y ¼ ðY 1 , :::, Y n Þ is composed of independent random variables valued in Y & R: Each response Y i belongs to a family of probability measures of one-parameter exponential type with respective parameters k 1 , :: That is, the log-likelihood L associated to the statistical experiment is where a : R !R, b : K !R and c : Y Â R !R are known real-valued measurable functions and / is the dispersion parameter, e.g., McCullagh and Nelder (1989, Section 2.2).
In the GLM setting, the parameters k 1 , :::, k n of Equation (1) depend on a finite-dimensional parameter # 2 H & R m : A direct computation of theoretical moments leads to In the following, we consider deterministic exogenous variables x 1 , :::, x n , with x i ¼ ðx i, 1 , :::, x i, p Þ 2 R p for i ¼ 1, :::, n, representing for categorical variable encoding.That is, x i, j is binary, typically indicating a label for a categorical variable (see Sec. 3 for the proper encoding for categorical explanatory variables).
Let g be a twice continuously differentiable and bijective function g from b 0 ðKÞ to R: GLMs are defined by assuming the following relation between the expectation E # Y i and the predictor where g x i are the linear predictors.The parameter # 2 H & R p is unknown and has to be estimated and g is the so-called link function.Consequently, the (bijective) function ' ¼ ðb 0 Þ À1 g À1 maps linear predictors to parameters as k i ð#Þ ¼ 'ðg x i Þ which can be summarized as where D is the space of linear predictor and X the possible set of value of x i for i 2 f1, :::, ng: In the special case 'ðtÞ ¼ t, we talk of a canonical link function.
Consecutively, the log-likelihood (1) can be rewritten as log Lð# j Y, x 1 , :::, Since the covariates are supposed to be categorical in this paper, the linear predictors g x i takes d distinct values namely 1 , :::, d : We consider the unique d Â p matrix Q such that In the case of categorical explanatory variables, the model is not identifiable and linear identifiability conditions are imposed.Precisely, the MLE solves the optimization problem where R is a contrast matrix that ensures the model to be identifiable.The identifiability condition is equivalent to the positive definiteness of the matrix As mentioned in the introduction, the computation of the MLE is time consuming when it is not explicit.

General setting
Consider the case where all m explanatory variables are categorical, that is for j ¼ 1, :::, m every observations ðx ðjþ1Þ i Þ i take values in a finite set fv j, 1 , :::, v j, d j g and x ð1Þ i ¼ 1 is the intercept.Assuming values are unordered, x ðjþ1Þ i needs to encoded using binary dummies as follows x ðjþ1Þ, k i ¼ 1 fx ðjþ1Þ i ¼v j, k g , k 2 f1, :::, d j g: These binary dummies can be used both in single-effect models or with cross-effect models.To take all possible GLM settings into account, we consider a GLM with predictor defined as where g is the link function and indexes j i 2 f2, :::, m þ 1g and k j 2 f1, :::, d j g for j ¼ 2, :::, m þ 1: In the m-variables case, the unknown parameter vector is We introduce a specific notation based on Kronecker products of ones vectors 1 and identity matrices I to denote cross effects of j 1 < ::: < j k variables where j i 2 f2, :::, m þ 1g with the convention that 1 0 ¼ I 0 ¼ 1: When j i , j iþ1 are consecutive integers, i.e., j iþ1 ¼ j i þ 1, ones vectors disappear and simplifications occur I d j i I d j iþ1 ¼ I d j i d j iþ1 : There are two special cases: M ðÞ m ¼ 1 d mþ1 :::d 2 is the ones vector and M ð2, :::, mþ1Þ m Using (7), we define the Q matrix as Table 1 gives examples of such Q matrix for 1, 2 and 3 explanatory variables, whereas we identify the (unique) m-uple ðk 2 , :::, k mþ1 Þ for the ith observation.Hence, the linear predictor g x i of Equation ( 5) simplifies in the following way such that the ith observation belongs to the k j th modality of the jth variable for j ¼ 2, :::, m þ 1: Naturally, there are redundancies in the linear predictors and we must impose a contrast matrix R 2 R qÂp , in order to identify the unknown parameters, namely R# ¼ 0: We assume, in the following, that the matrix R is such that Q T Q þ R T R is definite-positive and rankðRÞ ¼ q: This contrast matrix also allows to built every sub-models of a GLM.In particular, the single effect case (no interaction) can be considered.In the R statistical software (R Core Team 2021), glm removes the first modality of each variable and the associated interaction terms (that corresponds to the second example of Table 3

below for two categorical explanatory variables).
To present our alternative estimator, using Equation ( 9), we introduce the absolute frequencies over the m explanatory variables by and the cross-effect average responses by For the sake of clarity, we consider m k 2 , :::, k mþ1 > 0: Nevertheless the results can be adapted to the case where m k 2 , :::, k mþ1 !0 in the same way of Brouste, Dutang, and Rohmer (2020).In this paper, we propose the closed-form estimator defined by where Q is defined in (8), gð Y Þ 2 R p is the vector of g-transformation of average responses (11).Below, we exhibit the asymptotic properties of the alternative estimator #n : It can be quickly computed by solving a linear system and can therefore be used to handle large datasets (see Sec. 4).We also investigate situations for which the alternative estimator is the MLE, i.e., #n ¼ #n :

Main results
Before stating theorems, using Equation ( 9), we introduce the vector of theoretical expectations We also define the theoretical probabilities p k 2 , :::, k mþ1 as m k 2 , :: p k 2 , :::, k mþ1 2 ð0, 1Þ: ( Theorem 1 gives the asymptotic distribution of #n , whereas Theorem 2 gives sufficient conditions where the alternative estimator is MLE, i.e., #n ¼ #n : Theorem 1. Regardless of q, the proposed estimator #n is strongly consistent.Furthermore, #n is asymptotically normal, namely, R is the diagonal matrix whose diagonal elements are where l k 2 , :::, k mþ1 is defined in (13) and p k 2 , :::, k mþ1 in (14).
Proof.The proof is postponed in Appendix A.2.
Remark 1.It is worth emphasizing that, due to the non-identifiability and the imposed linear constraints, the matrix Q R RQ T R in ( 15) is singular.Let us define the matrix R D ¼ R Àm À R ð2, :::, mþ1Þ Q Àm , with R Àm and Q Àm the matrices respectively without the d 2 :::d mþ1 last columns of R and Q, and R ð2, :::, mþ1Þ the matrix with only the d 2 :::d mþ1 last columns of R. Let q min the minimal number of conditions to get identifiability for GLM (5), that is q min ¼ p À Q mþ1 j¼2 d j , i.e., q ¼ rankðRÞ !q min : Theorem 2. For q ¼ q min , the identifiability condition is equivalent to In that case, #n ¼ #n : Proof.The proof is postponed in Appendix A.3.

Two and one categorical explanatory variable(s)
We now focus on the case of two categorical explanatory variables for which the alternative estimator is generally not the MLE.Equation ( 5) simplifies to k, l : Using Table 1, the Q matrix is The contrast matrix R ¼ ðR À2 , R ð2, 3Þ Þ is given by ::: r ::: r 1, q ::: r Here a general row number q ! 1 þ d 2 þ d 3 is considered and rankðRÞ ¼ q: The lower bound Corollary 1.For q ¼ q min , the identifiability condition is given in Table 2.In that case, #n ¼ #n : Regardless of q, the proposed estimator #n is asymptotically normal with R is the diagonal matrix whose diagonal elements are given in Table 2.
Below, we exhibit two examples to illustrate the previous result.Example 1 presents the usual contrasts for two explanatory variables and where the MLE differs from the closed-form estimator: an over-contrasted matrix R which removes the interactions of the model, then which removes the interactions and the second variable.
Example 1 Lu and Shiou (2002) Let us consider the case of model ( 17) without interaction.Indeed the contrast matrix is with ::: where r 0, 1 , r 1, 1 , :::, r 1, 2 , :::, r and using B.1 Hence the alternative estimator #n fails to be MLE because R #n 6 ¼ 0: We perform a simulation analysis in Sec. 4 to compare asymptotic variances and computation times between the proposed alternative estimator and the asymptotically efficent MLE.An exception is the canonical Gaussian case with balanced plan, i.e., m 1, 1 ¼ ::: ¼ m d 2 , d 3 for which the alternative estimator is still the MLE.
We finish this section with the case of a single categorical explanatory variable.The linear predictor (5) simplifies to an intercept and a single sum whereas the Q matrix is given at the bottom of Table 1.A similar corollary to Corollary 1 can derived where the minimum constraint number is q ¼ 1, the identifiability condition and the variance components are given at the bottom of Table 2.In that case, the alternative is the MLE, as demonstrated by Brouste, Dutang, and Rohmer (2020).The formula of matrices used for #n for the three usual contrasts (no intercept, no first-level, zero-sum) are given in Table 4, while the proofs are put in Appendix A.4.
In our simulations, we consider a GLM with a gamma distribution.Namely, we assume a sample Y 1 , :::, Y n of independent variables following a gamma distribution with a shape parameter k > 0 and rate parameter h i > 0, the associated log-likelihood is given by ( 1) with We consider two explanatory variables x ð2Þ i and x ð3Þ i with d 2 and d 3 modalities.Given a parameter value #, we assume a zero-sum condition That corresponds to the following contrast matrix The simulation procedure for a given sample size n consists of simulating explanatory variables, then of simulating the gamma variable and finally of estimating the MLE #n and the proposed estimator #n The procedure is summarized as follows , l i # ð3Þ l using Table 5; 3. Compute rate parameters for gamma distribution    (b) estimate the GLM with the two methods assuming l : (c) return computation times, the MLE #n and the proposed estimator #n : end repeat; 5. Compute the variance of the two estimators.
We first present the results of computation times for two link functions of the gamma distribution gðxÞ ¼ 1=x and gðxÞ ¼ log ðxÞ: The computation time displayed is the average time over M ¼ 5 runs for two sample sizes n ¼ 10 5 , n ¼ 10 6 and d 5,10,15,20,25,30,Table 7. Parameter values of the gamma distribution for asymptotic variance.35, 40, 45, 50: The parameter values are given in Table 5 and the dispersion parameter is fixed to / ¼ 8: Computation times for both the MLE (denoted by IWLS) and the proposed estimator (denoted by explicit) are displayed in Figure 1 for the inverse link.We observe that the IWLS has a computation time which increases almost linearly with the modality number d.In comparison, the proposed estimator's computation time is almost constant and significantly lower.Table 6 shows the ratio of the computation time of #n against the computation time of #n : Irrespective of the sample size, this ratio increases from 6 to 67 as d increases from 5 to 50.When considering the log link function, we observe similar results, yet the computation of #n is more erratic and higher than for the canonical link, see Figure 2 and Table 6.Now, we turn our attention to asymptotic variances for two link functions of the gamma distribution and link function gðxÞ ¼ 1=x: Here, we consider M ¼ 10000 replicates, a sample size n ¼ 10 3 , 10 4 and d 2 ¼ 2 and d 3 ¼ 3: True parameter vector is chosen to guarantee the positivity of the parameter h i for the inverse link function, see Table 7. Again, the dispersion parameter is fixed to / ¼ 8: Figure 3 displays the asymptotic distribution of errors for n ¼ 1000 of #n for the inverse link function.Figure 3(a) shows a small bias of #n when estimating the intercept, which is not present for other coefficients.Asymptotic variances between the MLE and the proposed estimator are very close except for # ð2Þ, 1 : Indeed, Figure 3(b) shows a narrow distribution for MLE than for the proposed estimator.Similar conclusions can be drawn for n ¼ 10000 and/or the log link function, as well as other distributions.

Conclusion
A closed form estimator for GLM with categorical explanatory variables has been presented.It is a fast computable alternative to the MLE, in particular in the practical case of GLM with single effect only.The asymptotic properties of this estimation procedure have been studied.
The closed-form estimator avoid using the IWLS algorithm which is time-consuming for a large number of variables or modalities.Numerical illustrations quantify the performances of our explicit estimator against the MLE in different GLM examples.
In order to handle the asymptotical non-efficiency of the proposed estimator (compared to the MLE), the Le Cam one-step procedure could be targeted in a further research.
A. Proof of Section 3 Hence the identifiability condition is equivalent to R T R þ Q T Q being positive definite.

A.2. Proof of Theorem 1
Consider the empirical average Y Multiplying by Q R and using gðlÞ Using R# ¼ 0, we have Replacing Q R Q# ¼ # leads to the desired result.

A.3. Proof of Theorem 2
In the general case of m categorical explanatory variables, the log-likelihood (3) becomes

Figure 1 .
Figure 1.Computation time for gamma response with inverse link function (average over 5 runs).

Figure 2 .
Figure 2. Computation time for gamma response with log link function (average over 5 runs).

Table 1 .
Examples of Q matrix for 1, 2 or 3 variables.

Table 3 .
Three contrasts for two variables and full rank R matrix.

Table 5 .
Parameter values of the gamma distribution used for computation times.

Table 6 .
Time ratio of IWLS over explicit methods for sample size n and modality number d.