|Home | About | Journals | Submit | Contact Us | Français|
The generalized linear mixed-effects model (GLMM) is a popular paradigm to extend models for cross-sectional data to a longitudinal setting. When applied to modeling binary responses, different software packages and even different procedures within a package may give quite different results. In this report, we describe the statistical approaches that underlie these different procedures and discuss their strengths and weaknesses when applied to fit correlated binary responses. We then illustrate these considerations by applying these procedures implemented in some popular software packages to simulated and real study data. Our simulation results indicate a lack of reliability for most of the procedures considered, which carries significant implications for applying such popular software packages in practice.
Longitudinal study designs have become increasingly popular in clinical trials and observational studies across all disciplines, as they capture both between-individual differences and within-subject dynamics. They also offer the opportunity to study more complex biological, psychological and behavioral hypotheses, especially those involving changes over time such as causal treatment effects. Since longitudinal study designs create serial, or within-subject, correlations over repeated assessments, statistical methods for cross-sectional data analysis must be generalized to account for such correlated observations.
One popular paradigm to handle this extension is the generalized linear mixed-effects model (GLMM). This approach explicitly models the within-subject correlation using random effects, or latent random variates . Since the likelihood function arising from GLMM is generally quite complex, it is difficult to directly maximize this function except for the special linear models case for continuous outcomes. As a result, different estimates have been proposed and implemented by major statistical packages to provide inference about model parameters. For example, in SAS , which is arguably the most popular software package for data analysis, there are two major procedures that provide support for fitting GLMM. In R , another popular package for data analysis, especially among academic investigators, there are at least three packages for inference from GLMM. Thus, for many practitioners, it is quite confusing to select an appropriate procedure to use for their data at hand. A more important question is whether these procedures all provide valid inference.
In this paper, we discuss the major algorithms that drive the different procedures within as well as across different packages and investigate the performance of these procedures when applied to binary responses. We chose to examine the binary response since our experiences indicate this response type seems to mostly accentuate the differences between the different procedures, although our considerations apply equally well to other response types such as count responses. In Section 2, we give a brief review of GLMM for modeling binary responses. In Section 3, we discuss major software packages for fitting GLMM. In Section 4, we use simulated data and real data to examine the performance of these procedures/packages. In Section 5, we give our concluding remarks.
We start with a brief review of the generalized linear mixed-effects model (GLMM) for modeling longitudinal binary outcomes and then discuss two popular approaches for inference from GLMM.
Consider a sample of n subjects and let yi denote a binary response and xi a vector of explanatory variables (predictors/covariates) of interest from the ith subject (i = 1, 2, …n). The logistic regression model is given by:
where i.d. denotes independently distributed, Bernoulli(μi) the Bernoulli distribution with mean μi, and β the vector of model parameters. Inference about β using either maximum likelihood or estimating equations has been extensively discussed in the literature [e.g., 3, Chapters 4 and 9; 4, Chapter 2] and thus is not repeated for space consideration.
For a longitudinal study, each subject will be repeatedly assessed over time. For ease of exposition, we assume a design with m fixed assessment points and let yit and xit denote the binary response and explanatory variables of interest from the ith subject at the tth assessment time (t = 1, 2, …M). Jointly modeling the repeated outcomes yit is difficult since no valid joint distribution can be constructed either in general, or for a particular set of parameter values. For example, if yi = (yi1, …, yim) is an equicorrelated vector of binary responses with the same success probability and a symmetric distribution, then the lower bound on the correlation parameter depends on n and μi . If further restrictions are placed on the joint distribution, as in Bahadur , the correlation is also restricted from above [8,9].
Given such complexity, a popular approach to extend the model in (1) to account for dependence across repeated yit is to use random effects [5,10,11,12,13,14]. More precisely, consider the following extension of (1):
where N (μ, Σ) denotes a multivariate normal with mean μ and variance Σ, and zit a vector of predictors/covariates. Although bi is typically assumed to follow a multivariate normal as in (2), other types of distributions may also be considered . In addition, even conditioning on xit, zit, and bi, the yit’s may still be correlated. However, we focus on the normal random-effects and independent yit’s (conditional on xit, zit, bi) as specified in (2) in this report, as it is the most popular in applications.
For the standard logistic model in (1), the log-likelihood is readily evaluated and an objective function for estimation of the parameters is simple to construct based on the independence of the observations. For the GLMM-based extension in (2), the log-likelihood is also easily derived based on the independent yit’s upon conditioning on xit, zit and bi:
where ϕ (· | 0, Σb) denotes the probability density function of a multivariate normal with mean 0 and variance Σb. Although the log-likelihood seems simple in appearance, it is not a straightforward task to numerically compute the maximum likelihood estimate (MLE) because of the potentially high-dimensional integration when a large number of random effects (i.e., high dimension of bi) is used.
The two most popular approaches for estimation in GLMM as implemented in major software packages such as SAS and R are (1) approximating the log-likelihood function and (2) approximating the model. Algorithms in the second category are developed using linearization of or Laplace approximation to the model. They employ some type of expansions such as the Taylor series expansion to approximate the model by one based on pseudo-data with fewer nonlinear components. The process of computing the linear approximation must be repeated several times until some criteria indicate lack of further progress [15,16]. The fitting methods based on linearization typically involve two levels of iterations. The GLMM is first approximated by a linear mixed-effects model based on current values of the covariance parameter estimates, and the resulting linear mixed-effects model is then fit, which is itself an iterative process. On convergence, the new parameter estimates are used to update the linearization, which results in a new linear mixed-effects model. The iterative process terminates when the difference in parameter estimates between successive linear mixed model fits falls within a specified tolerance level. Note that the Laplace approximation is a special case of the Gauss-Hermite quadrature discussed below, when there exists only one quadrature point.
The advantages of the linearization approach include a relatively simple form of the linearized model that typically can be fit based only on the mean and variance in the linearized form. This approach can fit models with a large number of random effects, crossed random effects, multiple types of subjects, and even correlated yit’s after conditioning on xit, zit and bi. However, as it does not maximize the underlying log-likelihood function, this approach produces estimates with unknown asymptotic properties, except in the special linear mixed-effects model case for which this approach does produce the MLE . In addition, algorithms implementing this approach can fail at both levels of the double iteration scheme.
In contrast, the integral approximation approach aims to directly approximate the log-likelihood in (3) and maximize the approximated function [13,14]. Various techniques have been proposed to compute the approximation including Newton and Gauss-Hermite quadratures, Monte Carlo integration, and Markov Chain Monte Carlo. The advantage of this alternative is to provide an actual objective function for optimization, albeit it is an approximated log-likelihood. Nonetheless, this enables likelihood ratio tests and likelihood-based fit statistics. Further, unlike the linearization approach whose approximation accuracy and thus the asymptotic properties of the estimates are limited by the type of models fit (e.g., linear, logistic, etc.), the approximation to the log-likelihood under this approach can be improved to any degree by increasing the precision of numerical integration, at least in principle (see Section 4.1 for such a dependent relationship in some procedures from simulated data). Thus, algorithms based on the integral approximation are expected to provide better estimates than their linearization-based alternatives. In particular, since they can get arbitrarily close to the MLE as the precision of numerical integration increases, estimates obtained under this approach have similar nice large sample properties such as consistency and asymptotic normality and efficiency. One drawback of the integral approximation is the difficulty in getting a good approximation when there are many random effects and/or nested random effects in the model. In Section 4, we compare the performance of the different approaches as implemented in the two most popular software, SAS and R, using simulated data.
In this section, we discuss major procedures in SAS and R for inference about the GLMM model in (2).
Two major procedures in SAS for fitting GLMM are GLIMMIX and NLMIXED. The primary difference between the two lies in the estimation approaches employed. While GLIMMIX implements the linearization approach based on the Laplace approximation, NLMIXED employs the integral approximation to the log-likelihood to provide inference. NLMIXED approximates the log-likelihood by integral approximation through an adaptive Gaussian quadrature when carrying out the dual quasi-Newton optimization. The objective function for the NLMIXED procedure is the marginal log likelihood in (2) obtained by integrating out the random effects from the joint distribution of responses and random effects using quadrature techniques. As noted earlier, this approach is expected to yield estimates that can be made arbitrarily close to the MLE as the approximation precision to the log-likelihood improves. The NLMIXED procedure not only generates true log-likelihood fit statistics that can be used to compare nested models, but also permits greater flexibility to accommodate user-defined likelihood functions to implement other statistical models. A major limitation of NLMIXED is that the yit’s in (2) cannot be correlated after controlling for xit, zit and bi.
GLIMMIX, on the other hand, fits GLMM based on the approach of linearization . This allows multiple random effects, nested and crossed random effects, multiple cluster types, and within our context, a random structure to accommodate correlated yit’s in (2) even after controlling for xit, zit and bi. However, as noted earlier, the procedure only generates Wald-type test statistics and cannot perform likelihood-based tests such as the likelihood ratio statistic due to the absence of a true objective function for the overall optimization. GLIMMIX implements the restricted maximum likelihood (REML) method, which has been shown to produce better estimates than maximum likelihood (ML) when the number of higher-level units is small . In addition, GLIM-MIX also supports sandwich variance estimates, whereas NLMIXED supports only model-based standard errors. The latter option has been shown to provide more robust estimates in the special case of linear mixed-effects model in the presence of missing data when the assumed parametric distributions failed to adequately fit the data . However, a major problem with GLIMMIX is that it does not maximize the underlying log-likelihood function, producing estimates with unknown asymptotic properties except in the special linear model case. This flaw is particularly problematic for modeling binary responses as we demonstrate using simulated data in Section 4.
Note that initial parameter values are specified differently between the two procedures. Initial values must be manually set for NLMIXED prior to running the procedure. Within our context of binary outcomes, such values may be generated by fitting the data first using generalized estimating equations [4,9] via PROC GENMOD. In contrast, GLIMMIX expects no such user input and uses a double iteration scheme to generate initial values from an iteratively derived approximated linear mixed-effects model.
There are also multiple procedures (or packages as they are called in R) available for fitting GLMM in R. One popular package is lme4 (version 0.99875-8 was used in the simulation study), which implements the Gauss-Hermite quadrature to approximate the log-likelihood using numerical integration. It defaults to the Laplace approximation when only one quadrature point is used (the default option of lme4) . Another available package is ZELIG, which was proposed to serve as a common framework for statistical analysis and software development built around the R language . For binary responses, this function uses the same computing engine as in lme4 to provide estimates, and thus yields quite similar results as the latter package. A third option is the glmmML package . Like lme4 and ZELIG, glmmML offers integral approximations to the log-likelihood using the Gauss-Hermite quadrature, in addition to the Laplace approximation. Thus, like its SAS counterpart NLMIXED, lme4, ZELIG and glmmML can generate true log-likelihood fit statistics to provide improved inference about model parameters. For small samples, the Wald statistic based on the MLE may not provide reliable inference as the sampling distribution is likely to be poorly approximated by a (multivariate) normal distribution. The likelihood statistic, however, may still provide good inference since it is computed based on the height rather than the entire distribution of the log-likelihood statistic. As in the case of NLMIXED, a primary limitation of the integral approximation approach is that the yit’s in (2) must be uncorrelated after controlling for xit, zit and bi, making it impossible to model correlated errors above and beyond those accounted for by the random effects.
We illustrate our considerations with simulated as well as real study data. We investigate the performance of each procedure/package within R and SAS so we can compare the accuracy of estimates and inference across the different procedures/packages. All the codes for the simulation and real study data are available from the authors upon request. In all the examples, we set the type I error level at α = 0.05. We start with simulated data.
We simulated data from a GLMM model with a single explanatory variable within a longitudinal data setting containing 3 assessments. To evaluate the performance of the procedures, we repeated the analysis for a range of sample size, n = 50, 100, 500, representing small, moderate and large samples. To obtain reliable estimates of model parameters and type I errors, we perform 1, 000 Monte Carlo (MC) replications for each simulation run.
We simulated the explanatory variable xi and the response yit based on the following GLMM:
where β0 = β1 = 1 and τ = 0.001, 0.5 and 2. For τ = 0.001, the within-subject correlation was very small and thus negligible, making the yit conditional on xi follow approximately three independent Bernoulli distributions. Thus, for small τ, the different procedures/packages should yield similar results. For τ = 0.5 and τ = 2, the within-subject correlations were about 0.2 and 0.5 (estimated based on the simulated data), respectively. We did not consider higher correlations in our simulations since they are unrealistic for most applications in our opinion. Further, our experience with the software packages seemed to produce unreliable estimates when fitting GLMM to data with extremely high correlations.
We considered the hypothesis, H0 : β1 = 1 vs. Ha : β1 ≠ 1, and simulated the yit’s from the model in (4) with different τ’s. For each model, we estimated the type I error based on the Wald statistic, is the element of the estimated asymptotic variance β corresponding to 1. This statistic at the m th MC replication (1 ≤ m ≤ 1000) will be denoted by . Then, the type I error rate for testing H0 was estimated by: , where q1,0.95 is the 95th percentile of .
Shown in Table 1, ,2,2, and and33 are the estimates of model parameters β and associated standard errors, averaged over 1,000 MC replications, and the type I error rates based on the different procedures (packages) in SAS (R) for the three sample sizes considered. For SAS GLIMMIX, both maximum likelihood (ML) and restricted maximum likelihood (REML) estimates were reported. For R, results were reported for both the linearization (labeled as "Laplace") and integral approximation (labeled as "Gauss-Hermite") approaches, with the latter results based on 20 quadrature points.
As expected, for small τ = 0.001, all the procedures and packages gave similar estimates of β and standard errors across the sample sizes. As τ increased to 0.5 and 2, results for the estimates and standard errors become quite discrepant across the different procedures (packages). Within SAS, there was downward bias in both the estimates of β and standard errors by GLIMMIX; the amount of bias reached almost 50% at τ = 2. In both cases of τ, NLMIXED provided good estimates.
As in the case of SAS, estimates of β and standard errors based on R packages began to differ as τ increased. Results from lme4 and ZELIG were closer to each other than those from glmmML, which is not surprising, given that lme4 and ZELIG were based on the same computing algorithms as noted earlier. Estimates of β and standard errors from glmmML were downwardly biased, though the patterns of bias were more difficult to decipher for lme4 and ZELIG. The integral approach based on the Gauss-Hermite approximation seemed to reduce the amount of bias in the estimated β’s and the standard errors, especially for lme4 and ZELIG. It is interesting to note that although the Gauss-Hermite approximation in these packages employs the same integral approximation approach to the log-likelihood as its SAS NLMIXED counterpart, it does not seem to provide more accurate estimates (than its Laplace counterpart) for these packages.
Estimates of type I error rates were quite variable across the different procedures and functions as well. For τ = 0.001, all type I error rates were closer to the nominal value 0.05, especially for n = 500, though those from SAS GLIMMIX and R glmmML has small downward bias, while the ones from lme4 and ZELIG showed some upward bias. For τ = 0.5 and 2, all type I error rates increased substantially, except for those from NLMIXED that remained close to 0.05. It is quite clear that with the exception of NLMIXED, none of the procedures would provide acceptable type I error levels for inference about the null H0 : β1 = 1 when the within-subjects correlation is 0.2 or greater.
Note that during the review of this manuscript, one reviewer suggested that we reran the simulation model within the same setting, except for increasing the magnitude of the β’s, to see if larger (absolute) values of β’s would play any role in the bias. We found that the biases and type I errors did seem to depend on the magnitude of β’s. For example, shown in Tables 4, ,55 and and66 are the estimates, standard errors and type I errors for three different set values of the β’s, β0 = β1 = 0, 1.5 and 2. As seen, the bias in the type I error does increase, as the values of β’s move further away from 0. The bias in the type I error showed a similar pattern of increase when β’s were shifted away from 0 to the left.
Note also that the standard error estimates are generally smaller for the SAS GLIMMIX and the R procedures compared to the SAS NLMIXED, especially for GLIMMIX. However, smaller standard error estimates may not be a good thing when parameter estimates are highly biased. As estimates from some procedures such as those based on the Laplace approximation have unknown asymptotic properties, it is more important to examine the bias in the parameter estimates and in the type I error. Highly biased estimates with smaller standard estimates are the worst combination since it not only leads to wrong conclusions, but does so with greater statistical power.
We also applied the different procedures/packages to a multi-center randomized clinical trial comparing two treatments for a respiratory illness . There were 111 subjects randomized to active (54) or placebo (57) treatment across two sites. The primary outcome was respiratory status (0=poor, 1=good), which was assessed at 4 visits. The study data was analyzed by Davis using distribution-free (or semi-parametric) longitudinal methods . In addition to the treatment conditions, he also included baseline respiratory status, study site, sex and age as covariates. There was no missing data in the outcome, the predictor (treatment conditions), and the covariates. Our reanalysis using GLMM included the same predictor and covariates as the fixed effects, and a random intercept.
Shown in Table 7 are the estimates, standard errors and p-values from fitting the GLMM using the different procedures/packages. The results from the R packages lme4 and glmmML were based on the Gauss-Hermite approach for integral approximation with 3 quadrature points, since there is no significant improvement using more quadrature points for these procedures (see Discussion for more details). Since ZELIG yielded almost identical results as lme4, we only reported the results for the latter package. Likewise, we only reported the results from the REML option of SAS GLIMMIX, since they showed slightly less bias than those from the ML method as observed in the simulation study.
The estimates of β and standard errors showed similar patterns as those from the simulation study. In particular, the standard errors from SAS GLIMMIX and R glmmML all had almost 50% downward bias (with NLMIXED as a reference standard), consistent with the amount of bias observed for these procedures in the simulation study. The standard errors from lme4 also seemed to have some downward bias. The differences in the p-values between NLMIXED and the two R packages are larger than those between NLMIXED and GLIMMIX. The downward bias in both the estimates and standard errors by GLIMMIX may have helped reduce the amount of upward bias in the p-values yielded by this SAS procedure.
We examined the performances of procedures/packages for fitting generalized linear mixed-effects models (GLMM) for correlated binary responses using the popular SAS and R statistical software packages. Unlike the special case of a linear mixed-effects model, maximum likelihood estimates (MLE) are difficult to compute and approximate estimates have been proposed and implemented in most statistical software packages such as SAS and R. As some of these estimates have unknown large sample properties, it is important to evaluate their performances to determine if they can provide reasonably good estimates and inference for practical purposes.
We reviewed two popular approaches for providing such approximate estimates as implemented in SAS and R, and discussed the pros and cons between the two approaches. Although the linearization approach is flexible, and can accommodate a complex structure of random effects, it may not provide reliable estimates, since it is difficult to theoretically characterize the properties of such estimates. In particular, it is unclear whether such estimates are even consistent. In comparison, the integral-based approximation approach yields estimates that approximate the MLE, thereby providing not only consistent, but asymptotically normal estimates. The superior performance of the latter approach as demonstrated by the SAS NLMIXED in our simulation study confirms the theoretical behavior of the estimates from this approach. We are a bit surprised by the performance of the R lme4 and glmmML packages, as neither seems to yield comparable inference as its SAS NLMIXED counterpart given that it implements the same integral approximation approach, albeit using different algorithms.
To further investigate the robustness of inference against non-normal random effects, we also changed the bivariate normal random effects bi to a shifted bivariate-Gamma with the same variance structure, i.e.,
where SG(μ, Φ) denotes a Shifted bivariate Gamma with mean μ and variance Φ .
Shown in Table 8 are the simulation results from this revised model for τ = 0.5 and 2 based on a sample size n = 100 under 1,000 MC replications. Again, we used 3 quadrature points for the R procedures to save computing time. Although parameter estimates and type I errors were less affected for τ = 0.5, a large bias was obvious for both when τ = 2, even for NLMIXED. Along with the findings for normal random effects, these results seem to indicate that GLMM for binary outcomes is not robust against misspecification of random effects.
Judging from the results of our simulation study, we conclude that the SAS NLMIXED procedure provides the most accurate parameter estimates and inference (type I error) under correct model assumptions. Among the remaining procedures, estimates are generally biased, especially the type I errors, with the level of accuracy depending not so much on the number of quadrature points (for the procedures based on the integral approximation), but rather on the magnitude of the values of the parameters. For example, we varied the number of for the latter in the range of 3 and 20 for the R procedures, but failed to observe any improvement in type I errors when increasing the quadrature points from 3 to 20, except for a significant increase in computing time, with the latter increased by 15 times when using 20 over 3 quadrature points. Thus, for all practical purposes, it seems that 3 or 4 quadrature points is sufficient when using these procedures. However, as noted in Section 2.2, we should limit NLMIXED to models with a relatively small number of random effects to ensure accurate intergral approximation to the log-likelihood.
Unexpectedly, though quite interestingly as well, the levels of accuracy of these remaining procedures seem to depend on the magnitude of the values of the parameters, with larger (absolute) values yielding higher biases. Further investigations are needed to determine the sources of such a relationship so improvement may be made to these procedures. If these procedures have to be used in a given application for reasons such as those discussed in Section 3, we recommend the use of bootstrapped standard errors to improve inference reliability .
This research is supported in part by NIH grant U54 RR023480 and UL1-RR024160.
Conflict of Interests Statement
The authors have declared no conflict of interest.