|Home | About | Journals | Submit | Contact Us | Français|
A common class of models for longitudinal data are random effects (mixed) models. In these models, the random effects covariance matrix is typically assumed constant across subject. However, in many situations this matrix may differ by measured covariates. In this paper, we propose an approach to model the random effects covariance matrix by using a special Cholesky decomposition of the matrix. In particular, we will allow the parameters that result from this decomposition to depend on subject-specific covariates and also explore ways to parsimoniously model these parameters. An advantage of this parameterization is that there is no concern about the positive definiteness of the resulting estimator of the covariance matrix. In addition, the parameters resulting from this decomposition have a sensible interpretation. We propose fully Bayesian modelling for which a simple Gibbs sampler can be implemented to sample from the posterior distribution of the parameters. We illustrate these models on data from depression studies and examine the impact of heterogeneity in the covariance matrix on estimation of both fixed and random effects.
Random effects (mixed) models are a class of models used frequently to model longitudinal data. They offer many advantages including the ability to handle different observation times across subjects and to allow non-stationary covariance structures. In these models, little time is typically spent on modelling the random effects covariance matrix. In particular, examining whether this matrix is the same for all subjects or whether it differs depending on subject-specific characteristics is often neglected in the modelling. For discrete longitudinal data modelled using generalized linear mixed models, ignoring this heterogeneity can result in biased estimates of the fixed effects . For continuous longitudinal data, which will be our focus here, the standard errors for the fixed and random effects, and consequently, inferences, will be incorrect, the random effects will not be shrunk correctly, and prediction of subject-specific trajectories can be poor. In addition, incorrectly modelling the covariance structure in the presence of missing data can result in biased estimates of fixed effects.
Accounting for heterogeneity in covariance matrices has recently been discussed by several authors. In marginal models, Chiu et al.,  model the covariance matrix using a log matrix parameterization and obtain estimates using estimating equations. In non-linear mixed models, heterogeneous covariance structures are frequently used, but typically with variance a function of the mean and constant correlations across subjects . Pourahmadi and Daniels  develop a class of models they call dynamic conditionally linear mixed models in which the marginal covariance matrix is allowed to vary across individuals, but they consider the random effects covariance matrix to be constant across subjects. In the context of linear mixed models, Lin et al.,  examined heterogeneity in the within-individual variances in linear mixed models and Zhang and Weiss  discussed heterogeneity in the random effects covariance matrix but mainly consider models that allow the entire matrix to differ by a multiplicative factor. Little work has been done on modelling the entire random effects covariance matrix.
Here, we propose an approach that allows all the parameters of the random effects covariance matrix to be modelled, not just the variances. Specifically, we will model the parameters that result from a modified Cholesky decomposition of the covariance matrix (this decomposition has also been called a modified Gaussian ). This decomposition has been used previously to model the marginal covariance matrix of longitudinal observations on a subject [4, 8–10]. This decomposition results in parameters that can be easily modelled without concern of the resulting estimator being positive definite, have nice interpretations, and allow for relatively easy model fitting (sampling from the posterior distribution). Some discussion of interpretation in the context of modelling the random effects covariance matrix will be given in Section 2.
We reanalyse a data set previously discussed in Pourahmadi and Daniels . The data was from a series of five depression studies conducted in Pittsburgh from 1982–1992 . Patients were assigned one of two active treatments and measured at baseline and then weekly for 16-weeks. Here, we examine the rate of improvement of, and dependence in, weekly depression scores over this 16-week period for the 549 subjects with no missing baseline covariates. Previous work investigated the time to recovery from depression  and examined the rate of improvement in depression scores with a different class of models than those proposed here .
Preliminary analyses suggested that the random effects covariance matrix was not the same for subjects with different combinations of treatment (two levels: drug and psychotherapy versus psychotherapy only) and initial severity of depression (two levels: high and low). This motivated our modelling framework, in which we will model the parameters resulting from the modified Cholesky decomposition as a function of both drug (treatment) and severity.
In Section 2, we discuss the modified Cholesky decomposition of the covariance matrix, the interpretation of the parameters in this setting, and its use in modelling the random effects covariance matrix. We build models for the random effects covariance matrix on the depression data and discuss our results in Section 3. Conclusions and discussion comprise Section 4.
For each subject i, we observe an ni × 1 vector of responses, Yi. We assume the Yi follow a random effects model
where β is a p × 1 vector of fixed effects, Xi, bi is a q × 1 dimensional vector of random effects with corresponding design matrix Zi, and the random effects covariance matrix Σi is indexed by subject, i. The bi represent subject-specific deviations from an overall curve. These curves may be modelled using orthogonal polynomials or splines [12, 13]. We order the components of bi according to their corresponding basis functions. For example, for orthogonal polynomials, a natural ordering corresponds to overall average, linear, quadratic, cubic, etc. We can similarly order the basis functions for cubic splines. To model Σi, we use the modified Cholesky decomposition [8, 14]. We discuss this in detail next.
We decompose the random effects covariance matrix using the approach described in Pourahmadi . This decomposition results in a set of dependence parameters, generalized autoregressive parameters (GARP) and a set of variance parameters, the innovation variances (IV). These parameters can be understood in the context of the dynamic model which is described below. First, we define ik to be the linear least-squares predictor of bik based on its predecessors bi, k − 1,…, bi1, and eik = bik − ik to be its prediction error with variance . Thus, we have
The special Cholesky decomposition of Σi is defined as where and Ti is the unit lower triangular matrix with −ϕi, kj as its (k, j)th entry. We refer to the ϕ as the GARP and the σ2 as IV. The only constraint on these parameters for Σi to be positive definite is that the IV need to be positive.
The GARP parameters characterize the dependence of the random effects and the IV parameters the variance. For example, consider using orthogonal polynomials up to the second order (quadratic) as we use in the example in Section 3. Define bi1 to be the random effect corresponding to the subject-specific deviation from the overall average, bi2 to be the random effect corresponding to the deviation from the overall linear trend, and bi3 to be the random effect corresponding to the deviation from the overall quadratic trend. Then (2) can be written in three parts:
where . The first equation corresponds to the marginal distribution of the random effect corresponding to the deviation from the overall average for subject i, the second to the conditional distribution of the linear random effect given the overall average; and the third to the conditional distribution of the quadratic random effect, given the linear random effect and the overall average random effect. The GARP are clearly modelling the dependence between the different components of the orthogonal polynomial as we add higher order components one at a time, and the IV are modelling their variability as we add more complexity (higher order terms) to the model ( quantifies how well can we predict the deviation from the overall quadratic trend, given the deviations from the overall linear trend and the overall average).
We model the GARP/IV parameters using linear and log link functions
where Wi, kj and Hik are q2 and q1 dimensional vectors, respectively. These design vectors are used to model the GARP/IV parameters as functions of subject-specific covariates and/or to model structure on the ϕi, kj and within a subject. For example, within subject i, might be linear in k, Hik =(1, k). For some examples of these structured GARP/IV models, see Pourahmadi and Daniels . Under this formulation, it is easy to allow only a subset of the parameters of Σi to depend on covariates since there are no concerns about positive definiteness.
We use a Gibbs sampler with Metropolis–Hastings steps to sample from the posterior distribution of (β, bi, τ2, γ, λ). The full conditional distributions of β, bi, γ, τ2, are all known forms, normal for β, bi, γ and inverse gamma for τ2. To sample from the full conditional distribution of λ we use a Metropolis–Hastings step and use the normal approximation to the full conditional distribution as the candidate distribution. Details on the forms of the full conditionals and the Metropolis–Hastings sampler can be found in the Appendix. Sampling the parameters of the random effects covariance matrix is quite simple under this parameterization as there is no concern about positive definiteness and the full conditional distribution for the dependence parameters is a closed form. Computations are often a major impediment to modelling the covariance matrix in non-standard ways .
The GARP/IV parameterization provides parameters that can easily be modelled without the concern of the estimator being positive definite, that have a sensible interpretation, and that allow for simple computations. However, there are other parameterizations that can be considered. The log matrix  and the eigenvalue/Givens angle  parameterizations also pose no problems in relation to positive definiteness, but the resulting parameters can be difficult to interpret and model fitting (sampling from the posterior distribution) can be challenging. Parameterizing the matrix using the variances and correlations  provides easily interpretable parameters, but is impeded by the restrictions on the variances and correlations for the matrix to be positive definite. This can be troublesome when trying to model these parameters parsimoniously and/or as a function of covariates.
There were a total of 549 subjects in the five depression studies. About 30 per cent (2840) of the possible measurements for the subjects (baseline measure + 16 weeks per subject) were missing, mostly intermittently. This can be seen in Table I, where the sample sizes at each time (out of 549) are given. As in Pourahmadi and Daniels , we assume the data are missing at random (MAR). Previous analyses of similar studies and communication with the investigators involved in the study provided support for a MAR assumption. This assumption implies that missingness is explained by measured covariates in the model and/or observed responses on the subject prior to and/or after the missing visit. In addition, some of the missingness was by design as some of the studies only measured patients every other week.
The scale used to quantify depression in these studies was the Hamilton Rating Scale for Depression (HRSD), which was measured at baseline and then weekly for 16 weeks; higher scores on this scale indicate more severe depression. For a detailed description of the psychotherapy treatment and the drugs used in these studies, we refer the reader to Thase et al. . The sample means in Table I supported a quadratic trend for the HRSD trajectory over time.
The main questions of interest in this analysis were:
The components of the fixed effects design vector were motivated by these questions and the apparent quadratic trend over time. For the ith patient at time j, xij is the 14 × 1 design vector, given by xij = (polyj(2), drugi × polyj(2), severityi × polyj(2), drugi × severityi × polyj(2), genderi, agei), where polya(k) is a kth order orthogonal polynomial in a. Drug is coded as 1 if the subject was on the drug/psychotherapy treatment and 0 if the subject was on the only psychotherapy treatment; severity was coded as 1 for high severity and 0 for low severity; and gender was coded as 1 for male and 0 for female. The random effects design vector is set to zij = (polyj(2)); this implies each subject has their own HRSD trajectory.
Based on preliminary exploration of the data, we considered models for Σi which included drug and severity. In particular, we considered the models specified in Table II. The simple model is a standard random effects model. The Drug and Sev models allow the entire random effects covariance matrix to differ by drug and severity, respectively. The Drug × Sev model includes four separate random effects covariance matrices for each combination of the binary covariates drug and severity. These four models could be fit in SAS proc mixed using likelihood methods ; see the Appendix for the syntax. The Drug + Sev and Drug + Sev models are more parsimonious models that involve modelling some or all of the parameters of the modified Cholesky decomposition of the covariance matrix as an additive function of both drug and severity and cannot be fit in SAS.
To select the best models, we use a combination of examining 95 per cent credible intervals for the relevant parameters and computation of an overall statistic of model fit. For the latter, we compute the deviance information criterion (DIC) . The DIC has a form similar to the AIC: a goodness-of-fit term, the deviance evaluated at the posterior mean of the parameters, and a penalty term, two times the effective number of parameters, computed as the mean deviance minus the deviance evaluated at the posterior mean. Thus
where is the posterior mean of θ and pD = Dēv − Dev(), where Dēv is the posterior mean of the deviance. We define the deviance and use the same parameterization as in Pourahmadi and Daniels 
where , where . Based on the form of the DIC, we prefer models with small values. The DIC has the advantage of being easy to compute using output from a Gibbs sampler. We refer the reader to the paper by Spiegelhalter et al.  for a rigorous justification of this criterion.
Posterior means and 95 per cent credible intervals for (γ, λ) appear in Table III and Table IV. It is clear that drug and severity are important explanatory factors for the covariance structure. In particular, the additive model for drug and severity appears to fit best. This is confirmed by the table of DIC (Table V) and examination of the 95 per cent credible intervals in Table III and Table IV (that is, whether the parameters corresponding to the effects of drug and severity cover zero). The model with the smallest DIC was the one in which drug and severity are entered into the model additively. In addition, from the credible intervals for this model, it appears that the GARP did not depend on severity. We fit a simplified version of the additive model which now had the smallest DIC; this was the model Drug + Sev given in Table II that had the GARP only depend on drug. In addition, given that ϕi and were each only three-dimensional here, and there was no apparent structure, we did not pursue more parsimonious models.
The effective number of parameters, pD, was roughly what we would expect for all the models. For example, for the simple random effects model, pD = 21, the 14 parameters in β plus the one τ2 plus six parameters in Σ. However, in some models the effective number was less than expected. This is a general phenomenon observed with the DIC  and motivates the terminology ‘effective’ number of parameters.
The estimates of the GARP/IV parameters appear in Table III and Table IV. To examine the covariance structure of the random effects, we first focus on the simple model. The log(IV) decrease as we add more complexity to the model (linear, quadratic), but not in a linear fashion. The three GARP (dependence parameters) correspond to the regression parameters from regressing the linear random effect on the overall average random effect (ϕ21) and from regressing the quadratic random effects on the linear and overall average (ϕ31 and ϕ32, respectively) (see the sequence of equations in (3)). The posterior mean of ϕ21 is positive and the credible interval is above zero which implies a significant positive relationship (as the overall average increases, the linear trend increases (gets less steep since the slopes are negative)). We see a similar relationship between the quadratic trend and the overall average, after adjusting for the linear trend.
For the final model, the marginal variance of the deviation from the overall average significantly increases for an individual on drug and psychotherapy and for individuals with high initial severity. In addition, the conditional variances significantly increase for those with high initial severity. For the GARP, the effect of being on drug and psychotherapy versus only psychotherapy indicates a marginally significant weaker association between the deviations from the linear trend and the overall average and between the deviations from the quadratic and the overall average when adjusting for the deviation from the linear trend.
Figure 1 contains the posterior mean of the subject specific trajectories for eight subjects in the study for all the models in Table II. The first four are subjects with small ni (three or four weekly observations) and the last four are subjects with large ni (16 or 17 weekly observations) with each set of four representing all combinations of drug and severity. It is clear from the figure that subject-specific trajectories can be sensitive to the model for the random effects covariance matrix. In particular subject 1 and 4 have considerable variability in their mean trajectories across models. In addition, the credible intervals for the random effects are sensitive to the model for the random effects covariance matrix. Table VI shows the posterior mean of the random effects and 95 per cent credible intervals. We observe differences in both the posterior means and the widths of the credible intervals when comparing the simple model to the final model. In particular, for subject 1 the credible intervals are considerably tighter for the final model and for subject 4 they are considerably wider (relative to the simple model). We would argue that since the random effects matrix clearly depends on both drug and severity that the confidence intervals for the Drug + Sev model are more accurate than the ones based on the simple random effects model.
Posterior means and 95 per cent credible intervals for β appear in Table VII for the simple random effects model and the best fitting random effects model. Given the large sample size, n = 549, it is not too surprising that there are only small differences in the posterior means. However, the width of the 95 per cent credible intervals for most of the β in the Drug + Sev* are tighter (narrower). Thus, we can see the potential increase in efficiency from properly modelling the covariance structure here. In particular, we observe the decreased width of the credible intervals for the main effects of drug and severity. In addition, age is significant (credible interval does not cover 0) in the final model, but not in the sample random effects model. (The Gibbs sampler was run for enough iterations so that differences in credible intervals are not due to Monte Carlo error.)
In this section, we provide a comparison of the final model chosen by the DIC criterion with the Drug × Sev model which can be fit in SAS proc mixed . The third column of Table VI and Table VII contain the estimated random effects and fixed effects under the Drug × Sev model fit in SAS. In Table VI, there are some differences in the point estimates of the random effects and slight differences in the confidence intervals. For the estimated fixed effects, given in Table VII, the effect of age goes from significant to marginally significant when using the model fit in SAS and severity by drug interaction becomes significant in the SAS model. Clearly, the overall conclusions changed very little, but specific inferences, as mentioned above, were affected.
There was a significant non-linear decrease in depression scores over the 16 week treatment period (Figure 2). We will first address the first question of Section 3.1. Table VIII contains the posterior means for the week 16 score and the change from baseline for the four combinations of drug and severity. The worst group at week 16 (highest HRSD) was observed in the psychotherapy only/high severity arm and the best in the drug/low severity arm and these were significantly different (95 per cent credible intervals for the difference excluded zero). In terms of change from baseline, the largest change was observed in the drug/high severity arm and the smallest change was seen in the psychotherapy only/low severity arm (and these were significantly different). If we focus on the effect of drug among low and high severity patients, for the outcome at week 16, those on drug were slightly better off (lower HRSD) and this difference was significant among high severity patients. For the change from baseline outcome, those on drug and psychotherapy had much larger improvements than those only on psychotherapy within both the high and low severity patients (and these were significant).
The second question of interest stated in Section 3.1 addresses the impact of initial severity of depression. Individuals with more severe depression at the start of the study did worse on average than those with less severe depression. However, they improved at a quicker rate; there was a larger change from baseline (Table VIII), which was significant, and there was a significant interaction with the linear trend (Table VII) representing a steeper slope for the high severity subjects (also, see Figure 2).
There was not a significant interaction between drug and severity on the rate of improvement as the 95 per cent credible intervals for these two interactions covered zero (Table VII); this addresses the third question of interest given in Section 3.1.
We have proposed a simple way to model the random effects covariance matrix via structure  and/or subject-specific covariates. This approach is computationally attractive and provides parameters to model which have a nice interpretation for modelling trajectories over time; in particular, the covariance parameters correspond to characterizing the dependence and variability of the random effects as we increase the complexity of the subject-specific curves. For the depression example, it is clear that this matrix differs by severity and drug and it affects inferences in terms of both the fixed effects (smaller credible intervals) and prediction of subject-specific trajectories (Figure 1). In our example, the random effects covariance matrix only depended on categorical covariates. If all the parameters in the covariance matrix differ on all possible combinations of these categorical covariates, then a separate matrix can be fit for each combination and the models can be fit in SAS or using a ‘standard’ Gibbs sampler for Bayesian random effects models. However, if this is not the case, or there are continuous covariates, the models developed here would be necessary and useful.
Clearly, when the main interest in a longitudinal data analysis is the covariance structure or subject-specific predictions, care must be taken in how the covariance structure is modelled. However, even when these are not of direct interest, taking care is important. For example, when the true random effects matrix is a function of subject-specific covariates, but not modelled that way, the standard errors and confidence intervals for the fixed (and random) effects will be incorrect, which can result in incorrect inferences (for example, a parameter being significantly different from zero). In addition, in cases with unbalanced longitudinal data where the missing data is MAR, incorrectly modelling the covariance structure can even result in biased estimates of the fixed effects .
We can compare the DIC of the models in this paper to those in Pourahmadi and Daniels . The best-fitting models in their paper were superior to the best ones here. The reason is that there is considerable non-i.i.d. variability after including the random effects; that is, the residuals in (1), εi, conditional on the random effects, bi, were not independent with constant variance. The models in this paper could be extended to allow a more general form for the covariance of these residuals by replacing the first level covariance matrix τ2I with a more general form. Non-linear models for the trend might also be considered to avoid the slight upturn in the quadratic curves near week 16, which is likely an artefact of the quadratic model.
Other parameterizations can be used to model the random effects covariance matrix. The GARP/IV parameterization has the advantages of computational attractiveness and a logical interpretation of parameters. This parameterization implies a non-linear model on the variances and correlations. For example, in the simple 2 × 2 case, the covariance matrix expressed in terms of the model for the GARP/IV parameters given in (4) and (5) is
One issue with modelling the covariance matrix is choosing the link function; that is, the vector function which transforms the parameters of the original covariance matrix, the variances and covariances, to a new set of parameters on which the regression parameters are linear. For the specific form of the link function in our model, see Pourahmadi . The DIC can be used to choose between various link functions for these models.
Finally, we are exploring extending these models to discrete data (for which bias in the fixed effects is a major issue) and to multivariate longitudinal data.
Contract/grant sponsor: NIH; contract/grant number: CA85295-01A1.
The work of the first author was supported by NIH grant CA85295-01A1. The authors would like to thank Zhongqi Zhang who did some of the initial work on deriving the full conditional distributions for the Gibbs sampling algorithm and Professors Joe Hogan and Hal Stern for helping to clarify the presentation of the example and the model. The authors would also like to thank two referees for comments which improved the manuscript.
In this Appendix, we provide details on the forms of the prior distributions and full conditional distributions for the parameters used in the Gibbs sampling algorithm. We also provide the code to fit the Drug × Sev model in SAS.
We assume the prior distributions are
where IG is an inverse gamma prior, parameterized to have mean 1/(b(a − 1)). In general, we recommend specifying these priors to be non-informative. In our example, we chose λ*, β*, γ* to all be vector’s of 0’s with the respective Ω’s identity matrices multiplied by a scalar which we set equal to 10000. Specifying non-informative priors for variance components can be tricky. Here, we specified a = 3 and b = 9 for the inverse gamma prior on τ2, based on some preliminary exploration of the data. Owing to the large sample size, however, the posterior distribution for τ2 was insensitive to this choice.
The full conditional distributions for bi, β, τ2, and γ have the following closed forms:
where Gi is a q × q2 matrix with the first row Gi1 equal to 0 and the kth row Gik equal to when k > 1.
The full conditional distribution of λ does not have a standard form. It is proportional to the following:
To sample from this full conditional, we use a Metropolis–Hastings step with a normal approximation to the full conditional as the candidate distribution. For details, see Daniels and Pourahmadi . In our example, the acceptance rate was very high.
The convergence of the Gibbs sampler was monitored by examining time series plots of the parameters over iteration and the Gelman and Rubin approach of using multiple chains . For all the models fit, convergence to the posterior distribution was quick and the mixing was good.
The SAS code for fitting the Drug × Sev model follows:
proc mixed data=dep method=reml; class subject dsc; model y=linear quad sev drug sex age drug*sev drug*linear drug*quad sev*linear sev*quad drug*sev*linear drug*sev*quad / s; random intercept linear quad / type=un sub=subject grp = dsc; run;
The ‘grp=’ statement allows the random effects covariance matrix to differ by a categorical covariate. Here, ‘dsc’ is a covariate that takes four levels, one for each of the drug by severity combinations.