|Home | About | Journals | Submit | Contact Us | Français|
Analyzing problem-behavior trajectories can be difficult. The data are generally categorical and often quite skewed, violating distributional assumptions of standard normal-theory statistical models. In this paper, we present several currently-available modeling options, all of which make appropriate distributional assumptions for the observed categorical data. Three are based on the generalized linear model: a hierarchical generalized linear model (HGLM), a growth mixture model (GMM), and a latent class growth analysis (LCGA). We also describe a longitudinal latent class analysis (LLCA), which requires fewer assumptions than the first three. Finally, we illustrate all of the models using actual longitudinal adolescent alcohol-use data. We guide the reader through the model-selection process, comparing the results in terms of convergence properties, fit and residuals, parsimony, and interpretability. Advances in computing and statistical software have made the tools for these types of analyses readily accessible to most researchers. Using appropriate models for categorical data will lead to more accurate and reliable results, and their application in real data settings could contribute to substantive advancements in the field of development and the science of prevention.
Preventing mental health disorders and problem behaviors, such as delinquency, risky sexual behaviors, and substance use, in childhood and adolescence is critically important to the well-being of young people and, ultimately, to our society. Problem behaviors often occur in tandem with one another (Donovan & Jessor, 1985) and are associated with concurrent difficulties, such as family dysfunction, academic failure, and poor peer relationships during childhood and adolescence (e.g., Hawkins, Catalano, & Miller, 1992; Wiesner & Windle, 2004). Later, as these behaviors continue into emerging adulthood, avenues toward a successful life course may be shut off. This can lead to adulthood failures in areas such as work and education, as well as to physical and emotional disorders, all of which are costly to the individual and society as a whole (Hill, White, Chung, Hawkins, & Catalano, 2000; Marmorstein & Iacono, 2005; Wiesner & Silbereisen, 2003). Understanding the etiology of childhood and adolescent problem behaviors, in part through optimal statistical modeling of developmental trajectories, may help point the way toward more successful approaches to, and timing of, interventions (Shaw, Gilliom, Ingoldsby, & Nagin, 2003).
Problem behaviors are problematic to study for a number of reasons. Because psychopathology is a developmental process, proper understanding requires longitudinal data and analyses (Cicchetti & Toth, 1998; Sameroff, 2000). In addition, if researchers want to generalize their results to the population at large, they need to study population-based samples.1 But problem behaviors are, by definition, rare in the general population, and the data generally, often egregiously, violate assumptions of standard normal-theory linear models. They are frequently measured on a categorical scale, and the categories are generally unevenly spaced (e.g., During the past year, how often did you drink alcohol? 1 = never, 2 = a couple of times, 3 = 1 – 3 times a month, 4 = 1 or more times a week, 5 = daily). The data tend to have large masses in the lowest category, which signifies an absence of the behavior; and, to the extent that people do display the behavior, the distributions tend to be quite skewed. Ordinal data like these are most often treated as though continuous and normally distributed. Although some researchers have suggested that categorical variables with 4 or more categories could reasonably be analyzed using models that assume normal distributions (e.g., Bentler & Chou, 1987), other research suggests that this is not the case. Rather, Dolan (1994) found that this business-as-usual approach can lead to biased estimates, incorrect standard errors, and incorrect fit statistics even when the data are symmetric; and these problems get worse to the extent that the distribution departs from symmetry (also see, e.g., DiStefano, 2002; Feldman & Masyn, 2008; West, Finch, & Curran, 1995). Treating the data as censored normal (censored at zero) is a popular way of accounting for the asymmetry caused by the pileup of zeros, but it still assumes that the data above zero are continuous, so is subject to many of the same problems that occur when the data are treated as normally distributed.
When these ordinal variables are modeled as inherently categorical, a lack of symmetry is not a problem because the proportions in each category are explicitly modeled. Additionally, the mass at the bottom of the scale is only a problem to the extent that it may, with a limited sample size, be responsible for empty cells at the high end of the scale; and too many empty cells can cause estimation difficulties. Although longitudinal or clustered categorical-data models have been available for a number of years and used in fields such as medicine (Harville & Mee, 1984; Hedeker & Gibbons, 1994), economics (Butler & Moffitt, 1982), and education (Bock & Lieberman, 1970); until recently, software that estimated these models was relatively difficult to use and the estimation process was computationally arduous. Because of advances in computers and statistical software, appropriate analysis of longitudinal categorical data has become much more straightforward for applied researchers.
In this paper, we demonstrate some of the methods appropriate for modeling trajectories with longitudinal categorical observed data. Our goal is to place these techniques into a coherent framework. Such a framework will, we believe, help developmental researchers feel more confident in selecting and using the methods. A second important contribution this paper makes is to offer a systematic approach to choosing from among competing models for longitudinal categorical data. Because there is no generally available test of overall fit for most of these models, nor any single widely-accepted criterion for model selection, we present several techniques for assessing the models. We utilize adolescent-alcohol-use data to illustrate the process of model specification, selection, and interpretation; and we guide the reader through the process of choosing an optimal model for the data, from among those investigated, by comparing results with respect to fit (using comparative fit statistics and residual analysis), parsimony, and substantive considerations.
In the sections that follow, we describe several models for characterizing individual differences in change or growth over time, given categorical data. Appendix A lists the models we discuss here, along with their distributional assumptions and alternative names and acronyms sometimes seen in the literature.
To introduce models for categorical data, we begin with the most basic and general form of a generalized linear model,2 in which the expected value (or mean), μi, of individual i's response, yi (given covariate values x0i, x1i, … , xmi), is related to the covariates through a linear predictor, ηi. ηi has a linear relationship with the covariates which is quantified by regression coefficients β0, β1, … ,βm (Agresti, Booth, Hobert, & Caffo, 2000; Skrondal & Rabe-Hesketh, 2004):
where, usually, xi0 = 1 (making β0 the intercept of the equation), and there are m covariates. ηi is related to μi (given individual i's values on covariates) through some link function, g:
The type of link function used depends on the conditional distribution of the data.3 For instance, when the distribution is normal (Gaussian), g is typically an identity function and the equation is a standard linear regression (with an identity link, the estimate,i, can be thought of as ŷi):
But if yi has a different distribution, an identity link may not be appropriate. For example, if yi is binary, then its distribution is Bernoulli and the expected value, μi, is the conditional probability of giving a positive (or correct) response:pr(yi = 1 | x1,…,xm) = μi. Using an identity link in this case is improper because it would allow the model-predicted response, i, to go below zero or above one—outside of the range of probabilities. Because of this, either a logit (or logistic) link or a probit link is usually used with dichotomous data. In the logit-link model for binary data, the natural log of the odds of giving a positive response (often referred to as the log-odds or logit) is modeled as a linear function of the covariates:
In this paper, the logit-link model is used for ordinal data—an extension of the binary model referred to as the ordered logit, or cumulative logit model. For an ordinal variable with j = 1, 2, … J categories, the ordered logit model represents the probability of scoring in category j or above, versus any category below j (see, e.g., Agresti, 2002). In discussing ordered logit models, it is helpful to think of the ordinal observed outcome as a coarse categorization of an underlying (latent) continuous variable, yi* (see Figure 1), for which the residual errors are assumed to have a logistic distribution (Agresti, 2002; Gurland, Lee, & Dahm, 1960; Hedeker & Gibbons, 1994; Skrondal & Rabe-Hesketh, 2004).4 In this characterization, referred to by Skrondal and Rabe-Hesketh (2004) as a latent response formulation, the cutpoints that separate the underlying continuous distribution into categories are referred to as thresholds, or τ’s, and, the thresholds define the relationship between the categorical observed yi (in Figure 1, yi = 1, 2, 3, or 4), and the continuous latent yi*, such that yi≥j if yi*≥τj, and yi = j if both yi*≥τj and yi*<τj+1;j = 0, 1, 2, …, J − 1, τ0 = –∞ and τJ = ∞.
Imagine, for example, that Figure 1 is showing an ordinal measure used to assess frequency of suicidal ideation in the past year in which 0 = no ideation, 1 = thought of suicide once, 2 = thought of suicide more than once, but not frequently, 3 = contemplated suicide many times. If individual i's latent suicidal ideation score, yi* (latent, because only a category is observed), is greater than or equal to τ1 (Figure 1), then the observed value for individual i (yi) is 1, 2, or 3 (this individual has experienced suicidal ideation). However, if yi* is greater than or equal to τ1 and less than τ2, then yi = 1 (this individual thought of suicide once). Likewise, if yi* is less than τ1, then yi = 0 (not suicidal) and if it is equal to or greater than τ3, then individual i’s response is 3 (chronically suicidal).
The expected value of yi*, μi*, is modeled directly with an identity link, as in Equation 3:
where εi ~ Logistic with fixed mean and variance.
It is important to note that the ordered logit model assumes that the multiplicative effect of a covariate (observed or latent) on the odds of being in a category j, is the same for all j = 0, 1, 2, … J − 1. This proportional odds assumption is necessary to permit the relationship between the ordinal variable and each covariate to be quantified by a single coefficient.5
When researchers gather multiple cross-sectional or longitudinal measures on the same individuals (for example, repeatedly measuring individuals’ levels of suicidality over several years), the repeated measures on an individual are correlated with one another. This violates an important assumption of statistical models—that of conditionally independent observations—resulting in standard errors that are too low. Generalized estimating equations (GEE) were developed to solve this problem with categorical data. GEE use a “working” correlation matrix (specified by the data analyst) to represent the correlations between observations, and it uses a special estimator (known as a “sandwich” estimator), to correct the standard errors. However, the GEE approach has a few limitations that hamper its usefulness in modeling psychological data. The estimates that result from GEE are population averaged estimates. That is, the estimates are mean effects for the population as a whole, rather than for any given individual (one coefficient applies equally to all members of the population). GEE does not allow the researcher to explicitly model the population heterogeneity, or individual differences. For instance, in the suicidality example, GEE would give only an estimate of the average intercept and slope of suicidality in the population. GEE can estimate mean differences based observed covariates (e.g., the difference between male and female averages), but not, for instance, a specific individual’s estimated suicidality level or trajectory (or how it might change based on a covariate value). Nor would it permit individual prediction from the intercept and slopes to a distal outcome.
A longitudinal model for categorical data that does model the individual differences is the hierarchical generalized linear model (HGLM; Raudenbush & Bryk, 2002), also known as a generalized linear mixed model (GLMM; Skrondal & Rabe-Hesketh, 2004).6 This approach combines the generalized linear model (GLM) with a hierarchical linear model (HLM; also known as a random-effects, random-coefficients, mixed, or latent growth-curve model), in which repeated measurements on individuals are expressed as a function of time (see Figure 2). Individual differences in the outcome variable when time equals zero and change in the outcome over time are modeled by permitting the intercept and slope coefficients to vary across individuals. The intercept and slope(s) are, therefore, referred to as random coefficients, random effects, or (latent) growth factors. Both time-varying (not discussed in this paper) and time-invariant covariates may be used to explain within- and between-person variability, respectively.
In modeling the categorical drinking data, we use the logit link described in Equation 4 to Equation 6. A categorical random-effects model has been described by Hedeker and colleagues (Hedeker & Gibbons, 1994; Hedeker, Gibbons, & Flay, 1994; Hedeker & Mermelstein, 1998, 2000), and others (Agresti, 2002; Agresti et al., 2000; B. Muthén & Asparouhov, 2002; Raudenbush & Bryk, 2002; Skrondal & Rabe-Hesketh, 2004; Vermunt, 2006). In the models that follow, μti* (individual i's expected latent response value at time t; see Equation 7 & Equation 8), is modeled as a linear function of the random coefficients (growth factors), which comprise the intercept coefficient, β0i, (the expected latent response value for individual i when time = 0) and, usually, one or more slope coefficients, βsi (s = 1, 2, …, S) which are multiplied by powers of time (e.g., β1iati, β2iati2, β3iati3, …, βSiatiS). The slope coefficients describe systematic linear or curvilinear change over time in individual i's latent responses.7 The times of measurement, ati, are usually centered by subtracting a relevant age or timepoint; often this is the time of, or age at, the first measurement occasion, as is shown in Figure 2. This makes the intercept, β0i, individual i’s expected latent response at the first measurement occasion. The times of measurement may be the same for all individuals, as shown in Figure 2, or may be specified in the model to be unique for each individual (e.g., actual age).
The model that follows has linear slope (higher order polynomials are also common), and is expressed as having two levels. Level 1 characterizes the individual latent responses at each timepoint, and Level 2 characterizes the individual trajectories over time. The generalized linear model (Equation 4 – Equation 8) is at Level 1, embodied in yti*:
where: εti ~ Logistic with fixed mean and variance, and are assumed independent between individuals (i.e., cov[εti, εtj] = 0), and conditionally independent between times (i.e., cov[εti, εvi] = 0; referred to as local independence).
At Level 2, individual differences in the random coefficients from Level 1 (β0i, β1i) are represented by variability (u0i, u1i) around the mean intercept (γ00) and mean slope (γ10). The individual differences are modeled as a function of an individual-level, time-invariant covariate, xi (multiple covariates are possible), quantified by regression coefficients γ01 and γ11 for intercept and slope, respectively. The conditional joint distribution of the intercept and slope is assumed to be multivariate normal:
where: u ~ N(0, Ψ), cov(εt, us) = 0 for all t = 1, …, T and s = 0, 1, and γ00 = 0 for identification. This identification restriction is necessary because the scale and location of the intercept are arbitrary when this latent response parameterization is used.8 The thresholds (τ’s) are assumed to be constant across time.
Utilizing the earlier example of suicidality, this hierarchical generalized linear model (HGLM) characterizes individuals’ systematic change over time in the continuous latent response. The τ's that relate the latent suicidality score to the observed categorical suicidality measures are estimated as part of the model (Equation 5 – Equation 8) and, because they are constant across time, we know that the latent suicidality score, despite changing over time, relates to the observed categories of the ordinal variables in the same way at each measurement occasion. The intercept mean is be defined to equal zero (γ00 = 0) on the arbitrary scale of the latent suicidal-ideation response (necessary in this parameterization); however, the estimated individual intercepts vary around zero. The individual slopes vary about the estimated average slope, as well, and these individual intercepts and slopes are often correlated.
Several assumptions have been discussed that are very important for the hierarchical generalized linear model (HGLM): (1) the random coefficients (intercept and slopes) are assumed to have a multivariate normal distribution; (2) change is a smooth function of time (most often linear or curvilinear); (3) proportional odds; and (4) measures are independent, conditional on the random effects (local independence). Any of these assumptions might not hold, especially when modeling problem-behavior data. In the models that follow, we discuss methods that relax the first three assumptions.
To begin, when the data are categorical and highly skewed, it is possible that the individual intercepts and slopes are not normally distributed, potentially resulting in biased parameter estimates (B. Muthén & Asparouhov, 2006; Vermunt, 2006). However this assumption is difficult to test (Agresti, 2002, p. 496; Carlin, Wolfe, Brown, & Gelman, 2001; Vermunt, 2006). The Level 2 normality assumption can be relaxed by using a mixture of normal distributions, known as mixing components or classes, to characterize the joint distribution of the random effects. This is pictured in Figure 3a. This model is a type of finite mixture model (see McLachlan & Peel, 2000), which we refer to here as a (generalized linear) growth mixture model (GMM), and it assumes only within-class normality.9 The classes have different mean structures at Level 2, and can comprise different proportions of the population. The within-class Level-2 variance-covariance matrix can also differ across classes, however, under some circumstances, this may lead to estimation problems, resulting from an unbounded likelihood (McLachlan & Peel, 2000). The multinomial latent class variable, ci, which identifies individual i’s class membership, takes on values k = 1, 2, … K, where K is the number of classes. Here, the τ’s are assumed to be constant across both time and classes.
Figure 3a shows how the individual differences are, in this model, characterized in part as discrete classes, and in part as continuous distributions within those classes. Because of this within-class variability, covariates can be used to predict individual differences within class, as well as to predict the probability of class membership. The hierarchical generalized linear model (HGLM) can be seen as a special case of the growth mixture model (GMM), with just a single class (K = 1). When GMM has two or more classes, Level 1 equations are indexed by class:
for individual i at time t in class k; εkti distributed as before.
The Level 2 equations are also indexed by class, and a multinomial logistic equation is added, which expresses the conditional probability of membership in class k as a function of a class-specific intercept, αk and the covariate, xi, multiplied by coefficient, λk:
where γK00 = 0 and αK = γK = 0 for identification; uki distributed as before.
This model has a more complicated interpretation than the hierarchical generalized linear model (HGLM). In the suicidality example, if there are, for instance, three classes, it implies three unique sets of intercept and slope means; one for each class. Within each class, individual intercepts and slopes vary about those means and can be predicted by covariates, as can class membership. The grouping of individuals into classes, however, is seldom perfect. Instead, person i has some estimated probability of membership in each suicidality class. These are referred to as posterior probabilities. For each individual, the posterior probabilities sum to 1 and are based on that individual’s observed outcome responses and covariate values. Thus, person i's estimated trajectory of suicidality is a function of the probability of membership in each of the classes, the class-specific means and covariance structures, and possibly covariates.
There are several reasons that the within-class normality assumption of the growth mixture model (GMM) may not be sustained by the data. The primary one is that there may be insufficient variability to estimate the within-class variances and covariances in one or more classes. Two or more classes can be collapsed together, but if the classes are well separated (i.e., the mean trajectories are very different), then collapsing them might not be optimal. In this case, it is possible to constrain one or more elements in the within-class covariance matrix (Ψk) to zero. When all variances and covariances in a class are fixed at zero, it means, effectively, replacing a continuous, normally-distributed class with a mass point, as shown in Figure 3b. We refer to this model as a (generalized linear) latent class growth analysis (LCGA) and, like the hierarchical generalized linear model (HGLM) this model is a special case of GMM.10 The Level 1 equation is the same as that in GMM, but the Level 2 equations change to reflect the elimination of within-class variances (and with no variation to explain, within-class predictors are also dropped):
When all within-class variances and covariances equal zero, individual differences at Level 2 are being modeled solely by class membership, so it is likely that more classes will be needed to adequately characterize the variability in the random effects distribution (B. Muthén & Muthén, 2000b). The probability of membership in each of the K classes may be regressed upon the covariates as in Equation 13, with the same identification constraints. Figure 4a shows a path diagram of a growth mixture model (GMM) and a latent class growth analysis (LCGA). The solid lines belong to both models and dashed lines show the optional parameters that, if any are present, define the model as GMM (for more on the comparisons between these models, see, e.g., Kreuter & Muthén, 2007; Nagin, 2005; Raudenbush, 2005; Reinecke, 2006).
This model has a slightly simpler interpretation than the GMM because all individual differences in estimated suicidality trajectories are characterized by class membership, and those differences can be modeled as a function of predictors of class membership. Within each class, all individuals are assumed to share the same suicidality trajectory. One way to think about the shared within-class trajectory is that it is an approximation of the true within-class individual latent trajectories, similar to the way mass points approximate the true distribution of the random effect shown in Figure 3b. Person i's predicted trajectory of suicidality is a straightforward function of his or her posterior class membership probabilities and the class intercept and slope means.
Finally, it is possible that development cannot be characterized as a simple function of time, as is assumed in the models we have been discussing, or that the proportional odds assumption does not hold. All of the models thus far are bound by these two assumptions because they model scaled growth (change across time is along a single continuous scale). Longitudinal latent class analysis (LLCA; latent class analysis applied to longitudinal data) is a type of mixture model but it does not belong to the family of growth models we have been discussing. This is because, rather than modeling scaled change, LLCA models patterns of states across time.11 If the outcome data are multinomial, LLCA is the only appropriate analytic approach because any changes are necessarily state changes.
Longitudinal latent class analysis (LLCA) models the joint distribution of the repeated outcome measures directly with a latent class variable, which characterizes both the within-person variation that had been handled at Level 1 of the previous growth models, and the between-person differences that had been handled at Level 2 (see, also, Appendix A). Its only assumption is local independence. In LLCA, the time- and class-specific probability of scoring in or above category j is modeled directly, so the thresholds are indexed by time (t) and class (k), in addition to category:
Class membership is based on groupings of similar patterns of responses over time, and covariates can be used to predict class membership probabilities, as shown in Equation 13. Note that, because it is the category thresholds that define the differences both across time and across classes in longitudinal latent class analysis (LLCA), they are all estimated. Thus, while LLCA makes no assumptions about the distribution of the observed variables, or the form of change, it trades fewer assumptions for many more estimated parameters (Vermunt, 2006), in a manner similar to spline or piecewise models (e.g., McArdle, 2004). Additionally, the model, as parameterized here, does not permit unique, individually varying times of measurement. For an application of LLCA to drinking patterns across time, see Lanza and Collins (2006).
In a longitudinal latent class analysis (LLCA) of the suicidality example, patterns of change are defined by the variable categories, rather than estimated intercepts and slopes. Like the latent class growth analysis (LCGA), everybody in a class shares the same parameters, however now it is the category thresholds that are shared within class and different between classes. Because thresholds can also change across time, change can be much more irregular or complex in LLCA. For instance, the course of suicidality for individual i might fluctuate or be intermittent. Differences in suicidality could, perhaps, be associated with influences such as puberty, high-school graduation, or other shared events; and the covariates can be used to predict the individual’s probability of membership in each class.
The selection and modeling of covariates in mixture models is a complex issue currently under investigation, and an extended discussion of the issues is beyond the scope of this article. However, early work (Nylund & Masyn, 2008) suggests that the number of classes should be determined using an unconditional model, and that substantial changes in parameter values after covariates are included indicate misspecification of their effects (e.g., omitting direct effects of covariates on observed variables). As is the case when estimating any statistical model, it is also important to let substantive theory, along with consideration of parsimony and interpretability, inform the inclusion and specification of covariates. The added complexity of mixture models makes this a particularly critical issue. For example, in a growth mixture model, it is theoretically feasible to allow a covariate to affect class membership as well as one or more growth factor variances in one or more classes. Additionally, all of the standard rules for typical regression and HLM models still pertain (e.g., concern with multicollinearity among covariates, the need to dummy-code nominal predictors).
Longitudinal mixture models (GMM, LCGA, and LLCA) are often used to find unobserved but distinct groups of individuals (e.g., Lanza & Collins, 2006; B. Muthén & Muthén, 2000a; Shaw, Lacourse, & Nagin, 2005). However, any non-normal distribution can be approximated by a finite mixture of normal distributions (Bauer & Curran, 2003; McLachlan & Peel, 2000), so the extraction of two or more classes does not necessarily indicate that distinct groups exist in the population. It is possible that finding multiple classes is an indication of a non-normal distribution of the random effects or, potentially, some other violation or model misspecification (Bauer, 2005). In addition, class membership is seldom unequivocal. Rather, as mentioned earlier, the individual has an estimated probability of belonging to each class.
In this paper, we use actual alcohol-use data, collected from a sample of adolescents when they were in seventh through twelfth grades, to illustrate the similarities and differences between these analytic approaches and to motivate the process of model selection (see Appendix B for annotated Mplus syntax). The value these models is enhanced to the extent that childhood events and circumstances may be used to help to predict later pathological processes, as this can potentially allow targeted interventions to take place before such processes begin. In addition, under some circumstances, covariates can aid in model selection. Thus we include a covariate that has been found to predict adolescent alcohol use—association with alcohol-using peers (e.g., Curran, Stice, & Chassin, 1997; Hawkins et al., 1992)—to demonstrate how a covariate may be specified and interpreted in the models.
Data come from a longitudinal, community epidemiological study of 451 (52% female) target youths and their families from two-parent households in the Midwest. Participants in the study were recruited from eight rural counties and the original sample of families was primarily lower middle- or middle-class. Due to a very small minority population in this area, all participants are of European heritage. Annually from 1989 (seventh grade) through 1992 (tenth grade) and again in 1994 (twelfth grade) the adolescents and participating family members individually completed questionnaires pertaining to subjects such as the demographic characteristics of the family, their own personal characteristics and behavior, and characteristics and behavior of other family members and friends. The retention rate in 1994 (the last wave used in the current study) was 94%. Additional details about the study, which is still ongoing, can be found in Conger and Conger (2002).
The primary outcome for these analyses is the alcohol use measure from the questionnaire data. Target adolescents were asked each year how often they had consumed beer, wine, or hard liquor during the previous year. The drinking items were coded into a single, ordinal alcohol-use variable with four possible responses: 0 = never, 1 = less than weekly, 2 = once or twice a week, and 3 = three or more times a week. Each year, the target adolescents were also asked how many of their close friends used alcohol during the previous year. Answers ranged from 0 = “none of them,” to 4 = “all of them.” The adolescent’s answer to this question in seventh grade was used as a covariate, predicting the intercept and change in the target adolescent’s own drinking.12
We began the unconditional analyses with hierarchical generalized linear model (HGLM; Figure 2), using both linear and quadratic slopes, and utilizing the logit link and model shown in Equation 9 and Equation 10. This model has the simplest interpretation and it gives an overview of the form of change. Following HGLM, we tested the three mixture models, beginning with growth mixture model (GMM; Equation 11 – Equation 13). GMM comprises a very large set of potential models. For example, a quadratic model with three classes has 14 possible ways of specifying the random effects if they are constrained to be equal across classes, and dozens more if they are permitted to vary across classes. To limit the number of models we were testing, we decided to constrain variances equal across classes, as discussed earlier, and to use a systematic approach to relaxing the constraints on the variance/covariance matrix. For each model, we tested the following: (a) intercept variance only, (b) intercept and linear slope variances, and (c) intercept, linear, and quadratic slope variances. We tested each model both with all possible covariances estimated, and with all covariances constrained to zero.
All intercept and slope variances were fixed to zero for latent class growth analysis (LCGA; Equation14), and, finally, we modeled the joint distribution of the repeated drinking outcomes directly with longitudinal latent class analysis (LLCA; Equation15 & Equation16). Intercepts for HGLM, GMM, and LCGA are placed at ninth grade—the first year of high school. Entry to high school begins a time of increased risk for developing problem drinking. Incoming students are exposed to older adolescents, many of whom have already begun to drink, and alcohol is much more available in high school (Johnston, O’Malley, Bachman, & Schulenberg, 2006).
For all analyses, we used a full-information maximum likelihood (FIML) estimator with robust standard errors (MLR), as implemented in Mplus 4.21 (B. Muthén & Muthén, 1998 – 2006), to estimate the parameters (the robust standard errors offer protection against inflated alpha values in case of certain types of model or distributional misspecifications). With any longitudinal study, some individuals will miss some assessments or drop out altogether. These models utilize all of the available data under a missing-at-random (MAR) assumption (as would multiple imputation), which allows that the missingness may be related to variables included in the analysis (Little & Rubin, 2002). Unless missing data are missing completely at random (MCAR; i.e., missingness is unrelated to any data being modeled, present or missing), which is a much stronger assumption, FIML yields less biased estimates than other methods of missing-data methods, such as listwise deletion (which also reduces the power of the analysis by reducing sample size) or mean imputation (Schafer & Graham, 2002). FIML does not, however, guarantee unbiased estimates if missingness is related to the missing values, themselves, and this risk of bias increases with the proportion of missingness (the maximum for a single wave in this study is 10.5%). However, we feel that, given the broad range of topics under study, MAR is a reasonable assumption (for more on mechanisms of missingness and techniques for dealing with missing data, see, e.g., Foster, Fang, & Conduct Problems Prevention Research Group, 2004; Little, 1995; Little & Rubin, 2002; Schafer & Graham, 2002). FIML can be computationally demanding with categorical data, however, it offers advantages over methods such as penalized quasi-likelihood (PQL) and marginal quasi-likelihood (MQL), which, while computationally less demanding, may result in biased estimates.
Any given data set can potentially be analyzed using a multitude of statistical models, however, there are generally theoretical and statistical reasons that many possible models can be ruled out. For example, we are not considering any models that require an assumption of continuous observed data (e.g., two-part models; see Olsen & Schafer, 2001; Tooze, Grunwald, & Jones, 2002). Furthermore, because we are interested in modeling change across multiple timepoints, rather than pairs of timepoints, are we not considering any autoregressive-type models, such as latent transition analysis (LTA).
For most longitudinal categorical analyses, there are no readily available statistics that assess overall fit of the model to the data. The two categorical goodness-of-fit statistics, the Pearson chi-square and likelihood chi-square (also called deviance, or G2) may not perform well when there are many low expected cell frequencies (Agresti, 2002), and this is often the case when longitudinal categorical data are used (e.g., five waves of a 4-category variable results in 45 or 1024 cells, most of which are likely to be empty). Nor is there a method for comparing models that is widely accepted as best (B. Muthén & Asparouhov, 2006; Nylund, Asparouhov, & Muthén, 2007). Thus, when assessing models, researchers generally consider both statistical and substantive criteria (e.g., Jackson, Sher, & Schulenberg, 2005; Tucker, Orlando, & Ellickson, 2003). In this section, we will discuss four complementary approaches to model assessment: (1) quality of convergence; (2) comparative fit; (3) residual analysis; and (4) visualization. In addition, issues such as parsimony and, in mixture models, class size should be taken into consideration. When a class comprises a small number of individuals, the parameter estimates may not be very stable or reliable. Often, no single criterion is, alone, sufficient to select a model and two or more of the criteria we discuss must be considered simultaneously.
The estimation algorithm searches for the global maximum of the likelihood function, but in complex categorical data models and mixture models, algorithms are more likely to converge on local maxima than with continuous data or in less complicated models, so the use of multiple starts from random locations in the parameter space is recommended (Hagenaars & McCutcheon, 2002; Hipp & Bauer, 2006; McLachlan & Peel, 2000; B. Muthén & Asparouhov, 2006; Vermunt, 2006; see Appendix B). In some cases, there may not be a global maximum (McLachlan & Peel, 2000). This can result in non-convergence, or failure of the algorithm to replicate a maximum loglikelihood value over many starting values. That is, each start may end up at a different local maximum, suggesting that the model parameter estimates are untrustworthy. A second indication that adequate convergence was not reached is a failure of the algorithm to generate standard errors. This can be brought about by a singular information matrix, which implies that the model is not identified (L. K. Muthén & Muthén, 1998–2006). A model that has not converged, has resulted in a singular information matrix, or has failed to yield a consistent maximum loglikelihood value is considered a failed model, in this paper, and eliminated from further consideration.
The standard chi-square difference test (likelihood ratio test; LRT) is not helpful in determining which model is best, because the different models are not necessarily nested. It also cannot help choose the optimal number of classes in a mixture model, because regularity conditions of the test are violated when comparing a k-class model to a (k − 1)-class model (McLachlan & Peel, 2000). Often, an information index, such as the Bayesian Information Criterion (BIC; sometimes called Schwartz Information Index) is used in model selection. This index and similar ones (e.g., Akaike Information Criterion) take into account the model loglikelihood (higher is better), and penalize for model complexity (i.e., the number of parameters estimated relative to the sample size). Thus, using them reduces the risk of overfitting the model to a single sample, thereby improving the possibility of replicating the model findings with future samples. In general, a lower value on an information criterion indicates a better model. Mplus reports three information indices, but the most widely used is the BIC (see Hagenaars & McCutcheon, 2002; Nylund et al., 2007; Vermunt, 2006).
Two more statistics, available in Mplus 4.21 (B. Muthén & Muthén, 1998 – 2006), may be useful for determining the optimal number of classes in a mixture model. The Lo-Mendell-Rubin test (LMR; B. Muthén, 2004) analytically approximates the LRT distribution and the bootstrapped LRT (BLRT), suggested by McLachlan and Peel (2000), uses bootstrap samples to empirically derive the sampling distribution of the LRT statistic (Nylund et al., 2007). Both tests compare a k-class model with a (k − 1)-class model and, in both cases, a statistically significant p-value suggests the current model offers improvement over the model with one class fewer.
Also discussed in the literature on mixture models is entropy (B. Muthén, 2004), which, similar to the average posterior probabilities for most likely class membership, serves as a measure of the precision of individual classification. It ranges from 0 (everybody has an equal posterior probability of membership in all classes) to 1 (each individual has posterior probability 1 of membership in a single class, and probability 0 of membership in the remaining classes). Equivalently, high entropy indicates clear class separation. Entropy is not a measure of fit, nor was it originally intended for model selection (Ramaswamy, Desarbo, Reibstein, & Robinson, 1993); however, if it is extremely low, that suggests the model may not be useful for some purposes. For instance, if the model is intended to find homogenous clusters of individuals with distinctive patterns of change (e.g., Nagin, 1999), low entropy indicates that it may be doing a poor job. However, the reverse is not necessarily true; that is, high entropy does not necessarily show the existence of homogeneous clusters of individual trajectories (Feldman & Masyn, 2008).
The observed and model-predicted cell and marginal proportions can be compared to help assess fit of the model to the data (Carlin et al., 2001; Skrondal & Rabe-Hesketh, 2004). As mentioned earlier, when a model is large and complex, such as a growth process with multiple timepoints, most cells have expected frequencies that are very small or approach zero. However, often, a few patterns of responses recur relatively frequently. In this case, it may be helpful to inspect standardized Pearson residuals (Haberman, 1973) in the cells with these more common response patterns (B. Muthén & Asparouhov, 2006). In addition, standardized residuals from the univariate margins (i.e., category proportions at each timepoint) and bivariate margins (i.e., proportions in cells of all two-timepoint cross-tabulations) may be inspected (Skrondal & Rabe-Hesketh, 2004). Standardized residuals can be compared with a standard normal distribution (Haberman, 1973); thus, too many standardized residuals greater than 2 suggests poor fit of the model to the data. All of these observed and expected cell proportions, along with standardized residuals, can be generated in Mplus with the TECH10 output command (see Appendix B). Instructions on how to calculate the expected frequencies when other software is used are available from the first author.
Graphical methods can give a clearer picture than numeric output alone of model-predicted trajectories and patterns of change, but using them effectively for model selection depends upon researcher judgment and substantive knowledge. For instance, plots may make it easier to judge whether results of an analysis make interpretive sense or map reasonably onto theory and previous findings. Mixture models, in particular, must be considered in the context of theory, interpretability, and usefulness, because they tend to be used in an exploratory fashion and are innately data-driven.
Linear (and curvilinear) trajectories based on the (within-class) means of the random effects can be plotted, and these give an easy-to-describe picture of the overall mean change; but because the scale of these parameters is arbitrary, the plots can potentially be difficult to interpret. To facilitate interpretation in our line plots, we include the estimated thresholds (τj) that divide the categories of the observed data. In addition, we show plots of the time- and class-specific estimated category probabilities, which give a more nuanced picture of how alcohol use changes over time. The predicted-probabilities plots were created in Excel using the TECH7 output from Mplus, and line plots were generated in Excel by using Mplus estimated intercept and slope means multiplied by time (and the model-estimated thresholds, which are fixed across time).
The sample size for these analyses is 451 (236 females). The observed category proportions at each grade are shown in Figure 5, broken out by gender. Nearly 30% of the target adolescents of both genders reported drinking in seventh grade, and this grew to approximately 65% by twelfth grade (Figure 5). Going along with this increase in the likelihood of drinking is an increase in the proportion of youths drinking heavily. By the time the adolescents graduated from high school, approximately 20% of them (17% of females, 21% of males) reported drinking weekly or more often (Figure 5). In addition, out of 451 seventh graders, approximately 1/3 (27% of females and 34% of males) reported having at least one close friend who drank alcohol and 5% of females and 7% of males reported that half or more of their friends drank. Because drinking did not differ by very much between males and females (Figure 5), we did not use gender as a covariate in the model.
Fit indices are shown in Table 1 for the all models. Underlined values identify the best fitting number of classes within model type, as selected by the statistic in the row.
Two unconditional hierarchical generalized linear models were tested. The first characterized in drinking-frequency trajectories over adolescence with an intercept (in ninth grade) and linear slope, and the second added a quadratic slope. The quadratic slope significantly improved the model fit (Δχ2 = 49.83, Δdf = 4, p < .001) and was retained. Based on this result, the mixture models that follow were also specified with quadratic slopes. The intercept mean was fixed to zero for identification (Equation 9 &Equation 10) and the average linear trend was positive and statistically significant (β10 = 0.84, p < .001; see Table 2). The mean quadratic slope was not significant, however, there was considerable individual variability around the intercept and both slopes. In addition, the statistically significant covariances suggest that the frequency with which adolescents were drinking in ninth grade was related to changes in their drinking over the 6-year timespan (Table 2). On average, more drinking or more rapidly increasing drinking in ninth grade was associated with trajectories that were flattening out by the end of high school; while less frequent drinking in ninth grade was associated with drinking trajectories that were accelerating in twelfth grade. The BIC for this model was the lowest of all models tested. The other three statistics in Table 1 do not apply unless there are two or more classes.
In general, when more within-class variances and covariances were estimated in the growth mixture models, fewer classes were successfully extracted. For example, when all three variances and their covariances were estimated even the 2-class model failed to converge, but with the covariances fixed to zero, the same 2-class model did converge (see Table 1) Increasing the number of classes to three resulted in a singular information matrix.
In the next model, the quadratic variance was constrained to zero and only intercept and linear slope variances were estimated. The 2-class models converged, both with and without the covariance, but there was no significant difference in fit between the two (Δχ2 = 0.59, df = 1, p = .44), so we constrained the covariance to zero. Here, again, the model did not converge with three classes.
In the last set of GMM models only the intercept variance was estimated, and the remaining variances (and, necessarily, all covariances) were constrained to zero. In this case, we were able to estimate up to four classes. The 5-class model did not converge (the information matrix was singular; Table 1) and we ruled out the 4-class model because one class was too small to estimate reliably (4%, representing approximately 18 individuals). This left plausible models with two and three classes. In comparing the 2- and 3-class models, the fit statistics did not agree on which was optimal: the LMR preferred two classes, the BIC three classes, and the BLRT no fewer than four classes (Table 1). It is worth noting that entropy for all of the GMM models was much lower than for it was for any other mixture models (Table 1). We retained all of the models the converged except the 4-class GMM.
As mentioned earlier, in latent class growth analysis, because all within-class variances are constrained to zero, more classes may be needed to characterize the same joint distribution of growth factors. Equivalently, as the number of classes increases, the latent class variable is likely to account for more and more between-individual variability. Although a maximum of only four classes could be estimated with GMM, seven classes were successfully extracted with LCGA. The 7-class LCGA had a class that was too small to estimate well (3%, representing approximately 12 individuals). Nonetheless, the BLRT was still significant at seven classes, suggesting that no fewer than seven classes would be needed to model these data (Table 1). The BIC and LMR both selected the 4-class LCGA, but the 3-class model had a BIC value that was nearly equivalent to that of the 4-class model. We retained the 3- and 4-class models.
The longitudinal latent class analysis failed to replicate the best loglikelihood when seven classes were specified, and none of the fit statistics selected more than four classes (Table 1). The BLRT suggested that four classes were adequate, and the BIC and LMR both selected the 2-class model. Because the LLCA models require so many parameters, it is not surprising that no fit index selected a model with more than four classes, nor that its BIC values were consistently higher than for any other models (as was entropy, suggesting cleaner class separation when more parameters were estimated). We retained the models with two, three, and four classes.
We inspected the univariate and bivariate residuals for the HGLM and the best 10 out of the 17 mixture models that had converged properly. Table 3 shows the frequencies of univariate and bivariate residuals greater than 2.13 All models were able to replicate the univariate observed values adequately, but there was more variability among the bivariate residuals (Table 3). The 3- and 4- class LLCA models fit the bivariate cell counts the most accurately and the worst-fitting were the 2-class models: LLCA and the three GMMs (Table 3).
Table 4 shows comparisons of observed and predicted cell frequencies for the most commonly observed patterns of responses over the five occasions. Note that, although each of the four categories of the outcome were endorsed by at least a few individuals at every timepoint, the most commonly observed patterns comprised only zeros and ones (representing non-drinkers  and infrequent drinkers ). Here, again, the 3- and 4-class LLCA modeled offer the best fit to the observed data. Only the 2-class LLCA model did a noticeably worse job than the other models under consideration at reproducing the cell counts. It is also worth noting that the frequency of one response pattern (01111) was not reproduced well by any model except the most highly parameterized LLCAs.
In summary, the bivariate residuals suggested that all of the 2-class models fit the data relatively poorly, and the 2-class LLCA also failed to adequately match the common response patterns. We dropped the 2-class models from further consideration. In contrast, 3- and 4-class LLCAs both fit the observed frequencies particularly well, but the 4-class model required 16 more parameters, had a higher BIC (Table 1). Thus, we rejected the 4-class LLCA in favor of the more parsimonious 3-class model. We continued to consider the HGLM, the 3-class GMM with intercept variance, the 3-class LLCA, and 3- and 4-class LCGAs.
Plots of model-predicted trajectories and patterns of change can also help a researcher choose the most appropriate or reasonable model from several competing ones, and plots are often used this way in model selection (e.g., Greenbaum, Del Boca, Darkes, Wang, & Goldman, 2005; B. Muthén & Muthén, 2000b). Figure 6 shows a comparison of the estimated mean curves, based on the y* metric, generated by the 3-class growth mixture model (GMM; top) and 3-class latent class growth analysis (LCGA; bottom). Because the log-odds trajectories are on arbitrary scales, the plots show category thresholds as dashed lines to facilitate comparison across analyses. In addition, the GMM plot includes SD bars above and below the class means at each timepoint (based on the intercept variance). These SD bars illustrate the overlap of classes that accompanies the extremely low entropy. The GMM finds one unexpected class, comprising slightly over 1/3 of the sample, in which average drinking probabilities are constant across adolescence. This class shows approximately 50% of the individuals reporting alcohol use at each time (the first threshold separates non-drinkers from drinkers). Its other two classes are increasing, one early and one later, which is consistent with the covariances found in the hierarchical generalized linear model (HGLM; Table 2). The LCGA plot shows one class with consistent low drinking probabilities, one with rapidly increasing probabilities, and one with relatively high drinking probabilities across all timepoints. These last two classes are also consistent with the HGLM covariances in Table 2. That is, one class has lower drinking probabilities in ninth grade, but those probabilities are increasing rapidly, and the other, more likely to be drinking (or drinking more) in ninth grade, is increasing more slowly (and shows more quadratic slope; Figure 6).
The second way to plot the trajectories and patterns of change is to show the predicted category probabilities from each time (see Figure 7). This gives a somewhat richer picture of the change because it illustrates the shifts in proportions at each level of a categorical variable. It complements the plots of y*, shown in Figure 6, and when the categories are not evenly spaced, or have different substantive meanings, this approach can be more informative than a line plot. Figure 7 shows a comparison of the 3-class GMM, LCGA, and longitudinal latent class analysis (LLCA, which only yields category probabilities). The LLCA and LCGA classes are very similar to one another, both in class-specific patterns of change and size, suggesting that the extra parameters in LLCA are not yielding much more information. Both are consistent with prior literature (e.g., Hix-Small, Duncan, Duncan, & Okut, 2004; White, Johnson, & Buyske, 2000), in which researchers have found classes with low, high, and increasing alcohol-use probabilities across adolescence.
Only GMM found the class with constant moderate drinking probabilities, and Figure 7 shows that this class groups together individuals who responded in all four drinking categories. Because this class fails to differentiate high versus low drinking probabilities (all categories are represented in the class), it may not be very informative. It is also not consistent with prior literature. In addition to these problems, the overlap of classes in GMM makes it difficult to interpret, and a mixture model with such indistinct classes may offer little if any advantage over the more standard hierarchical generalized linear model (HGLM) without classes. For all of these reasons, we eliminated it from further consideration.
The 4-class LCGA was selected by both LMR and BIC as the best LCGA model, however, the BICs for the 3- and 4-class models were very close (Table 1) and, overall, the two models did approximately equally well with respect to the standardized residuals (Table 3 & Table 4). Figure 8 shows the small added class (12%; labeled “High at A”) which is largely a subset of the low class from the 3-class model, but with higher drinking probabilities in the seventh and eighth grades (such class splitting can be investigated with a cross-tabulation of modal class assignments). By tenth grade, it does not differ substantially from the low class. This type of pattern, in which adolescents drink more in the early and later years, but not in between, is not predicted by theory or prior literature (e.g., Colder, Campbell, Ruel, Richardson, & Flay, 2002; Hill et al., 2000; Orlando, Tucker, Ellickson, & Klein, 2005). It is our feeling that this added class arises as an artifact of a testing effect that is well-established among self-report measures in longitudinal studies; higher levels of negative feelings and behaviors are often reported at the first wave of data collection (see, e.g., Knowles, Coker, Scott, Cook, & Neville, 1996; Twenge & Nolen-Hoeksema, 2002). An inflated first-year report could explain the added class, so we decided to eliminate the 4-class LCGA.
At this point, based on combinations of fit statistics, residuals, considerations of parsimony, and plots, we felt that all 2- and 4-class models, as well as all GMMs could be reasonably ruled out for this data set, leaving the 3-class LCGA and LLCA models and the HGLM still under consideration.
After selecting the best unconditional models, we used the adolescent’s seventh-grade report of alcohol consumption by close friend to explain individual differences and to help guide model selection. The covariate was allowed to predict class membership in the mixture models and the growth-factor variability in the hierarchical generalized linear model (HGLM). If a growth mixture model (GMM) were still under consideration, it would be possible to use the covariate to explain within-class variation, as well as the probability of class membership. In the case of GMM, as in the other cases, covariates are used to explain individual differences. When a covariate is permitted to affect both within-class and between-class variability, it is predicting a complex combination of an individual’s probability of class membership and that individual’s trajectory, relative to the mean trajectory in each class in which the individual has some non-zero probability of membership. This can make the covariate’s effects potentially difficult to interpret.
In the HGLM, targets with more close seventh-grade friends drinking alcohol were, themselves, predicted to be drinking more in ninth grade (β01 = 1.36, SE = 0.24, p < .001), but it was also predicted that their drinking would be increasing more slowly in ninth grade (β11 = −.19, SE = 0.07, p = .011). Despite this effect, however, a target who reported that even a few of his or her friends were drinking in seventh grade (i.e., 1 on the 0 to 4 scale) would still end up, on average, drinking more in 12th grade than an adolescent who reported having no drinking friends in seventh grade (see Figure 9).
The regression coefficient for a single class represents the change in the log-odds of membership in that class, versus a reference class, for a one-unit change in the covariate (for members of those two classes). The reference class can be any class, and Mplus 4.21(B. Muthén & Muthén, 1998 – 2006) gives results with each class, in turn, treated as the reference. Viewing results this way permits comparison of all pairs of classes with respect to the covariate, which may be more useful than simply using a single reference class. Note, however, that, as the number of classes increases, testing all possible pairs may not be necessary or useful, and could introduce the possibility of an inflated Type-1 error rate. Additionally, global tests of association between the latent class variable (with a given number of classes) and one or more covariates can be conducted using the standard likelihood ratio test for nested models.
Table 5 shows all paired class comparisons in 3-class LCGA and LLCA models. The results across the two models are very similar. Members of the high class reported more alcohol-using friends in seventh grade than either of the remaining two classes, which did not differ significantly from one another. In deciding between LLCA and LCGA 3-class models, we weighed the better fit of the LLCA against its lack of parsimony; as compared with the LCGA (the LLCA requires 34 more parameters; Table 1). If the LLCA patterns of change had shown noticeable signs of non-linearity, then it would have made sense to choose that model over LCGA. However, it did not (Figure 7), and the substantive stories told by the trajectory plots and the relationships between classes and the covariate are the same for the two models. Based on all of these factors, we retained the LCGA and dropped the LLCA. Parameter estimates for the conditional 3-class LCGA are shown in Table 6.
A single set of categorical adolescent-drinking data was analyzed using a hierarchical generalized linear model (HGLM) with linear and quadratic slopes, and 21 specifications of mixture models. The HGLM showed a statistically significant mean linear increase in drinking (with a non-significant mean quadratic slope), and reflected significant variability around the intercept, in ninth grade, and both linear and quadratic slopes. All of the mixture models also revealed an overall increase in drinking across time. In general, freeing more variances and covariances in the growth mixture models (GMMs) led to fewer classes being estimable, resulting in a range of one class (equivalent to HGLM), when all variances and covariances were estimated, to seven classes, when all variances were constrained to zero (in the latent class growth analysis; LCGA). Based on the convergence properties, fit indices (BIC and LMR proved most useful), and, in two cases, the extraction of unacceptably small classes, we were able to pare our list down to 10 potential models, including the HGLM.
We next considered the fit in terms of model-predicted versus observed category frequencies by looking at standardized residuals for the univariate and bivariate marginal frequencies (Table 3), and the most commonly occurring patterns of responses (Table 4). Too many standardized residuals larger than 2 is a sign of poor fit of the model to the data. All models fit the univariate marginal frequencies fairly well, and all but one—the 2-class longitudinal latent class analysis (LLCA) model—fit the common-pattern frequencies adequately. We eliminated the 2-class LLCA and also dropped the remaining 2-class models, all of which had 15 or more large bivariate residuals (this cutoff was chosen somewhat arbitrarily). The 3- and 4-class LLCA models both predicted cell frequencies extremely well, so we eliminated the 4-class model, which required 16 more parameters and had a higher BIC.
For the remaining selection decisions, we used plots to help us consider substantive issues: interpretability, usefulness, relationships with covariates, and consistency with theory and previous literature. Based on the plots, we rejected the sole remaining GMM. One of its classes was of little use in terms of explaining outcomes and it lacked a class with consistently low drinking probabilities that is predicted by theory and previous literature (e.g., Colder et al., 2002).14 Finally, the plots showed how poor the class separation was when the intercept variances were estimated. Given the overlap between classes and the difficulty in estimating and interpreting this model, even with all other things being equal, we would see little reason to select this GMM over the more straightforward and parsimonious HGLM.
We also compared the plots for the 3- and 4-class LCGAs, which had very similar BICs (Table 1). The added class showed a pattern that was neither predicted by theory nor found in previous studies of adolescent alcohol use. Rather, this new class appeared to be an artifact resulting from test-retest effects often found in longitudinal studies comprising self-reported data. We rejected the 4-class model in favor of the 3-class.
The 3-class LCGA and LLCA models were very similar to one another (Figure 7). Both had one large class with low drinking probabilities across adolescence and a small increase toward the end of high school (labeled “low”), a small class with relatively high drinking probabilities (including more frequent drinking) across adolescence (“high”), and a class that started with low drinking probabilities in seventh grade and then increased fairly rapidly, until its drinking probabilities were similar to the “high” class in 12th grade (“increasing”).
Finally, we added a covariate. The three remaining models (LCGA and LLCA 3-class models, and HGLM) all showed that adolescents who reported having close friends who used alcohol in seventh grade were likely to be drinking more frequently themselves across all of the junior high and high school years. Such similar findings across the three models suggest that all offer reasonable representations of drinking across middle and late adolescence, and given the large difference in parsimony between LCGA and LLCA (13 vs. 47 parameters, respectively), it made sense to reject the LLCA in favor of the LCGA. In situations where plots show that change over time is less continuous and linear or curvilinear (e.g., episodic problems like suicidality), or when the proportional odds assumption is not tenable, LLCA may offer considerable advantage over the other models.
The two remaining models, HGLM and the 3-class LCGA, were equivalent in terms of fit to the data and to theory, and they told essentially the same substantive story, so we did not choose between them. It is possible that, based on further analyses, one of the two would stand out as a clearly better model. For example, differences might emerge when other covariates are used, or one of the two models could prove a better predictor of individual values on a distal outcome.
In any data-analysis situation, there is a multitude of decisions that must be made, from determining a statistical model for the distribution of the data to choosing the structural models that best answer the research questions. The study of problem behaviors can be a particular challenge; the behaviors represent developmental processes, requiring longitudinal analyses, and because the behaviors are rare, their distributions are highly skewed. In addition, most data measuring these behaviors are discrete (binary, ordinal, or count). We have presented a selection of alternative models for analyzing longitudinal categorical data. These approaches assume more appropriate distributions for problem-behavior data than the traditional continuous-data models, permitting researchers to have more faith in the results.
Starting with the concepts involved in the generalized linear model (GLM) for categorical data, we discussed longitudinal extensions, including the hierarchical (or longitudinal) version of the generalized linear model (HGLM), and several specifications of mixture models. While there has been a great deal of excitement recently over using mixture models to find homogeneous unobserved subgroups in the population, we showed how such approaches can also be viewed as extensions of the random-effects model (Bauer & Curran, 2003; B. Muthén & Asparouhov, 2006), which relax some of its strong assumptions. Because of this, they might be the most appropriate choices for certain types of data distributions, irrespective of any question of subpopulations.
Our goal in this paper was to present a group of related models for longitudinal categorical data and to offer substantive researchers a useful guide to testing and selecting between the alternative models. This is a particularly difficult issue with categorical data and complicated longitudinal models, such as mixture models, because, as we find here, there are no completely reliable fit statistics for these types of models, and no readily available measures of absolute fit. We suggested several considerations to use in model selection. The first have to do with convergence properties, measures of comparative fit, and residual analysis. We dismissed models that did not converge well (including both improper solutions and failure to repeat the best loglikelihood value). From the remaining models, we chose the most plausible ones based on the fit statistics, and then ruled out several more because they failed to adequately reproduce the observed cell frequencies or were equivalent to more parsimonious models with respect to fit. Next, we included substantive considerations in model selection. For instance, we rejected a model because its substantive meaning was both obscure, rendering the model less useful, and not a reasonable fit to theory or to prior research. We also selected one of two similar models, despite poorer fit, because it was more parsimonious. Of course, we tested a limited number of models and only a single covariate. Choice and specification of covariates are an important factor in model selection and differences in these could potentially lead to different model choices
Through this process of elimination, we selected the two models that offered the most reasonable combination of parsimony, fit to the data, and fit to theory—the hierarchical generalized linear model (HGLM) and the 3-class latent class growth analysis (LCGA). The two models also use roughly the same number of parameters: 11 and 13, respectively and the substantive findings were similar. Both showed that, while adolescents are increasingly likely to drink as they approach the end of high school, those who, in seventh grade, reported having more friends who were drinking, were more likely to be drinking, themselves, throughout the rest of the secondary school years. The choice between these two models, based on the data we have presented, is, in our opinion, largely a matter of ease of interpretation, although further investigation could turn up substantive or statistical reasons to choose one over the other, or even select a different model over these two.
This study has the strengths and the weaknesses that inevitably result from using actual data as an example. On the one hand, it shows how the models might look in an actual research situation. On the other hand, with simulated data, we could compare the model results with a known population to discover whether there are any systematic biases in the analyses. It is also possible that different approaches to measuring alcohol you could yield different results (Feldman & Masyn, 2008). Finally, there is one more critical piece which warrants further study. Often the purpose of longitudinal studies is to predict what will happen to the young people later in life. The models may not be of much use if they fail in that respect, no matter how well they appear to characterize the data at hand. Future research needs to take distal outcomes into account in assessing the adequacy of the models.
In this paper we present and compare a fairly large set of models, but it is not an exhaustive list of potential models. There is an infinite set of possible models and, with empirical data, we cannot know what the true model is. For instance, there are models outside of the GLM framework that could be considered when growth is not linear or curvilinear, as well as two-part models, in which behavior is modeled as two related processes: (1) exhibiting the behavior versus not doing so, and (2) intensity or extent of the behavior (Olsen & Schafer, 2001; Tooze et al., 2002). However, although there remains some uncertainty with respect to model choice, we feel that all of the methods shown here may offer improvement over approaches that treat all data, irrespective of their actual distributions, as continuous and normally distributed.
In addition, because comparing longitudinal categorical models is not always straightforward, we have offered a systematic approach to assessing and selecting models from among several competing, equally appropriate statistical options. It is our belief that adopting this new business-as-usual approach will improve our ability to understand trajectories of problem behaviors over childhood and adolescence. With improved models, we can better determine risks and protective factors, and this information could potentially be used for early identification of those most at-risk, helping to select the best candidates for targeted preventive interventions.
Support for this research has come from multiple sources, including the National Institute of Mental Health (MH00567, MH19734, MH43270, MH48165, and MH51361), the National Institute on Drug Abuse (DA05347 and HD047573), the National Institute of Child Health and Human Development (HD027724 and HD047573), the Bureau of Maternal and Child Health (MCJ-109572), the MacArthur Foundation Research Network on Successful Adolescent Development Among Youth in High-Risk Settings, and the Iowa Agriculture and Home Economics Experiment Station (Project No. 3320). We thank Shelley Blozis, Keith Widaman, Sophia Rabe-Hesketh for their helpful comments on this paper. We also thank the editor and anonymous reviewers for many insightful and helpful remarks.
|Name||Acronym||Alternative Name||Link||Distributional Assumptions|
|(acronym)||Level 1||Level 2 +|
|Hierarchical Linear Model||HLM||Random Effects Model|
Random Coefficients Model
Multilevel Model (MLM)
|Hierarchical Generalized Linear|
|HGLM||Generalized Linear Mixed|
|May be identity,|
log, logistic or
|Growth Mixture Model||GMM||Growth Mixture Analysis|
Finite Mixture Model
|May be identity,|
log, logistic or
|Latent Class Growth||LCGA||Semi-parametric Group-|
Based Model (SGBM)
|Latent Profile Analysis||Identity||Normal||Multinomial|
|Longitudinal Latent||LLCA||Latent Class Analysis|
|May be identity,|
logistic or logit,
Publisher's Disclaimer: The following manuscript is the final accepted manuscript. It has not been subjected to the final copyediting, fact-checking, and proofreading required for formal publication. It is not the definitive, publisher-authenticated version. The American Psychological Association and its Council of Editors disclaim any responsibility or liabilities for errors or omissions of this manuscript version, any version derived from this manuscript by NIH, or other third parties. The published version is available at http://www.apa.org/journals/dev/.
1Results from analyses of clinical samples can only be generalized to that population and may not apply to the general public, and use of other types of non-representative samples may also have problems with respect to inference (Hernán, Hernández-Diaz, & Robins, 2004).
2For extended discussions of generalized linear models, including methods for various types of non-Gaussian distributions, see, e.g., Agresti (2002), McCullagh and Nelder (1999), and Skrondal and Rabe-Hesketh (2004).
3All of these models assume that the distribution of the observed response variable is a member of the exponential family of distributions; most commonly seen among discrete distributions: multinomial (ordered or unordered categorical data), Poisson (count data), or Bernoulli (dichotomous data); and continuous distributions: gamma and Gaussian (or normal).
4In the probit model, the errors are assumed to be normally distributed (Agresti, 2002; Skrondal & Rabe-Hesketh, 2004). For an early references to similarly formulated probit models, see, e.g., Ashford (1959), Goldberger (1964).
7For models in which change over time is permitted to be non-linear, see, e.g., Blozis (2004, 2007), and Ferrer and McArdle (2003).
8For examples of alternative parameterizations, in which different restrictions on the thresholds permit γ00 be estimated, see Mehta, Neale, and Flay (2004), Millsap and Yun-Tein (2004), and Skrondal and Rabe-Hesketh (2004).
11LLCA differs from latent transition analysis (LTA) because LTA models state changes across consecutive timepoints, rather than patterns that span several timepoints (see, e.g., Collins, Graham, Rousculp, & Hansen, 1997; Graham, Collins, Wugalter, Chung, & Hansen, 1991). As such, it answers different types of questions than LLCA and the other models discussed here, just as autoregressive models answer different types of questions than latent growth curves.
12Allowing covariates to affect the slope means that the effect of the covariates on the outcome changes across time; however, the covariate is assumed to affect all individuals in the same fashion and is known as a “fixed effect.” Alternatively, we could constrain the covariate effect to be time-invariant but allow that effect to vary across individuals (as a random effect).
13Absolute size of standardized residuals can also be taken into consideration, if they are very large (e.g., > 10), which did not occur in these analyses.
14It is possible, in GMM and other mixture models, to specify a class with zero probability of drinking at all timepoints if such a class is believed to exist. For details on how this is done, see Kreuter and Muthén (2007).
Betsy J. Feldman, Graduate School of Education, University of California, Berkeley.
Katherine E. Masyn, Department of Human and Community Development, University of California, Davis.
Rand D. Conger, Department of Human and Community Development, University of California, Davis.