|Home | About | Journals | Submit | Contact Us | Français|
This study fills in the current knowledge gaps in statistical analysis of longitudinal zero-inflated count data by providing a comprehensive review and comparison of the hurdle and zero-inflated Poisson models in terms of the conceptual framework, computational advantage, and performance under different real data situations. The design of simulations represents the special features of a well-known longitudinal study of alcoholism so that the results can be generalizable to the substance abuse field. When the hurdle model is more natural under the conceptual framework of the data, the zero-inflated Poisson model tends to produce inaccurate estimates. Model performance improves with larger sample sizes, lower proportions of missing data, and lower correlations between covariates. The simulation also shows that the computational strength of the hurdle model disappears when random effects are included.
Measuring symptomatology of a target disease longitudinally can provide useful data for assessing disease progression or evaluating long-term effects of treatment or intervention. Substance use disorders are relapsing–remitting in nature . The disease course manifests itself clinically by nondeterministic fluctuations between periods of worsening symptoms and periods of improvement. Such longitudinal trajectories can hardly be delineated by the popular growth curve modeling that employs polynomial functions of time because (i) they have all orders of derivatives everywhere, (ii) polynomial degree cannot be controlled continuously, and (iii) further individual observations can have a large influence on remote parts of the curve . Another common feature of this type of data is that the symptom count measure tends to have excess zero values beyond what would be expected by a classical Poisson model, especially when the sample is drawn from the general population or a community . Moreover, the large individual differences in developmental trajectories commonly observed in the substance abuse field (e.g., ) certainly increase the difficulty level of the data analysis.
Longitudinal zero-inflated count data frequently occur in not only the substance abuse field but also in other fields such as healthcare utilization , pharmaceutical research , and vaccination safety . Thus, statistical models designed for this kind of data have many practical applications. In recent years, the applications of regression splines to longitudinal data analysis have increased dramatically because of their flexibility to different developmental patterns and robust model assumptions. The aim of this paper is to review and compare two competing models for zero-inflated count data, the hurdle model, and the zero-inflated Poisson (ZIP) model, in the setting of longitudinal data analysis with regression splines for modeling longitudinal trajectories as well as random effects for handling within-subject correlation and between-subject heterogeneity. We demonstrate the programming in SAS  and conduct simulations to evaluate the performance of the competing models based on the data features of the Michigan Longitudinal Study (MLS), which is an ongoing multiwave prospective study of youth at high risk for alcoholism. The alcohol use disorder (AUD) symptom counts collected from this sample from childhood to adulthood provide a typical example of longitudinal zero-inflated count data.
We organize this paper as follows. In Section 2, we introduce and compare the hurdle model and the ZIP model with an emphasis on the conceptual and computational differences between the models. We also discuss the issues of applications in the substance abuse field. In Section 3, we present a motivational example using the MLS data. In Section 4, we conduct simulation studies to assess the performance of the statistical models under different levels of the sample size, proportion of missing data, and correlation between covariates. We also compare the computational time consumed by the competing models. We present discussion and concluding remarks in Section 5. We provide a SAS program example in Appendix A.
The hurdle model  has mostly been adopted to conduct economic analysis of healthcare utilization. The model postulates a two-stage decision structure in the demand process: the first stage involves the decision to seek care, and the second stage determines how much care is demanded among the subgroup of users for whom the hurdle is crossed. One major strength of the hurdle model is that it can simultaneously accommodate two sets of factors that contribute to separate stages.
In a longitudinal setting, suppose that Yij is the response such as the symptom count for subject i at time tij (i = 1, …, n, j = 1, …, mi ) and it is from a finite mixture:
The probability distribution is thus written as
Let xij be the set of factors that contribute to the severity of symptomatology for the people who have developed some symptoms and zij be another set of factors that account for the probability of being symptom free. The parameters can be modeled by
where β and γ are fixed effects for covariates xij and zij, respectively; ai and bi are random effects accounting for within-subject correlation and between-subject heterogeneity; g(tij) and h(tij) are baseline functions for the nonlinear time effects. It is assumed that
We fit the functions g(·) and h(·) with the following piecewise quadratic polynomials (for the ease of presentation, we drop ij from t):
where the + indicates the positive value from the expression inside the parentheses; t1, …, tK are the knots that divide the range of t into segments. The number of knots controls the amount of smoothing and can be chosen by goodness-of-fit statistics such as BIC. We can use SAS PROC TRANSREG  to estimate these regression splines. We include a program example in Appendix A.
By assuming that Yij, j = 1, …, mi are conditionally independent given the random effects (ai, bi), the likelihood function for data from subject i can be written as
where I(·) is an indicator function; F(·) is the joint distribution function of (ai, bi). To obtain the maximum likelihood estimators for the unknown parameters β, γ and Σ, we need to optimize the likelihood function
An alternative model for zero-inflated count data, the ZIP model, was originally proposed to model the number of defects on an item in a manufacturing process that is assumed to move randomly back and forth between a perfect state and an imperfect state . Like the hurdle model, the ZIP model can simultaneously accommodate one set of factors that make the perfect state more likely and another set of factors that contribute to fewer defects in the imperfect state. The model has been applied in many fields including health science where the population consists of the people who are at risk for a disease and others who are not at risk . The original model has also been extended to accommodate an upper-bounded count situation as well as a repeated measures design .
Like the hurdle model, the ZIP model is a finite mixture model:
The probability distribution is thus written as
We can view both the hurdle and ZIP models with random effects as nonlinear mixed effects models. Vonesh and Chinchilli have provided a systematic account on this class of models . The estimation procedure used in SAS PROC NLMIXED is based on the marginal likelihood. Following the theory of maximum likelihood estimation, the resulting estimators follow an asymptotic normal distribution. Furthermore, the standard errors of the estimates can be computed using the inverse of Fisher information matrix.
SAS PROC NLMIXED uses twice the negative of the log likelihood to measure goodness of fit. This is similar to the deviance used in generalized linear models. Specifically, for the ZIP model with random effects, define
for each observation yij and
To calculate dij, we need to have predicted values of the random variables, ij and ij, which can be estimated using the empirical Bayes method. We can then use the Chi-square test (degrees of freedom = number of observations–number of parameters) to examine goodness of fit. We can conduct a goodness-of-fit test for the hurdle model with random effects in a similar way.
Conceptually, the ZIP model is more intuitive when the population consists of a group of people who are at risk for a disease and another group who are not at risk (e.g., women are not at risk for prostate cancer), whereas the hurdle model is more appropriate when all people in the population are considered at risk of an event and the realization of the event represents a hurdle that has been crossed . In substance abuse research, either way of conceptualization makes sense. For example, the ZIP model is applicable when we divide the population into nondrinkers who can only have zero alcohol-related symptoms and drinkers who may have zero symptoms. For another example, the hurdle model is appropriate if every person in the population is considered at risk for alcohol dependence but only some people meet at least one symptom criterion. For longitudinal studies such as the one described in the next section, people are assumed to be at different levels of risk for alcoholism-related symptoms at different ages. People also change from nondrinkers to drinkers or the other way around across time. The hurdle model tends to have reasonable grounds for such settings.
As shown in previous sections, the hurdle and ZIP models are both finite mixture models – the hurdle mixes 0 and a truncated Poisson, whereas the ZIP mixes 0 and a regular Poisson. The two models can be shown to be mathematically equivalent – one is just a reparameterization of the other when there are no covariates involved . In the case when covariates are included in the models, it is not clear how the two models relate to each other, although it was shown that these two models produce similar estimates and have indistinguishable goodness-of-fit measures in empirical data analysis . Simulation studies are needed to investigate the consequences of using one model when the other model is more natural under the conceptual framework of the data.
The major strength of the hurdle model is that it can handle not only zero-inflated data but also zero-deflated data, whereas the ZIP model can only deal with zero-inflated data . Although this strength makes the hurdle model more applicable in general settings, it is less relevant for substance abuse data because zero-inflated count data are typically the norm, whereas zero-deflated count data are extremely rare in the field. Symptom count data collected from the general population or a community sample tend to be zero-inflated because many participants are nondrinkers or drinkers who have not yet developed symptomatology. Even when we collect symptom count data from a treatment sample such as patients with alcohol dependence, we are likely to observe many participants having only the minimum number of symptoms (e.g., the DSM-IV  requires at least three of seven possible symptoms to meet an alcohol dependence diagnosis). The resultant data can again be analyzed by the hurdle or ZIP model with a shift on the location.
Another strength of the hurdle model is its computational simplicity. When there is no random effect, it is shown that the log-likelihood function of the hurdle model can be factored into two terms with one involving β (in Equation (1)) and another involving γ (in Equation (2)), so one can obtain maximum likelihood estimates by separately maximizing the two terms . For the ZIP model, on the other hand, the model components must be fit simultaneously and therefore is more complex computationally. Nevertheless, when random effects are included such as Equations (1) and (2), this strength may disappear as the model components must be fit simultaneously for both models because the random effects have to be integrated out in the optimization process (Equation (4)). To the best of our knowledge, the computational time of the two models has never been compared systematically, especially when random effects are involved.
In this study, we aim to extend the applications of the hurdle and ZIP models to analyze longitudinal zero-inflated count data in the substance abuse field. We fill in the current knowledge gaps by conducting simulations on the basis of the data features of a well-known longitudinal study on alcoholism to (i) investigate the consequences of using one model when the other model is more natural under the conceptual framework of the data; (ii) compare the computational time consumed by the two models; and (iii) evaluate the performance under different sample sizes, proportions of missing data, and correlations between covariates.
The MLS is an ongoing multiwave prospective study of people at high risk for substance use disorders [15, 16]. The study recruited participant families using fathers’ drunk driving conviction records and door-to-door community canvassing in a four-county area in mid-Michigan. All participants received extensive in-home assessments of their psychiatric symptoms including alcoholism-related symptoms at baseline, and thereafter at 3-year intervals. In this study, we use longitudinal data from a sample of 635 children (71% male) for analysis. Their mean age at the latest assessment wave was 20 years.
The following is a brief list of the 11 DSM-IV symptom criteria for AUD :
|Abuse symptom 1:||Failure to fulfill major role obligations|
|Abuse symptom 2:||Hazardous use|
|Abuse symptom 3:||Legal problems|
|Abuse symptom 4:||Social or interpersonal problems|
|Dependence symptom 1:||Tolerance|
|Dependence symptom 2:||Withdrawal|
|Dependence symptom 3:||Taken in larger amounts or over a longer period|
|Dependence symptom 4:||Persistent desire or unsuccessful efforts to cut down|
|Dependence symptom 5:||A great deal of time spent|
|Dependence symptom 6:||Important activities given up or reduced|
|Dependence symptom 7:||Physical or psychological problems|
The symptom count (range of 0–11) serves as an important indicator for AUD severity. The zero values in the data from this community sample are more than what would be expected from a classical Poisson regression model (83% across waves). Thus, statistical models for zero-inflated data such as the hurdle and ZIP models are needed.
The substance abuse literature has shown that children of alcoholics, early onset drinkers, men, or youth with high internalizing or externalizing behavior are at a higher risk for progression into AUD [17, 18]. We applied both the hurdle model and the ZIP model to estimate the effects of these risk factors on AUD symptom counts using the MLS data. Table I shows the estimated regression coefficients, standard errors, p-values for t -tests, and goodness-of-fit test results. Most of the estimates in the Poisson component produced by the two models are close, whereas the ones in the zero component are very different. Given the significance level at .05, both models identify externalizing behavior in youth as an important risk factor that contributes to more AUD symptoms and a lower likelihood of being nondrinkers (ZIP) or symptom free (hurdle). Early onset of drinking is found by both models to be a significant factor in the zero component but is only significant in the Poisson component under the hurdle model. Moreover, children of alcoholics are shown by the hurdle model to be less likely to have no AUD symptoms. Although there are significant individual differences (i.e., the random effect) in the Poisson component under both models, the random effect term in the zero component is only found to be necessary under the hurdle model. As shown on the bottom of Table I, both models fit as well as the saturated model.
Figures 1–2 show the spline functions of time. Like the estimates of fixed effects in the Poisson component (shown in Table I), the spline functions in the Poisson component generated by the two models look very similar – the severity level of AUD symptomatology grows rapidly from age 10 to 13 years and stays at the same high level throughout early adulthood (Figure 1). On the other hand, Figure 2 shows that the two models produce very different spline functions in the zero component during early adolescence. The ZIP model indicates that the probability of being a nondrinker increases from age 10 to 13 years, whereas the hurdle model delineates that the probability of being symptom free decreases rapidly during the same period. Both models demonstrate that the corresponding probability gradually decreases year by year from age 13 to early 20s and afterwards stays relatively flat.
The developmental trajectories of AUD symptomatology under the hurdle model are more legitimate as the youth in this high risk sample tend to start drinking earlier and thus are more likely to develop alcohol-related symptoms quickly in early adolescence. Furthermore, for longitudinal studies such as the MLS, people are assumed to be at different levels of risk for alcoholism-related symptoms at different ages. People also change from nondrinkers to drinkers or the other way around across time. The logic behind the hurdle model thus tends to have reasonable grounds in such a setting. For these reasons, we designate the hurdle model as the true model in the simulation study.
We adopted the fitted hurdle model in the motivating example as the true model to generate simulated data that closely represent the structure of real data so that the results can be more generalizable to the substance abuse field . To generate the five covariates (x = z), we drew random samples from a multivariate normal distribution N5(0, ψ), where the diagonal element ψii = 1 and the off-diagonal element ψij = r|i−j| (i, j = 1, …, 5). We employed three levels of correlation in this experiment: small (r = 0.00), medium (r = 0.25), and large (r = 0.50). We also manipulated the sample size at three levels: small (N = 100), medium (N = 200), and large (N = 400). For each subject, we randomly generated a covariate set at each of 20 predetermined assessment waves with a random shift added to each wave so that the subjects may not follow exactly the same assessment schedule such as real data. In addition, we varied the proportion of missing data commonly observed in real data at three levels: small (p = 0.20), medium (p = 0.30), and large (p = 0.40). In summary, we manipulated three factors: the sample size (N), the proportion of missing data (p), and the correlation between covariates (r), with three levels for each factor. Thus, there were in total nine situations. For each situation, we conducted 200 replications.
In each situation, we evaluated the performance of three alternative models: the Poisson regression model, the ZIP model, and the hurdle model. All the models involve both fixed and random effects (i.e., mixed effects). We used the Poisson regression model as a control model to examine the consequences of ignoring zero inflation in the data. We compared the performance of the ZIP model with the one of the hurdle model to evaluate the consequences of using the ZIP model when the hurdle model was more natural under the conceptual framework of the data. For the fixed effects (β, λ) and the variance–covariance of the random effects Σ, we used the mean squared error (MSE) summarizing the deviation of the 200 estimates from the parameter as the criterion for performance evaluation. For the nonlinear function of time g(t), we calculated the integrated mean squared error (IMSE) for each replication:
We used the average of the 200 IMSEs to measure the deviation from the true function. We applied the same computation to the other spline function h(t). To save space, we only show those results featuring the effects of one factor holding the other two factors at their medium values in Tables II–IV. Interested readers may request the tables for other situations from the first author.
Table II shows the effects of varied sample sizes on performance of the three alternative models holding the correlation between covariates and the proportion of missing data at the medium values (r = 0.25, p = 0.30). For any model, the performance improves (i.e., smaller MSE) as the sample size increases. When the levels of the three factors are fixed, the hurdle model performs better than the ZIP model, especially on estimating the parameters in the zero component. The Poisson regression model performs much worse than the other two models because it ignores the zero inflation in the data.
In Table III, the effects of varied proportions of missing data are depicted with the correlation between covariates and the sample size fixed at the medium values (r = 0.25, N = 200). As the proportion of missing data increases, the performance of every model becomes worse (i.e., larger MSE). For any combinations of levels in the three factors, the hurdle model outperforms the ZIP model, especially on estimating the parameters in the zero component. Both the hurdle model and the ZIP model perform much better than the Poisson regression model that imposes an incorrect assumption on the count data.
Table IV summarizes the simulation results with the correlation between covariates manipulated while holding the other two factors constant (p = 0.30, N = 200). Like the results in Tables II–III, the Poisson regression model tends to produce much larger MSE’s than the other two models across different levels of correlation. The performance of the hurdle model tends to decline as the correlation becomes higher. The other two models, on the other hand, do not have a clear pattern of changes with varied correlations.
One objective of this study is to compare the two models in terms of their computational time, particularly when random effects are involved. Our simulation shows that the average computational time per replication is 15 min and 45 s for the ZIP model; it is 17 min and 24 s for the hurdle model. Therefore, this confirms our hypothesis that the computational strength of the hurdle model disappears when random effects are included in the model.
This study has filled in the current knowledge gaps in statistical analysis of longitudinal zero-inflated count data by providing a comprehensive review and comparison of the ZIP and hurdle models in terms of the conceptual framework, computational advantage, and performance under different real-data situations. The design of our simulation study is unique because it represents the special features of a well-known longitudinal study on alcoholism risk so that the results can be generalizable to the substance abuse field.
Our simulation results demonstrate the danger of using the Poisson regression model to analyze longitudinal count data when excess zeros exist in the data: the model tends to produce much higher MSEs than the hurdle and ZIP models. On the basis of both the simulations and empirical analysis of real data, when the hurdle model is more natural under the conceptual framework of the data, the ZIP model tends to result in relatively high MSEs, particularly in the zero component. Moreover, the performance of the models improves with a larger sample size, lower proportion of missing data, and lower correlation between covariates. The simulation also shows that the computational strength of the hurdle model disappears when random effects are included in the model.
According to our comparison between the hurdle and ZIP models in Section 2.4 and the simulation results, we would like to provide a general guideline on applications of these models for practitioners. If there exist a group of subjects who are at risk for an event and another group who are not at risk, the ZIP model captures the conceptual framework better than the hurdle model that is more appropriate when all the subjects are considered at risk for an event. Both models are designed to address the issue of zero-inflated data. However, only the hurdle model can handle zero-deflated data. When there is no random effect (or the random effect is ignorable), the hurdle model has computational advantage. But such advantage should play a minimal role in choosing a model because the conceptual framework and features of the data are more important.
In this study, we only consider the Poisson link function for both the hurdle and ZIP models because it fits well with the real data example. In some applications where overdispersion is common, such as analyzing vaccine adverse event count data , the negative binomial link should be considered.
In our analysis, we implicitly assumed that the mechanism of missing data is missing completely at random, so we only needed to model the response process with the use of available observations. It would be an interesting research topic to consider other patterns of missing data such as missing at random, where the missing process is correlated with the response process. For the case of missing at random, to obtain consistent and unbiased estimates of the parameters for the response process, we need to take into account the missing process appropriately (e.g., inverse probability approach), and we should model both the missing process and the response process. This may be a topic for future research.
As shown by the simulations, the average computational time for fitting either model is 15–17 min because it involves both numerical integration and optimization procedures. Future methodological work is needed to improve the computational efficiency in real-life applications.
Buu’s research was supported by a National Institutes of Health (NIH) grant, K01 AA16591; Li’s research was supported by NIH grants, R21 DA024260 & P50 DA10075, and a National Science Foundation (NSF) grant DMS 0348869; Tan’s research was supported by an NIH grant P50 DA10075; and Zucker’s research was supported by an NIH grant R37 AA07065. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF.
We carried out all programming in SAS version 9.2. We employed PROC TRANSREG to construct piecewise quadratic polynomials in Equation (3) with the use of degree two B-splines (DEGREE=2) that have nice numerical properties and are easy to manipulate . We chose five knots (NKNOTS=5) because the corresponding model fits the MLS data well. This results in 5 + 3 terms, t_0–t_7, that were stored in the data set basis.
PROC TRANSREG DATA= mls; MODEL IDENTITY(y) = BSPLINE(t/DEGREE=2 NKNOTS = 5 ); OUTPUT OUT = basis PREDICTED; RUN;
We merged the resulting transformations of t with the original data set mls, which contains the ID number (target), the outcome (y), and the covariates (coa, earlyos, male, inttscr, exttscr).
DATA final; MERGE mls basis; KEEP target y coa earlyos male inttscr exttscr t_0–t_7; RUN;
We used PROC NLMIXED to fit the hurdle model as described in Section 2.1. The variables created in the program are defined as follows:
infprob: The probability of being symptom-free ij linpinfl: The logit (ij) in Equation (2) mu: The mean for truncated Poisson μij sp_eff: The random effect bi in Equation (2) varsp: The variance of bi sl_eff: The random effect ai in Equation (1) varsl: The variance of ai cors: The correlation between ai and bi ll: The log-likelihood function
PROC NLMIXED DATA=final; varsp = exp(2*logsp); varsl = exp(2*logsl); linpinfl = sp_eff+a1*coa+a2*earlyos+a3*male+a4*inttscr +a5*exttscr+a6*t_0+a7*t_1+a8*t_2+a9*t_3+a10*t_4 +a11*t_5+a12*t_6+a13*t_7; infprob = 1/(1+exp(−linpinfl)); mu = exp(sl_eff+b1*coa+b2*earlyos+b3*male+b4*inttscr+b5*exttscr +b6*t_0+b7*t_1+b8*t_2+b9*t_3+b10*t_4+b11*t_5 +b12*t_6+ b13*t_7); IF y=0 THEN ll = log(infprob); ELSE ll = log((1−infprob)) −mu+y*log(mu) −lgamma(y+1) −log(1−exp(−mu)); MODEL y ~ general(ll); RANDOM sp_eff sl_eff ~ NORMAL([0,0], [varsp, cors, varsl]) SUBJECT=target; ESTIMATE ’Varsp ’ varsp; ESTIMATE ’Varsl ’ varsl; ESTIMATE ’cors ’ cors; RUN;
We used PROC NLMIXED to fit the ZIP model as described in Section 2.2. All the codes are identical to the codes for the hurdle model except for the log-likelihood function ll.
PROC NLMIXED DATA=final; varsp = exp(2*logsp); varsl = exp(2*logsl); linpinfl = sp_eff+a1*coa+a2*earlyos+a3*male+a4*inttscr +a5*exttscr+a6*t_0+a7*t_1+a8*t_2+a9*t_3+a10*t_4 +a11*t_5+a12*t_6+a13*t_7; infprob = 1/(1+exp(−linpinfl)); mu = exp(sl_eff+b1*coa+b2*earlyos+b3*male+b4*inttscr+b5*exttscr +b6*t_0+b7*t_1+b8*t_2+b9*t_3+b10*t_4+b11*t_5 +b12*t_6+b13*t_7); IF y=0 THEN ll = log(infprob+(1−infprob)*exp(−mu)); ELSE ll = log((1−infprob)) −mu + y*log(mu) −lgamma(y+1); MODEL y ~ general(ll); RANDOM sp_eff sl_eff ~ NORMAL([0,0], [varsp, cors, varsl]) SUBJECT=target; ESTIMATE ’Varsp ’ varsp; ESTIMATE ’Varsl ’ varsl; ESTIMATE ’cors ’ cors; RUN;