|Home | About | Journals | Submit | Contact Us | Français|
The aim of the paper is to produce a methodology that will allow users of ordinal scale data to more accurately model the distribution of ordinal outcomes in which some subjects are susceptible to exhibiting the response and some are not (i.e., the dependent variable exhibits zero inflation). This situation occurs with ordinal scales in which there is an anchor that represents the absence of the symptom or activity, such as “none”, “never” or “normal”, and is particularly common when measuring abnormal behavior, symptoms, and side effects. Due to the unusually large number of zeros, traditional statistical tests of association can be non-informative. We propose a mixture model for ordinal data with a built-in probability of non-response that allows modeling of the range (e.g., severity) of the scale, while simultaneously modeling the presence/absence of the symptom. Simulations show that the model is well behaved and a likelihood ratio test can be used to choose between the zero-inflated and the traditional proportional odds model. The model, however, does have minor restrictions on the nature of the covariates that must be satisfied in order for the model to be identifiable. The method is particularly relevant for public health research such as large epidemiological surveys where more careful documentation of the reasons for response may be difficult.
One of the most common issues surrounding the analysis of clinical and utilization outcomes is what statisticians refer to as the excess zero or zero inflation problem. This occurs when an outcome is measured, e.g. counts of service utilization, and the resulting data collected contain many observations that are zero, i.e., did not use the service being measured. In the case of normal or lognormal data, there are a variety of applications where there are a considerable number of subjects with zero values, either due to the fact that they do not participate in the activity (e.g., drinking alcohol ) or that they are below a certain threshold for detection . These cases have both been dealt with through the application of a mixture model with the appropriate continuous distribution and a built-in probability of non-response, also known as a two-part model [3, 4]. These models provide a conditional test of association between the outcome and any predictor of interest, i.e., a test that removes the effect of the zero responses, in order to answer meaningful research questions .
The need for a conditional test of association may also apply to ordinal scales in which there are anchors that represent the absence of the symptom or activity, such as “none”, “never” or “normal”. A mixture distribution in this context would imply that there are patients who are “susceptible” and patients who are not. However, for discrete data the definition of “excess” zero is more complicated. The choice to be made is whether or not we wish to include some of the zeros in the conditional test of association. If it is assumed that all zeros are “true” zeros, then the second distribution in the mixture can be modified (e.g. truncated) to reflect the association with the probability of response . However, given sampling uncertainty and measurement error, the usual approach has been to split the zeros into two distinct populations: a group of subjects that do not have a response, and a group of subjects that have a response, who nonetheless exhibit zero values. Farewell and Sprott  used the term “sampling zero” for those zeros not attributed to “cure”. In the derivation for the zero inflated poisson (ZIP) model , the mixture probability was referred to as the probability of the “perfect state” and the second distribution referred to another state in which “defects are possible, but not inevitable”, allowing for some zero counts in the conditional distribution. For the ordinal case presented here, a zero measure would indicate that the symptom is not present at the time of measurement but that the subject is still susceptible to the phenomenon being measured. This pattern of response is particularly evident in the case of low incidence sequelae such as side effects or suicidal ideation. For example, it is the case that pharmacological side effects, when they are present, are usually dose related. However, there are a considerable number of subjects who will not experience these side effects at any dose, while others will not experience the side effect at their current dose at the time of the assessment. If there are sampling zeros of this nature, we would expect an underlying association between dose and severity of the side effect in those patients who are susceptible. If the symptom is measured on an ordinal scale, this would coincide with a relationship that is consistent with the proportional odds assumption. We would also expect that dose may not be related to presence or absence of the side effect, but only related to the severity of the side effect when it occurs. Thus we have two distinct reasons for the application of a mixture model if there are, in fact, different populations of patients: 1) to explicitly model the clinical source of the zero inflation, and 2) to clarify the relationship with possible predictors.
The proportional odds (PO) model, when it applies, is unique among the ordinal regression models in that it is invariant to collapsing across categories, which is often needed to summarize results. More importantly to the current application, however, is its similarity to the results of a traditional linear regression on the underlying variable in that it allows for a test of association between a predictor and the outcome variable that is not category specific. Given that most ordinal scales are constructed with an underlying variable in mind, it would be desirable to the clinician to retain the ability to perform one test of association. It is assumed that for ordinal regression, the direct results of a mixture of responders and non-responders would be a deviation from the proportional odds assumption. When the proportional odds assumption is violated, another alternative is the partial proportional odds model (PPO) . For this particular application, the constrained form of the model in which only the zero category was specified to have non-proportional odds would be the most likely analogue to the proposed model. It differs from the current model in that it does not allow for sampling zeros, and it does not incorporate the probability of response into the conditional (non-zero) distribution. However, we will consider this model further in our clinical example.
Our aim is to produce a methodology that will allow users of ordinal scale data to more accurately model the distribution of ordinal outcomes when it is assumed that not all patients are susceptible to the phenomenon being measured, and that this is the primary reason for any deviation from the proportional odds assumption. Since most applications we will consider are measures of symptoms or side effects, we will refer to the two aspects of the distribution as “incidence” and “severity” in order to simplify the discussion. Similar terms that have also been used are “occurrence” and “intensity”. The model proposed allows modeling of the range (e.g., severity) of the scale, while simultaneously modeling the presence/absence of the symptom, and allows the predictors of incidence and severity to differ.
The development of a zero inflated proportional odds model is similar in derivation to previous zero inflated models [8, 10], with the distribution component consisting of a multinomial distribution with the logit link used for the linear predictor.
Let yi be an ordinal measure, on the ith subject, i=1,…, n, with levels 0,1,…, J with a multinomial distribution:
with cumulative probabilities defined such that:
For this adaptation, the response is distributed as a mixture of two distributions, a point mass at 0 and a multinomial:
For the ordinal regression, the parameters p and (γ0,…,γJ) will be modeled via a canonical logit link. If xi denotes the full covariate vector, we can choose any subset, including xi itself, for Gi and Bi to form the linear predictors for the probabilities of the perfect state and other state, respectively. The vectors of covariate effects corresponding to the perfect state and the multinomial state will be denoted by τ and β, respectively. In order to maintain notational consistency with proportional odds models, we use the parameter θ for the intercept parameters in the ordinal regression, and we will identify the complete parameter vector for the ordinal part as η′= (θ0,θ1,…,θJ−1, β)′. The model equations will then be:
with F being the cumulative distribution function for the logit:
This results in an observed-data log-likelihood, in terms of the regression parameters, of the form
If we could observe which data came from which part of the mixture in an indicator variable, Z=(z1,…zn)′, i.e.,
where the “perfect state” refers to the point mass at zero (i.e., those subjects that do not exhibit a response), then we could construct a complete data log-likelihood:
Rearranging and simplifying, we then have:
Thus, the complete data log-likelihood is easily maximized due to the fact that τ and η can be maximized separately. It is also useful to note that the part involving η can be estimated by using appropriate weights in the fit of a traditional proportional odds model. The part of the complete-data likelihood involving τ is identical to the zero inflated poisson (ZIP) model, and can be solved using the method derived by Lambert , which involves augmenting the data with an indicator vector and then solving the resulting weighted logistic regression using the appropriate weights for each piece of the likelihood. The ZIPO models were estimated using S-PLUS® (Insightful Corporation, Seattle, WA) code adapted from the Design library and modified by the authors to fit the proposed models.
The estimation process is a function of the ordinal link we use. We will show the derivation for the logit link; the others can be derived similarly. The main problem in our estimation of the zero-inflated model is that we don’t in fact observe the zi’s. We can, however, formulate the problem with the zi’s as incomplete data and solve using the EM algorithm . The EM algorithm maximizes the log-likelihood iteratively by estimating the zi’s using the current estimates of the parameters, and then maximizing the log-likelihood with the zi’s fixed at their estimated expected values. This process is repeated until the algorithm converges.
For iteration k, estimate E[zi | yi, τ(k), η (k)] by its posterior mean zi(k) given the data y, and the current estimates of τ(k) and η(k) (see Appendix for derivation):
Given that the expected values of the zi’s are constants, the maximization of Q = EZ[lC(Y, Z | τ(k), η(k))] is, in this instance, equivalent to maximizing the augmented logistic and the weighted ordinal regression, using the current estimates of zi=zi(k) as weights.
To initialize values for the conditional parameters (in this case η), we used the standard proportional odds estimates. In order to point the algorithm towards the mixture solution, we calculated a z-score using the mean and variance of the non-zero data and used it to derive the expected proportion of zeros using the logistic cumulative distribution function., We then used the logit transform of this value as the initial τ0, with τ1 initialized at zero. For standard error estimates, we directly evaluated the observed information matrix using the appropriate derivatives of (2.1).
Traditional likelihood ratio statistics can be computed for the parameters as well as any test statistics for nested hypotheses . In addition, an hypothesis of the form H0: p=0, can be tested using the method of Self and Liang  for hypothesis tests on the boundary of the parameter space. This provides a likelihood ratio test of the benefit of the zero inflated model over the proportional odds model, i.e., for the presence of “nonresponders”.
It can be shown that a single binary predictor for both portions of the model will result in a non-unique solution. However, simple bounds on the predictors can ensure identifiability of the proposed model. Given that the sufficient statistics for the multinomial probabilities are the respective counts for that anchor, and these probabilities are mutually exclusive, the only difference in estimation from the regular multinomial problem is for the probability of a zero response. Given that the marginal distribution of any particular response category is Bernoulli, the subject-specific probability of a zero response will then be a mixture of two Bernoulli distributions. It has been shown that mixtures of two binomials (parameters n and p) are only identifiable when n > 1 (Teicher ). However, Follmann and Lambert  determined that finite mixtures of logistic regressions with a Bernoulli response can be identified, as long as the number of unique predictor combinations (covariates) is sufficient for identifiability. Specifically, they show that the number of components of the mixture, c, is constrained by
where N1 is the number of unique observed values of the covariate vector. Thus, a single binary covariate will only support one component distribution. Given Follmann and Lambert’s theorem (2.5), a mixture of two Bernoulli distributions is identifiable if the number of unique combinations of the covariate vector is at least seven (i.e., seven is sufficient, but not necessary). Given the nature of most applications, this restriction will rarely be prohibitive. If the focus of inference is a comparison of groups, one can simply incorporate a continuous but nonsignificant predictor to ensure identifiability that would have little effect on the group comparison.
Because the multinomial distribution has mutually exclusive categories, it is a possibility that the estimate for the conditional probability of a zero response (γ0) can approach zero, resulting in a corresponding regression coefficient (θ0) which tends to infinity. This would essentially result in a solution in which all the zeros were classified as “non-responders” and the non-zero data would be modeled using the resulting conditional probabilities. To provide an empirical estimate, we decreased the convergence criterion until there was clear separation between the valid solutions and those on the boundary. Then, in order to determine the extent of the impact of this boundary solution, we tracked the percentage of occurrences of this type of solution by considering any fitted model with θ0 > 5.6 as indicating a dataset with the solution on the boundary.
In order to simulate data with the expected underlying mechanism, we began with a choice of regression coefficients (τ,η) and proceeded to determine the resulting binomial and multinomial probabilities associated with a fixed covariate x (p, γ0,γ1,…γJ). We assumed the same full linear predictor (i.e., B=G=x) for both parts of the mixture, although this is not necessary for estimation. We then generated the ordinal variate, y, using the following process:
The simulations were designed to test the asymptotic properties of the maximum likelihood estimates in finite sample sizes (50,100,200,500). We used fixed parameters as indicated below, and evaluated the following measures in S=2000 simulations (for definitions see the Appendix):
We created an example where the baseline (x=0) probability of nonresponse was 0.18, and the baseline cumulative probabilities for the ordinal categories 0,1,2,3 were equal to 0.1, 0.3, 0.6, 1, which corresponds to multinomial probabilities of 0.1, 0.2, 0.3, 0.4. These correspond to parameter values of τ0= −1.5 and θ=(2.1972,0.8473, −0.8473, −2.1972). For the relationship with covariates we chose values of τ1=2.0, β=2.0, and the predictor x was given equally spaced values between 0 and 1. This provided a range of the probability of response (1−p) between 0.18 and 0.62 and the conditional probability of nonresponse (γ0) between 0.015 and 0.1. The appropriate probabilities were calculated for each subject based on their linear predictor value.
Some simulation data sets could not be used for assessment of the asymptotic properties due to particular properties of the data set. These properties are defined below and the frequency of their occurrence is documented in Table 1.
Because the proportional odds model has intercepts for each category of the scale, we eliminated simulated datasets that did not have all possible values of the dependent variable (e.g., 0,1,2,3,4), because the number of intercepts, as well as the values of the intercepts, would not correspond to the proposed values. However, this would not be a problem in practice, as the model fits only as many intercepts as are necessary for the data.
In the evaluation of the ZIP model, Lambert discusses briefly that some data sets exhibit singular observed information and that removing these instances from the simulation results improves the estimation of the parameters considerably. In order to further clarify this source of error for the proposed model, we tracked the occurrence of singularity of observed information in the simulations. Singularity is clearly a function of the sample size [Table 1], with larger sample sizes less likely to exhibit singular information.
The proportion of ordinal solutions on the boundary of the parameter space decreases as the sample size increases, as expected [Table 1]. Given that the true parameter value of θ0 was set to be in the interval (0.015–0.1) for this set of simulations, it is likely that the low values of this conditional probability were the cause of this numerical problem, as the algorithm must choose which zeros are from which distribution. In those cases in which the probability of a sampling zero is low, either higher sample sizes can be used, such as with the clinical example presented here, or a two-part model  should be considered.
Assessment of the valid simulations revealed that while estimates of the linear predictor parameters are variable [Table 2], the nature of the logistic function is such that the convergence of the underlying probabilities was reasonably accurate [Table 3]. As with the ZIP model  the linear predictor for the probability of a “perfect” zero is not estimated well. We assume that this is partly due to the indeterminate nature of assigning the zeros to the two distributions, and perhaps also due to the over-parameterization of what is essentially a point mass. We suspect that this is a property of all discrete zero inflated models, and possibly the continuous two-part models as well. Unfortunately, simulations of this nature are not available for the majority of other zero-inflated models so we have no way of examining this proposition without additional investigation. In this multinomial extension of the problem, it appears that this over-parameterization also effects the estimation of the probability of a sampling zero, which is consistently underestimated.
It is encouraging, however, that the covariate parameters have the desired qualities. Wald confidence intervals were surprisingly accurate for the ordinal parameter at all sample sizes [Table 4]. In contrast, the corresponding confidence intervals for the predictor of non-response (τ1) are somewhat conservative for the smaller sample sizes (n< 200). This is an interesting result and requires further investigation using other zero-inflated models to determine if an adjustment could be made that would correct this problem. Although the proposed method is much more suitable for larger (n ≥ 200) sample sizes, analyses of smaller data sets using this technique could still provide more accurate results if it is assumed that there is a mixture of subjects in the sample.
The Alcohol Use Disorders Identification Test (AUDIT) is a frequently used measure of the level of consumption as well as some of the negative consequences resulting from the consumption of alcohol . Recently, a focus of primary care services research has been to identify hazardous drinkers, i.e., those patients who drink at a level that may cause health and legal repercussions but have not yet developed clinically diagnosable alcoholism . In studying these patients, many of the traditional measures used for alcoholics have been used despite the fact that some items may not directly apply to this population. Many primary care patients do not drink at all, resulting in an inflation of “never” responses. This provides a setting for the mixture model proposed in order to correctly determine the predictors of severity. A similar approach was used by Olsen and Schafer , who used a continuous mixture model to assess predictors of alcohol consumption in teenagers. The data used for the example come from a study of Early Lifestyles Management (ELM), which identified possible hazardous drinkers through the use of a screening evaluation of alcohol consumption in the waiting room of 12 primary care clinics in the Pittsburgh area . For the main predictor variable, we chose gender, which is a general demographic also collected on a large number of screens. There is considerable evidence of gender differences in consumption [20, 21], with males commonly reporting more drinks per week than females.
We included an additional covariate, age group, for two purposes: 1) to force the model to be identifiable and 2) because age was a significant confounder as indicated by its association with both gender and consumption. For our sample, the males were considerably older than the females (grouped means of 55 and 45 years, respectively) and those who reported never drinking alcohol tended to be older than those who drink (grouped means of 54 vs 48 years). Unfortunately, due to the brief nature of the survey, age was not a continuous variable but a categorical variable with nine categories. We chose to keep age group as a continuous variable in the analysis, however, to simplify the interpretation and keep the focus on our primary variable of interest, gender. We did this by substituting the midpoint age value for each category and treating the resulting variable as continuous. The corresponding frequency distribution for the 11,492 completed screening data forms with valid age group and gender is shown in Figure 1. It can be seen that a large proportion of the subjects chose the “never” option. However, the next anchor, monthly or less, may be more frequent than some occasional drinkers were willing to admit, making it possible that not all zeros were true abstainers. In addition, we have sufficient reason to believe that the predictors of abstinence may be quite different from predictors of consumption in drinkers. This provides initial justification for use of the new model.
Although the observed proportion of non-drinkers is nearly identical in male and female patients [40.3% vs. 40.5%, χ2 =0.041, p=0.840], the test of association between gender and consumption level is highly significant in the ordinal regression, indicating that males consume more alcohol [Table 5]. This apparent contradiction leads us to pursue a different model from a clinical perspective. One way of reconciling the discrepant results is if we consider abstinence as an additional factor. Our clinical objective would then be to determine if there is a gender difference in prevalence of abstinence and subsequently if there is a gender difference in consumption in those who are not abstinent. The proposed model tests this explicitly, as stated in the introduction; however we will first consider possible constraints for the PPO model that might allow us to incorporate abstinence in a similar way. If we fit equal slopes to all categories but the first (zero), we get the results in Table 5. The interpretation of the gender effect for this model is similar to the PO model in that males have a lower probability of being in the “never” category only the difference is less pronounced for the constrained PPO model due to the significant offset to the parameter in that group. However, the data are not consistent with this result as the proportions in the “never” category are nearly identical.
If we incorporate the concept of abstinence as an additional source of “never” responses that is separate from the level of consumption, the results indicate that males have a slightly higher proportion of abstainers than females [Table 6], which could then account for the fact that there is no gender difference in the observed proportions in the “never” group. Coincidentally, both information criteria (AIC and BIC) as well as a comparison of the likelihoods show a preference for the zero-inflated model over the PO and PPO models for these data. A likelihood ratio test indicating the improvement with the mixture can be constructed using a boundary test as discussed by Self and Liang . This test is based on a reference distribution of a 50:50 percent mixture of χ2 distributions with 0 and 1 degrees of freedom respectively. This result is highly significant [χ2=428.94, df=0.5, p < 0.0005] although in the case of the current model, it is somewhat less informative as it cannot indicate if the lack of fit is due to the need for a mixture or the assumption of proportional odds.
Due to the nature of the cumulative logit models, both the PO and the PPO model predict higher incidence of “never” responses among females due to the fact that males drink more at the higher levels of consumption [Table 6]. However with the mixture model, the data indicate that for these data males are abstainers in higher proportions, but when they do drink, they also have higher levels of consumption [Table 6]. This interpretation of the current sample fits better with existing data on gender and alcohol use. The National Health Interview Survey of drinking  differentiated between total abstainers and what they called “lifetime infrequent drinkers”, which would correspond to the two populations of zeros we suspect. However, the national data shows that females have higher proportions in both groups, thus we would not expect to see equal proportions of “never” responses across gender. The only group of non-drinkers with higher proportions in males is “previous drinkers”, or those who abstain most likely due to prior issues with drinking. Thus, the current data could be explained if there were a significant proportion of previous drinkers in the current sample. It turns out that the recruitment strategy for this study focused on areas in which high levels of drinking were expected and thus, may also have higher incidence of previous drinkers.
In this article, we have adapted the use of mixture models for zero inflation in categorical data such as the ZIP and zero inflated binomial (ZIB)  models to include ordinal level variables fit using the multinomial distribution. The extension to the multinomial distribution requires a few more restrictions on the nature of the predictors in the model; however, these restrictions can easily be met in most modeling applications. The model, however, does tend to require larger sample sizes (n ≥ 200) due to the number of additional parameters involved. This should not be a problem for large survey applications, but for smaller samples any opportunities to collapse across categories should be considered carefully in order reduce the number of parameters under consideration. Although our model can potentially be used for any type of ordinal variable, it appears to apply best to ordinal scales where there is an anchor that represents a lack of response so that there is a justification for the existence of true nonresponders and thus can be easily interpreted from a clinical perspective.
This work was funded in part by a seed grant (PI MK) from the VISN-4 Mental Illness Research, Education, and Clinical Center (MIRECC), Department of Veterans Affairs, Pittsburgh, PA and Public Health Service Grants No. U10CA-69974 and U10CA-69651 from the National Cancer Institute and the Department of Health and Human Services.
The authors would like to thank Steve A. Maisto, PhD, Syracuse University, and Joseph Conigliaro, MD, University of Kentucky for kindly providing the alcohol screening data. This work was funded in part by a seed grant (PI MK) from the VISN-4 Mental Illness Research, Education, and Clinical Center (MIRECC), Department of Veterans Affairs, Pittsburgh, PA and by Public Health Service Grants No. U10CA-69974 and U10CA-69651 from the National Cancer Institute and the Department of Health and Human Services..
S=number of simulations; β represents the parameter of interest; vjj is the appropriate diagonal of the observed information matrix; I is the generic indicator function
Sample mean and variance were defined as
Averaged standard deviation estimate from the observed information
Averaged tail probabilities, and overall confidence interval coverage (normal theory):
Given that zi=0 when yi ≠ 0 (because (Pr[perfect state=0]) the calculation is only for when yi=0
the calculation then equals
then zi is