The recently rekindled debate on health care reform in the United States and interest in how to pay for an expanded national program of care provides motivation to understand the relative influence of demographics, patient-level illness burden and provider-level variability in practice patterns on these costs. With advanced data collection and management techniques, medical cost data are now recorded routinely by hospitals, disease registries, and health insurance companies. In combination with development and application of state-of-the-art statistical methods of analysis, such widely available data can now be used to enhance our understanding of these various influences.
Medical cost data are frequently right-skewed, involve a substantial proportion of zero values, and may exhibit heteroscedasticity. For example, healthy people may incur no costs in a given year, while otherwise similar patients having a particular disease (e.g., cancer) may incur medical costs that increase tremendously by disease severity. Both observations suggest that no simple parametric distribution is suitable for describing such “semi-continuous” data.
Recognizing the need to account for the semi-continuous nature of medical cost data arguably began with the development of the “two-part” model (Cragg 1971; Manning et al., 1981
. Specifically, denote the cost of a given subject by Y
; then, a two-part model for the probability distribution of Y
consists of (i) modeling the probability that Y
> 0 using, say, a logistic or probit model (“Part I”); and, (ii) separately modeling the probability distribution of Y
> 0 (“Part II”). Numerous options exist for Part II; a convenient and common choice is to assume that [log Y
> 0] follows a linear regression model with normal errors (Manning et al. 1983; Diehr et al. 1999).
More recently, the two-part model has also been extended to deal with correlated outcome data. For example, Lu, Lin and Shi (2004)
proposed a “marginal” version of this model, utilizing GEE for the purposes of estimation and inference. Olsen and Schafer (2001)
and Tooze, Grunwald and Jones (2002)
instead extended the two-part model to account for correlation through the introduction of random Effects; see also Zhang, Strawderman, Cowen and Wells (2006)
, who considered a related Bayesian formulation of this model. In the present context, such models permit one to account for situations in which costs incurred by subjects may be related to each other, as might be expected for patients served by the same physician or treated within the same hospital, or in cases where longitudinal cost information is available on a given patient. A typical random Effects formulation of the two-part model employs a generalized linear mixed model (GLMM) for the binary outcomes Y
> 0 in Part I and a linear mixed model for the positive continuous outcomes [log Y
> 0] in Part II. The two parts of the model are then linked together by imposing a correlation structure on the random Effects that appear in Part I and Part II. Given the pair of random Effects and all covariates, the propensity (i.e., Y
> 0) is usually assumed to be independent of the conditional response level (i.e., [log Y
> 0]; averaging over the random Effects thus induces two forms of correlation. In the context of medical cost data with clusters of patients served by different physicians, the first form is standard and captures (i) the marginal correlation between the propensity to incur costs among patients served by the same physician; and, (ii) the marginal correlation between the level of costs actually incurred by patients served by the same physician. The second form is more specific to the two-part model and captures the relationship between the physician-specific propensity to incur cost and physician-specific level of actual costs incurred. Such “cross-part” correlation is of interest; for example, in the case of pharmacy cost data, it is interesting to ask whether a physician that has a higher probability of prescribing medication for his patients also tends to prescribe more expensive medications for these patients (Zhang et al., 2006
). As shown in Albert (2005)
, the parameters of a marginal two-part model, such as that proposed in Lu et al. (2004)
, can depend strongly on the underlying degree of cross-part correlation, resulting in a misleading assessment of the regression Effects. The specification of a random Effects model avoids this problem, allowing one to estimate interpretable regression Effects as well as characterize the marginal Effects of each variable on the response variable.
One important disadvantage of the typical formulation of the two-part model relates to the use of a transformation in Part II of the model. The resulting difficulties of re-transformation that arise are only compounded in the presence of random Effects. Duan (1983)
proposed the “smearing” method for estimating the mean of the untransformed response Y
after fitting a linear regression model to a transformed response (e.g., log Y
). However, if the transformation does not stabilize the variance, heteroscedasticity may be present (Duan et al. 1983
; Manning 1998
; Zhou, Stroup, and Tierney 2001
; Zhou, Lin and Johnson, 2008
). In this case, Duan’s smearing estimate cannot be employed and the use of ordinary least squares (OLS) leads to a biased estimate of the covariate Effect on the untransformed mean of Y
(e.g., Manning, 1998
; Mullahy, 1998
; Manning and Mullahy, 2001). In response to these and other concerns, Manning, Basu, and Mullahy (2005
, hereafter MBM) proposed to use the generalized gamma distribution for modeling the probability distribution of Y
> 0. This distribution is useful in that it contains the standard gamma, inverse gamma, Weibull and lognormal distributions as special cases. As a result, standard testing methods for nested hypotheses may be used in evaluating the fit of these simpler models; in addition, the model provides greater flexibility in cases where none might permit an adequate description of the data.
In this paper, we propose a new two-part model that incorporates correlated random Effects. Analogously to MBM, our model uses a generalized gamma GLMM for Y
in part II. Similarly to MBM and also Zhou et al. (2001)
, we further permit the scale parameter of this distribution to depend on covariate information, allowing for subject-level heteroscedasticity in the cost data. The proposed model is both flexible and more general than those currently available in the literature, being capable of dealing with clustered data, heteroscedasticity, and cross-part correlation. Maximum likelihood estimation for the proposed model is easily implemented within SAS Proc NLMIXED (Littell et al. 2006
). Further contributions of this paper include a clear demonstration of the Effects of heteroscedasticity in Part II of the two-part model on the interpretation and bias of estimated covariate Effects and a detailed discussion of the relationship between the proposed model and a strongly associated class of marginal two-part models.
We illustrate the use of this model by analyzing pharmacy cost data from a mid-western U.S. managed care organization (MCO) on 56,245 patients served by 239 primary care physicians (PCP). Important features of the data include the clustering of patients within PCP, the substantial proportion (i.e., 26%) of zero cost patients, and the highly skewed nature of the cost data among those patients with non-zero cost during the year under study (i.e., respectively, a mean and median cost of $497.95 and $87.63). A similar dataset is analyzed in Zhang et al. (2006)
using a Bayesian two-part random Effects model, with a focus on profiling the physician contribution to patient pharmacy costs. In the present paper, the primary interest lies more in characterizing the patient-level factors (i.e., covariates) that influence the pharmaceutical expenditures of adult patients, accounting for the possibility of both a physician Effect and heteroscedasticity.
We remark here that the two-part model has been the subject of some controversy in the econometrics literature, where it has been frequently compared and contrasted with the sample selection model of Heckman (1976
. The most relevant comparison is perhaps with the adjusted tobit model of van de Ven and van Praag (1981)
, a variant of the sample selection model in which (i) ‘censored’ observations are actually observed as true zeros instead of missing data; and, (ii) a correlation structure is imposed on the two possible counterfactual responses (i.e., potential outcomes) at the level of the subject. Duan, Manning, Morris and Newhouse (1984)
demonstrated that these two model classes are in fact distinct; Manning, Duan and Rogers (1987)
provided Monte Carlo evidence to show the sensitivity of the adjusted tobit model to exclusion restrictions. However, Leung and Yu (1996) later presented alternative Monte Carlo evidence that demonstrated the results of Manning et al. (1987)
were inherently biased against the selection model, particularly so when model parameters are estimated via limited information maximum likelihood. Leung and Yu (1996) concluded that the two-part and sample selection model classes are in general designed to answer distinct inferential questions and that both are useful in their respective contexts.
Despite appearances, the correlation structure induced by the two-part model with correlated random Effects is distinct from that imposed by the adjusted tobit model of van de Ven and van Praag (1981)
. As indicated above, the former asserts that the correlation structure exists at the level of the physician; it is not imposed at the level of the subject, as is the case with the adjusted tobit model. Moreover, unlike the sample selection model, there is information in the observed data (i.e., the cluster structure) that allows one to identify this correlation structure regardless of whether exclusion restrictions are imposed. For these reasons, the relationship between two-part and sample selection models is not considered further in this paper.
The rest of the paper is organized as follows. In Section 2, we review the generalized gamma distribution and its properties and then introduce the proposed two-part model. In Section 3, we give the relevant likelihood function and describe how estimation can be carried out in SAS; example SAS code is provided in the Appendix
. In Section 4, simulation is used to assess the performance of the estimation method. In Section 5, the proposed model is applied to the dataset described above. Concluding remarks are given in Section 6.