|Home | About | Journals | Submit | Contact Us | Français|
The applied literature on propensity scores has often cited the c-statistic as a measure of the ability of the propensity score to control confounding. However, a high c-statistic in the propensity model is neither necessary nor sufficient for control of confounding. Moreover, use of the c-statistic as a guide in constructing propensity scores may result in less overlap in propensity scores between treated and untreated subjects; this may require the analyst to restrict populations for inference. Such restrictions may reduce precision of estimates and change the population to which the estimate applies. Variable selection based on prior subject matter knowledge, empirical observation, and sensitivity analysis is preferable and avoids many of these problems.
The use of propensity scores to reduce confounding bias in non-experimental studies has increased dramatically1 since their introduction by Rosenbaum and Rubin.2 The propensity score is the predicted probability of treatment (alternatively, exposure) conditional on selected covariates, and is used as part of a two-stage analytic process. First, the propensity score is estimated from the data, typically1, (although not always3–4) using logistic regression. Second, the effect of the treatment on the outcome is assessed in persons with the same estimated propensity score,5–7 for example by matching on propensity score,2 stratification,2,8 or by weighting by a function of the propensity score.9
Treated and untreated subjects with the same propensity score have measured baseline covariates that come from the same distribution. Assuming no unmeasured confounders, conditioning on the propensity score allows one to obtain an unbiased estimate of the average treatment effect at that value of the propensity score.
In this commentary, we discuss the estimation of the propensity score itself and the use (and misuse) of the c-statistic in this process. The widespread reporting of the c-statistic from the propensity score model may suggest an underlying misconception about the goal of the propensity score.
As this work comprises only methodological commentary, ethics approval was not sought.
The purpose of the propensity score is to eliminate confounding bias, which occurs when risk factors for the outcome are unequally (alternatively, non-randomly) distributed among treatment groups. Thus, to control for confounding using propensity scores, the epidemiologist estimates a score which attempts to achieve an equal distribution of observed risk factors for the outcome between treated and untreated subjects: that is, to balance those risk factors between treatment groups. The goal of the propensity score is not to predict treatment as well as possible. Balancing covariates so as to control confounding and the prediction of treatment are separate goals that require different considerations for variable selection. Thus, for example, a propensity score should not attempt to balance covariates unrelated to the outcome (e.g. instruments), as these variables do not lead to confounding.10–11 Nor is the use a single propensity score model to examine the association of a single treatment with multiple outcomes12–14 necessarily advisable.
Epidemiologists tend to view propensity scores (not unjustly) as an efficient means to control for a large number of covariates possibly even when an outcome is rare.15–16 Perhaps as a result, the literature on selection of covariates for inclusion in propensity scores is less well developed than the literature on selection of covariates for inclusion in outcome regression analysis. Nonetheless, several guidelines have emerged. Leaving a confounder out of a propensity score model will result in bias in the final effect estimate similar (in magnitude and direction) to the bias observed when this confounder is left out of an outcome regression model.11 The inclusion of covariates in the propensity score which are not associated with the disease outcome (including instrumental variables) decreases precision of the treatment effect estimate without any advantage with respect to bias10 and may increase bias in the presence of unmeasured confounders.17–18 The inclusion of risk factors for the outcome not associated with treatment increases precision of the treatment effect estimate.10 Although not specifically assessed for propensity scores, the inclusion of colliders in a model can lead to collider stratification bias.19–20 These issues are not unique to propensity score models, although the instrumental variable issue will tend to be more relevant to propensity scores if they are misunderstood as treatment prediction scores. Given these guidelines, does the use of the c-statistic aid epidemiologists in creating propensity scores which achieve the goal of propensity scores: namely, to eliminate bias due to measured confounders?
The c, or concordance, statistic is a measure of the discriminatory power of a predictive model, and is equivalent to the area under the receiver operating characteristic curve.21–22 For a model for a dichotomous variable (in the propensity score setting, a treatment), the c-statistic is calculated as the proportion of all possible pairs of subjects comprising one treated subject (A) and one untreated subject (B), in which the predicted probability of the outcome is higher in the treated than the untreated (higher in A than in B).22 The c-statistic takes on values between 0.5 (classification no better than a coin flip) to 1.0 (perfect classification), and can be calculated for any method that generates predicted values, including logistic regression and machine learning classification 3.
The c-statistic is often cited as a measure of the “fit” of a propensity score model; that is, the ability of the model to predict treatment status using observed covariates. Among two large reviews of the propensity score literature, 91/224 articles (41%) reviewed reported a c-statistic or an equivalent.1,23 In each review, several studies reported c-statistics greater than 0.90, indicating very good ability of the propensity score model to predict treatment status. Indeed, the frequent reporting of the c-statistic suggests that many investigators view propensity score estimation as a predictive model for the treatment. Not discussed is whether a high (or low) c-statistic in a propensity score model gives us any information on whether that propensity score model achieves its goal.
If the c-statistic is used as a guide for variable selection into a propensity score model, it may lead to the inclusion of useless or even harmful variables in that model. In particular, the inclusion of covariates strongly related to treatment but unrelated to the outcome will increase the c-statistic and thus be preferentially included in the model; but the inclusion of such variables will lead to distributions of propensity scores with relatively little overlap between the treated and the untreated 23. Because the treatment-outcome effect is estimated in persons with the same propensity score, data that fall outside a common range of the propensity score distributions in treated and untreated are typically lost for the second stage of a propensity score analysis: either because these individuals cannot be matched, or because they are specifically excluded from further analysis.
We exclude subjects in the non-overlapping tails of the propensity score distribution because treatment effect cannot be estimated without variation of treatment given the propensity score. Formally, positivity is violated in these subjects.24 Positivity requires that there are both treated and untreated subjects at every level of all covariates under consideration, and is one of the key assumptions for causal inference.24–25
Propensity score distribution overlap is often considered in terms of treatment-stratified histograms or kernel-smoothed density estimates. If propensity score density estimates do not overlap, it is likely that there is non-positivity for some combination of covariates. (However, positivity is not guaranteed even if propensity score distributions fully overlap; smoothed curves may obscure regions of non-positivity.) Identification of populations never- or always-treated may be one of the main advantages of propensity scores16; regions of non-overlap are often trimmed in propensity scores.16 Trimming of propensity score curves reduces sample size and thus precision, and also changes the population in which inference is being made in complex ways. Consistent with this, Austin et al. found that as the c-statistic (area under the ROC curve) increased, the number of matched pairs for analysis decreased.26 Creating unnecessary non-overlap by inclusion of unnecessary covariates (i.e., non-confounders) in propensity score models should be avoided on both counts.
Using a c-statistic to guide propensity score modeling is therefore ill-advised. Perhaps more to the point, a high c-statistic in the propensity score model is neither necessary nor sufficient for the control of confounding. Imagine a propensity score estimated in a randomized trial in which all risk factors for the outcome are perfectly balanced between treatment arms. The propensity score model built with these risk factors will have a c-statistic of 0.5: risk factors do not help predict the treatment assignment. But given perfect balance, there will be no confounding bias; thus a high c-statistic is not necessary. Conversely, a high c-statistic can be achieved by the inclusion of a strong instrument, independent of all confounders; thus, thus a high c-statistic is not sufficient.
Correspondingly, Weitzen et al. found in a simulation study that the c-statistic “had no relationship with residual confounding in…treatment effect estimates.”27 Austin et al. reiterated that “there was no clear relationship between the [c-statistic] of a given propensity score model and the degree to which conditioning on the propensity score balanced prognostically important variables between treated and untreated subjects in the matched sample.”26 In a third report, Austin argued that the c-statistic gives no indication as to whether confounders have been omitted from the propensity-score model, nor as to whether the propensity-score model has been correctly specified.28
Neither necessary nor sufficient to ensure the control of confounding, the c-statistic is of limited value for covariate selection into a propensity score model and provides no certainty that all measured confounders have been balanced between treatment groups, or that interactions among covariates or higher order terms have been balanced. Indeed, preference for covariates which yield a high c-statistic may lead us to balance non-confounders including instrumental variables, resulting in reduced overlap of propensity score distributions between treatment arms, and reduced efficiency. This discussion is largely compatible with previous recommendations,2,26–29 and in general these cautions apply to other markers of model fit including the Hosmer-Lemeshow goodness-of-fit test statistic27 as well as to automated model fitting procedures which rely on the c-statistic.
Unfortunately, the determination of what is a risk factor for the outcome is not always straightforward and there is no infallible way to separate risk factors from non-risk factors in a given study.17 While some have argued in favor of selecting covariates according to their empirical association with the outcome, more simulation studies are necessary before this approach can be recommended without reservation. Model selection approaches, such as those based on cross-validation30–31, may also be able to improve the performance of a given propensity score model; however, currently these approaches require that the initial model yields a consistent estimator. Thus, these methods still require the analyst to decide which variables to include in the initial model.
Therefore, rather than letting the c-statistic guide selection of variables into the propensity score model, we recommend that selection of covariates into a propensity score model begin with analysis of causal diagrams based on prior subject-matter knowledge and hypotheses32. From there, the literature supports the inclusion of strong risk factors for the outcome, whether or not they are related to the exposure, and the exclusion of variables that are strong predictors for exposure but with no obvious relation to the outcome. However, as no method for variable selection is foolproof, the epidemiologist should consider reporting various results under different, sensible, and transparently explained model specifications. As in conventional analyses, such a sensitivity analysis approach allows the analyst and the reader to assess the sensitivity of the results to model assumptions that are not supported by strong subject-matter knowledge.18
Funding. D.W. received funding from an unrestricted educational training grant from the UNC-GlaxoSmithKline Center of Excellence in Pharmacoepidemiology and Public Health, UNC School of Public Health; NIH/NIAID grant T32-AI-07001; and NIH/NICHD K99-HD-06-3961. M.J.F. is supported by a career development award from the Agency for Healthcare Research & Quality (AHRQ K02 HS17950) and an unrestricted grant from the UNC-GSK Center of Excellence in Pharmacoepidemiology and Public Health. M.A.B. is supported by a career development grant from the National Institute on Aging (AG-027400). T.S. received investigator-initiated research funding and support as Principal Investigator (RO1 AG023178) and Co-Investigator (RO1 AG018833) from the National Institute on Aging at the National Institutes of Health, and as Principal Investigator of the UNC-DEcIDE center from the Agency for Healthcare Research and Quality. T.S. does not accept personal compensation of any kind from any pharmaceutical company, though he receives salary support from unrestricted research grants from pharmaceutical companies to UNC.
Conflicts None declared