In this case study and tutorial on propensity score methods, we have illustrated the use of different propensity score methods for estimating treatment effects when using observational data. Several observations merit highlighting and discussion.
First, we highlight that specifying the propensity score model was an iterative process that involved several iterations of model specification and assessing the balance of measured baseline covariates between treated and untreated participants in the propensity score matched sample. In this case study, we required three steps before we were satisfied that the propensity score model had been adequately specified. Balance assessment plays a critical role in any propensity score analysis.
Second, different propensity score methods eliminated systematic differences between treated and untreated participants to differing degrees. Propensity score matching and inverse probability of treatment weighting using the propensity score reduced systematic differences between treated and untreated participants to a greater extent than did stratification on the propensity score or covariate adjustment using the propensity score. These observations are similar to prior empirical observations and to the results of Monte Carlo simulations (Austin, 2009c
Third, when outcomes were binary, propensity score methods allowed estimation of absolute risk reductions (or differences in proportions) and relative risks. In contrast, conventional logistic regression only allowed estimation of odds ratios. Many authors have suggested that relative risks and risk differences are preferable to odds ratios for quantifying the magnitude of treatment effects (Sackett, 1996
; Sinclair & Bracken, 1994
). The reader is referred elsewhere for a more detailed discussion of propensity score methods for estimating risk differences and relative risks (Austin, 2008d
, 2010e; Austin & Laupacis, 2011
Fourth, when estimating absolute and relative reductions in the probability of mortality within 3 years of hospital discharge, the magnitude of estimated treatment effects varied across the different propensity score methods. summarizes the estimated absolute risk reductions and relative risks across the four propensity score methods. The estimated relative risks varied from 0.75 (stratification on the propensity score with five strata) to 0.88 (propensity score matching). Furthermore, in only one instance (stratification with five strata) did the associated 95% confidence interval exclude unity. Stratification on the quintiles of the propensity score removed less of the systematic differences between treated and untreated participants than did matching or weighting using the propensity score. Thus, the greater effect size obtained using stratification may reflect a greater amount of residual bias. In our comparison of treated and untreated participants in the original sample, we observed that treated participants tended to be younger and healthier than untreated participants. Furthermore, they were more likely to receive discharge prescriptions for medications that reduce cardiac mortality and morbidity. Thus, some of the difference between the stratified estimate and those obtained using weighting or matching may reflect persistent residual differences between treated and untreated participants with the treated participants being healthier than the untreated participants. Similarly, when estimating absolute risk reductions, the greatest reduction in mortality was observed when stratification was used, whereas the smallest absolute reduction was observed when matching was used. As with relative risks, stratification was the only propensity score method that resulted in a 95% confidence interval for the absolute risk reduction that excluded the null value.
Comparison of Effect Sizes Across Different Propensity Score Methods
Fifth, we highlight that propensity score matching allows one to estimate the ATT, whereas the other three methods allow one to estimate the ATE (Imbens, 2004
, although we note that the other methods can be adapted to estimate the ATT as well). The latter three methods allow one to estimate the effect on average mortality in the population if one shifted the entire
population from receiving no counseling to receiving smoking cessation counseling. Because of how matching was done, the matching estimator is estimating the effect of smoking cessation counseling in those participants who ultimately did not receive counseling. We have already noted that participants who did not receive smoking cessation counseling tended to be older and sicker than patients who received counseling. Thus, the populations to which each estimate applies are qualitatively and quantitatively different from one another. In comparing the two panels of , one should note that the probability of survival to 3 years is different in the untreated population (matched analysis) compared with survival in the overall population if all participants were untreated (weighted analysis). Complicating the interpretation of the matched estimator is the fact that of those patients who did not receive in-patient smoking cessation counseling, only 86% were successfully matched to a patient who did receive smoking cessation counseling. Ideally, each participant who did not receive counseling would be matched to a participant who received counseling. Then, the estimated treatment effect would apply to the population of participants who did not receive counseling. However, we have noted that, of those participants who did not receive counseling, unmatched participants were systematically different from those who were matched. In particular, unmatched participants were substantially older. Due to incomplete matching, it is not clear how to describe the population to which the matched estimator applies. Incomplete matching appears to occur frequently in applied applications, complicating the interpretation of the matched estimator. Applied investigators need to decide which of the ATE or the ATT is more meaningful in their research context. In the context of smoking cessation counseling offered to patients hospitalized with an AMI, the choice may depend in part on the intensity of counseling and the degree to which patients’ commitment is required.
Sixth, we contrast the different odds ratios that were obtained using different methods. In the sample weighted by the inverse probability of treatment, we obtained two odds ratios: 0.84 and 0.98. The first was obtained by regressing survival on treatment status, whereas the second was obtained after additional adjustment for baseline covariates. Neither odds ratio was statistically significantly different from one (p
> .09). In contrast, conventional logistic regression in the original unweighted sample resulted in an odds ratio of 0.73 (this was attenuated to 0.77 when cubic smoothing splines were used to model the relationship between continuous baseline covariates and the log-odds of mortality). The first two odds ratios are estimates of the marginal odds ratio for the reduction in mortality due to counseling, whereas the latter two odds ratios are estimates of the conditional odds ratio (Rosenbaum, 2005
). Differences between these two sets of estimates reflect the fact that propensity score methods allow for estimation of marginal treatment effects, whereas regression adjustment allows for estimation of conditional treatment effects (Rosenbaum, 2005
). For odds ratios, marginal and conditional effects do not coincide (Gail, Wieand, & Piantadosi, 1984
; Greenland, 1987
Seventh, we highlight that there are well-developed methods for assessing the similarity of treated and untreated participants conditional on the propensity score. These methods allow one to assess whether the propensity score model has been adequately specified. Having removed or reduced systematic differences between treatment groups, one can then directly compare outcomes in the resultant matched, stratified, or weighted sample. In contrast, when using regression adjustment it is more difficult to assess whether the regression model relating outcomes to treatment and baseline covariates has been correctly specified. In our initial conventional logistic regression model, the odds ratio for counseling was 0.73 (p = .0371). However, this was attenuated to 0.77 (p = .0942) when cubic smoothing splines were used to model the relationship between continuous baseline variables and the log-odds of mortality. However, uncertainty persists as to whether this second model had been adequately specified.
Eighth, we remind the reader that propensity score methods only allow one to account for measured baseline variables. Estimates using each of the estimates of treatment effect may be susceptible to bias due to unmeasured confounding variables. The reader is referred elsewhere for an illustration of this (Austin, Mamdani, Stukel, Anderson, & Tu, 2005
). Rosenbaum and Rubin (1983)
described sensitivity analyses to assess the sensitivity of the study conclusions to unmeasured covariates when propensity score methods are used. These sensitivity analyses allow one to assess how strongly an unmeasured confounder would have to be associated with treatment selection in order for a previously statistically significant treatment effect to become statistically nonsignificant if the unmeasured confounder had been accounted for. However, in our case study, the large majority of estimated effects were not statistically significant. Thus, we did not employ these sensitivity analyses in this case study.
Ninth, we highlight that the question of whether providing smoking cessation counseling reduces postdischarge mortality in AMI patients is a complex clinical question Readers are referred elsewhere for an examination of this clinical question (Van Spall, Chong, & Tu, 2007
). The analyses presented in the current case study were merely intended to illustrate the use of different statistical methods and were not intended to address this clinical question. However, to underline the importance of the clinical question, we note that a prior study found that approximately 31% of patients who were discharged alive from the hospital with a diagnosis of AMI were current smokers at the time of the infarction, whereas 36% were former smokers (Rea et al., 2002
). A meta-analysis of 12 cohort studies found that smoking cessation following AMI reduced the odds of subsequent mortality by 46% (Wilson, Gibson, Willan, & Cook, 2000
). It is important to note that this mortality benefit was consistent across a range of factors. Given the large number of patients hospitalized with acute myocardial infarction, the high prevalence of current smokers among these patients, the high mortality rate in this patient population, and the potential benefit of smoking cessation in these patients, it is critical that effective means of successfully encouraging smoking cessation be developed. A systematic review of 33 randomized and quasi-randomized controlled trials found that smoking cessation counseling in hospitalized smokers increased the odds of smoking cessation at 6 and 12 months by 65% if the counseling began during hospitalization and included supportive contacts for more than 1 month after hospital discharge (Rigotti, Munafo, & Stead, 2008
). However, interventions with less postdischarge contact were not found to be effective. In the context of patients hospitalized with AMI, a randomized controlled trial found that bedside smoking cessation counseling followed by seven telephone calls over the first 6 months after discharge had a substantial effect on smoking cessation 1 year after discharge (Dornelas, Sampson, Gray, Waters, & Thompson, 2000
). In an analysis of a multicenter registry of patients hospitalized with an AMI, a multivariable analysis found that, although individual smoking cessation counseling did not influence of the odds of smoking cessation, being treated at a facility that offered an in-patient smoking cessation program increased the odds of smoking cessation (Dawood et al., 2008
). Finally, a meta-analysis found that, when comparing different health care providers, smoking cessation was most effective when provided by physicians (Gorin & Heck, 2004
In summary, we have illustrated the appropriate steps in conducting analyses using different propensity score methods. Increased use of these methods may allow for more transparent estimation of causal treatment effects using observational data.