In this paper we used Monte Carlo simulations to examine the performance of different propensity-score methods for estimating risk differences. The estimators based on using IPTW resulted in unbiased estimation of the risk differences. The other propensity-score methods introduced minor to modest bias. The IPTW doubly robust estimator with the correctly specified outcomes regression model resulted in estimates with the lowest estimated standard errors. Similarly, this method resulted in estimates with the lowest MSE, although the other IPTW estimators were close competitors. The IPTW doubly robust estimator with the correctly specified outcomes regression model resulted in confidence intervals with the advertised coverage rates, whereas those of other IPTW estimators had coverage rates that exceeded the advertised levels. In some scenarios, propensity-score matching resulted in confidence intervals whose coverage rates were substantially lower than the advertised levels. Similarly, the IPTW doubly robust estimator with the correctly specified outcomes regression model resulted in approximately correct type I error rates, whereas the other IPTW estimators had type I error rates lower than the nominal level. Finally, we observed that the standard errors for stratification, matching, and the doubly robust IPTW estimator with the correctly specified regression model closely approximated the standard deviation of the sampling distribution of the estimators.
In our case study, we observed that the different propensity-score methods resulted in qualitatively similar estimates of the absolute reduction in the probability of mortality within 1 year due to receipt of a beta-blocker prescription at hospital-discharge. Similarly, estimates of the NNT to avoid one death within 1 year were qualitatively similar across the different methods (range 19–21). IPTW using the doubly robust estimator with the full regression model resulted in a 95 per cent confidence interval with the narrowest width. Propensity-score matching resulted in a 95 per cent confidence interval that was 13 per cent wider than the doubly robust method.
When outcomes are binary, measures of treatment effect can be reported using odds ratios, relative risks, or risk differences. Several studies have examined the performance of different propensity-score methods for estimating relative risks and odds ratios. Austin
et al. found that propensity-score methods result in biased estimation of conditional or adjusted odds ratios [
6]. Furthermore, propensity-score matching, stratification on the propensity score, and covariate adjustment using the propensity score result in sub-optimal inferences about marginal or population-average odds ratios [
7]. However, propensity-score methods allow for unbiased estimation of relative risks in the presence of a uniform treatment effect [
8]. When used for estimating relative risks, stratification and matching displayed the variance–bias trade-off: matching resulted in estimates with less bias, whereas stratification resulted in estimates with lower variance and greater precision [
8]. In contrast to these prior studies examining estimation of odds ratios and relative risks, there is a paucity of information on the performance of different propensity-score methods for estimating risk differences.
Clinical commentators have suggested that absolute risk reductions and numbers needed to treat provide important information for clinical decision making that is lacking in relative measures of effect such as the odds ratio and the relative risk [
1–
4].
Furthermore, some medical journals require that the NNT be reported for any randomized clinical trial with binary outcomes [
5]. These clinically meaningful measures of treatment effect can be easily computed using propensity-score methods. The current study provides the first comprehensive examination of the performance of the four different propensity-score methods for estimating risk differences. Given the advantages of the absolute risk reduction and the NNT for clinical decision making, we suggest that these measures of effect should also be reported for any observational study with binary outcomes.
Based on the results of our Monte Carlo simulations, IPTW using the doubly robust estimator had the superior performance of the different propensity-score methods examined. It resulted in essentially unbiased estimation of the true risk difference, had the lowest standard error of the estimated risk difference, had the lowest MSE, resulted in 95 per cent confidence intervals with approximately correct coverage rates, and had approximately correct type I error rates. Each of the competing approaches had inferior performance on at least one of these metrics compared with the doubly robust approach. A limitation of the doubly robust approach is the requirement that one specify an outcomes regression model relating the outcome to baseline covariates. However, we found that if the outcomes model was mis-specified through the omission of several predictor variables, then superior performance was still achieved relative to stratification or matching on the propensity score. Propensity-score matching and stratification on the quintiles of the propensity score are two commonly used approaches in the medical literature, whereas IPTW methods are rarely employed in the medical literature [
19,
20]. When the comparison was restricted to stratification and matching, one observed that matching had superior performance for low to moderate effect sizes, whereas stratification had superior performance for larger effect sizes. However, matching resulted in an approximately correct type I error rate, whereas stratification had a substantially inflated type I error rate.
There has been limited comparison of methods that employ IPTW with other propensity-score methods in the literature. Lunceford and Davidian [
18] compared methods employing weighting via the propensity score with stratification in the context of a continuous outcome and a linear treatment effect. Some of our observations mirror their findings. For instance, they note that stratification is not a consistent estimator, resulting in biased estimation of linear treatment effects. Furthermore, they note that the IPTW-DR-1 estimator has lower variance than the IPTW1 estimator, which reflects the findings of our Monte Carlo simulations. In addition, Lunceford and Davidian observed that stratification resulted in estimates with higher MSE compared with IPTW-DR-1, and that confidence intervals for stratified estimators did not have the advertised coverage rates. The current study made two novel contributions. The first was the focus on estimating risk differences rather than differences in means. In randomized controlled trials of medical interventions, binary outcomes are more prevalent than continuous outcomes [
30]. Therefore, our focus on estimating risk differences may provide more guidance to medical researchers examining the effects of treatments on health outcomes using observational data. The second novel contribution was the inclusion of propensity-score matching and covariate adjustment using the propensity score. Both of these methods are used more frequently than methods based on IPTW [
19,
20]. Propensity-score matching is frequently used in the medical literature. It is important to determine its relative performance compared with the competing methods for estimating risk differences.
The propensity score is a balancing score: conditional on the propensity score, the distribution of measured baseline covariates is similar between the treated and the untreated subjects [
9]. Several recent studies have compared the relative ability of different propensity-score methods with balance measured baseline covariates between the treated and the untreated subjects. Propensity-score matching has been shown to eliminate a greater proportion of the systematic differences between the treated and the untreated subjects compared with stratification on the propensity score [
25,
26,
31]. Similarly, propensity-score matching eliminated a greater degree of the systematic differences between the treated and the untreated subjects compared with covariate adjustment using the propensity-score [
31]. Finally, in some settings, propensity-score matching and IPTW using the propensity score eliminated systematic differences between the treated and the untreated subjects to an approximately equivalent degree [
31]. However, there were some scenarios in which propensity-score matching eliminated a modestly greater proportion of the observed imbalance compared with IPTW using the propensity score [
31].
A limitation to the use of methods based on IPTW is the paucity of methods that have been described in the literature for assessing whether the propensity-score model has been correctly specified in this context. When stratification or matching on the propensity score is employed, a range of diagnostics have been described for assessing the adequacy of the specification of the propensity-score model [
10,
25,
26,
32]. However, there are limited descriptions of methods to assess the goodness-of-fit of the propensity-score model in the context of IPTW (one assumes that many methods for matching could be adapted to the use of IPTW using the propensity score). Rubin writes ‘
In rare situations, the individually estimated probabilities (i.e. the estimated propensity scores) themselves may be used in the process of estimating treatment effects … If it is, the propensity-score estimation has to be conducted far more carefully. … In such cases, the estimated probabilities can be very influential on the estimated effects of treatment versus control, and so the probabilities themselves must be very well-estimated. In such cases, diagnostics of the accuracy of the estimated probabilities are appropriate, although diagnostics of the estimated underlying (logistic) regression coefficients are generally irrelevant’ [
33]. Thus, a prelude to the greater use of methods based on IPTW using the propensity score may be the development of diagnostics for assessing the accuracy of the estimated propensity scores.
The apparent superiority of IPTW using the propensity score compared with propensity-score matching may be worrisome, given the popularity of the latter method [
12–
14]. However, a possible explanation for the discrepancies between these two methods is that they are estimating different measures of effect. The econometrics literature differentiates between the average treatment effect (ATE) and the average treatment effect for the treated (ATT) [
34]. Imbens [
34] states that stratification using the propensity score and IPTW using the propensity score allow one to estimate the ATE, whereas matching on the propensity score allows one to estimate the ATT. The data-generating process in the current study induced a specified ATE. Thus, the bias estimation that arose when using propensity-score matching may primarily be a result of the fact that matching estimates the ATT, whereas stratification and weighting estimate the ATE.
In conclusion, our study suggests that a greater use of methods based on IPTW should be used for estimating risk differences in observational studies. This is particularly true when the interest is in estimating ATEs. Although the focus in the past has been on odds ratios and relative risks, estimation of absolute risk reductions and numbers needed to treat may provide greater information for clinical decision making.