In this paper we have examined the impact of not accounting for the matched nature of a propensity-score matched sample versus accounting for the matched design on type I error rates, coverage rates of confidence intervals, and variance estimation. We examined a wide range of measures of effect: difference in means, odds ratios, hazard ratios, rate ratios, and relative risks. In an empirical case study, we demonstrated that differing inferences can be obtained depending on whether one accounts for the matched nature of the sample.

When estimating a difference in means, we found that both matched and unmatched methods tended to have the appropriate type I error rate when the baseline covariates explained a low to moderate proportion of the variance in the outcome. However, when the baseline covariates explained a larger proportion of the variance in the outcome, a matched test had the correct type I error rate, while an unmatched test was overly conservative with a type I error rate that was less than 0.05. When estimating odds ratios, rate ratios, and relative risks, matched tests had the correct type I error rate, while unmatched tests had incorrect type I error rates. Finally, when estimating hazard rates, unmatched tests had type I error rates that were more conservative than matched tests.

When estimating non-null treatment effects, unmatched analyses tended to result in standard errors of the estimated treatment effect that overestimated the sampling variability of the treatment effect. In contrast, matched analyses resulted in estimates of the standard error of the treatment effect that were closer to the standard deviation of the sampling distribution of the treatment effect. Furthermore, for estimating rate ratios and relative risks, matched analyses resulted in confidence intervals that had coverage rates closer to the nominal level than did unmatched analyses.

A systematic review of the use of propensity-score methods in the medical literature found that they were most frequently used to analyze the effect of treatments and exposures on dichotomous or time-to-event outcomes (

Sturmer 2006). Our findings suggest that in these settings, applied researchers should apply statistical methods that account for the matched nature of the propensity-score matched sample. Accounting for the matched nature of the sample will result in tests with appropriate type I error rates and confidence intervals with coverage rates that are closer to the nominal level.

In the medical literature, propensity-score methods are less frequently used to determine the effects of exposures or treatments on continuous outcomes (

Sturmer 2006). Our study demonstrates that there is no advantage to employing an unmatched analysis. In contrast, a matched analysis resulted in estimates of the standard error of the treatment effect that better reflect the sampling variation of the treatment effect. Furthermore, when the baseline covariates explained a moderate proportion of the variability in the outcome, a matched analysis resulted in type I error rates and coverage rates for confidence intervals that were closer to the advertised level.

Two prior systematic reviews of propensity-score matching in the medical literature found that the large majority of published studies ignored the matched nature of the propensity-score matched sample when estimating the variance of the treatment effect (

Austin 2007a 2008b). In the current study, we found that ignoring the matched nature of the propensity-score matched sample can result in tests with incorrect type I error rates, confidence intervals that do not have the advertised coverage rates, and incorrect estimates of the sampling variability of the estimates of the treatment effect. Applied researchers should employ matched analyses when estimating differences in means, odds ratios, hazard ratios, rate ratios, and relative risks.