Randomized controlled trials (RCTs) are the gold standard for estimating the effect of treatments, interventions, and exposures on health outcomes. In RCTs, there will, on average, be no systematic differences in baseline characteristics between treated and untreated subjects. This allows outcomes to be compared directly between the different treatment arms, permitting the reporting of clinically-meaningful measures of treatment effect. In particular, absolute measures of treatment effect can be estimated when outcomes are binary and time-to-event in nature. Several clinical commentators have suggested that absolute measures of treatment effect are superior to relative measures of treatment effect for making treated-related decisions for patients (
Laupacis et al.,1988;
Cook and Sackett, 1995;
Jaeschke et al., 1995), while others have criticized the reporting of odds ratios in RCTs (
Sackett et al., 1996). At the very least, the reporting of relative measures of treatment effect should be supplemented by the reporting of absolute measures of treatment effect (
Schechtman, 2002;
Sinclair and Bracken, 1994).
There is growing interest in using observational or non-randomized studies to examine the effect of treatment on health outcomes. However, in non-randomized studies, there are often systematic differences between treated and untreated subjects. Historically, researchers have used regression methods to adjust for observed systematic differences between treated and untreated subjects. A limitation to this approach is that, when outcomes are binary or time-to-event in nature, the resultant measure of treatment effect is a relative measure such as an odds ratio or a hazard ratio. We suggest that absolute measures of treatment effect should also reported for observational studies of the effect of treatment on outcomes.
In this paper we have summarized methods that have been proposed in the literature for estimating clinically-meaningful measures of treatment effect in observational studies. When outcomes are binary, we have described methods for estimating relative risks, absolute risk reductions, and the number needed to treat (NNT). When outcomes are time-to-event in nature, we have described methods for estimating the absolute reduction in the probability of an event occurring within a specified duration of follow-up time (and the associated NNT). We have also suggested methods for estimating the effect of treatment on expected survival time. Application of these methods allows for supplementing the reporting of relative measures of treatment effect by absolute measures of treatment effect. Furthermore, when outcomes are dichotomous, the described methods allow for the reporting of relative risks, and free one from using the odds ratio as a measure of association and effect. When outcomes are common, the odds ratio does not provide an approximation of the relative risk; rather, it magnifies the apparent association between treatment and outcome.
We have considered two different families of approaches for estimating clinically-meaningful measures of treatment effect in observational studies: regression-based approaches and propensity score-based approaches. Each clinically-meaningful measure of treatment effect can be estimated using either approach. We would argue that there are advantages to the propensity score-based approaches compared to the regression-based approaches. First, propensity-score methods reflect a design-based approach to removing confounding, whereas regression methods reflect an analysis-based approach to removing confounding (
Rubin, 2007). As
Rubin (2007) has argued, the use of propensity score methods allows one to separate the design of an observational study from the analysis of an observational study. Using propensity score methods, no reference is made to the outcome until the propensity score model has been specified and adequate balance in baseline covariates has been observed between treated and untreated subjects with similar propensity scores. A second advantage to propensity-score based approaches is that one can explicitly assess the degree to which confounding has been removed. When matching, stratifying or weighting using the propensity score, one can examine the similarity of treated and untreated subjects within the matched sample, within propensity-score strata or within the weighted sample, respectively (
Austin, 2009c). These balance diagnostics serve as an empirical test of whether the propensity score model has been adequately specified. When using regression-based approaches it is more difficult to assess whether the outcomes model has been adequately specified, and whether confounding between treatment and baseline covariates has been removed.
We have described how three different propensity score methods can be used to estimate clinically-meaningful measures of treatment effect: propensity score matching, stratification on the propensity score, and inverse probability of treatment weighting (IPTW) using the propensity score. There are subtle differences between these methods. First, propensity-score matching allows one to estimate average treatment effects for the treated (ATT), whereas stratification and weighting allow one to estimate average treatment effects (ATE) (
Imbens, 2004). However, we would note that use of different weights allows one to estimate either the ATT or the average treatment effect for the controls (ATC) when using IPTW. Furthermore, the stratification estimator can be modified to estimate the ATT (
Imbens, 2004). Second, empirical studies have shown that matching and weighting eliminates a greater degree of the observed differences between treated and untreated subjects than does stratification (
Austin and Mamdani, 2006;
Austin et al., 2007;
Austin, 2009a). Simulations have shown that in some settings matching and weighting remove equivalent amounts of imbalance between treated and untreated subjects, while in other settings matching removes modestly more imbalance (
Austin, 2009a).
In we summarize the different estimated measures of effect for the impact of beta-blocker prescribing on death within one year of discharge in our study sample. Several observations can be made from an examination of this table. First, the adjusted odds ratio (0.73) is further from unity than are all the estimated adjusted relative risks (relative risks range from 0.77 to 0.81). This highlights the fact that the odds ratio overestimates the magnitude of the relative risk when the outcome is common. Second, apart from conditional standardization by centering covariates (which estimates the relative risk for a specific covariate pattern), the other relative risks were lay between 0.79 and 0.81. Third, when estimating risk differences, then five of the six methods resulted in qualitatively similar estimates (−0.053 to −0.057).
We have noted above, that randomization will ensure that, on average, treated and untreated subjects do not differ systematically from one another. However, in any given randomization, it is possible that residual differences may exist between treatment groups. Several authors have suggested that regression adjustment be used to adjust for potential differences in baseline covariates that are predictive of the outcome (
Senn,1989;
Senn, 1994;
Altman and Dore, 1991;
Lavori et al., 1983). When outcomes are binary or time-to-event in nature, regression adjustment results in the odds ratio or the hazard ratio being reported as the measure of treatment effect. Several of the methods described in the current paper can be directly applied to RCTs to estimate clinically-meaningful measures of effect when regression adjustment is used and outcomes are binary or time-to-event in nature (
Austin, 2010b,
2010c).
In summary, the design of RCTs allows for the reporting of simple, clinically-meaningful measures of treatment effect. The recently revised CONSORT statement on the reporting of results for RCTs recommends that, for RCTs with dichotomous outcomes, both relative and absolute measures of treatment effect be reported (
Schulz et al., 2010). In observational studies of the effect of treatment or exposure on outcomes, relative measures of treatment effect, such as the odds ratio or the hazard ratio, are frequently reported. In this paper we have summarized different statistical methods that allow for estimating clinically-meaningful measures of treatment effect in observational studies. We encourage researchers to report absolute risk reductions, numbers needed to treat, and relative risks when outcomes are binary. This would allow the reporting of treatment effects in observational studies to mirror what is recommended for RCTs. When outcomes are time-to-event in nature, we encourage authors to report the absolute reduction in the risk of an event occurring within a specified duration of follow-up (along with the associated number needed to treat).