|Home | About | Journals | Submit | Contact Us | Français|
Propensity score analyses attempt to control for confounding in non-experimental studies by adjusting for the likelihood that a given patient is exposed. Such analyses have been proposed to address confounding by indication, but there is little empirical evidence that they achieve better control than conventional multivariate outcome modeling.
Using PubMed and Science Citation Index, we assessed the use of propensity scores over time and critically evaluated studies published through 2003.
Use of propensity scores increased from a total of 8 papers before 1998 to 71 in 2003. Most of the 177 published studies abstracted assessed medications (N=60) or surgical interventions (N=51), mainly in cardiology and cardiac surgery (N=90). Whether PS methods or conventional outcome models were used to control for confounding had little effect on results in those studies in which such comparison was possible. Only 9 out of 69 studies (13%) had an effect estimate that differed by more than 20% from that obtained with a conventional outcome model in all PS analyses presented.
Publication of results based on propensity score methods has increased dramatically, but there is little evidence that these methods yield substantially different estimates compared with conventional multivariable methods.
Randomized controlled trials are considered the gold standard for assessing the efficacy of medications, medical procedures, or clinical strategies. Nevertheless, particularly for research on the prevention of chronic disease, randomized trials are often infeasible because of their size, time, and budget requirements, questionable generalizability, and ethical constraints .
On the other hand, non-experimental studies of interventions have frequently been criticized because of their potential for selection bias. This concern reached a crescendo with the disparity in estimated effects of hormone replacement therapy from randomized trials and non-experimental studies. This imbroglio highlighted the need to develop and apply improved methods to reduce bias in non-experimental studies in which selection bias or confounding is likely to occur.
The use of multivariate confounder scores to combine many covariates into a single variable can be traced back to Miettinen in 1976. In 1983, Rosenbaum and Rubin developed the concept of propensity scores (PS) estimated at baseline to control for selection bias in cohort studies. This technique has become popular to control confounding bias in epidemiologic studies that assess the outcomes of drugs and medical procedures. Propensity scores estimate the predicted probability (propensity) of use of a given drug or procedure in a particular subject, based on his or her characteristics when the treatment is chosen. In principle, the effect of the treatment can then be measured among patients who have the same predicted propensity of treatment, thus controlling for confounding. Use of PS to reduce bias is especially appealing since, under the assumption that all relevant predictors of treatment have been adequately captured, subjects with the same PS should have the same chance of receiving treatment. Therefore, PS are often conceptualized as mimicking randomized trials, although they do so only with respect to factors that have been adequately measured. Randomization, in contrast, removes bias from both measured and unmeasured factors. PS allow simultaneous control for confounding by several variables in situations where conventional multivariable models might not be appropriate, owing to the small number of outcomes. PS, however, are frequently used in settings where the outcome is common; their value in this situation is not yet clear. We sought to review the application of PS in the medical literature and to assess its practical value.
A propensity score (PS) can be defined as the probability of exposure to e.g. a treatment given observed covariates. The score is usually estimated using a multivariable logistic regression model, but can be estimated with a variety of multivariable scoring functions. In a logistic model, the scores range from 0 to 1 and reflect the estimated probability, based on the subject’s characteristics, that the subject will receive the treatment of interest, such that individuals with the same estimated PS have the same chance of receiving treatment. Any two subjects with the same PS can have different values for specific covariates, but overall, covariates entered in the PS model will tend to be balanced for treated and untreated subjects with similar PS. This balance of covariates can easily be checked and the performance of PS to achieve this goal can be clearly communicated, e.g. by presenting the distribution of covariates in exposed and unexposed separately, stratified by quintiles of the PS.
By estimating the PS and analyzing the data within homogeneous levels of PS, in theory one can achieve a ‘virtual randomization’, in which comparable patients are separated into the exposed and unexposed groups. Since PS are estimated using measured data, however, they cannot control for unmeasured or imperfectly measured variables. Therefore, residual systematic bias cannot be excluded.
Once PS are estimated they can be used in various ways to control for selection bias or confounding in non-experimental cohort studies. Possible implementations include matching on the PS, stratified analysis using PS as the stratification variable, and combinations of these two approaches with conventional multivariable outcome modeling. In theory, within each PS stratum, some patients will have received the treatment of interest while others will not. In practice, however, this is not always the case (see figure 1) and how one uses PS in analyses can make a difference.
One strategy is to match each exposed subject to one or more unexposed subjects with a similar PS, thus avoiding the complexity of matching within multiple strata. A variety of matching methods are available to identify unexposed subjects with PS similar to those of exposed subjects.[7,8] Effective balancing is achieved by any matching procedure that produces good agreement between the mean PS in exposed and unexposed subjects. Selecting an equal number of exposed and unexposed subjects within categories of the PS (frequency matching) instead of individual matching enables the inclusion of exposed subjects for whom no exact unexposed match can be found, but may introduce bias stemming from non-overlapping ranges for exposed and unexposed subjects at the extremes of the distribution of the PS. This bias can be avoided by restricting analyses to the range of PS common to both exposed and unexposed patients, i.e. excluding unexposed patients with a PS lower than the lowest PS observed in exposed patients and excluding exposed patients with a PS higher than the highest PS observed in unexposed patients. Plotting the PS distribution for exposed and unexposed subjects is an easy diagnostic for non-overlap (see figure 1).
As an alternative to matching, one can include all available subjects in an analysis and control for the PS. This can be achieved by simple stratification or modeling of the PS-disease association, e.g. as a continuous covariate. Often, a model with indicator variables for quintiles of the observed PS is used, but control for confounding may be better when PS are modeled as continuous variables. Again, inclusion of all subjects might introduce bias due to inclusion of subjects with a PS from outside the mutual range of scores among exposed and unexposed. Similar to matching, this bias may be reduced by excluding unexposed patients with a PS lower than the lowest PS observed in exposed patients and vice versa.
An issue of some controversy in use of PS is whether better control of confounding can be achieved, and hence better estimates of the effect of the treatment on the outcome can be obtained, by including the PS along with other important predictors of the outcome.[12,13] In theory, confounding can be controlled and a treatment effect can be estimated validly if only one of the two models - the treatment model (PS) or the outcome model (‘traditional’ multivariate modeling) - is specified correctly. Strategies that include both approaches, and thus involve possibly redundant control of confounding, have therefore been called ‘doubly robust’. The theory behind these methods is complex, however, and software tools with adequate documentation are not yet available.
We identified studies in which the propensity score was used through PubMed and Science Citation Index. Initially, a keyword search was performed through PubMed, identifying studies including the term “propensity”. This broad search yielded 5311 unduplicated references published through December 31, 2003. After review of the abstracts, we identified 167 articles that used propensity score methods in the study of medical interventions and health outcomes (excluding articles focusing solely on methodological or statistical aspects, editorials, review articles or letters, and foreign language articles). To increase the sensitivity of our search, we also searched for articles that cited one of the important propensity score methods articles.[5, 6, 13, 15–17] This search yielded another 73 articles. All these papers were obtained and read by one of the authors. We excluded 48 articles that did not include analysis of data (28), randomized clinical trials (9), case-control studies (2), and articles primarily analyzing cost-effectiveness (6) or practice patterns (3).
Our search revealed 58 substantive medical research studies that used PS in 2003,[18–75] 38 in 2002,[76–113] 28 in 2001, [114–141] 6 in 2000, [142–147] 5 in 1999, [148–152] 5 in 1998, [153–157] and a total of 5 before 1998 [158–162]. Additional articles found through a citation search of the significant methods articles written about PS, using Science Citation Index, yielded 13 medical research studies that used PS in 2003,[163–175]13 in 2002,[176–188] 11 in 2001,[189–199] 3 in 2000,[200–202]1 in 1999, [ 203], 3 in 1998, [204–206] and a total of 3 before 1998 [207–209]. We present the number of studies with results based on PS methods published in each of these years in figure 2.
After further review of articles, fifteen articles were excluded from further analysis as the outcomes were continuous and it was not possible to calculate an odds ratio or risk ratio [59, 86, 88, 111, 120, 124, 137, 150, 152, 160, 162, 193, 195, 196, 206]. The final selection of studies abstracted comprised 70 articles from 2003, 48 articles from 2002, 33 articles from 2001, 9 from 2000, 4 from 1999, 7 from 1998 and 6 articles before 1998.
For all selected papers published through 2003 we abstracted the following items: the number of variables used to predict treatment and outcome, respectively; the unadjusted (crude) estimate for the treatment-outcome association; the estimates for the treatment-outcome association adjusted by use of PS matching, PS adjustment, and/or multivariable outcome models, including models without PS and with PS as well as covariates; the predictive value of the PS as assessed by the area under the receiver operating characteristic (ROC) curve (equivalent to the c-statistic in logistic regression); and the percent of exposed participants that could be matched to unexposed participants (where applicable). We extracted or calculated odds ratios or relative risks whenever adequate data were presented.
In table 1, studies published in 2003 are presented. Corresponding tables for the years 2002 and prior years are included in the e-Extra online version. The following results are based on all 177 substantive studies reporting on dichotomous exposures and outcomes published through 2003. Among the medical specialties covered in these papers were cardiology (including cardiac and vascular surgery) (N=90), general internal medicine (N=34), oncology (N=20), nephrology (N=9), psychiatry (N=4), and rheumatology (N=2). The treatments studied included medications (N=60), surgical interventions (N=51), catheterization (N=13),other medical procedures (including care after myocardial infarction and in end-stage renal failure), lifestyle factors, or a wide variety of other comparisons. The main outcome assessed was mortality (N=118). Other outcomes included myocardial infarction (N=6), stroke (N=3), and a wide variety of other outcomes including complications of infection, gastrointestinal events, and emergency hospitalizations.
The number of exposed subjects (or unexposed subjects, if this number was smaller) ranged from 61 to over 1,380,000, and the number of outcomes ranged from 23 to 285,965. In 109 studies, the number of exposed subjects was larger than the number of subjects who experienced the outcome; in 13 studies it was smaller. To estimate the PS, 2 to 112 variables were used (in those papers in which this information was presented), compared with 1 to 45 used in multivariable outcome models. Direct comparison of the number of variables was possible in 90 studies, of which only 51 used more variables to estimate the PS than to estimate the corresponding outcome model; 27 used fewer variables to do so. Sixty-five studies had fewer than 8 outcomes for each variable entered into the PS model, i.e. a setting where the use of PS methods was shown to be advantageous compared with conventional outcome modeling. In 60% of studies (96 out of 161) the number of outcomes would have been sufficient to enter all variables used in the propensity score model in the corresponding outcome model.
The area under the receiver operating characteristic (ROC) curve or c-statistic was presented for 73 studies. It ranged from 0.56 to 0.93, indicating poor to good predictive power. The lowest predictive value (c=0.56) was achieved predicting the annual volume of patients treated by admitting physicians (in a study assessing its association with mortality in acute myocardial infarction); the highest (c=0.94) was achieved when predicting revascularization in coronary artery disease  and thrombolysis in patients with stroke . Very high values (c > 0.90) were reported in six additional studies for treatments including statins , amiodarone after acute myocardial infarction , chemotherapy in colon cancer , heart valve repair vs. replacement , bilateral thoracic artery bypass , and a hospital comparison .
Fifty-one studies used matching on the PS as either the main analytic strategy or as one of several analytic strategies presented. The percentage of exposed participants that could be matched to an unexposed participant was presented for 49 studies, and ranged from 26% to 100% (median=90%).
Most studies showed clear evidence of confounding, with substantial changes in the point estimate after adjustment. Whether PS methods or conventional outcome models were used to control for confounding, however, seemed to matter little in most of those 69 studies in which such a comparison was possible. These included 10 studies in which the authors made a qualitative statement that (mostly PS) analyses showed “similar” results.
In 20% of studies (14 out of 69), however, there was a more than 20% difference in the point estimate obtained from the conventional outcome model compared with any propensity score method presented.[22, 24, 34, 51, 52, 62, 73, 100, 102, 105, 107, 121, 123, 192] We used this arbitrary cut-point as a marker of a substantial difference in results. Of these, 5 [22, 100, 107, 121, 123] showed results not meeting our 20% criterion for at least one of the analytic strategies using PS. In four of these studies, the PS strategy not meeting criterion was when the PS was added to the conventional multivariable outcome model [22, 100, 107, 123]. In the study by Foody et al. , the result matched on the PS did not meet our criterion. This left the remaining 13% of studies (9 out of 69) in which all PS analyses presented showed a substantial difference compared with conventional outcome models.
The number of studies using PS methods, though not yet large, is climbing rapidly. According to the authors of many of these studies, the main reason to use PS methods was better control for confounding compared with conventional multivariable outcome modeling. We found no empirical evidence, however, that PS analyses controlled confounding more effectively than conventional outcome modeling in the majority of the studies where results from both methods were presented. Potentially meaningful differences in the control of confounding were observed in less than 15% of studies. Since the true underlying association is unknown, it remains unclear whether these differences are due to better control for confounding using the PS or whether adjusting for an inaccurate PS distorted results in some studies.[212,213] The use of PS as the only analytic technique applied comes at the price of losing potentially useful information about predictors of outcome. It therefore seems desirable to use PS only if a reduction in bias or an improvement in efficiency can be achieved.
Cook and Goldman compared the performance of tests of significance under the null hypothesis (i.e. assuming no difference between treatments) for PS and for ‘traditional’ multivariable outcome models using simulations. PS appeared to produce valid results in most circumstances, but were biased in situations with very strong treatment-confounder associations.
In some practical situations the choice of analytic method will be limited. Because 10 events per covariate is usually considered to be a minimum requirement for stable estimates in multivariable models,[215,216] PS analyses combining multiple covariates into a single score are especially desirable if the treatment is common and the outcome is rare.[217,218] A recent simulation study comparing PS with multivariable outcome models concluded that PS performed better in situations with less than 8 outcomes per covariate. Apart from this specific condition (relevant in 65 of the 161 studies presenting the necessary information), there is little if any practical guidance for researchers regarding when the use of PS will produce different, and in particular, better estimates compared with conventional multivariable outcome models.
PS are used to reduce bias. Drake observed that the magnitude and direction of bias resulting from omitting an important confounder from analysis was similar in multivariable outcome modeling and in estimating the treatment-outcome relation controlling for PS. This observation implies that PS may not be superior to conventional multivariable outcome models in controlling bias from unobserved confounders.
Several strategies for using PS are currently being applied in medical research, and often the results from more than one of these strategies are reported in a single paper. Individual matching on a PS has intuitive appeal and in those studies that used matching, the proportion of exposed subjects that could be matched ranged from 26% to 100%. Excluding a large proportion of exposed subjects because of a lack of unexposed matches, however, may severely alter the composition of the study population. Because comparisons may be valid within that altered population, we would therefore not call this issue a bias. Nevertheless, it is essential to appreciate and to describe clearly the differences between this altered population and the original study population. On the other hand, including subjects with a PS outside the overlapping range, such as using conventional outcome modeling or PS methods including non-overlapping ranges, can lead to bias due to model extrapolation or smoothing. Such subjects might include, for example, patients with absolute indications or contraindications to treatment, who should not be included in any treatment comparison, but are usually not recognized using conventional multivariable outcome modeling. Since this is a clear advantage of PS, a graphical exploration similar to figure 1 could be used as a routine procedure before doing any multivariable outcome modeling in treatment comparisons. Unfortunately, systematic comparisons of the different strategies to apply PS with respect to validity and efficiency with specific attention to exclusion of participants and non-linear associations between the PS and the outcome are sparse.
Variable selection in constructing PS is at present an ad hoc process that lacks guidelines and well-understood model diagnostics. The area under the ROC curve or c-statistic (from logistic regression) to quantify the predictive power of a model is a well established concept in clinical epidemiology. Its value when assessing the performance of PS to control confounding is unclear, however. Indeed, a very high c-statistic can indicate considerable non-overlap in PS distributions between exposed and unexposed as shown in figure 1.
Some authors argue that variables that only predict treatment choice but are not associated with the study outcome should not be included in the PS. By definition, these are not confounders, but they may increase the area under the ROC curve and thereby erroneously imply a high validity of the PS analysis.
A practical way of assessing the value of the PS model in controlling for confounding is to check the balance of important risk factors for the outcome between exposed and unexposed within levels of the estimated PS. This method has the advantage of being driven by substantive knowledge rather than statistics, and the results can easily be communicated to the reader in a table. It allows direct assessment of comparability of exposed and unexposed by the reader, a clear advantage of using PS methods compared with the ‘black box’ of the conventional outcome model.
This review of the application of propensity score methods in the medical literature has several limitations. We may well have missed some studies by using a specific search strategy, but this problem should not affect the comparison over time. Important information in understanding similarities and differences between the analytic approaches, including description of the types of variables, variable selection procedures, and measures of model adequacy, could not be abstracted systematically, since these are rarely presented with sufficient detail in published papers.
In conclusion, methods using propensity scores may be good candidates for improving inference in non-experimental studies, but a better understanding of the benefits and limitations of these methods in practical circumstances is needed. Meanwhile, propensity scores, like any other method, should not be automatically regarded as a preferable and sole method to control for confounding in non-experimental research, but rather as a promising addition.
The project was funded by a grant from the National Institute on Aging (RO1 AG023178)