|Home | About | Journals | Submit | Contact Us | Français|
While much psychiatric research is based on randomized controlled trials (RCTs), where patients are randomly assigned to treatments, sometimes RCTs are not feasible. This paper describes propensity score approaches, which are increasingly used for estimating treatment effects in non-experimental settings. The primary goal of propensity score methods is to create sets of treated and comparison subjects who look as similar as possible, in essence replicating a randomized experiment, at least with respect to observed patient characteristics. A study to estimate the metabolic effects of antipsychotic medication in a sample of Florida Medicaid beneficiaries with schizophrenia illustrates methods.
While much psychiatric research is based on randomized controlled trials (RCTs), where patients are randomly assigned to treatments, sometimes RCTs are not feasible. Ethical concerns might preclude randomization, such as randomizing subjects to smoke, or it may be impractical, such as when the treatment of interest is widely available and commonly used. When RCTs are unethical or infeasible, a carefully constructed non-experimental study can be used to estimate treatment effects. While non-experimental studies are disadvantaged by lack of randomization, the study costs may be lower, the study sample may be broader, and follow-up may be longer, as compared to an RCT (1,2).
The primary challenge for estimation of treatment effects is the identification of subjects who are as similar as possible on all background characteristics other than the treatment of interest. By virtue of randomization, RCTs ensure, on average, the treatment and comparison groups are similar on background characteristics, measured and unmeasured. In non-experimental studies, there is no such guarantee. Treatment and comparison groups may systematically differ on factors that also affect the outcome, a problem referred to as “selection bias.” Selection bias leads to confounding, “a situation in which the estimated intervention effect is biased because of some difference between the comparison groups apart from the planned interventions such as baseline characteristics, prognostic factors, or concomitant interventions. For a factor to be a confounder, it must differ between the comparison groups and predict the outcome of interest” (3).
Numerous design and analytical strategies are available to account for measured confounders but the major limitation is the potential for unmeasured confounders. Well-designed non-experimental studies make good use of measured confounders by creating treatment groups that look as similar as possible on the measured characteristics. Researchers then assume that, given comparability (or balance) between the groups on measured confounders, there are no measured or unmeasured differences, other than treatment received. This assumption has many names: “unconfounded treatment assignment,” “no hidden bias,” “ignorable treatment assignment,” or “selection on observables” (4–6).
We describe approaches that, through the careful design and analysis of non-experimental studies, create balance between treatment groups. The key idea is to use relatively recently developed techniques, known as propensity score methods, to ensure that the treatment and comparison subjects are as similar as possible. The goal is to replicate a randomized experiment, at least with respect to the measured confounders, by making the treatment and comparison groups look as if they could have been randomly assigned to the groups, in the sense of having similar distributions of the confounders. We describe the five key stages to this process (Table 1). A study that compares atypical and conventional antipsychotic medications with regard to their effect on adverse metabolic outcomes (dyslipidemia, Type II diabetes, and obesity) (16) illustrates the methods. The study uses data from Florida Medicaid beneficiaries (18 to 64 years), diagnosed with schizophrenia and continuously enrolled from 1997 to 2001. Although the bulk of the evidence on the causal associations of antipsychotics comes from studies using U.S. and U.K. administrative and medical databases, RCTs have been used to assess the metabolic effects of antipsychotic drugs (e.g., 17,18). Findings of these RCTs, however, are not regarded as representative of the adverse events of these drugs as used in routine practice. A possible exception is the CATIE trial (19), an effectiveness trial in that other than the randomization, every other aspect of the care was naturalistic. Conduct of this type of trial is costly and generally unfeasible.
The first step involves clearly specifying the treatment of interest, and identifying individuals who experienced that treatment. One way to address this is to consider what treatment would be randomized if randomization were possible. For example, we could randomly assign patients to receive an atypical medication. We then need to select an appropriate comparison condition. Because this study investigates the metabolic effects of atypical antipsychotics the relevant question is whether the comparison of interest is another type of medication, no medication, or either? Virtually all subjects with schizophrenia during this time frame are treated with some type of antipsychotic agent and thus the key clinical question is not whether the patient should receive an antipsychotic medication, but rather, which type of antipsychotic medication should be used. We compare atypical antipsychotics (specifically, clozapine, olanzapine, quetiapine, and risperidone) to conventional antipsychotics (specifically, chlorpromazine, trifluoperazine, fluphenazine, perphenazine, thioridazine, haloperidol, and thiothixene). We use Medicaid claims data so that atypical (conventional) antipsychotic users are those subjects who filled at least one prescription for an atypical (conventional) antipsychotic. Prescribing information is unavailable and so only subjects who were written an antipsychotic prescription and filled it are included. Like an intent-to-treat analysis, we only know that the prescription was filled and not whether the medication was actually taken.
The next consideration is identification of confounders: factors that have previously been found to be associated with receipt of atypical antipsychotics and/or with metabolic outcomes. Key confounders in the Medicaid study include demographic and clinical variables, listed in Table 1, such as sex, age, race, and medical comorbidities. A good study will have a large set of measured confounders so that the assumption of no hidden bias is likely to be satisfied.
Once the treatment group, comparison group, and potential confounders are identified, researchers need to identify data on those groups and the confounders. The particular data elements necessary are: subjects, some of whom received the treatment (atypical antipsychotics) and others the comparison condition (conventional antipsychotics), an indicator for which subject is in which group, potential confounders, and outcomes. Confounders are measured before treatment assignment to ensure that they are not affected by the treatment (20,21) and outcomes are measured after treatment assignment, to ensure temporal ordering. In the Medicaid study, we determined periods during which an individual had some minimal exposure to an antipsychotic drug, at least 6 months of Medicaid enrollment preceding treatment initiation (from which we obtained the covariate information), and a 12-month follow-up period to examine incidence of metabolic outcomes. Often it is not possible to have truly longitudinal data, and researchers instead use cross-sectional data where assumptions regarding the time ordering of the variables being measured are made. We analyze one measurement occasion for each subject, measured 12 months following antipsychotic initiation. See the paper by Marcus et al. in this series for methods for estimating causal effects with multiple outcome occasions (22).
Table 2 (Columns 1–3) compares the means of the potential confounders between atypical and conventional antipsychotic users. The differences in percentages (for binary variables) or standardized differences (for continuous variables) are also reported. The standardized difference is the difference in means divided by the standard deviation of the confounder among the full set of conventional users (1,11,23). We then multiply by 100 to express the difference as a percentage. The conventional users are older on average (by 26% of a standard deviation) and more likely to be African American (34% vs. 24%), as compared to the atypical users. Because of these differences between the groups, comparing the raw outcomes between the two treatment groups would result in bias (24). Statistical adjustments are required to deal with the differences in the observed confounders.
Ideally we want to compare atypical and conventional users who have “exactly” the same values for all the confounders. Assuming no unmeasured confounders, any difference in the outcomes could then be attributed to the treatment. However, exact matching on all of the covariates is often infeasible given the large number of covariates and relatively small number of subjects available. In the Medicaid study, if we were to make each of our 11 confounders binary, we would have 2048 (= 211) distinct strata and need to have both atypical and conventional antipsychotic users in each. Because this is not feasible, a reasonable strategy is to make the “distributions” of the confounders similar between the atypical and conventional antipsychotic users—e.g., similar age, similar race, similar chronic medical comorbidity status. There are several general strategies to create comparable groups.
A common approach to adjusting for confounders is regression adjustment, whereby the treatment effect is estimated by regressing the outcome of interest on an indicator for the treatment received and the set of confounders. The coefficient on the treatment indicator provides an estimate of the treatment effect (Table 3, Column 1).2 A drawback of this approach is that if the atypical and conventional groups are very different on the observed covariates (e.g., with over a 25% standard deviation difference on average age, as seen in Table 2), the regression adjustment relies heavily on the particular model form and extrapolates between the two groups (24, 25). Why does this pose a problem? First, the regression approach will provide a prediction of what would have happened to atypical users had they instead used conventional antipsychotics using information from a set of conventional users who are very different from, e.g., older than, those atypical users. Second, in most cases, the regression approach assumes a linear relationship between the measured covariates and the outcome of interest—an assumption that may not be true and is often difficult to test. Third, the output of standard regression analysis provides no information regarding covariate balance between the two treatment groups. Other approaches avoid these problems by ensuring that the comparisons are made between groups that are similar.
A useful tool to achieve comparable confounder distributions is the “propensity score,” defined as the probability of receiving the treatment given the measured covariates (6). A property of the propensity score makes it possible to select subjects based on their similarity with respect to the propensity score (a single number summary of the covariates, similar to a comorbidity score) in order to achieve comparability on all the measured confounders, rather than having to consider each confounder separately. If a group of subjects have similar propensity scores, then they have similar probabilities of receiving the treatment, given the measured confounders. Within a small range of propensity score values, the atypical and conventional users should only differ randomly on the measured confounders, in essence replicating a randomized experiment.
Because the true propensity score for each subject is unknown, it is estimated with a model, such as a logistic regression, predicting treatment received given the measured confounders. Each subject’s propensity score is their predicted probability of receiving the treatment, generated from the model. The diagnostics for propensity score estimation are not the standard logistic regression diagnostics, as concern is not with the parameter estimates or predictive ability of the model. Rather, the success of a propensity score model (and subsequent matching or stratification procedure) is determined by the covariate balance achieved.
One of the simplest ways of ensuring the comparability of groups is to select for each treated individual the comparison individual with the closest propensity score3 (26). We illustrate a 1:1 matching algorithm where one conventional antipsychotic user is selected for each atypical antipsychotic user. Variations on this algorithm include selecting multiple matches for each atypical user, matching atypical users to a variable number of conventional users (27), and prioritizing certain variables (12). For example, if there are a large number of potential control subjects relative to the number of treated, it may be possible to get 2 or 3 good matches for each treated individual, which will increase the precision of estimates without sacrificing much balance (27,28). In our study, because the numbers of conventional and atypical users are nearly equal, we used matching with replacement, meaning that each conventional user could be used as a match multiple times (29).
Figure 1 Panel A illustrates the resulting matches in the Medicaid study, with 1,809 conventional users matched to the 3,384 atypical users. The x-axis reflects the propensity scores; the y-axis is used to group the subjects into atypical (treated) vs. conventional (control), and matched vs. unmatched; the vertical spread of the symbols within each grouping is done to show the symbols more clearly. The figure shows the relative weight different subjects receive in the analyses of the outcomes, with the relative size of the symbols reflecting the number of times a subject was matched. Thus, conventional users selected as a match multiple times have larger symbols. The goal is to see good “overlap” between the propensity scores of the atypical and conventional users, which we have. However, there are quite a few conventional users with low propensity scores who are left unmatched. This illustrates a common drawback of nearest neighbor matching, in that sometimes subjects are unmatched, including some with propensity scores similar to those in the other group.
A second approach, inverse probability of treatment weighting (IPTW), avoids this problem by using data from all subjects (9,13,30). The idea of IPTW is similar to that of survey sampling weights, where individuals in a survey sample are weighted by their inverse probabilities of selection so that they then represent the full population from which the sample was selected. In our setting we treat each of the treatment groups (the atypical users and the conventional users) as a separate sample, and weight each up to the “population,” which in this case is all study subjects. Each subject receives a weight that is the inverse probability of being in the group in which they are in. However, instead of having known survey sampling probabilities, we use the estimated propensity scores. In particular, atypical users are weighted by one over their probability of receiving an atypical antipsychotic (the propensity score), and conventional users are weighted by one over their probability of receiving a conventional antipsychotic (one minus the propensity score). In the Medicaid study, the conventional users with low probabilities of receiving a conventional antipsychotic will receive relatively large weights, because they actually look more similar to the atypical users, thus providing good information about what would happen to the atypical users if they had instead taken conventional antipsychotics.
Subclassification, also called stratification, is a method that also uses all subjects, by forming groups (subclasses) of individuals with similar propensity scores (31). In the Medicaid study the subclasses were created to have approximately the same number of subjects taking atypical antipsychotics (about 565); the number of conventional users in each subclass ranges from 287 to 933 (Figure 1 Panel B; Table 4). Because of the properties of propensity scores described above, within each subclass, the subjects look similar on the measured confounders.
Is it better to match or to stratify/weight? The answer depends on whether the investigator is more concerned about bias or about having enough power to detect an effect. Matching approaches are often used when it is important to reduce as much as possible differences between treatment groups and consequently, not all subjects are used, reducing the total sample size available to find differences. While subclassification and weighting retain all subjects (generally yielding some efficiency gain), there is a risk of making comparisons between individuals who are not as alike as desired.
How do we know if the atypical and conventional groups are “similar,” at least on the measured covariates? After using one of the approaches described above, the crucial next step is to check the resulting “balance:” the similarity of the confounders between the treatment and comparison groups. Common (and sometimes misguided) measures used for balance checks are standard hypothesis tests, such as t-tests. The danger in using test statistics is that they conflate changes in balance with changes in the sample size; comparing p-values before and after matching can be misleading, implying that balance has improved when in fact it has not (1,11).
A good balance measure, and the one we suggest, is the standardized difference in means. This is most appropriate for continuous variables. A general rule of thumb is that an acceptable standardized difference is less than 10% (11). Differences larger than 10% roughly imply that 8% or more of the area covered by atypical and conventional users combined is not overlapping.4 For binary variables the absolute value of the difference in proportions is examined. These measures are generally calculated both in the full dataset (Table 2, Column 3), as well as in the dataset after applying one of the propensity score methods described above (Table 2, Column 4); if the propensity score method was successful the standardized differences and differences in proportions should be smaller than they were in the original data set. After 1:1 matching (Table 2, Column 4) the largest standardized difference is 3%, which is a good situation. Similar balance was achieved with weighting and subclassification. In contrast, the largest standardized difference prior to matching was 26%, which is clearly an unacceptable situation. In some cases adequate balance may not be achieved with the available data. This is an indication that estimating the treatment effect with that data may be unreliable. It may be necessary to add interactions of the measured covariates in the propensity score model, seek additional data sources, or reconsider the question of interest.
Once adequate balance is achieved, the next step is to estimate the treatment effect. Note that this is the first time that the outcome is used; the propensity score method itself is not selected or implemented using the metabolic outcome measures, beyond the idea of selecting confounders that may be correlated with the outcome(s).
One method of estimating the treatment effect is to regress the outcomes for subjects in the original (unmatched) dataset on the measured confounders. In the antipsychotic study, we estimated a linear regression, where the coefficient of the atypical antipsychotic variable represents the increase (or decrease) in risk for atypical users. The results of this approach are shown in Table 3, Column 1, where atypical antipsychotic use increases the risk of dyslipidemia and of obesity. This regression is easy to conduct, but has the drawbacks discussed above, particularly when the treatment groups are far apart based on the covariates. However, despite these limitations of regression adjustment in general, in fact, combining it with the propensity score methods described above has been found to be a very effective approach (10,32–34), and we use that approach for the remaining methods.
Outcome analysis after 1:1 nearest neighbor matching is very straightforward. With paired data and binary outcomes, a natural method is McNemar’s test. McNemar’s test indicates a statistically significant adverse effect of atypical antipsychotics on obesity (χ2 = 14.61 on 1 degree of freedom; p = 0.0001): 5% of the 3,384 pairs had discordant outcomes and in 65% of the discordant pairs, the atypical subjects had obesity.
Alternatively, any analysis that would have been conducted on the full dataset can instead be conducted on the matched dataset (10). We estimated a regression model with each metabolic outcome predicted by whether someone took an atypical antipsychotic and the measured confounders, using the matched sample. Because the matching was done with replacement, the regression analysis was run using weights to account for that design (12). We find that atypical antipsychotics increased the risk of obesity, but not dyslipidemia or Type II diabetes (Table 3, Column 2), consistent with the results found using McNemar’s test.
After constructing IPTW weights, the effect estimate is obtained by estimating a weighted regression model using the IPTW weights (13). The results are consistent with those of the standard regression adjustment, indicating increased risk of dyslipidemia and obesity for those taking atypical antipsychotics (Table 3, Column 3).
With subclassification, treatment effects are first estimated separately within each subclass. Because of the potential for residual bias when the subclasses are relatively large, it is particularly important to estimate these effects using regression adjustment within each subclass, controlling for the confounders (13). If the treatment effects are similar across subclasses, it may make sense to combine the subclass-specific estimates to obtain an overall estimate. The results for the antipsychotic study do not indicate substantial treatment differences across subclasses (Table 4). After combining the subclass results by taking a precision-weighted average of the effects within each subclass, we find that the overall effects are similar to those from the simple regression adjustment and from weighting (Table 3, Column 4). An advantage of the subclassification approach is that it permits non-linear associations in the effects across the subclasses.
Selection of matching versus subclassification or weighting involves a bias/variance trade-off. One-to-one matching generally yields more closely matched samples and thus lower bias, but higher variance because of the smaller sample size used. The better balance generally obtained by matching also sometimes yields smaller point estimates of effects. In our example, the lack of a statistically significant finding on dyslipidemia when using 1:1 matching but a significant finding when using other approaches appears to be a result of a combination of these factors. In comparison with the effect on obesity, the effect of dyslipidemia is much weaker: for dyslipidemia, 53% of the discordant pairs had an atypical user with dyslipidemia (χ2 = 2.613 on 1 degree of freedom; p = 0.11), for obesity, 65% of the discrepant pairs had an atypical user with obesity. The discrepancy in results also indicates the value in assessing sensitivity by trying a few different approaches; those that yield the best covariate balance should be used (10).
The final question in any non-experimental study is how sensitive are the results to a potential unmeasured confounder. We illustrate an approach that determines how strongly related to the decision to fill an atypical antipsychotic medication an unmeasured confounder would have to be to make the observed effect go away (i.e., lose statistical significance; 35). We illustrate the approach using the matched pairs from 1:1 matching using the obesity outcome. Table 5 indicates that for two subjects who appear similar on the measured covariates, if their odds of filling an atypical antipsychotic medication differ by a factor of 1.5 or larger, then the treatment effect becomes statistically insignificant. The size of these odds needs to be interpreted in the context of the particular problem. In our analyses, the largest observed odds ratio was 1.75 (95% CI: 1.55, 1.98) reflecting an increased odds of receiving an atypical antipsychotic for white subjects relative to black subjects. Given this size odds ratio observed, the small number of confounders available in the data, and knowing that the results are sensitive at an odds of 1.5, makes us cautious in concluding that atypical antipsychotic use increases the risk of obesity compared to conventional antipsychotic use. These results need to be replicated in other studies.
This paper has provided an overview of the approaches for estimating treatment effects with non-experimental data, with a focus on propensity score methods that ensure comparison of similar individuals. While in this study the propensity score approaches gave results similar to those of traditional regression adjustment, we can have more confidence because of the balance obtained by the matching, weighting, and subclassification methods. The methods generally imply increased risk of dyslipidemia and obesity for individuals on atypical antipsychotics and no increased risk of Type II diabetes. However, we should interpret these results with caution, as the effect on dyslipidemia was sensitive to the particular method used and even the (stronger) effect on obesity is potentially sensitive to an unmeasured confounder.
There are a number of complications that researchers may encounter when designing an observational study. The first is missing data: rarely do researchers measure all of the variables of interest for all study subjects. If there are not many patterns of missing data, a first solution is to estimate separate propensity scores for each missing data pattern (6). A second approach is to include missing data indicators in the propensity score model; this will essentially match individuals on both the observed values (when possible) and on the patterns of missingness (36,37). A third approach is to use multiple imputation and undertake the propensity score matching and outcome analysis separately within each multiply imputed dataset (38).
A second complication involves questions where the treatment of interest is not a simple binary comparison. Interest might be in the effect of different types or dosages of antipsychotic medications. Two solutions exist in this type of setting. First, if scientifically interesting, focus can be shifted to a binary comparison, for example comparing low vs. high doses. Second, a new area of methodological research has developed generalized propensity scores for use with non-binary treatments (5,16,39).
A final concern with any non-experimental study is that of unmeasured confounding: there may be some unmeasured variable related to both which treatment an individual receives and their outcome. Using propensity score approaches to deal with measured confounders is an important step, but there is always concern about effects of unmeasured confounders. One approach to assess whether this could be a problem is to examine an outcome that should not be affected by the treatment of interest; if an effect is actually found, that may indicate the presence of unmeasured confounding. We have also illustrated here a statistical sensitivity analysis, which can be used to assess how important such an unmeasured confounder may be with respect to the study conclusions.
What are the primary lessons? When reading a study that uses non-experimental data, readers should:
When estimating treatment effects using non-experimental methods, researchers should:
In conclusion, propensity score approaches such as matching, weighting, and subclassification are an important step forward in the estimation of treatment effects using observational data. Whenever treatment effects are estimated using non-experimental studies, particular care should be taken to ensure that the comparison is being done using treated and comparison subjects who are as similar as possible; propensity scores are one way of doing so. Propensity score methods can thus help researchers, as well as users of that research, to have more confidence in the resulting study findings.
Dr. Stuart’s effort was supported by the Center for Prevention and Early Intervention, jointly funded by the National Institute of Mental Health (NIMH) and the National Institute on Drug Abuse (Grant MH066247; PI: N. Ialongo). Dr. Normand’s effort was supported by Grant MH61434 from NIMH. Dr. Gibbons’ effort was supported by NIMH Grant R56-MH078580, and Dr. Horvitz-Lennon’s by NIMH Grant P50-MH073469. The authors are indebted to Larry Zaborski, MS, Harvard Medical School, for earlier programming help and to Richard Frank, PhD, Harvard Medical School, for generously providing the Medicaid data.
2Although our outcomes are binary we present results from a linear regression model. This was for comparability with the analyses described for the propensity score approaches with weights. If a logistic regression model is used, the difference in absolute risk can be obtained by comparing predictions of the outcomes for the full sample under each of the treatment conditions. In this study the results are virtually identical. Section IV provides more detail.
3Often the matches are based on the logits (the log-odds of the predicted probabilities) because the logits have better statistical properties.
4The 10% threshold is a small effect size using Cohen’s effect size criteria (21).