All of the research designs and estimation methods shown in have advantages and disadvantages. All rely on assumptions, many of which are untestable, especially if one is allowed to use only the abstract numbers in the current data file. All have the ability to produce valid causal inference when the assumptions plausibly are met. Pearl's (2000)
directed acyclic graphs provide a framework for identifying the causal assumptions that are necessary to draw valid causal inference from hierarchical models (i.e., models in which the causal arrows all go in one direction). The following comparison of RCTs and IV estimation could be extended to the other observational data methods discussed later in the paper.
There are many disadvantages to IV estimation, in addition to the difficulty of testing the lack of correlation of Z
. IV estimation is known to perform poorly in the presence of weak instruments (Murray 2006
In some situations, the estimated treatment effect easily could be heterogeneous (varying from one subject to another). But the estimated treatment effect also might vary from one type of Z to another. For example, one might not expect the same response to a smoking cessation program if the subjects enrolled in it because (a) they were encouraged to do so by a friend; (b) they were encouraged to do so by their employer; or (c) they were paid to do so.
In other situations, the IV Z may be an important determinant of assignment to the treatment versus control group for only a portion of the subjects. Some subjects may have chosen the treatment or control group regardless of the value of the instrument. Patients for whom the instrument is an important determinant of choice of the treatment versus control group are said to be “on the margin” with respect to the instrument.
If the effect of the treatment is homogeneous across subjects, then the treatment effects estimated for patients on the margin will be identical to those for the rest of the sample. If the treatment effect is different for different subjects (heterogeneous), then it is important to ask, “Who is the marginal subject?” (Harris and Remler 1998
). Are subjects for whom the instrument works well the same subjects for whom the treatment has a significant effect? Models with heterogeneous treatment effects are more difficult to estimate in any research design, including RCTs, and require samples large enough to estimate the different treatment effects in each subsample.
Although RCTs are vulnerable to attrition bias, compliance problems and cross-contamination of the treatment and control groups, an important advantage of RCTs is that randomization affects assignment equally for all
the subjects in the experiment. Nonetheless, heterogeneous treatment effects still are a problem. In extreme cases of heterogeneous treatment effects, an RCT cannot distinguish between a treatment that has no effect on the outcome and one that either kills the patient or saves her life. The problem is that unobserved variation in the patient's medical condition can act as a moderator of the effect of the treatment on the outcome of interest. An example, adapted from Pearl (2000)
, is provided in Appendix SA2
. Heckman (2008)
also notes that RCTs do not permit study of subjects' choice of the treatment versus control group, which could have a dramatic effect on the average treatment effect in real-world applications if subjects who self-select into the treatment group have a different response than those who were randomized to the treatment.
In RCTs it is difficult to ensure close correspondence of either the intervention or the subjects as specified in the RCT to real-world application. Internal validity of a study is improved by careful specification of both the intervention and the eligible subjects, but narrow specification of either reduces the study's external validity, that is, its generalizability.
Critics also cite the expense of RCTs, and the ethical paradox that the cost entails. Unless one has good reason to think that the treatment may be effective, the cost of an RCT is difficult to justify. However, if one has good reason to think that the treatment is effective, it is difficult to justify withholding it from the control group.
Particularly in the small samples that are common in clinical trials, randomization can fail to achieve balance on observed covariates. As Urbach (1985)
notes, Fisher's original rationale for randomization was to justify the statistical significance of the test of the null hypothesis (no difference in outcomes in the two groups). But suppose that following randomization, the treatment and control groups are unbalanced on some observable variables. What should the analyst do? The analyst might rerandomize patients to the treatment and control groups and test again for differences in observed variables, but Urbach notes that:
… if randomization purports to underwrite the significance test, then picking and choosing among the distributions thrown up by chance is quite illicit, for this undermines that test. But if we do not allow this kind of post-randomization selection, we might end up with test groups that differ substantially in their relevant characteristics, in which case we shall be assured of reaching a false conclusion. (p. 260)
Alternatively the analyst might use multivariate regression (or propensity scores) to control for differences in observed covariates. But if one is prepared to rely on statistical controls to correct imbalance, why bother with randomization in the first place? The hope is that randomization will balance unobserved confounders in the treatment and control groups, but how justified is that hope if randomization has failed to achieve balance on observed confounders?
A valid but expensive response to unbalanced groups is to continue with the experiment and then repeat the experiment many times, compiling the results into an unbiased estimate of the average treatment effect computed across all the experiments.13Deaton (2009)
discusses other potential difficulties with RCTs.
Additional Observational Data Methods
The impracticality of RCTs for many questions has led to the development of a variety of analytic approaches for observational data in addition to IV. The sample selection model of Heckman (1974)
and Lee (1976)
jointly estimates an equation that represents endogenous choice of the treatment (e.g., T
estimated as a function of Z
, additional covariates, and v
) along with an outcome equation (e.g., Y
estimated as a function of T
, additional covariates and u
). The correlation of u
is estimated as one of the model's parameters. The data requirements for the IV and sample selection models are the same,14
and the sample selection model typically is estimated either by a two-step or a maximum likelihood estimator (MLE) method. The MLE has been criticized for its assumption that the error terms u
have a bivariate normal distribution, but the two-step estimator can be derived by assuming only that v
have a linear regression relationship (Olsen 1980
). Lee (1983)
showed that the MLE can be generalized to any
distribution of v
One popular choice for an instrument is a “natural” experiment. A natural experiment is an exogenous shock to a system that affects the causal variable of interest, but it is uncorrelated with the error term in the (outcome) equation of interest. A change in tax policy that occurs in one state but not in another might alter the relative price of a good or service in ways that facilitate study of the effects of prices on consumption. Of course, the obvious question is, “Why did the change in tax policy occur in one state but not another?” The analyst must argue that the (unobserved) factors that caused the change in policy have no direct causal connection to the dependent variable except through the price of the taxed commodity.
When data are available on the same subjects before and after the intervention (e.g., the occurrence of the natural experiment's shock) difference-in-difference (DID) models can be used to compare changes in the dependent variable in the preintervention and postintervention time periods and contrast those changes for the treatment and control groups. DID models based on only one observation per subject in the preintervention and postintervention time periods cannot capture differences in time trends in either period. When multiple observations are available on the same subject in the preintervention and postintervention time periods, panel data methods (referred to in some literatures as “interrupted time series”) can be used to estimate trend lines in the preintervention and postintervention periods. In observational data, fixed effects for the research subjects often are used to control for the effects of time-invariant characteristics of the subject that could represent unobserved confounders.
Regression discontinuity models are applicable when a break point in a continuous assignment variable divides the sample into the treatment and control groups (Cook 2008
). An example is school children who either are or are not given remedial help based on a break point in a baseline test score. Subjects just to either side of the break point are assumed to be similar. Causal inference from regression discontinuity models can be sensitive to accurate modeling of the often nonlinear relationship between the assignment variable and outcome variable.