Recent legislation around investments in comparative effectiveness research (CER) has raised awareness and enthusiasm for the development of methods for such research. A contemporaneous investment in health information technology has raised hopes for the development of richer and comprehensive observational databases based on electronic medical records. Despite the push for the larger use of such databases in CER, the fundamental methodological challenge of selection bias arising out of non-random assignment of treatments remains. Since the goal of CER is to generate information that can inform better treatment selection in practice, causal estimation of treatment effects remain central to the CER theme. Otherwise, interventions that do not provide sufficient value may be adopted and treatments that do may be eliminated.

Selection bias (i.e., confounding by indication) arises when factors that can influence the treatment choice such as patient health and provider skills also influence outcomes. This is a common phenomenon in observational studies of treatment outcomes. The significance of this well-known limitation was famously illustrated in the case of hormone replacement therapy in post-menopausal women. As several large scale observational studies consistently showed these treatments to be effective for preventing chronic cardiovascular disease, hormone replacement therapy became widely adopted. Use then plummeted when these studies were eventually disproven by a large randomized trial [

35]. It has subsequently been shown that the reason for the discrepant results was that the observational studies failed to consider certain confounders like socioeconomic status [

25] or failed to distinguish initiation of therapy from prevalence of therapy [

24]. The significance of overcoming the limitation of common observational study designs cannot be overstated as it could lead to fewer mistaken conclusions regarding treatment effectiveness and a greater use of sound observational studies to develop the evidence base of comparative effectiveness research.

A wide range of statistical methods have been developed to address overt selection bias or bias that arise due to differences in levels of confounders for patients receiving different treatments that are observed by the analyst of the observational data. Some of the most common techniques used to address overt bias include regression methods, propensity score matching and doubly robust estimators [

3,

34,

36,

38,

40]. The set of techniques that rely on propensity scores and related techniques that ensure balance of confounders between groups are being widely adopted in comparative effectiveness research as they often provide better estimates of treatment effects [

39] and can be implemented across a wide range of settings using data readily available. However, these methods have limitations if confounders that are not observed by the analysts give rise to hidden selection bias [

43,

44]. This hidden selection bias presents the biggest challenge for comparative effectiveness research as aptly illustrated in the hormone replacement therapy example.

Because of the prevalence of hidden selection bias, instrumental variable (IV) analysis has been a cornerstone method for observational studies, whose origins date back to the 1920s [

42]. In the last couple of decades, these methods have gained popularity in the medical literature on the evaluation alternative medical treatments [

9,

10,

14,

27,

43], the types of evaluations that were by and large restricted to clinical trials. The instrumental variables determine or affect treatment choice, but do not have a direct effect on outcomes except to the extent that they influence the choice of treatment [

1,

2,

16]. Thus, by using IVs, one can induce substantial variation in the treatment variable but have no direct effect on the outcome variable of interest. One can then estimate how much of the variation in the treatment variable is induced by the instrument—and only that induced variation—affects the outcome measure. In econometric terminology, this induced variation is called the

*exogenous variation* and identifies the desired estimate. These analyses constitute an important body of work that have advanced the field of CER by going beyond establishing associations between treatments and outcomes to estimating causal effects of treatments on outcomes, such as a RCT conducted on a similar population can inform. The adoption of these techniques for CER, although limited thus far, appears to be accelerating.

The field of CER itself is also grappling with issues about heterogeneity of treatment effects. In many situations, people respond differently to the same treatment. This is called

*response heterogeneity*. More importantly, the differential response from alternative treatments may vary across people. This is called

*treatment-effect heterogeneity,* and will be the primary focus of discussion in this paper. There are strong economic reasons why heterogeneity is important is this field [

4,

6]. But what has received less attention is how such a heterogeneity can compromise the traditional evidence generation infrastructure (e.g. randomized clinical trials and observational data analyses) in CER.

Let us take the case of IV approaches. An IV estimate of treatment effect using standard methods (e.g. two-stage least squares) is comparable to that arising from an RCT only under the assumption that treatment effects are constant for everyone in the population with the same observed characteristics. Even if treatment effects are allowed to be heterogeneous, IV estimates assume patients or their physicians do not have any additional information beyond what the analyst of an observational data possesses that can enable them to anticipate these effects and to select a treatment that would potentially give them the largest benefits. Such assumptions are clearly a stretch for modeling treatment choices in health care, especially under the practical limitations of observational data to collect all relevant information pertaining to treatment choices. Note that such assumption are also implicitly made in RCTs where selection into RCTs are hardly ever studied, even though there are several instances where clinicians have questioned the generalizability of RCTs [

12].

When such assumptions are relaxed, recent econometric literature has demonstrated several limitations of the traditional and newer IV approaches that we discussed above [

16,

17]. Now subjects and their providers are able to self-select treatments based on the patient’s expected idiosyncratic gains, i.e. it allows unobserved characteristics of patients that influence treatment choices to also be moderators of treatment effects (I will later develop a weaker assumption than self-selection that can also lead to such moderation). Imbens and Angrist [

26] showed that standard IV methods can identify parameters that reflect the treatment effects for a group of marginal patients, i.e. the patients whose actual treatment choices are driven by the specific instrumental variables, but are otherwise indifferent to choosing between alternative treatments. Therefore, the marginal patients identified by an IV are entirely dependent on the specific instrument being used and how this instrument affects treatment choices [

2,

16]. Consequently, the use of different instruments will produce different treatment effects because they represent the effects for different groups of marginal patients, and IV results become instrument dependent. This key insight, originally highlighted by Heckman [

15], is that it is difficult to interpret and apply IV results to clinical practice, where patients are often believed to select treatment based on their idiosyncratic net gains or preferences. In response to this insight, most traditional IV methods estimate a Local Average Treatment Effect (LATE). This estimate is often substantially different from mean treatment-effect concepts such as the Average Treatment Effect (ATE). This result, in one sense, is synonymous to the problems of interpreting RCT results, when self-selection into RCTs is common. In fact, under heterogeneity and self selection, even if results from IV methods applied to observational data and results from an RCT are both internally valid, there is no reason to expect that these results should tally with each other. Yet much of the applied literature has tried to replicate RCT results with IV methods.

To recover the full distribution of treatments effects across all possible margins of patients choices, not just the one directly influenced by an IV, one needs to explicitly develop a choice model for treatment selection. This choice model tries to explain choices based on all observed risk factors and also all possible IVs that are identified in the data, so that for each predicted level of probability for treatment choice, we observe some patients choosing treatment and some that do not. One can then study how the difference in average outcomes, the marginal treatment effect (MTE), between these two groups varies over levels of the probability of treatment choice. This approach, known as the local instrumental variable (LIV) approach, uses control function methods to identify the MTEs and subsequently combines them to form interpretable and decision-relevant parameters of interest such as the ATE or the Effect on the Treated (TT) or the Untreated (TUT). (Heckman series) ATE estimates the average gain if everyone undergoes treatment as compared to an alternative treatment or no treatment at all. This has been one of the most popular parameters of interest for health economists and policy analysts when making inference about health care policies [

46]. Treatment Effect on the Treated (TT) estimates the average gain to those who actually select into treatment and is one ingredient for determining whether a given treatment should be shut down or retained as a medical practice or in the formularies. It is informative on the question of whether the persons, choosing the treatment, benefit from it in gross terms. Recently, Basu et al. [

5] applied these methods to estimate ATE and TT of breast cancer treatments on costs.

In this paper, my goal would be to highlight these challenges in the context of using instrumental variable methods on observational data and discuss potential solutions to these problems.