The “what” and “when” questions are relatively easy to answer. “Why” is more difficult.

Francis Galton, Karl Pearson, and Correlation

Although many reviews of causality begin with Aristotle and Plato and work their way through David Hume in the 1700s (e.g.,

Holland 1986), the most recent methodological split in causal inference began in the late 1800s with the work of Francis Galton and Karl Pearson.

Karl Pearson, Francis Galton's student, credits Galton with the discovery of correlation (

Pearson 1920). Pearson was captivated by correlation, but he rejected any notion of causation beyond correlation. For that reason, Judea Pearl refers to Pearson as “causality's worst adversary” (

Pearl 2000, p. 105).

To put Pearson's objection in context, consider an outcome variable of interest, denoted *Y*. The value of *Y* for the *i*th subject is *Y*_{i}. *X*_{i} refers to the set of variables that might, in the ordinary language of everyday experience, have a causal influence on the value that *Y* assumes. The *X* variables could include the primary causal variable or mediators or surrogates by which the causal effect is transmitted from a more “fundamental” cause to *Y*. An *X* of particular interest is one that designates membership in the treatment group (*T*) versus the control group. *T* can be either a discrete or continuous variable. Although the distinction between a continuous versus discrete *T* has important implications for estimation, the discussion of the issues in this section applies to a one unit change in either type of treatment. Also, there could be many different treatments and many types of control groups, but for simplicity, the examples in this paper are limited to one of each. Although the language of causality can be controversial, it is important to remember that the entire purpose of forming and talking about treatment and control groups in the first place is that one thinks it is possible that the treatment might have a causal effect on the outcome of interest.

I state the fundamental causal question as, “What is the change in the expected value of *Y*, or the probability that *Y* assumes a particular value, when the value of *T* is changed, by external means, by one unit?” The emphasis on *change* separates questions of causality from questions of association. By “external means” I mean that the change in *T* is induced by some mechanism that is uncorrelated with the unobserved factors that affect *Y*. Randomization assigns research subjects to the treatment versus control group, but the act of randomization, per se, is assumed to have no direct effect on the outcome variable of interest. Similarly, the distance from a person's place of residence to a hospital that offers one type of treatment for acute myocardial infarction (AMI) versus another could affect the treatment an AMI patient receives, but that distance, per se, plausibly has no direct effect on the patient's immediate health outcome resulting from treatment at the hospital.

This version of the causal question reflects specific positions on a host of interesting and controversial questions. For example, references to the causal effect of

*X* on the

*expected* value of

*Y*, or the

*probability* that

*Y* takes on a specific value are a relatively recent development in the literature (

Suppes 1970).

Cartwright (2007) objects to the assumption that the values of other

*X* variables can be held constant in observational studies when the value of

*T* changes.

In some of the discussion, it will be helpful to refer to a linear model that takes the form^{1}:

where *u* is unobserved error. The *β*'s are coefficients. The model is written in linear form for simplicity, but all the discussion in this paper can be generalized to models that are nonlinear in either the variables or the parameters, even though different estimation approaches are required.

Equation (1) has no causal interpretation, per se, as long as one is restricted to information contained in the numeric values of

*Y, X*, and

*T*. The numeric values of

*Y, X*, and

*T* cannot reveal which way the causal arrows go in . The direction of the arrows relies on information beyond the data in hand, for example, knowing that

*X* assumes its value in a time period before the time period during which

*Y* assumes its value. That knowledge does not establish a causal relationship between

*X* and

*Y* because many events could precede

*Y* and still be causally unrelated to

*Y*, or

*X* and

*Y* could be the result of some common cause

*W*, but the knowledge that

*X* precedes

*Y* does at least rule out

*Y*→

*X*.

Pearson appears to have taken a conservative approach to allowing such extraneous information to inform the assessment of causal effects. However, as discussed later in the paper, if the restriction on information beyond the numeric values of *Y, X*, and *T* is enforced strictly, causal inference will be the least of the analyst's worries. The controversies in the causal literature center on precisely *what* information beyond the numeric values of *Y, X*, and *T* the analyst will be allowed to consider.

Now, suppose that in addition to the variables *T* and *Y* we add a confounder variable *W* that has a causal effect on both *T* and *Y*, as shown in . If *W* is unobserved, the resulting estimate of the causal effect of *T* on *Y* (*β*_{T}) will be biased. In that case, some analysts might refer to *W* as an unobserved confounder, while others would say that *T* and *Y* are “spuriously correlated.” Econometricians would refer to the problem as omitted variable bias.

If is true (the crucial assumption) and

*W* is observed, an unbiased estimate of

*β*_{T} is the partial correlation of

*T* and

*Y* controlling for

*W* in standardized data (i.e.,

for all variables) and is computed by regressing

*T* on

*W* and

*Y* on

*W* in separate regressions. The correlation of the residuals from those two equations is

*β*_{T}.

From our vantage point, it seems obvious that the ability to calculate the correlation of *T* and *Y* controlling for *W*, thereby reducing (though not necessarily eliminating) the bias in *β*_{T}, would have been viewed as a major improvement in causal analysis. Apparently Pearson did not see it that way. There are several possibly explanations. First, is filled with untestable assumptions if one is restricted to information contained in the numeric values of *Y, X*, and *T*. Second, the computational difficulty of controlling for multiple *W* variables in regression equations in the early 20th century posed a formidable practical limitation on causal analysis.^{2}

Pearson's reticence regarding information beyond the numeric values of *Y, X*, and *T* may have been due in part to the limited number of ways in which the value of *T* could arise when he published his history of correlation in 1920. Six years later, Ronald Fisher discovered the randomized trial.

Ronald Fisher and Philip Wright

Two major discoveries in the 1920s defined the alternative empirical approaches to causal analysis that, over time, split the field of statistics: the discovery of random assignment by Ronald

Fisher (1926) and the discovery of IV by Philip

Wright (1928). The split is shown in .

The key to the RCT was the analyst's ability to manipulate the causal variable of interest (

*T*). That approach was quite feasible in Fisher-type problems, for example, effect of different fertilizers (

*T*) on crop yields (

*Y*). The great attraction of randomization was its claim to render all unobserved confounding variables causally irrelevant, a claim questioned by

Urbach (1985) and others.

The key to IV estimation (

Appendix SA1) was the analyst's ability to identify an IV that, like randomization, affected the assignment of individuals to the treatment group versus the control group, but was uncorrelated with the unobserved factors that affect the outcome. The great attraction of IV estimation was that it could be applied to problems where random assignment to potentially endogenous

^{3} explanatory variables was difficult or impossible.

shows a causal model in which unobserved variables (*v*) that affect or are affected by^{4} assignment to the treatment group (*T*) are correlated with unobserved variables (*u*) that affect the outcome variable of interest (*Y*), resulting in correlation of *T* and *u*, and thus biased estimates of *β*_{T}.^{5} The variable *Z* represents the variable that affects assignment of subjects to the treatment versus control group. *Z* could be either randomization or an IV. The important characteristic of *Z* is that it is uncorrelated with *u*. That characteristic is controversial in the case of IV because it is not directly testable using only the data represented by the variables in . That point is discussed in greater detail later in the paper.

Fisher's discovery was well-received and his experimental approach to causal inference came to dominate the field of statistics, in part, because random assignment helped to link the relatively new field of statistics to the prestigious natural sciences. The statistician who could calculate the combinatorics for a Latin Square design and lead an RCT became an integral part of the natural science research team.

Wright's discovery of IV, on the other hand, was ignored for approximately 20 years before being rediscovered by the Cowles Commission after World War II (

Stock and Trebbi 2003, p. 182). Even when econometricians adopted Wright's approach, they largely eschewed explicit causal language.

Pearl (1997) refers to the 20th century as a “century of denial” that valid causal inference could be drawn from observational data and laments “an alarming tendency among economists and social scientists to view a structural equation

^{6} as an algebraic object that carries functional and statistical assumptions but is void of causal content.” He wonders, “… what has happened to (structural equation modeling) SEM over the past 50 years, and why the basic (and still valid) teachings of Wright, Haavelmo, Marschak, Koopmans, and Simon have been forgotten” (

Pearl 2000, pp. 135–7).

Pearl (2000, p. 138) suggests that causal language was abandoned by SEM proponents in an attempt to mollify statisticians, “the arbiters of respectability.” Even today, it is surprising how much print in econometrics textbooks is devoted to the task of obtaining estimators with desirable large and small sample properties, for example, unbiasedness, consistency and efficiency, and how little is devoted to explaining exactly what one has estimated in an unbiased, consistent, and efficient manner.

Bertrand Russell and Physical Laws

Bertrand Russell was one of the towering figures in the fields of mathematics and philosophy in the early to mid 20th century. Russell's views on causality had a profound influence on the developing field of statistics. One of his most famous quotes is the following:

All philosophers, of every school, imagine that causation is one of the fundamental axioms or postulates of science, yet, oddly enough, in advanced sciences such as gravitational astronomy, the word “cause” never occurs … The law of causality, I believe, like much that passes muster among philosophers, is a relic of a bygone age, surviving, like the monarchy, only because it is erroneously supposed to do no harm. (

Russell 1913)

If causal relationships would not conform to the physical laws, Russell was prepared to abandon causality. Russell maintained that causal relationships needed to have three attributes which he thought characterized all physical laws: (1) causal symmetry; (2) irrelevance of time ordering (time invariance); and (3) determinism (no stochastic processes involved).^{7}

Later developments in quantum physics proved Russell's assertions wrong,

^{8} but

Kempthorne (1978) cites a more fundamental problem with Russell's analysis: failure to recognize the importance of experimentation:

… I, a lowly statistician, am compelled to regard Russell as being very stupid in this connection, for the reason that he did not, it would seem, give the slightest recognition of the idea of experimentation. (

Kempthorne 1978, p. 8)

Nonetheless, Russell's analogy to physical laws was an influential force in the development of statistics. Philip Wright anticipated that his estimation approach to omitted confounders would face enthusiastic opposition. He was careful to state that: “Estimates of [demand and supply]^{9} elasticities may be made, but any hope of obtaining numerical values comparable with results to be obtained in physical science must be abandoned.” The desire to maintain the link between statistics and the natural sciences survived despite Russell's errors and the irony that some of the most stunning achievements in physics in the early 20th century were based on observational data and conjectures for which no empirical test was possible at the time (e.g., Einstein's conjecture that identical clocks placed at the equator and either of the poles would keep different time).

The Split Solidifies

Over time, the different empirical approaches to causal analyses pioneered by Fisher and Wright solidified into the “experimental” versus “observational” approaches to empirical investigations of causality. But the mere development of two different approaches to causal analysis, by itself, cannot explain the deep and long-lasting intellectual divide between statisticians and structural equation modelers (SEMs) that came to include sociologists, econometricians, and other representatives largely from the social sciences. As

Robert Moffitt (1996, p. 462) notes:

IV (instrumental variables) is widely regarded by economists as one of the most versatile and flexible techniques, applicable in an enormous number of disparate applications. Yet it is scarcely used or discussed by statisticians, who often do not see the point of it all.

There are several factors that contributed to the depth and endurance of the split. The first is that statisticians and SEMs often were working on different types of problems. The statisticians' problems often fit neatly into the *T*→*Y* framework of . Wright's model, based on the intersection of simultaneously but separately determined supply and demand curves, was very different from the standard problem of omitted variable bias, even the IV version shown in .

Second, statisticians often were working on a very special subset of the problems in where it was possible for the analyst to manipulate

*T*, for example, through random assignment. The compelling appeal of randomization left statisticians suspicious of empirical investigations of causality where such manipulation was impossible.

Holland (1986, pp. 954, 959) describes the statistician's position:

Put as bluntly and contentiously as possible, in this article I take the position that causes are only those things that could, in principle, be treatments in experiments.

Donald Rubin and I once made up the motto

NO CAUSATION WITHOUT MANIPULATION

to emphasize the importance of this restriction.

This remarkable restriction eliminates race and gender as the cause of being denied a promotion or receiving inferior health care (

Holland 1986, p. 946;

Shadish, Cook, and Campbell 2002, pp. 7–8)—often a surprise to health services researchers engaged in disparities research. Holland elaborates:

As an example, the schooling a student receives can be a cause, in our sense, of the student's performance on a test, whereas the student's race or gender cannot. (

Holland 1986, p. 946)

Variables like race or gender are termed “attributes” of subjects because they are not amenable to manipulation, and attributes cannot be causes.^{10} Statisticians might agree, however, that racial *discrimination* could cause people to be denied employment, proper health care, and so on, because racial discrimination, at least in theory, could be modified by the analyst.

The quote also is notable in light of the fact that at the time Holland's article was written Donald Rubin and Paul Rosenbaum were collaborating on the development of propensity score analysis of observational data (

Rosenbaum and Rubin 1983,

1984), following

Rubin's (1974) endorsement of “carefully controlled” analyses of nonrandomized data to estimate causal effects as a “reasonable and necessary procedure in many cases” (p. 688).

Objection to “Untestable Assumptions”

Although the split between statisticians and SEMs may have begun with different types of problems that were amenable to different empirical approaches, the difference often is formalized by statisticians into an objection regarding the untestable assumption of IV estimation, specifically that the variable *Z* in affects *T* but is uncorrelated with *u*. There are several points to make about objections to untestable assumptions.

First, what information should the analyst be allowed to use in tests of a model's assumptions? Should it be only the information contained in the numeric values of *Y, X*, and *T*? If so, the analyst is in deep trouble. Staring at a sheet of paper containing the values of *Y, X*, and *T*, how can the analyst know for sure that the variable in the first column actually is “diastolic blood pressure” rather than the last 100 winning numbers in the state lottery? How does the analyst know that the blood pressure variable was collected before the variable indicating whether the patient experienced a stroke? How does the analyst know that the data truly were generated by an RCT rather than a clever matching strategy? Causal inference in any setting, whether it is daily application of common sense or sophisticated data analysis, requires some assumptions about the process that generated the data and it often will be impossible to test those assumptions using only the single dataset in hand.

Second, statistical analyses are filled with untestable assumptions. What is the empirical test for the existence of the hypothetical repeated samples that underlie frequentist statistics? One can draw repeated samples from a known population, but when one has only a sample of data, generalizing to any conceptual population becomes a matter of speculation. What is the empirical test that any single sample, even one drawn from a population with a previously estimated mean, is not in fact a member of a “relevant subset” of the population that has a different mean (

Johnstone 1989)?

^{11} What is the empirical test that randomization was successful with respect to

*unobserved* confounders? What is the empirical test that the Bayesian's priors (whether informative or not) are, in any sense, the best approximation to objective rather than subjective reality? In theory, these untestable assumptions could be prioritized, but until they are, broad objections to untestable assumptions minimally lack precision.

Third, discomfort with untestable assumptions can bleed over into a general aversion regarding models of relationships between unobserved variables.

Holland (1986, p. 946) reminds us that statistics is concerned with measurement, and it is difficult to imagine measuring what one cannot observe.

^{12} Statisticians seem comfortable with models of heteroscedasticity and autocorrelation, for example, that incorporate relationships among unobserved variables, and indeed those assumptions can be subjected to specification tests. But how is the assumption that the error terms for two people in a regression are correlated (perhaps because they belong to the same family) any more testable than the assumption that

*u* and

*v* are in are correlated (the correlation produced by the Heckman–Lee sample selection model)?

Fourth, any assertion that attempts to limit the domain of “legitimate” knowledge is open to a self-referential critique. If an analyst asserts that legitimate causal knowledge can be obtained only from research designs free of untestable assertions, we might ask the analyst to prove that assertion using only information from research designs free of untestable assertions. This problem in logic may seem esoteric to health services researchers, but it is similar to the self-referential critique that led to the demise of logical positivism (the assertion that the only valid data are data obtained from our senses and subject to empirical verification) in philosophy departments in the mid 20th century.

The objection to untestable assumptions needs to be carefully delineated. There can be no general objection to all information beyond that contained in the numeric values of the variables, and once the door is open to additional information then the “untestable” assumption of IV are likely to appear less forbidding.

Here is an example. In studies that compare the health outcomes resulting from two different treatments, a popular IV (

*Z*) in health services research is the distance from the subject's place of residence to a health care facility that offers one type of treatment versus another. A famous example is

McClellan, McNeil, and Newhouse (1994) study of the effect of more intensive treatments on mortality in patients with acute myocardial infarction (AMI). The authors' assumption was that AMI patients did not choose their place of residence to be closer to a hospital offering one type of treatment for AMI patients versus another. That assumption is testable in two ways. First, one simply could ask the patients how they chose their place of residence. Second, one could agree with the authors that unobserved measures of health status were the variables most likely to represent unobserved confounders in the analysis. Then one could do what the authors did and check to see whether the distance variable was correlated with

*observed* measures of health status. Since it was not, the authors concluded that it was unlikely that distance was correlated with

*unobserved* measures of health status. The same comparison of observed variables in the treatment and control groups is used to test that randomization was successful.