Suppose that you attend a meeting with 99 of your peers. Unbeknownst to you, 10 were pre-symptomatically infected with influenza upon arrival. We ignore transmission, the effects of which we would not see for a few days. Everyone at the meeting ate one of 50 chicken or 50 an egg-salad sandwiches in a boxed lunch, chosen at random. Without further information, what is the expected number of influenza cases in the 50 people who ate a chicken sandwich? The answer is five; and the same is true for the 50 people who ate an egg-salad sandwich. The data in illustrate this example. In panel A of , the risk of influenza is 5/50 in both sandwich groups, and the risk difference is 0. Because sandwich type was randomized, we expect no association between pre-existing influenza and sandwich type. Ignoring chance imbalances due to the relatively small sample size, it is clear that the data in panel A of represent what we would expect to see in an infinite sample (one can think of each of the 50 people as representing a much larger number of homogenous individuals).
Data illustrative of selection bias, due to conditioning on a collider
That evening, you and 54 others develop a 102°F fever. Let’s say that in our hypothetical world there are only two ways to get such a fever: influenza or consuming 1 of the 50 tainted egg-salad sandwiches. Among those individuals with a fever, therefore, all were exposed to either influenza or an egg-salad sandwich (or both). Put another way, restricting our attention to only those individuals with a fever or conditioning on (stratifying by) the variable fever, we have conditioned on a ‘common effect’ of both influenza and sandwich type.
Therefore, knowing that you have a fever, if we ask you whether you ate an egg-salad sandwich and you respond ‘no’, then we know that your fever is due to having influenza. In panel B of , we see that, among those with a fever, the influenza risk among those who ate a chicken sandwich is 5/5 = 1, compared with 5/50 for those who ate an egg-salad sandwich, yielding a risk difference of 0.9. (Conversely, all individuals without a fever were exposed to neither influenza nor tainted egg-salad.) This association is introduced by conditioning on a common effect, namely fever. Recall that sandwich type was randomly assigned. This association was not present before we knew about (and conditioned on) your fever status.
In general, we may introduce bias by conditioning on common effects of otherwise unrelated variables: we call this selection bias. Consider the variables influenza I
, sandwich type S
and fever F
. A causal diagram8
illustrating the associations as described previously is drawn in A. A variable like F
, where two arrowheads meet, is called a ‘collider’ on the ‘path’ I–F–S
; but may not be a collider on other paths (if they existed). When we condition on a collider, we may introduce associations in one or more strata that were not present in the source population. One way to understand the bias caused by conditioning on a collider is to envision a connection made between I
is conditioned upon; indeed, some methods for working with causal diagrams explicitly draw such connections.9
The bias in this example is not in the association of exposure (S
) with disease (F
), but it is in the apparent I–S
association, within one or both levels of F
Causal diagrams depicting scenarios described in Example 1 (A), Example 2 (B) and in Discussion (C)