Perhaps the most common strategy to identify important imbalances in individual confounders between intervention and comparison groups is to use significance tests such as χ2 tests (for dichotomous variables) or t tests (for continuous variables). A problem with these tests is that the significance levels are sensitive to sample size, and the tests are usually not very meaningful when applied to studies with very large numbers of subjects (as is often the case for cohort studies). Under such circumstances, the differences may be significant but not clinically meaningful. For example, in the comparison restricted to people with dementia in the , a difference of about three months in mean age between groups is significant (P < 0.001) but may not be clinically relevant. Alternatively, if the samples are small, differences that are clinically meaningful may not be significant. For these reasons this approach to the assessment of differences is of little value.
An alternative to traditional significance testing is to use standardised differences or effect size to examine between group differences in patient characteristics. Standardised differences reflect the mean difference as a percentage of the standard deviation. To estimate these, differences between groups are divided by the pooled standard deviation of the two groups. This measure of the distribution is not as sensitive to sample size as traditional tests and provides a sense of the relative magnitude of differences. Standardised differences of greater than 0.1 are typically felt to be meaningful.13
In the , traditional significance testing found that all 19 potential confounders were significantly different (P < 0.001) in comparison 1, and that 13 of the 19 characteristics had standardised differences greater than 0.1. Of particular note is the large standardised difference for history of dementia. Restriction of the study to people with dementia eliminates the possibility of confounding from this characteristic. For comparison 2, traditional significance tests showed that 8 of the 18 potential confounders were significantly different (P < 0.001) but only two had a standardised difference greater than 0.1. The use of the standardised differences technique shows that comparison 1 has substantial selection bias, particularly for dementia, whereas comparison 2 has much less potential for bias.
Both traditional significance testing and standardised differences focus on one potential confounder at a time and do not provide an overall perspective on how the comparison groups differ. For example, two groups could have the same mean age and proportion of women, but one could contain old men and young women and the other old women and young men. An increasingly common approach to the analysis of cohort studies of health care interventions is to use propensity score methods14,15
—a technique that involves multivariate assessment of confounders (see bmj.com
for a brief discussion and an example).
Selection bias in cohort studies can result in confounding. Here we have defined questions that can help identify potential confounders. In the next article we will examine statistical methods that can be used to reduce the effect of confounding and strategies that can be used to determine if the results of a study are plausible.
Has there been a systematic effort to identify and measure potential confounders?
Is there information on how the potential confounders are distributed between the comparison groups?
What methods are used to assess differences in the distribution of potential confounders?