Over 50% of the pairwise associations between baseline nongenetic characteristics in our study were statistically significant at the 0.05 level; an 11-fold increase from what would be expected, assuming these characteristics were independent. Similar findings were found for statistically significant associations at the 0.01 level (45-fold increase from expected) and the 0.0001 level (3,000-fold increase from expected). This illustrates the considerable difficulty of determining which associations are valid and potentially causal from a background of highly correlated factors, reflecting that behavioural, socioeconomic, and physiological characteristics tend to cluster. This tendency will mean that there will often be high levels of confounding when studying any single factor in relation to an outcome. Given the complexity of such confounding, even after formal statistical adjustment, a lack of data for some confounders, and measurement error in assessed confounders will leave considerable scope for residual confounding [
4]. When epidemiological studies present adjusted associations as a reflection of the magnitude of a causal association, they are assuming that all possible confounding factors have been accurately measured and that their relationships with the outcome have been appropriately modelled. We think this is unlikely to be the case in most observational epidemiological studies [
26].
Predictably, such confounded relationships will be particularly marked for highly socially and culturally patterned risk factors, such as dietary intake. This high degree of confounding might underlie the poor concordance of observational epidemiological studies that identified dietary factors (such as beta carotene, vitamin E, and vitamin C intake) as protective against cardiovascular disease and cancer, with the findings of randomized controlled trials of these dietary factors [
1,
27]. Indeed, with 45% of the pairwise associations of nongenetic characteristics being “statistically significant” at the
p < 0.01 level in our study, and our study being unexceptional with regard to the levels of confounding that will be found in observational investigations, it is clear that the large majority of associations that exist in observational databases will not reach publication. We suggest that those that do achieve publication will reflect apparent biological plausibility (a weak causal criterion [
28]) and the interests of investigators. Examples exist of investigators reporting provisional analyses in abstracts—such as antioxidant vitamin intake being apparently protective against future cardiovascular events in women with clinical evidence of cardiovascular disease [
29]—but not going on to full publication of these findings, perhaps because randomized controlled trials appeared soon after the presentation of the abstracts [
30] that rendered their findings as being unlikely to reflect causal relationships. Conversely, it is likely that the large majority of null findings will not achieve publication, unless they contradict high-profile prior findings, as has been demonstrated in molecular genetic research [
31].
The magnitudes of most of the significant correlations between nongenetic characteristics were small (see ), with a median value at
p ≤ 0.01 and
p ≤ 0.05 of 0.08, and it might be considered that such weak associations are unlikely to be important sources of confounding. However, so many associated nongenetic variables, even with weak correlations, can present a very important potential for residual confounding. For example, we have previously demonstrated how 15 socioeconomic and behavioural risk factors, each with weak but statistically independent (at
p ≤ 0.05) associations with both vitamin C levels and coronary heart disease (CHD), could together account for an apparent strong protective effect (odds ratio = 0.60 comparing top to bottom quarter of vitamin C distribution) of vitamin C on CHD [
32].
The independence of genetic and environmental factors is of importance in other domains of genetic epidemiology, in addition to that of mendelian randomization. First, case-only studies necessarily assume the independence of genetic and environmental factors in their basic rationale [
33,
34]. Second, statistical methods for analysing case-control studies in genetic epidemiology can enhance precision by assuming the independence of genetic and environmental factors, as demonstrated by several authors [
35–
37]. Such approaches have been applied to the analysis of empirical datasets [
38]. Conversely, it is commonplace to see statistical adjustment for environmental factors applied to associations between genetic variants and outcomes. This adjustment is probably unnecessary, given the independence of the genetic variants and the environmental factors, and it also provides opportunity for data-derived selection of the adjusted model that provides the strongest evidence for an association with the genetic variant in question. In some cases, indeed, only the adjusted analyses are presented. We suggest that routine adjustment of genetic associations with phenotypic outcomes for potential nongenetic confounding factors is unnecessary and can be misleading.
Three of the authors decided a priori which baseline characteristics were likely to be biologically closely related to each other or likely to be measuring the same underlying characteristic and did not include such variables in the overall correlations. Other investigators might have come up with somewhat different grouping of variables. However, the very high proportion of statistically significant associations at all three levels of significance and the similar findings with sensitivity analyses using different nongenetic characteristics (e.g., total cholesterol instead of triglycerides, high-density lipoprotein cholesterol and low-density lipoprotein cholesterol) suggest that our findings are likely to be replicated even with different opinions about which baseline nongenetic variables should be included in the analyses (provided this selection of nongenetic variables was done a priori within any given dataset).We also deliberately chose only one genetic variant when we had typed several within a gene; this selection ensured there is no association caused by linkage disequilibrium due to close physical proximity of variants. It is possible that pleiotropy or population stratification could generate associations between genetic variants and nongenetic factors, but we do not see strong evidence of this in our study population of United Kingdom (UK) women, very largely of white European origin.
The genetic polymorphisms that we investigated were those that had been assayed in this cohort study. The variants that we have typed to date are those that we (or study collaborators) wish to use in mendelian randomization studies or to replicate previous association studies. Thus, these variants have all been selected on the grounds that there was some evidence that they relate to biological differences between individuals for phenotypes or disease outcomes that we have assessed in this cohort. Therefore, they are a group of variants that will tend to be related to phenotypic differences. Our variants include, for example, the C→T677
MTHFR variant and the SNP that marks the lactase persistence trait, two well-known and widely studied variants with clear biological correlates. The number of associations found with phenotypic variables should, therefore, be higher for our SNPs than for a group of SNPs selected without reference to known function. Four of the chosen variants (lying at the
APO_AV,
HL,
LPL, and
TNFA loci) were associated with more phenotypes than expected at either the 0.05 or 0.01 significance level. It is possible that these variants are involved in such a wide range of biological processes that the observations are causal. However, these “positive” findings, particularly those at the 0.05 level, may well simply represent the play of chance and be nonreplicable in future studies. In support of our general hypothesis that in mendelian randomization studies, genetic variants are seldom confounded by phenotypic factors [
10], overall we found no more associations with phenotypes than would be expected by chance at the 0.05 or 0.01 level.
At a more realistic
p-value threshold for genetic association studies (
p ≤ 0.0001), only four (0.18%) out of 2,208 associations of 23 genetic variants with 96 nongenetic variants were statistically significant. Although this is greater than the number (0.22) expected by chance, the proportion of statistically significant associations of genotype with nongenetic characteristics is considerably smaller than the proportion of significant associations between nongenetic characteristics (0.18% versus 30%) at this level of significance. It is difficult to believe that all or a substantial proportion of the 1,378 statistically significant associations (at
p ≤ 0.0001) between two nongenetic characteristics are truly causal, whereas the four associations of genetic variants with nongenetic factor associations at this level of significance may well be real. The association of variants in
lactase with mean outdoor temperature and rainfall for the area and month of birth of the participant is likely to reflect the established population stratification for this variant [
39,
40] Since the allele frequency of this variant is known to vary by ancestral geography, we would take this into account in any mendelian randomization studies of this variant. The other two associations—
CETP with high-density lipoprotein cholesterol [
41–
43]; and
LPL with triglycerides [
44]—reflect the biological actions of these genes.
Our findings provide reassuring evidence that utilising genetic variants in mendelian randomization studies is generally a legitimate strategy. Furthermore, statistical methods that assume independence of genetic and environmental factors are also legitimate in many circumstances [
33–
38]. Our findings are concordant with the demonstration that a large number of genetic variants were unrelated to participation or nonparticipation in a series of case-control studies [
45]; with occasional reports of gene–environment independence that have focused on a limited number of variants and environmental factors [
46]; with the very similar distribution of allelic frequencies among blood donors and a representative population sample in the UK [
47] and with a detailed review of gene–environment correlations in behavioural genetics [
48]. We have demonstrated a fundamental difference in the degree of confounding of genetic variants and other variables. This difference can be exploited by using genetic variants as exposure indicators to study the effects on common diseases of modifiable risk factors that are too heavily confounded to be studied robustly through conventional observational epidemiological approaches [
10].