There are obstacles on the “environment” side of gene-environment interaction that are not present on the “gene” side. Environmental epidemiology does not have the economy of scale seen in genomics, where the difference in cost between measuring a million variants and one variant is a small fraction of the average cost per participant in a case-control or cohort study that collects DNA. We may be missing important environmental determinants of disease because we do not know what to look for or because we do not know how or when to measure accurately what we do know to seek. A person's genetic makeup may be too far removed from complex physiologic or biochemical processes that could be more important risk factors for disease. Germ-line variation is static and so can be captured at any point, but variation in the timing of exposure, and the timing of subsequent risk, complicates study of environmental factors; at the same time, variation in exposure and risk over time can provide important clues about etiology. In addition, the major advances in the use of biomarkers in research and medical applications, most notably for infectious diseases, are not yet close to yielding useful measures of long-term exposure regarding diet, pharmaceuticals, and polluted air and water for the large numbers of persons needed for studies of rare diseases. Even as biomarkers continue to improve measurement of some exposures, we must also improve the accuracy of epidemiologic questionnaires, medical records, occupational records, and other proxy measurements of environmental factors.
Investigation of gene-environment interaction to learn about etiology and public health is feasible with existing data. An agnostic strategy that is implemented carelessly, however, will generate a large supply of false-positive findings and cause well-founded skepticism about claims of interactions, given the low prior probabilities of most hypotheses (15
). Researchers conducting GWAS are demanding replications and requiring P
values for significance below what we have ever thought realistic in epidemiology (20
) in order to avoid false-positive findings in studying main effects of a million genetic variants. Imagine 10–30 times more tests of interaction involving genes, demographic factors, and personal and environmental exposures. Hypotheses about interaction have lower prior probabilities and tests have lower power for detecting interactions compared with tests for main effects with comparable effect size. In addition, exposures are measured with more significant misclassification than genetic variants are. Huge sample sizes are required to reach the very low P
values for GWAS of main effect that are finding small effects. How will we decide on and achieve the enormous sample sizes needed for interactions when there are more hypotheses and lower prior probabilities of effect, and when good exposure assessment will be critical? How will we be able to distinguish and draw attention to the few interactions likely to be real from the myriad of false-positive ones?
The decades-old problem of defining interaction (21
) is even more prominent in the GWAS era. The statistical models we have used to declare interaction as departure from additive or multiplicative joint effects may be inadequate to describe the underlying biology of joint gene-environment effects on complex disease. The flood of new empirical data becoming available may allow us to examine both gene-gene and gene-environment interactions in new ways.
Systems biology provides novel experimental approaches to quantify molecular components of a biologic system, to assess their interactions, and to integrate such information into graphic models that may explain or predict emergent phenomena (23
). However, there is still a large schism between modeling of interactions in cellular and biologic processes and our ability to use that information in observing health and disease in human populations. How can we use biologic information for defining interaction or choosing which analytic method is most useful for identifying risk factors, genetic or environmental; for describing their joint effects; and for predicting and stratifying risk? Do we look for higher-order effects only when a main genetic effect has been found? Do we try to fit a variety of models of interactions, including additive and multiplicative effects? Do we remain truly agnostic in our approach and let the data speak for themselves by using other approaches such as data mining techniques (24
)? Do we continue using the multiplicative model to remove one dimension of complexity (25
)? We need some analytic help to make the GEWIS efforts more productive by addressing biologic, clinical, and public health questions, not only academic abstractions!