|Home | About | Journals | Submit | Contact Us | Français|
Despite the yield of recent genome-wide association (GWA) studies, the identified variants explain only a small proportion of the heritability of most complex diseases. This unexplained heritability could be partly due to gene-environment (G×E) interactions or more complex pathways involving multiple genes and exposures. This article provides a tutorial on the available epidemiological designs and statistical analysis approaches for studying specific G×E interactions and choosing the most appropriate methods. I discuss the approaches that are being developed to study entire pathways and available techniques for mining interactions in GWA data. I also explore approaches to marrying hypothesis-driven pathway-based approaches with “agnostic” GWA studies.
The term ‘interaction’ has various meanings in the epidemiologic literature, depending on the context (Box 1). The focus of this article is on gene-environment (G×E) interaction, here defined as a joint effect of one or more genes with one or more environmental factors that cannot be readily explained by their separate marginal effects. By convention in epidemiology, a multiplicative model is taken as the null hypothesis; that is, the relative risk of disease in individuals with both the genetic and environmental risk factors is the product of the relative risks of each separately. Thus, any joint effect that differs from this prediction is considered a form of interaction. Other null hypotheses, such as an additive model for the excess risk, would yield different interpretations about interaction (Box 1).
Both public health and biological interactions lead to an additive risk model as the natural null hypothesis164, although in epidemiology, the multiplicative model is more commonly used. Various authors25,165-167 have offered classifications of different types of G×E interactions, including qualitative interactions (crossing, no effect of environment in those not genetically susceptible, no effect of genotype in the unexposed, etc.) and quantitative. See these papers for examples of each.
G×E interactions are worth studying for many reasons1,2 (Box 2), not least of which is the insight they could provide into biological pathways. If some of the unexplained heritability in genome-wide association (GWA) studies is due to interactions, then one goal might be to use interactions to discover novel genes that act synergistically with other factors without having demonstrable marginal effects, rather than discovery of the interaction per se3. Conversely, one might wish to discover environmental hazards that affect only a subpopulation of genetically susceptible individuals. For example, G×E interactions might allow the effects of the components of a complex mixture like air pollution to be dissected4. Understanding the failure to replicate the findings of GWA studies is another goal, as it could provide insights to disease complexity by identifying sources of real heterogeneity5,6. Finally, taking account of G×E interactions in risk prediction models can have important implications for both public health and personalized medicine7.
Traditionally, G×E interactions were investigated using candidate-gene studies. This research often begins with an established association with an environmental factor and proceeds to explore genes in pathways known to metabolize them. Over time, candidate gene studies have become more elaborate investigations of entire pathways, including all the genes, exposures, and cofactors thought to be involved in a particular mechanism. With the advent of GWA studies, a different philosophy has gained prominence, based on “agnostic” searches with no prior hypotheses. Understandably, most reports have focused on genetic main effects, but now increasingly are directed at gene-gene (G×G) interactions8. Although many GWA studies have not collected data on environmental factors, some are based on epidemiologic cohort or case-control studies that have well-characterized exposure information and could be scanned for novel G×E interactions. Such scans for G×G and G×E interactions have been viewed as agnostic. Recently, however, there has been an intriguing convergence of the two philosophies, either by using external pathway knowledge to inform the analysis of GWA data to better detect signals that do not achieve genome-wide significance9 or by mining patterns of interaction effects in GWA data to discover novel pathways10.
In the current post-GWA era the focus is on integrating findings from the vast body of data that has been generated through large consortia. A key feature of this next phase should be a renewed focus on G×E interactions, but this will require careful consideration of epidemiologic study design, exposure assessment, and methods of analysis, with particular attention to harmonization of these features across the consortium. Another key feature is the integration of GWA data with external biological knowledge from –omics databases.
I first discuss some of the challenges facing investigators studying environmental factors. Next, I provide a tutorial for the various types of study designs and analytical methods for studying G×E interactions in different contexts, ranging from specific interactions, to more extensive biological pathways, to GWA studies (“Gene-Environment-Wide Interaction Studies, GEWIS)”11. I discuss various ways that external data can be exploited in these types of analyses. Finally, I discuss some emerging directions and needs for making further progress.
Whatever study design is used, the major challenges to the success of a G×E study — in addition to the usual challenges for genetic association studies that have been thoroughly discussed elsewhere — are exposure assessment, sample size, and heterogeneity.
Many environmental factors are multi-dimensional; air pollution, for example, is a complex mixture of gases and particles with differing biological effects. Most environmental agents have degrees of exposure intensity, usually varying over time. Even if an exposure is not time-dependent, the resulting disease risk is likely to be modified by temporal factors like age at or duration of exposure12. Seldom are accurate measurements of exposure over a lifetime available on all participants in a large epidemiologic study, but more detailed information may be obtainable on a stratified subsample to allow correction for measurement error13. Exposures may not even be measured on individuals, but assigned on the basis of ecologic-level exposures or a prediction model. Two-phase case-control designs that leverage readily available exposure surrogates to select individuals for more in-depth exposure assessment and/or genotyping might be used. Uncertainties in exposure assignments can be large and lead to unpredictable biases, particularly if differential with respect to disease, and can induce spurious interactions9. Although methods of correction for exposure or genotype measurement errors are well established for main effects, they have seldom been applied to interaction analyses14,15. In general, however, interactions are less likely to be biased than main effects unless the measurement errors are differentially related to both exposure and genotype.
Sample size requirements for G×E studies can be enormous. A useful rule-of-thumb is that detection of an interaction requires at least four times the sample size than for detecting a main effect of comparable magnitude16. Sample sizes in the thousands of cases are typically needed for G×E analyses in candidate gene studies (Suppl. Fig. 1a) and tens of thousands in GWA studies because of the more stringent significance levels required (Suppl. Fig. 1b). In addition to study design, the key determinants of power or sample size requirements are the prevalence of exposure (or its distribution if continuous), the allele frequency, mode of inheritance, Interaction Odds Ratio ORG×E (and to a lesser extent the ORs for the main effects), significance level, and desired power. Several programs for sample size and power calculations are freely available, notably Quanto17 and POWER18. It is likely that at least some of the poor track record of replicating claims of G×E interactions is due to underpowered studies in the initial discovery or replication attempts19-21. This has led some to suggest that the search for interactions is not worthwhile, as genes involved in interactions are more likely to be detected through their marginal effects22. Nevertheless, a range of interaction effect sizes can be detected in a GWA study by either a test of interaction or a genetic effect in an environmental subgroup even when the marginal effects are not detectable (Suppl. Fig. 1c). Despite claims that interaction in the absence of main effects is a “ubiquitous” phenomenon in nature23,24, most examples are found at the molecular or cellular level and there are few convincing examples in human epidemiology. Nevertheless, there are examples of genetic effects that are apparent only groups with the relevant environmental exposure or of environmental factors that affect only those with the susceptible genotype (Box 1).
When comparing studies with different exposure assessment tools, different distributions or characteristics of exposure (e.g., different sizes or chemical constituents of particulate air pollution across regions), or different confounders (e.g., co-pollutants, ethnic distributions with differing genetic background risk), the potential for true heterogeneity is magnified. If explanations can be found for such heterogeneity5, there is an opportunity for insights about the complexity of the disease, but spurious inconsistency due to methodological or data quality differences will just add confusion.
Any of the standard epidemiological designs to study main effects of genes or environmental factors — cohort, case-control, or hybrid designs such as nested case-control or case-cohort25-27 — can also be applied to the study of G×E interactions. The issues for choosing between the designs are similar for main effects and interactions — for example, control of confounding and other biases, temporal sequence of exposure and disease, data quality, ability to examine multiple endpoints, and efficiency to detect rare diseases or rare risk factors (Table 1). For simplicity, I treat G in this section as a single functional polymorphism, but it could comprise a risk-associated haplotype, several causal variants within a gene, or some risk index composed of multiple rare variants. The same analysis techniques could be applied in any case (e.g., multiple logistic regression) and the design considerations would be similar. The following non-traditional designs offer particular advantages for studying interactions.
One of the earliest non-traditional designs was the case-only (or “case-case”) design28, which can only be used for testing interactions, not main effects. This design relies on an assumption of gene-environment independence in the source population to avoid estimating this association among controls, thereby increasing power for the test of interaction. While this assumption would be reasonable for most exogenous exposures like air pollution, the case-only design will yield a biased estimate of ORG×E and an elevated type I error rate if the independence assumption is violated. For example, genes involved in behavioral traits such as addiction might be expected to produce a causal association between G and E (e.g. tobacco smoking29,30) in the general population. Other G-E associations could arise indirectly; for instance, between oral contraceptives and BRCA1 through the effect of the gene on family history — a sister of an affected case might choose to take oral contraceptives to lessen her risk of ovarian cancer31.
Broeks et al.32 used a case-only design to assess the interaction between radiotherapy (RT) for treatment of a first breast cancer and mutations in four DNA damage repair genes (BRCA1, BRCA2, CHEK2, and ATM) on the subsequent risk of contralateral breast cancer (CBC). Among RT+ cases, there was a 2.2-fold higher prevalence of germline mutations in one or more of these genes than among RT– cases. Here it seems unlikely that genotypes would have affected the choice of treatment, except perhaps indirectly through tumor characteristics or stage at diagnosis (factors that could be adjusted for).
It is tempting to begin by testing for G-E association in controls and then decide whether to use the case-only test (for greater power if there is no G-E association) or the case-control test (for greater validity if there is). However, this naïve procedure leads to biased tests and estimates because it fails to take proper account of this two-step inference procedure33. More appropriate empirical Bayes34 or Bayes model averaging35 approaches have been developed that essentially provide weighted averages of the case-only and case-control estimators, yielding an acceptable trade-off between bias and efficiency. For example, Mukherjee et al.34 reanalyzed data on glutathione-S-transferase (GSTM1) and N-acetyl-transferase (NAT2) genotypes in relation to smoking and dietary factors. They found a strong association between NAT2 and smoking, so that their empirical Bayes estimate of the interaction between the two was closer to the case-control estimate than to the case-only one, which was in the opposite direction. However, there was no association between GSTM1 and fruit consumption, so the empirical Bayes estimate of that interaction was similar to both the case-control and case-only estimates, but took advantage of the smaller standard error of the latter.
Family-based association tests (FBATs) — case-parent-trios36, case-sibling37, designs using extended pedigrees38, and modified segregation analysis39 — are appealing because they avoid bias from population stratification, but are generally less powerful for testing main effects than case-control studies using unrelated controls. However, they can be more powerful for testing G×E interactions if relatives’ exposures are not too highly correlated37. Population stratification can bias G×E interactions only if the substructure is related to the gene and the environmental factor differentially—different ancestry-genotype associations in exposed and unexposed individuals—which seems unlikely. The case-parent trio design requires exposure information only on the cases (although it does require surviving parents for genotyping, making it more suitable for early-onset diseases) and entails a comparison of genetic relative risks between exposed and unexposed cases. The discordant sibship design requires exposure information on all cases and controls and uses standard conditional logistic regression tests of interaction. Twin studies40, segregation41, and linkage analysis42-44 can also be used for testing the existence of G×E with unknown genes or specific regions25.
Two other novel designs use different ways of selecting controls to improve the power for detecting either main effects or interactions. The two-phase case-control design45 is useful where a surrogate for exposure is readily available but data on exact doses, confounders, or modifiers require additional expensive data collection46. (Note that the kinds of two-phase sampling designs described here are fundamentally different from the two-stage genotyping designs for GWA studies described below.) These designs entail independent subsampling on the basis of both disease status and the exposure surrogate variable from a first-phase case-control or cohort study. Data from both phases are combined in the analysis, with appropriate allowance for the biased sampling in phase two. The optimal design entails over-representing the rarer cells, typically the exposed cases. Although most applications have focused on its use for improving exposure characterization for main effects or for better control of confounding, it can also be highly efficient for studying interaction effects. For example, Li et al.47 used a two-phase design nested within the Atherosclerosis Risk in Communities (ARIC) study to study the interaction between GSTM1/GSTT1 and cigarette smoking on the risk of coronary heart disease. Their sampling scheme was not fully efficient for addressing this particular question because it stratified only on intima media thickness, not smoking, and only for the controls, and did not exploit the information from the original cohort in the analysis. Reanalyses of other data from the ARIC study48 showed the considerable improvement in efficiency that can be obtained by using the full cohort information.
Countermatching is essentially a matched variant of the two-phase design. Here one or more controls are selected for each case on the basis of exposure so that each matched set contains the same number of exposed individuals. Another study of CBC in relation to RT and DNA damage repair genes49 counter-matched each CBC case to two controls with unilateral breast cancer, such that each matched set contained two RT+ subjects. Radiation doses to each quadrant of the contralateral breast were then estimated and DNA was obtained for genotyping candidate DNA repair genes and for a GWA scan. Langholz50 has demonstrated the considerable gains in power that can be obtained, both for main effects and for interactions. In particular, for G×E interactions Andrieu et al.51 showed that a 1:1:1:1 design counter-matched on surrogates for both exposure and genotype was more powerful than conventional 1:3 nested case-control or 1:3 or 2:2 designs counter-matched on just one of these factors.
So far I have considered interactions between one gene and one environmental factor, but most candidate gene studies are based on a conceptual model for one or more hypothesized pathways. For example, most of the genetic studies being done for susceptibility to the effects of air pollution on children's asthma and lung growth within the Southern California Children's Health Study (CHS) have been motivated by a theoretical framework involving oxidative stress, inflammation, and modifiers such as anti-oxidant intake52. Typically such hypotheses lead to the selection of a set of candidate genes to be studied together. How then can these data be analyzed in combination to learn about the overall effect of the postulated pathway(s)?
Many exploratory methods have been developed for multivariate analysis of high-dimensional data ranging from standard multiple regression techniques to various machine learning or pattern recognition methods8,53,54. Perhaps the most popular of these methods to study interactions is Multifactor Dimension Reduction (MDR)8,55,56, which I applied in Box 3 to data on a reported four-way interaction between two exposures (smoking and red meat) and two genes (cytochrome P-450 (CYP1A2) and NAT2) in colorectal cancer57. Although this study is widely quoted as one of the few examples of a higher-order interaction, this analysis makes clear that the 4-way interaction is not internally reproducible by cross-validation. In this instance, MDR is more useful for putting a high-dimensional interaction in context than for discovering one, and emphasizes that if two-way interactions require large sample sizes, higher-order interactions require even larger sample sizes. Nevertheless, the interaction is biologically plausible (similar replicated interactions among NAT2, GSTM1, tobacco smoking, and occupational exposures have been reported for bladder cancer58) and is worth studying further using techniques that leverage known pathways.
A reanalysis by the author of grouped data from Le Marchand et al.44 on colorectal cancer in relation to two exposures, smoking and red meat (RM, R/M=rare/medium, WD=well done), and phenotypic markers of two genes, CYP1A2 and NAT2 (S/I=slow/intermediate, R = rapid acetelators) using the MDR technique. Blue shading indicates low risk strata, yellow high risk.
Training subset (9/10):
|CYP1A2||NAT2||Numbers of cases / controls|
Testing subset (1/10)
|CYP1A2||NAT2||Numbers of cases / controls|
The proportion correctly classified in the testing subset by the rule derived from the training data for this realization is 58/85 = 68.2%. Across 10 random training/testing subsets, however, the mean classification accuracy is only 49.7% (range 31.9–74.1%); this is no better than chance, due to the small numbers of subjects (12 cases, 2 controls) in the highest risk stratum. MDR explored all possible models (combinations of genes and environmental factors) and found that only the main effect of smoking on CRC risk was replicable.
Since candidate pathway studies are hypothesis-driven, it seems appropriate to carry this reasoning through to the analysis59,60. Two approaches that attempt to leverage external information about biological pathways are summarized below and in Box 4. These methods, though promising, have not been widely applied to candidate gene studies so far.
This approach shifts the emphasis from the effects of individual SNPs to sets of genes known a priori to have related functions. First, each SNP is assigned to one or more genes, typically based on proximity and a summary statistic for each gene is obtained (e.g., the minimum p-value for all SNPs assigned to it). Then genes are assigned to gene sets and the distribution of gene-specific summary statistics for each set is compared with its null distribution, typically using the Kolmogoroff-Smirnoff test. Permutation may be used to allow for the non-uniformity of the null distributions. This method seems to have been applied only to purely genetic analyses, but could be extended to the genes involved in G×E interactions.
This approach supplements a traditional epidemiologic analysis (e.g., multiple logistic regression) with a second level in which the first-level regression coefficients are modeled in relation to a set of “prior covariates” derived from external information, such as pathway or genomic databases (see the figure). This shifts the main focus of inference from the effects of specific exposures, genes, or interactions to the effects of the pathways or other external predictors. It also provides more stable estimates of the individual risk factor effects by “borrowing strength” from related risk factors. The first-level associations may comprise a mixture of null and non-null ones, with probability depending upon prior covariates. The prior means of the non-null effects are regressed on prior covariates and their covariances can depend on a matrix of gene-gene connections. Rebbeck et al.18 provide a discussion of various sources of prior covariate information.
Gene set enrichment analysis (GSEA)61 tests whether disease-associated genes are significantly enriched for particular pathways. Although GSEA is widely used in the analysis of gene-expression data, methods for applying it in association studies have only recently been developed62-64 and have not yet been used for G×E studies.
Hierarchical models extend traditional multiple regression methods for exploring main effects and interactions in an epidemiological dataset by regressing the first-level coefficients on external data65-67. External information can include simple pathway indicator variables68, genomic annotation or pathway ontologies69, functional assays70, in silico predictions of function or evolutionary conservation71, or simulation of pathway kinetics72,73.
Both the GSEA and hierarchical modeling approaches can be thought of as “empirical” as they use external information only to guide the selection of terms to include in a model or to stabilize their estimation. These approaches do not fit strong mechanistic models directly — our understanding of the basic biology is too primitive — although there have been notable successes. Some of the earliest were stochastic models for multistage carcinogenesis74,75, but they have not been applied to pathways involving specific genes. Another area that has seen extensive mathematical modeling is the pharmacokinetics and pharmacodynamics of drug metabolism76, exposure to toxic substances77,78, and normal metabolism79,80. While inter-individual variation in metabolic rate parameters has long been recognized, their genetic basis has only recently been incorporated into this kind of modeling81,82.
Even when supplemented with external information, the informativeness of epidemiological studies of chronic disease endpoints for the purpose of pathway analysis is limited by the dichotomous nature of the phenotype. The information content may be improved by obtaining biomarker data on some of the intermediate steps in the process. Ideally, biomarker specimens would be sampled longitudinally and before disease onset. This may be prohibitively expensive, so the two-phase case-control design samples individuals from a cohort or case-control study based on disease, exposure, and genotype information83. Nested case-control studies within biobanks overcome the problem of reverse causation by using stored specimens and exposure information obtained at enrollment. Mendelian randomization84,85 provides another way to avoid reverse causation by using genes (which are not subject to this problem) as instrumental variables86 for the biomarker–disease relationship. In a randomized trial of estrogen plus progestin, Dai et al.87 used a two-phase design to assess interactions of treatment with thrombosis biomarkers and found that estimates of the interaction effect were considerably more precise than those from the case-control study alone or standard two-phase estimators not assuming G-E independence.
Although the approaches described above could be used in a genome-wide context, the enormous cost, computational burden, multiple comparisons penalty, and general absence of prior knowledge about most SNPs pose additional complexities. For main effects of genes, various design and analysis issues have been widely discussed88,89, so the remainder of this Review focuses on the use of GWA data for G×E. Both two-stage genotyping designs and two-step analyses of a single-stage design discussed below could be applied to interaction studies (Box 5). In contrast to the pathway-based approaches in the previous section, these novel techniques are readily applicable to GWA data now.
Although any of the designs for studying G×E interactions with single genes could be used for GWA studies including interactions (GEWIS), the following five have the potential to greatly improve power or cost-efficiency:
The two-stage genotyping design90 has been extended to GWA scale91-94 and used to discover main effects in many studies. The design is also attractive for GEWIS, but requires choices about how to select the SNPs to be carried forward to the second stage based on promising main effects and interactions. Any SNP for which the main effect or any of the G×E/G×G interaction tests attained the appropriately Bonferroni-corrected significance level would be chosen for inclusion in stage 2 genotyping. While an optimal selection of numbers of hits of each type to pursue so as to maximize the yield of true positives would require knowledge of the distribution of true effect sizes of each type, reasonable bets might be made based on previous literature and calculation of the power to detect similar effects.
A conventional two-step analysis of G×G interactions in a single-stage GWA study restricts the search for interactions to gene pairs for which one or both members shows a marginal association. It can be more powerful than an exhaustive scan for all possible pair-wise interactions, but risks missing those with no or weak marginal effects8,95-97. In addition, scanning for higher-order (G×G×G…) interactions is computationally infeasible without filtering based on main effects and/or lower-order interactions. While this filtering approach could also be applied to G×E interactions, it does not exploit the ability of the following two-step approaches to use different designs.
The case-only design is appealing for a GEWIS because of its greater power than the case-control design and because most GWA SNPs are unlikely to be correlated with environmental factors in the source population. Nevertheless, some false positives due to G-E association may occur, and even if only a small proportion of all SNPs were associated, they could represent a high proportion of all reported G×E interactions. Since any scan for interactions is likely to have been accompanied by a main effects scan, controls are probably available anyway, so it would be wasteful not to use them. (The exception would be if public controls with no environmental data, or non-comparable data, were used for the main effects scan, combining case-only information on G×E interactions with case-control information on genetic main effects98.) Two basic approaches have been suggested for taking advantage of controls to protect against false positives while exploiting the power advantage of the case-control design. Murcray et al.99 introduced a two-step analysis of a single-stage GWA study (FIG 1), in which G-E association is first tested in the combined case and control sample and only the most significant SNPs are then tested for G×E interaction using the standard case-control test. The second general approach is the empirical Bayes34 or Bayes model averaging35 methods that combine the case-only and case-control estimators to provide a reasonable trade-off between validity and efficiency. Simulation studies show that these approaches can have better power than the two-step analysis over a range of modest interaction relative risks, while the two-step approach is more powerful for larger relative risks.
Another possible approach to saving on genotyping costs is DNA pooling, at least for an initial screen, to be followed by individual genotyping of promising loci100. Beyond the technical challenges in forming comparable pools and assaying allelic concentrations, this approach would be feasible for studies of G×E interactions only if the pools were stratified on the basis of exposure, thus limiting the number of possible environmental factors that could be considered. Recent advances in DNA bar-coding101, however, would permit the reconstruction of individual genotypes from within pools, thereby allowing a broader range of interaction analyses102.
One must sift through a massive number of potential “hits” to decide which should be considered in subsequent stages of a multi-stage genotyping design, in independent replication studies, or in functional assays. This decision is usually based on statistical significance, but also entails expert judgment based on the internal consistency of the results and the coherence with other knowledge (e.g., the existence of other GWA associations for the same or related traits or biological pathways). Coherence has tended to be a more informal judgment, but various methods have emerged for formalizing this process. The following techniques can be viewed as well established and available for application now, although because of their novelty, there are few applications so far. See REF. 103 for an excellent review of the available techniques in the context of genetic main effects.
One of the first was a weighted False Discovery Rate (FDR) approach104, which uses external information to prioritize some SNPs or regions while maintaining a fixed overall FDR. Bayesian versions of the FDR have also been described105,106, as well as the use of Bayes factors107 and empirical Bayes shrinkage108. Both GSEA and hierarchical modeling approaches are also amenable to incorporating external knowledge. Several authors109-111 have described applications of the hierarchical Bayes modeling approach to GWA data using prior covariates extracted from genomic or pathway ontologies. While these have focused on main effects, the methods are also applicable to GEWIS11, the limiting factor presently being the lack of suitable ontologies for interaction effects. Meanwhile, a growing literature is discussing various ways of using GSEA or other methods of integrating pathway knowledge into GWA analyses9,62-64,112-116. Few studies have explicitly included G×E interactions in formal pathway-based analyses of GWA data117. A promising approach entails incorporating metabolomics, as in the first GWA of a large panel of metabolite phenotypes118, which found associations of 4 genes with metabolite concentration ratios for enzymatic activities that matched the pathways in which these enzymes act.
An emerging idea is to use Bayesian network analysis119-121 or similar techniques to discover novel pathways. Bayesian networks have been widely used in the analysis of gene co-expression data to discover cliques of interacting loci. The starting point is usually a matrix of gene-gene correlations across multiple experimental conditions (e.g., time series of synchronized cell cultures or different environmental stressors), which can be used to derive a parsimonious graphical representation of the important interactions. Unlike co-expression data, GWA data provides only a single estimate of the association between genotype and phenotype, but no information about gene-gene connections. G×G interaction analyses do, however, yield information about pairs of genes that could be mined in a similar way, as could G×E interactions. Sebastiani et al.10 applied the technique to modeling the posterior probability of genotypes and exposures given disease status, yielding graphical models that can be interpreted in terms of interactions. However, these probabilities depend on both the risk of disease given G and E (and their interactions) and the correlations among these factors, so do not represent a pure interactome122 model. Alternatively, a known network can be used as either a prior covariance matrix for main effects or as prior covariates for interactions in a hierarchical model (Box 4). Although potentially exciting, such methods have yet to be applied on a GWA scale.
Experimental studies offer unique promise for validating G×E interactions, as both exposure and genotypes can be carefully controlled through randomization. Model organisms are commonly used for evaluating genetic modifiers of drug response; for example, Koch and Britton123 used selective breeding of rats on aerobic capacity to study gene-diet interactions in body weight and various metabolic markers. In human challenge studies, a randomized crossover design is typically used, in which volunteers are exposed to one or more environmental exposures in random order. In one intra-nasal challenge study of allergen alone or with diesel exhaust particles, various immunological responses were measured124. Stratified analyses revealed that those with the GSTM1 null or GSTP1 I/I genotypes had significantly larger increases in IgE and histamine levels after diesel challenge. Subjects were not preselected on the basis of genotype, so results were limited by the relatively small numbers of subjects with the susceptible genotypes. Challenge studies nested within epidemiologic cohorts for which genotypes (and possibly various outcomes) are already available could be more powerful.
Clinical trials also allow controlled comparisons for G×E interactions and more powerful designs using two-phase sampling on various combinations of genotype, treatment, outcomes, and possibly other factors93,125. For example, Israel et al.126 performed a clinical trial of albuteral in asthmatics, matching pairs on forced expiratory volume and β2AR genotypes, and found a highly significant gene × treatment interaction. A case-only design nested within a clinical trial is particularly appealing for evaluating gene-treatment interactions on survival or other treatment responses, as treatment assignment is independent of genotype by virtue of randomization127,128.
The biggest barrier to integrating biological knowledge with agnostic GEWIS data may be the lack of ontologies designed to bring together information from SNPs, genes, and pathways, but also their relevant environmental substrates, known relationships to disease, metabolic parameters, and toxicological information. The creation of such a database is arguably one of the most important contributions of the Human Genome Epidemiology Network (HuGE NET) project129, but is highly labor-intensive because expert curation of the literature is needed; their valuable series of reviews on specific topics130,131 does not replace the need for a searchable database that could provide prior covariate information in a systematic and unbiased manner. Automatic literature-mining approaches132,133 have been developed that can help assign sets of genes to shared pathways or interaction networks. However, they are still vulnerable to bias in what is investigated and published; the current literature on G×E interactions is very sparse, highly subject to publication bias, poorly replicated, and tends to reflect a “looking under the lamppost” mentality in terms of what gets studied. Other genomic or pathway ontologies134-136 tend to be limited to purely genetic information and are only partially useful for G×E modeling.
One of the aims of pathway-based modeling is to understand how genetic and environmental effects are mediated through intermediate events such as changes in gene expression, epigenetic events like DNA methylation137, somatic mutations138, and small-interfering RNAs139. These phenomena have been studied in relation to disease and to a lesser extent exposure140,141, but the full pathways from genes and exposures through epigenetics to disease remain to be studied137. For example, the seminal observation142 that MZ twins start life with identical methylation patterns but subsequently diverge suggests the effect of environmental factors and may provide a mechanism for their subsequent discordance in disease. Latent variable models could be used to treat biomarker measurements as surrogate observations of a long-term unobserved process leading to disease. Various –omics technologies could provide high-dimensional measurements of intermediate processes on targeted subsamples of epidemiologic study subjects, although the multiple comparisons challenges of relating high-dimensional phenotypes to high-dimensional genotypes and interactions are even more daunting than for regular GWA studies. Alternatively, stand-alone studies or external databases can be used to construct prior covariates to inform G×E analyses of epidemiologic studies. For example, GWA data on immunologic markers for a challenge study of allergen and diesel exhaust particles are being used to define a set of immunologic covariates associated with each SNP as priors in a hierarchical model for a GWA study of asthma. Associations of genome-wide expression with genome-wide SNPs143 could be used in a similar manner, and would likely be even more promising for G×E interactions if based on expression studies conducted under a range of environmental conditions.
Increasing attention is being paid to the possibility that rare variants might account for at least some of the missing heritability144. Next-generation sequencing methods are making it feasible to sequence portions of the genome identified through a GWA study in a subset of study subjects. Until it becomes possible to obtain and manage genome-wide sequence information on the massive sample sizes that would be required to discover associations with rare variants directly, some form of informative sampling approaches will be required. For example, one might sequence a subsample of cases and controls, stratified by associated SNPs in a given region, family history, and environmental factors, to discover novel variants in the region and for a joint analysis of subsample and main study data94,145. The imminent availability of the 1000 Genomes Project146 data will doubtless have a profound effect on the design of such studies.
Insights from G×E interactions could have important policy implications for environmental health standards147, targeting of interventions148, and treatment selection149 (Box 2). For example, the Clean Air Act directs the U.S. Environmental Protection Agency to set standards to protect the most sensitive, including genetically susceptible individuals150, although it has been argued that public health interventions aimed at the whole population may be more effective151. As another example, suppose the joint effect of mutations in BRCA1/2 and radiotherapy in an individual were multiplicative; then even if the radiation effect in mutation carriers alone was not statistically significant or the joint effect was not significantly greater than additive, it would be misleading to conclude that radiotherapy was no more dangerous for carriers than for noncarriers, owing to their much higher baseline risk152. Since any statement about interaction is necessarily scale dependent (Box 1), it is essential that claims about the presence or absence of an interaction make clear whether it is a departure from an additive or multiplicative model on a scale of absolute or attributable risk, odds, underlying liability, or some other scale that is being discussed. Unfortunately, translation of scientific understanding about G×E interactions into risk assessment and prevention policies has so far been limited153.
The current enthusiasm for studying genetic associations with disease, recently enhanced by the advent of GWA studies, has tended to overshadow the important role of environmental factors and G×E interactions. While these are much more difficult to study than purely genetic associations, requiring careful collection of exposure data and rigorous study designs, standard epidemiologic designs can be used and several recently developed variants of them can enhance power. Nevertheless, large consortia will likely be needed to fully explore G×E interactions, requiring attention to these principles and harmonization across studies. The use of powerful pathway-based methods that leverage external biological knowledge can further enhance power and insight.
Duncan Thomas is Professor and Director of the Biostatistics Division of the Department of Preventive Medicine in the Keck School of Medicine at the Uniuversity of Southern California and holds the Verna Richter Chair in Cancer Research. His major research interests are in the development of study design and statistical analysis methods for genetic and environmental epidemiology and their interface. He has been a coinvestigator on numerous studies ranging from radiation carcinogenesis to the health effects of air pollution and their genetic modifiers.