For decades, the pages of the

*Journal* have been filled with philosophical debates over the meaning of words such as “interaction” and “synergism,” as well as distinctions among statistical, biologic, and public health contexts. Recently, there has been a resurgence of interest in this topic by a new cohort of genetic epidemiologists working on gene-environment (

*G-E*) and gene-gene (

*G-G*) interactions. Various authors have offered classifications of patterns of

*G-E* interaction (

1–

3), although the concept of “epistasis” (

*G-G* interaction) can be traced back to 1909 (

4) and of

*G-E* interaction to 1938 (

5).

Most epidemiologic analyses of interactions have tested for a departure from some simple main effects model, most commonly a multiplicative one. Without belaboring the issue, we point out that this may not be interpretable as a biologic interaction or a synergistic public health impact (

6). With the advent of genome-wide association studies, discussions have shifted from these philosophical topics to more practical concerns about study designs and analysis methods for discovering

*G-E* interactions on a massive scale, what Khoury et al. called a “GEWIS” (gene-environment-wide interaction study(ies)) (

6,

7). The advent of “EWAS” (environment-wide association studies) (

8) and the “exposome” concept (

9,

10) is likely to ratchet the importance of this topic up yet another notch. Two papers in this issue (

11,

12) compare the performance of several novel approaches with GEWIS analysis.

Mukherjee et al. (

11) used simulation to compare case-control, case-only, and several approaches that combine them in various ways. They found, as expected, that the case-only design generally yields the greatest power but is subject to substantial false positives in the presence of

*G-E* associations. Empirical Bayes (

13), Bayesian model averaging (

14), and 2-step (

15) methods all yield better power than the case-control method in most situations, with some performing better than the others for particular parameter combinations. The one notable exception is when a population-level

*G-E* association goes in the opposite direction from the

*G-E* interaction. Here, the case-only method also has low power because the

*G-E* association among cases will tend to be small. Because any GEWIS is testing many different single nucleotide polymorphisms (SNPs) (and often several different exposure variables), there is no uniformly most powerful procedure across the full range of possible model parameters.

One observation reported by Mukherjee et al. (

11) is that, for a fixed number of cases and a fixed screening threshold α

_{1}, the power of the 2-step method appears to decline with increasing number of controls—the only example we are aware of where having more data appears to be worse! However, the reason for this apparent “lack of coherence” is that the power of the 2-step design for a fixed α

_{1} ( at α

_{1} = 0.05 or 0.0005) and the optimal first step critical value α

_{1} () depend strongly on the control:case ratio. If one chooses the optimal α

_{1} for a given control:case ratio, the power does increase monotonically with increasing sample size (, optimal), as one might expect. In addition to the control:case ratio, the optimal choice for α

_{1} depends strongly on the population disease prevalence and number of SNPs analyzed. Because all of these quantities are known or easily estimated prior to analysis, choosing an optimal or near-optimal α

_{1} is possible in practice (

16).

In their discussion, Mukherjee et al. (

11) claim that the 2-step procedure violates the likelihood principle (and that the other methods compared do not). Actually, any test of significance (whether multistep or not) violates the likelihood principle, as it does not rely exclusively on the likelihood for inferences but considers unobserved outcomes as well (more extreme ones than actually observed) (

17). Multistep methods have a long tradition in statistics. For example, sequential and group sequential methods have been used in industrial applications to reduce costs and in clinical trials to minimize potential adverse side effects. Valid multistep methods have also been proposed in situations where the data are not collected sequentially, often leading to substantial improvements in power (

18,

19).

We are aware of 2 applications to real data of the 2-step

*G-E* approach. Ege et al. (

20) applied the approach to data on asthma from genome-wide association studies from the GABRIELA consortium. They identified 15 genes showing evidence of interaction with farm-related variables, although none attained genome-wide significance. Figueiredo et al. (

21) compared case-only, case-control, and the 2-step procedure on data from the Colorectal Cancer Family Registries for interactions with 14 established environmental risk factors. None attained genome-wide significance by any of the 3 methods. This work points out the difficulty in identifying

*G-E* interactions for a complex trait and suggests the need for quite large sample sizes in addition to efficient analytical approaches. Programs to compute required sample size for interaction tests are available for several study designs (

22,

23) including 2-step testing in a genome-wide association study (

16).

Cornelis et al. (

12) took a different approach, comparing similar methods on real data from 2 large GEWIS of type 2 diabetes. Here, the interacting variable was an “adipogenic environment,” as measured by a dichotomization of body mass index. What makes this an interesting application is that this “exposure” variable is also partially under genetic control, probably by some of the same genes that are involved in diabetes, so one might expect a substantial number of false positives due to

*G-E* associations. Surprisingly, on the basis of an examination of the quantile-quantile (QQ) plot of

*P* values, they found no evidence of an inflated type I error rate for any of these tests, even the simple case-only test. However, the loci that yielded the most significant interactions were generally those that were also most strongly associated with obesity, which should cause some concern. As the true state of nature is, of course, unknown, it isn’t possible to assess whether any of these are real interactions or simply reflect

*G-E* associations.

One interesting observation in this paper was the lack of robustness of even the standard case-control test when the model for a continuous exposure variable was misspecified. This phenomenon was recently explored by Tchetgen Tchetgen and Kraft (

24), who also proposed using a robust sandwich variance estimator that does not require dichotomization of the exposure variable (with its inherent loss of power) (

25). Alternatively, more flexible modeling of the exposure using, e.g., generalized additive models (

26) could also help to make the inferences on

*G-E* interactions with quantitative exposures more robust.

So how plausible is the assumption of

*G-E* independence? Most tag-SNPs used in genome-wide association studies are unlikely to be related to either the disease or exposure, so a screening tool using a case-only test seems reasonable. Even if a small proportion of the interactions discovered in this way are false positives, they will be weeded out by a second step that does not rely on this assumption. Nevertheless, there are a few circumstances where caution is warranted. One is for diseases with a strong behavioral component, such as lung cancer where many genes might be associated with nicotine addiction. Hormone-related cancers are another example where various genes could influence a woman’s age at menarche, menopause, or reproductive history as the “exposures” of interest. A third example is a nonrandomized study of treatment outcomes (e.g., second-cancer studies), where indications for treatment could relate to disease severity or other characteristics that are genetically influenced. Uncontrolled population stratification can easily induce spurious

*G-E* associations due to confounding by genetic ancestry and cultural factors influencing exposures, emphasizing the importance of proper adjustment for ancestry covariates (

27). Finally, differential survival over time can induce associations between genes and exposures so that both are risk factors even if the 2 are independent initially (

28). Uncontrolled confounding of the

*G-E* association will lead to inflated type I errors for the case-only or empirical Bayesian approach, but not for the case-control or 2-step approaches.

Although it is tempting to pretest for

*G-E* independence among controls and on this basis decide whether to use the more powerful case-only test (which requires it for validity) or the more robust case-control test (which does not), Albert et al. (

29) showed that this can be a seriously biased strategy. This bias arises because the pretesting is ignored when assessing significance in the follow-up test. Unlike the screening step in the Murcray et al. approach, the test of

*G-E* independence using only controls is not independent of the standard case-control test. However, at least in principle, pretesting for

*G-E* independence in controls would result in an acceptable 2-step test if one properly accounted for pretesting by either conditioning on the outcome of the pretest or by considering the true unconditional distribution of the resulting test statistic. This distribution is a weighted mixture of the case-only and case-control statistics with weights given by the probabilities of acceptance and rejection of the hypothesis of

*G-E* independence in the pretest. The latter is similar to the empirical Bayes and the Bayes-model-averaging procedures, both of which are weighted averages of the 2 statistics. Nevertheless, one should be cautious about blindly disregarding concerns about the validity of the

*G-E* independence assumption. Even if the various 2-step procedures (properly applied) ensure a valid test, power can be adversely affected if the first step passes too high a proportion of false positives to the second step.

As pointed out by Mukherjee et al. (

11), the 2-step approach is the only alternative to the standard 1-step case-control test that guarantees asymptotic control of the type I error under departures from

*G-E* independence. Both the empirical Bayes and Bayes model-averaging statistics, being weighted averages of the case-only and case-control statistics, are necessarily liberal under departures from

*G-E* independence. Thus, because even a modest inflation of the type I error can translate into a sizable power increase, some of the power gain of the empirical Bayes and the Bayes model-averaging approaches over the standard 1-step case-control might be due to type I error inflation. One can argue that, among tests with similar power, it is preferable to use one that controls the type I error. After all, why trade an unknown increase in type I error (even if small) for extra power when one can simply increase the level of significance to achieve the same goal but with a known type I error?

However, perhaps a deeper question is whether we need be concerned about type I error at all in a climate that demands independent replication before publication in a top-tier journal. Won’t virtually all false positives be weeded out by the requirement of genome-wide significance in the discovery sample followed by significance at a conventional replication level such as α = 0.05? Perhaps yes, but this may be too conservative a requirement, tantamount to requiring a genome-wide significance level of α = 0.05

^{2} = 0.0025. The role of replication is more to rule out bias and to ensure generalizability by testing associations or interactions with different methods, by different investigators, in different populations, than to avoid chance statistical flukes (which can always be accomplished within a single study simply by adopting a more stringent significance level) (

30,

31). If independent replication is planned anyway, then a good case could be made for always using the most powerful case-only test for the initial scan, provided the replication is performed with a case-control test in an independent data set. This, however, is not always an option in practice. In particular, for unique exposure situations, uniquely well-characterized cohorts, or consortia that comprise essentially the entirety of the world’s data to generate sufficient cases for studying rare diseases, replication may never be feasible (

32). These situations put a premium on using a powerful testing procedure that maintains control of the type I error rate.

These 2 papers are certainly not the last word on this subject. Various extensions of 2-step procedures are possible. Kraft et al. (

33) discussed the use of a 2 df joint test for gene main effects and

*G-E* interactions, where the goal is not to detect the interaction per se but rather to identify genes that may be etiologically relevant either directly or through an interaction (this test is also evaluated in the paper by Cornelis et al. (

12)). Our group recently described 2 different kinds of 2-step procedures, one for case-parent trios that exploits a between-family comparison of

*G-E* association among the parents (

19) and a hybrid approach (

16) for case-control data that screens SNPs on the basis of both marginal association (

34) and

*G-E* association (

15). Similar methods are applicable for

*G*-

*G* interactions, where the multiple comparisons burden is orders of magnitude more severe (half a trillion tests for an exhaustive scan of 1 million SNPs) and the power advantages of a 2-step method may be even larger than for

*G-E* interaction scans (

35). Tests that exploit Hardy-Weinberg equilibrium in the population (

36), discussed in the contribution by Cornelis et al. (

12), may also enhance power. As we move into the era of targeted, whole-exome, or even whole-genome sequencing, 2-step procedures for interaction testing may become even more necessary. Power for testing interactions with specific rare variants is likely to be miniscule, but interaction testing for aggregate indices of multiple rare variants in a gene or for discovering more complex pathways may be feasible. Exposure measurement error is a longstanding problem and can have unpredictable effects on

*G-E* interactions, although in general it is likely to make their detection more difficult (

37–

39). The larger sample sizes available in a consortium setting may be necessary to achieve adequate power (

40). Methods to analyze

*G-E* interaction in the consortium setting have begun to appear (

41), but this is an area of statistical research that requires more attention. All the GEWIS methods discussed so far are “agnostic” (with respect to the genes), but methods that incorporate external genetic and environmental information offer further potential to achieve substantial power gains (

42,

43).

We commend Mukherjee et al. (

11) for their rigorous comparison of several methods and Cornelis et al. (

12) for their thoughtful application of methods to real data. Together, these papers raise a number of important issues, including the largely untapped potential of GEWIS to discover novel genetic variants in existing genome-wide association data sets. For nearly all complex human diseases, it is clear that neither genes nor environmental factors are exclusively to blame for increased risk. As we move forward, well-designed studies with careful measurement and efficient analysis of both genetic and environmental factors will likely hold the key to further understanding complex disease etiologies.