Search tips
Search criteria 


Logo of cancerinformLink to Publisher's site
Cancer Inform. 2010; 9: 115–120.
Published online 2010 May 12.
PMCID: PMC2879605

An Efficient Gatekeeper Algorithm for Detecting GxE


The risk for many complex diseases is believed to be a result of the interactive effects of genetic and environmental factors. Developing efficient techniques to identify gene-environment interactions (GxE) is important for unraveling the etiologic basis of many modern day diseases including cancer. The problem of false positives and false negatives continues to pose significant roadblocks to detecting GxE and informing targeted public health screening and intervention. A heuristic gatekeeper method is presented to guide the selection of single nucleotide polymorphisms (SNPs) in the design phase of a GxE study.

Keywords: gene-environment interaction, multiplicity corrected confidence intervals, SNP microarrays


Advances in bioinformatics and genomics have opened the door for personalized medicine, enabling epidemiologists to identify genetic variations that predispose some, but not others, to disease. Single nucleotide polymorphisms (SNPs), which on average occur in about every 1,000 base pairs throughout the human genome, also may be useful for determining variability in individual response to treatment and could potentially lead to the development of novel therapeutics custom tailored to patients’ genetic profiles. However, studies often have failed to yield consistent findings or definitive results, in part because analyses have not accounted for gene-environment interactions (GxE).

While genetics play a significant role in many diseases, few common medical disorders are explained by a single SNP or genetic mutation. Rather, environmental factors are thought to modulate an individual’s genetic predisposition for certain diseases.13 In the case of GxE, risk for disease occurs only when genetic and environmental factors are present in combination, while individual factors alone convey little or no risk for disease. Correctly identifying GxE is particularly difficult in the context of high density SNP arrays, because the number of multiple comparisons can be in the thousands.

In this paper, an efficient gatekeeper algorithm is presented to identify GxE. The technique involves computing a multiplicity adjusted lower bound on an indirect estimate for GxE. The indirect estimate is then used to independently screen for GxE in a direct disease association study in order to correctly identify risk or propensity for disease.


The method for indirectly estimating the odds ratio (OR) for GxE from a case-control study and the technique for computing multiplicity corrected confidence intervals for a relative effects estimate have been separately described in previous publications and are only briefly summarized below.4,5 Their combination forms the basis for the procedure to screen for GxE.

Indirect OR estimate and confidence interval (CI) for GxE

The odds ratio (OR) for environmental exposure (E) associated with disease (D) in the population may be expressed as

OR(E[mid ]D)=P(E[mid ]D)/P(E¯[mid ]D)P(E[mid ]D¯)/P(E¯[mid ]D¯).

Assuming that D is relatively rare in both exposed and unexposed populations, and that genotype (G) is independent of environmental exposure (E) [i.e., P(G|E)=P(G|Ē)=g], equation number (1) simplifies to

[P(D[mid ]GE)g/P(D[mid ]G¯E¯)]+[P(D[mid ]G¯E)(1-g)/P(D[mid ]G¯E¯)][P(D[mid ]GE¯)g/P(D[mid ]G¯E¯)]+(1-g).

Considering the simple case when OR(GĒ|D) = OR(GE|D) = 1, the OR for GxE given disease may be written as

OR(GE[mid ]D)=[OR(E[mid ]D)-1+g]/g,

where [OR(E|D)| −1] ≥g by unity constraints on the joint conditional probabilities. Treating (g) as fixed, the (1−α/2) × 100% CI for OR(GE|D) is approximately equal to

exp{log{[OR(E[mid ]D)-1+g]/g}±Z(1-α/2)OR(E[mid ]D)(var(OR(E[mid ]D))){OR(E[mid ]D)-1+g}},

where z(1−α/2) is the (1−α/2) × 100 percentile of a standard normal distribution.

Multiplicity corrected CIs

Given a set of (i) SNPs, the P value corresponding to the statistical significance of OR(GE|D)i is computed as



zi=|log[OR(GE[mid ]D)i]SE[log[OR(GE[mid ]D)i]]|=|log[OR(GE[mid ]D)i]-{log[LCI[OR(GE[mid ]D)i]-log[OR(GE[mid ]D)i]}/Z(1-α/2)|,

and LCI is the (1−α/2) × 100% lower CI from equation 4. Ordering the P values (pi’s) from the lowest to highest values, i.e., p(1)≤ p(2) ≤ ... p(i) ≤ ... p(n) (with arbitrary ordering in the case of ties), the multiplicity corrected P values denoted by “*” are computed as


where j ranges from 1 to n in a 1:1 identity mapping with the i values, and p(j)* is bounded by unity. The multiplicity corrected (1−α/2) × 100% CI for OR(GE|D)(i) is then computed as

CI(1-α/2)=exp{log(OR(GE[mid ]Di)±Z(1-α/2)SE*[log(OR(GE[mid ]D)i)])},


SE*[log(OR(GE[mid ]D)i)]=log(OR(GE[mid ]D)i)Φ-1[1-p(j)*2].

Screening for GxE in a direct association study

A SNP will be selected as a possible candidate in a direct disease association study for GxE if the (1−α/2) × 100% multiplicity corrected lower CI (MCLCI) estimate for OR(GE|D) is greater than an a priori specified threshold value, for example, OR = 3.0. Letting α1 and α2 denote the type I error for the indirect and direct tests, the statistical significance of the overall procedure will be protected at α ≤ α1 + α2, where the upper bound holds under independence of the indirect and direct tests. The significance level of the likelihood ratio test for the global null hypothesis β = 0, where β denotes the vector of β coefficients in a direct multivariable logistic regression model, may be set to a nominal value (e.g., ≤0.001), such that the significance level for the overall procedure (indirect and direct combined) will be protected at an α-level only slightly greater than the type I error for the indirect test. Furthermore, the total number of SNPs allowed to enter the direct model may be fixed at a small number, for example ≤10, based on the rank order of the lower 95% CIs for SNPs passing the OR threshold value.

In practice, a significant gain in power may be realized by using a meta-analysis estimate for OR(E|D) and 95% CI. Because a meta-analysis combines several studies, the resulting confidence interval will be more precise than a single case-control estimate of the effect.

Example (hypothetical)

A population-based meta-analysis of several case-control studies estimates that children living on a farm have a 1.5-fold OR(E|D) (95% CI = 1.1676 – 1.9270) for childhood brain cancer compared with controls. The aim of a future association study is to determine whether GxE are occurring between a panel of 100 innate immunity SNPs and exposure to farm life. The study investigator is interested in finding interactions with an OR(GE|D) ≤ 3.0. The population allele frequencies (g) for the 100 SNPs and computed 95% MCLCI for the indirect estimates of OR(GE|D) are shown in Table 1. Upon examining Table 1, the investigator observes that 7 SNPs (highlighted in gray in the 3 rightmost columns) have a MCLCI ≤ 3.0 for the indirect estimate of OR(GE|D) and these will be included in a new association study to directly test for GxE. Assuming that the type I error rate for the direct test will be controlled at α2 = 0.001, the statistical significance of the overall procedure will be protected at α ≤ 0.051.

Table 1
Multiplicity corrected 95% lower confidence intervals (LCI) for OR(GE|D) given the population allele frequency (g) for 100 innate immunity SNPs and OR(E|D) = 1.5 (95% LCI = 1.1676).

Power and sample size computation

Power and sample size for the direct study may be computed using standard maximum likelihood methods for a logistic regression model and setting the α-level equal to α2.6 When the joint distribution of covariates is unknown, the sample size for a multivariable model may be estimated by multiplying the univariate result times a variance inflation factor 1/(1ρ1.2,3...p2), where ρ1.2,3...p2 denotes the squared multiple correlation coefficient and p equals the number of model covariates.7 In the example above, approximately n1 = 260 cases and n2 = 260 controls are needed in a direct study to have at least 80% power to detect an OR(GE|D) ≥3.18 (corresponding to the LCI of the minimum SNP passing the threshold for entrance into the direct model; upper right hand value in highlighted region of Table 1) at the α2 = 0.001 level of statistical significance (2-sided test), given ρ1.2,3...p2 = 0.2, P(E) = 0.10 and P(GE) = (0.88) (0.10) = 0.088.8 Accordingly, the overall test procedure is protected at α ≤ 0.051 (i.e., α1 + α2 = 0.05 + 0.001 = 0.051).

For models involving non-null main effects [i.e., P(G = 1, E = 0) ≠ 1 and/or P(G = 0, E = 1) ≠ 1], power may be computed by excluding these effects from the sample space when estimating the power for OR(GE|D).


To the best of our knowledge, this method is the first to screen for GxE using an indirect estimate for OR(GE|D). Statistical power is significantly increased in this approach by eliminating SNPs prior to conducting a direct association study of GxE. Furthermore, study cost is greatly reduced since fewer SNPs need to be genotyped.

The approach has several advantages. For example, the derived indirect estimate requires only knowledge of the OR for environmental exposure OR(E|D) and the population allele frequency (g). Also, the model can detect interactions even when the OR for an environmental effect is null, i.e., OR(E|D) = 1.

An indirect estimate for OR(GE|D) can be computed regardless of whether a biologic rationale exists for the underlying effect. Since the inclusion of biologically irrelevant or non-functional SNPs will inflate type I error, pathway analysis and other molecular techniques are recommended to determine the relevance of SNPs prior to analysis.9

The face validity of the method is based on established probabilistic principles and theory.1012 Nonetheless, further validation of the technique will require testing its ability to detect biologically and clinically meaningful results that hold under replication in future independent studies. Furthermore, since the multiplicity corrected CIs used in the indirect screening phase of the method were derived heuristically and represent approximate estimates of the true interval widths, re-sampling methods are recommended in situations requiring exact coverage.

A practical limitation of the method is that genotype must be independent of environmental exposure. This assumption may be violated, for example, when an underlying gene affects behavior such that an individual is predisposed to seek (or avoid) the environmental exposure (e.g., a gene that causes craving for alcohol). Additionally, the method does not account for complex gene-environment interactions that may underlie multifactorial diseases.

An implicit assumption of the method is that estimates for OR(E|D) and (g) remain unchanged in the population under consideration in the direct association study. However, this may not hold true when samples are collected based on strict population stratification or the target population has changed over time. Accordingly, a prudent comparison of known epidemiologic characteristics for the indirect and direct populations is advised prior to the implementation of this method. Additionally, the user must use caution in the interpretation of results when the decimal precision of estimates are limited.

When the allelic frequency of SNPs is very low, the multiplicity adjusted P values will approach zero. To remedy this limitation, a GMP-based implementation of the Schonhage-Strassen algorithm may be used to perform arbitrary-precision arithmetic.13,14 This algorithm uses fast Fourier transforms in rings with 22n+1 elements to enable multiplicative computation of factors near absolute zero.

The ultimate success of detecting GxE will depend on the accurate and precise measurement of environmental exposures on par with recent advances in genotyping technology. For many diseases, this will entail determining life-course environmental exposures from birth onward.1 Parsimonious questionnaire design and the use of targeted biomarkers will play a key role in assessing environmental exposures in the context of GxE.

In summary, the failure to account for multiplicity in large scale GxE studies may lead to the misinterpretation of results. Furthermore, disease association studies for GxE are expensive and time consuming, and careful control of these factors is important to consider in study design.15 The method presented in this paper offers an easy to implement and efficient means to identify GxE that will provide more efficacious use of research and clinical resources.


This manuscript was made possible by a grant from NCMHD/NIH (P20MD002289) entitled “Teamwork in Research and Intervention to Alleviate Disparities Project (TRIAD).” Elizabeth Tornquist (UNC-CH) and Debra C. Wallace (UNCG) offered valuable comments during the writing of this manuscript and their knowledge and insight are greatly appreciated.


Confidence interval
gene-environment interaction
lower confidence interval
multiplicity corrected lower confidence interval
odds ratio
single nucleotide polymorphism
standard error



This manuscript has been read and approved by the author. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The author and peer reviewers of this paper report no conflicts of interest. The author confirms that they have permission to reproduce any copyrighted material.


1. Wild P. Complementing the genome with an “exposome”: The outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 2005;14:1847–50. [PubMed]
2. von Mutius E. Gene-environment interactions in asthma. J Allergy Clin Immunol. 2009;123:3–11. [PubMed]
3. Vercelli D. Gene-environment interactions: The road less traveled by in asthma genetics. J Allergy Clin Immunol. 2009;123:26–7. [PubMed]
4. Efird J. A method for indirectly estimating gene-environment effect modification and power given only genotype frequency and odds ratio of environmental exposure. Eur J Epidemiol. 2005;20:389–93. [PubMed]
5. Efird J, Searles Nielsen S. A method to compute multiplicity corrected confidence intervals for odds ratios and other relative effect estimates. Int J Environ Res Public Health. 2008;5:394–8. [PubMed]
6. Lyles R, Lin H, Williamson J. A practical approach to computing power for generalized linear models with nominal, count, or ordinal responses. Stat Med. 2007;26:1632–48. [PubMed]
7. Whittemore A. Sample size for logistic regression with small response probability. J Am Stat Assoc. 1981;76:27–32.
8. Faul F, Erdfelder E, Buchner A, Lang A-G. Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses. Behav Res Methods. 2009;41:1149–60. [PubMed]
9. Jorgensen T, Ruczinski I, Kessing B, Smith M, Shugart Y, Alberg A. Hypothesis-driven candidate gene association studies: practical design and analytical considerations. Am J Epidemiol. 2009;170:986–93. [PMC free article] [PubMed]
10. Roy S, Bose R. Simultaneous confidence interval estimation. Ann of Math Stat. 1953;24:513–36.
11. Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75:800–2.
12. Marcus R, Peritz E, Gabriel K. On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976;63:655–60.
13. Schonhage A, Strassen V. Schnelle multiplication grosser zahlen. Computing. 1971;7:281–92.
14. Gaudry P, Kruppa A, Zimmermann P. A GMP-based implementation of Schonhage-Strassen‘s large integer multiplication algorithm. Proceedings of the 2007 International Symposium on Symbolic and Algebraic Comutation; pp. 167–74.
15. Li C, Li M, Long J, Cai Q, Zheng W. Evaluating cost efficiency of SNP chips in genome-wide association studies. Genet Epidemiol. 2008;32:387–95. [PMC free article] [PubMed]

Articles from Cancer Informatics are provided here courtesy of Libertas Academica