|Home | About | Journals | Submit | Contact Us | Français|
It is important to investigate whether genetic susceptible variants exercise the same effects in populations that are differentially exposed to environmental risk factors. Here, we assess the power of four two-phase case-control design strategies for assessing multiplicative gene-environment (G-E) interactions or for assessing genetic or environmental effects in the presence of G-E interactions. With a di-allelic SNP and a binary E, we obtained closed-form maximum likelihood estimates of both main effect and interaction odds ratio parameters under the constraints of G-E independence and Hardy-Weinberg Equilibrium, and used the Wald statistic for all tests. We concluded that i) for testing G-E interactions or genetic effects in the presence of G-E interactions when data for E is fully available, it is preferable to ascertain data for G in a subsample of cases with similar numbers of exposed and unexposed and a random subsample of controls; and ii) for testing G-E interactions or environmental effects in the presence of G-E interactions when data for G is fully available, it is preferable to ascertain data for E in a subsample of cases that has similar numbers for each genotype and a random subsample of controls. In addition, supplementing external control data to an existing casecontrol sample leads to improved power for assessing effects of G or E in the presence of G-E interactions.
Many genetic variants have recently been found to be associated with complex human phenotypes in genome-wide association studies (GWAS). Capitalizing on these findings for personalized medicine calls for investigations on the synergy between these genes and environmental risk factors. In the post GWAS era when genotype data for millions of genomic loci has been made available for thousands of people, it is of great interest to consider how to best utilize this existing resource to achieve improved power in G-E interaction studies. Similarly, it is important to consider how to expand case-control studies that did not collect biological samples for cost-effective studies of G-E interactions. In general, the two-phase design, which is a cost-effective option for studying expensive risk factors, has recently been advocated for the study of G-E interactions . In this design, data for either genetic variants or environmental exposures is collected only on a judiciously selected subgroup of subjects. In this work, we consider two-phase case-control study designs for assessing multiplicative G-E interactions. We also evaluate the efficiency of these designs for jointly testing genetic or environmental main and G-E interaction effects, as these joint tests may lead to improved power for detecting genetic variants or environmental risk factors in the presence of G-E interactions .
Efficient study designs must be discussed in conjunction with statistical methods for analysis. While the prospective likelihood method for analyzing case-control genetic association studies is frequently applied , recent years have seen important advances in the development of statistically efficient methods for assessing G-E interactions. To analyze binary genetic and environmental variables in relation to a rare phenotype, under the constraint of G-E independence, the case-only method, which ignores data from controls and estimates the G-E interaction odds ratio (OR) parameter as the OR for G-E association in cases, is much more precise than the prospective case-control method . This case-only OR estimate is actually the maximum likelihood estimate (MLE) of the same parameter in a log-linear model under the constraint of G-E independence in controls . Chatterjee and Carroll  proposed to exploit the G-E independence in the maximum likelihood analysis of case-control data under a logistic regression model. Their method had much improved precision for estimating OR parameters that quantify joint G-E effects. Based on these powerful methods, Mukherjee et al.  proposed practical sample size calculation methods for designing case-control G-E interaction studies. In this work, we consider a di-allelic SNP and a binary environmental exposure for a rare phenotype and adopt a retrospective likelihood method for analysis. Our method not only constrains the control population by the G-E independence, but also by the Hardy-Weinberg Equilibrium (HWE) for the genotype variable. The analysis of two-phase designs coupled with this powerful method of analysis yields novel insights into cost-effective designs of G-E interaction studies.
This paper is organized as follows. In Section 2, we provide closed-form formulas for OR parameter estimates that quantify G-E main and interaction effects with standard case-control data. In Section 3, we provide closed-form formulas for the analysis of two-phase case-control data by extending results in Section 2. Using these formulas, we discuss the efficiency of four slightly different two-phase designs, where either G or E is collected only on a subset of cases and controls, or data for G or E from additional controls is supplemented. In Section 4, we perform extensive simulation studies to assess implications of the HWE constraint for testing OR association parameters with the standard case-control data and assess the efficiency of various two-phase design sampling strategies. We discuss practical implications of our findings in Section 5.
Let E denote a binary environmental factor, G denote the count of the minor allele for a di-allelic SNP, and Y denote the case-control status (Y = 1: case; Y = 0: control). Data for (G, E) is collected from n1 cases and n0 controls. We describe the association between Y and (G, E) by a logistic regression model
where f(G) is a pre-specified function that reflects different numerical codings for G. For example, f(G) can be the count of the minor allele with f(G) = G (log-additive model), can be the presence or absence of the minor allele with f(G) = I(G>0) (dominant model), or can be an indicator function for the genotype with f(G) = (I(G=1),I(G=2)) (co-dominant model). Denote β = (βg, βe, βI). The case-control data for fitting model (1) is summarized in Table 1, for which the standard retrospective likelihood function can be written as . Following a result in Satten and Kupper , this standard likelihood function can also be written as
Without any constraints, the nuisance probability p(Gj, Ej|Yj = 0) in the above likelihood can be fully parameterized by 5 parameters. When the phenotype is rare, joint maximization of β and these 5 nuisance parameters leads to an estimate of β that is identical to that obtained from standard prospective likelihood analysis. We assume G-E independence and HWE in the control population, and , where pa denotes the minor allele frequency (MAF). Let pe denote p(E = 1|Y = 0). The retrospective likelihood function can then be written as
which we maximize to obtain the MLE of (β, pa, pe). We calculate the estimates in two steps. First, simple algebra leads to solutions and
Then we solve for and OR estimates of genetic effects among the exposed and unexposed, eβg and eβ*g = eβg+βI, from the following profile log-likelihood obtained by replacing (pa, eβe) by in the likelihood function L(β, pe, pa):
The estimate can then be obtained as . The estimate of the MAF, , is the same regardless of the numerical coding adopted for G. Below, we provide explicit formulas for and corresponding to different numerical codings for G, focusing on results for the most widely used log-additive model for G. We also provide formula for under the log additive model for G.
Under the log-additive model for G, estimates can be expressed explicitly as functions of the cell counts in Table 1:
We found that both G-E independence and HWE constraints are required to obtain these closed-form formulas. That is, the HWE constraint does have an impact on the estimation of parameters that characterize the joint G-E effect. In the above formulas, the MAF estimate appeared only in but not in . Therefore, we may conjecture that the impact will be mainly on the estimation of genetic main effect parameter βg, but not much on the interaction parameter βI. In fact, is the OR estimate based on a case-only analysis as follows. First, create a contingency table for cases that cross-classifies E and the two alleles, treating each chromosome as a subject and the environmental exposure E as the outcome variable. Then is the standard OR estimate from this 2-by-2 table. This result reminds the allelic OR for analyzing standard case-control SNP data, which is valid only under certain conditions . These conditions, when applied to the current context, are as follows: i) the log-additive model is the true model for relating binary E and G in cases and ii) the HWE constraint is valid in the population of unexposed cases. Since the G-E independence and HWE in controls imply the HWE among the unexposed (E = 0), these two conditions are guaranteed as long as the penetrance model (1) is correct.
Interestingly, and , and thus , can also be obtained directly via a stratified analysis as follows. That is, > is the allelic OR based only on the unexposed cases and all n0 controls regardless of the exposure status, and is the allelic OR similarly based only on exposed cases and all n0 controls. Note that the allelic OR within each stratum is the MLE based on a similar likelihood as (2) where p(G|Y = 0) satisfies the HWE constraint. These observations reveal the impact of G-E independence and HWE constraints: analysis that is stratified on E with the most efficient analysis performed within each stratum results in the most efficient estimates of all association parameters. It is straightforward to obtain the variance-covariance matrix for using results for standard multinomial distributions:
We focus on the estimation of βg and βI, since can not be simplified as that under the log-additive coding. Similar to the log-additive model, closed-form estimates and can be obtained via efficient stratified analysis. For the analysis of case-control SNP genotype data under the co-dominant coding, the MLEs for the two OR parameters that exploit the HWE in controls have the same forms as the standard OR estimates based on 2-by-3 contingency tables but with the observed control counts replaced by the expected numbers under the HWE . Let βg = (β1, β2) be the logarithm of the two genetic main effect ORs, and βI = (βI1, βI2) be the two interaction effects log ORs. Then , and , are obtained by applying results of Chen and Chatterjee  directly to unexposed cases and all controls and exposed cases and all controls, respectively. The closed-form formulas are as follows:
It appears that the HWE constraint indeed has an impact on the estimation of genetic main effects through the estimated MAF . But the estimated interaction OR parameters appeared to be the same as those obtained under only G-E independence constraint, which approach the true parameter values as the sample size increases. Therefore, the estimation of interaction ORs is robust with respect to the HWE constraint under the co-dominant coding for G. Similar to the results under the log-additive coding, can be obtained based on the case-only analysis using cases with G = 0 or G = 1 or cases with G = 0 or G = 2, respectively. The estimates of all OR parameters can also be obtained by applying results of Chen and Chatterjee  separately to the analysis of all controls together with either exposed or unexposed cases. The variance-covariance matrix for , following Chen and Chatterjee  formula, is as follows:
When the dominant coding is adopted for G, the MLE of eβI, , is the OR estimate from the case-only analysis with E being the binary outcome variable, which is the same as that obtained under only the G-E independence constraint by Umbach and Weinberg . The estimate of the main effect eβg under the additional HWE constraint is different from that without the HWE constraint:
The variance-covariance matrix for is
All above estimates approach the true parameter values as the sample size increases when the penetrance model (1) and both constraints are correct. It has been well recognized that deviation from the G-E independence constraint can lead to intolerable biases in parameter estimates even when the HWE constraint is not imposed [5,10]. Here, it appears that the consistency of the main effect OR estimates, , requires that the HWE hold. For the estimation of the interaction OR parameter βI, under the log-additive model, its consistency requires both G-E independence and HWE constraints. But under other models, only the G-E independence is required. The closed-form formulas we provided facilitate explicit quantification of the magnitude of the bias. We will not further discuss the bias issue since the main interest of the current work is to provide guidelines on optimal study designs. The power for different study designs assuming the above methods for analysis is optimal when the two constraints hold, and the corresponding sample sizes similarly represent the minimum required.
In the simplest two-phase case-control design for assessing joint G-E effects, data for either E or G is available for all cases and controls, but that for the other one is available only on a selected subset. Without imposing the G-E independence or HWE constraints, the balanced design , which “balances” the numbers of phase II subjects, that is, those for whom both E and G are ascertained, in strata defined by the case-control status and variables completely collected on cases and controls (“phase I variable”), is nearly optimal for estimating the main and interaction effect parameters when analyzed by the maximum likelihood method . Here, we consider four variants of the two-phase design: E is the phase I variable and G is ascertained on a subset of cases and controls (Design I) selected with or without referring to E; G is the phase I variable and E is ascertained on a subset of cases and controls selected with or without referring to G (Design II); Data on E is available on an external set of controls (Supplemented Design I); and data on G is available on an external set of controls (Supplemented Design II). The two supplemented designs are obviously special cases of designs I and II, respectively. Below we focus on the log-additive coding for G, and results under other codings can be obtained in a straightforward manner.
The results above for the standard case-control data immediately suggest efficient two-phase sampling strategies for the estimation and testing of genetic and environmental effects. First consider Design I where E is available for all cases and controls. Above, only the data from cases is used in interaction OR parameter estimates, where cases with E = 1 are used as “cases”, and cases with E = 0 are used as “controls”. To avoid confusion, below we refer to cases with E = 1 as “c-cases” and those with E = 0 as “c-controls”. The accompanying association model is
where f(G) is the same as that in model (1). Now consider that we design such a case-control study. Intuitively, standard principles for designing a retrospective case-control study would apply here: a desirable design would balance the numbers of c-cases and c-controls to achieve an optimal power. For analysis, one can simply ignore the selective sampling and perform standard prospective analysis. The estimate of βI would be valid, although the intercept parameter estimate is not a consistent estimate of αo . The most efficient estimate of βI is obtained by applying the retrospective likelihood method that exploits the HWE  to the data from the sampled c-cases and c-controls. Due to the G-E independence in the control sample, stratification on E in controls would not help improve the precision for estimating any association parameters. Therefore, Design I that selects a balanced sub-sample of exposed and unexposed cases and a random sample of controls for ascertaining G is preferable for the estimation and testing of genetic and environmental effects. Similarly, supplementing data for E (Supplemented Design I) is not expected to help the estimation of βI, although it is expected to lead to improved prediction for estimating pe and βe.
For Design II where G is available for all cases and controls, the case-only analysis with model (4) using phase II cases yields valid estimates for both αo and βI, although the most efficient analysis would also utilize data for G for cases who are not selected into phase II. Similar as the arguments above, a balanced selection of cases with G = 0, G = 1 and G = 2 is expected to lead to improved efficiency for estimating βI. In addition, data for G from additional controls (Supplemented Design II) would improve the efficiency for estimating βg, but not βI.
Let R be a binary variable taking values 1 or 0 depending on whether a subject is selected into phase II or not. For Design I, we obtained the parameter estimates by maximizing the likelihood function
We found that has the same form as (3) and , the estimates obtained when (E,G) is available for all n1 cases and n0 controls. For estimating (pa, βg, βI), we found that the same profile likelihood as that for the standard case-control design above applies, except that only phase II cases and controls who have both G and E measurements are used. Therefore, estimates and their variance-covariance matrix are largely the same as those for the standard case-control design above, except that each count in the formula is replaced by the corresponding one in the phase II data. Let m1 and m0 denote the respective number of phase II cases and controls, and mijk has the same meaning as nijk. Under the log-additive coding for G, formulas for and var are as follows:
In Supplemented Design I where data on E is available for m additional controls, let and be the number of supplemented controls with E = 1 and E = 0, respectively. We obtain . Under the log-additive coding for G, the estimated main environmental effect and its asymptotic variance are as follows:
Estimates of other parameters remain the same as the standard case-control design.
Let R, m, and mijk be defined similarly as those for Design I. The likelihood function for Design II, where the selection of cases and controls for collecting E may stratify on G, can be written as
Contrary to Design I, one generally can not get closed-form estimates for OR estimates. This result may seem counter-intuitive since E and G appear to be symmetric in their relationship to the phenotype variable. But the difference in the analysis of Design I and Design II is that the distribution of the phase I variable G in Design II is constrained via the HWE, but the phase I variable E in Design I was not constrained. In an important special case where data for both E and G is collected for cases (but E is still available only for a subset of controls), the closed-form solutions exist for all OR parameters. In this case, , and
For Supplemented Design II where G is collected from ms additional controls, the OR estimates and variance-covariance matrix have the same form as those for the standard case-control design, but with where , are the respective number of supplemented controls with genotypes 1 and 2.
We conducted extensive simulation studies to evaluate the power of different study designs for testing three hypotheses: i) null G-E interaction effect, βI = 0; ii) null genetic effect, βg = βI = 0; and iii) null environmental effect, βe = βI = 0. We assumed the log additive model for G and used the Wald statistic for all tests based on the closed-form estimates provided in the above sections. First, we assessed the impact of imposing the HWE constraint on the estimation efficiency and power for testing different sets of association parameters under the standard case-control design. We considered the standard prospective method (“Standard”), the method that imposed the G-E independence constraint but not the HWE constraint (“GE-O”), and the method that imposed both the G-E independence and HWE constraints (“GE-HWE”). The comparison of these methods would shed light on the power improvement incurred by the two constraints. Next, with GE-HWE as the method of analysis, we compared the efficiency of four two-phase sampling strategies for testing the three hypotheses above. We considered a range of penetrance models in the form of (1) by varying the magnitude of OR parameters. For example, G may have an effect only in the presence of E, or E may have an effect only in the presence of G. We first generated data for controls, assuming that E followed a Bernoulli distribution and SNP genotype data G satisfied the HWE. Then we generated (G, E) for cases from the conditional distribution p(G, E|Y = 1) where
In all tests, we set the nominal level at 0.0001, assuming that 500 tests were performed. In practice, the test of βg = βI = 0 may be at a different significance level than that for testing βe = βI = 0. Here we used the same level mainly to facilitate power comparison. The test for all three hypotheses had type I error rates that were close to the nominal level, as shown in Table 2. We generated 5,000 replicates for assessing the power of all tests.
Panels A and B in Figure 1 demonstrate the relative power of the three methods for testing βI = 0 and βg = βI = 0, where βg = 0 for Panel A and βg = ln(1.2) for Panel B. For testing βI = 0, the power of GE-HWE appeared to be similar to that of GE-O, and both are higher than the standard method with the difference rising sharply with the magnitude of βI. For example, with βI = ln(1.5), the power difference was around 20%. But with βI = 1.8, the power difference was around 60%. For testing βg = βI = 0, the power of GE-HWE and GE-O was very similar but much higher than the standard method. For example, the power difference was around 60% at βI = ln(1.8) and βg = 0 (Panel A) and was around 20% at βI = ln(1.8) and βg = ln(1.2). These data indicate that imposing the HWE constraint in addition to the G-E independence had limited influences on testing genetic effects or G-E interactions under the log-additive model for G. Panels C and D display the results for the relative power of the three methods for testing βI = 0 and βe = βI = 0. Regardless of the presence or absence of the main effect of E (Panel C: βe = 0; Panel D: βe = log(1.5)), GE-HWE and GE-O have nearly identical power for both tests, and both had higher power than the standard method. This indicates that the HWE constraint hardly has any impact on power for testing βe = βI = 0.
We quantified the relationship between all parameter values and the ratio of power for GE-HWE to that for the standard method using simulation studies. We first obtained the relative power for a wide range of parameter setups. Then we performed linear regression analysis, using the log relative power as the outcome variable and the true parameter values as explanatory variables. The estimated mean log relative power for testing βI = 0, βg = βI = 0, and βe = βI = 0 is 3.5−1.1pa −0.33pe +0.43βg +0.17 βe −2.88βI, 1.51−0.44pa −0.35pe −0.57βg −0.15βI, and 1.6 − 0.5pa − 0.56pe + 0.02βg + 0.44βe − 0.30βI, respectively. Therefore, the magnitude of βI plays a dominant role in the relative power for testing G-E interactions, but the magnitude of βg and βe plays a greater role in testing genetic and environmental effects, respectively.
Table 3 presents the mean estimates, averaged estimated asymptotic variances, and empirical variances of the three methods, where the data was generated using the same parameter setup as that for panels A and B in Figure 1. The mean estimates with GE-HWE appeared to be close to the true parameter values. The averaged estimated asymptotic variances for all parameter estimates appeared to be close to their empirical counterparts. The empirical variances of main effect parameters estimated with GE-HWE were generally close to those of GE-O but smaller than that those under the standard method, and that for the interaction parameter βI could be smaller by more than 60%.
We investigated efficient two-phase design strategies for testing the genetic effect βg = βI = 0 and environmental effect βe = βI = 0 using GE-HWE for analysis. In each replicate, we first generated (Y, G, E) for 1,000 cases and 1,000 controls. Then we created a two-phase sample by selecting an equal proportion of cases and controls into phase II, and either data for G (Design I) or E (Design II) were deleted for those unselected. For cases, we selected the phase II subset either randomly or following a “balanced design” strategy by stratifying on E in Design I or G in Design II. The balanced design included all cases with E = 1 for a rare exposure in Design I, and it included as equal as possible numbers of cases with G = 0, G = 1, or G = 2 in Design II, respectively. With a small MAF, all cases with G = 2 are selected. To further evaluate the impact of control selection on the efficiency of the design, we considered two-phase designs with 300 phase II cases but a varying proportion of phase II controls ranging from 30% to 100%.
Figures 2 displays the power of Design I for testing βg = βI = 0 and βe = βI = 0 as a function of the proportion of phase II cases and/or controls. In general, the power under balanced sampling for testing βg = βI = 0 was much higher than that under random sampling, with the power difference becoming greater at smaller phase II case/control proportions and larger MAF (Panel A). But the difference between the two sampling strategies was small for testing βe = βI = 0 (Panel B). With a fixed subset of phase II cases, the power for testing genetic and environmental effect is nearly identical under both stratified and random sampling of controls (Panels C and D), and it increased with the proportion of selected controls for testing βg = βI = 0 (Panel C) but remained constant for testing βe = βI = 0 (Panel D). These results suggest that sampling stratified on E in cases is generally preferred for testing genetic effects or G-E interactions when data on E is available on all subjects. Parameter estimates corresponding to Panel C are presented in Table 4.
Figure 3 displays the power of Design II for testing βg = βI = 0 and βe = βI = 0 as a function of the proportion of phase II cases and controls. In general, for testing βg = βI = 0, the difference between the two sampling strategies appeared to be small (Panel A), and the power remained constant with a varying proportion of phase II controls (Panel C) when the subset of phase II cases is fixed. On the other hand, the power under balanced sampling for testing βe = βI = 0 was much higher than that under random sampling, with the power difference getting greater at smaller phase II case/control proportions and larger prevalence of E (Panel B). The power under both balanced and random sampling of controls when the subset of phase II cases was fixed slightly increased with the proportion of selected controls (Panel D). These results suggest that sampling stratified on G in cases for ascertaining data for E is generally preferred for assessing environmental effects.
Figure 4 displays the power of Supplemented Design I for testing βe = βI = 0 as a function of the number of supplemented controls m at different values of pe. The magnitude of power increase due to the supplement of additional control data for E increased with βe, βI, and pe, particularly when m was less than 500. For example, with pa = 0.2, pe = 0.15, βg = log(1.2), βe = βI = log(1.5) (Panel A), supplementing E from 500 and 2, 000 additional controls to data from 500 cases and 500 controls led to around 20% and 40% increase in power, respectively. But with βe reduced to log(1.2), the respective increase was only around 5% and 10%. The power of Supplemented Design I for testing βI = 0 and βg = βI = 0 remained constant regardless of the number of supplemented controls (data not shown).
Figure 5 displays the power of Supplemented Design II for testing βg = βI = 0 as a function of m, the number of additional controls with data on G. Similar as Supplemented Design I, the power increase at a given m appeared to be larger with increasing βg. For example, with pa = 0.2, pe = 0.15, βg = log(1.2), and βI = log(1.5) (Panel A), supplementing G from 500 and 2000 controls to 300 cases and 300 controls led to 10% and 24% increase in power, respectively. But with pa = 0.2, pe = 0.15, βg = log(1.2), and βI = log(1.3), the respective increase was only 7% and 16%. In the absence of genetic main effect (βg = 0), the respective increase became negligible. The increase also became sharper with a greater pa. Not surprisingly, the power of Supplemented Design II for testing βI = 0 and βe = βI = 0 remains nearly constant regardless of the number of supplemented controls (data not shown).
We assessed the efficiency of two-phase case-control designs for assessing genetic and environmental effects when the control population is constrained by the G-E independence and HWE. A balanced selection of the exposed and unexposed cases appears to be a nearly optimal strategy for testing G-E interactions when data for cases can not be completely ascertained. Random sampling of controls suffices in the sense that stratified sampling in controls does not lead to improved power for association analysis. Supplementing data for G or E from additional controls generally does not help improve the power for testing G-E interactions. For testing genetic effects in the presence of G-E interactions, supplementing data for G from additional controls is helpful, particularly when the genetic effect is moderate or large. Similarly, supplementing data for E from additional controls is helpful for assessing environmental effects in the presence of G-E interactions, and the power increase becomes higher with increased environmental effects. Although we considered a binary environmental variable in this work, we expect that our conclusions hold when the environmental variable is continuous.
We obtained closed-form formulas for odds ratio association parameter estimates assuming a di-allelic SNP and a binary environmental variable. Regardless of the numerical coding adopted for the SNP genotype, we found that the estimation of the G-E interaction odds ratio parameter requires only the data of cases. In particular, the allelic odds ratio estimate in the case-only G-E interaction analysis is the MLE under the log-additive coding for the SNP genotype. Thus, our results generalized the case-only analysis with a binary genotype variable to a broader range of numerical coding schemes. For testing genetic effects or environmental effects in the presence of G-E interactions, incorporating the HWE constraint leads to improved power, although the HWE constraint hardly has any effect on the power for testing G-E interaction effects beyond that it is required to obtain closed-form estimates under the log-additive coding.
In this work, we assumed that the same numerical coding for the genotype variable was adopted in the main and multiplicative interaction effects. If the specification of the main effects is incorrect, the test for interaction would be invalid. In practice, one may base a test for interaction on a model where the co-dominant coding is adopted for the main effect of G. Then a valid test is guaranteed under the null hypothesis of no interaction. We did not consider this approach in this paper, mainly because we did not find closed-form estimates for OR parameters and because our conclusions for two-phase designs appeared to hold under this model.
This research was supported by ES016626 and the Long-Range Research Initiative of the American Chemistry Council and the Intramural Research Program of the /Eunice Kennedy Shriver/ National Institute of Child Health and Human Development, National Institutes of Health.