|Home | About | Journals | Submit | Contact Us | Français|
Gene-environment interactions are of interest in genetic association studies for several reasons. Firstly, the power to detect genetic effects may be substantially decreased if those effects differ according to environmental exposure, and if no account is taken of this interaction with environmental exposure in the analysis. Secondly, such interactions may indicate a phenomenon of genuine biological interest (whereby a particular genetic effect only operates in the presence of an environmental trigger, or vice versa), understanding of which can lead us to a greater understanding of possible mechanisms and pathways in disease progression. Here I discuss the testing and estimation of gene-environment interactions via the case/pseudocontrol and related approaches. As originally proposed, the case/pseudocontrol approach applies to case/parent trios with no missing genotype data. I discuss some recent extensions that allow larger pedigree structures with some missing genotype data, and present computer simulations to compare the performance of several competing approaches.
Complex genetic diseases are, by definition, believed to result from the interplay of numerous different genetic and environmental factors. Interactions between such factors could account for the relatively modest successes in the detection of disease-predisposing genetic variants for common, complex diseases . If genetic and environmental factors interact to cause disease susceptibility, the power to detect such effects, even in the current generation of large-scale, well-powered, genome-wide studies , may be compromised unless one stratifies by, or in some other way takes account of, the other factor(s) involved. For this reason, there is a growing interest in modeling gene-gene and gene-environment interactions, both in candidate gene and genome-wide studies. An additional motivation for investigation of interactions is a belief that these represent a phenomenon of biological interest, the elucidation of which may help uncover possible mechanisms and pathways in disease progression.
From a statistical point of view, interaction simply signifies departure from a linear model describing how two predictors (x1 and x2, say) predict an outcome variable (y, say). This is perhaps most easily understood when y represents some quantitative trait (such as height, for example) and x1 and x2 are binary indicator variables representing the presence or absence of some predictive factors. If the data is well fitted by a linear model y=μ+β1x1+β2x2 (where β1 and β2 are regression coefficients, representing the effects due to variables and x1, x2, and μ is some baseline mean trait value that is expected in the absence of either factor) then the variables x1 and x2 are said to not interact with regards to predicting y. If, on the other hand, the data is better represented by a model that includes additional terms, such as y=μ+β1x1+β2x2+β3x1x2, then we say that x1 and x2 interact. More complex linear models may be postulated when the predictor variables take on several different levels or are measured on a quantitative scale and/or the outcome variable is qualitative or dichotomous (such as indicating presence/absence of disease). For a disease outcome and case/control data, the usual approach is to model the log odds of disease ln[p/(1−p)] (where p represents the probability of an individual becoming diseased) as a linear function of the relevant predictor variables . For example, we might model the log odds as ln[p/(1−p)]=β0+βex1+βgx2+βgex1x2 , where x1 and x2 are binary indicator variables representing presence or absence of environmental and genetic exposures respectively, βe and βg are regression coefficients representing the environmental and genetic main effects, and βge represents a gene-environment interaction term .
The concept of interaction as departure from a linear model for the main effects of two variables is visualised pictorially in Figure 1. Expected trait values are shown for two different levels of a binary environmental exposure (‘low’ or ‘high’) and a three level genotype (such as might occur at a single nucleotide polymorphism). Figure 1A shows a situation where the genetic and environmental variables do not interact with regards to prediction of the trait: at each genetic level the effect of changing environment is to shift the trait mean by a constant amount, while at each level of the environment, the shift in trait mean between the different genotypes also remains constant. Figure 1 B-D on the other hand show varying types of interaction: in Figure 1B the difference in trait mean between the different genotypes is seen to be much greater when environment is ‘high’ than when environment is ‘low’, and, equivalently, the effect on the trait mean of moving from the ‘low’ to the ‘high’ environment is seen to be stronger for individuals with genotype g3 than for those with g1. In Figure 1C this pattern is taken to the extreme in that there is no effect of genotype on trait at all in the ‘low’ environment. In Figure 1D the effect of genotype on trait is seen to be reversed in the ‘low’ environment as compared to the ‘high’ environment, sometimes referred to as a crossover model.
The difficulties in biological interpretation of statistical interaction are well-known [5-7] and result partly from the fact that statistical interaction is not invariant to transformations of scale of the outcome variable. For example, simply taking a monotonic transformation trait→ trait3/50000 can convert the diagram in Figure 1A to the one shown in Figure 1B. Therefore, two variables which do not interact with regards to how they predict trait as measured on the original scale, may well interact with regards to how they predict trait3/50000. This complicates biological interpretation of statistical interaction, unless the required scale of measurement is ‘obvious’ (so that we know there is only one particular scale in which we are interested) or unless an interaction term would be required on every scale (as would be the case in Figure 1C).
In spite of difficulties in interpretation, consideration of interaction terms may be warranted on the grounds of increasing power, given that the trait will generally be modeled on some particular scale, and interaction effects on that scale may well exist. Inclusion of interaction terms also allows improved predictive value of a model. Kraft et al.  showed that if the focus of a study is the detection of genetic effects, an appealing procedure is to fit a model that includes main effects and interactions and conduct a joint test for marginal (genetic main effect) association and gene-environment interaction (i.e. perform a test of association allowing for interaction). In the context of the linear model ln[p/(1−p)]=β0+βex1+βgx2+βgex1x2, this joint test is a 2 df test of βg=βge=0. Although not universally most powerful, the joint test is nearly optimal over a wide range of plausible penetrance models.
Family-based association studies are a popular alternative to case/control studies for the detection of genetic effects. Although limited by sample size (since families are generally harder to collect than unrelated cases and controls), families have some advantages over case/control samples, allowing the construction of tests that are generally robust to population stratification , and the examination of potentially more interesting effects such as those due to maternal genotype and/or imprinting [9-12]. A popular design is to collect cases and their parents, who may be analysed using methods based on the transmission of alleles from heterozygous parents to affected offspring . A more general method, that allows the fitting of linear regression models similar to those used in case/control studies, is the case/pseudocontrol approach [12,14], which conditions on the observed parental genotypes and constructs sets of matched ‘controls’ for the affected offspring from the untransmitted parental genotypes. This approach builds on previous methods for testing and estimation of genotype relative risks at a single locus [15,16] by extending these methods to allow for haplotype associations, gene-gene and gene-environment interactions, maternal genotype and parent-of-origin effects. In this approach, any environmental variables posessed by the case are copied over to the pseudocontrols, with the result that we cannot assess main effects of environment, but we can assess genetic effects and gene-environment interactions. As originally proposed, the case/pseudocontrol approach deals with missing data (such as unknown haplotype configurations due to phase uncertainty) through a complex conditioning argument [12,14], similar to the ‘conditioning on sufficient’ statistic approach used in the FBAT [17,18] program. A more efficient approach is to model the full likelihood (rather than the likelihood conditional on parental genotypes) and to account for missing data through use of missing data likelihood  or multiple imputation  approaches. Extensions to nuclear families with more than one affected offspring, and to larger pedigrees, can be derived either by conditioning on the identity-by-descent of alleles of related individuals   or by use of an empirical variance estimate  that is robust to genotype correlations among related individuals.
Table 1 shows the results of simulations of case/parent trios to assess the performance of several different methods for testing and estimation of haplotype effects that influence, in conjunction with a binary environmental exposure, a disease outcome. Under the null, all methods give correct type 1 error rates of approximately 5%. Correct type 1 error rates are also seen when testing for gene-environment interaction if the data is generated under a main effects model (so that no interaction effects exist). Regardless of the true disease model, all methods are found to give unbiased parameter estimation and correct 95% confidence interval coverage when both main genetic (haplotype) effects and haplotype-environment interactions are included in the regression analysis (i.e. a ‘joint analysis’ is performed) although note that the 95% confidence interval coverage for the MI-TDT method is slightly over-conservative (i.e. greater than 95%). When a ‘marginal analysis’ (including only haplotype effects in the regression model) is performed, unbiased parameter estimation is seen when the marginal model is in fact correct (e.g. under null and main effects models). For models where a gene-environment interaction term exists, so that the marginal analysis model is misspecified, the value of the parameter estimate exp(βg) (which equals the relative risk for the 2-2 haplotype) is seen to be biased towards the (unmodeled) value of the interaction term exp(βge).
In all situations, the three methods (MI-TDT, Unphased, Unph-Pa) that try to estimate or reconstruct missing haplotype data show considerably higher power than the case/pseudocontrol approach implemented in the pseudocc program, which uses only families and pseudocontrols in which the haplotype configurations can be reconstructed with certainty. Power is also higher for a specific test of the effect of haplotype 2-2 (a 1df test, or a 2df test if tested jointly together with a haplotype environment-interaction) than for a 3df (or 6df if tested jointly together with haplotype-environment interactions) test of any difference in risk conferred by the four possible haplotypes. This is not unexpected, given the fewer degrees of freedom in the haplotype specific test. In practice, this higher power would only be achievable if we had reason to believe a priori that it is haplotype 2-2 that confers the differential disease risk, and so wish to test only this specific haplotype.
Our results illustrate the observation by Kraft et al.  that performing a joint test of genotype and genotype interaction can be a powerful strategy for detection of genetic factors predisposing to disease. When gene-environment interaction effects exist, the joint test has higher power than either a marginal test (in which only genetic effects are included in the regression model) or a test of the gene-environment interaction term alone. When gene-environment interaction effects do not exist, the joint test loses only a little power compared to the optimal (in this situation) marginal test of genetic effects only.
In the simulations presented here, the power of the three approaches (MI-TDT, Unphased, Unph-Pa) that try to estimate or reconstruct missing data appears to be very similar and, on this basis, there is little to choose between them. However, these simulations were conducted assuming that the data arise from a homogeneous population. In the presence of population stratification, the performance (and in particular the type 1 error) of the methods considered here is likely to vary considerably. The ‘additional conditioning event’ approach implemented in pseudocc, like the conditioning on sufficient statistic approach implemented in FBAT [17,18], should provide complete robustness to population stratification, but, as we have seen, this comes at the expense of power. A locally optimal approach that also provides complete robustness to population stratification has been proposed by Allen and Satten ; however this too loses power  when there is a strong genetic effect and a high proportion of missing data. The missing data likelihood approach implemented in Unphased, particularly when parental association parameters are modeled, can considerably improve the power at the expense of only a small increase in type 1 error  when there is missing genotype data and population stratification.
The multiple imputation approach used here provides only partial protection from population stratification . However, it does have some advantages over the missing data likelihood approach in terms of its flexibility. In the multiple imputation approach, once multiple ‘complete’ imputed data sets have been constructed, we are free to fit whatever models we wish via analysis in a standard statistical package such as Stata or R (using methods from the multiple imputation literature [22,23] to combine estimates across the different multiple imputed data sets, and to perform tests). This allows us to fit models and perform tests that have not been implemented in specialised genetic analysis packages, such as the joint test of genotype main effect and gene-environment interaction, or the 3df test of haplotype interaction, neither of which are currently implemented in the Unphased software.
A danger of methods that reconstruct missing data probablistically is that one is tempted to treat the reconstructed data as it it were actually observed. Unless the uncertainty in the reconstructed missing data is appropriately allowed for in the analysis, one runs the risk of overestimating the amount of information actually available in the observed sample , resulting in an inflation of type 1 error and anti-conservative confidence intervals. Standard statistical theory, bourne out by our simulations, predicts that inference based on a proper missing data likelihood, as is used in the Unphased software, should appropriately allow for the uncertainty in haplotype reconstruction. Similarly, multiple imputation under the alternative hypothesis, using an IP (imputation/posterior sampling) algorithm  to sample the full Bayesian posterior distribution of haplotype data given the observed genotype and phenotype data, as is used in the MI-TDT software , should appropriately allow for any uncertainty in haplotype reconstruction (uncertainty given the complete-data model parameters, as well as uncertainty about these unknown model parameters). This is achieved through the generation of multiple reconstructed (imputed) complete data sets in which uncertainty is reflected through multiple alternative (differing) reconstructions (as output from the IP algorithm, essentially a Gibbs sampler), which must then be combined appropriately to provide the final inference .
It is common in the multiple imputation literature to use a relatively small number (e.g. 3-10) of imputed data sets to perform this final step. In the genetic context described here, and in previous investigations [19,26], the use of 10 imputed data sets appeared to be adequate, provided the level of missing data was not too large (up to about 30% genotypes missing). As pointed out by Nicolae et al. , the amount of information in an incomplete data set depends not only on the relationship between the observed and missing data, but on the hypothesis and test statistic being evaluated. Investigation of the sensitivity of the multiple imputation procedure to the number of imputed data sets generated, as well as to the use of different hypothesis tests, would be an interesting topic for further investigation. In theory, the use of larger numbers of imputed data sets should provide inferences that are better calibrated  as well as providing a better reflection of the inherent uncertainty in the haplotype reconstruction.
Simulations were conducted to examine the performance of several different family-based methods for testing and estimation of haplotype effects that influence, in conjunction with a binary environmental exposure, a disease outcome. In each case, 1000 case/parent trios were simulated under a model that assumed disease risk could be influenced by a two-marker haplotype acting in conjunction with a possible environmental exposure. The two genetic markers were both diallelic, leading to four possible haplotypes, denoted 1-1, 1-2, 2-1, 2-2 (here i-j denotes the occurrence of allele i at locus 1 in coupling with allele j at locus 2). Four different disease models were considered: 1) a null model in which there were no genetic effects but the environmental exposure (assumed to be at 30% frequency in the population) multiplied the offspring disease risk by a factor of two; 2) a main effects model in which the environmental exposure multiplied the offspring disease risk by two and each 2-2 haplotype multiplied the offspring disease risk by 1.5; 3) a pure interaction model in which each 2-2 haplotype multiplied the offspring disease risk by 1.5 only in the presence of the environmental trigger; and 4) a model in which main effects of both genotype and environment and their interaction all contributed to disease risk. In each case, 15% of genotypes in the parents and offspring were randomly set to be missing, in addition to missing data generated in the form of undeterminable phase (haplotype) resolutions.
The methods evaluated were: 1) the case/pseudocontrol approach  as implemented in the Stata program pseudocc (part of the genassoc  package); 2) the multiple imputation approach  implemented in the MI-TDT program; and 3) the missing data likelihood-based approach  implemented in Unphased program. The case/pseudocontrol approach deals with the missing data through use of an additional conditioning event  that means we only use families and pseudocontrols for which the haplotype configuration is inferrable; in addition the implementation in pseudocc only uses families in which there are no missing genotypes. The multiple imputation approach uses an iterative procedure to repeatedly fill in the missing data (missing genotypes or phase resolutions) as described in . In order to allow the fitting of gene-environment interaction effects, imputation here was performed within classes defined by the offspring's environmental exposure. This resulted in the generation of ten complete data sets that were then analysed using the Stata program mim  in order to combine estimates and construct tests [22,23]. The likelihood approach implemented in Unphased deals with missing data through direct maximisation of a full missing data likelihood, modeling genetic association parameters separately in parents and offspring . The default is to assume that there is no genetic association in the parents; we also considered use of the -parentrisk (-pa) option in Unphased, in which association parameters are estimated (separately) in the parents as well as in the offspring.
For each disease model, 1000 simulation replicates were used to examine the bias (the difference between the true and the expected values of the relative risks and log relative risks) for the genetic main effects and gene-environment interactions, the coverage of the 95% confidence intervals for these parameters (which, if a method is working correctly, should equal 95%) and the powers and type 1 errors i.e. the probability of declaring a significant result (at the 5% significance level) when the null hypothesis is false (for power) or true (for type 1 error).
Support for this work was provided by the Wellcome Trust (Grant reference 074524)