|Home | About | Journals | Submit | Contact Us | Français|
Joint analysis of multiple SNP markers can be informative, but studying joint effects of haplotypes and environmental exposures is challenging. Population structure can involve both genes and exposures and a case-control study is susceptible to bias from either source of stratification. We propose a procedure that uses case-parent triad data and, though not fully robust, resists bias from population structure.
Our procedure assumes that haplotypes under study have no influence on propensity to exposure. Then, under a no-interaction null hypothesis (multiplicative scale), transmission of a causative haplotype from parents to affected offspring might show distortion from Mendelian proportions but should be independent of exposure. We used this insight to develop a permutation test of no haplotype-by-exposure interaction.
Simulations showed that our proposed test respects the nominal Type I error rate and provides good power under a variety of scenarios. We illustrate by examining whether SNP variants in GSTP1 modify the association between maternal smoking and oral clefting.
Our procedure offers desirable features: no need for haplotype estimation, validity under unspecified genetic main effects, tolerance to Hardy-Weinberg disequilibrium, ability to handle missing genotypes and a relatively large number of SNPs. Simulations suggest resistance to bias due to exposure-related population stratification.
Gene-environment interactions, where the risk of disease associated with carrying a certain genotype is modified by environmental exposures or, equivalently, where the effects of an exposure are modified by genotype, are likely to be important in the etiology of complex diseases. We consider the assessment of gene-by-environment interaction in the context of a case-parent triad design for a young-onset condition.
Haplotypes can carry more information about interactions than single SNPs. Multiple SNPs together may change protein folding and cause enzyme dysfunction, and the risk conferred by an impaired enzyme may depend on the individual's exposure level. Alternatively, association of disease with untyped causative SNPs may be captured by typed markers that are in linkage disequilibrium with them, and again risk may vary with exposure.
Extensions of single-marker methods to haplotype-based methods are not straightforward, however. Haplotypes cannot be uniquely reconstructed because of phase ambiguity, and the number of possible haplotypes grows exponentially with the number of SNPs. Studying haplotype-environment interaction brings added challenges. For a test of haplotype-by-environment interaction to be valid as broadly as possible, the null model should avoid restrictive specification of genetic main effects, because misspecification of the usually unknown mode of inheritance can invalidate a test for interaction. Another complication is that genetic population stratification can also involve the exposure. Haplotype frequencies may correlate with exposure prevalences across subpopulations with different baseline risks of disease, leading to spurious evidence for haplotype-by-environment interaction in case-control studies. Such bias can afflict even single-SNP family-based tests of interaction if they are not constructed carefully .
Family-based studies offer several important advantages over case-control studies. They can provide robustness to population stratification if analyzed properly, and they allow testing of nonstandard genetic effects, such as maternally-mediated genetic effects and imprinting (parent-of-origin) effects. Knowledge of family structure helps resolve phase ambiguity. Moreover, for a gene-environment interaction study involving a single SNP, family-based designs can require fewer genotypes than a case-control design to achieve the same power [2,3,4]. Self-selection based on exposure levels can bias results for a case-control design, but not for a family-based approach, where the inference is based on a correlation between the exposure and transmission patterns. Family designs also offer practical advantages, because recruiting family members can be much easier than identifying and recruiting appropriate population-based controls.
Other studies have considered testing haplotype-environment interaction using case-parent triads. Cordell et al.  generalized the case/pseudo-control approach  to haplotype analysis and also incorporated estimation and testing of haplotype-environment interaction effects. The software for implementing this approach allows flexible modeling of haplotype relative risks and interaction effects but discards triads when the haplotype phase of the child cannot be deduced from genotype information – potentially inducing substantial loss of statistical power, especially when some genotype data are missing. Allen and Satten  developed a method to cope with missing phase and genotype information when analyzing haplotypes and haplotype-environment interactions. Their approach, projection conditional on parental haplotypes (PCPH), is implemented in software that works well when only a few SNPs are studied. Dudbridge  developed a widely applicable likelihood-based association analysis for nuclear families. His software, UNPHASED, accommodates multiple offspring and is capable of testing haplotype-environment interactions in triads. These three approaches are based on parametric models for the relationship between individual haplotypes, exposure, and disease risk. With this reliance on parametric models, one can estimate the relative risk parameters in the model and perform tests for individual parameters. On the other hand, inferences depend to a large extent on the correctness of the postulated model and, with many haplotypes, the number of parameters quickly becomes challenging. Vansteelandt et al.  described an adaptation of the FBAT approach of Laird and colleagues  for studying haplotype-environment interactions. This approach relies in part on choosing a coding for the haplotypes and in this sense shares some pros and cons of the three fully parametric approaches.
We describe a non-parametric method to test haplotype-environment interaction using case-parent triads under the assumption that haplotypes under study have no influence on propensity to exposure. Then, under a no-interaction null hypothesis (multiplicative scale), transmission of a causative haplotype from parents to affected offspring might show distortion from Mendelian expectations but should be independent of exposure. We exploited this insight to develop a permutation test. Defining strata by sums of unphased parental genotypes, we base our test statistic on an approximation to the average of stratum-specific covariances between exposure and a score reflecting transmission. We generate its null distribution by permuting exposure among families within each stratum.
Our method offers several desirable features: it does not require investigators to know or to estimate haplotypes or their frequencies; it retains validity under any genetic main effects; it tolerates Hardy-Weinberg disequilibrium; it handles missing genotypes and it can accommodate a relatively large number of SNPs. Although it is not fully robust, we characterize it as ‘resistant’ to bias because our simulations suggest that it does not exceed its nominal Type I error rate under population stratification that could inflate the Type I error rate of a case-control study. We study its performance under scenarios with different numbers of SNPs, with missing genotypes, with a quantitative or a dichotomous exposure, and with or without exposure-involved population stratification.
The fundamental genotype data structure is the case-complement genotype-difference matrix D.  Consider k di-allelic markers and n families, where families each contribute one case-parent triad. Let Mij, Fij, and Cij(for i = 1, …, n, j = 1, …, k) represent the number of copies of the designated allele carried by the mother, father, and affected offspring in triad i at marker locus j (it does not matter which allele is enumerated). We define the complement as the hypothetical matched sibling who carries all the alleles not transmitted to the affected child. Thus, the complement inherits Mij+ Fij – Cij. The paired difference in genotype at every locus between each affected offspring and their complement produces an n × k matrix of differences, D, having the ij-th component equal to Dij = 2Cij – (Mij+ Fij) (i = 1, …, n, j = 1, …, k, and Dij ε[−2, −1, 0, 1, 2]). We used the observed matrix D to study apparent transmission distortion of haplotypes from parents to affected offspring through a haplotype-based transmission distortion test called TRIMM . Like the TDT, D reflects transmissions and, regardless of population stratification, has expected value 0 under the null that the set of k markers is not linked or not associated (within families) with the outcome under study. For a single SNP analysis, a case-complement analysis captures most of the information available through a conditional logistic  or log-linear  model, except that (as for the TDT) some information is lost for triads where both parents are heterozygous. But the case-complement approach (unlike the others) is easily extended to handle multiple linked SNPs.
Let X represent an n-component column vector with component Xi denoting the exposure for triad i. The exposure can be dichotomous or quantitative; it could be an exposure that was experienced by the mother during her pregnancy with the affected offspring or an exposure experienced post-natally by the affected offspring.
Any methods that use a case-parent triad design to study haplotype-by-environment interactions require a conditional or within-family independence assumption: we assume that conditional on parental genotypes, the haplotypes transmitted by the parents do not influence the offspring's propensity to exposure. This within-family independence is much weaker than the assumption required for a case-only analysis, where genotype and exposure must be independent unconditionally . Within-family independence is compatible with, for example, exposure-involved population stratification, whereas unconditional independence is not. In addition, when tests rely on LD between unmeasured causative SNPs and other measured SNPs, any method based on transmissions must assume that the exposure is not confounded with the underlying LD structure.
Assuming conditional independence of haplotypes and exposure, we develop a test for the omnibus null hypothesis that any risk-related multi-locus diplotypes act independently of the exposure, in the sense that the joint effects of genotype and exposure on disease risk are multiplicative. Operationally this null means that, even though transmissions of specific haplotypes to affected offspring might violate Mendelian proportions, those transmissions should nonetheless be statistically independent of the exposure.
Under the no-interaction null, one might expect the covariance between a transmission score at locus j, defined by Dij, and the exposure Xi to have expected value 0 for every j. We need, however, to account for the possibility that the exposure participates in bias-inducing population stratification, which can produce an overall correlation between transmissions to affected offspring and exposure, even under this null . Just as when studying a single SNP, protection against bias requires a strategy that groups together families with similar parental genotypes. Our approach calculates covariances within such groups of families.
For the moment imagine that we know phase and can stratify on parental diplotypes. Let g index strata defined by particular pairs of parental diplotypes. Under the no-interaction null, the covariance between Dij and Xi should be 0 within each stratum regardless of exposure-related genetic population stratification; under alternatives, transmission of some SNPs to affected offspring will instead be correlated with exposure. With that background as motivation, we designed our test procedure to estimate the average conditional-on-g (within-stratum) covariance at each component SNP. We make use of the following decomposition (suppressing subscript i) for locus j:
Here, Eg denotes the expected value taken over g, and Cov(.,.g) denotes the covariance between two specified arguments conditional on parental diplotypes. The last term on the right in (1) can be nonzero under exposure-involved population stratification but does not reflect within-stratum correlation between Dij and Xi. This decomposition indicates how to distinguish an interaction signal from population-stratification-induced bias and serves to motivate the development of our testing approach.
The mean of the cross product terms DijXi across all triads, namely:
estimates the first term on the right in equation (1). We can generate a corresponding permutation distribution for locus j by repeatedly permuting exposure values among families within each g stratum and computing a cj each time. Let denote the average cj across permutations. This permutation process recapitulates the null hypothesis by allowing transmissions to behave independently of exposure within each stratum. For the permutation distribution, then, the left side of (1) is zero by construction, and the two terms on the right must equal each other. Thus, , which estimates E(DjX) for permutations, also estimates Eg[E(Djg)E(Xg)] for permutations. Consequently, we propose to regard permutation-based as an estimate of Eg[E(Djg)E(Xg)] under the null and to use it to adjust for bias in cj due to exposure-involved population stratification.
To construct a multi-locus test of no haplotype-by-exposure interaction, we consider a vector with component cj for each locus j (j = 1, 2, 3, …, k). Because the bias-related second term on the right in (1) will differ for different loci, from each component cj we subtract , its permutation-based mean under the null, to form . We then form a test statistic as the sum of squared vj, that is,
We use the within-stratum permutation procedure described above to generate the permutation distribution for .
A modification of the statistic may enhance power under certain circumstances. The k SNPs may contribute unequally to the genetic risks, hence to the interaction, especially when a disease-susceptibility mutation is one of the typed SNPs and that SNP not only confers risk among the unexposed but also interacts with the exposure. In that circumstance one could potentially increase the power by combining the vj in a weighted fashion, using the estimated marginal genetic main effects of marker SNPs as weights. Accordingly, in a second procedure we used zj, the j-th component of the Z statistic from TRIMM  (the standardized mean of the j-th column of D) as a weight for SNP j and formed an alternate test statistic
The use of Tmax–ZV should also tend to favor finding interactions when the genotype main effect and a genotype-by-exposure interaction both involve the same haplotype.
In addition to the two test statistics and Tmax–ZV, we also considered a combined test, Tcombined, that takes advantage of the complementary strengths of both. To do this, we used the sum_log(P) approach of Shi et al. . If the interaction effect is due to a single causative SNP, a SNP that was fortuitously included in our measured set of k markers, then we expect Tmax–ZV to do better than . If instead a susceptibility haplotype is related to risk, or the SNPs studied behave collectively as markers for an untyped causative SNP or haplotype, then the advantage should go to . Consequently, the combination of the two might perform well under either scenario. We refer to the test based on Tcombined as GEI-TRIMM (gene-environment interaction triad multi-marker method).
With unphased genotype data, parental diplotypes are typically unknown, so instead of strata defined by parental diplotype pairs, we used strata defined by the vector of sums of parental genotypes (Mi,1 + Fi,1, Mi,2 + Fi,1, L, Mi,k + Fi,k). Strata defined this way represent a coarsening of strata based on parental diplotypes. Consequently, tests based on those strata are less than fully robust but remain resistant to bias from population structure. Other than this re-definition of strata, one generates the permutation distribution as described above. The empirical p value is the proportion of permutated-data-based test statistics that are larger than or equal to the observed-data-based test statistic.
Several available programs, such as Haplore  infer haplotypes based on genotypes from a pedigree even with incomplete genotype data. These programs typically provide a list for each family of the haplotype configurations compatible with the observed genotypes and corresponding estimated probabilities. These lists provide an efficient way to include families with missing genotypes. Using such a list of haplotype configurations, we calculated expected allele counts for each person with a missing SNP genotype and used those expected counts to replace any missing values. In this way, we imputed missing genotypes before calculating D and the observed-data cj.
Depending on a family's list of possible data-compatible haplotype configurations, the family may be assignable to several different strata based on the vector of sums of parental genotypes. For permutations, we handled the assignment to strata stochastically, by sampling. For each family with more than one possible triad-haplotype configuration, we first sampled a configuration from the list with probability proportional to the estimated probability from Haplore. Each family then had a provisional haplotype configuration and corresponding assigned stratum. We next generated a permuted data set by permuting exposure values among families within strata defined by summing the two corresponding parental genotypes. Note that, when calculating test statistics for permuted data, we use the same set of (possibly imputed) genotypes as those used for calculating observed-data statistics. The sampling of a triad-haplotype configuration thus helps define the permutation strata (which determine how to permute exposure), but it does not affect D. We then repeated the process by sampling (with replacement) a new configuration for each family from all possible triad-haplotype configurations and then carrying out exposure permutations within the parental-genotype-based strata. For improved computational efficiency, we generated five exposure permutes for each set of sampled triad-haplotype configurations, but one could also use one permute per set.
We used simulations to evaluate the performance of our proposed tests in a realistic context. We simulated haplotypes based on HapMap-phased genotype data from the sample with European ancestry, using haplotypes and their frequencies for a 100-kb region around the replication factor C1 gene (RFC1). This haplotype set is defined by 12 SNPs, which specify 17 different haplotypes (Appendix). We then created a thirteenth SNP, representing a causative mutation, whose variant allele resided on haplotype 1. We also constructed a second haplotype set with fewer SNPs by using 5 LD tagging SNPS for RFC1. We again introduced a causative SNP, residing on the same haplotype. This second set contained 12 haplotypes (Appendix). In both these haplotype sets, only one haplotype was associated with risk and its frequency was typically 0.28, although we also examined the influence of risk-haplotype frequency on the performance of our tests. Our simulations always involved only a single risk-associated haplotype; in contrast, our test procedures neither presuppose a particular number of risk haplotypes nor designate any haplotypes as candidates.
We simulated both dichotomous and continuous exposures. Dichotomous exposures were generated via Bernoulli trials with outcome probability equal to the designated exposure prevalence. We also simulated continuous exposures from a zero-enriched truncated lognormal distribution. A proportion of individuals had no exposure (again generated via Bernoulli trials). For the remaining individuals, we generated exposure levels by exponentiating a standard normal random number and rejecting values above 6.
Besides specifying distributions of haplotypes and exposure for our simulations, we need a risk model to generate case-parent triads. Our simulations allowed the risk associated with haplotypes to depend on exposure level; but the risk model that we used, though convenient for simulations, is neither required for the validity of nor imposed by our tests. Because the case-parent triad design precludes estimation of effects of exposures, our simulations assumed for simplicity that exposure was unrelated to risk among offspring with no copies of the risk-associated haplotype. Suppose X0 is a designated reference level of the exposure and let H denote the number of offspring-inherited copies of the risk-associated haplotype. In our simulations, X0 was always set to zero. We specify the genetic main effects parameters Rh for h = 1 or 2 as:
Thus, haplotype relative risks at the reference exposure level depend on the inherited number of copies, H, conditional on parental genotypes. For δ = X – X0, we define interaction parameters for the risk-associated haplotype as Ih(δ) for h = 1 or 2 as:
Ih(δ) could be any function of δ, but for simulations we used a log-linear function, so that the log Ih(δ) of was proportional to δ, a function we can characterize by specifying choices of Ih(1) (simply denoted Ih) for h = 1 or 2. Consequently, risk models in our simulations, whether exposure is dichotomous or continuous, can be represented by the risk vector (R1, R2, I1, I2) for the designated risk-associated haplotype. Implicitly, all other simulated haplotypes have the risk vector (1, 1, 1, 1).
Extending the notation established above from a single risk haplotype to include separate functions Ih(δ) for each haplotype under study allows a more precise statement of the null hypothesis of interest, namely, H0: I1(δ) = 1 = I2(δ) for all δ and for all haplotypes studied. This null asserts that, conditional on parental diplotypes, the relative risks associated with having inherited copies of any haplotype do not vary with the level of the exposure. This null is equivalent to one where the effect on risk of a change in exposure level does not depend on how many copies of any haplotype were inherited. Alternatives would include situations where inheritance of one specific haplotype influences risk in the unexposed while a different haplotype may interact with the exposure. For simplicity, however, we simulate only scenarios where the same haplotype that contributes to risk in the unexposed might also interact with the exposure.
For each simulated scenario, we generated 1,000 data sets, each with 1,000 triads. We generated triads by sampling parental pairs (with population stratification, both parents came from the same subpopulation), randomly generating the offspring haplotypes based on Mendelian inheritance and an offspring exposure from the scenario's exposure model, and then generating disease status randomly from diplotype and exposure through the scenario's presumed risk model. Cases were retained until 1,000 case-parent triads were generated. For each risk scenario, we considered four situations: the mutation SNP was either typed or not and genotypes had either 0 or 20% of loci randomly missing. We applied our three proposed tests to each simulated data set. This tactic enabled us to examine whether Tcombined (GEI-TRIMM) had power superior to Tmax–ZV when the causative SNP was genotyped and power superior to when the causative SNP was not genotyped. We used 1,000 permutations with each simulated data set to estimate p values.
To examine validity of our tests under population stratification, we simulated a no-interaction null scenario with a dichotomous exposure and two equal-sized subpopulations, each subpopulation having haplotypes in Hardy-Weinberg equilibrium (HWE). We used only the second haplotype set with 6 SNPs (Appendix). The disease ratio between subpopulations was 1:1.5 (ratio of risks for unexposed individuals with no copies of the risk haplotype). Risk-haplotype frequency and exposure prevalence were both 0.2 in one subpopulation and both 0.4 in the other. Thus this no-interaction scenario had both genetic-and exposure-related population stratification. We set (R1, R2, I1, I2) = (1.73, 3, 1, 1) in each subpopulation.
To further examine the Type I error behavior under population stratification, we used RFC1 with 6 SNPs again; but we reduced the risk haplotype frequencies and varied the proportional representation of the remaining haplotypes between the two subpopulations. We set the risk haplotype frequencies at 0.1 and 0.05 in the two subpopulations, respectively. In the first sub-population, all non-risk haplotypes retained their original HapMap frequencies rescaled to total 1–0.1; in the second sub-population, the non-risk haplotype frequencies were rescaled to total 1–0.05 and then re-assigned at random so that the frequency ranking of haplotypes differed between the two subpopulations. The exposure prevalences, disease ratio, and relative risk parameters were as in the previous paragraph. We constructed 12 distinct simulation scenarios by letting each haplotype in turn be the risk haplotype, and we simulated 1,000 data sets for each scenario.
We examined power for detecting interaction with a dichotomous exposure where either 13 or 6 SNPs (Appendix) were genotyped. Samples came from a homogeneous population, where the exposure prevalence was 0.2. For simplicity, we simulated genotypes in HWE (not required for test validity). We simulated data with four (R1, R2, I1, I2) settings: (1, 1, 2, 4), (1, 1, 1, 4) (1, 2, 2, 2), or (1.414, 2, 1.414, 2).
We also studied a quantitative exposure where the risk haplotype and exposure had a pure interaction, that is, haplotype was unrelated to risk among the unexposed. We drew triads from a population consisting of a mixture of two equal-sized subpopulations, each with haplotypes in HWE and with disease ratio 1.65. The proportions of unexposed individuals were 0.2 and 0.4, and the haplotype frequencies were 0.5 and 0.2 for the two subpopulations, respectively. We simulated scenarios with four (R1, R2, I1, I2) settings: a null scenario (1, 1, 1, 1) and three alterative scenarios (1, 1, 1.2, 1.44), (1, 1, 2.4, 2.4) and (1, 1, 1, 1.36).
To study the influence of risk-haplotype frequency on power, we generated triads either from a homogeneous population or from a stratified population formed from two equal-sized subpopulations each in HWE and with disease ratio 1.65. We used a quantitative exposure with unexposed prevalence of 0.3 for the homogeneous scenarios and of 0.2 and 0.4, respectively, for each subpopulation in the stratified scenarios. In these simulations, we set (R1, R2, I1, I2) at (1.2, 1.5, 1.2, 1.44). For the homogeneous scenarios, risk-haplotype frequencies ranged from 0.05 to 0.95. For the stratified scenarios, risk-haplotype frequencies (f1 and f2, respectively) were set so that the odds ratio:
and the overall haplotype frequency, namely, (f1 + f2)/2, ranged from 0.05 to 0.95. For example, for an overall haplotype frequency of 0.3, the frequencies in the two subpopulations are 0.5 and 0.2 respectively, which meets the odds ratio specification with (f1 + f2)/2 = 0.3.
We compared the performance of our method on complete genotype data with the pseudocontrol approach . Because the current implementations of UNPHASED  and PCPH  do not provide for saturating the genetic main effects (i.e. for fitting a separate parameter for each diplotype), we excluded them from our comparison. We evaluated Type I error rates under a null scenario with population stratification in our second haplotype set with 5 SNPs (causative SNP not typed). This scenario used a quantitative exposure with unexposed prevalences 0.1 and 0.9 and risk haplotype frequencies 0.9 and 0.1 for the two equal-sized subpopulations, respectively. The disease ratio was set at 1.65 and (R1, R2, I1, I2) at (1.73, 3, 1, 1). We also simulated populations with three haplotypes, each with two SNPs. We simulated a no-interaction null scenario with population stratification: a quantitative exposure with unexposed prevalences 0.2 and 0.4 and risk haplotype frequencies 0.5 and 0.2 for the two equal-sized subpopulations, respectively, with disease ratio 1.65. (R1, R2, I1, I2) was set at (1, 3, 1, 1). We also simulated a non-null scenario with a homogeneous population: a quantitative exposure with unexposed prevalence 0.3, risk haplotype frequency 0.3, and (R1, R2, I1, I2) set at (1, 1, 1.2, 1.44). For the pseudocontrol approach , we used ‘genotype’ coding, as described in the program's R-package documentation, to saturate main effects by fitting a parameter for each diplotype and tested for log-additive interaction with a likelihood ratio test.
With our second haplotype set (5–6 SNPs, Appendix) and a dichotomous exposure, our tests maintained nominal Type I error rates in the presence of exposure-involved population structure. The observed Type I error rates for all three tests were always statistically compatible with the nominal 0.05 with complete genotypes but fell slightly below nominal with 20% of SNPs missing (data not shown). In the set of simulations where each haplotype in turn was the risk haplotype (with frequency <10%), the empirical Type I error rates across the 12 complete-data scenarios ranged from 0.037 to 0.059.
With a homogeneous population, the power of our three tests with 1,000 triads was reasonably good for a dichotomous exposure with prevalence 0.20 and a risk-haplotype frequency of 0.28 in the pure interaction scenarios (1, 1, 2, 4) and (1, 1, 1, 4), regardless of the number of SNPs (table (table1).1). Power was much reduced, however, for scenarios (1, 2, 2, 2) and (1.4, 2, 1.4, 2), where the haplotype influenced risk among both unexposed and exposed. Compared to complete data scenarios, power dropped somewhat under all four risk scenarios when 20% of the SNP genotypes were sporadically missing. When all other factors were equal, 6-SNP scenarios had somewhat higher power than 13-SNP scenarios. This observation could in part reflect the LD structures for the particular haplotypes used in the simulations; or it could reflect the increase in the number of strata that need to be accommodated when the lists of data-compatible haplotype configurations become long.
The drop in power from the pure interaction risk scenarios with a dichotomous exposure can be understood by considering I2/I1. In families where one parent is heterozygous for the risk haplotype and the other is homozygous, for example, the scenario with (1, 2, 2, 2) produces identical transmission patterns to affected offspring regardless of exposure (because I1 = I2 = 2). Hence, those families become noninformative for interaction under this scenario.
For a quantitative exposure with exposure-related population stratification, the Type I error rates for our tests were statistically consistent with 0.05 when no genotypes were missing (table (table2).2). Both the 6- and 13-SNP scenarios appeared to maintain the nominal rate whether the causative SNP was typed or not, although the rate was slightly elevated for the scenario with 6 SNPs when the causative SNP was not typed. With 20% of genotypes missing, the Type I error rates were slightly below nominal for all scenarios. This conservatism can perhaps be understood by recognizing that our permutation distribution is constructed as a mixture of possible permutation distributions and as such has a larger variance than the idealized permutation distribution that would have been generated with no phase ambiguity.
As we predicted, for both dichotomous and quantitative exposures, the power for was superior to that of Tmax–ZV when the causative SNP was not genotyped, while the reverse was true when the causative SNP was genotyped. In both situations the combined test, GEI-TRIMM, achieved power that was close to the better of the procedures being combined.
Power for our tests with 1,000 case-parent triads for our quantitative exposure scenarios under population stratification was reasonably good under most risk scenarios that we examined (table (table2).2). We saw the lowest power with risk scenario (1, 1, 2.4, 2.4) with 13 SNPs. Generally speaking, power was larger when the causative SNP was typed compared to not typed, when data were complete compared to missing, and when 6 tagging SNPs were studied compared to 13.
We examined the power of our combined test GEI-TRIMM as a function of the risk-haplotype frequency for homogeneous and stratified populations, with the causative SNP either typed or not, and with 6 SNPs (fig. (fig.1)1) or 13 SNPs (fig. (fig.2).2). For 1,000 case-parent triads, power exceeded ~70% for risk-haplotype frequencies between 0.25 and 0.75, with much lower power for more extreme frequencies.
With our 5-SNP haplotype set (causative SNP not typed) and strong exposure-related population stratification, the pseudocontrol method showed inflated Type I error rates of 0.20, whereas GEI-TRIMM showed a rate of 0.04, which was statistically consistent with 0.05. In a 5-SNP 12-haplotype null scenario with no exposure-related population stratification, the pseudocontrol method still showed inflated Type I error rates (data not shown). These results demonstrate that GEI-TRIMM can cope with larger numbers of SNPs or haplotypes better than can the otherwise sound parametric method, reflecting the difficulties of asymptotic inference in highly parameterized models. For the 2-SNP null scenario, pseudocontrol, and GEI-TRIMM both yielded nominal Type I error rates near 0.05. For the 2-SNP non-null scenario, GEI-TRIMM (power = 0.948) performed better than did the pseudocontrol approach (power = 0.785).
GSTP1, one of the major enzymes contributing to inactivation of metabolites from cigarette smoke, is widely expressed in normal human epithelial tissue and is particularly abundant in the lung, esophagus and placenta. GSTP1 is the most highly expressed GST gene in all tissues of both human embryo and fetus . When treated with carcinogens, mice with deletions in the GST-π gene cluster (Gst-π1, and Gst-π2) showed a significantly increased incidence of papillomas . Thus, variable GSTP1 expression may be an important determinant of susceptibility to environmental chemicals. Shi et al. , using the method of Kistner et al. , reported a significant interaction between SNP rs947894 in GSTP1 and maternal smoking during pregnancy on the risk of cleft lip with or without cleft palate (CL/P) in a Danish sample. We reanalyzed those data to explore possible haplotype-by-smoking interaction using three SNPs (rs947894, rs1799811, rs762803) in GSTP1 genotyped in 498 Danish triads with CL/P.
We observed p = 0.062 for , p = 0.059 for Tmax–ZV, and p = 0.050 for GEI-TRIMM, suggesting GSTP1 haplotypes modulate the effect of maternal smoking on risk of CL/P. The corrected mean cross product vector was (0.026, 0.036, 0.013), where the positive signs indicate increased transmission of minor alleles to affected offspring born to smoking mothers compared to affected offspring born to nonsmokers.
Such interactions are, however, hard to interpret without examining main effects. In this example, when we applied TRIMM to the data without considering maternal smoking status, the Z vector was (−0.5, −1.3, −1.7) with p = 0.25. Thus, we saw no significant overall main effect of haplotypes, with possibly some under-transmission of the interaction-identified haplotype to affected offspring. When smokers and nonsmokers were considered separately, the Z statistic among nonsmokers was (−1.6, −1.9, −2.45), with a TRIMM p = 0.05, while that among smokers was (1.8, 1.3, 0.1) with a TRIMM p = 0.21. This analysis suggests that a haplotype marked by the minor alleles (the ones being enumerated) at these three loci confers protection among nonsmokers (since all the signs are negative for the Z vector) but that the protection is attenuated by maternal smoking.
Multiplicative interaction between an inherited set of SNP markers and an exposure can be assessed through our combined GEI-TRIMM test. Stratifying on the vector of the sum of unphased parental genotypes allows permutation of exposure measures among genetically similar families, a strategy that confers resistance to exposure-related population stratification and its attendant biases, but does not guarantee complete robustness. Our simulations suggest that, in a realistic setting with a moderate sample size, GEI-TRIMM applied to the offspring-inherited genotype resists bias due to fairly strong exposure-related population stratification and is also reasonably powerful for detecting haplotype-by-exposure interactions – even though Type I error rates sometimes fall below the nominal level (table (table2).2). It continues to perform well when 20% of SNPs are sporadically missing, or when certain individuals (e.g. 20% of the fathers) are missing altogether (data not shown). It requires no parametric assumptions about the exposure distribution, and the investigator need not assume a particular underlying mode of inheritance. Its construction, by building on a combination of tests, allows GEI-TRIMM to perform well if there is a single causative SNP, whether or not that SNP was directly typed. Our simulations were, however, limited to scenarios where at most one haplotype was involved in main effects and in interactions with the exposure. Software for implementing GEI-TRIMM is freely available at http://www.niehs.nih.gov/research/atniehs/labs/bb/staff/weinberg/index.cfm#downloads.
Although GEI-TRIMM can cope more successfully with larger numbers of SNPs than its competitors can in testing haplotype-environment interaction, increasing numbers of SNPs do impose costs. As the number of SNPs increases, the number of possible strata based on the vectors of sums of parental genotypes increases rapidly. Large numbers of strata will reduce power because many strata will then have only one triad, or possibly contain only triads with tied exposure values, and such strata will not contribute to the test statistic. In one extreme example among our simulations with a dichotomous exposure, when 13 SNPs were included in the analysis, we lost 60% of triads in this way. Also, with missing genotype data, GEI-TRIMM and its competitors assume that no recombinations have occurred between parents and offspring. With sets of SNPs that span long genomic regions, the likelihood of recombination grows; and the performance of GEI-TRIMM and its competitors will suffer.
Because GEI-TRIMM does not fully stratify on parental diplotypes, residual bias due to population stratification can potentially persist. One can construct extreme scenarios where a stratum defined by M + F still contains subpopulations that differ in both their exposure and risk haplotype distributions. In such a setting the Type I error rate for GEI-TRIMM can be inflated. Consequently, the proposed method can claim only to be resistant to population stratification and not fully robust. Nevertheless, our simulations revealed a Type I error rate that respected the nominal level of the test. Moreover, if many linked SNPs were included in the analysis, fully stratifying on the parental diplotypes, even if possible, would require more strata than would stratification on M + F, rendering the number of strata so large that the power of the test would be compromised. From this perspective, GEI-TRIMM represents a pragmatic compromise between bias and efficiency.
Just as our TRIMM method  can be adapted to detect effects either of an offspring-inherited haplotype or of a haplotype carried by the mother that acts on the offspring via effects on the prenatal environment, the GEI-TRIMM method can also be adapted to detect interactions between exposure and maternal genotype. One simply needs to replace the case-complement difference with the mother-father difference. Conveniently, when no genotypes are missing, GEI-TRIMM tests for interactions with maternal haplotypes are statistically independent of their transmission-based counterparts for interactions with offspring-inherited haplotypes. The user should be cautious, however, when employing these methods to study gene-environment interactions related to maternal genetic effects. The conditional independence assumption needed for the validity of tests for interaction with offspring-inherited genes is a weak one, specifying that inheritance of the haplotype does not influence the offspring's propensity to be exposed, conditional on the parental genotypes. The corresponding assumption needed to assess maternal haplotype-by-exposure interaction is stronger: one must assume that when the parents differ in their diplotypes, that difference is not predictive of the propensity for the mother to be exposed (presumably during or before the pregnancy). In a population with exposure-related population stratification, such an assumption might often fail, for example, if maternal ancestry is related to both maternal genotype and maternal exposures, and ethnically mixed marriages are common.
To our knowledge, GEI-TRIMM is the first method that can validly study the possible multiplicative interaction of many SNPs simultaneously in relation to an exposure, without requiring prior knowledge of candidate haplotypes, or identification of phased diplotypes, or specification of a genetic inheritance model. It can handle sporadically missing SNPs and can fully use data from triads having only one parent available for genotyping. Though GEI-TRIMM evidently can become conservative when many SNPs are missing, it was designed to be approximately robust to exposure-involved genetic population stratification, and our simulations support that robustness.
This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01-ES040007; Z01-ES45002). We thank Dr. Jeff Murray for sharing the clefting genotype data and Drs. Dana Hancock and Dmitri Zaykin for helpful comments on the manuscript.
Haplotype structures and frequencies used in the simulation studies were based on HapMap phased genotype data for the RFC1 gene in the sample with European ancestry.
|SNP index||Haplotype frequency|
SNP 13 is the hypothetical causative SNP that we created (when SNP 13 is not typed, we refer to a 12-SNP scenario); haplotype 1 is the risk-associated haplotype.
|SNP index||Haplotype frequency|
Five LD-tagging SNPs from the previous set plus SNP 13, which is the hypothetical causative SNP that we created (when SNP 13 is not typed, we refer to a 5-SNP scenario); haplotype 1 is the risk-associated haplotype.