Complex traits are expected to be caused by the interplay of multiple genetic variants and environmental factors through complicated mechanisms. If two genes are jointly involved in producing the variability of a phenotype whether additively or not, biological interaction between them or their products must be involved [Wang, et al. 2010a
]. In addition, there may be statistical interaction that may or may not be removable by a transformation of the data. Thus, statistical approaches that consider gene-gene/gene-environment interactions, including high order interaction, are more likely to take this complexity into account and could improve the discovery process of identifying important genetic variants. In this paper, we proposed a Forward U-Test for joint association of multiple genetic variants, with consideration of possible gene-gene interaction. Through simulation, we have shown that our method has a better performance than GMDR under various scenarios, whether or not statistical interaction exists. This improvement can be explained by the following reasons: 1) Our method is an entirely non-parametric approach and makes no assumption about the trait distribution, while the GMDR is based on a generalized linear model and implicitly specifies the link function with an assumption of the trait distribution. 2) Similar to MDR, GMDR assumes two levels of the quantitative traits by clustering multi-locus genotypes into a high-risk group and a low-risk group. Our method measures the differences of traits on genotype group levels without constraining the genotype groups to two levels, which may gain more strength from the quantitative variation of the trait. 3) Unlike MDR and GMDR, which select a set of candidate models for each model size, the Forward U-Test uses the cross-validation procedure to choose the most parsimonious model, making it easy for interpretation and replication. 4) Our method uses a forward search instead of an exhaustive search as does GMDR. The forward selection can substantially reduce the search space of the interaction combinations. As discussed by Wu et al
[Wu and Zhao 2009
], the performance of the selection strategies depends on the underlying disease model. Our results indicated that, under additive and multiplicative models, forward selection outperformed exhaustive selection. However, we expect power to decrease for forward selection if none of the genetic variants has any marginal effect. In this specific case, exhaustive selection will perform better than forward selection.
Besides the potential improvement of testing power, forward selection is also less computational intensive than exhaustive selection. When the number of loci increases, the computation time and memory use for the exhaustive search increase exponentially, while those increase only quadratic for the forward selection algorithm. This makes it computationally feasible for testing high-order interaction on high-dimensional data (e.g., whole genome-wide data). High-order epistasis may play an important role in gene networks. The early evidence in plant has shown that the aggressiveness of the isolate of phytophthora capsici is influence by high order epistasis [Bartual, et al. 1993
]. A recent study has also detected a significant three-locus interaction that is associated with the development of inflammatory bowel disease (IBD) in human [Wang, et al. 2010b
]. Furthermore, in our study, we illustrated the proposed method with a moderate number of SNPs. For genome-wide association studies with millions of SNPs, Li et al.
recently proposed a two-step analysis framework by integrating a trait preconditioning procedure with the feature selection [Li, et al. 2011
]. This approach first predicts ‘preconditioned’ trait by a linear combination of features that are strongly correlated with the trait, and further applies the feature selection to the ‘preconditioned’ trait. It has been shown that the preconditioning can improve the performance of feature selection, especially for interaction effects. Such a strategy may also be helpful to detect genetic interactions by combing trait preconditioning with the proposed forward selection procedure.
We also compared the power of the proposed method with the stepwise linear regression method. The stepwise linear regression model was performed using the glm and step function in R. During the stepwise regression process, the SNPs were selected forwardly into the model and the most parsimonious model was determined based on Akaike information criterion (AIC). Through simulation, we found forward U-Test outperformed linear regression. For instance, under the two-locus multiplicative model with the largest marginal effect in the simulation (first scenario in ), the power of stepwise linear regression is 0.16 without considering the interaction effects and 0.152 if interaction effects are considered, which are much lower than the power of the forward U-Test. We also applied stepwise linear regression to SAGE data. Due to a large number of parameters required for modeling interactions, we applied stepwise linear regression with only considering marginal effects. By applying the stepwise linear regression to the initial data of FSCD, 26 SNPs were selected. Further evaluation of these 26 SNPs in COGA and COGEND showed these SNPs were not significantly associated with the trait. This result may indicate that the parametric methods, such as linear regression, were less robust when a large number of SNPs were considered.
The Forward U-Test also differs from existing U-Statistic based methods: 1) It calculates the global U-Statistic by a summation over the U-Statistics of multi-SNP genotype groups instead of each single SNP, which implicitly considers the joint gene-gene action that is additive or not; 2) It searches for the multi-SNP genotypes by a forward selection algorithm, which is important for high-dimensional data with a large number of non-functional SNPs. The size of the model selected by the forward selection algorithm may depend on the study sample size. The larger the sample size, the more complex the model with the possibility of high-order interactions, the approach can find. In addition, the choice of the weight parameter ω
can also have an impact on the performance of the approach. Different weights can be used in the proposed method (e.g. ωll′
for all l,l
′), but we chose
in our study because it appeared to have the best testing power.
In the real data application, we identified two SNPs, located in CHRNA5
, jointly associated with ND. Both CHRNA5
have been suggested to be functionally related to ND. SNP rs16969968, a non-synonymous coding SNP in exon 5 of CHRNA5
, was first reported to be ND-related with a significance level of 0.00064 [Saccone, et al. 2007
], and has been replicated in several other studies [Berrettini, et al. 2008
; Caporaso, et al. 2009
; Grucza, et al. 2008
; Schuckit, et al. 2008
; Spitz, et al. 2008
; Stevens, et al. 2008
]. Studies have also suggested that CHRNA5
may interact with CHRNA3
to affect ND and lung cancer [Li, et al. 2010b
; Li, et al. 2010c
; Schlaepfer, et al. 2008
; Weiss, et al. 2008
]. SNP rs1122530, a non-coding SNP in NTRK2
, has been found to be associated with ND in a haplotype analysis with two other SNPs (rs1659400 and rs1187272) of NTRK2
[Beuten, et al. 2007
]. A previous study has found evidence of joint action between NTRK2
and multiple functional genes for ND, such as CHRNA4
, and BDNF
[Li, et al. 2008
]. However, to our knowledge, no joint action has been previously reported for CHRNA5
. Although the joint association of CHRNA5
with ND, involving statistical interaction, reached a statistically significant level and replicated in independent studies, further study would be necessary to further replicate and investigate the statistical interaction.