Gene–gene interactions have an important role in complex human diseases. Detection of gene–gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP–SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene–gene interactions.
gene–gene interaction; linkage disequilibrium; imputation
Genome-wide association studies (GWAS) have been widely applied to identify informative SNPs associated with common and complex diseases. Besides single-SNP analysis, the interaction between SNPs is believed to play an important role in disease risk due to the complex networking of genetic regulations. While many approaches have been proposed for detecting SNP interactions, the relative performance and merits of these methods in practice are largely unclear. In this paper, a ground-truth based comparative study is reported involving 9 popular SNP detection methods using realistic simulation datasets. The results provide general characteristics and guidelines on these methods that may be informative to the biological investigators.
Genome-wide association study; single-nucleotide polymorphism; SNP interaction
Recent studies have shown that quantitative phenotypes may be influenced not only by multiple single nucleotide polymorphisms (SNPs) within a gene but also by the interaction between SNPs at unlinked genes. We propose a new statistical approach that can detect gene-gene interactions at the allelic level which contribute to the phenotypic variation in a quantitative trait. By testing for the association of allelic combinations at multiple unlinked loci with a quantitative trait, we can detect the SNP allelic interaction whether or not it can be detected as a main effect. Our proposed method assigns a score to unrelated subjects according to their allelic combination inferred from observed genotypes at two or more unlinked SNPs, and then tests for the association of the allelic score with a quantitative trait. To investigate the statistical properties of the proposed method, we performed a simulation study to estimate type I error rates and power and demonstrated that this allelic approach achieves greater power than the more commonly used genotypic approach to test for gene-gene interaction. As an example, the proposed method was applied to data obtained as part of a candidate gene study of sodium retention by the kidney. We found that this method detects an interaction between the calcium-sensing receptor gene (CaSR), the chloride channel gene (CLCNKB) and the Na, K, 2Cl cotransporter gene (CLC12A1) that contributes to variation in diastolic blood pressure.
quantitative trait loci; allelic test; interaction effect; blood pressure
Complex phenotypes are known to be associated with interactions among genetic factors. A growing body of evidence suggests that gene–gene interactions contribute to many common human diseases. Identifying potential interactions of multiple polymorphisms thus may be important to understand the biology and biochemical processes of the disease etiology. However, despite the great success of genome-wide association studies that mostly focus on single locus analysis, it is challenging to detect these interactions, especially when the marginal effects of the susceptible loci are weak and/or they involve several genetic factors. Here we describe a Bayesian classification tree model to detect such interactions in case-control association studies. We show that this method has the potential to uncover interactions involving polymorphisms showing weak to moderate marginal effects as well as multi-factorial interactions involving more than two loci.
Epistasis; GWAS; Bayesian CART; MCMC; logistic regression; Crohn’s disease
Detecting epistatic (nolinear) interactions among single nucleotide polymorphisms (SNPs) at multiple loci is important in the analysis of genomic data in association studies. We developed a Bayesian combinatorial partitioning (BCP) for detecting such interactions among SNPs that are predictive of disease. When compared with multifactor dimensionality reduction (MDR), a widely used combinatorial partitioning method for detecting interactions, BCP has significantly greater power and is computationally more efficient.
Studies have shown that interactions of single nucleotide polymorphism (SNP) may play an important role for understanding causes of complex disease. Machine learning approaches provide useful features to explore interactions more effectively and efficiently. We have proposed an integrated method that combines two machine learning methods - Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS) - to identify a subset of important SNPs and detect interaction patterns. In this two-stage RF-MARS (TRM) approach, RF is first applied to detect a predictive subset of SNPs, and then MARS is used to identify the interaction patterns among the selected SNPs. We evaluated the TRM performances in four models: three causal models with one two-way interaction and one null model. RF variable selection was based on out-of-bag classification error rate (OOB) and variable important spectrum (IS). First, we compared the selection of important variable of RF and MARS. Our results support that RFOOB had better performance than MARS and RFIS in detecting important variables. We also evaluated the true positive rate and false positive rate of identifying interaction patterns in TRM and MARS. This study demonstrates that TRMOOB, which is RFOOB plus MARS, has combined the strengths of RF and MARS in identifying SNP-SNP interaction patterns in a scenario of 100 candidate SNPs. TRMOOB had greater true positive rate and lower false positive rate compared with MARS, particularly for searching interactions with a strong association with the outcome. Therefore the use of TRMOOB is favored for exploring SNP-SNP interactions in a large-scale genetic variation study.
polymorphism; interaction; machine learning
We describe a hierarchical clustering algorithm for using Single Nucleotide Polymorphism (SNP) genetic data to assign individuals to populations. The method does not assume Hardy-Weinberg equilibrium and linkage equilibrium among loci in sample population individuals.
We show that the algorithm can assign sample individuals highly accurately to their corresponding ethnic groups in our tests using HapMap SNP data and it is also robust to admixed populations when tested with Perlegen SNP data. Moreover, it can detect fine-scale population structure as subtle as that between Chinese and Japanese by using genome-wide high-diversity SNP loci.
The algorithm provides an alternative approach to the popular STRUCTURE program, especially for fine-scale population structure detection in genome-wide association studies. This is the first successful separation of Chinese and Japanese samples using random SNP loci with high statistical support.
An important challenge in the analysis of single nucleotide polymorphism (SNP) data is the identification of SNPs that interact in a nonlinear fashion in their association with disease. Such epistatic interactions among genetic variants at multiple loci likely underlie the inheritance of common diseases. We have developed a novel method called the Bayesian combinatorial method (BCM) for detecting combination of genetic variants that are predictive of disease. When compared with the multifactor dimensionality reduction (MDR), a widely used combinatorial method, BCM has significantly greater power to detect interactions and is computationally more efficient.
Genetic mutations may interact to increase the risk of human complex diseases. Mapping of multiple interacting disease loci in the human genome has recently shown promise in detecting genes with little main effects. The power of interaction association mapping, however, can be greatly influenced by the set of single nucleotide polymorphism (SNP) genotyped in a case–control study. Previous imputation methods only focus on imputation of individual SNPs without considering their joint distribution of possible interactions. We present a new method that simultaneously detects multilocus interaction associations and imputes missing SNPs from a full Bayesian model. Our method treats both the case–control sample and the reference data as random observations. The output of our method is the posterior probabilities of SNPs for their marginal and interacting associations with the disease. Using simulations, we show that the method produces accurate and robust imputation with little overfitting problems. We further show that, with the type I error rate maintained at a common level, SNP imputation can consistently and sometimes substantially improve the power of detecting disease interaction associations. We use a data set of inflammatory bowel disease to demonstrate the application of our method.
Bayesian analysis; Case–control studies; Missing data
Gene-based and single-nucleotide polymorphism (SNP) set association studies provide an important complement to SNP analysis. Kernel-based nonparametric regression has recently emerged as a powerful and flexible tool for this purpose. Our goal is to explore whether this approach can be extended to incorporate and test for interaction effects, especially for genes containing rare variant SNPs. Here, we construct nonparametric regression models that can be used to include a gene-environment interaction effect under the framework of the least-squares kernel machine and examine the performance of the proposed method on the Genetic Analysis Workshop 17 unrelated individuals data set. Two hundred simulated replicates were used to explore the power for detecting interaction. We demonstrate through a genome scan of the quantitative phenotype Q1 that the simulated gene-environment interaction effect in the data can be detected with reasonable power by using the least-squares kernel machine method.
The interaction of the association of dopamine genes, impulsivity and childhood trauma with substance abuse remains unclear.
To clarify the impacts and the interactions of the Catechol -O-methyltransferase (COMT) gene, impulsivity and childhood trauma on the age of onset of heroin use among heroin dependent patients in China.
202 male and 248 female inpatients who meet DSM-IV criteria of heroin dependence were enrolled. Impulsivity and childhood trauma were measured using BIS-11 (Barratt Impulsiveness Scale-11) and ETISR-SF (Early Trauma Inventory Self Report-Short Form). The single nucleotide polymorphism (SNP) rs737866 on the COMT gene-which has previously been associated with heroin abuse, was genotyped using a DNA sequence detection system. Structural equations model was used to assess the interaction paths between these factors and the age of onset of heroin use.
Chi-square test indicated the individuals with TT allele have earlier age of onset of heroin use than those with CT or CC allele. In the correlation analysis, the severity of childhood trauma was positively correlated to impulsive score, but both of them were negatively related to the age of onset of heroin use. In structure equation model, both the COMT gene and childhood trauma had impacts on the age of onset of heroin use directly or via impulsive personality.
Our findings indicated that the COMT gene, impulsive personality traits and childhood trauma experience were interacted to impact the age of onset of heroin use, which play a critical role in the development of heroin dependence. The impact of environmental factor was greater than the COMT gene in the development of heroin dependence.
A number of recent works have introduced statistical methods for detecting genetic loci that affect phenotypic variability, which we refer to as variability-controlling quantitative trait loci (vQTL). These are genetic variants whose allelic state predicts how much phenotype values will vary about their expected means. Such loci are of great potential interest in both human and non-human genetic studies, one reason being that a detected vQTL could represent a previously undetected interaction with other genes or environmental factors. The simultaneous publication of these new methods in different journals has in many cases precluded opportunity for comparison. We survey some of these methods, the respective trade-offs they imply, and the connections between them. The methods fall into three main groups: classical non-parametric, fully parametric, and semi-parametric two-stage approximations. Choosing between alternatives involves balancing the need for robustness, flexibility, and speed. For each method, we identify important assumptions and limitations, including those of practical importance, such as their scope for including covariates and random effects. We show in simulations that both parametric methods and their semi-parametric approximations can give elevated false positive rates when they ignore mean-variance relationships intrinsic to the data generation process. We conclude that choice of method depends on the trait distribution, the need to include non-genetic covariates, and the population size and structure, coupled with a critical evaluation of how these fit with the assumptions of the statistical model.
In case-control studies identifying disease susceptibility loci, it has been shown that the interaction caused by multiple single nucleotide polymorphisms (SNPs) within a gene as well as by SNPs at unlinked genes plays an important role in influencing risk of a disease. A novel statistical approach is proposed to detect gene-gene interactions at the allelic level contributing to a disease trait. With a new allelic score inferred from the observed genotypes at two or more unlinked SNPs, we derive a score test from logistic regression and test for association of the allelic scores with a disease trait. Furthermore, F and likelihood ratio tests are derived from Cochran-Armitage regression. By testing for the association, the interaction can be assessed both in cases where the SNP association can be detected and cannot be detected as a main effect in single SNP approach. The analytical power and type I error rates over 6 two-way interaction models are investigated based on the non-centrality parameter approximation of the score test. Simulation studies demonstrate that (1) the power of the score test is asymptotically equivalent to that of the test statistics by the Cochran-Armitage method and (2) the allelic based method provides higher power than two genotypic based methods.
Allelic test; Interaction effect; Score test; Cochran-Armitage method; Epistasis
The detection of epistatic interactive effects of multiple genetic variants on the susceptibility of human complex diseases is a great challenge in genome-wide association studies (GWAS). Although methods have been proposed to identify such interactions, the lack of an explicit definition of epistatic effects, together with computational difficulties, makes the development of new methods indispensable. In this paper, we introduce epistatic modules to describe epistatic interactive effects of multiple loci on diseases. On the basis of this notion, we put forward a Bayesian marker partition model to explain observed case-control data, and we develop a Gibbs sampling strategy to facilitate the detection of epistatic modules. Comparisons of the proposed approach with three existing methods on seven simulated disease models demonstrate the superior performance of our approach. When applied to a genome-wide case-control data set for Age-related Macular Degeneration (AMD), the proposed approach successfully identifies two known susceptible loci and suggests that a combination of two other loci—one in the gene SGCD and the other in SCAPER—is associated with the disease. Further functional analysis supports the speculation that the interaction of these two genetic variants may be responsible for the susceptibility of AMD. When applied to a genome-wide case-control data set for Parkinson's disease, the proposed method identifies seven suspicious loci that may contribute independently to the disease.
Although genome-wide association studies (GWAS) have been quite popular due to recent advances in low-cost genotyping techniques, most of the reported studies only analyze single-locus effects because traditional multi-locus methods are not computationally practical in the detection of epistatic interactive effects of multiple loci. Here, on the basis of a rigorous definition of epistatic modules that describe interactive effects of multiple loci, we take advantage of a Bayesian model with a properly designed Gibbs sampling strategy to facilitate the detection of such modules. We confirm via extensive simulation studies that the proposed method, named epiMODE, is not only feasible in detecting multi-locus effects but also more powerful than three representative methods on seven disease models. We apply the proposed method to an Age-related Macular Degeneration (AMD) data and discover that a combination of two loci—one in the gene SGCD and the other in SCAPER—might be associated with AMD. Considering its advantages, we suggest that the proposed method be applied to more GWAS data for the detection of multi-locus interactive effects.
Rare variants are believed to play an important role in disease etiology. Recent advances in high-throughput sequencing technology enable investigators to systematically characterize the genetic effects of both common and rare variants. We introduce several approaches that simultaneously test the effects of common and rare variants within a single-nucleotide polymorphism (SNP) set based on logistic regression models and logistic kernel machine models. Gene-environment interactions and SNP-SNP interactions are also considered in some of these models. We illustrate the performance of these methods using the unrelated individuals data from Genetic Analysis Workshop 17. Three true disease genes (FLT1, PIK3C3, and KDR) were consistently selected using the proposed methods. In addition, compared to logistic regression models, the logistic kernel machine models were more powerful, presumably because they reduced the effective number of parameters through regularization. Our results also suggest that a screening step is effective in decreasing the number of false-positive findings, which is often a big concern for association studies.
Recent genome-wide association studies have identified independent susceptibility loci for prostate cancer (CaP) that could influence risk through interaction with other, possibly undetected, susceptibility loci. We explored evidence of interaction between pairs of 13 known susceptibility loci and single nucleotide polymorphisms (SNPs) across the genome to generate hypotheses about the functionality of CaP susceptibility regions. We used data from Cancer Genetic Markers of Susceptibility: Stage I included 523,841 SNPs in 1175 cases and 1100 controls; Stage II included 27,383 SNPs in an additional 3941 cases and 3964 controls. Power calculations assessed the magnitude of interactions our study is likely to detect. Logistic regression was used with alternative methods that exploit constraints of gene-gene independence between unlinked loci to increase power. Our empirical evaluation demonstrated that an empirical Bayes (EB) technique is powerful and robust to possible violation of the independence assumption. Our EB analysis identified several noteworthy interacting SNP pairs, although none reached genome-wide significance. We highlight a Stage II interaction between the major CaP susceptibility locus in the subregion of 8q24 that contains POU5F1B and an intronic SNP in the transcription factor EPAS1, which has potentially important functional implications for 8q24. Another noteworthy result involves interaction of a known CaP susceptibility marker near the prostate protease genes KLK2 and KLK3 with an intronic SNP in PRXX2. Overall, the interactions we have identified merit follow-up study, particularly the EPAS1 interaction which has implications not only in CaP but also in other epithelial cancers that are associated with the 8q24 locus.
Conservation of the spatial binding organizations at the level of physico-chemical interactions is important for the formation and stability of protein-protein complexes as well as protein and drug design. Due to the lack of computational tools for recognition of spatial patterns of interactions shared by a set of protein-protein complexes, the conservation of such interactions has not been addressed previously.
We performed extensive spatial comparisons of physico-chemical interactions common to different types of protein-protein complexes. We observed that 80% of these interactions correspond to known hot spots. Moreover, we show that spatially conserved interactions allow prediction of hot spots with a success rate higher than obtained by methods based on sequence or backbone similarity. Detection of spatially conserved interaction patterns was performed by our novel MAPPIS algorithm. MAPPIS performs multiple alignments of the physico-chemical interactions and the binding properties in three dimensional space. It is independent of the overall similarity in the protein sequences, folds or amino acid identities. We present examples of interactions shared between complexes of colicins with immunity proteins, serine proteases with inhibitors and T-cell receptors with superantigens. We unravel previously overlooked similarities, such as the interactions shared by the structurally different RNase-inhibitor families.
The key contribution of MAPPIS is in discovering the 3D patterns of physico-chemical interactions. The detected patterns describe the conserved binding organizations that involve energetically important hot spot residues and are crucial for the protein-protein associations.
Large genetic association studies based on hundreds of thousands of single-nucleotide polymorphisms (SNPs) are a popular option for the study of complex diseases. The evaluation of gene × gene interactions in such studies is a sensible method of capturing important genetic effects. The number of tests required to consider all pairs of SNPs, however, can lead to a computational burden, and efficient strategies to reduce the number of tests performed are desirable. In this study, we compare two-stage strategies for pairwise SNP interactions testing. Those approaches rely on the selection of SNPs based on the single-locus test results obtained at the first stage. In the simultaneous approach, SNPs that fall below the marginal significance thresholds (p = 0.05 and p = 0.1) in stage 1 are selected and tested for within-group pairwise interaction in stage 2. With the conditional approach, SNPs that reach Bonferroni-adjusted significance at the first stage are tested in pairwise combinations with all SNPs in the data set. We compared the performance of those strategies by using Replicate 1 of the simulated data set of the Genetic Analysis Workshop 15 Problem 3. Most interactions detected resulted from SNP pairs within 1000 kb of each other. The remaining were false positives involving SNPs with excessively strong marginal signals. Our results highlight the need to account for locus proximity in the evaluation of interaction effects and emphasize the importance of marginal signal strength in logistic regression-based interaction modeling. We found that modeling additive genetic effects alone was sufficient to capture underlying dominance interaction effects in the data.
Multifactor Dimensionality Reduction (MDR) is a popular and successful data mining method developed to characterize and detect nonlinear complex gene-gene interactions (epistasis) that are associated with disease susceptibility. Because MDR uses a combinatorial search strategy to detect interaction, several filtration techniques have been developed to remove genes (SNPs) that have no interactive effects prior to analysis. However, the cutoff values implemented for these filtration methods are arbitrary, therefore different choices of cutoff values will lead to different selections of genes (SNPs).
We suggest incorporating a global test of p-values to filtration procedures to identify the optimal number of genes/SNPs for further MDR analysis and demonstrate this approach using a ReliefF filter technique. We compare the performance of different global testing procedures in this context, including the Kolmogorov-Smirnov test, the inverse chi-square test, the inverse normal test, the logit test, the Wilcoxon test and Tippett’s test. Additionally we demonstrate the approach on a real data application with a candidate gene study of drug response in Juvenile Idiopathic Arthritis.
Extensive simulation of correlated p-values show that the inverse chi-square test is the most appropriate approach to be incorporated with the screening approach to determine the optimal number of SNPs for the final MDR analysis. The Kolmogorov-Smirnov test has high inflation of Type I errors when p-values are highly correlated or when p-values peak near the center of histogram. Tippett’s test has very low power when the effect size of GxG interactions is small.
The proposed global tests can serve as a screening approach prior to individual tests to prevent false discovery. Strong power in small sample sizes and well controlled Type I error in absence of GxG interactions make global tests highly recommended in epistasis studies.
P-value; Global tests; ReliefF; Multifactor dimensionality reduction
Gene-gene interaction is believed to play an important role in understanding complex traits. Multifactor dimensionality reduction (MDR) was proposed by Ritchie, et al.  to identify multiple loci that simultaneously affect disease susceptibility. Although the MDR method has been widely used to detect gene-gene interactions, few studies have been reported on MDR analysis when there are missing data. Currently, there are four approaches available in MDR analysis to handle missing data. The first approach uses only complete observations that have no missing data, which can cause a severe loss of data. The second approach is to treat missing values as an additional genotype category, but interpretation of the results may then be not clear and the conclusions may be misleading. Furthermore, it performs poorly when the missing rates are unbalanced between the case and control groups. The third approach is a simple imputation method that imputes missing genotypes as the most frequent genotype, which also may produce biased results. The fourth approach, Available, uses all data available for the given loci, to increase power. In any real data analysis, it is not clear which MDR approach one should use when there are missing data. In this paper, we consider a new EM Impute approach, to handle missing data more appropriately. Through simulation studies, we compared the performance of the proposed EM Impute approach with the current approaches. Our results showed that Available and EM Impute approaches perform better than the three other current approaches in terms of power and precision.
Gene-gene interaction; Multifactor Dimensionality Reduction; Missing genotypes; Association study
The detection of gene-gene interaction is an important approach to understand the etiology of rheumatoid arthritis (RA). The goal of this study is to identify gene-gene interaction of SNPs at the allelic level contributing to RA using real data sets (Problem 1) of North American Rheumatoid Arthritis Consortium (NARAC) provided by Genetic Analysis Workshop 16 (GAW16). We applied our novel method that can detect the interaction by a definition of nonrandom association of alleles that occurs when the contribution to RA of a particular allele inherited in one gene depends on a particular allele inherited at other unlinked genes. Starting with 639 single-nucleotide polymorphisms (SNPs) from 26 candidate genes, we identified ten two-way interacting genes and one case of three-way interacting genes. SNP rs2476601 on PTPN22 interacts with rs2306772 on SLC22A4, which interacts with rs881372 on TRAF1 and rs2900180 on C5, respectively. SNP rs2900180 on C5 interacts with rs2242720 on RUNX1, which interacts with rs881375 on TRAF1. Furthermore, rs2476601 on PTPN22 also interacts with three SNPs (rs2905325, rs1476482, and rs2106549) in linkage disequilibrium (LD) on IL6. The other three SNPs (rs2961280, rs2961283, and rs2905308) in LD on IL6 interact with two SNPs (rs477515 and rs2516049) on HLA-DRB1. SNPs rs660895 and rs532098 on HLA-DRB1 interact with rs2834779 and four SNPs in LD on RUNX1. Three-way interacting genes of rs10229203 on IL6, rs4816502 on RUNX1, and rs10818500 on C5 were also detected.
Genotype imputation methods have become increasingly popular for recovering untyped genotype data. An important application with imputed genotypes is to test genetic association for diseases. Imputation-based association test can provide additional insight beyond what is provided by testing on typed tagging SNPs only. A variety of effective imputation-based association tests have been proposed. However, their performances are affected by a variety of genetic factors, which have not been well studied. In this study, using both simulated and real data sets, we investigated the effects of LD, MAF of untyped causal SNP and imputation accuracy rate on the performances of seven popular imputation-based association methods, including MACH2qtl/dat, SNPTEST, ProbABEL, Beagle, Plink, BIMBAM and SNPMStat. We also aimed to provide a comprehensive comparison among methods. Results show that: 1). imputation-based association tests can boost signals and improve power under medium and high LD levels, with the power improvement increasing with strengthening LD level; 2) the power increases with higher MAF of untyped causal SNPs under medium to high LD level; 3). under low LD level, a high imputation accuracy rate cannot guarantee an improvement of power; 4). among methods, MACH2qtl/dat, ProbABEL and SNPTEST perform similarly and they consistently outperform other methods. Our results are helpful in guiding the choice of imputation-based association test in practical application.
In genetic studies of complex disease a consideration for the investigator is detection of joint effects. The Multifactor Dimensionality Reduction (MDR) algorithm searches for these effects with an exhaustive approach. Previously unknown aspects of MDR performance were the power to detect interactive effects given large numbers of non-model loci or varying degrees of heterogeneity among multiple epistatic disease models.
To address the performance with many non-model loci, datasets of 500 cases and 500 controls with 100 to 10,000 SNPs were simulated for two-locus models, and one hundred 500-case/500-control datasets with 100 and 500 SNPs were simulated for three-locus models. Multiple levels of locus heterogeneity were simulated in several sample sizes.
These results show MDR is robust to locus heterogeneity when the definition of power is not as conservative as in previous simulation studies where all model loci were required to be found by the method. The results also indicate that MDR performance is related more strongly to broad-sense heritability than sample size and is not greatly affected by non-model loci.
A study in which a population with high heritability estimates is sampled predisposes the MDR study to success more than a larger ascertainment in a population with smaller estimates.
Epistasis; MDR; Heterogeneity
Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.
RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.
While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
There is growing evidence that gene-gene interactions are ubiquitous in determining the susceptibility to common human diseases. The investigation of such gene-gene interactions presents new statistical challenges for studies with relatively small sample sizes as the number of potential interactions in the genome can be large. Breast cancer provides a useful paradigm to study genetically complex diseases because commonly occurring single nucleotide polymorphisms (SNPs) may additively or synergistically disturb the system-wide communication of the cellular processes leading to cancer development.
In this study, we systematically studied SNP-SNP interactions among 19 SNPs from 18 key genes involved in major cancer pathways in a sample of 398 breast cancer cases and 372 controls from Ontario. We discuss the methodological issues associated with the detection of SNP-SNP interactions in this dataset by applying and comparing three commonly used methods: the logistic regression model, classification and regression trees (CART), and the multifactor dimensionality reduction (MDR) method.
Our analyses show evidence for several simple (two-way) and complex (multi-way) SNP-SNP interactions associated with breast cancer. For example, all three methods identified XPD-[Lys751Gln]*IL10-[G(-1082)A] as the most significant two-way interaction. CART and MDR identified the same critical SNPs participating in complex interactions. Our results suggest that the use of multiple statistical approaches (or an integrated approach) rather than a single methodology could be the best strategy to elucidate complex gene interactions that have generally very different patterns.
The strategy used here has the potential to identify complex biological relationships among breast cancer genes and processes. This will lead to the discovery of novel biological information, which will improve breast cancer risk management.