Although great progress in genome-wide association studies (GWAS) has been made,
the significant SNP associations identified by GWAS account for only a few
percent of the genetic variance, leading many to question where and how we can
find the missing heritability. There is increasing interest in genome-wide
interaction analysis as a possible source of finding heritability unexplained by
current GWAS. However, the existing statistics for testing interaction have low
power for genome-wide interaction analysis. To meet challenges raised by
genome-wide interactional analysis, we have developed a novel statistic for
testing interaction between two loci (either linked or unlinked). The null
distribution and the type I error rates of the new statistic for testing
interaction are validated using simulations. Extensive power studies show that
the developed statistic has much higher power to detect interaction than
classical logistic regression. The results identified 44 and 211 pairs of SNPs
showing significant evidence of interactions with FDR<0.001 and
It is expected that genome-wide interaction analysis can be a possible source of
finding heritability unexplained by current GWAS. However, the existing
statistics for testing interaction have low power for genome-wide interaction
analysis. To meet challenges raised by genome-wide interactional analysis, we
develop a novel statistic for testing interaction between two loci (either
linked or unlinked) and validate the null distribution and the type I error
rates of the new statistic through simulations. By extensive power studies we
show that the developed novel statistic has much higher power to detect
interaction than the classical logistic regression. To provide evidence of
gene–gene interactions as a possible source of the missing heritability
unexplained by the current GWAS, we performed the genome-wide interaction
analysis of psoriasis in two independent studies. The preliminary results
identified 44 and 211 pairs of SNPs showing significant evidence of interactions
with FDR<0.001 and 0.001
There is a growing awareness that interaction between multiple genes play an important role in the risk of common, complex multi-factorial diseases. Many common diseases are affected by certain genotype combinations (associated with some genes and their interactions). The identification and characterization of these susceptibility genes and gene-gene interaction have been limited by small sample size and large number of potential interactions between genes. Several methods have been proposed to detect gene-gene interaction in a case control study. The penalized logistic regression (PLR), a variant of logistic regression with L2 regularization, is a parametric approach to detect gene-gene interaction. On the other hand, the Multifactor Dimensionality Reduction (MDR) is a nonparametric and genetic model-free approach to detect genotype combinations associated with disease risk.
We compared the power of MDR and PLR for detecting two-way and three-way interactions in a case-control study through extensive simulations. We generated several interaction models with different magnitudes of interaction effect. For each model, we simulated 100 datasets, each with 200 cases and 200 controls and 20 SNPs. We considered a wide variety of models such as models with just main effects, models with only interaction effects or models with both main and interaction effects. We also compared the performance of MDR and PLR to detect gene-gene interaction associated with acute rejection(AR) in kidney transplant patients.
In this paper, we have studied the power of MDR and PLR for detecting gene-gene interaction in a case-control study through extensive simulation. We have compared their performances for different two-way and three-way interaction models. We have studied the effect of different allele frequencies on these methods. We have also implemented their performance on a real dataset. As expected, none of these methods were consistently better for all data scenarios, but, generally MDR outperformed PLR for more complex models. The ROC analysis on the real dataset suggests that MDR outperforms PLR in detecting gene-gene interaction on the real dataset.
As one might expect, the relative success of each method is context dependent. This study demonstrates the strengths and weaknesses of the methods to detect gene-gene interaction.
Identifying gene-gene interactions or gene-environment interactions in studies of human complex diseases remains a big challenge in genetic epidemiology. An additional challenge, often forgotten, is to account for important lower-order genetic effects. These may hamper the identification of genuine epistasis. If lower-order genetic effects contribute to the genetic variance of a trait, identified statistical interactions may simply be due to a signal boost of these effects. In this study, we restrict attention to quantitative traits and bi-allelic SNPs as genetic markers. Moreover, our interaction study focuses on 2-way SNP-SNP interactions. Via simulations, we assess the performance of different corrective measures for lower-order genetic effects in Model-Based Multifactor Dimensionality Reduction epistasis detection, using additive and co-dominant coding schemes. Performance is evaluated in terms of power and familywise error rate. Our simulations indicate that empirical power estimates are reduced with correction of lower-order effects, likewise familywise error rates. Easy-to-use automatic SNP selection procedures, SNP selection based on “top” findings, or SNP selection based on p-value criterion for interesting main effects result in reduced power but also almost zero false positive rates. Always accounting for main effects in the SNP-SNP pair under investigation during Model-Based Multifactor Dimensionality Reduction analysis adequately controls false positive epistasis findings. This is particularly true when adopting a co-dominant corrective coding scheme. In conclusion, automatic search procedures to identify lower-order effects to correct for during epistasis screening should be avoided. The same is true for procedures that adjust for lower-order effects prior to Model-Based Multifactor Dimensionality Reduction and involve using residuals as the new trait. We advocate using “on-the-fly” lower-order effects adjusting when screening for SNP-SNP interactions using Model-Based Multifactor Dimensionality Reduction analysis.
Gene-gene interactions may play an important role in the genetics of a complex disease. Detection and characterization of gene-gene interactions is a challenging issue that has stimulated the development of various statistical methods to address it. In this study, we introduce a method to measure gene interactions using entropy-based statistics from a contingency table of trait and genotype combinations. We also developed an exploration procedure by using graphs. We propose a standardized relative information gain (RIG) measure to evaluate the interactions between single nucleotide polymorphism (SNP) combinations. To identify the kth order interactions, contingency tables of trait and genotype combinations of k SNPs are constructed, with which RIGs are calculated. The RIGs are standardized using the mean and standard deviation from the permuted datasets. SNP combinations yielding high standardized RIG are chosen for gene-gene interactions. Detection of high-order interactions and comparison of interaction strengths between different orders are made possible by using standardized RIG. We have applied the proposed standardized entropy-based method to two types of data sets from a simulation study and a real genetic association study. We have compared our method and the multifactor dimensionality reduction (MDR) method through power analysis of eight different genetic models with varying penetrance rates, number of SNPs, and sample sizes. Our method shows successful identification of genetic associations and gene-gene interactions both in simulation and real genetic data. Simulation results suggest that the proposed entropy-based method is better able to detect high-order interactions and is superior to the MDR method in most cases. The proposed method is well suited for detecting interactions without main effects as well as for models including main effects.
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
Tests of association with disease status are normally conducted one SNP at a time, ignoring the effects of all other genotyped SNPs. We developed a computationally efficient method to simultaneously analyse all SNPs, either in a genome-wide association (GWA) study, or a fine-mapping study based on re-sequencing and/or imputation. The method selects a subset of SNPs that best predicts disease status, while controlling the type-I error of the selected SNPs. This brings many advantages over standard single-SNP approaches, because the signal from a particular SNP can be more clearly assessed when other SNPs associated with disease status are already included in the model. Thus, in comparison with single-SNP analyses, power is increased and the false positive rate is reduced because of reduced residual variation. Localisation is also greatly improved. We demonstrate these advantages over the widely used single-SNP Armitage Trend Test using GWA simulation studies, a real GWA dataset, and a sequence-based fine-mapping simulation study.
SNP genotyping arrays have been developed to characterize single-nucleotide polymorphisms (SNPs) and DNA copy number variations (CNVs). Nonparametric and model-based statistical algorithms have been developed to detect CNVs from SNP data using the marker intensities. However, these algorithms lack specificity to detect small CNVs owing to the high false positive rate when calling CNVs based on the intensity values. Therefore, the resulting association tests lack power even if the CNVs affecting disease risk are common. An alternative procedure called PennCNV uses information from both the marker intensities as well as the genotypes and therefore has increased sensitivity.
By using the hidden Markov model (HMM) implemented in PennCNV to derive the probabilities of different copy number states which we subsequently used in a logistic regression model, we developed a new genome-wide algorithm to detect CNV associations with diseases. We compared this new method with association test applied to the most probable copy number state for each individual that is provided by PennCNV after it performs an initial HMM analysis followed by application of the Viterbi algorithm, which removes information about copy number probabilities. In one of our simulation studies, we showed that for large CNVs (number of SNPs ≥ 10), the association tests based on PennCNV calls gave more significant results, but the new algorithm retained high power. For small CNVs (number of SNPs <10), the logistic algorithm provided smaller average p-values (e.g., p = 7.54e - 17 when relative risk RR = 3.0) in all the scenarios and could capture signals that PennCNV did not (e.g., p = 0.020 when RR = 3.0). From a second set of simulations, we showed that the new algorithm is more powerful in detecting disease associations with small CNVs (number of SNPs ranging from 3 to 5) under different penetrance models (e.g., when RR = 3.0, for relatively weak signals, power = 0.8030 comparing to 0.2879 obtained from the association tests based on PennCNV calls). The new method was implemented in software GWCNV. It is freely available at http://gwcnv.sourceforge.net, distributed under a GPL license.
We conclude that the new algorithm is more sensitive and can be more powerful in detecting CNV associations with diseases than the existing HMM algorithm, especially when the CNV association signal is weak and a limited number of SNPs are located in the CNV.
The risk of common diseases is likely determined by the complex interplay between environmental and genetic factors, including single nucleotide polymorphisms (SNPs). Traditional methods of data analysis are poorly suited for detecting complex interactions due to sparseness of data in high dimensions, which often occurs when data are available for a large number of SNPs for a relatively small number of samples. Validation of associations observed using multiple methods should be implemented to minimize likelihood of false-positive associations. Moreover, high-throughput genotyping methods allow investigators to genotype thousands of SNPs at one time. Investigating associations for each individual SNP or interactions between SNPs using traditional approaches is inefficient and prone to false positives.
We developed the Polymorphism Interaction Analysis tool (PIA version 2.0) to include different approaches for ranking and scoring SNP combinations, to account for imbalances between case and control ratios, stratify on particular factors, and examine associations of user-defined pathways (based on SNP or gene) with case status. PIA v. 2.0 detected 2-SNP interactions as the highest ranking model 77% of the time, using simulated data sets of genetic models of interaction (minor allele frequency = 0.2; heritability = 0.01; N = 1600) generated previously [Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol 2007, 31:306–315.]. Interacting SNPs were detected in both balanced (20 SNPs) and imbalanced data (case:control 1:2 and 1:4, 10 SNPs) in the context of non-interacting SNPs.
PIA v. 2.0 is a useful tool for exploring gene*gene or gene*environment interactions and identifying a small number of putative associations which may be investigated further using other statistical methods and in replication study populations.
Recently, Wu and colleagues  proposed two novel statistics for genome-wide interaction analysis using case/control or case-only data. In computer simulations, their proposed case/control statistic outperformed competing approaches, including the fast-epistasis option in PLINK and logistic regression analysis under the correct model; however, reasons for its superior performance were not fully explored. Here we investigate the theoretical properties and performance of Wu et al.'s proposed statistics and explain why, in some circumstances, they outperform competing approaches. Unfortunately, we find minor errors in the formulae for their statistics, resulting in tests that have higher than nominal type 1 error. We also find minor errors in PLINK's fast-epistasis and case-only statistics, although theory and simulations suggest that these errors have only negligible effect on type 1 error. We propose adjusted versions of all four statistics that, both theoretically and in computer simulations, maintain correct type 1 error rates under the null hypothesis. We also investigate statistics based on correlation coefficients that maintain similar control of type 1 error. Although designed to test specifically for interaction, we show that some of these previously-proposed statistics can, in fact, be sensitive to main effects at one or both loci, particularly in the presence of linkage disequilibrium. We propose two new “joint effects” statistics that, provided the disease is rare, are sensitive only to genuine interaction effects. In computer simulations we find, in most situations considered, that highest power is achieved by analysis under the correct genetic model. Such an analysis is unachievable in practice, as we do not know this model. However, generally high power over a wide range of scenarios is exhibited by our joint effects and adjusted Wu statistics. We recommend use of these alternative or adjusted statistics and urge caution when using Wu et al.'s originally-proposed statistics, on account of the inflated error rate that can result.
Gene–gene interactions are a topic of great interest to geneticists carrying out studies of how genetic factors influence the development of common, complex diseases. Genes that interact may not only make important biological contributions to underlying disease processes, but also be more difficult to detect when using standard statistical methods in which we examine the effects of genetic factors one at a time. Recently a method was proposed by Wu and colleagues  for detecting pairwise interactions when carrying out genome-wide association studies (in which a large number of genetic variants across the genome are examined). Wu and colleagues carried out theoretical work and computer simulations that suggested their method outperformed other previously proposed approaches for detecting interactions. Here we show that, in fact, the method proposed by Wu and colleagues can result in an over-preponderence of false postive findings. We propose an adjusted version of their method that reduces the false positive rate while maintaining high power. We also propose a new method for detecting pairs of genetic effects that shows similarly high power but has some conceptual advantages over both Wu's method and also other previously proposed approaches.
Gene–gene interactions have an important role in complex human diseases. Detection of gene–gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP–SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene–gene interactions.
gene–gene interaction; linkage disequilibrium; imputation
Purely epistatic multi-locus interactions cannot generally be detected via single-locus analysis in case-control studies of complex diseases. Recently, many two-locus and multi-locus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multi-locus analysis requires prohibitively large computational efforts when problems involve large-scale or genome-wide data. Furthermore, there is no explicit proof that a combination of multiple two-locus analyses can lead to the correct identification of multi-locus interactions.
The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of two-locus analyses. The algorithm consists of four main steps: two-locus analysis, a permutation test, global p-value determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive two-locus analysis technique, a set association approach, a correlation-based feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low false-positive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two-, three- and four-locus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genome-wide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal single-locus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in PGM1 (phosphoglucomutase 1), two intronic SNPs in LMX1A (LIM homeobox transcription factor 1, alpha), two intronic SNPs in PARK2 (Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in GYS2 (glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic two-locus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multi-locus interaction models with marginal two-locus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.
An omnibus permutation test on ensembles of two-locus analyses can detect purely epistatic multi-locus interactions with marginal two-locus effects. The study also reveals that SNPs from large-scale or genome-wide case-control data which are discarded after single-locus analysis detects no association can still be useful for genetic epidemiology studies.
Genetic mutations may interact to increase the risk of human complex diseases. Mapping of multiple interacting disease loci in the human genome has recently shown promise in detecting genes with little main effects. The power of interaction association mapping, however, can be greatly influenced by the set of single nucleotide polymorphism (SNP) genotyped in a case–control study. Previous imputation methods only focus on imputation of individual SNPs without considering their joint distribution of possible interactions. We present a new method that simultaneously detects multilocus interaction associations and imputes missing SNPs from a full Bayesian model. Our method treats both the case–control sample and the reference data as random observations. The output of our method is the posterior probabilities of SNPs for their marginal and interacting associations with the disease. Using simulations, we show that the method produces accurate and robust imputation with little overfitting problems. We further show that, with the type I error rate maintained at a common level, SNP imputation can consistently and sometimes substantially improve the power of detecting disease interaction associations. We use a data set of inflammatory bowel disease to demonstrate the application of our method.
Bayesian analysis; Case–control studies; Missing data
Multifactorial diseases such as cancer and cardiovascular diseases are caused by the complex interplay between genes and environment. The detection of these interactions remains challenging due to computational limitations. Information theoretic approaches use computationally efficient directed search strategies and thus provide a feasible solution to this problem. However, the power of information theoretic methods for interaction analysis has not been systematically evaluated. In this work, we compare power and Type I error of an information-theoretic approach to existing interaction analysis methods.
The k-way interaction information (KWII) metric for identifying variable combinations involved in gene-gene interactions (GGI) was assessed using several simulated data sets under models of genetic heterogeneity driven by susceptibility increasing loci with varying allele frequency, penetrance values and heritability. The power and proportion of false positives of the KWII was compared to multifactor dimensionality reduction (MDR), restricted partitioning method (RPM) and logistic regression.
The power of the KWII was considerably greater than MDR on all six simulation models examined. For a given disease prevalence at high values of heritability, the power of both RPM and KWII was greater than 95%. For models with low heritability and/or genetic heterogeneity, the power of the KWII was consistently greater than RPM; the improvements in power for the KWII over RPM ranged from 4.7% to 14.2% at for α = 0.001 in the three models at the lowest heritability values examined. KWII performed similar to logistic regression.
Information theoretic models are flexible and have excellent power to detect GGI under a variety of conditions that characterize complex diseases.
Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.
Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.
We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at recall using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set.
We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.
Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read – or, more likely, none – from a true singleton.
To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.
We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at
http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).
In case-control studies identifying disease susceptibility loci, it has been shown that the interaction caused by multiple single nucleotide polymorphisms (SNPs) within a gene as well as by SNPs at unlinked genes plays an important role in influencing risk of a disease. A novel statistical approach is proposed to detect gene-gene interactions at the allelic level contributing to a disease trait. With a new allelic score inferred from the observed genotypes at two or more unlinked SNPs, we derive a score test from logistic regression and test for association of the allelic scores with a disease trait. Furthermore, F and likelihood ratio tests are derived from Cochran-Armitage regression. By testing for the association, the interaction can be assessed both in cases where the SNP association can be detected and cannot be detected as a main effect in single SNP approach. The analytical power and type I error rates over 6 two-way interaction models are investigated based on the non-centrality parameter approximation of the score test. Simulation studies demonstrate that (1) the power of the score test is asymptotically equivalent to that of the test statistics by the Cochran-Armitage method and (2) the allelic based method provides higher power than two genotypic based methods.
Allelic test; Interaction effect; Score test; Cochran-Armitage method; Epistasis
To facilitate whole-genome association studies (WGAS), several high-density SNP genotyping arrays have been developed. Genetic coverage and statistical power are the primary benchmark metrics in evaluating the performance of SNP arrays. Ideally, such evaluations would be done on a SNP set and a cohort of individuals that are both independently sampled from the original SNPs and individuals used in developing the arrays. Without utilization of an independent test set, previous estimates of genetic coverage and statistical power may be subject to an overfitting bias. Additionally, the SNP arrays' statistical power in WGAS has not been systematically assessed on real traits. One robust setting for doing so is to evaluate statistical power on thousands of traits measured from a single set of individuals. In this study, 359 newly sampled Americans of European descent were genotyped using both Affymetrix 500K (Affx500K) and Illumina 650Y (Ilmn650K) SNP arrays. From these data, we were able to obtain estimates of genetic coverage, which are robust to overfitting, by constructing an independent test set from among these genotypes and individuals. Furthermore, we collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Our genetic coverage estimates are lower than previous reports, providing evidence that previous estimates may be inflated due to overfitting. The Ilmn650K platform showed reasonable power (50% or greater) to detect SNPs associated with quantitative traits when the signal-to-noise ratio (SNR) is greater than or equal to 0.5 and the causal SNP's minor allele frequency (MAF) is greater than or equal to 20% (N = 359). In testing each of the more than 40,000 gene expression traits for association to each of the SNPs on the Ilmn650K and Affx500K arrays, we found that the Ilmn650K yielded 15% times more discoveries than the Affx500K at the same false discovery rate (FDR) level.
Advances in SNP genotyping array technologies have made whole-genome association studies (WGAS) a readily available approach. Genetic coverage and the statistical power are two key properties to evaluate on the arrays. In this study, 359 newly sampled individuals were genotyped using Affymetrix 500K and Illumina 650Y SNP arrays. From these data, we obtained new estimates of genetic coverage by constructing a test set from among these genotypes and individuals that is independent from the SNPs and individuals used to construct the arrays. These estimates are notably smaller than previous ones, which we argue is due to an overfitting bias in previous studies. We also collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Through this dataset and simulations, we find that the SNP arrays provide adequate power to detect quantitative trait loci when the causal SNP's minor allele frequency is greater than 20%, but low power is less than 10%. Importantly, we provide evidence that sample size has a greater impact on the power of WGAS than SNP density or genetic coverage.
Genome-wide association studies (GWAS) aim to identify genetic variants related to diseases by examining the associations between phenotypes and hundreds of thousands of genotyped markers. Because many genes are potentially involved in common diseases and a large number of markers are analyzed, it is crucial to devise an effective strategy to identify truly associated variants that have individual and/or interactive effects, while controlling false positives at the desired level. Although a number of model selection methods have been proposed in the literature, including marginal search, exhaustive search, and forward search, their relative performance has only been evaluated through limited simulations due to the lack of an analytical approach to calculating the power of these methods. This article develops a novel statistical approach for power calculation, derives accurate formulas for the power of different model selection strategies, and then uses the formulas to evaluate and compare these strategies in genetic model spaces. In contrast to previous studies, our theoretical framework allows for random genotypes, correlations among test statistics, and a false-positive control based on GWAS practice. After the accuracy of our analytical results is validated through simulations, they are utilized to systematically evaluate and compare the performance of these strategies in a wide class of genetic models. For a specific genetic model, our results clearly reveal how different factors, such as effect size, allele frequency, and interaction, jointly affect the statistical power of each strategy. An example is provided for the application of our approach to empirical research. The statistical approach used in our derivations is general and can be employed to address the model selection problems in other random predictor settings. We have developed an R package markerSearchPower to implement our formulas, which can be downloaded from the Comprehensive R Archive Network (CRAN) or http://bioinformatics.med.yale.edu/group/.
Almost all published genome-wide association studies are based on single-marker analysis. Intuitively, joint consideration of multiple markers should be more informative when multiple genes and their interactions are involved in disease etiology. For example, an exhaustive search among models involving multiple markers and their interactions can identify certain gene–gene interactions that will be missed by single-marker analysis. However, an exhaustive search is difficult, or even impossible, to perform because of the computational requirements. Moreover, searching more models does not necessarily increase statistical power, because there may be an increased chance of finding false positive results when more models are explored. For power comparisons of different model selection methods, the published studies have relied on limited simulations due to the highly computationally intensive nature of such simulation studies. To enable researchers to compare different model search strategies without resorting to extensive simulations, we develop a novel analytical approach to evaluating the statistical power of these methods. Our results offer insights into how different parameters in a genetic model affect the statistical power of a given model selection strategy. We developed an R package to implement our results. This package can be used by researchers to compare and select an effective approach to detecting SNPs.
Detecting gene–gene interactions or epistasis in studies of human complex diseases is a big challenge in the area of epidemiology. To address this problem, several methods have been developed, mainly in the context of data dimensionality reduction. One of these methods, Model-Based Multifactor Dimensionality Reduction, has so far mainly been applied to case–control studies. In this study, we evaluate the power of Model-Based Multifactor Dimensionality Reduction for quantitative traits to detect gene–gene interactions (epistasis) in the presence of error-free and noisy data. Considered sources of error are genotyping errors, missing genotypes, phenotypic mixtures and genetic heterogeneity. Our simulation study encompasses a variety of settings with varying minor allele frequencies and genetic variance for different epistasis models. On each simulated data, we have performed Model-Based Multifactor Dimensionality Reduction in two ways: with and without adjustment for main effects of (known) functional SNPs. In line with binary trait counterparts, our simulations show that the power is lowest in the presence of phenotypic mixtures or genetic heterogeneity compared to scenarios with missing genotypes or genotyping errors. In addition, empirical power estimates reduce even further with main effects corrections, but at the same time, false-positive percentages are reduced as well. In conclusion, phenotypic mixtures and genetic heterogeneity remain challenging for epistasis detection, and careful thought must be given to the way important lower-order effects are accounted for in the analysis.
Model-Based Multifactor Dimensionality Reduction; gene–gene interactions; quantitative traits; complex diseases; noisy data
One goal of personal genomics is to use information about genomic variation to predict who is at risk for various common diseases. Technological advances in genotyping have spawned several personal genetic testing services that market genotyping services directly to the consumer. An important goal of consumer genetic testing is to provide health information along with the genotyping results. This has the potential to integrate detailed personal genetic and genomic information into healthcare decision making. Despite the potential importance of these advances, there are some important limitations. One concern is that much of the literature that is used to formulate personal genetics reports is based on genetic association studies that consider each genetic variant independently of the others. It is our working hypothesis that the true value of personal genomics will only be realized when the complexity of the genotype-to-phenotype mapping relationship is embraced, rather than ignored. We focus here on complexity in genetic architecture due to epistasis or nonlinear gene-gene interaction. We have previously developed a multifactor dimensionality reduction (MDR) algorithm and software package for detecting nonlinear interactions in genetic association studies. In most prior MDR analyses, the permutation testing strategy used to assess statistical significance was unable to differentiate MDR models that captured only interaction effects from those that also detected independent main effects. Statistical interpretation of MDR models required post-hoc analysis using entropy-based measures of interaction information. We introduce here a novel permutation test that allows the effects of nonlinear interactions between multiple genetic variants to be specifically tested in a manner that is not confounded by linear additive effects. We show using data simulated across 35 different epistasis models with varying effect sizes (heritabilities = 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and sample sizes (n = 400, 800, 1600) that the power to detect interactions using the explicit test of epistasis is no different than a standard permutation test. We also show that the test has the appropriate size or type I error rate of approximately 0.05. We then apply MDR with the new explicit test of epistasis to a large genetic study of bladder cancer (n=914) and show that a previously reported nonlinear interaction between two XPD gene polymorphisms is indeed significant (P = 0.005), even after considering the strong additive effect of smoking in the model. Finally, we evaluated the power of the explicit test of epistasis to detect the nonlinear interaction between two XPD gene polymorphisms by simulating data from the MDR model of bladder cancer susceptibility. We show that the power to detect the interaction alone was 1.00 while the power to detect the independent effect of smoking alone was 0.06 which is close to the expected type I error rate of 0.05. Importantly, the power to detect the interaction with smoking in the model was 0.94. The results of this study provide for the first time a simple method for explicitly testing epistasis or gene-gene interaction effects in genetic association studies. An important advantage of the method is that it can be combined with any modeling approach. The explicit test of epistasis brings us a step closer to the type of routine gene-gene interaction analysis that is needed if we are to enable personal genomics.
Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical “complete” chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.
Genome-wide association studies are a powerful and now widely-used method for finding genetic variants that increase the risk of developing particular diseases. These studies are complex and must be planned carefully in order to maximize the probability of finding novel associations. The main design choices to be made relate to sample sizes and choice of commercially available genotyping chip and are often constrained by cost, which can currently be as much as several million dollars. No comprehensive comparisons of chips based on their power for different sample sizes or for fixed study cost are currently available. We describe in detail a method for simulating large genome-wide association samples that accounts for the complex correlations between SNPs due to LD, and we used this method to assess the power of current genotyping chips. Our results highlight the differences between the chips under a range of plausible scenarios, and we demonstrate how our results can be used to design a study with a budget constraint. We also show how genotype imputation can be used to boost the power of each chip and that this method decreases the differences between the chips. Our simulation method and software for comparing power are being made available so that future association studies can be designed in a principled fashion.
It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci.
We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives.
We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis.
A web-based version of the software used for this analysis is available at http://bioinformatics.research.nicta.com.au/gwis.
Genome-wide association studies (GWAS) have identified numerous associations between genetic loci and individual phenotypes; however, relatively few GWAS have attempted to detect pleiotropic associations, in which loci are simultaneously associated with multiple distinct phenotypes. We show that pleiotropic associations can be directly modeled via the construction of simple Bayesian networks, and that these models can be applied to produce single or ensembles of Bayesian classifiers that leverage pleiotropy to improve genetic risk prediction. The proposed method includes two phases: (1) Bayesian model comparison, to identify Single-Nucleotide Polymorphisms (SNPs) associated with one or more traits; and (2) cross-validation feature selection, in which a final set of SNPs is selected to optimize prediction. To demonstrate the capabilities and limitations of the method, a total of 1600 case-control GWAS datasets with two dichotomous phenotypes were simulated under 16 scenarios, varying the association strengths of causal SNPs, the size of the discovery sets, the balance between cases and controls, and the number of pleiotropic causal SNPs. Across the 16 scenarios, prediction accuracy varied from 90 to 50%. In the 14 scenarios that included pleiotropically associated SNPs, the pleiotropic model search and prediction methods consistently outperformed the naive model search and prediction. In the two scenarios in which there were no true pleiotropic SNPs, the differences between the pleiotropic and naive model searches were minimal. To further evaluate the method on real data, a discovery set of 1071 sickle cell disease (SCD) patients was used to search for pleiotropic associations between cerebral vascular accidents and fetal hemoglobin level. Classification was performed on a smaller validation set of 352 SCD patients, and showed that the inclusion of pleiotropic SNPs may slightly improve prediction, although the difference was not statistically significant. The proposed method is robust, computationally efficient, and provides a powerful new approach for detecting and modeling pleiotropic disease loci.
pleiotropy; SNP; GWAS; prediction; Bayesian
Association study (especially the genome-wide association study) now has a key function in identification and characterization of disease-predisposing genetic variant(s), which customarily involve multiple single nucleotide polymorphisms (SNPs) in a candidate region or across the genome. Case–control association design remains the most popular and a challenging issue in the statistical analysis is the optimal use of all information contained in these SNPs. Previous approaches often treated gene–gene interaction as deviation from additive genetic effects or replaced it with SNP–SNP interaction. However, these approaches are limited for their failure of consideration of gene–gene interaction or gene–gene co-association at gene level. Although the co-association of the SNPs within a candidate gene can be detected by principal component analysis-based logistic regression model, the detection of co-association between genes in genome remains uncertain. Here, we proposed a canonical correlation-based U statistic (CCU) for detecting gene-based gene–gene co-association in the case–control design. We explored its type I error rates and power through simulation and analyzed two real data sets. By treating gene as a functional unit in analysis, we found that CCU was a strong alternative to previous approaches. We discussed the performance of CCU as a gene-based gene–gene co-association statistic and the prospect of further improvement.
gene-based; gene–gene co-association; canonical correlation
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.