Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.
Results: In this article, we propose an ensemble approach based on boosting to study gene–gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
Motivation: In genome-wide association studies (GWAS), up to millions of single nucleotide polymorphisms (SNPs) are genotyped for thousands of individuals. However, conventional single locus-based approaches are usually unable to detect gene–gene interactions underlying complex diseases. Due to the huge search space for complicated high order interactions, many existing multi-locus approaches are slow and may suffer from low detection power for GWAS.
Results: In this article, we develop a simple, fast and effective algorithm to detect genome-wide multi-locus epistatic interactions based on the clustering of relatively frequent items. Extensive experiments on simulated data show that our algorithm is fast and more powerful in general than some recently proposed methods. On a real genome-wide case–control dataset for age-related macular degeneration (AMD), the algorithm has identified genotype combinations that are significantly enriched in the cases.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity.
We propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases.
Our study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.
Genome-wide association studies (GWAS) are a widely used study design for detecting genetic causes of complex diseases. Current studies provide good coverage of common causal SNPs, but not rare ones. A popular method to detect rare causal variants is haplotype testing. A disadvantage of this approach is that many parameters are estimated simultaneously, which can mean a loss of power and slower fitting to large datasets.
Haplotype testing effectively tests both the allele frequencies and the linkage disequilibrium (LD) structure of the data. LD has previously been shown to be mostly attributable to LD between adjacent SNPs. We propose a generalised linear model (GLM) which models the effects of each SNP in a region as well as the statistical interactions between adjacent pairs. This is compared to two other commonly used multimarker GLMs: one with a main-effect parameter for each SNP; one with a parameter for each haplotype.
We show the haplotype model has higher power for rare untyped causal SNPs, the main-effects model has higher power for common untyped causal SNPs, and the proposed model generally has power in between the two others. We show that the relative power of the three methods is dependent on the number of marker haplotypes the causal allele is present on, which depends on the age of the mutation. Except in the case of a common causal variant in high LD with markers, all three multimarker models are superior in power to single-SNP tests.
Including the adjacent statistical interactions results in lower inflation in test statistics when a realistic level of population stratification is present in a dataset.
Using the multimarker models, we analyse data from the Molecular Genetics of Schizophrenia study. The multimarker models find potential associations that are not found by single-SNP tests. However, multimarker models also require stricter control of data quality since biases can have a larger inflationary effect on multimarker test statistics than on single-SNP test statistics.
Analysing a GWAS with multimarker models can yield candidate regions which may contain rare untyped causal variants. This is useful for increasing prior odds of association in future whole-genome sequence analyses.
GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.
Genome-wide association studies (GWAS) using single nucleotide polymorphism (SNP) markers provide opportunities to detect epistatic SNPs associated with quantitative traits and to detect the exact mode of an epistasis effect. Computational difficulty is the main bottleneck for epistasis testing in large scale GWAS.
The EPISNPmpi and EPISNP computer programs were developed for testing single-locus and epistatic SNP effects on quantitative traits in GWAS, including tests of three single-locus effects for each SNP (SNP genotypic effect, additive and dominance effects) and five epistasis effects for each pair of SNPs (two-locus interaction, additive × additive, additive × dominance, dominance × additive, and dominance × dominance) based on the extended Kempthorne model. EPISNPmpi is the parallel computing program for epistasis testing in large scale GWAS and achieved excellent scalability for large scale analysis and portability for various parallel computing platforms. EPISNP is the serial computing program based on the EPISNPmpi code for epistasis testing in small scale GWAS using commonly available operating systems and computer hardware. Three serial computing utility programs were developed for graphical viewing of test results and epistasis networks, and for estimating CPU time and disk space requirements.
The EPISNPmpi parallel computing program provides an effective computing tool for epistasis testing in large scale GWAS, and the epiSNP serial computing programs are convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware.
In this paper, we develop a powerful test for identifying SNP-sets that are predictive of survival with data from genome-wide association studies (GWAS). We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially non-linear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum p-value based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
cox model; genetic studies; gene-based analysis; kernel machine; multi-locus test; score test; single nucleotide polymorphism
There is growing evidence that gene-gene interactions are ubiquitous in determining the susceptibility to common human diseases. The investigation of such gene-gene interactions presents new statistical challenges for studies with relatively small sample sizes as the number of potential interactions in the genome can be large. Breast cancer provides a useful paradigm to study genetically complex diseases because commonly occurring single nucleotide polymorphisms (SNPs) may additively or synergistically disturb the system-wide communication of the cellular processes leading to cancer development.
In this study, we systematically studied SNP-SNP interactions among 19 SNPs from 18 key genes involved in major cancer pathways in a sample of 398 breast cancer cases and 372 controls from Ontario. We discuss the methodological issues associated with the detection of SNP-SNP interactions in this dataset by applying and comparing three commonly used methods: the logistic regression model, classification and regression trees (CART), and the multifactor dimensionality reduction (MDR) method.
Our analyses show evidence for several simple (two-way) and complex (multi-way) SNP-SNP interactions associated with breast cancer. For example, all three methods identified XPD-[Lys751Gln]*IL10-[G(-1082)A] as the most significant two-way interaction. CART and MDR identified the same critical SNPs participating in complex interactions. Our results suggest that the use of multiple statistical approaches (or an integrated approach) rather than a single methodology could be the best strategy to elucidate complex gene interactions that have generally very different patterns.
The strategy used here has the potential to identify complex biological relationships among breast cancer genes and processes. This will lead to the discovery of novel biological information, which will improve breast cancer risk management.
With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.
Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.
The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.
The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.
Susceptibility variants identified by genome-wide association studies (GWAS) have modest effect sizes. Whether such variants provide incremental information in assessing risk for common 'complex' diseases is unclear. We investigated whether measured and imputed genotypes from a GWAS dataset linked to the electronic medical record alter estimates of coronary heart disease (CHD) risk.
Study participants (n = 1243) had no known cardiovascular disease and were considered to be at high, intermediate, or low 10-year risk of CHD based on the Framingham risk score (FRS) which includes age, sex, total and HDL cholesterol, blood pressure, diabetes, and smoking status. Of twelve SNPs identified in prior GWAS to be associated with CHD, four were genotyped in the participants as part of a GWAS. Genotypes for seven SNPs were imputed from HapMap CEU population using the program MACH. We calculated a multiplex genetic risk score for each patient based on the odds ratios of the susceptibility SNPs and incorporated this into the FRS.
The mean (SD) number of risk alleles was 12.31 (1.95), range 6-18. The mean (SD) of the weighted genetic risk score was 12.64 (2.05), range 5.75-18.20. The CHD genetic risk score was not correlated with the FRS (P = 0.78). After incorporating the genetic risk score into the FRS, a total of 380 individuals (30.6%) were reclassified into higher-(188) or lower-risk groups (192).
A genetic risk score based on measured/imputed genotypes at 11 susceptibility SNPs, led to significant reclassification in the 10-y CHD risk categories. Additional prospective studies are needed to assess accuracy and clinical utility of such reclassification.
Complex genetic disorders are a result of a combination of genetic and non-genetic factors, all potentially interacting. Machine learning methods hold the potential to identify multi-locus and environmental associations thought to drive complex genetic traits. Decision trees, a popular machine learning technique, offer a computationally low complexity algorithm capable of detecting associated sets of SNPs of arbitrary size, including modern genome-wide SNP scans. However, interpretation of the importance of an individual SNP within these trees can present challenges.
We present a new decision tree algorithm denoted as Bagged Alternating Decision Trees (BADTrees) that is based on identifying common structural elements in a bootstrapped set of ADTrees. The algorithm is order nk2, where n is the number of SNPs considered and k is the number of SNPs in the tree constructed. Our simulation study suggests that BADTrees have higher power and lower type I error rates than ADTrees alone and comparable power with lower type I error rates compared to logistic regression. We illustrate the application of these data using simulated data as well as from the Lupus Large Association Study 1 (7822 SNPs in 3548 individuals). Our results suggest that BADTrees holds promise as a low computational order algorithm for detecting complex combinations of SNP and environmental factors associated with disease.
Machine Learning; Genetic Association; Gene-Gene Interaction; Multi-locus Models
Despite the great success of genome-wide association studies (GWAS) in identification of the common genetic variants associated with complex diseases, the current GWAS have focused on single-SNP analysis. However, single-SNP analysis often identifies only a few of the most significant SNPs that account for a small proportion of the genetic variants and offers only a limited understanding of complex diseases. To overcome these limitations, we propose gene and pathway-based association analysis as a new paradigm for GWAS. As a proof of concept, we performed a comprehensive gene and pathway-based association analysis of 13 published GWAS. Our results showed that the proposed new paradigm for GWAS not only identified the genes that include significant SNPs found by single-SNP analysis, but also detected new genes in which each single SNP conferred a small disease risk; however, their joint actions were implicated in the development of diseases. The results also showed that the new paradigm for GWAS was able to identify biologically meaningful pathways associated with the diseases, which were confirmed by a gene-set-rich analysis using gene expression data.
genome-wide association studies; gene and pathway-based analysis; complex diseases; combining P-values; gene-set enrichment analysis
Genome-wide association studies (GWAS) with hundreds of żthousands of single nucleotide polymorphisms (SNPs) are popular strategies to reveal the genetic basis of human complex diseases. Despite many successes of GWAS, it is well recognized that new analytical approaches have to be integrated to achieve their full potential. Starting with a list of SNPs, found to be associated with disease in GWAS, here we propose a novel methodology to devise functionally important KEGG pathways through the identification of genes within these pathways, where these genes are obtained from SNP analysis. Our methodology is based on functionalization of important SNPs to identify effected genes and disease related pathways. We have tested our methodology on WTCCC Rheumatoid Arthritis (RA) dataset and identified: i) previously known RA related KEGG pathways (e.g., Toll-like receptor signaling, Jak-STAT signaling, Antigen processing, Leukocyte transendothelial migration and MAPK signaling pathways); ii) additional KEGG pathways (e.g., Pathways in cancer, Neurotrophin signaling, Chemokine signaling pathways) as associated with RA. Furthermore, these newly found pathways included genes which are targets of RA-specific drugs. Even though GWAS analysis identifies 14 out of 83 of those drug target genes; newly found functionally important KEGG pathways led to the discovery of 25 out of 83 genes, known to be used as drug targets for the treatment of RA. Among the previously known pathways, we identified additional genes associated with RA (e.g. Antigen processing and presentation, Tight junction). Importantly, within these pathways, the associations between some of these additionally found genes, such as HLA-C, HLA-G, PRKCQ, PRKCZ, TAP1, TAP2 and RA were verified by either OMIM database or by literature retrieved from the NCBI PubMed module. With the whole-genome sequencing on the horizon, we show that the full potential of GWAS can be achieved by integrating pathway and network-oriented analysis and prior knowledge from functional properties of a SNP.
Genome-wide association studies (GWAS) are popular for identifying genetic variants which are associated with disease risk. Many approaches have been proposed to test multiple single nucleotide polymorphisms (SNPs) in a region simultaneously which considering disadvantages of methods in single locus association analysis. Kernel machine based SNP set analysis is more powerful than single locus analysis, which borrows information from SNPs correlated with causal or tag SNPs. Four types of kernel machine functions and principal component based approach (PCA) were also compared. However, given the loss of power caused by low minor allele frequencies (MAF), we conducted an extension work on PCA and used a new method called weighted PCA (wPCA). Comparative analysis was performed for weighted principal component analysis (wPCA), logistic kernel machine based test (LKM) and principal component analysis (PCA) based on SNP set in the case of different minor allele frequencies (MAF) and linkage disequilibrium (LD) structures. We also applied the three methods to analyze two SNP sets extracted from a real GWAS dataset of non-small cell lung cancer in Han Chinese population. Simulation results show that when the MAF of the causal SNP is low, weighted principal component and weighted IBS are more powerful than PCA and other kernel machine functions at different LD structures and different numbers of causal SNPs. Application of the three methods to a real GWAS dataset indicates that wPCA and wIBS have better performance than the linear kernel, IBS kernel and PCA.
Recently introduced pathway-based approach is promising and advantageous to improve the efficiency of analyzing genome-wide association scan (GWAS) data to identify disease variants by jointly considering variants of the genes that belong to the same biological pathway. However, the current available pathway-based approaches for analyzing GWAS have limited power and efficiency.
We proposed a new and efficient permutation strategy based on SNP randomization for determining significance in pathway analysis of GWAS. The developed permutation strategy was evaluated and compared to two previously available methods, i.e. sample permutation and gene permutation, through simulation studies and a study on a real dataset. Results showed that the proposed permutation strategy is more powerful and efficient with greatly reducing the computational complexity.
Our findings indicate the improved performance of SNP permutation and thus render pathway-based analysis of GWAS more applicable and attractive.
Single nucleotide polymorphism (SNP) based association studies aim at identifying SNPs associated with phenotypes, for example, complex diseases. The associated SNPs may influence the disease risk individually (main effects) or behave jointly (epistatic interactions). For the analysis of high throughput data, the main difficulty is that the number of SNPs far exceeds the number of samples. This difficulty is amplified when identifying interactions.
In this paper, we propose an Adaptive Group Lasso (AGL) model for large-scale association studies. Our model enables us to analyze SNPs and their interactions simultaneously. We achieve this by introducing a sparsity constraint in our model based on the fact that only a small fraction of SNPs is disease-associated. In order to reduce the number of false positive findings, we develop an adaptive reweighting scheme to enhance sparsity. In addition, our method treats SNPs and their interactions as factors, and identifies them in a grouped manner. Thus, it is flexible to analyze various disease models, especially for interaction detection. However, due to the intensive computation when millions of interaction terms needs to be searched in the model fitting, our method needs to combined with some filtering methods when applied to genome-wide data for detecting interactions.
By using a wide range of simulated datasets and a real dataset from WTCCC, we demonstrate the advantages of our method.
Genome-wide association (GWA) studies have recently emerged as a major approach to gene discovery for many complex diseases. Since GWA scans are expensive, cost efficiency is an important factor to consider in study design. However, it often requires extensive and time consuming computer simulations to compare cost efficiency across different SNP chips. Here we propose two simulation-free approaches to cost efficiency comparisons across SNP chips. In the first method, the overall power under a given disease model is calculated for each SNP chip and various sample sizes. Then SNP chips can be compared with respect to the sample sizes required to achieve the same level of power. In the second method, for a desired level of genomic coverage, the effective r2 threshold values are calculated for each SNP chip. Since r2 is inversely proportional to the sample size to achieve the same power, the required sample sizes can then be compared among SNP chips. These two methods are complementary to each other. The first approach provides direct power comparisons, but it requires information on disease model and may not be reliable for SNP chips that contain many non-HapMap SNPs. The second approach allows sample size comparisons based on the coverage of SNP chips, and it can be modified for SNP chips that contain non-HapMap SNPs. These methods are particularly relevant for large epidemiological studies in which enough subjects are available for GWA screening and follow-up stages. We illustrate these approaches using five currently available whole genome SNP chips.
genome-wide association; cost efficiency; HapMap
Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e., epistatic interactions, however, remains a significant challenge in large-scale association studies.
We have developed a new method, named SNPInterForest, for identifying epistatic interactions by extending an ensemble learning technique called random forest. Random forest is a predictive method that has been proposed for use in discovering single-nucleotide polymorphisms (SNPs), which are most predictive of the disease status in association studies. However, it is less sensitive to SNPs with little marginal effect. Furthermore, it does not natively exhibit information on interaction patterns of susceptibility SNPs. We extended the random forest framework to overcome the above limitations by means of (i) modifying the construction of the random forest and (ii) implementing a procedure for extracting interaction patterns from the constructed random forest. The performance of the proposed method was evaluated by simulated data under a wide spectrum of disease models. SNPInterForest performed very well in successfully identifying pure epistatic interactions with high precision and was still more than capable of concurrently identifying multiple interactions under the existence of genetic heterogeneity. It was also performed on real GWAS data of rheumatoid arthritis from the Wellcome Trust Case Control Consortium (WTCCC), and novel potential interactions were reported.
SNPInterForest, offering an efficient means to detect epistatic interactions without statistical analyses, is promising for practical use as a way to reveal the epistatic interactions involved in common complex diseases.
Association study (especially the genome-wide association study) now has a key function in identification and characterization of disease-predisposing genetic variant(s), which customarily involve multiple single nucleotide polymorphisms (SNPs) in a candidate region or across the genome. Case–control association design remains the most popular and a challenging issue in the statistical analysis is the optimal use of all information contained in these SNPs. Previous approaches often treated gene–gene interaction as deviation from additive genetic effects or replaced it with SNP–SNP interaction. However, these approaches are limited for their failure of consideration of gene–gene interaction or gene–gene co-association at gene level. Although the co-association of the SNPs within a candidate gene can be detected by principal component analysis-based logistic regression model, the detection of co-association between genes in genome remains uncertain. Here, we proposed a canonical correlation-based U statistic (CCU) for detecting gene-based gene–gene co-association in the case–control design. We explored its type I error rates and power through simulation and analyzed two real data sets. By treating gene as a functional unit in analysis, we found that CCU was a strong alternative to previous approaches. We discussed the performance of CCU as a gene-based gene–gene co-association statistic and the prospect of further improvement.
gene-based; gene–gene co-association; canonical correlation
The high genomic density of the single-nucleotide polymorphism (SNP) sets that are typically surveyed in genome-wide association studies (GWAS) now allows the application of haplotype-based methods. Although the choice of haplotype-based vs. individual-SNP approaches is expected to affect the results of association studies, few empirical comparisons of method performance have been reported on the genome-wide scale in the same set of individuals. To measure the relative ability of the two strategies to detect associations, we used a large dataset from the North American Rheumatoid Arthritis Consortium to: 1) partition the genome into haplotype blocks, 2) associate haplotypes with disease, and 3) compare the results with individual-SNP association mapping. Although some associations were shared across methods, each approach uniquely identified several strong candidate regions. Our results suggest that the application of both haplotype-based and individual-SNP testing to GWAS should be adopted as a routine procedure.
One of the most challenging points in studying human common complex diseases is to search for both strong and weak susceptibility single-nucleotide polymorphisms (SNPs) and identify forms of genetic disease models. Currently, a number of methods have been proposed for this purpose. Many of them have not been validated through applications into various genome datasets, so their abilities are not clear in real practice. In this paper, we present a novel SNP association study method based on probability theory, called ProbSNP. The method firstly detects SNPs by evaluating their joint probabilities in combining with disease status and selects those with the lowest joint probabilities as susceptibility ones, and then identifies some forms of genetic disease models through testing multiple-locus interactions among the selected SNPs. The joint probabilities of combined SNPs are estimated by establishing Gaussian distribution probability density functions, in which the related parameters (i.e., mean value and standard deviation) are evaluated based on allele and haplotype frequencies. Finally, we test and validate the method using various genome datasets. We find that ProbSNP has shown remarkable success in the applications to both simulated genome data and real genome-wide data.
Association study; SNPs; probability theory; Gaussian distribution; case-control
Large genetic association studies based on hundreds of thousands of single-nucleotide polymorphisms (SNPs) are a popular option for the study of complex diseases. The evaluation of gene × gene interactions in such studies is a sensible method of capturing important genetic effects. The number of tests required to consider all pairs of SNPs, however, can lead to a computational burden, and efficient strategies to reduce the number of tests performed are desirable. In this study, we compare two-stage strategies for pairwise SNP interactions testing. Those approaches rely on the selection of SNPs based on the single-locus test results obtained at the first stage. In the simultaneous approach, SNPs that fall below the marginal significance thresholds (p = 0.05 and p = 0.1) in stage 1 are selected and tested for within-group pairwise interaction in stage 2. With the conditional approach, SNPs that reach Bonferroni-adjusted significance at the first stage are tested in pairwise combinations with all SNPs in the data set. We compared the performance of those strategies by using Replicate 1 of the simulated data set of the Genetic Analysis Workshop 15 Problem 3. Most interactions detected resulted from SNP pairs within 1000 kb of each other. The remaining were false positives involving SNPs with excessively strong marginal signals. Our results highlight the need to account for locus proximity in the evaluation of interaction effects and emphasize the importance of marginal signal strength in logistic regression-based interaction modeling. We found that modeling additive genetic effects alone was sufficient to capture underlying dominance interaction effects in the data.
Genome-wide association studies (GWAS) have emerged as the method of choice for identifying common variants affecting complex disease. In a GWAS, particular attention is placed, for obvious reasons, on single-nucleotide polymorphisms (SNPs) that exceed stringent genome-wide significance thresholds. However, it is expected that many SNPs with only nominal evidence of association (e.g., P < 0.05) truly influence disease. Efforts to extract additional biological information from entire GWAS datasets have primarily focused on pathway-enrichment analyses. However, these methods suffer from a number of limitations and typically fail to lead to testable hypotheses. To evaluate alternative approaches, we performed a systems-level analysis of GWAS data using weighted gene coexpression network analysis. A weighted gene coexpression network was generated for 1918 genes harboring SNPs that displayed nominal evidence of association (P ≤ 0.05) from a GWAS of bone mineral density (BMD) using microarray data on circulating monocytes isolated from individuals with extremely low or high BMD. Thirteen distinct gene modules were identified, each comprising coexpressed and highly interconnected GWAS genes. Through the characterization of module content and topology, we illustrate how network analysis can be used to discover disease-associated subnetworks and characterize novel interactions for genes with a known role in the regulation of BMD. In addition, we provide evidence that network metrics can be used as a prioritizing tool when selecting genes and SNPs for replication studies. Our results highlight the advantages of using systems-level strategies to add value to and inform GWAS.
genome-wide association study (GWAS); systems biology; coexpression network; osteoporosis
Genome-wide association studies (GWAS) are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single single nucleotide polymorphism (SNP) analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferroni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. In this paper, we propose a hidden Markov random field model (HMRF) for GWAS analysis based on a weighted LD graph built from the prior LD information among the SNPs and an efficient iterative conditional mode algorithm for estimating the model parameters. This model effectively utilizes the LD information in calculating the posterior probability that an SNP is associated with the disease. These posterior probabilities can then be used to define a false discovery controlling procedure in order to select the disease-associated SNPs. Simulation studies demonstrated the potential gain in power over single SNP analysis. The proposed method is especially effective in identifying SNPs with borderline significance at the single-marker level that nonetheless are in high LD with significant SNPs. In addition, by simultaneously considering the SNPs in LD, the proposed method can also help to reduce the number of false identifications of disease-associated SNPs. We demonstrate the application of the proposed HMRF model using data from a case–control GWAS of neuroblastoma and identify 1 new SNP that is potentially associated with neuroblastoma.
Empirical Bayes; False discovery; Iterative conditional model; Linkage disequilibrium