It has been suggested that pathway analysis can complement single-SNP analysis in exploring genomewide association data. Pathway analysis incorporates the available biological knowledge of genes and SNPs and is expected to improve the chances of revealing the underlying genetic architecture of complex traits. Methods for pathway analysis can be classified as competitive (enrichment) or self-contained (association) according to the hypothesis tested. Although association tests are statistically more powerful than enrichment tests they can be difficult to calibrate because biases in analysis accumulate across multiple SNPs or genes. Furthermore, enrichment tests can be more scientifically relevant than association tests, as they detect pathways with relatively more evidence for association than the remaining genes. Here we show how some well known association tests can be simply adapted to test for enrichment, and compare their performance to some established enrichment tests. We propose versions of the Adaptive Rank Truncated Product (ARTP), Tail Strength Measure and Fisher’s combination of p-values for testing the enrichment null hypothesis. We compare the behaviour of these proposed methods with the established Hypergeometric Test and Gene-Set Enrichment Analysis (GSEA). The results of the simulation study show that the modified version of the ARTP method has generally the best performance across the situations considered. The methods were also applied for finding enriched pathways for body mass index (BMI) and platelet function phenotypes. The pathway analysis of BMI identified the Vasoactive Intestinal Peptide pathway as significantly associated with BMI. This pathway has been previously reported as associated with BMI and the risk of obesity. The ARTP method was the method that identified the largest number of enriched pathways across all tested pathway databases and phenotypes. The simulation and data application results are in agreement with previous work on association tests and suggests that the ARTP should be preferred for both enrichment and association testing.
Current GWAS have primarily focused on testing association of single SNPs. To only test for association of single SNPs has limited utility and is insufficient to dissect the complex genetic structure of many common diseases. To meet conceptual and technical challenges raised by GWAS, we propose gene and pathway-based GWAS as complementary to the current single SNP-based GWAS. This publication develops three statistics for testing association of genes and pathways with disease: linear combination test, quadratic test and decorrelation test which take correlations among SNPs within a gene or genes within a pathway into account. The null distribution of the proposed statistics is examined and the statistics are applied to GWAS of rheumatoid arthritis in the Wellcome Trust Case Control Consortium and the North American Rheumatoid Arthritis Consortium studies. The preliminary results show that the proposed gene and pathway-based GWAS offer several remarkable features. First, not only can they identify the genes that have large genetic effects, but also they can detect new genes in which each single SNP conferred a small amount of disease risk, and their joint actions can be implicated in the development of diseases. Second, gene and pathway-based analysis can allow the formation of the core of pathway definition of complex diseases and unravel the functional bases of an association finding. Third, replication of association findings at the gene or pathway level is much easier than replication at the individual SNP level.
GWAS; gene association analysis; pathway association analysis; complex diseases
Current GWAS have primarily focused on testing association of single SNPs. To only test for association of single SNPs has limited utility and is insufficient to dissect the complex genetic structure of many common diseases. To meet conceptual and technical challenges raised by GWAS, we suggest gene and pathway-based GWAS as complementary to the current single SNP-based GWAS. This publication develops three statistics for testing association of genes and pathways with disease: linear combination test, quadratic test and decorrelation test, which take correlations among SNPs within a gene or genes within a pathway into account. The null distribution of the suggested statistics is examined and the statistics are applied to GWAS of rheumatoid arthritis in the Wellcome Trust Case–Control Consortium and the North American Rheumatoid Arthritis Consortium studies. The preliminary results show that the suggested gene and pathway-based GWAS offer several remarkable features. First, not only can they identify the genes that have large genetic effects, but also they can detect new genes in which each single SNP conferred a small amount of disease risk, and their joint actions can be implicated in the development of diseases. Second, gene and pathway-based analysis can allow the formation of the core of pathway definition of complex diseases and unravel the functional bases of an association finding. Third, replication of association findings at the gene or pathway level is much easier than replication at the individual SNP level.
GWAS; gene association analysis; pathway association analysis; complex diseases
Recently we have witnessed a surge of interest in using genome-wide association studies (GWAS) to discover the genetic basis of complex diseases. Many genetic variations, mostly in the form of single nucleotide polymorphisms (SNPs), have been identified in a wide spectrum of diseases, including diabetes, cancer, and psychiatric diseases. A common theme arising from these studies is that the genetic variations discovered by GWAS can only explain a small fraction of the genetic risks associated with the complex diseases. New strategies and statistical approaches are needed to address this lack of explanation. One such approach is the pathway analysis, which considers the genetic variations underlying a biological pathway, rather than separately as in the traditional GWAS studies. A critical challenge in the pathway analysis is how to combine evidences of association over multiple SNPs within a gene and multiple genes within a pathway. Most current methods choose the most significant SNP from each gene as a representative, ignoring the joint action of multiple SNPs within a gene. This approach leads to preferential identification of genes with a greater number of SNPs.
We describe a SNP-based pathway enrichment method for GWAS studies. The method consists of the following two main steps: 1) for a given pathway, using an adaptive truncated product statistic to identify all representative (potentially more than one) SNPs of each gene, calculating the average number of representative SNPs for the genes, then re-selecting the representative SNPs of genes in the pathway based on this number; and 2) ranking all selected SNPs by the significance of their statistical association with a trait of interest, and testing if the set of SNPs from a particular pathway is significantly enriched with high ranks using a weighted Kolmogorov-Smirnov test. We applied our method to two large genetically distinct GWAS data sets of schizophrenia, one from European-American (EA) and the other from African-American (AA). In the EA data set, we found 22 pathways with nominal P-value less than or equal to 0.001 and corresponding false discovery rate (FDR) less than 5%. In the AA data set, we found 11 pathways by controlling the same nominal P-value and FDR threshold. Interestingly, 8 of these pathways overlap with those found in the EA sample. We have implemented our method in a JAVA software package, called SNP Set Enrichment Analysis (SSEA), which contains a user-friendly interface and is freely available at http://cbcl.ics.uci.edu/SSEA.
The SNP-based pathway enrichment method described here offers a new alternative approach for analysing GWAS data. By applying it to schizophrenia GWAS studies, we show that our method is able to identify statistically significant pathways, and importantly, pathways that can be replicated in large genetically distinct samples.
Genome-wide association studies (GWAS) aim to detect single nucleotide polymorphisms (SNP) associated with trait variation. However, due to the large number of tests, standard analysis techniques impose highly stringent significance thresholds, leaving potentially associated SNPs undetected, and much of the trait genetic variation unexplained. Pathway- and network-based methodologies applied to GWAS aim to detect associations missed by standard single-marker approaches. The complex and non-random architecture of the genome makes it a challenge to derive an appropriate testing framework for such methodologies. We developed a rapid and simple permutation approach that uses GWAS SNP association results to establish the significance of pathway associations while accounting for the linkage disequilibrium structure of SNPs and the clustering of functionally related elements in the genome. All SNPs used in the GWAS are placed in a “circular genome” according to their location. Then the complete set of SNP association P values are permuted by rotation with respect to the genomic locations of the SNPs. Once these “simulated” P values are assigned, the joint gene P values are calculated using Fisher’s combination test, and the association of pathways is tested using the hypergeometric test. The circular genomic permutation approach was applied to a human genome-wide association dataset. The data consists of 719 individuals from the ORCADES study genotyped for ∼300,000 SNPs and measured for 51 traits ranging from physical to biochemical measurements. KEGG pathways (n = 225) were used as the sets of pathways to be tested. Our results demonstrate that the circular genomic permutations provide robust association P values. The non-permuted hypergeometric analysis generates ∼1400 pathway-trait combination results with an association P value more significant than P ≤ 0.05, whereas applying circular genomic permutation reduces the number of significant results to a more credible 40% of that value. The circular permutation software (“genomicper”) is available as an R package at http://cran.r-project.org/.
GWAS; pathway-based; permutation method; genomicper R package; cardiac disease
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways.
We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our “pathways group lasso with adaptive weights” (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets.
In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small.
pathways; GWAs; quantitative traits; group lasso; penalised regression; Alzheimer’s disease; imaging genetics
Though rooted in genomic expression studies, pathway analysis for genome-wide association studies (GWAS) has gained increasing popularity, since it has the potential to discover hidden disease pathogenic mechanisms by combining statistical methods with biological knowledge. Generally, algorithms or programs proposed recently can be categorized by different types of input data, null hypothesis or counts of analysis stages. Due to complexity caused by SNP, gene and pathway relationships, re-sampling strategies like permutation are always utilized to derive an empirical distribution for test statistics for evaluating the significance of candidate pathways. However, evaluation of these algorithms on real GWAS datasets and real biological pathway databases needs to be addressed before we apply them widely with confidence.
Two algorithms which use summary statistics from GWAS as input were implemented in KGG, a novel and user-friendly software tool for GWAS pathway analysis. Comparisons of these two algorithms as well as the other five selected algorithms were conducted by analyzing the WTCCC Crohn's Disease dataset utilizing the MsigDB canonical pathways. As a result of using permutation to obtain empirical p-value, most of these methods could control Type I error rate well, although some are conservative. However, the methods varied greatly in terms of power and running time, with the PLINK truncated set-based test being the most powerful and KGG being the fastest.
Raw data-based algorithms, such as those implemented in PLINK, are preferable for GWAS pathway analysis as long as computational capacity is available. It may be worthwhile to apply two or more pathway analysis algorithms on the same GWAS dataset, since the methods differ greatly in their outputs and might provide complementary findings for the studied complex disease.
Single nucleotide polymorphisms (SNPs) in genes derived from distinct pathways are associated with a breast cancer risk. Identifying possible SNP-SNP interactions in genome-wide case–control studies is an important task when investigating genetic factors that influence common complex traits; the effects of SNP-SNP interaction need to be characterized. Furthermore, observations of the complex interplay (interactions) between SNPs for high-dimensional combinations are still computationally and methodologically challenging. An improved branch and bound algorithm with feature selection (IBBFS) is introduced to identify SNP combinations with a maximal difference of allele frequencies between the case and control groups in breast cancer, i.e., the high/low risk combinations of SNPs.
A total of 220 real case and 334 real control breast cancer data are used to test IBBFS and identify significant SNP combinations. We used the odds ratio (OR) as a quantitative measure to estimate the associated cancer risk of multiple SNP combinations to identify the complex biological relationships underlying the progression of breast cancer, i.e., the most likely SNP combinations. Experimental results show the estimated odds ratio of the best SNP combination with genotypes is significantly smaller than 1 (between 0.165 and 0.657) for specific SNP combinations of the tested SNPs in the low risk groups. In the high risk groups, predicted SNP combinations with genotypes are significantly greater than 1 (between 2.384 and 6.167) for specific SNP combinations of the tested SNPs.
This study proposes an effective high-speed method to analyze SNP-SNP interactions in breast cancer association studies. A number of important SNPs are found to be significant for the high/low risk group. They can thus be considered a potential predictor for breast cancer association.
Summary: Traditional methods of genetic study design and analysis work well under the scenario that a handful of single nucleotide polymorphisms (SNPs) independently contribute to the risk of disease. For complex diseases, susceptibility may be determined not by a single SNP, but rather a complex interplay between SNPs. For large studies involving hundreds of thousands of SNPs, a brute force search of all possible combinations of SNPs associated with disease is not only inefficient, but also results in a multiple testing paradigm, whereby larger and larger sample sizes are needed to maintain statistical power. Pathway-based methods are an example of one of the many approaches in identifying a subset of SNPs to test for interaction. To help determine which SNP–SNP interactions to test, we developed Path, a software application designed to help researchers interface their data with biological information from several bioinformatics resources. To this end, our application brings together currently available information from nine online bioinformatics resources including the National Center for Biotechnology Information (NCBI), Online Mendelian Inheritance in Man (OMIM), Kyoto Encyclopedia of Genes and Genomes (KEGG), UCSC Genome Browser, Seattle SNPs, PharmGKB, Genetic Association Database, the Single Nucleotide Polymorphism database (dbSNP) and the Innate Immune Database (IIDB).
Availability: The software, example datasets and tutorials are freely available from http://genapha.icapture.ubc.ca/PathTutorial.
Background and Aims
The immune system is likely to play a key role in the etiology of gliomas. Genetic polymorphisms in the mannose-binding lectin gene, a key activator in the lectin complement pathway, have been associated with risk of several cancers.
To examine the role of the lectin complement pathway, we combined data from prospectively collected cohorts with available DNA specimens. Using a nested case-control design, we genotyped 85 single nucleotide polymorphisms (SNPs) in 9 genes in the lectin complement pathway and 3 additional SNPs in MBL2 were tested post hoc). Initial SNPs were selected using tagging SNPs for haplotypes; the second group of SNPs for MBL2 was selected based on functional SNPs related to phenotype. Associations were examined using logistic regression analysis. All statistical tests were two-sided. Nominal p-values are presented and are not corrected for multiple comparisons.
A total of 143 glioma cases and 419 controls were available for this analysis. Statistically significant associations were observed for two SNPs in the mannose-binding lectin 2 (ML2) gene and risk of glioma (rs1982266 and rs1800450, test for trend p = 0.003 and p = 0.04, respectively, using the additive model). One of these SNPs, rs1800450, was associated with a 58% increase in glioma risk among those carrying one or two mutated alleles (odds ratio = 1.58, 95% confidence interval = 0.99–2.54), compared to those homozygous for the wild type allele.
Overall, our findings suggest that MBL may play a role in the etiology of glioma. Future studies are needed to confirm these findings which may be due to chance, and if reproduced, to determine mechanisms that link glioma pathogenesis with the MBL complement pathway.
Relationships are unclear between polymorphisms in genes involved in metabolism and detoxification of various chemicals and papillary thyroid cancer (PTC) risk as well as their potential modification by alcohol or tobacco intake. We evaluated associations between 1647 tagging single nucleotide polymorphisms (SNPs) in 132 candidate genes/regions involved in metabolism of exogenous and endogenous compounds (Phase I/II, oxidative stress, and metal binding pathways) and PTC risk in 344 PTC cases and 452 controls. For 15 selected regions and their respective SNPs, we also assessed interaction with alcohol and tobacco use. Logistic regression models were used to evaluate the main effect of SNPs (Ptrend) and interaction with alcohol/tobacco intake. Gene- and pathway-level associations and interactions (Pgene interaction) were evaluated by combining Ptrend values using the adaptive rank-truncated product method. While we found associations between PTC risk and nine SNPs (Ptrend≤0.01) and seven genes/regions (Pregion<0.05), none remained significant after correction for the false discovery rate. We found a significant interaction between UGT2B7 and NAT1 genes and alcohol intake (Pgene interaction=0.01 and 0.02 respectively) and between the CYP26B1 gene and tobacco intake (Pgene interaction=0.02). Our results are suggestive of interaction between the genetic polymorphisms in several detoxification genes and alcohol or tobacco intake on risk of PTC. Larger studies with improved exposure assessment should address potential modification of PTC risk by alcohol and tobacco intake to confirm or refute our findings.
Pathway analysis has been proposed as a complement to single SNP analyses in GWAS. This study compared pathway analysis methods using two lung cancer GWAS data sets based on four studies: one a combined data set from Central Europe and Toronto (CETO); the other a combined data set from Germany and MD Anderson (GRMD). We searched the literature for pathway analysis methods that were widely used, representative of other methods, and had available software for performing analysis. We selected the programs EASE, which uses a modified Fishers Exact calculation to test for pathway associations, GenGen (a version of Gene Set Enrichment Analysis (GSEA)), which uses a Kolmogorov-Smirnov-like running sum statistic as the test statistic, and SLAT, which uses a p-value combination approach. We also included a modified version of the SUMSTAT method (mSUMSTAT), which tests for association by averaging χ2 statistics from genotype association tests. There were nearly 18000 genes available for analysis, following mapping of more than 300,000 SNPs from each data set. These were mapped to 421 GO level 4 gene sets for pathway analysis. Among the methods designed to be robust to biases related to gene size and pathway SNP correlation (GenGen, mSUMSTAT and SLAT), the mSUMSTAT approach identified the most significant pathways (8 in CETO and 1 in GRMD). This included a highly plausible association for the acetylcholine receptor activity pathway in both CETO (FDR≤0.001) and GRMD (FDR = 0.009), although two strong association signals at a single gene cluster (CHRNA3-CHRNA5-CHRNB4) drive this result, complicating its interpretation. Few other replicated associations were found using any of these methods. Difficulty in replicating associations hindered our comparison, but results suggest mSUMSTAT has advantages over the other approaches, and may be a useful pathway analysis tool to use alongside other methods such as the commonly used GSEA (GenGen) approach.
Interpreting Genome-Wide Association Studies (GWAS) at a gene level is an important step towards understanding the molecular processes that lead to disease. In order to incorporate prior biological knowledge such as pathways and protein interactions in the analysis of GWAS data it is necessary to derive one measure of association for each gene. We compare three different methods to obtain gene-wide test statistics from Single Nucleotide Polymorphism (SNP) based association data: choosing the test statistic from the most significant SNP; the mean test statistics of all SNPs; and the mean of the top quartile of all test statistics. We demonstrate that the gene-wide test statistics can be controlled for the number of SNPs within each gene and show that all three methods perform considerably better than expected by chance at identifying genes with confirmed associations. By applying each method to GWAS data for Crohn's Disease and Type 1 Diabetes we identified new potential disease genes.
Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.
In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.
Association study; ANOVA test
Genome-wide association study (GWAS) is nowadays widely used to identify genes involved in human complex disease. The standard GWAS analysis examines SNPs/genes independently and identifies only a number of the most significant SNPs. It ignores the combined effect of weaker SNPs/genes, which leads to difficulties to explore biological function and mechanism from a systems point of view. Although gene set enrichment analysis (GSEA) has been introduced to GWAS to overcome these limitations by identifying the correlation between pathways/gene sets and traits, the heavy dependence on genotype data, which is not easily available for most published GWAS investigations, has led to limited application of it. In order to perform GSEA on a simple list of GWAS SNP P-values, we implemented GSEA by using SNP label permutation. We further improved GSEA (i-GSEA) by focusing on pathways/gene sets with high proportion of significant genes. To provide researchers an open platform to analyze GWAS data, we developed the i-GSEA4GWAS (improved GSEA for GWAS) web server. i-GSEA4GWAS implements the i-GSEA approach and aims to provide new insights in complex disease studies. i-GSEA4GWAS is freely available at http://gsea4gwas.psych.ac.cn/.
Common genetic variation may play an important role in altering lung cancer risk. We conducted a pathway-based candidate gene evaluation to identify genetic variations that may be associated with lung cancer in a population-based case–control study in Xuan Wei, China (122 cases and 111 controls). A total of 1260 single-nucleotide polymorphisms (SNPs) in 380 candidate genes for lung cancer were successfully genotyped and assigned to one of 10 pathways based on gene ontology. Logistic regression was used to assess the marginal effect of each SNP on lung cancer susceptibility. The minP test was used to identify statistically significant associations at the gene level. Important pathways were identified using a test of proportions and the rank truncated product methods. The cell cycle pathway was found as the most important pathway (P = 0.044) with four genes significantly associated with lung cancer (PLA2G6 minP = 0.001, CCNA2 minP = 0.006, GSK3β minP = 0.007 and EGF minP = 0.013), after adjusting for multiple comparisons. Interestingly, most cell cycle genes that were associated with lung cancer in this analysis were concentrated in the AKT signaling pathway, which is essential for regulation of cell cycle progression and cellular survival, and may be important in lung cancer etiology in Xuan Wei. These results should be viewed as exploratory until they are replicated in a larger study.
Genome-wide association studies commonly involve simultaneous tests of millions of single nucleotide polymorphisms (SNP) for disease association. The SNPs in nearby genomic regions, however, are often highly correlated due to linkage disequilibrium (LD, a genetic term for correlation). Simple Bonferonni correction for multiple comparisons is therefore too conservative. Permutation tests, which are often employed in practice, are both computationally expensive for genome-wide studies and limited in their scopes. We present an accurate and computationally efficient method, based on Poisson de-clumping heuristics, for approximating genome-wide significance of SNP associations. Compared with permutation tests and other multiple comparison adjustment approaches, our method computes the most accurate and robust p-value adjustments for millions of correlated comparisons within seconds. We demonstrate analytically that the accuracy and the efficiency of our method are nearly independent of the sample size, the number of SNPs, and the scale of p-values to be adjusted. In addition, our method can be easily adopted to estimate false discovery rate. When applied to genome-wide SNP datasets, we observed highly variable p-value adjustment results evaluated from different genomic regions. The variation in adjustments along the genome, however, are well conserved between the European and the African populations. The p-value adjustments are significantly correlated with LD among SNPs, recombination rates, and SNP densities. Given the large variability of sequence features in the genome, we further discuss a novel approach of using SNP-specific (local) thresholds to detect genome-wide significant associations. This article has supplementary material online.
Genome-wide association study; Multiple comparison; Poisson approximation
In a genetic association study, it is often desirable to perform an overall test of whether any or all single-nucleotide polymorphisms (SNPs) in a gene are associated with a phenotype. Several such tests exist, but most of them are powerful only under very specific assumptions about the genetic effects of the individual SNPs. In addition, some of the existing tests assume that the direction of the effect of each SNP is known, which is a highly unlikely scenario. Here we propose a new kernel-based association test (KBAT) of joint association of several SNPs. Our test is non-parametric and robust, and does not make any assumption about the directions of individual SNP effects. It can be used to test multiple correlated SNPs within a gene and can also be used to test independent SNPs or genes in a biological pathway. Our test uses an analysis of variance (ANOVA) paradigm to compare variation between cases and controls to the variation within the groups. The variation is measured using kernel functions for each marker, and then a composite statistic is constructed to combine the markers into a single test. We present simulation results comparing our statistic to the U-statistic based method by Schaid et al. and another statistic by Wessel and Schork. We consider a variety of different disease models and assumptions about how many SNPs within the gene are actually associated with disease. Our results indicate that our statistic has higher power than other statistics under most realistic conditions.
genetic similarity; association study; multilocus association
Genome-wide association studies (GWAS) testing several hundred thousand SNPs have been performed in multiple sclerosis (MS) and other complex diseases. Typically, the number of markers in which the evidence for association exceeds the genome-wide significance threshold is very small, and markers that do not exceed this threshold are generally neglected. Classical statistical analysis of these datasets in MS revealed genes with known immunological functions. However, many of the markers showing modest association may represent false negatives. We hypothesize that certain combinations of genes flagged by these markers can be identified if they belong to a common biological pathway. Here we conduct a pathway-oriented analysis of two GWAS in MS that takes into account all SNPs with nominal evidence of association (P < 0.05). Gene-wise P-values were superimposed on a human protein interaction network and searches were conducted to identify sub-networks containing a higher proportion of genes associated with MS than expected by chance. These sub-networks, and others generated at random as a control, were categorized for membership of biological pathways. GWAS from eight other diseases were analyzed to assess the specificity of the pathways identified. In the MS datasets, we identified sub-networks of genes from several immunological pathways including cell adhesion, communication and signaling. Remarkably, neural pathways, namely axon-guidance and synaptic potentiation, were also over-represented in MS. In addition to the immunological pathways previously identified, we report here for the first time the potential involvement of neural pathways in MS susceptibility.
Genome-wide association studies (GWAS) are now used routinely to identify SNPs associated with complex human phenotypes. In several cases, multiple variants within a gene contribute independently to disease risk. Here we introduce a novel Gene-Wide Significance (GWiS) test that uses greedy Bayesian model selection to identify the independent effects within a gene, which are combined to generate a stronger statistical signal. Permutation tests provide p-values that correct for the number of independent tests genome-wide and within each genetic locus. When applied to a dataset comprising 2.5 million SNPs in up to 8,000 individuals measured for various electrocardiography (ECG) parameters, this method identifies more validated associations than conventional GWAS approaches. The method also provides, for the first time, systematic assessments of the number of independent effects within a gene and the fraction of disease-associated genes housing multiple independent effects, observed at 35%–50% of loci in our study. This method can be generalized to other study designs, retains power for low-frequency alleles, and provides gene-based p-values that are directly compatible for pathway-based meta-analysis.
Genome-wide association studies (GWAS) have successfully identified genetic variants associated with complex human phenotypes. Despite a proliferation of analysis methods, most studies rely on simple, robust SNP–by–SNP univariate tests with ever-larger population sizes. Here we introduce a new test motivated by the biological hypothesis that a single gene may contain multiple variants that contribute independently to a trait. Applied to simulated phenotypes with real genotypes, our new method, Gene-Wide Significance (GWiS), has better power to identify true associations than traditional univariate methods, previous Bayesian methods, popular L1 regularized (LASSO) multivariate regression, and other approaches. GWiS retains power for low-frequency alleles that are increasingly important for personal genetics, and it is the only method tested that accurately estimates the number of independent effects within a gene. When applied to human data for multiple ECG traits, GWiS identifies more genome-wide significant loci (verified by meta-analyses of much larger populations) than any other method. We estimate that 35%–50% of ECG trait loci are likely to have multiple independent effects, suggesting that our method will reveal previously unidentified associations when applied to existing data and will improve power for future association studies.
Any given single nucleotide polymorphism (SNP) in a genome may have little or no functional impact. A biologically significant effect may possibly emerge only when a number of key SNP-related genotypes occur together in a single organism. Thus, in analysis of many SNPs in association studies of complex diseases, it may be useful to look at combinations of genotypes. Genes related to signal transmission, e.g., ion channel genes, may be of interest in this respect in the context of bipolar disorder. In the present study, we analysed 803 SNPs in 55 genes related to aspects of signal transmission and calculated all combinations of three genotypes from the 3×803 SNP genotypes for 1355 controls and 607 patients with bipolar disorder. Four clusters of patient-specific combinations were identified. Permutation tests indicated that some of these combinations might be related to bipolar disorder. The WTCCC bipolar dataset were use for replication, 469 of the 803 SNP were present in the WTCCC dataset either directly (n = 132) or by imputation (n = 337) covering 51 of our selected genes. We found three clusters of patient-specific 3×SNP combinations in the WTCCC dataset. Different SNPs were involved in the clusters in the two datasets. The present analyses of the combinations of SNP genotypes support a role for both genetic heterogeneity and interactions in the genetic architecture of bipolar disorder.
The genetic basis for bipolar disorder (BPD) is complex with the involvement of multiple genes. As it is well established that cyclic adenosine monophosphate (cAMP) signaling regulates behavior, we tested variants in 29 genes that encode components of this signaling pathway for associations with BPD type I (BPD I) and BPD type II (BPD II). A total of 1172 individuals with BPD I, 516 individuals with BPD II and 1728 controls were analyzed. Single SNP (single-nucleotide polymorphism), haplotype and SNP × SNP interactions were examined for association with BPD. Several statistically significant single-SNP associations were observed between BPD I and variants in the PDE10A gene and between BPD II and variants in the DISC1 and GNAS genes. Haplotype analysis supported the conclusion that variation in these genes is associated with BPD. We followed-up PDE10A's association with BPD I by sequencing a 23-kb region in 30 subjects homozygous for seven minor allele risk SNPs and discovered eight additional rare variants (minor allele frequency <1%). These single-nucleotide variants were genotyped in 999 BPD cases and 801 controls. We obtained a significant association for these variants in the combined sample using multiple methods for rare variant analysis. After using newly developed methods to account for potential bias from sequencing BPD cases only, the results remained significant. In addition, SNP × SNP interaction studies suggested that variants in several cAMP signaling pathway genes interact to increase the risk of BPD. This report is among the first to use multiple rare variant analysis methods following common tagSNPs associations with BPD.
bipolar disorder; cAMP signaling; DISC1; GNASPDE10A; PDE10A
Genome-wide association studies (GWAS) with hundreds of żthousands of single nucleotide polymorphisms (SNPs) are popular strategies to reveal the genetic basis of human complex diseases. Despite many successes of GWAS, it is well recognized that new analytical approaches have to be integrated to achieve their full potential. Starting with a list of SNPs, found to be associated with disease in GWAS, here we propose a novel methodology to devise functionally important KEGG pathways through the identification of genes within these pathways, where these genes are obtained from SNP analysis. Our methodology is based on functionalization of important SNPs to identify effected genes and disease related pathways. We have tested our methodology on WTCCC Rheumatoid Arthritis (RA) dataset and identified: i) previously known RA related KEGG pathways (e.g., Toll-like receptor signaling, Jak-STAT signaling, Antigen processing, Leukocyte transendothelial migration and MAPK signaling pathways); ii) additional KEGG pathways (e.g., Pathways in cancer, Neurotrophin signaling, Chemokine signaling pathways) as associated with RA. Furthermore, these newly found pathways included genes which are targets of RA-specific drugs. Even though GWAS analysis identifies 14 out of 83 of those drug target genes; newly found functionally important KEGG pathways led to the discovery of 25 out of 83 genes, known to be used as drug targets for the treatment of RA. Among the previously known pathways, we identified additional genes associated with RA (e.g. Antigen processing and presentation, Tight junction). Importantly, within these pathways, the associations between some of these additionally found genes, such as HLA-C, HLA-G, PRKCQ, PRKCZ, TAP1, TAP2 and RA were verified by either OMIM database or by literature retrieved from the NCBI PubMed module. With the whole-genome sequencing on the horizon, we show that the full potential of GWAS can be achieved by integrating pathway and network-oriented analysis and prior knowledge from functional properties of a SNP.
The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases the chance of interrogating functional SNPs that are themselves causative or in linkage disequilibrium with causal SNPs, commonly used single SNP-association approaches suffer from serious multiple hypothesis testing problems and provide limited insights into combinations of loci that may contribute to complex diseases. Drawing inspiration from Gene Set Enrichment Analysis developed for gene expression data, we have developed a method, named GLOSSI (Gene-loci Set Analysis), that integrates prior biological knowledge into the statistical analysis of genotyping data to test the association of a group of SNPs (loci-set) with complex disease phenotypes. The most significant loci-sets can be used to formulate hypotheses from a functional viewpoint that can be validated experimentally.
In a simulation study, GLOSSI showed sufficient power to detect loci-sets with less than 10% of SNPs having moderate-to-large effect sizes and intermediate minor allele frequency values. When applied to a biological dataset where no single SNP-association was found in a previous study, GLOSSI was able to identify several loci-sets that are significantly related to blood pressure response to an antihypertensive drug.
GLOSSI is valuable for association of SNPs at multiple genetic loci with complex disease phenotypes. In contrast to methods based on the Kolmogorov-Smirnov statistic, the approach is parametric and only utilizes information from within the interrogated loci-set. It properly accounts for dependency among SNPs and allows the testing of loci-sets of any size.
Many complex diseases are influenced by genetic variations in multiple genes, each with only a small marginal effect on disease susceptibility. Pathway analysis, which identifies biological pathways associated with disease outcome, has become increasingly popular for genome-wide association studies (GWAS). In addition to combining weak signals from a number of SNPs in the same pathway, results from pathway analysis also shed light on the biological processes underlying disease. We propose a new pathway-based analysis method for GWAS, the supervised principal component analysis (SPCA) model. In the proposed SPCA model, a selected subset of SNPs most associated with disease outcome is used to estimate the latent variable for a pathway. The estimated latent variable for each pathway is an optimal linear combination of a selected subset of SNPs; therefore, the proposed SPCA model provides the ability to borrow strength across the SNPs in a pathway. In addition to identifying pathways associated with disease outcome, SPCA also carries out additional within-category selection to identify the most important SNPs within each gene set. The proposed model operates in a well-established statistical framework and can handle design information such as covariate adjustment and matching information in GWAS. We compare the proposed method with currently available methods using data with realistic linkage disequilibrium structures and we illustrate the SPCA method using the Wellcome Trust Case-Control Consortium Crohn Disease (CD) dataset.
SNPs; genome-wide association; pathway analysis; principal component analysis