Gene-set analysis (GSA) methods are used as complementary approaches to genome-wide association studies (GWASs). The single marker association estimates of a predefined set of genes are either contrasted with those of all remaining genes or with a null non-associated background. To pool the p-values from several GSAs, it is important to take into account the concordance of the observed patterns resulting from single marker association point estimates across any given gene set. Here we propose an enhanced version of Fisher’s inverse χ2-method META-GSA, however weighting each study to account for imperfect correlation between association patterns.
Simulation and Power
We investigated the performance of META-GSA by simulating GWASs with 500 cases and 500 controls at 100 diallelic markers in 20 different scenarios, simulating different relative risks between 1 and 1.5 in gene sets of 10 genes. Wilcoxon’s rank sum test was applied as GSA for each study. We found that META-GSA has greater power to discover truly associated gene sets than simple pooling of the p-values, by e.g. 59% versus 37%, when the true relative risk for 5 of 10 genes was assume to be 1.5. Under the null hypothesis of no difference in the true association pattern between the gene set of interest and the set of remaining genes, the results of both approaches are almost uncorrelated. We recommend not relying on p-values alone when combining the results of independent GSAs.
We applied META-GSA to pool the results of four case-control GWASs of lung cancer risk (Central European Study and Toronto/Lunenfeld-Tanenbaum Research Institute Study; German Lung Cancer Study and MD Anderson Cancer Center Study), which had already been analyzed separately with four different GSA methods (EASE; SLAT, mSUMSTAT and GenGen). This application revealed the pathway GO0015291 “transmembrane transporter activity” as significantly enriched with associated genes (GSA-method: EASE, p = 0.0315 corrected for multiple testing). Similar results were found for GO0015464 “acetylcholine receptor activity” but only when not corrected for multiple testing (all GSA-methods applied; p≈0.02).
Recently, gene set analysis (GSA) has been extended from use on gene expression data to use on single-nucleotide polymorphism (SNP) data in genome-wide association studies. When GSA has been demonstrated on SNP data, two popular statistics from gene expression data analysis (gene set enrichment analysis [GSEA] and Fisher's exact test [FET]) have been used. However, GSEA and FET have shown a lack of power and robustness in the analysis of gene expression data. The purpose of this work is to investigate whether the same issues are also true for the analysis of SNP data. Ultimately, we conclude that GSEA and FET are not optimal for the analysis of SNP data when compared with the SUMSTAT method. In analysis of real SNP data from the Framingham Heart Study, we find that SUMSTAT finds many more gene sets to be significant when compared with other methods. In an analysis of simulated data, SUMSTAT demonstrates high power and better control of the type I error rate. GSA is a promising approach to the analysis of SNP data in GWAS and use of the SUMSTAT statistic instead of GSEA or FET may increase power and robustness.
Pathway-based analysis as an alternative approach can provide complementary information to single-marker genome-wide association studies (GWASs), which always ignore the epistasis and does not have sufficient power to find rare variants. In this study, using genotypes from a genome-wide association study (GWAS), pathway-based association studies were carried out by a modified Gene Set Enrichment Algorithm (GSEA) method (GenGen) for triglyceride in 1028 unrelated European-American extremely obese females (BMI≥35kg/m2) and normal-weight controls (BMI<25kg/m2), and another pathway association analysis (ICSNPathway) was also used to verify the GenGen result in the same data. The GO0009110 pathway (vitamin anabolism) was among the strongest associations with triglyceride (empirical P<0.001); the result remained significant after FDR correction (P = 0.022). MMAB, an obesity-related locus, included in this pathway. The ABCG1 and BCL6 gene was found in several triglyceride-related pathways (empirical P<0.05), which were also replicated by ICSNPathway (empirical P<0.05, FDR<0.05). We also performed single-marked GWAS using PLINK for TG levels (log-transformed). Significant associations were found between ASTN2 gene SNPs and plasma triglyceride levels (rs7035794, P = 2.24×10−10). Our study suggested that vitamin anabolism pathway, BCL6 gene pathways and ASTN2 gene may contribute to the genetic variation of plasma triglyceride concentrations.
Pathway-based analysis as an alternative and effective approach to identify disease-related genes or loci has been verified. To decipher the genetic background of plasma adiponectin levels, we performed genome wide pathway-based association studies in extremely obese individuals and normal-weight controls. The modified Gene Set Enrichment Algorithm (GSEA) was used to perform the pathway-based analyses (the GenGen Program) in 746 European American females, which were collected from our previous GWAS in extremely obese (BMI > 35 kg/m2) and never-overweight (BMI<25 kg/m2) controls. Rac1 cell motility signaling pathway was associated with plasma adiponectin after false-discovery rate (FDR) correction (empirical P < 0.001, FDR = 0.008, family-wise error rate = 0.008). Other several Rac1-centered pathways, such as cdc42racPathway (empirical P < 0.001), hsa00603 (empirical P = 0.003) were among the top associations. The RAC1 pathway association was replicated by the ICSNPathway method, yielded a FDR = 0.002. Quantitative pathway analyses yielded similar results (empirical P = 0.001) for the Rac1 pathway, although it failed to pass the multiple test correction (FDR = 0.11). We further replicated our pathway associations in the ADIPOGen Consortium data by the GSA-SNP method. Our results suggest that Rac1 and related cell motility pathways might be associated with plasma adiponectin levels and biological functions of adiponectin.
The typical objective of Genome-wide association (GWA) studies is to identify single-nucleotide polymorphisms (SNPs) and corresponding genes with the strongest evidence of association (the 'most-significant SNPs/genes' approach). Borrowing ideas from micro-array data analysis, we propose a new method, named RS-SNP, for detecting sets of genes enriched in SNPs moderately associated to the phenotype. RS-SNP assesses whether the number of significant SNPs, with p-value P ≤ α, belonging to a given SNP set is statistically significant. The rationale of proposed method is that two kinds of null hypotheses are taken into account simultaneously. In the first null model the genotype and the phenotype are assumed to be independent random variables and the null distribution is the probability of the number of significant SNPs in greater than observed by chance. The second null model assumes the number of significant SNPs in depends on the size of and not on the identity of the SNPs in . Statistical significance is assessed using non-parametric permutation tests.
We applied RS-SNP to the Crohn's disease (CD) data set collected by the Wellcome Trust Case Control Consortium (WTCCC) and compared the results with GENGEN, an approach recently proposed in literature. The enrichment analysis using RS-SNP and the set of pathways contained in the MSigDB C2 CP pathway collection highlighted 86 pathways rich in SNPs weakly associated to CD. Of these, 47 were also indicated to be significant by GENGEN. Similar results were obtained using the MSigDB C5 pathway collection. Many of the pathways found to be enriched by RS-SNP have a well-known connection to CD and often with inflammatory diseases.
The proposed method is a valuable alternative to other techniques for enrichment analysis of SNP sets. It is well founded from a theoretical and statistical perspective. Moreover, the experimental comparison with GENGEN highlights that it is more robust with respect to false positive findings.
Recently we have witnessed a surge of interest in using genome-wide association studies (GWAS) to discover the genetic basis of complex diseases. Many genetic variations, mostly in the form of single nucleotide polymorphisms (SNPs), have been identified in a wide spectrum of diseases, including diabetes, cancer, and psychiatric diseases. A common theme arising from these studies is that the genetic variations discovered by GWAS can only explain a small fraction of the genetic risks associated with the complex diseases. New strategies and statistical approaches are needed to address this lack of explanation. One such approach is the pathway analysis, which considers the genetic variations underlying a biological pathway, rather than separately as in the traditional GWAS studies. A critical challenge in the pathway analysis is how to combine evidences of association over multiple SNPs within a gene and multiple genes within a pathway. Most current methods choose the most significant SNP from each gene as a representative, ignoring the joint action of multiple SNPs within a gene. This approach leads to preferential identification of genes with a greater number of SNPs.
We describe a SNP-based pathway enrichment method for GWAS studies. The method consists of the following two main steps: 1) for a given pathway, using an adaptive truncated product statistic to identify all representative (potentially more than one) SNPs of each gene, calculating the average number of representative SNPs for the genes, then re-selecting the representative SNPs of genes in the pathway based on this number; and 2) ranking all selected SNPs by the significance of their statistical association with a trait of interest, and testing if the set of SNPs from a particular pathway is significantly enriched with high ranks using a weighted Kolmogorov-Smirnov test. We applied our method to two large genetically distinct GWAS data sets of schizophrenia, one from European-American (EA) and the other from African-American (AA). In the EA data set, we found 22 pathways with nominal P-value less than or equal to 0.001 and corresponding false discovery rate (FDR) less than 5%. In the AA data set, we found 11 pathways by controlling the same nominal P-value and FDR threshold. Interestingly, 8 of these pathways overlap with those found in the EA sample. We have implemented our method in a JAVA software package, called SNP Set Enrichment Analysis (SSEA), which contains a user-friendly interface and is freely available at http://cbcl.ics.uci.edu/SSEA.
The SNP-based pathway enrichment method described here offers a new alternative approach for analysing GWAS data. By applying it to schizophrenia GWAS studies, we show that our method is able to identify statistically significant pathways, and importantly, pathways that can be replicated in large genetically distinct samples.
We conducted multilevel analyses to identify potential susceptibility loci for renal cell carcinoma (RCC), which may be overlooked in traditional genome-wide association studies (GWAS). A gene set enrichment analysis was performed utilizing a GWAS dataset comprised of 894 RCC cases and 1,516 controls using GenGen, SNP ratio test, and ALIGATOR. The antigen processing and presentation pathway was consistently significant (P = 0.001, = 0.004, and < 0.001, respectively). Versatile gene-based association study approach was applied to the top-ranked pathway and identified the driven genes. By comparing the expression of the genes in RCC tumor and adjacent normal tissues, we observed significant overexpression of HLA genes in tumor tissues, which was also supported by public databases. We sought to validate genetic variants in antigen processing and presentation pathway in an independent GWAS dataset comprised of 1,311 RCC cases and 3,424 control subjects from the National Cancer Institute; one SNP, rs1063355, was significant in both populations (Pmeta-analysis = 9.15 × 10−4, Pheterogeneity = 0.427). Strong correlation indicated that rs1063355 was a cis-expression quantitative trait loci which associated with HLA-DQB1 expression (Spearman's rank r = −0.59, p = 5.61 × 10−6). The correlation was further validated using a public dataset. Our results highlighted the role of immune-related pathway and genes in the etiology of RCC.
RCC; GWAS; GSEA; eQTL
Genome-wide association studies (GWAS) are designed to identify individual regions associated with cancer risk, but only explain a small fraction of the inherited variability. Alternative approach analyzing genetic variants within biological pathways has been proposed to discover networks of susceptibility genes with additional effects. The gene set enrichment analysis (GSEA) may complement and expand traditional GWAS analysis to identify novel genes and pathways associated with bladder cancer risk. We selected three GSEA methods: Gen-Gen, Aligator, and the SNP Ratio Test to evaluate cellular signaling pathways involved in bladder cancer susceptibility in a Texas GWAS population. The candidate genetic polymorphisms from the significant pathway selected by GSEA were validated in an independent NCI GWAS. We identified 18 novel pathways (P < 0.05) significantly associated with bladder cancer risk. Five of the most promising pathways (P ≤ 0.001 in any of the three GSEA methods) among the 18 pathways included two cell cycle pathways and neural cell adhesion molecule (NCAM), platelet-derived growth factor (PDGF), and unfolded protein response pathways. We validated the candidate polymorphisms in the NCI GWAS and found variants of RAPGEF1, SKP1, HERPUD1, CACNB2, CACNA1C, CACNA1S, COL4A2, SRC, and CACNA1C were associated with bladder cancer risk. Two CCNE1 variants, rs8102137 and rs997669, from cell cycle pathways showed the strongest associations; the CCNE1 signal at 19q12 has already been reported in previous GWAS. These findings offer additional etiologic insights highlighting the specific genes and pathways associated with bladder cancer development. GSEA may be a complementary tool to GWAS to identify additional loci of cancer susceptibility.
pathway analysis; gene set enrichment analysis; GWAS; bladder cancer; susceptibility loci
Genome-wide association study (GWAS) is nowadays widely used to identify genes involved in human complex disease. The standard GWAS analysis examines SNPs/genes independently and identifies only a number of the most significant SNPs. It ignores the combined effect of weaker SNPs/genes, which leads to difficulties to explore biological function and mechanism from a systems point of view. Although gene set enrichment analysis (GSEA) has been introduced to GWAS to overcome these limitations by identifying the correlation between pathways/gene sets and traits, the heavy dependence on genotype data, which is not easily available for most published GWAS investigations, has led to limited application of it. In order to perform GSEA on a simple list of GWAS SNP P-values, we implemented GSEA by using SNP label permutation. We further improved GSEA (i-GSEA) by focusing on pathways/gene sets with high proportion of significant genes. To provide researchers an open platform to analyze GWAS data, we developed the i-GSEA4GWAS (improved GSEA for GWAS) web server. i-GSEA4GWAS implements the i-GSEA approach and aims to provide new insights in complex disease studies. i-GSEA4GWAS is freely available at http://gsea4gwas.psych.ac.cn/.
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case–cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
coronary heart disease; genome-wide association studies; Random Forests classifier; SNPs; variable selection
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
coronary heart disease; genome-wide association studies; Random Forests classifier; SNPs; variable selection
Complex diseases such as hypertension are inherently multifactorial and involve many factors of mild-to-minute effect sizes. A genome-wide association study (GWAS) typically tests hundreds of thousands of single-nucleotide polymorphisms (SNPs), and offers opportunity to evaluate aggregated effects of many genetic variants with effects that are too small to detect individually. The gene-set-enrichment analysis (GSEA) is a pathway-based approach that tests for such aggregated effects of genes that are linked by biological functions. A key step in GSEA is the summary statistic (gene score) used to measure the overall relevance of a gene based on all SNPs tested in the gene. Existing GSEA methods use maximum statistics sensitive to gene size and linkage equilibrium. We propose the approach of variable set enrichment analysis (VSEA) and study new gene score methods that are less dependent on gene size. The new method treats groups of variables (SNPs or other variants) as base units for summarizing gene scores and relies less on gene definition itself. The power of VSEA is analyzed by simulation studies modeling various scenarios of complex multiloci interactions. Results show that the new gene scores generally performed better, some substantially so, than existing GSEA extension to GWAS. The new methods are implemented in an R package and when applied to a real GWAS data set demonstrated its practical utility in a GWAS setting.
gene set enrichment; pathway-based analysis; SNP; genome-wide association
Despite the success of genome-wide association studies (GWAS), there still remains “missing heritability” for many traits. One contributing factor may be the result of examining one marker at a time as opposed to a group of markers that are biologically meaningful in aggregate. To address this problem, a variety of gene- and pathway-level methods have been developed to identify putative biologically relevant associations. A simulation was conducted to systematically assess the performance of these methods. Using genetic data from 4,500 individuals in the Wellcome Trust Case Control Consortium (WTCCC), case–control status was simulated based on an additive polygenic model. We evaluated gene-level methods based on their sensitivity, specificity, and proportion of false positives. Pathway-level methods were evaluated on the relationship between proportion of causal genes within the pathway and the strength of association.
The gene-level methods had low sensitivity (20-63%), high specificity (89-100%), and low proportion of false positives (0.1-6%). The gene-level program VEGAS using only the top 10% of associated single nucleotide polymorphisms (SNPs) within the gene had the highest sensitivity (28.6%) with less than 1% false positives. The performance of the pathway-level methods depended on their reliance upon asymptotic distributions or if significance was estimated in a competitive manner. The pathway-level programs GenGen, GSA-SNP and MAGENTA had the best performance while accounting for potential confounders.
Novel genes and pathways can be identified using the gene and pathway-level methods. These methods may provide valuable insight into the “missing heritability” of traits and provide biological interpretations to GWAS findings.
Electronic supplementary material
The online version of this article (doi:10.1186/s12863-015-0191-2) contains supplementary material, which is available to authorized users.
Genome-wide Association Studies; Gene Set; Biological Pathways
Many gene-set analysis methods have been previously proposed and compared through simulation studies and analysis of real datasets for binary phenotypes. We focused on the survival phenotype and compared the performances of Gene Set Enrichment Analysis (GSEA), Global Test (GT), Wald-type Test (WT) and Global Boost Test (GBST) methods in a simulation study and on two ovarian cancer data sets. We considered two versions of GSEA by allowing different weights: GSEA1 uses equal weights, yielding results similar to the Kolmogorov-Smirnov test; while GSEA2's weights are based on the correlation between genes and the phenotype.
We compared GSEA1, GSEA2, GT, WT and GBST in a simulation study with various settings for the correlation structure of the genes and the association parameter between the survival outcome and the genes. Simulation results indicated that GT, WT and GBST consistently have higher power than GSEA1 and GSEA2 across all scenarios. However, the power of the five tests depends on the combination of correlation structure and association parameter. For the ovarian cancer data set, using the FDR threshold of q < 0.1, the GT, WT and GBST detected 12, 6 and 8 significant pathways, respectively, whereas neither GSEA1 nor GSEA2 detected any significant pathways. In addition, among the pathways found significant by GT, WT, and GBST, three pathways - Purine metabolism, Leukocyte transendothelial migration and Jak-STAT signaling pathway - overlapped with those reported in previous ovarian cancer microarray studies.
Simulation studies and a real data example indicate that GT, WT and GBST tend to have high power, whereas GSEA1 and GSEA2 have lower power. We also found that the power of the five tests is much higher when genes are correlated than when genes are independent, when survival is positively associated with genes. It seems that there is a synergistic effect in detecting significant gene sets when significant genes have within-class correlation and the association between survival and genes is positive or negative (i.e., one-direction correlation).
Characterizing the genetic determinants of complex diseases can be further augmented by incorporating knowledge of underlying structure or classifications of the genome, such as newly developed mappings of protein-coding genes, epigenetic marks, enhancer elements and non-coding RNAs.
We apply a simple class-level testing framework, termed Genetic Class Association Testing (GenCAT), to identify protein-coding gene association with 14 cardiometabolic (CMD) related traits across 6 publicly available genome wide association (GWA) meta-analysis data resources. GenCAT uses SNP-level meta-analysis test statistics across all SNPs within a class of elements, as well as the size of the class and its unique correlation structure, to determine if the class is statistically meaningful. The novelty of findings is evaluated through investigation of regional signals. A subset of findings are validated using recently updated, larger meta-analysis resources. A simulation study is presented to characterize overall performance with respect to power, control of family-wise error and computational efficiency. All analysis is performed using the GenCAT package, R version 3.2.1.
We demonstrate that class-level testing complements the common first stage minP approach that involves individual SNP-level testing followed by post-hoc ascribing of statistically significant SNPs to genes and loci. GenCAT suggests 54 protein-coding genes at 41 distinct loci for the 13 CMD traits investigated in the discovery analysis, that are beyond the discoveries of minP alone. An additional application to biological pathways demonstrates flexibility in defining genetic classes.
We conclude that it would be prudent to include class-level testing as standard practice in GWA analysis. GenCAT, for example, can be used as a simple, complementary and efficient strategy for class-level testing that leverages existing data resources, requires only summary level data in the form of test statistics, and adds significant value with respect to its potential for identifying multiple novel and clinically relevant trait associations.
Genetic researchers often collect disease related quantitative traits in addition to disease status because they are interested in understanding the pathophysiology of disease processes. In genome-wide association (GWA) studies, these quantitative phenotypes may be relevant to disease development and serve as intermediate phenotypes or they could be behavioral or other risk factors that predict disease risk. Statistical tests combining both disease status and quantitative risk factors should be more powerful than case-control studies, as the former incorporates more information about the disease. In this paper, we proposed a modified inverse-variance weighted meta-analysis method to combine disease status and quantitative intermediate phenotype information. The simulation results showed that when an intermediate phenotype was available, the inverse-variance weighted method had more power than did a case-control study of complex diseases, especially in identifying susceptibility loci having minor effects. We further applied this modified meta-analysis to a study of imputed lung cancer genotypes with smoking data in 1154 cases and 1137 matched controls. The most significant SNPs came from the CHRNA3-CHRNA5-CHRNB4 region on chromosome 15q24–25.1, which has been replicated in many other studies. Our results confirm that this CHRNA region is associated with both lung cancer development and smoking behavior. We also detected three significant SNPs—rs1800469, rs1982072, and rs2241714—in the promoter region of the TGFB1 gene on chromosome 19 (p = 1.46×10−5, 1.18×10−5, and 6.57×10−6, respectively). The SNP rs1800469 is reported to be associated with chronic obstructive pulmonary disease and lung cancer in cigarette smokers. The present study is the first GWA study to replicate this result. Signals in the 3q26 region were also identified in the meta-analysis. We demonstrate the intermediate phenotype can potentially enhance the power of complex disease association analysis and the modified meta-analysis method is robust to incorporate intermediate phenotype or other quantitative risk factor in the analysis.
Genome-wide association studies (GWAS) have been successful in identifying single nucleotide polymorphisms (SNPs) associated with many traits and diseases. However, at existing sample sizes, these variants explain only part of the estimated heritability. Leverage of GWAS results from related phenotypes may improve detection without the need for larger datasets. The Bayesian conditional false discovery rate (cFDR) constitutes an upper bound on the expected false discovery rate (FDR) across a set of SNPs whose p values for two diseases are both less than two disease-specific thresholds. Calculation of the cFDR requires only summary statistics and have several advantages over traditional GWAS analysis. However, existing methods require distinct control samples between studies. Here, we extend the technique to allow for some or all controls to be shared, increasing applicability. Several different SNP sets can be defined with the same cFDR value, and we show that the expected FDR across the union of these sets may exceed expected FDR in any single set. We describe a procedure to establish an upper bound for the expected FDR among the union of such sets of SNPs. We apply our technique to pairwise analysis of p values from ten autoimmune diseases with variable sharing of controls, enabling discovery of 59 SNP-disease associations which do not reach GWAS significance after genomic control in individual datasets. Most of the SNPs we highlight have previously been confirmed using replication studies or larger GWAS, a useful validation of our technique; we report eight SNP-disease associations across five diseases not previously declared. Our technique extends and strengthens the previous algorithm, and establishes robust limits on the expected FDR. This approach can improve SNP detection in GWAS, and give insight into shared aetiology between phenotypically related conditions.
Many diseases have a significant hereditary component, only part of which has been explained by analysis of genome-wide association studies (GWAS). Shared aetiology, treatment protocols, and overlapping results from existing GWAS suggest similarities in genetic susceptibility between related diseases, which may be exploited to detect more disease-associated SNPs without the need for further data. We extend an existing method for detecting SNPs associated with a given disease by conditioning on association with another disease. Our extension allows GWAS for the two conditions to share control samples, enabling larger overall control groups and application to the common case when GWAS for related diseases pool control samples. We demonstrate that our technique limits the expected overall false discovery rate at a threshold dependent on the two diseases. We apply our technique to genotype data from ten immune mediated diseases. Overall pleiotropy between phenotypes is demonstrated graphically. We are able to declare several SNPs significant at a genome-wide level whilst controlling at a lower false-discovery rate than would be possible using a conventional approach, identifying eight previously unknown disease associations. This technique can improve SNP detection in GWAS by re-analysing existing data, and gives insight into the shared genetic bases of autoimmune diseases.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
Recently, a pathway-based approach has been developed to evaluate the cumulative contribution of the functionally related genes for genome-wide association studies (GWASs), which may help utilize GWAS data to a greater extent.
In this study, we applied this approach for the GWAS of basal cell carcinoma (BCC) of the skin. We first conducted the BCC GWAS among 1,797 BCC cases and 5,197 controls in Caucasians with 740,760 genotyped SNPs. 115,688 SNPs were grouped into gene transcripts within 20 kb in distance and then into 174 Kyoto Encyclopedia of Genes and Genomes pathways, 205 BioCarta pathways, as well as two positive control gene sets (pigmentation gene set and BCC risk gene set). The association of each pathway with BCC risk was evaluated using the weighted Kolmogorov-Smirnov test. One thousand permutations were conducted to assess the significance.
Both of the positive control gene sets reached pathway p-values<0.05. Four other pathways were also significantly associated with BCC risk: the heparan sulfate biosynthesis pathway (p = 0.007, false discovery rate, FDR = 0.35), the mCalpain pathway (p = 0.002, FDR = 0.12), the Rho cell motility signaling pathway (p = 0.011, FDR = 0.30), and the nitric oxide pathway (p = 0.022, FDR = 0.42).
We identified four pathways associated with BCC risk, which may offer new insights into the etiology of BCC upon further validation, and this approach may help identify potential biological pathways that might be missed by the standard GWAS approach.
Duchenne muscular dystrophy (DMD) causes progressive muscle degeneration, cardiomyopathy and respiratory failure in approximately 1/5,000 boys. Golden Retriever muscular dystrophy (GRMD) resembles DMD both clinically and pathologically. Like DMD, GRMD exhibits remarkable phenotypic variation among affected dogs, suggesting the influence of modifiers. Understanding the role(s) of genetic modifiers of GRMD may identify genes and pathways that also modify phenotypes in DMD and reveal novel therapies. Therefore, our objective in this study was to identify genetic modifiers that affect discrete GRMD phenotypes.
We performed a linear mixed-model (LMM) analysis using 16 variably-affected dogs from our GRMD colony (8 dystrophic, 8 non-dystrophic). All of these dogs were either full or half-siblings, and phenotyped for 19 objective, quantitative biomarkers at ages 6 and 12 months. Each biomarker was individually assessed. Gene expression profiles of 59 possible candidate genes were generated for two muscle types: the cranial tibialis and medial head of the gastrocnemius. SNPs significantly associated with GRMD biomarkers were identified on multiple chromosomes (including the X chromosome). Gene expression levels for candidate genes located near these SNPs correlated with biomarker values, suggesting possible roles as GRMD modifiers.
The results of this study enhance our understanding of GRMD pathology and represent a first step toward the characterization of GRMD modifiers that may be relevant to DMD pathology. Such modifiers are likely to be useful for DMD treatment development based on their relationships to GRMD phenotypes.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-2948-z) contains supplementary material, which is available to authorized users.
Muscular dystrophy; Duchenne muscular dystrophy; DMD; Golden retriever muscular dystrophy; GRMD; Modifier; Linear mixed-model analysis; Gene expression
The analysis of high-throughput gene expression data with respect to sets of genes rather than individual genes has many advantages. A variety of methods have been developed for assessing the enrichment of sets of genes with respect to differential expression. In this paper we provide a comparative study of four of these methods: Fisher's exact test, Gene Set Enrichment Analysis (GSEA), Random-Sets (RS), and Gene List Analysis with Prediction Accuracy (GLAPA). The first three methods use associative statistics, while the fourth uses predictive statistics. We first compare all four methods on simulated data sets to verify that Fisher's exact test is markedly worse than the other three approaches. We then validate the other three methods on seven real data sets with known genetic perturbations and then compare the methods on two cancer data sets where our a priori knowledge is limited.
The simulation study highlights that none of the three method outperforms all others consistently. GSEA and RS are able to detect weak signals of deregulation and they perform differently when genes in a gene set are both differentially up and down regulated. GLAPA is more conservative and large differences between the two phenotypes are required to allow the method to detect differential deregulation in gene sets. This is due to the fact that the enrichment statistic in GLAPA is prediction error which is a stronger criteria than classical two sample statistic as used in RS and GSEA. This was reflected in the analysis on real data sets as GSEA and RS were seen to be significant for particular gene sets while GLAPA was not, suggesting a small effect size. We find that the rank of gene set enrichment induced by GLAPA is more similar to RS than GSEA. More importantly, the rankings of the three methods share significant overlap.
The three methods considered in our study recover relevant gene sets known to be deregulated in the experimental conditions and pathologies analyzed. There are differences between the three methods and GSEA seems to be more consistent in finding enriched gene sets, although no method uniformly dominates over all data sets. Our analysis highlights the deep difference existing between associative and predictive methods for detecting enrichment and the use of both to better interpret results of pathway analysis. We close with suggestions for users of gene set methods.
Integrating single nucleotide polymorphism (SNP) p-values from genome-wide association studies (GWAS) across genes and pathways is a strategy to improve statistical power and gain biological insight. Here, we present Pascal (Pathway scoring algorithm), a powerful tool for computing gene and pathway scores from SNP-phenotype association summary statistics. For gene score computation, we implemented analytic and efficient numerical solutions to calculate test statistics. We examined in particular the sum and the maximum of chi-squared statistics, which measure the strongest and the average association signals per gene, respectively. For pathway scoring, we use a modified Fisher method, which offers not only significant power improvement over more traditional enrichment strategies, but also eliminates the problem of arbitrary threshold selection inherent in any binary membership based pathway enrichment approach. We demonstrate the marked increase in power by analyzing summary statistics from dozens of large meta-studies for various traits. Our extensive testing indicates that our method not only excels in rigorous type I error control, but also results in more biologically meaningful discoveries.
Genome-wide association studies (GWAS) typically generate lists of trait- or disease-associated SNPs. Yet, such output sheds little light on the underlying molecular mechanisms and tools are needed to extract biological insight from the results at the SNP level. Pathway analysis tools integrate signals from multiple SNPs at various positions in the genome in order to map associated genomic regions to well-established pathways, i.e., sets of genes known to act in concert. The nature of GWAS association results requires specifically tailored methods for this task. Here, we present Pascal (Pathway scoring algorithm), a tool that allows gene and pathway-level analysis of GWAS association results without the need to access the original genotypic data. Pascal was designed to be fast, accurate and to have high power to detect relevant pathways. We extensively tested our approach on a large collection of real GWAS association results and saw better discovery of confirmed pathways than with other popular methods. We believe that these results together with the ease-of-use of our publicly available software will allow Pascal to become a useful addition to the toolbox of the GWAS community.
We report a genome-wide association study (GWAS) of nicotine dependence defined on the basis of scores on the Fagerström Test for Nicotine Dependence in European-American (EA) and African-American (AA) populations.
Our sample, from the one used in our previous GWAS, included only subjects who had smoked >100 cigarettes lifetime (2114 EA and 2602 AA subjects) and an additional 927 AA and 2003 EA subjects from the Study of Addiction: Genetics and Environment project [via the database of Genotypes and Phenotypes (dbGAP)]. GWAS analysis considered Fagerström Test for Nicotine Dependence score as an ordinal trait, separately in each population and sample and by combining the results in meta-analysis. We also conducted analyses that were adjusted for other substance use disorder criteria in a single nucleotide polymorphism (SNP) subset.
In EAs, one chromosome 7 intergenic region was genome-wide significant (GWS): rs13225753, p = 3.48 × 10−8 (adjusted). In AAs, GWS associations were observed at numerous SNPs mapped to a region on chromosome 14 of >305,000 base pairs (minimal p = 4.74 × 10−10). Two chromosome 8 regions were associated: p = 4.45 × 10−8 at DLC1 SNP rs289519 (unadjusted) and p = 1.10 × 10−9 at rs6996964 (adjusted for other substances), located between CSGALNACT1 and INTS10. No GWS associations were observed at the chromosome 15 nicotinic receptor gene cluster (CHRNA5-CHRNA3-CHRNB4) previously associated with nicotine dependence and smoking quantity traits. TSNAX-DISC1 SNP rs821722 (p = 1.46 × 10−7) was the most significant result with substantial contributions from both populations; we previously identified DISC1 associations with opioid dependence. Pathway analysis identified association with nitric oxide synthase and adenosine monophosphate-activated protein kinase pathways in EAs.
The key risk loci identified, which require replication, offer novel insights into nicotine dependence biology.
AMPK pathway; DISC1; DLC1; eNOS pathway; FTND; GWAS; Nicotine dependence; Population differences
Pathway analysis of genome-wide association studies (GWAS) offer a unique opportunity to collectively evaluate genetic variants with effects that are too small to be detected individually. We applied a pathway analysis to a bladder cancer GWAS containing data from 3,532 cases and 5,120 controls of European background (n = 5 studies). Thirteen hundred and ninety-nine pathways were drawn from five publicly available resources (Biocarta, Kegg, NCI-PID, HumanCyc, and Reactome), and we constructed 22 additional candidate pathways previously hypothesized to be related to bladder cancer. In total, 1421 pathways, 5647 genes and ∼90,000 SNPs were included in our study. Logistic regression model adjusting for age, sex, study, DNA source, and smoking status was used to assess the marginal trend effect of SNPs on bladder cancer risk. Two complementary pathway-based methods (gene-set enrichment analysis [GSEA], and adapted rank-truncated product [ARTP]) were used to assess the enrichment of association signals within each pathway. Eighteen pathways were detected by either GSEA or ARTP at P≤0.01. To minimize false positives, we used the I2 statistic to identify SNPs displaying heterogeneous effects across the five studies. After removing these SNPs, seven pathways (‘Aromatic amine metabolism’ [PGSEA = 0.0100, PARTP = 0.0020], ‘NAD biosynthesis’ [PGSEA = 0.0018, PARTP = 0.0086], ‘NAD salvage’ [PARTP = 0.0068], ‘Clathrin derived vesicle budding’ [PARTP = 0.0018], ‘Lysosome vesicle biogenesis’ [PGSEA = 0.0023, PARTP<0.00012], ’Retrograde neurotrophin signaling’ [PGSEA = 0.00840], and ‘Mitotic metaphase/anaphase transition’ [PGSEA = 0.0040]) remained. These pathways seem to belong to three fundamental cellular processes (metabolic detoxification, mitosis, and clathrin-mediated vesicles). Identification of the aromatic amine metabolism pathway provides support for the ability of this approach to identify pathways with established relevance to bladder carcinogenesis.
Recently, microarray data analyses using functional pathway information, e.g., gene set enrichment analysis (GSEA) and significance analysis of function and expression (SAFE), have gained recognition as a way to identify biological pathways/processes associated with a phenotypic endpoint. In these analyses, a local statistic is used to assess the association between the expression level of a gene and the value of a phenotypic endpoint. Then these gene-specific local statistics are combined to evaluate association for pre-selected sets of genes. Commonly used local statistics include t-statistics for binary phenotypes and correlation coefficients that assume a linear or monotone relationship between a continuous phenotype and gene expression level. Methods applicable to continuous non-monotone relationships are needed. Furthermore, for multiple experimental categories, methods that combine multiple GSEA/SAFE analyses are needed.
For continuous or ordinal phenotypic outcome, we propose to use as the local statistic the coefficient of multiple determination (i.e., the square of multiple correlation coefficient) R2 from fitting natural cubic spline models to the phenotype-expression relationship. Next, we incorporate this association measure into the GSEA/SAFE framework to identify significant gene sets. Unsigned local statistics, signed global statistics and one-sided p-values are used to reflect our inferential interest. Furthermore, we describe a procedure for inference across multiple GSEA/SAFE analyses. We illustrate our approach using gene expression and liver injury data from liver and blood samples from rats treated with eight hepatotoxicants under multiple time and dose combinations. We set out to identify biological pathways/processes associated with liver injury as manifested by increased blood levels of alanine transaminase in common for most of the eight compounds. Potential statistical dependency resulting from the experimental design is addressed in permutation based hypothesis testing.
The proposed framework captures both linear and non-linear association between gene expression level and a phenotypic endpoint and thus can be viewed as extending the current GSEA/SAFE methodology. The framework for combining results from multiple GSEA/SAFE analyses is flexible to address practical inference interests. Our methods can be applied to microarray data with continuous phenotypes with multi-level design or the meta-analysis of multiple microarray data sets.