Genome-wide association (GWA) study aims to identify the genetic factors associated with the traits of interest. However, the power of GWA analysis has been seriously limited by the enormous number of markers tested. Recently, the gene set analysis (GSA) methods were introduced to GWA studies to address the association of gene sets that share common biological functions. GSA considerably increased the power of association analysis and successfully identified coordinated association patterns of gene sets. There have been several approaches in this direction with some limitations. Here, we present a general approach for GSA in GWA analysis and a stand-alone software GSA-SNP that implements three widely used GSA methods. GSA-SNP provides a fast computation and an easy-to-use interface. The software and test datasets are freely available at http://gsa.muldas.org. We provide an exemplary analysis on adult heights in a Korean population.
With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had Pmeta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available.
The recent success of genome-wide association studies (GWAS) has generated a wealth of genotyping data critical to studies of genetic architectures of many complex diseases. In contrast to traditional single marker analysis, an integrative analysis of multiple genes and the assessment of their joint effects have been particularly promising, especially upon the availability of many GWAS datasets and other high-throughput datasets for numerous complex diseases. In this study, we developed an integrative analysis framework for multiple GWAS datasets and demonstrated it in schizophrenia. We first constructed a GWAS-weighted protein-protein interaction (PPI) network and then applied a dense module search algorithm to identify subnetworks with combinatory disease effects. We applied combinatorial criteria for module selection based on permutation tests to determine whether the modules are significantly different from random gene sets and whether the modules are associated with the disease in investigation. Importantly, considering there are many complex diseases with multiple GWAS datasets available, we proposed a discovery-evaluation strategy to search for modules with consistent combined effects from two or more GWAS datasets. This approach can be applied to any diseases or traits that have two or more GWAS datasets available.
Genome-wide association studies (GWAS) for epithelial ovarian cancer (EOC), the most lethal gynecologic malignancy, have identified novel susceptibility loci. GWAS for survival after EOC have had more limited success. The association of each single nucleotide polymorphism (SNP) individually may not be well-suited to detect small effects of multiple SNPs, such as those operating within the same biological pathway. Gene set analysis (GSA) overcomes this limitation by assessing overall evidence for association of a phenotype with all measured variation in a set of genes.
To determine gene sets associated with EOC overall survival, we conducted GSA using data from two large GWASes (N cases = 2,813, N deaths = 1,116), with a novel Principal Component – Gamma GSA method. Analysis was completed for all cases and then separately for high grade serous (HGS) histological subtype.
Analysis of the HGS subjects resulted in 43 gene sets with p<0.005 (1.7%); of these, 21 gene sets had p < 0.10 in both GWASes, including intracellular signaling pathway (p = 7.3 × 10−5) and macrolide binding (p = 6.2 ×10−4) gene sets. The top gene sets in analysis of all cases were meiotic mismatch repair (p=6.3 ×10−4) and macrolide binding (p=1.0×10−3). Of 18 gene sets with p<0.005 (0.7%), eight had p < 0.10 in both GWASes.
This research detected novel gene sets associated with EOC survival.
Novel gene sets associated with EOC survival might lead to new insights and avenues for development of novel therapies for EOC and pharmacogenomic studies.
pathway analysis; genetic association; GWAS; SNPs; gynecologic neoplasm
Motivation: An important question that has emerged from the recent success of genome-wide association studies (GWAS) is how to detect genetic signals beyond single markers/genes in order to explore their combined effects on mediating complex diseases and traits. Integrative testing of GWAS association data with that from prior-knowledge databases and proteome studies has recently gained attention. These methodologies may hold promise for comprehensively examining the interactions between genes underlying the pathogenesis of complex diseases.
Methods: Here, we present a dense module searching (DMS) method to identify candidate subnetworks or genes for complex diseases by integrating the association signal from GWAS datasets into the human protein–protein interaction (PPI) network. The DMS method extensively searches for subnetworks enriched with low P-value genes in GWAS datasets. Compared with pathway-based approaches, this method introduces flexibility in defining a gene set and can effectively utilize local PPI information.
Results: We implemented the DMS method in an R package, which can also evaluate and graphically represent the results. We demonstrated DMS in two GWAS datasets for complex diseases, i.e. breast cancer and pancreatic cancer. For each disease, the DMS method successfully identified a set of significant modules and candidate genes, including some well-studied genes not detected in the single-marker analysis of GWA studies. Functional enrichment analysis and comparison with previously published methods showed that the genes we identified by DMS have higher association signal.
Availability: dmGWAS package and documents are available at http://bioinfo.mc.vanderbilt.edu/dmGWAS.html.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.
gene set analysis; genome-wide association study; GSA-SNP; i-GSEA4GWAS; imputation
The search for the missing heritability in genome-wide association studies (GWAS) has become an important focus for the human genetics community. One suspected location of these genetic effects is in gene-gene interactions, or epistasis. The computational burden of exploring gene-gene interactions in the wealth of data generated in GWAS, along with small to moderate sample sizes, have led to epistasis being an afterthought, rather than a primary focus of GWAS analyses. In this review, we discuss some potential approaches to filter a GWAS dataset to a smaller, more manageable dataset where searching for epistasis is considerably more feasible. We describe a number of alternative approaches, but primarily focus on the use of prior biological knowledge from databases in the public domain to guide the search for epistasis. The manner in which prior knowledge is incorporated into a GWA study can be many and these data can be extracted from a variety of database sources. We discuss a number of these approaches and propose that a comprehensive approach will likely be most fruitful for searching for epistasis in large-scale genomic studies of the current state-of-the-art and into the future.
epistasis; prior knowledge; pathways; protein-protein interactions; gene-gene interactions
The efforts of the Human Genome Project are beginning to provide important findings for human health. Technological advances in the laboratory, particularly in characterizing human genomic variation, have created new approaches for studying the human genome - genome-wide association studies (GWAS). However, current statistical and computational strategies are taking only partial advantage of this wealth of information. In the quest for susceptibility genes for complex diseases in GWAS data, several different analytic strategies are being pursued. In a recent report, Baranzini and colleagues used a pathway- and network-based analysis to explore potentially interesting single locus association signals in a GWAS of multiple sclerosis. This and other pathway-based approaches are likely to continue to emerge in the GWAS literature, as they provide a powerful strategy to detect important modest single-locus effects and gene-gene interaction effects.
The recent success of genome-wide association (GWA) studies has greatly expanded our understanding of many complex diseases by delivering previously unknown loci and genes. A large number of GWAS datasets have already been made available, with more being generated. To explore the underlying moderate and weak signals, we recently developed a network-based dense module search (DMS) method for identification of disease candidate genes from GWAS datasets, leveraging on the joint effect of multiple genes. DMS is designed to dynamically search for the best nodes in a step-wise fashion and, thus, could overcome the limitation of pre-defined gene sets. Here, we propose an improved version of DMS, the topologically-adjusted DMS, to facilitate the analysis of complex diseases. Building on the previous version of DMS, we improved the randomization process by taking into account the topological character, aiming to adjust the bias potentially caused by high-degree nodes in the whole network. We demonstrated the topologically-adjusted DMS algorithm in a GWAS dataset for schizophrenia. We found the improved DMS strategy could effectively identify candidate genes while reducing the burden of high-degree nodes. In our evaluation, we found more candidate genes identified by the topologically-adjusted DMS algorithm have been reported in the previous association studies, suggesting this new algorithm has better performance than the unweighted DMS algorithm. Finally, our functional analysis of the top module genes revealed that they are enriched in immune-related pathways.
dmGWAS; dense module search; GWAS; schizophrenia; network; gene set enrichment analysis
The last decade of human genetic research witnessed the completion of hundreds of genome-wide association studies (GWASs). However, the genetic variants discovered through these efforts account for only a small proportion of the heritability of complex traits. One explanation for the missing heritability is that the common analysis approach, assessing the effect of each single-nucleotide polymorphism (SNP) individually, is not well suited to the detection of small effects of multiple SNPs. Gene set analysis (GSA) is one of several approaches that may contribute to the discovery of additional genetic risk factors for complex traits. Complex phenotypes are thought to be controlled by networks of interacting biochemical and physiological pathways influenced by the products of sets of genes. By assessing the overall evidence of association of a phenotype with all measured variation in a set of genes, GSA may identify functionally relevant sets of genes corresponding to relevant biomolecular pathways, which will enable more focused studies of genetic risk factors. This approach may thus contribute to the discovery of genetic variants responsible for some of the missing heritability. With the increased use of these approaches for the secondary analysis of data from GWAS, it is important to understand the different GSA methods and their strengths and weaknesses, and consider challenges inherent in these types of analyses. This paper provides an overview of GSA, highlighting the key challenges, potential solutions, and directions for ongoing research.
pathway analysis; multilocus; complex traits; genetic association studies
Genome wide association studies (GWAS) are becoming the approach of choice to identify genetic determinants of complex phenotypes and common diseases. The astonishing amount of generated data and the use of distinct genotyping platforms with variable genomic coverage are still analytical challenges. Imputation algorithms combine directly genotyped markers information with haplotypic structure for the population of interest for the inference of a badly genotyped or missing marker and are considered a near zero cost approach to allow the comparison and combination of data generated in different studies. Several reports stated that imputed markers have an overall acceptable accuracy but no published report has performed a pair wise comparison of imputed and empiric association statistics of a complete set of GWAS markers.
In this report we identified a total of 73 imputed markers that yielded a nominally statistically significant association at P < 10 -5 for type 2 Diabetes Mellitus and compared them with results obtained based on empirical allelic frequencies. Interestingly, despite their overall high correlation, association statistics based on imputed frequencies were discordant in 35 of the 73 (47%) associated markers, considerably inflating the type I error rate of imputed markers. We comprehensively tested several quality thresholds, the haplotypic structure underlying imputed markers and the use of flanking markers as predictors of inaccurate association statistics derived from imputed markers.
Our results suggest that association statistics from imputed markers showing specific MAF (Minor Allele Frequencies) range, located in weak linkage disequilibrium blocks or strongly deviating from local patterns of association are prone to have inflated false positive association signals. The present study highlights the potential of imputation procedures and proposes simple procedures for selecting the best imputed markers for follow-up genotyping studies.
From the early 1990s to the middle of the last decade, the search for genes influencing osteoporosis proved difficult with few successes. However, over the last 5 years this has begun to change with the introduction of genome-wide association (GWA) studies. In this short period of time, GWA studies have significantly accelerated the pace of gene discovery, leading to the identification of nearly 100 independent associations for osteoporosis-related traits. However, GWA does not specifically pinpoint causal genes or provide functional context for associations. Thus, there is a need for approaches that provide systems-level insight on how associated variants influence cellular function, downstream gene networks, and ultimately disease. In this review we discuss the emerging field of “systems genetics” and how it is being used in combination with and independent of GWA to improve our understanding of the molecular mechanisms involved in bone fragility.
Systems genetics; Systems biology; Coexpression network; Causality modeling; Genetics of osteoporosis; Genome-wide association study
Gene set analysis (GSA) is used to elucidate genome-wide data, in particular transcriptome data. A multitude of methods have been proposed for this step of the analysis, and many of them have been compared and evaluated. Unfortunately, there is no consolidated opinion regarding what methods should be preferred, and the variety of available GSA software and implementations pose a difficulty for the end-user who wants to try out different methods. To address this, we have developed the R package Piano that collects a range of GSA methods into the same system, for the benefit of the end-user. Further on we refine the GSA workflow by using modifications of the gene-level statistics. This enables us to divide the resulting gene set P-values into three classes, describing different aspects of gene expression directionality at gene set level. We use our fully implemented workflow to investigate the impact of the individual components of GSA by using microarray and RNA-seq data. The results show that the evaluated methods are globally similar and the major separation correlates well with our defined directionality classes. As a consequence of this, we suggest to use a consensus scoring approach, based on multiple GSA runs. In combination with the directionality classes, this constitutes a more thorough basis for an enriched biological interpretation.
Genome wide association studies (GWAS) have proven useful as a method for identifying genetic variations associated with diseases. In this study, we analyzed GWAS data for 61 diseases and phenotypes to elucidate common associations based on single nucleotide polymorphisms (SNP). The study was an expansion on a previous study on identifying disease associations via data from a single GWAS on seven diseases.
Adjustments to the originally reported study included expansion of the SNP dataset using Linkage Disequilibrium (LD) and refinement of the four levels of analysis to encompass SNP, SNP block, gene, and pathway level comparisons. A pair-wise comparison between diseases and phenotypes was performed at each level and the Jaccard similarity index was used to measure the degree of association between two diseases/phenotypes. Disease relatedness networks (DRNs) were used to visualize our results. We saw predominant relatedness between Multiple Sclerosis, type 1 diabetes, and rheumatoid arthritis for the first three levels of analysis. Expected relatedness was also seen between lipid- and blood-related traits.
The predominant associations between Multiple Sclerosis, type 1 diabetes, and rheumatoid arthritis can be validated by clinical studies. The diseases have been proposed to share a systemic inflammation phenotype that can result in progression of additional diseases in patients with one of these three diseases. We also noticed unexpected relationships between metabolic and neurological diseases at the pathway comparison level. The less significant relationships found between diseases require a more detailed literature review to determine validity of the predictions. The results from this study serve as a first step towards a better understanding of seemingly unrelated diseases and phenotypes with similar symptoms or modes of treatment.
Recent genome-wide association studies (GWAS) have identified a number of novel genetic associations with complex human diseases. In spite of these successes, results from GWAS generally explain only a small proportion of disease heritability, an observation termed the ‘missing heritability problem’. Several sources for the missing heritability have been proposed, including the contribution of many common variants with small individual effect sizes, which cannot be reliably found using the standard GWAS approach. The goal of our study was to explore a complimentary approach, which combines GWAS results with functional data in order to identify novel genetic associations with small effect sizes. To do so, we conducted a GWAS for lymphocyte count, a physiologic quantitative trait associated with asthma, in 462 Hutterites. In parallel, we performed a genome-wide gene expression study in lymphoblastoid cell lines from 96 Hutterites. We found significant support for genetic associations using the GWAS data when we considered variants near the 193 genes whose expression levels across individuals were most correlated with lymphocyte counts. Interestingly, these variants are also enriched with signatures of an association with asthma susceptibility, an observation we were able to replicate. The associated loci include genes previously implicated in asthma susceptibility as well as novel candidate genes enriched for functions related to T cell receptor signaling and adenosine triphosphate synthesis. Our results, therefore, establish a new set of asthma susceptibility candidate genes. More generally, our observations support the notion that many loci of small effects influence variation in lymphocyte count and asthma susceptibility.
QT prolongation is associated with increased risk of cardiac arrhythmias. Identifying the genetic variants that mediate antipsychotic induced prolongation may help to minimize this risk, which might prevent the removal of efficacious drugs from the market. We performed candidate gene analysis and five drug specific genome-wide association studies (GWAS) with 492K SNPs to search for genetic variation mediating antipsychotic induced QT prolongation in 738 schizophrenia patients from the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) study.
Our candidate gene study suggests the involvement of NOS1AP and NUBPL (p-values =1.45×10−05 and 2.66×10−13, respectively). Furthermore, our top GWAS hit achieving genome-wide significance, defined as a q-value <0.10, (p-value =1.54×10−7, q-value =0.07), located in SLC22A23, mediated the effects of quetiapine on prolongation. SLC22A23 belongs to a family of organic ion transporters that shuttle a variety of compounds including drugs, environmental toxins, and endogenous metabolites across the cell membrane. This gene is expressed in the heart and is integral in mouse heart development. The genes mediating antipsychotic induced QT prolongation partially overlap with the genes affecting normal QT interval variation. However, some genes may also be unique for drug induced prolongation. This study demonstrates the potential of GWAS to discover genes and pathways that mediate antipsychotic induced QT prolongation.
candidate gene analysis; genome-wide association study; schizophrenia; adverse effects; CATIE
Schizophrenia is a complex genetic disorder. Gene set-based analytic (GSA) methods have been widely applied for exploratory analyses of large, high-throughput datasets, but less commonly employed for biological hypothesis testing. Our primary hypothesis is that variation in ion channel genes contribute to the genetic susceptibility to schizophrenia. We applied Exploratory Visual Analysis (EVA), one GSA application, to analyze European-American (EA) and African-American (AA) schizophrenia genome-wide association study datasets for statistical enrichment of ion channel gene sets, comparing GSA results derived under three SNP-to-gene mapping strategies: (1) GENIC; (2) 500-Kb; (3) 2.5-Mb and three complimentary SNP-to-gene statistical reduction methods: (1) minimum p value (pMIN); (2) a novel method, proportion of SNPs per Gene with p-values below a pre-defined α-threshold (PROP); and (3) the truncated product method (TPM). In the EA analyses, ion channel gene set(s) were enriched under all mapping and statistical approaches. In the AA analysis, ion channel gene set(s) were significantly enriched under pMIN for all mapping strategies and under PROP for broader mapping strategies. Less extensive enrichment in the AA sample may reflect true ethnic differences in susceptibility, sampling or case ascertainment differences, or higher dimensionality relative to sample size of the AA data. More consistent findings under broader mapping strategies may reflect enhanced power due to increased SNP inclusion, enhanced capture of effects over extended haplotypes or significant contributions from regulatory regions. While extensive pMIN findings may reflect gene size bias, the extent and significance of PROP and TPM findings suggest that common variation at ion channel genes may capture some of the heritability of schizophrenia.
Millions of genetic variants have been assessed for their effects on the trait of interest in genome-wide association studies (GWAS). The complex traits are affected by a set of inter-related genes. However, the typical GWAS only examine the association of a single genetic variant at a time. The individual effects of a complex trait are usually small, and the simple sum of these individual effects may not reflect the holistic effect of the genetic system. High-throughput methods enable genomic studies to produce a large amount of data to expand the knowledge base of the biological systems. Biological networks and pathways are built to represent the functional or physical connectivity among genes. Integrated with GWAS data, the network- and pathway-based methods complement the approach of single genetic variant analysis, and may improve the power to identify trait-associated genes. Taking advantage of the biological knowledge, these approaches are valuable to interpret the functional role of the genetic variants, and to further understand the molecular mechanism influencing the traits. The network- and pathway-based methods have demonstrated their utilities, and will be increasingly important to address a number of challenges facing the mainstream GWAS.
Genome-wide association studies (GWAS) aim to identify causal variants and genes for complex disease by independently testing a large number of SNP markers for disease association. Although genes have been implicated in these studies, few utilise the multiple-hit model of complex disease to identify causal candidates. A major benefit of multi-locus comparison is that it compensates for some shortcomings of current statistical analyses that test the frequency of each SNP in isolation for the phenotype population versus control.
Here we developed and benchmarked several protocols for GWAS data analysis using different in-silico gene prediction and prioritisation methodologies. We adopted a high sensitivity approach to the data, using less conservative statistical SNP associations. Multiple gene search spaces, either of fixed-widths or proximity-based, were generated around each SNP marker. We used the candidate disease gene prediction system Gentrepid to identify candidates based on shared biomolecular pathways or domain-based protein homology. Predictions were made either with phenotype-specific known disease genes as input; or without a priori knowledge, by exhaustive comparison of genes in distinct loci. Because Gentrepid uses biomolecular data to find interactions and common features between genes in distinct loci of the search spaces, it takes advantage of the multi-locus aspect of the data.
Results suggest testing multiple SNP-to-gene search spaces compensates for differences in phenotypes, populations and SNP platforms. Surprisingly, domain-based homology information was more informative when benchmarked against gene candidates reported by GWA studies compared to previously determined disease genes, possibly suggesting a larger contribution of gene homologs to complex diseases than Mendelian diseases.
Mood disorders are highly heritable forms of major mental illness. A major breakthrough in elucidating the genetic architecture of mood disorders was anticipated with the advent of genome-wide association studies (GWAS). However, to date few susceptibility loci have been conclusively identified. The genetic etiology of mood disorders appears to be quite complex, and as a result, alternative approaches for analyzing GWAS data are needed. Recently, a polygenic scoring approach that captures the effects of alleles across multiple loci was successfully applied to the analysis of GWAS data in schizophrenia and bipolar disorder (BP). However, this method may be overly simplistic in its approach to the complexity of genetic effects. Data mining methods are available that may be applied to analyze the high dimensional data generated by GWAS of complex psychiatric disorders. We sought to compare the performance of five data mining methods, namely, Bayesian Networks (BN), Support Vector Machine (SVM), Random Forest (RF), Radial Basis Function network (RBF), and Logistic Regression (LR), against the polygenic scoring approach in the analysis of GWAS data on BP. The different classification methods were trained on GWAS datasets from the Bipolar Genome Study (2,191 cases with BP and 1,434 controls) and their ability to accurately classify case/control status was tested on a GWAS dataset from the Wellcome Trust Case Control Consortium. The performance of the classifiers in the test dataset was evaluated by comparing area under the receiver operating characteristic curves (AUC). BN performed the best of all the data mining classifiers, but none of these did significantly better than the polygenic score approach. We further examined a subset of SNPs in genes that are expressed in the brain, under the hypothesis that these might be most relevant to BP susceptibility, but all the classifiers performed worse with this reduced set of SNPs. The discriminative accuracy of all of these methods is unlikely to be of diagnostic or clinical utility at the present time. Further research is needed to develop strategies for selecting sets of SNPs likely to be relevant to disease susceptibility and to determine if other data mining classifiers that utilize other algorithms for inferring relationships among the sets of SNPs may perform better.
data mining; Genome-Wide Association; Mood Disorders
Genome-wide association studies (GWAS) are a valuable approach to understanding the genetic basis of complex traits. One of the challenges of GWAS is the translation of genetic association results into biological hypotheses suitable for further investigation in the laboratory. To address this challenge, we introduce Network Interface Miner for Multigenic Interactions (NIMMI), a network-based method that combines GWAS data with human protein-protein interaction data (PPI). NIMMI builds biological networks weighted by connectivity, which is estimated by use of a modification of the Google PageRank algorithm. These weights are then combined with genetic association p-values derived from GWAS, producing what we call ‘trait prioritized sub-networks.’ As a proof of principle, NIMMI was tested on three GWAS datasets previously analyzed for height, a classical polygenic trait. Despite differences in sample size and ancestry, NIMMI captured 95% of the known height associated genes within the top 20% of ranked sub-networks, far better than what could be achieved by a single-locus approach. The top 2% of NIMMI height-prioritized sub-networks were significantly enriched for genes involved in transcription, signal transduction, transport, and gene expression, as well as nucleic acid, phosphate, protein, and zinc metabolism. All of these sub-networks were ranked near the top across all three height GWAS datasets we tested. We also tested NIMMI on a categorical phenotype, Crohn’s disease. NIMMI prioritized sub-networks involved in B- and T-cell receptor, chemokine, interleukin, and other pathways consistent with the known autoimmune nature of Crohn’s disease. NIMMI is a simple, user-friendly, open-source software tool that efficiently combines genetic association data with biological networks, translating GWAS findings into biological hypotheses.
Gene-based association approach could be regarded as a complementary analysis to the single SNP association analysis. We meta-analyzed the findings from the gene-based association approach using the genome-wide association studies (GWAS) data from Chinese and European subjects, confirmed several well established bone mineral density (BMD) genes, and suggested several novel BMD genes.
The introduction of GWAS has greatly increased the number of genes that are known to be associated with common diseases. Nonetheless, such a single SNP GWAS has a lower power to detect genes with multiple causal variants. We aimed to assess the association of each gene with BMD variation at the spine and hip using gene-based GWAS approach.
We studied 778 Hong Kong Southern Chinese (HKSC) women and 5,858 Northern Europeans (dCG); age, sex, and weight were adjusted in the model. The main outcome measure was BMD at the spine and hip.
Nine genes showed suggestive p value in HKSC, while 4 and 17 genes showed significant and suggestive p values respectively in dCG. Meta-analysis using weighted Z-transformed test confirmed several known BMD genes and suggested some novel ones at 1q21.3, 9q22, 9q33.2, 20p13, and 20q12. Top BMD genes were significantly associated with connective tissue, skeletal, and muscular system development and function (p < 0.05). Gene network inference revealed that a large number of these genes were significantly connected with each other to form a functional gene network, and several signaling pathways were strongly connected with these gene networks.
Our gene-based GWAS confirmed several BMD genes and suggested several novel BMD genes. Genetic contribution to BMD variation may operate through multiple genes identified in this study in functional gene networks. This finding may be useful in identifying and prioritizing candidate genes/loci for further study.
Electronic supplementary material
The online version of this article (doi:10.1007/s00198-011-1779-7) contains supplementary material, which is available to authorized users.
Association study; Bone mineral density; Genetic epidemiology; Meta-analysis; Osteoporosis
Gene set analysis (GSA) has become a successful tool to interpret gene expression profiles in terms of biological functions, molecular pathways, or genomic locations. GSA performs statistical tests for independent microarray samples at the level of gene sets rather than individual genes. Nowadays, an increasing number of microarray studies are conducted to explore the dynamic changes of gene expression in a variety of species and biological scenarios. In these longitudinal studies, gene expression is repeatedly measured over time such that a GSA needs to take into account the within-gene correlations in addition to possible between-gene correlations.
We provide a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions. The limiting distributions of our statistics are derived when the number of genes goes to infinity while the number of replications can be small. When the number of genes in a gene set is small, we recommend permutation tests based on our nonparametric test statistics to achieve reliable type I error and better power while incorporating unknown correlations between and within-genes. Simulation results demonstrate that the proposed method has a greater power than other methods for various data distributions and heteroscedastic correlation structures. This method was used for an IL-2 stimulation study and significantly altered gene sets were identified.
The simulation study and the real data application showed that the proposed gene set analysis provides a promising tool for longitudinal microarray analysis. R scripts for simulating longitudinal data and calculating the nonparametric statistics are posted on the North Dakota INBRE website http://ndinbre.org/programs/bioinformatics.php. Raw microarray data is available in Gene Expression Omnibus (National Center for Biotechnology Information) with accession number GSE6085.
We conducted data-mining analyses using the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) and molecular genetics of schizophrenia genome-wide association study supported by the genetic association information network (MGS-GAIN) schizophrenia data sets and performed bioinformatic prioritization for all the markers with P-values ≤0.05 in both data sets. In this process, we found that in the CMYA5 gene, there were two non-synonymous markers, rs3828611 and rs10043986, showing nominal significance in both the CATIE and MGS-GAIN samples. In a combined analysis of both the CATIE and MGS-GAIN samples, rs4704591 was identified as the most significant marker in the gene. Linkage disequilibrium analyses indicated that these markers were in low LD (3 828 611–rs10043986, r2 = 0.008; rs10043986–rs4704591, r2 = 0.204). In addition, CMYA5 was reported to be physically interacting with the DTNBP1 gene, a promising candidate for schizophrenia, suggesting that CMYA5 may be involved in the same biological pathway and process. On the basis of this information, we performed replication studies for these three single-nucleotide polymorphisms. The rs3828611 was found to have conflicting results in our Irish samples and was dropped out without further investigation. The other two markers were verified in 23 other independent data sets. In a meta-analysis of all 23 replication samples (family samples, 912 families with 4160 subjects; case–control samples, 11 380 cases and 15 021 controls), we found that both markers are significantly associated with schizophrenia (rs10043986, odds ratio (OR) = 1.11, 95% confidence interval (CI) = 1.04–1.18, P = 8.2 × 10−4 and rs4704591, OR = 1.07, 95% CI = 1.03–1.11, P = 3.0 × 10−4). The results were also significant for the 22 Caucasian replication samples (rs10043986, OR = 1.11, 95% CI = 1.03–1.17, P = 0.0026 and rs4704591, OR = 1.07, 95% CI = 1.02–1.11, P = 0.0015). Furthermore, haplotype conditioned analyses indicated that the association signals observed at these two markers are independent. On the basis of these results, we concluded that CMYA5 is associated with schizophrenia and further investigation of the gene is warranted.
association study; cardiomyopathy; GWA data mining; meta-analysis; schizophrenia>
Complex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the ‘unstable’ results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance.
We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.
Interactions among genomic loci (also known as epistasis) have been suggested as one of the potential sources of missing heritability in single locus analysis of genome-wide association studies (GWAS). The computational burden of searching for interactions is compounded by the extremely low threshold for identifying significant p-values due to multiple hypothesis testing corrections. Utilizing prior biological knowledge to restrict the set of candidate SNP pairs to be tested can alleviate this problem, but systematic studies that investigate the relative merits of integrating different biological frameworks and GWAS data have not been conducted.
We developed four biologically based frameworks to identify pairwise interactions among candidate SNP pairs as follows: (1) for each human protein-coding gene, a set of SNPs associated with that gene was constructed providing a gene-based interaction model, (2) for each known biological pathway, a set of SNPs associated with the genes in the pathway was constructed providing a pathway-based interaction model, (3) a set of SNPs associated with genes in a disease-related subnetwork provides a network-based interaction model, and (4) a framework is based on the function of SNPs. The last approach uses expression SNPs (eSNPs or eQTLs), which are SNPs or loci that have defined effects on the abundance of transcripts of other genes. We constructed pairs of eSNPs and SNPs located in the target genes whose expression is regulated by eSNPs. For all four frameworks the SNP sets were exhaustively tested for pairwise interactions within the sets using a traditional logistic regression model after excluding genes that were previously identified to associate with the trait. Using previously published GWAS data for type 2 diabetes (T2D) and the biologically based pair-wise interaction modeling, we identify twelve genes not seen in the previous single locus analysis.
We present four approaches to detect interactions associated with complex diseases. The results show our approaches outperform the traditional single locus approaches in detecting genes that previously did not reach significance; the results also provide novel drug targets and biomarkers relevant to the underlying mechanisms of disease.