A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on mutation frequencies of single-genes, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. VarWalker fits generalized additive models for each sample based on sample-specific mutation profiles and builds on the joint frequency of both mutation genes and their close interactors. These interactors are selected and optimized using the Random Walk with Restart algorithm in a protein-protein interaction network. We applied the method in >300 tumor genomes in two large-scale NGS benchmark datasets: 183 lung adenocarcinoma samples and 121 melanoma samples. In each cancer, we derived a consensus mutation subnetwork containing significantly enriched consensus cancer genes and cancer-related functional pathways. These cancer-specific mutation networks were then validated using independent datasets for each cancer. Importantly, VarWalker prioritizes well-known, infrequently mutated genes, which are shown to interact with highly recurrently mutated genes yet have been ignored by conventional single-gene-based approaches. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate the detection of cancer driver genes in NGS data.
A cancer genome typically harbors both driver mutations, which contribute to tumorigenesis, and passenger mutations, which tend to be neutral and occur randomly. Cancer genomes differ dramatically due to genetic and environmental factors. A major challenge in interpreting the large volume of mutation data identified in cancer genomes using next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. Applying our approach in a large cohort of lung adenocarcinoma samples and melanoma samples, we derived a consensus mutation subnetwork for each cancer containing significantly enriched cancer genes and cancer-related functional pathways. Our results indicated that driver genes occur within a broad spectrum of frequency, interact with each other, and converge in several key pathways that play critical roles in tumorigenesis.
Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.
Genome-wide association study; Gene set; Pathway; Gene-set enrichment analysis; Statistical significance; Complex disease
A number of genetic studies have suggested numerous susceptibility genes for dental caries over the past decade with few definite conclusions. The rapid accumulation of relevant information, along with the complex architecture of the disease, provides a challenging but also unique opportunity to review and integrate the heterogeneous data for follow-up validation and exploration. In this study, we collected and curated candidate genes from four major categories: association studies, linkage scans, gene expression analyses, and literature mining. Candidate genes were prioritized according to the magnitude of evidence related to dental caries. We then searched for dense modules enriched with the prioritized candidate genes through their protein-protein interactions (PPIs). We identified 23 modules comprising of 53 genes. Functional analyses of these 53 genes revealed three major clusters: cytokine network relevant genes, matrix metalloproteinases (MMPs) family, and transforming growth factor-beta (TGF-β) family, all of which have been previously implicated to play important roles in tooth development and carious lesions. Through our extensive data collection and an integrative application of gene prioritization and PPI network analyses, we built a dental caries-specific sub-network for the first time. Our study provided insights into the molecular mechanisms underlying dental caries. The framework we proposed in this work can be applied to other complex diseases.
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
Kinase inhibitors are accepted treatment for metastatic melanomas that harbor specific driver mutations in BRAF or KIT, but only 40–50% of cases are positive. To uncover other potential targetable mutations, we performed whole-genome sequencing of a highly aggressive BRAF (V600) and KIT (W557, V559, L576, K642, D816) wildtype melanoma. Surprisingly, we found a somatic BRAF L597R mutation in exon 15. Analysis of BRAF exon 15 in 49 tumors negative for BRAF V600 mutations as well as driver mutations in KIT, NRAS, GNAQ, and GNA11, showed that 2 (4%) harbored L597 mutations and another 2 involved BRAF D594 and K601 mutations. In vitro signaling induced by L597R/S/Q mutants was suppressed by MEK inhibition. A patient with BRAF L597S mutant metastatic melanoma responded significantly to treatment with the MEK inhibitor, TAK-733. Collectively, these data demonstrate clinical significance to BRAF L597 mutations in melanoma.
melanoma; BRAF L597; whole genome sequencing; BRAF inhibitor; MEK inhibitor; TAK-733
Gene set-based analysis of genome-wide association study (GWAS) data has recently emerged as a useful approach to examine the joint effects of multiple risk loci in complex human diseases or phenotypes. Dental caries is a common, chronic, and complex disease leading to a decrease in quality of life worldwide. In this study, we applied the approaches of gene set enrichment analysis to a major dental caries GWAS dataset, which consists of 537 cases and 605 controls. Using four complementary gene set analysis methods, we analyzed 1331 Gene Ontology (GO) terms collected from the Molecular Signatures Database (MSigDB). Setting false discovery rate (FDR) threshold as 0.05, we identified 13 significantly associated GO terms. Additionally, 17 terms were further included as marginally associated because they were top ranked by each method, although their FDR is higher than 0.05. In total, we identified 30 promising GO terms, including ‘Sphingoid metabolic process,’ ‘Ubiquitin protein ligase activity,’ ‘Regulation of cytokine secretion,’ and ‘Ceramide metabolic process.’ These GO terms encompass broad functions that potentially interact and contribute to the oral immune response related to caries development, which have not been reported in the standard single marker based analysis. Collectively, our gene set enrichment analysis provided complementary insights into the molecular mechanisms and polygenic interactions in dental caries, revealing promising association signals that could not be detected through single marker analysis of GWAS data.
Integrating evidence from multiple domains is useful in prioritizing disease candidate genes for subsequent testing. We ranked all known human genes (n = 3819) under linkage peaks in the Irish Study of High-Density Schizophrenia Families using three different evidence domains: 1) a meta-analysis of microarray gene expression results using the Stanley Brain collection, 2) a schizophrenia protein-protein interaction network, and 3) a systematic literature search. Each gene was assigned a domain-specific p-value and ranked after evaluating the evidence within each domain. For comparison to this ranking process, a large-scale candidate gene hypothesis was also tested by including genes with Gene Ontology terms related to neurodevelopment. Subsequently, genotypes of 3725 SNPs in 167 genes from a custom Illumina iSelect array were used to evaluate the top ranked vs. hypothesis selected genes. Seventy-three genes were both highly ranked and involved in neurodevelopment (category 1) while 42 and 52 genes were exclusive to neurodevelopment (category 2) or highly ranked (category 3), respectively. The most significant associations were observed in genes PRKG1, PRKCE, and CNTN4 but no individual SNPs were significant after correction for multiple testing. Comparison of the approaches showed an excess of significant tests using the hypothesis-driven neurodevelopment category. Random selection of similar sized genes from two independent genome-wide association studies (GWAS) of schizophrenia showed the excess was unlikely by chance. In a further meta-analysis of three GWAS datasets, four candidate SNPs reached nominal significance. Although gene ranking using integrated sources of prior information did not enrich for significant results in the current experiment, gene selection using an a priori hypothesis (neurodevelopment) was superior to random selection. As such, further development of gene ranking strategies using more carefully selected sources of information is warranted.
Background and Methods
Esophageal adenocarcinoma (EAC) is characterized by a steep rise in incidence rates in the Western population. The unique miRNA signature that distinguishes EAC from other upper gastrointestinal cancers remains unclear. Herein, we performed a comprehensive microarray profiling for the specific miRNA signature associated with EAC. We validated this signature by qRT-PCR.
Microarray analysis showed that 21 miRNAs were consistently deregulated in EAC. miR-194, miR-192, miR-200a, miR-21, miR-203, miR-205, miR-133b, and miR-31 were selected for validation using 46 normal squamous (NS), 23 Barrett’s esophagus (BE), 17 Barrett’s high grade dysplasia (HGD), 34 EAC, 33 gastric adenocarcinoma (GC), and 45 normal gastric (NG) tissues. The qRT-PCR analysis indicated that 2 miRNAs (miR-21 and miR-133b) were deregulated in both EAC and GC, and 6 miRNAs (up-regulated: miR-194, miR-31, miR-192, and miR-200a; down-regulated: miR-203 and miR-205) in EAC, as compared to BE but not in GC, indicating their potential unique role in EAC. Our data showed that miR-194, miR-192, miR-21, and miR-31 were up-regulated in BE adjacent to HGD lesions relative to isolated BE samples. Analysis of clinicopathological features indicated that down-regulation of miR-203 is significantly associated with progression and tumor stages in EAC. Interestingly, the overexpression levels of miR-194, miR-200a, and miR-192 were significantly higher in early EAC stages, suggesting that these miRNAs may be involved in EAC tumor development rather than progression.
Our findings demonstrate the presence of a unique miRNA signature for EAC. This may provide some clues for the distinct molecular features of EAC to be considered in future studies of the role of miRNAs in EAC and their utility as disease biomarkers.
Next generation sequencing (NGS) technologies allow us to explore virus interactions with host genomes that lead to carcinogenesis or other diseases; however, this effort is largely hindered by the dearth of efficient computational tools. Here, we present a new tool, VirusFinder, for the identification of viruses and their integration sites in host genomes using NGS data, including whole transcriptome sequencing (RNA-Seq), whole genome sequencing (WGS), and targeted sequencing data. VirusFinder’s unique features include the characterization of insertion loci of virus of arbitrary type in the host genome and high accuracy and computational efficiency as a result of its well-designed pipeline. The source code as well as additional data of VirusFinder is publicly available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.
Pathway analysis of large-scale omics data assists us with the examination of the cumulative effects of multiple functionally related genes, which are difficult to detect using the traditional single gene/marker analysis. So far, most of the genomic studies have been conducted in a single domain, e.g., by genome-wide association studies (GWAS) or microarray gene expression investigation. A combined analysis of disease susceptibility genes across multiple platforms at the pathway level is an urgent need because it can reveal more reliable and more biologically important information.
We performed an integrative pathway analysis of a GWAS dataset and a microarray gene expression dataset in prostate cancer. We obtained a comprehensive pathway annotation set from knowledge-based public resources, including KEGG pathways and the prostate cancer candidate gene set, and gene sets specifically defined based on cross-platform information. By leveraging on this pathway collection, we first searched for significant pathways in the GWAS dataset using four methods, which represent two broad groups of pathway analysis approaches. The significant pathways identified by each method varied greatly, but the results were more consistent within each method group than between groups. Next, we conducted a gene set enrichment analysis of the microarray gene expression data and found 13 pathways with cross-platform evidence, including "Fc gamma R-mediated phagocytosis" (PGWAS = 0.003, Pexpr < 0.001, and Pcombined = 6.18 × 10-8), "regulation of actin cytoskeleton" (PGWAS = 0.003, Pexpr = 0.009, and Pcombined = 3.34 × 10-4), and "Jak-STAT signaling pathway" (PGWAS = 0.001, Pexpr = 0.084, and Pcombined = 8.79 × 10-4).
Our results provide evidence at both the genetic variation and expression levels that several key pathways might have been involved in the pathological development of prostate cancer. Our framework that employs gene expression data to facilitate pathway analysis of GWAS data is not only feasible but also much needed in studying complex disease.
A variety of species and experimental designs have been used to study genetic influences on alcohol dependence, ethanol response, and related traits. Integration of these heterogeneous data can be used to produce a ranked target gene list for additional investigation.
In this study, we performed a unique multi-species evidence-based data integration using three microarray experiments in mice or humans that generated an initial alcohol dependence (AD) related genes list, human linkage and association results, and gene sets implicated in C. elegans and Drosophila. We then used permutation and false discovery rate (FDR) analyses on the genome-wide association studies (GWAS) dataset from the Collaborative Study on the Genetics of Alcoholism (COGA) to evaluate the ranking results and weighting matrices. We found one weighting score matrix could increase FDR based q-values for a list of 47 genes with a score greater than 2. Our follow up functional enrichment tests revealed these genes were primarily involved in brain responses to ethanol and neural adaptations occurring with alcoholism.
These results, along with our experimental validation of specific genes in mice, C. elegans and Drosophila, suggest that a cross-species evidence-based approach is useful to identify candidate genes contributing to alcoholism.
Genome-wide association studies (GWAS) have generated a wealth of valuable genotyping data for complex diseases/traits. A large proportion of these data are embedded with many weakly associated markers that have been missed in traditional single marker analyses, but they may provide valuable insights in dissecting the genetic components of diseases. Gene set analysis (GSA) augmented by protein-protein interaction network data provides a promising way to examine GWAS data by analyzing the combined effects of multiple genes/markers, each of which may have only individually weak to moderate association effects. A critical issue in GSA of GWAS data is the definition of gene-wise P values based on multiple SNPs mapped to a gene.
In this study, we proposed an alternative restricted search approach based on our previously developed dense module search algorithm, and we demonstrated it in the CATIE GWAS dataset for schizophrenia. Specifically, we explored three ways of computing gene-wise P values and examined their effects on the resultant module genes. These methods calculate gene-wise P values based on all the SNPs, the top ranked SNPs, or the most significant SNP among all the SNPs mapped to a gene. We applied the restricted search approach and identified a module gene set for each of the gene-wise P value data set. In our evaluation using an independent method, ALIGATOR, we showed that although each of these input datasets generated a unique set of module genes, all of them were significant in the GWAS dataset. Further functional enrichment analysis of these module genes showed that at the pathway level, they were all consistently related to neuro- and immune-related pathways. Finally, we compared our method with a previously reported method.
Our results showed that the approaches to computing gene-wise P values in GWAS data are critical in GSA. This work is useful for evaluating key factors in GSA of GWAS data.
QT prolongation is associated with increased risk of cardiac arrhythmias. Identifying the genetic variants that mediate antipsychotic induced prolongation may help to minimize this risk, which might prevent the removal of efficacious drugs from the market. We performed candidate gene analysis and five drug specific genome-wide association studies (GWAS) with 492K SNPs to search for genetic variation mediating antipsychotic induced QT prolongation in 738 schizophrenia patients from the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) study.
Our candidate gene study suggests the involvement of NOS1AP and NUBPL (p-values =1.45×10−05 and 2.66×10−13, respectively). Furthermore, our top GWAS hit achieving genome-wide significance, defined as a q-value <0.10, (p-value =1.54×10−7, q-value =0.07), located in SLC22A23, mediated the effects of quetiapine on prolongation. SLC22A23 belongs to a family of organic ion transporters that shuttle a variety of compounds including drugs, environmental toxins, and endogenous metabolites across the cell membrane. This gene is expressed in the heart and is integral in mouse heart development. The genes mediating antipsychotic induced QT prolongation partially overlap with the genes affecting normal QT interval variation. However, some genes may also be unique for drug induced prolongation. This study demonstrates the potential of GWAS to discover genes and pathways that mediate antipsychotic induced QT prolongation.
candidate gene analysis; genome-wide association study; schizophrenia; adverse effects; CATIE
The recent success of genome-wide association (GWA) studies has greatly expanded our understanding of many complex diseases by delivering previously unknown loci and genes. A large number of GWAS datasets have already been made available, with more being generated. To explore the underlying moderate and weak signals, we recently developed a network-based dense module search (DMS) method for identification of disease candidate genes from GWAS datasets, leveraging on the joint effect of multiple genes. DMS is designed to dynamically search for the best nodes in a step-wise fashion and, thus, could overcome the limitation of pre-defined gene sets. Here, we propose an improved version of DMS, the topologically-adjusted DMS, to facilitate the analysis of complex diseases. Building on the previous version of DMS, we improved the randomization process by taking into account the topological character, aiming to adjust the bias potentially caused by high-degree nodes in the whole network. We demonstrated the topologically-adjusted DMS algorithm in a GWAS dataset for schizophrenia. We found the improved DMS strategy could effectively identify candidate genes while reducing the burden of high-degree nodes. In our evaluation, we found more candidate genes identified by the topologically-adjusted DMS algorithm have been reported in the previous association studies, suggesting this new algorithm has better performance than the unweighted DMS algorithm. Finally, our functional analysis of the top module genes revealed that they are enriched in immune-related pathways.
dmGWAS; dense module search; GWAS; schizophrenia; network; gene set enrichment analysis
With the recent success of genome-wide association studies (GWAS), a wealth of association data has been accomplished for more than 200 complex diseases/traits, proposing a strong demand for data integration and interpretation. A combinatory analysis of multiple GWAS datasets, or an integrative analysis of GWAS data and other high-throughput data, has been particularly promising. In this study, we proposed an integrative analysis framework of multiple GWAS datasets by overlaying association signals onto the protein-protein interaction network, and demonstrated it using schizophrenia datasets. Building on a dense module search algorithm, we first searched for significantly enriched subnetworks for schizophrenia in each single GWAS dataset and then implemented a discovery-evaluation strategy to identify module genes with consistent association signals. We validated the module genes in an independent dataset, and also examined them through meta-analysis of the related SNPs using multiple GWAS datasets. As a result, we identified 205 module genes with a joint effect significantly associated with schizophrenia; these module genes included a number of well-studied candidate genes such as DISC1, GNA12, GNA13, GNAI1, GPR17, and GRIN2B. Further functional analysis suggested these genes are involved in neuronal related processes. Additionally, meta-analysis found that 18 SNPs in 9 module genes had Pmeta<1×10−4, including the gene HLA-DQA1 located in the MHC region on chromosome 6, which was reported in previous studies using the largest cohort of schizophrenia patients to date. These results demonstrated our bi-directional network-based strategy is efficient for identifying disease-associated genes with modest signals in GWAS datasets. This approach can be applied to any other complex diseases/traits where multiple GWAS datasets are available.
The recent success of genome-wide association studies (GWAS) has generated a wealth of genotyping data critical to studies of genetic architectures of many complex diseases. In contrast to traditional single marker analysis, an integrative analysis of multiple genes and the assessment of their joint effects have been particularly promising, especially upon the availability of many GWAS datasets and other high-throughput datasets for numerous complex diseases. In this study, we developed an integrative analysis framework for multiple GWAS datasets and demonstrated it in schizophrenia. We first constructed a GWAS-weighted protein-protein interaction (PPI) network and then applied a dense module search algorithm to identify subnetworks with combinatory disease effects. We applied combinatorial criteria for module selection based on permutation tests to determine whether the modules are significantly different from random gene sets and whether the modules are associated with the disease in investigation. Importantly, considering there are many complex diseases with multiple GWAS datasets available, we proposed a discovery-evaluation strategy to search for modules with consistent combined effects from two or more GWAS datasets. This approach can be applied to any diseases or traits that have two or more GWAS datasets available.
A critical step in detecting variants from next-generation sequencing data is post hoc filtering of putative variants called or predicted by computational tools. Here, we highlight four critical parameters that could enhance the accuracy of called single nucleotide variants and insertions/deletions: quality and deepness, refinement and improvement of initial mapping, allele/strand balance, and examination of spurious genes. Use of these sequence features appropriately in variant filtering could greatly improve validation rates, thereby saving time and costs in next-generation sequencing projects.
Protein-protein interaction (PPI) network analysis has been widely applied in the investigation of the mechanisms of diseases, especially cancer. Recent studies revealed that cancer proteins tend to interact more strongly than other categories of proteins, even essential proteins, in the human interactome. However, it remains unclear whether this observation was introduced by the bias towards more cancer studies in humans. Here, we examined this important issue by uniquely comparing network characteristics of cancer proteins with three other sets of proteins in four organisms, three of which (fly, worm, and yeast) whose interactomes are essentially not biased towards cancer or other diseases. We confirmed that cancer proteins had stronger connectivity, shorter distance, and larger betweenness centrality than non-cancer disease proteins, essential proteins, and control proteins. Our statistical evaluation indicated that such observations were overall unlikely attributed to random events. Considering the large size and high quality of the PPI data in the four organisms, the conclusion that cancer proteins interact strongly in the PPI networks is reliable and robust. This conclusion suggests that perturbation of cancer proteins might cause major changes of cellular systems and result in abnormal cell function leading to cancer.
Cancer proteins; Cancer genes; Protein-protein interactions; Protein interaction network; Global network characteristics; Network topology
Pathway analysis of a set of genes represents an important area in large-scale omic data analysis. However, the application of traditional pathway enrichment methods to next-generation sequencing (NGS) data is prone to several potential biases, including genomic/genetic factors (e.g., the particular disease and gene length) and environmental factors (e.g., personal life-style and frequency and dosage of exposure to mutagens). Therefore, novel methods are urgently needed for these new data types, especially for individual-specific genome data.
In this study, we proposed a novel method for the pathway analysis of NGS mutation data by explicitly taking into account the gene-wise mutation rate. We estimated the gene-wise mutation rate based on the individual-specific background mutation rate along with the gene length. Taking the mutation rate as a weight for each gene, our weighted resampling strategy builds the null distribution for each pathway while matching the gene length patterns. The empirical P value obtained then provides an adjusted statistical evaluation.
We demonstrated our weighted resampling method to a lung adenocarcinomas dataset and a glioblastoma dataset, and compared it to other widely applied methods. By explicitly adjusting gene-length, the weighted resampling method performs as well as the standard methods for significant pathways with strong evidence. Importantly, our method could effectively reject many marginally significant pathways detected by standard methods, including several long-gene-based, cancer-unrelated pathways. We further demonstrated that by reducing such biases, pathway crosstalk for each individual and pathway co-mutation map across multiple individuals can be objectively explored and evaluated. This method performs pathway analysis in a sample-centered fashion, and provides an alternative way for accurate analysis of cancer-personalized genomes. It can be extended to other types of genomic data (genotyping and methylation) that have similar bias problems.
Motivation: In genome-wide association studies (GWAS) of complex diseases, genetic variants having real but weak associations often fail to be detected at the stringent genome-wide significance level. Pathway analysis, which tests disease association with combined association signals from a group of variants in the same pathway, has become increasingly popular. However, because of the complexities in genetic data and the large sample sizes in typical GWAS, pathway analysis remains to be challenging. We propose a new statistical model for pathway analysis of GWAS. This model includes a fixed effects component that models mean disease association for a group of genes, and a random effects component that models how each gene's association with disease varies about the gene group mean, thus belongs to the class of mixed effects models.
Results: The proposed model is computationally efficient and uses only summary statistics. In addition, it corrects for the presence of overlapping genes and linkage disequilibrium (LD). Via simulated and real GWAS data, we showed our model improved power over currently available pathway analysis methods while preserving type I error rate. Furthermore, using the WTCCC Type 1 Diabetes (T1D) dataset, we demonstrated mixed model analysis identified meaningful biological processes that agreed well with previous reports on T1D. Therefore, the proposed methodology provides an efficient statistical modeling framework for systems analysis of GWAS.
Availability: The software code for mixed models analysis is freely available at http://biostat.mc.vanderbilt.edu/LilyWang.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Recently, we prioritized 160 schizophrenia candidate genes (SZGenes) by integrating multiple lines of evidence and subsequently identified twenty-four pathways in which these 160 genes are overrepresented. Among them, four neurotransmitter-related pathways were top ranked. In this study, we extended our previous pathway analysis by applying a systems biology approach to identifying candidate genes for schizophrenia. We constructed protein-protein interaction subnetworks for four neurotransmitter-related pathways and merged them to obtain a general neurotransmitter network, from which five candidate genes stood out. We tested the association of four genes (GRB2, HSPA5, YWHAG, and YWHAZ) in the Irish Case-Control Study of Schizophrenia (ICCSS) sample (1021 cases and 626 controls). Interestingly, six of the seven tested SNPs in GRB2 showed significant signal, two of which (rs7207618 and rs9912608) remained significant after permutation test or Bonferroni correction, suggesting that GRB2 might be a risk gene for schizophrenia in Irish population. To our knowledge, this is the first report of GRB2 being significantly associated with schizophrenia in a specific population. Our results suggest that the systems biology approach is promising for identification of candidate genes and understanding the etiology of complex diseases.
Schizophrenia; systems biology; association; GRB2; GWAS; neurotransmitter
Motivation: An important question that has emerged from the recent success of genome-wide association studies (GWAS) is how to detect genetic signals beyond single markers/genes in order to explore their combined effects on mediating complex diseases and traits. Integrative testing of GWAS association data with that from prior-knowledge databases and proteome studies has recently gained attention. These methodologies may hold promise for comprehensively examining the interactions between genes underlying the pathogenesis of complex diseases.
Methods: Here, we present a dense module searching (DMS) method to identify candidate subnetworks or genes for complex diseases by integrating the association signal from GWAS datasets into the human protein–protein interaction (PPI) network. The DMS method extensively searches for subnetworks enriched with low P-value genes in GWAS datasets. Compared with pathway-based approaches, this method introduces flexibility in defining a gene set and can effectively utilize local PPI information.
Results: We implemented the DMS method in an R package, which can also evaluate and graphically represent the results. We demonstrated DMS in two GWAS datasets for complex diseases, i.e. breast cancer and pancreatic cancer. For each disease, the DMS method successfully identified a set of significant modules and candidate genes, including some well-studied genes not detected in the single-marker analysis of GWA studies. Functional enrichment analysis and comparison with previously published methods showed that the genes we identified by DMS have higher association signal.
Availability: dmGWAS package and documents are available at http://bioinfo.mc.vanderbilt.edu/dmGWAS.html.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Numerous genetic and genomic datasets related to complex diseases have been made available during the last decade. It is now a great challenge to assess such heterogeneous datasets to prioritize disease genes and perform follow up functional analysis and validation. Among complex disease studies, psychiatric disorders such as major depressive disorder (MDD) are especially in need of robust integrative analysis because these diseases are more complex than others, with weak genetic factors at various levels, including genetic markers, transcription (gene expression), epigenetics (methylation), protein, pathways and networks.
In this study, we proposed a comprehensive analysis framework at the systems level and demonstrated it in MDD using a set of candidate genes that have recently been prioritized based on multiple lines of evidence including association, linkage, gene expression (both human and animal studies), regulatory pathway, and literature search. In the network analysis, we explored the topological characteristics of these genes in the context of the human interactome and compared them with two other complex diseases. The network topological features indicated that MDD is similar to schizophrenia compared to cancer. In the functional analysis, we performed the gene set enrichment analysis for both Gene Ontology categories and canonical pathways. Moreover, we proposed a unique pathway crosstalk approach to examine the dynamic interactions among biological pathways. Our pathway enrichment and crosstalk analyses revealed two unique pathway interaction modules that were significantly enriched with MDD genes. These two modules are neuro-transmission and immune system related, supporting the neuropathology hypothesis of MDD. Finally, we constructed a MDD-specific subnetwork, which recruited novel candidate genes with association signals from a major MDD GWAS dataset.
This study is the first systematic network and pathway analysis of candidate genes in MDD, providing abundant important information about gene interaction and regulation in a major psychiatric disease. The results suggest potential functional components underlying the molecular mechanisms of MDD and, thus, facilitate generation of novel hypotheses in this disease. The systems biology based strategy in this study can be applied to many other complex diseases.
Evidence is presented that the linked histone modifications H2B ubiquitylation and H3 methylation play an important role in preventing the ectopic association of silencing proteins with telomere-distant euchromatic genes important for the cellular response to heat.
In Saccharomyces cerevisiae, ubiquitylation of histone H2B signals methylation of histone H3 at lysine residues 4 (K4) and 79. These modifications occur at active genes but are believed to stabilize silent chromatin by limiting movement of silencing proteins away from heterochromatin domains. In the course of studying atypical phenotypes associated with loss of H2B ubiquitylation/H3K4 methylation, we discovered that these modifications are also required for cell wall integrity at high temperatures. We identified the silencing protein Sir4 as a dosage suppressor of loss of H2B ubiquitylation, and we showed that elevated Sir4 expression suppresses cell wall integrity defects by inhibiting the function of the Sir silencing complex. Using comparative transcriptome analysis, we identified a set of euchromatic genes—enriched in those required for the cellular response to heat—whose expression is attenuated by loss of H2B ubiquitylation but restored by disruption of Sir function. Finally, using DNA adenine methyltransferase identification, we found that Sir3 and Sir4 associate with genes that are silenced in the absence of H3K4 methylation. Our data reveal that H2B ubiquitylation/H3K4 methylation play an important role in limiting ectopic association of silencing proteins with euchromatic genes important for cell wall integrity and the response to heat.
Unlike the typical analysis of single markers in genome-wide association studies (GWAS), we incorporated Gene Set Enrichment Analysis (GSEA) and hypergeometric test and combined them using Fisher's combined method to perform pathway-based analysis in order to detect genes’ combined effects on mediating schizophrenia. A few pathways were consistently found to be top ranked and likely associated with schizophrenia by these methods; they are related to metabolism of glutamate, the process of apoptosis, inflammation, and immune system (e.g., glutamate metabolism pathway, TGF-beta signaling pathway, and TNFR1 pathway). The genes involved in these pathways had not been detected by single marker analysis, suggesting this approach may complement the original analysis of GWAS dataset.
Schizophrenia; gene set enrichment analysis; GWAS; pathway; candidate gene
Understanding individual differences in the susceptibility to metabolic side effects as a response to antipsychotic therapy is essential to optimize the treatment of schizophrenia. Here we perform genomewide association studies (GWAS) to search for genetic variation affecting the susceptibility to metabolic side effects. The analysis sample consisted of 738 schizophrenia patients, successfully genotyped for 492K SNPs, from the genomic subsample of the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) study. Outcomes included twelve indicators of metabolic side effects, quantifying antipsychotic-induced change in weight, blood lipids, glucose and hemoglobin A1c, blood pressure and heart rate. Our criterion for genomewide significance was a pre-specified threshold that ensures, on average, only 10% of the significant findings are false discoveries. Twenty-one SNPs satisfied this criterion. The top finding indicated a SNP in MEIS2 mediated the effects of risperidone on hip circumference (q =.004). The same SNP was also found to mediate risperidone's effect on waist circumference (q =.055). Genomewide significant finding were also found for SNPs in PRKAR2B, GPR98, FHOD3, RNF144A, ASTN2, SOX5 and ATF7IP2, as well as several intergenic markers. PRKAR2B and MEIS2 both have previous research indicating metabolic involvement and PRKAR2B has previously been shown to mediate antipsychotic response. Although our findings require replication and functional validation, this study demonstrates the potential of GWAS to discover genes and pathways that potentially mediate adverse effects of antipsychotic medication.
genomewide association; antipsychotics; pharmacogenomics; personalized medicine; metabolic side effects