Related Articles
Background
Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene expression data with functionally related gene sets as defined in databases such as GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are generally more biologically interpretable, accurate and robust than the results based on individual gene analysis. However, while most available methods for gene set enrichment analysis test the enrichment of the entire gene set, it is more likely that only a subset of the genes in the gene set may be related to the phenotypes of interest.
Results
In this paper, we develop a novel method, termed Sub-GSE, which measures the enrichment of a predefined gene set, or pathway, by testing its subsets. The application of Sub-GSE to two simulated and two real datasets shows Sub-GSE to be more sensitive than previous methods, such as GSEA, GSA, and SigPath, in detecting gene sets assiated with a phenotype of interest. This is particularly true for cases in which only a fraction of the genes in the gene set are associated with the phenotypes. Furthermore, the application of Sub-GSE to two real data sets demonstrates that it can detect more biologically meaningful gene sets than GSEA.
Conclusion
We developed a new method to measure the gene set enrichment. Applications to two simulated datasets and two real datasets show that this method is sensitive to the associations between gene sets and phenotype. The program Sub-GSE can be downloaded from .
doi:10.1186/1471-2105-9-362
PMCID: PMC2543030
PMID: 18764941
Background
Autism spectrum disorder is a severe early onset neurodevelopmental disorder with high heritability but significant heterogeneity. Traditional genome-wide approaches to test for an association of common variants with autism susceptibility risk have met with limited success. However, novel methods to identify moderate risk alleles in attainable sample sizes are now gaining momentum.
Methods
In this study, we utilized publically available genome-wide association study data from the Autism Genome Project and annotated the results (P <0.001) for expression quantitative trait loci present in the parietal lobe (GSE35977), cerebellum (GSE35974) and lymphoblastoid cell lines (GSE7761). We then performed a test of enrichment by comparing these results to simulated data conditioned on minor allele frequency to generate an empirical P-value indicating statistically significant enrichment of expression quantitative trait loci in top results from the autism genome-wide association study.
Results
Our findings show a global enrichment of brain expression quantitative trait loci, but not lymphoblastoid cell line expression quantitative trait loci, among top single nucleotide polymorphisms from an autism genome-wide association study. Additionally, the data implicates individual genes SLC25A12, PANX1 and PANX2 as well as pathways previously implicated in autism.
Conclusions
These findings provide supportive rationale for the use of annotation-based approaches to genome-wide association studies.
doi:10.1186/2040-2392-3-3
PMCID: PMC3484025
PMID: 22591576
Autism; annotation; cerebellum; enrichment; expression quantitative trait (eQTL); GWAS; LCL; pannexin; parietal; SLC25A12
Dinu, Irina | Potter, John D | Mueller, Thomas | Liu, Qi | Adewale, Adeniyi J | Jhangri, Gian S | Einecke, Gunilla | Famulski, Konrad S | Halloran, Philip | Yasui, Yutaka
Background
Gene-set analysis evaluates the expression of biological pathways, or a priori defined gene sets, rather than that of individual genes, in association with a binary phenotype, and is of great biologic interest in many DNA microarray studies. Gene Set Enrichment Analysis (GSEA) has been applied widely as a tool for gene-set analyses. We describe here some critical problems with GSEA and propose an alternative method by extending the individual-gene analysis method, Significance Analysis of Microarray (SAM), to gene-set analyses (SAM-GS).
Results
Using a mouse microarray dataset with simulated gene sets, we illustrate that GSEA gives statistical significance to gene sets that have no gene associated with the phenotype (null gene sets), and has very low power to detect gene sets in which half the genes are moderately or strongly associated with the phenotype (truly-associated gene sets). SAM-GS, on the other hand, performs very well. The two methods are also compared in the analyses of three real microarray datasets and relevant pathways, the diverging results of which clearly show advantages of SAM-GS over GSEA, both statistically and biologically. In a microarray study for identifying biological pathways whose gene expressions are associated with p53 mutation in cancer cell lines, we found biologically relevant performance differences between the two methods. Specifically, there are 31 additional pathways identified as significant by SAM-GS over GSEA, that are associated with the presence vs. absence of p53. Of the 31 gene sets, 11 actually involve p53 directly as a member. A further 6 gene sets directly involve the extrinsic and intrinsic apoptosis pathways, 3 involve the cell-cycle machinery, and 3 involve cytokines and/or JAK/STAT signaling. Each of these 12 gene sets, then, is in a direct, well-established relationship with aspects of p53 signaling. Of the remaining 8 gene sets, 6 have plausible, if less well established, links with p53.
Conclusion
We conclude that GSEA has important limitations as a gene-set analysis approach for microarray experiments for identifying biological pathways associated with a binary phenotype. As an alternative statistically-sound method, we propose SAM-GS. A free Excel Add-In for performing SAM-GS is available for public use.
doi:10.1186/1471-2105-8-242
PMCID: PMC1931607
PMID: 17612399
We present GSE, the Genomic Spatial Event database, a system to store, retrieve, and analyze all types of high-throughput microarray data. GSE handles expression datasets, ChIP-chip data, genomic annotations, functional annotations, the results of our previously published Joint Binding Deconvolution algorithm for ChIP-chip, and precomputed scans for binding events. GSE can manage data associated with multiple species; it can also simultaneously handle data associated with multiple ‘builds’ of the genome from a single species. The GSE system is built upon a middle software layer for representing streams of biological data; we outline this layer, called GSEBricks, and show how it is used to build an interactive visualization application for ChIP-chip data. The visualizer software is written in Java and communicates with the GSE database system over the network. We also present a system to formulate and record binding hypotheses- simple descriptions of the relationships that may hold between different ChIP-chip experiments. We provide a reference software implementation for the GSE system.
PMCID: PMC2674223
PMID: 18229714
Background
Gene set enrichment testing has helped bridge the gap from an individual gene to a systems biology interpretation of microarray data. Although gene sets are defined a priori based on biological knowledge, current methods for gene set enrichment testing treat all genes equal. It is well-known that some genes, such as those responsible for housekeeping functions, appear in many pathways, whereas other genes are more specialized and play a unique role in a single pathway. Drawing inspiration from the field of information retrieval, we have developed and present here an approach to incorporate gene appearance frequency (in KEGG pathways) into two current methods, Gene Set Enrichment Analysis (GSEA) and logistic regression-based LRpath framework, to generate more reproducible and biologically meaningful results.
Results
Two breast cancer microarray datasets were analyzed to identify gene sets differentially expressed between histological grade 1 and 3 breast cancer. The correlation of Normalized Enrichment Scores (NES) between gene sets, generated by the original GSEA and GSEA with the appearance frequency of genes incorporated (GSEA-AF), was compared. GSEA-AF resulted in higher correlation between experiments and more overlapping top gene sets. Several cancer related gene sets achieved higher NES in GSEA-AF as well. The same datasets were also analyzed by LRpath and LRpath with the appearance frequency of genes incorporated (LRpath-AF). Two well-studied lung cancer datasets were also analyzed in the same manner to demonstrate the validity of the method, and similar results were obtained.
Conclusions
We introduce an alternative way to integrate KEGG PATHWAY information into gene set enrichment testing. The performance of GSEA and LRpath can be enhanced with the integration of appearance frequency of genes. We conclude that, generally, gene set analysis methods with the integration of information from KEGG PATHWAY performs better both statistically and biologically.
doi:10.1186/1471-2105-12-81
PMCID: PMC3213687
PMID: 21418606
Motivation: There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted.
Results: In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies.
Availability: R package Pwayrfsurvival is available from URL: http://www.duke.edu/∼hp44/pwayrfsurvival.htm
Contact: pathwayrf@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp640
PMCID: PMC2804301
PMID: 19933158
Recently, we demonstrated the feasibility of a chemical synthetic lethality screen in cultured human cells. We now demonstrate the principles for a genetic synthetic lethality screen. The technology employs both an immortalized human cell line deficient in the gene of interest, which is complemented by an episomal survival plasmid expressing the wild-type cDNA for the gene of interest, and the use of a novel GFP-based double-label fluorescence system. Dominant negative genetic suppressor elements (GSEs) are selected from an episomal library expressing short truncated sense and antisense cDNAs for a gene likely to be synthetic lethal with the gene of interest. Expression of these GSEs prevents spontaneous loss of the GFP-marked episomal survival plasmid, thus allowing FACS enrichment for cells retaining the survival plasmid (and the GSEs). The dominant negative nature of the GSEs was validated by the decreased resident enzymatic activity present in cells harboring the GSEs. Also, cells mutated in the gene of interest exhibit reduced survival upon GSE expression. The identification of synthetic lethal genes described here can shed light on functional genetic interactions between genes involved in normal cell metabolism and in disease.
PMCID: PMC60228
PMID: 11600719
Rho, Kyoohyoung | Kim, Bumjin | Jang, Youngjun | Lee, Sanghyun | Bae, Taejeong | Seo, Jihae | Seo, Chaehwa | Lee, Jihyun | Kang, Hyunjung | Yu, Ungsik | Kim, Sunghoon | Lee, Sanghyuk | Kim, Wan Kyu
Background
Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information.
Results
GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules - gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations.
Conclusions
GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).
doi:10.1186/1471-2105-12-S1-S25
PMCID: PMC3044280
PMID: 21342555
Despite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/prognostic cancer classification.
PMCID: PMC1854845
PMID: 17460773
Molecular diagnostics; biomarkers; prostate cancer; evolutionary algorithm; microarray profiling
Gene set testing problem has become the focus of microarray data analysis. A gene set is a group of genes that are defined by a priori biological knowledge. Several statistical methods have been proposed to determine whether functional gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to analyzing the dependence structure among gene sets. In this study, we have proposed a novel statistical method of gene set association analysis to identify significantly associated gene sets using the coefficient of intrinsic dependence. The simulation studies show that the proposed method outperforms the conventional methods to detect general forms of association in terms of control of type I error and power. The correlation of intrinsic dependence has been applied to a breast cancer microarray dataset to quantify the un-supervised relationship between two sets of genes in the tumor and non-tumor samples. It was observed that the existence of gene-set association differed across various clinical cohorts. In addition, a supervised learning was employed to illustrate how gene sets, in signaling transduction pathways or subnetworks regulated by a set of transcription factors, can be discovered using microarray data. In conclusion, the coefficient of intrinsic dependence provides a powerful tool for detecting general types of association. Hence, it can be useful to associate gene sets using microarray expression data. Through connecting relevant gene sets, our approach has the potential to reveal underlying associations by drawing a statistically relevant network in a given population, and it can also be used to complement the conventional gene set analysis.
doi:10.1371/journal.pone.0058851
PMCID: PMC3597597
Background
Microarray experiments examine the change in transcript levels of tens of thousands of genes simultaneously. To derive meaningful data, biologists investigate the response of genes within specific pathways. Pathways are comprised of genes that interact to carry out a particular biological function. Existing methods for analyzing pathways focus on detecting changes in the mean or over-representation of the number of differentially expressed genes relative to the total of genes within the pathway. The issue of how to incorporate the influence of correlation among the genes is not generally addressed.
Results
In this paper, we propose a non-parametric rank test for analyzing pathways that takes into account the correlation among the genes and compared two existing methods, Global and Gene Set Enrichment Analysis (GSEA), using two publicly available data sets. A simulation study was conducted to demonstrate the advantage of the rank test method.
Conclusions
The data indicate the advantages of the rank test. The method can distinguish significant changes in pathways due to either correlations or changes in the mean or both. From the simulation study the rank test out performed Global and GSEA. The greatest gain in performance was for the sample size case which makes the application of the rank test ideal for microarray experiments.
doi:10.1186/1471-2105-11-60
PMCID: PMC3098106
PMID: 20109181
Colorectal cancer (CRC) is one of the leading malignant cancers with a rapid increase in incidence and mortality. The recurrences of CRC after curative resection are sometimes unavoidable and often take place within the first year after surgery. MicroRNAs may serve as biomarkers to predict early recurrence of CRC, but identifying them from over 1,400 known human microRNAs is challenging and costly. An alternative approach is to analyze existing expression data of messenger RNAs (mRNAs) because generally speaking the expression levels of microRNAs and their target mRNAs are inversely correlated. In this study, we extracted six mRNA expression data of CRC in four studies (GSE12032, GSE17538, GSE4526 and GSE17181) from the gene expression omnibus (GEO). We inferred microRNA expression profiles and performed computational analysis to identify microRNAs associated with CRC recurrence using the IMRE method based on the MicroCosm database that includes 568,071 microRNA-target connections between 711 microRNAs and 20,884 gene targets. Two microRNAs, miR-29a and miR-29c, were disclosed and further meta-analysis of the six mRNA expression datasets showed that these two microRNAs were highly significant based on the Fisher p-value combination (p = 9.14×10−9 for miR-29a and p = 1.14×10−6 for miR-29c). Furthermore, these two microRNAs were experimentally tested in 78 human CRC samples to validate their effect on early recurrence. Our empirical results showed that the two microRNAs were significantly down-regulated (p = 0.007 for miR-29a and p = 0.007 for miR-29c) in the early-recurrence patients. This study shows the feasibility of using mRNA profiles to indicate microRNAs. We also shows miR-29a/c could be potential biomarkers for CRC early recurrence.
doi:10.1371/journal.pone.0031587
PMCID: PMC3278467
PMID: 22348113
Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models.
doi:10.1186/1758-2946-4-29
PMCID: PMC3515428
PMID: 23176548
Associative classification mining; Fingerprint; Pipeline Pilot; Bayesian; SVM
COFECO is a web-based tool for a composite annotation of protein complexes, KEGG pathways and Gene Ontology (GO) terms within a class of genes and their orthologs under study. Widely used functional enrichment tools using GO and KEGG pathways create large list of annotations that make it difficult to derive consolidated information and often include over-generalized terms. The interrelationship of annotation terms can be more clearly delineated by integrating the information of physically interacting proteins with biological pathways and GO terms. COFECO has the following advanced characteristics: (i) The composite annotation sets of correlated functions and cellular processes for a given gene set can be identified in a more comprehensive and specified way by the employment of protein complex data together with GO and KEGG pathways as annotation resources. (ii) Orthology based integrative annotations among different species complement the defective annotations in an individual genome and provide the information of evolutionary conserved correlations. (iii) A term filtering feature enables users to collect the specified annotations enriched with selected function terms. (iv) A cross-comparison of annotation results between two different datasets is possible. In addition, COFECO provides a web-based GO hierarchical viewer and KEGG pathway viewer where the enrichment results can be summarized and further explored. COFECO is freely accessible at http://piech.kaist.ac.kr/cofeco.
doi:10.1093/nar/gkp331
PMCID: PMC2703949
PMID: 19429688
Background
The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Despite the rapid growth of available data, very little has been proposed in terms of formalization and optimization. Additionally, current methods mainly ignore the structure of the data which causes results redundancy. For example, when searching for enrichment in GO terms, genes can be annotated with multiple GO terms and should be propagated to the more general terms in the Gene Ontology. Consequently, the gene sets often overlap partially or totally, and this causes the reported enriched GO terms to be both numerous and redundant, hence, overwhelming the researcher with non-pertinent information. This situation is not unique, it arises whenever some hierarchical clustering is performed (e.g. based on the gene expression profiles), the extreme case being when genes that are neighbors on the chromosomes are considered.
Results
We present a generic framework to efficiently identify the most pertinent over-represented features in a set of genes. We propose a formal representation of gene sets based on the theory of partially ordered sets (posets), and give a formal definition of target set pertinence. Algorithms and compact representations of target sets are provided for the generation and the evaluation of the pertinent target sets. The relevance of our method is illustrated through the search for enriched GO annotations in the proteins involved in a multiprotein complex. The results obtained demonstrate the gain in terms of pertinence (up to 64% redundancy removed), space requirements (up to 73% less storage) and efficiency (up to 98% less comparisons).
Conclusion
The generic framework presented in this article provides a formal approach to adequately represent available data and efficiently search for pertinent over-represented features in a set of genes or proteins. The formalism and the pertinence definition can be directly used by most of the methods and tools currently available for feature enrichment analysis.
doi:10.1186/1471-2105-8-332
PMCID: PMC2206060
PMID: 17848190
Historically, probabilistic models for decision support have focused on discrimination, e.g., minimizing the ranking error of predicted outcomes. Unfortunately, these models ignore another important aspect, calibration, which indicates the magnitude of correctness of model predictions. Using discrimination and calibration simultaneously can be helpful for many clinical decisions. We investigated tradeoffs between these goals, and developed a unified maximum-margin method to handle them jointly. Our approach called, Doubly Optimized Calibrated Support Vector Machine (DOC-SVM), concurrently optimizes two loss functions: the ridge regression loss and the hinge loss. Experiments using three breast cancer gene-expression datasets (i.e., GSE2034, GSE2990, and Chanrion's datasets) showed that our model generated more calibrated outputs when compared to other state-of-the-art models like Support Vector Machine ( = 0.03, = 0.13, and <0.001) and Logistic Regression ( = 0.006, = 0.008, and <0.001). DOC-SVM also demonstrated better discrimination (i.e., higher AUCs) when compared to Support Vector Machine ( = 0.38, = 0.29, and = 0.047) and Logistic Regression ( = 0.38, = 0.04, and <0.0001). DOC-SVM produced a model that was better calibrated without sacrificing discrimination, and hence may be helpful in clinical decision making.
doi:10.1371/journal.pone.0048823
PMCID: PMC3490990
PMID: 23139819
Background
Analysis of High Throughput (HTP) Data such as microarray and proteomics data has provided a powerful methodology to study patterns of gene regulation at genome scale. A major unresolved problem in the post-genomic era is to assemble the large amounts of data generated into a meaningful biological context. We have developed a comprehensive software tool, WholePathwayScope (WPS), for deriving biological insights from analysis of HTP data.
Result
WPS extracts gene lists with shared biological themes through color cue templates. WPS statistically evaluates global functional category enrichment of gene lists and pathway-level pattern enrichment of data. WPS incorporates well-known biological pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) and Biocarta, GO (Gene Ontology) terms as well as user-defined pathways or relevant gene clusters or groups, and explores gene-term relationships within the derived gene-term association networks (GTANs). WPS simultaneously compares multiple datasets within biological contexts either as pathways or as association networks. WPS also integrates Genetic Association Database and Partial MedGene Database for disease-association information. We have used this program to analyze and compare microarray and proteomics datasets derived from a variety of biological systems. Application examples demonstrated the capacity of WPS to significantly facilitate the analysis of HTP data for integrative discovery.
Conclusion
This tool represents a pathway-based platform for discovery integration to maximize analysis power. The tool is freely available at .
doi:10.1186/1471-2105-7-30
PMCID: PMC1388242
PMID: 16423281
Selective inhibition of specific genes can be accomplished using genetic suppressor elements (GSEs) that encode antisense RNA, dominant negative mutant proteins, or other regulatory products. GSEs may correspond to partial sequences of target genes, usually identified by trial and error. We have used bacteriophage lambda as a model system to test a concept that biologically active GSEs may be generated by random DNA fragmentation and identified by expression selection. Fragments from eleven different regions of lambda genome, encoding specific peptides or antisense RNA sequences, rendered E. coli resistant to the phage. Analysis of these GSEs revealed some previously unknown functions of phage lambda, including suppression of the cellular lambda receptor by an 'accessory' gene of the phage. The random fragment selection strategy provides a general approach to the generation of efficient GSEs and elucidation of novel gene functions.
Images
PMCID: PMC312009
PMID: 1531871
Most methods for large-scale gene expression microarray and RNA-Seq data analysis are designed to determine the lists of genes or gene products that show distinct patterns and/or significant differences. The most challenging and rate-liming step, however, is to determine what the resulting lists of genes and/or transcripts biologically mean. Biomedical ontology and pathway-based functional enrichment analysis is widely used to interpret the functional role of tightly correlated or differentially expressed genes. The groups of genes are assigned to the associated biological annotations using Gene Ontology terms or biological pathways and then tested if they are significantly enriched with the corresponding annotations. Unlike previous approaches, Gene Set Enrichment Analysis takes quite the reverse approach by using pre-defined gene sets. Differential co-expression analysis determines the degree of co-expression difference of paired gene sets across different conditions. Outcomes in DNA microarray and RNA-Seq data can be transformed into the graphical structure that represents biological semantics. A number of biomedical annotation and external repositories including clinical resources can be systematically integrated by biological semantics within the framework of concept lattice analysis. This array of methods for biological knowledge assembly and interpretation has been developed during the past decade and clearly improved our biological understanding of large-scale genomic data from the high-throughput technologies.
doi:10.1371/journal.pcbi.1002858
PMCID: PMC3531281
PMID: 23300429
Gene set enrichment analysis (GSEA) is a widely used technique in transcriptomic data analysis that uses a database of predefined gene sets to rank lists of genes from microarray studies to identify significant and coordinated changes in gene expression data. While GSEA has been playing a significant role in understanding transcriptomic data, no similar tools are currently available for understanding metabolomic data. Here, we introduce a web-based server, called Metabolite Set Enrichment Analysis (MSEA), to help researchers identify and interpret patterns of human or mammalian metabolite concentration changes in a biologically meaningful context. Key to the development of MSEA has been the creation of a library of ∼1000 predefined metabolite sets covering various metabolic pathways, disease states, biofluids, and tissue locations. MSEA also supports user-defined or custom metabolite sets for more specialized analysis. MSEA offers three different enrichment analyses for metabolomic studies including overrepresentation analysis (ORA), single sample profiling (SSP) and quantitative enrichment analysis (QEA). ORA requires only a list of compound names, while SSP and QEA require both compound names and compound concentrations. MSEA generates easily understood graphs or tables embedded with hyperlinks to relevant pathway images and disease descriptors. For non-mammalian or more specialized metabolomic studies, MSEA allows users to provide their own metabolite sets for enrichment analysis. The MSEA server also supports conversion between metabolite common names, synonyms, and major database identifiers. MSEA has the potential to help users identify obvious as well as ‘subtle but coordinated’ changes among a group of related metabolites that may go undetected with conventional approaches. MSEA is freely available at http://www.msea.ca.
doi:10.1093/nar/gkq329
PMCID: PMC2896187
PMID: 20457745
Purpose
To characterize the functional role of JNK and other apoptotic pathways in grape seed extract (GSE)-induced apoptosis in human leukemia cells by using pharmacologic and genetic approaches.
Experimental Design
Jurkat cells were treated with various concentrations of GSE for 12 h and 24 h, or with 50 μg/ml of GSE for various time intervals, after which apoptosis, caspase activation, and cell signaling pathways were evaluated. Parallel studies were performed in U937 and HL-60 human leukemia cells.
Results
Exposure of Jurkat cells to GSE resulted in dose- and time-dependent increase in apoptosis and caspase activation, events associated with the pronounced increase in Cip1/p21 protein level. Furthermore, treatment of Jurkat cells with GSE resulted in marked increase in levels of phospho-JNK. Conversely, interruption of the JNK pathway by pharmacological inhibitor (e.g. SP600125) or genetic (e.g. siRNA) approaches displayed significant protection against GSE mediated lethality in Jurkat cells.
Conclusions
The result of the present study showed that GSE induces apoptosis in Jurkat cells through a process that involves sustained JNK activation and Cip1/p21 up-regulation, culminating in caspase activation.
doi:10.1158/1078-0432.CCR-08-1447
PMCID: PMC2760842
PMID: 19118041
Apoptosis; Leukemia; Grape seed extract; JNK; Cip1/p21
Background
Microarray experiments produce expression measurements in genomic scale. A way to derive functional understanding of the data is to focus on functional sets of genes, such as pathways, instead of individual genes. While a common practice for the pathway-level analysis has been functional enrichment analysis such as over-representation analysis and gene set enrichment analysis, an alternative approach has also been explored. In this approach, gene expression data are first aggregated at pathway level to transform the original data into a compact representation in which each row corresponds to a pathway instead of a gene. Thereafter the pathway expression data can be used for differential expression and classification analyses in pathway space, leveraging existing algorithms usually applied to gene expression data. While several studies have proposed the pathway-level aggregation methods, it remains unclear how they compare with one another, since the evaluations were done to a limited extent. Thus this study presents a comprehensive evaluation of six most prominent aggregation methods.
Results
The compared methods include five existing methods--mean of all member genes (Mean all), mean of condition-responsive genes (Mean CORGs), analysis of sample set enrichment scores (ASSESS), principal component analysis (PCA), and partial least squares (PLS)--and a variant of an existing method (Mean top 50%, averaging top half of member genes). Comprehensive and stringent benchmarking was performed by collecting seven pairs of related but independent datasets encompassing various phenotypes. Aggregation was done in the space of KEGG pathways. Performance of the methods was assessed by classification accuracy validated both internally and externally, and by examining the correlative extent of pathway signatures between the dataset pairs. The assessment revealed that (i) the best accuracy and correlation were obtained from ASSESS and Mean top 50%, (ii) Mean all showed the lowest accuracy, and (iii) Mean CORGs and PLS gave rise to the largest extent of discordance in the pathway signature correlation.
Conclusions
The two best performing method (ASSESS and Mean top 50%) are suggested to be preferred. The benchmarking analysis also suggests that there is both room and necessity for developing a novel method for pathway-level aggregation.
doi:10.1186/1471-2164-13-S7-S26
PMCID: PMC3521227
PMID: 23282027
Background
High-throughput technologies like functional screens and gene expression analysis produce extended lists of candidate genes. Gene-Set Enrichment Analysis is a commonly used and well established technique to test for the statistically significant over-representation of particular pathways. A shortcoming of this method is however, that most genes that are investigated in the experiments have very sparse functional or pathway annotation and therefore cannot be the target of such an analysis. The approach presented here aims to assign lists of genes with limited annotation to previously described functional gene collections or pathways. This works by comparing InterPro domain signatures of the candidate gene lists with domain signatures of gene sets derived from known classifications, e.g. KEGG pathways.
Results
In order to validate our approach, we designed a simulation study. Based on all pathways available in the KEGG database, we create test gene lists by randomly selecting pathway genes, removing these genes from the known pathways and adding variable amounts of noise in the form of genes not annotated to the pathway. We show that we can recover pathway memberships based on the simulated gene lists with high accuracy. We further demonstrate the applicability of our approach on a biological example.
Conclusion
Results based on simulation and data analysis show that domain based pathway enrichment analysis is a very sensitive method to test for enrichment of pathways in sparsely annotated lists of genes. An R based software package domainsignatures, to routinely perform this analysis on the results of high-throughput screening, is available via Bioconductor.
doi:10.1186/1471-2105-9-3
PMCID: PMC2245903
PMID: 18177498
Davis, Allan Peter | Murphy, Cynthia Grondin | Johnson, Robin | Lay, Jean M. | Lennon-Hopkins, Kelley | Saraceni-Richards, Cynthia | Sciaky, Daniela | King, Benjamin L. | Rosenstein, Michael C. | Wiegers, Thomas C. | Mattingly, Carolyn J.
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between environmental chemicals and gene products and their relationships to diseases. Chemical–gene, chemical–disease and gene–disease interactions manually curated from the literature are integrated to generate expanded networks and predict many novel associations between different data types. CTD now contains over 15 million toxicogenomic relationships. To navigate this sea of data, we added several new features, including DiseaseComps (which finds comparable diseases that share toxicogenomic profiles), statistical scoring for inferred gene–disease and pathway–chemical relationships, filtering options for several tools to refine user analysis and our new Gene Set Enricher (which provides biological annotations that are enriched for gene sets). To improve data visualization, we added a Cytoscape Web view to our ChemComps feature, included color-coded interactions and created a ‘slim list’ for our MEDIC disease vocabulary (allowing diseases to be grouped for meta-analysis, visualization and better data management). CTD continues to promote interoperability with external databases by providing content and cross-links to their sites. Together, this wealth of expanded chemical–gene–disease data, combined with novel ways to analyze and view content, continues to help users generate testable hypotheses about the molecular mechanisms of environmental diseases.
doi:10.1093/nar/gks994
PMCID: PMC3531134
PMID: 23093600
In this study, the BALB/c and Qs mouse responses to infection by the parasite Neospora caninum were investigated in order to identify host response mechanisms. Investigation was done using gene set (enrichment) analyses of microarray data. GSEA, MANOVA, Romer, subGSE and SAM-GS were used to study the contrasts Neospora strain type, Mouse type (BALB/c and Qs) and time post infection (6 hours post infection and 10 days post infection). The analyses show that the major signal in the core mouse response to infection is from time post infection and can be defined by gene ontology terms Protein Kinase Activity, Cell Proliferation and Transcription Initiation. Several terms linked to signaling, morphogenesis, response and fat metabolism were also identified. At 10 days post infection, genes associated with fatty acid metabolism were identified as up regulated in expression. The value of gene set (enrichment) analyses in the analysis of microarray data is discussed.
doi:10.4137/BBI.S9954
PMCID: PMC3448498
PMID: 23012496
mouse model; microarray; gene set; host response; immunity; neospora