Gene-set enrichment analysis is a useful technique to help functionally characterize large gene lists, such as the results of gene expression experiments. This technique finds functionally coherent gene-sets, such as pathways, that are statistically over-represented in a given gene list. Ideally, the number of resulting sets is smaller than the number of genes in the list, thus simplifying interpretation. However, the increasing number and redundancy of gene-sets used by many current enrichment analysis software works against this ideal.
To overcome gene-set redundancy and help in the interpretation of large gene lists, we developed “Enrichment Map”, a network-based visualization method for gene-set enrichment results. Gene-sets are organized in a network, where each set is a node and edges represent gene overlap between sets. Automated network layout groups related gene-sets into network clusters, enabling the user to quickly identify the major enriched functional themes and more easily interpret the enrichment results.
Enrichment Map is a significant advance in the interpretation of enrichment analysis. Any research project that generates a list of genes can take advantage of this visualization framework. Enrichment Map is implemented as a freely available and user friendly plug-in for the Cytoscape network visualization software (http://baderlab.org/Software/EnrichmentMap/).
Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at http://csbl.bmb.uga.edu/~xizeng/research/seas/.
Curated gene sets from databases such as KEGG Pathway and Gene Ontology are often used to systematically organize lists of genes or proteins derived from high-throughput data. However, the information content inherent to some relationships between the interrogated gene sets, such as pathway crosstalk, is often underutilized. A gene set network, where nodes representing individual gene sets such as KEGG pathways are connected to indicate a functional dependency, is well suited to visualize and analyze global gene set relationships. Here we introduce a novel gene set network construction algorithm that integrates gene lists derived from high-throughput experiments with curated gene sets to construct co-enrichment gene set networks. Along with previously described co-membership and linkage algorithms, we apply the co-enrichment algorithm to eight gene set collections to construct integrated multi-evidence gene set networks with multiple edge types connecting gene sets. We demonstrate the utility of approach through examples of novel gene set networks such as the chromosome map co-differential expression gene set network. A total of twenty-four gene set networks are exposed via a web tool called MetaNet, where context-specific multi-edge gene set networks are constructed from enriched gene sets within user-defined gene lists. MetaNet is freely available at http://blaispathways.dfci.harvard.edu/metanet/.
Over the past decade, pathway and gene-set enrichment analysis has evolved into the study of high-throughput functional genomics. Owing to poorly annotated and incomplete pathway data, researchers have begun to combine pathway and gene-set enrichment analysis as well as network module-based approaches to identify crucial relationships between different molecular mechanisms.
To meet the new challenge of molecular phenotype discovery, in this work, we have developed an integrated online database, the Pathway And Gene Enrichment Database (PAGED), to enable comprehensive searches for disease-specific pathways, gene signatures, microRNA targets, and network modules by integrating gene-set-based prior knowledge as molecular patterns from multiple levels: the genome, transcriptome, post-transcriptome, and proteome.
The online database we developed, PAGED http://bio.informatics.iupui.edu/PAGED is by far the most comprehensive public compilation of gene sets. In its current release, PAGED contains a total of 25,242 gene sets, 61,413 genes, 20 organisms, and 1,275,560 records from five major categories. Beyond its size, the advantage of PAGED lies in the explorations of relationships between gene sets as gene-set association networks (GSANs). Using colorectal cancer expression data analysis as a case study, we demonstrate how to query this database resource to discover crucial pathways, gene signatures, and gene network modules specific to colorectal cancer functional genomics.
This integrated online database lays a foundation for developing tools beyond third-generation pathway analysis approaches on for discovering molecular phenotypes, especially for disease-associated pathway/gene-set enrichment analysis.
Motivation: Gene expression class comparison studies may identify hundreds or thousands of genes as differentially expressed (DE) between sample groups. Gaining biological insight from the result of such experiments can be approached, for instance, by identifying the signaling pathways impacted by the observed changes. Most of the existing pathway analysis methods focus on either the number of DE genes observed in a given pathway (enrichment analysis methods), or on the correlation between the pathway genes and the class of the samples (functional class scoring methods). Both approaches treat the pathways as simple sets of genes, disregarding the complex gene interactions that these pathways are built to describe.
Results: We describe a novel signaling pathway impact analysis (SPIA) that combines the evidence obtained from the classical enrichment analysis with a novel type of evidence, which measures the actual perturbation on a given pathway under a given condition. A bootstrap procedure is used to assess the significance of the observed total pathway perturbation. Using simulations we show that the evidence derived from perturbations is independent of the pathway enrichment evidence. This allows us to calculate a global pathway significance P-value, which combines the enrichment and perturbation P-values. We illustrate the capabilities of the novel method on four real datasets. The results obtained on these data show that SPIA has better specificity and more sensitivity than several widely used pathway analysis methods.
Availability: SPIA was implemented as an R package available at http://vortex.cs.wayne.edu/ontoexpress/
Supplementary information: Supplementary data are available at Bioinformatics online.
Gene set enrichment analysis for analyzing large profiling and screening experiments can reveal unifying biological schemes based on previously accumulated knowledge represented as “gene sets”. Most of the existing implementations use a fixed fold-change or P value cutoff to generate regulated gene lists. However, the threshold selection in most cases is arbitrary, and has a significant effect on the test outcome and interpretation of the experiment. We developed a new gene set enrichment analysis method, ie, FDR-FET, which dynamically optimizes the threshold choice and improves the sensitivity and selectivity of gene set enrichment analysis. The procedure translates experimental results into a series of regulated gene lists at multiple false discovery rate (FDR) cutoffs, and computes the P value of the overrepresentation of a gene set using a Fisher’s exact test (FET) in each of these gene lists. The lowest P value is retained to represent the significance of the gene set. We also implemented improved methods to define a more relevant global reference set for the FET. We demonstrate the validity of the method using a published microarray study of three protease inhibitors of the human immunodeficiency virus and compare the results with those from other popular gene set enrichment analysis algorithms. Our results show that combining FDR with multiple cutoffs allows us to control the error while retaining genes that increase information content. We conclude that FDR-FET can selectively identify significant affected biological processes. Our method can be used for any user-generated gene list in the area of transcriptome, proteome, and other biological and scientific applications.
gene set enrichment analysis; false discovery rate; Fisher’s exact test; microarray profiling; protease inhibitors
COFECO is a web-based tool for a composite annotation of protein complexes, KEGG pathways and Gene Ontology (GO) terms within a class of genes and their orthologs under study. Widely used functional enrichment tools using GO and KEGG pathways create large list of annotations that make it difficult to derive consolidated information and often include over-generalized terms. The interrelationship of annotation terms can be more clearly delineated by integrating the information of physically interacting proteins with biological pathways and GO terms. COFECO has the following advanced characteristics: (i) The composite annotation sets of correlated functions and cellular processes for a given gene set can be identified in a more comprehensive and specified way by the employment of protein complex data together with GO and KEGG pathways as annotation resources. (ii) Orthology based integrative annotations among different species complement the defective annotations in an individual genome and provide the information of evolutionary conserved correlations. (iii) A term filtering feature enables users to collect the specified annotations enriched with selected function terms. (iv) A cross-comparison of annotation results between two different datasets is possible. In addition, COFECO provides a web-based GO hierarchical viewer and KEGG pathway viewer where the enrichment results can be summarized and further explored. COFECO is freely accessible at http://piech.kaist.ac.kr/cofeco.
Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene expression data with functionally related gene sets as defined in databases such as GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are generally more biologically interpretable, accurate and robust than the results based on individual gene analysis. However, while most available methods for gene set enrichment analysis test the enrichment of the entire gene set, it is more likely that only a subset of the genes in the gene set may be related to the phenotypes of interest.
In this paper, we develop a novel method, termed Sub-GSE, which measures the enrichment of a predefined gene set, or pathway, by testing its subsets. The application of Sub-GSE to two simulated and two real datasets shows Sub-GSE to be more sensitive than previous methods, such as GSEA, GSA, and SigPath, in detecting gene sets assiated with a phenotype of interest. This is particularly true for cases in which only a fraction of the genes in the gene set are associated with the phenotypes. Furthermore, the application of Sub-GSE to two real data sets demonstrates that it can detect more biologically meaningful gene sets than GSEA.
We developed a new method to measure the gene set enrichment. Applications to two simulated datasets and two real datasets show that this method is sensitive to the associations between gene sets and phenotype. The program Sub-GSE can be downloaded from .
Motivation: Many pathway analysis (or gene set enrichment analysis) methods have been developed to identify enriched pathways under different biological states within a genomic study. As more and more microarray datasets accumulate, meta-analysis methods have also been developed to integrate information among multiple studies. Currently, most meta-analysis methods for combining genomic studies focus on biomarker detection and meta-analysis for pathway analysis has not been systematically pursued.
Results: We investigated two approaches of meta-analysis for pathway enrichment (MAPE) by combining statistical significance across studies at the gene level (MAPE_G) or at the pathway level (MAPE_P). Simulation results showed increased statistical power of meta-analysis approaches compared to a single study analysis and showed complementary advantages of MAPE_G and MAPE_P under different scenarios. We also developed an integrated method (MAPE_I) that incorporates advantages of both approaches. Comprehensive simulations and applications to real data on drug response of breast cancer cell lines and lung cancer tissues were evaluated to compare the performance of three MAPE variations. MAPE_P has the advantage of not requiring gene matching across studies. When MAPE_G and MAPE_P show complementary advantages, the hybrid version of MAPE_I is generally recommended.
Supplementary information: Supplementary data are available at Bioinformatics online.
Widespread use of microarrays has generated large amounts of data, the interrogation of the public microarray repositories, identifying similarities between microarray experiments is now one of the major challenges. Approaches using defined group of genes, such as pathways and cellular networks (pathway analysis), have been proposed to improve the interpretation of microarray experiments. We propose a novel method to compare microarray experiments at the pathway level, this method consists of two steps: first, generate pathway signatures, a set of descriptors recapitulating the biologically meaningful pathways related to some clinical/biological variable of interest, second, use these signatures to interrogate microarray databases. We demonstrate that our approach provides more reliable results than with gene-based approaches. While gene-based approaches tend to suffer from bias generated by the analytical procedures employed, our pathway based method successfully groups together similar samples, independently of the experimental design. The results presented are potentially of great interest to improve the ability to query and compare experiments in public repositories of microarray data. As a matter of fact, this method can be used to retrieve data from public microarray databases and perform comparisons at the pathway level.
Microarray experiments examine the change in transcript levels of tens of thousands of genes simultaneously. To derive meaningful data, biologists investigate the response of genes within specific pathways. Pathways are comprised of genes that interact to carry out a particular biological function. Existing methods for analyzing pathways focus on detecting changes in the mean or over-representation of the number of differentially expressed genes relative to the total of genes within the pathway. The issue of how to incorporate the influence of correlation among the genes is not generally addressed.
In this paper, we propose a non-parametric rank test for analyzing pathways that takes into account the correlation among the genes and compared two existing methods, Global and Gene Set Enrichment Analysis (GSEA), using two publicly available data sets. A simulation study was conducted to demonstrate the advantage of the rank test method.
The data indicate the advantages of the rank test. The method can distinguish significant changes in pathways due to either correlations or changes in the mean or both. From the simulation study the rank test out performed Global and GSEA. The greatest gain in performance was for the sample size case which makes the application of the rank test ideal for microarray experiments.
Network-based approaches may leverage genome-wide association (GWA) analysis by testing for the aggregate association across several pathway members. We aimed to examine if networks of genes that represent experimentally determined protein-protein interactions are enriched in genes associated with risk of coronary heart disease (CHD).
Methods and Results
GWA analyses of ~700,000 SNPs in 899 incident CHD cases and 1,823 age- and sex-matched controls within the Nurses’ Health and the Health Professionals Follow-Up Studies were used to assign gene-wise p-values. A large database of protein-protein interactions (PPI) was used to assemble 8,300 unbiased protein complexes and corresponding gene-sets. Superimposed gene-wise p-values were used to rank gene-sets based on their enrichment in genes associated with CHD. After correcting for the number of complexes tested, one gene-set was overrepresented in CHD-associated genes (p-value=0.002). Centered on the beta-1-adrenergic receptor gene (ADRB1), this complex included 18 protein interaction partners that, so far, have not been identified as candidate loci for CHD. Five of the 19 genes in the top-complex are reported to be involved in abnormal cardiovascular system physiology based on knock-out mice (4-fold enrichment; p-value, Fisher’s exact test= 0.006). Ingenuity pathway analysis revealed that especially canonical pathways related to blood pressure regulation were significantly enriched in the genes from the top complex.
The integration of a GWA study with PPI data successfully identifies a set of candidate susceptibility genes for incident CHD that would have been missed in single-marker GWA analysis.
Genetics of cardiovascular disease; acute myocardial infarction; epidemiology
The relative contribution of epigenetic mechanisms to carcinogenesis is not well understood, including the extent to which epigenetic dysregulation and somatic mutations target similar genes and pathways. We hypothesize that during carcinogenesis, certain pathways or biological gene sets are commonly dysregulated via DNA methylation across cancer types. The ability of our logistic regression-based gene set enrichment method to implicate important biological pathways in high-throughput data is well established.
We developed a web-based gene set enrichment application called LRpath with clustering functionality that allows for identification and comparison of pathway signatures across multiple studies. Here, we employed LRpath analysis to unravel the commonly altered pathways and other gene sets across ten cancer studies employing DNA methylation data profiled with the Illumina HumanMethylation27 BeadChip. We observed a surprising level of concordance in differential methylation across multiple cancer types. For example, among commonly hypomethylated groups, we identified immune-related functions, peptidase activity, and epidermis/keratinocyte development and differentiation. Commonly hypermethylated groups included homeobox and other DNA-binding genes, nervous system and embryonic development, and voltage-gated potassium channels. For many gene sets, we observed significant overlap in the specific subset of differentially methylated genes. Interestingly, fewer DNA repair genes were differentially methylated than expected by chance.
Clustering analysis performed with LRpath revealed tightly clustered concepts enriched for differential methylation. Several well-known cancer-related pathways were significantly affected, while others were depleted in differential methylation. We conclude that DNA methylation changes in cancer tend to target a subset of the known cancer pathways affected by genetic aberrations.
This study aimed to discover ‘signature pathways’ characterizing biological processes based on genes differentially expressed in the uterine cervix before and after spontaneous labor.
The cervical transcriptome was previously characterized from biopsies taken before and after term labor. Pathway analysis was used to study the differentially expressed genes based on two gene-to-pathway annotation databases (KEGG and Metacore™). Over-represented and highly impacted pathways and connectivity nodes were identified.
Fifty-two pathways in the Metacore™ database were significantly enriched in differentially expressed genes. Three of the top 5 pathways were known to be involved in cervical remodeling.Two novel pathways were: plasmin signaling and plasminogen activator urokinase (PLAU) signaling. The same analysis in the KEGG database identified 4 significant pathways, of which impact analysis confirmed. Multiple nodes providing connectivity within the plasmin and PLAU signaling pathways were identified..
Three strategies for pathway analysis were consistent in their identification of novel, unexpected as well as expected networks, suggesting that this approach is both valid and effective for the elucidation of biological mechanisms involved in cervical dilation and remodeling.
cervix; cervical dilation; cervical remodeling; labor; microarray; gene signature network; pathway analysis; plasmin; systems biology; parturition
Identification of canonical pathways through enrichment of differentially expressed genes in a given pathway is a widely used method for interpreting gene lists generated from high-throughput experimental studies. However, most algorithms treat pathways as sets of genes, disregarding any inter- and intra-pathway connectivity information, and do not provide insights beyond identifying lists of pathways.
We developed an algorithm (PathNet) that utilizes the connectivity information in canonical pathway descriptions to help identify study-relevant pathways and characterize non-obvious dependencies and connections among pathways using gene expression data. PathNet considers both the differential expression of genes and their pathway neighbors to strengthen the evidence that a pathway is implicated in the biological conditions characterizing the experiment. As an adjunct to this analysis, PathNet uses the connectivity of the differentially expressed genes among all pathways to score pathway contextual associations and statistically identify biological relations among pathways. In this study, we used PathNet to identify biologically relevant results in two Alzheimer’s disease microarray datasets, and compared its performance with existing methods. Importantly, PathNet identified de-regulation of the ubiquitin-mediated proteolysis pathway as an important component in Alzheimer’s disease progression, despite the absence of this pathway in the standard enrichment analyses.
PathNet is a novel method for identifying enrichment and association between canonical pathways in the context of gene expression data. It takes into account topological information present in pathways to reveal biological information. PathNet is available as an R workspace image from
Canonical pathways; Pathway enrichment; Pathway association; Pathway interaction; Pathway topology
The development of expression-based gene signatures for predicting prognosis or class membership is a popular and challenging task. Besides their stringent validation, signatures need a functional interpretation and must be placed in a biological context. Popular tools such as Gene Set Enrichment have drawbacks because they are restricted to annotated genes and are unable to capture the information hidden in the signature’s non-annotated genes.
We propose concepts to relate a signature with functional gene sets like pathways or Gene Ontology categories. The connection between single signature genes and a specific pathway is explored by hierarchical variable selection and gene association networks. The risk score derived from an individual patient’s signature is related to expression patterns of pathways and Gene Ontology categories. Global tests are useful for these tasks, and they adjust for other factors. GlobalAncova is used to explore the effect on gene expression in specific functional groups from the interaction of the score and selected mutations in the patient’s genome.
We apply the proposed methods to an expression data set and a corresponding gene signature for predicting survival in Acute Myeloid Leukemia (AML). The example demonstrates strong relations between the signature and cancer-related pathways. The signature-based risk score was found to be associated with development-related biological processes.
Many authors interpret the functional aspects of a gene signature by linking signature genes to pathways or relevant functional gene groups. The method of gene set enrichment is preferred to annotating signature genes to specific Gene Ontology categories. The strategies proposed in this paper go beyond the restriction of annotation and deepen the insights into the biological mechanisms reflected in the information given by a signature.
High-throughput gene-expression studies result in lists of differentially expressed genes. Most current meta-analyses of these gene lists include searching for significant membership of the translated proteins in various signaling pathways. However, such membership enrichment algorithms do not provide insight into which pathways caused the genes to be differentially expressed in the first place. Here, we present an intuitive approach for discovering upstream signaling pathways responsible for regulating these differentially expressed genes. We identify consistently regulated signature genes specific for signal transduction pathways from a panel of single-pathway perturbation experiments. An algorithm that detects overrepresentation of these signature genes in a gene group of interest is used to infer the signaling pathway responsible for regulation. We expose our novel resource and algorithm through a web server called SPEED: Signaling Pathway Enrichment using Experimental Data sets. SPEED can be freely accessed at http://speed.sys-bio.net/.
MicroRNAs (miRNAs) are known to be important post-transcriptional regulators that are involved in the etiology of complex psychiatric traits. The present study aimed to incorporate miRNAs information into pathway analysis using a genome-wide association dataset to identify relevant biological pathways for bipolar disorder (BPD). We selected psychiatric- and neurological-associated miRNAs (N = 157) from PhenomiR database. The miRNA target genes (miTG) predictions were obtained from microRNA.org. Canonical pathways (N = 4,051) were downloaded from the Molecule Signature Database. We employed a novel weighting scheme for miTGs in pathway analysis using methods of gene set enrichment analysis and sum-statistic. Under four statistical scenarios, 38 significantly enriched pathways (P-value < 0.01 after multiple testing correction) were identified for the risk of developing BPD, including pathways of ion channels associated (e.g., gated channel activity, ion transmembrane transporter activity, and ion channel activity) and nervous related biological processes (e.g., nervous system development, cytoskeleton, and neuroactive ligand receptor interaction). Among them, 19 were identified only when the weighting scheme was applied. Many miRNA-targeted genes were functionally related to ion channels, collagen, and axonal growth and guidance that have been suggested to be associated with BPD previously. Some of these genes are linked to the regulation of miRNA machinery in the literature. Our findings provide support for the potential involvement of miRNAs in the psychopathology of BPD. Further investigations to elucidate the functions and mechanisms of identified candidate pathways are needed.
microRNA; bipolar disorder; pathway analysis; genome-wide association; ion channel
With the integration of the KEGG and Predictome databases as well as two search engines for coexpressed genes/proteins using data sets obtained from the Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) database, VisANT 3.0 supports exploratory pathway analysis, which includes multi-scale visualization of multiple pathways, editing and annotating pathways using a KEGG compatible visual notation and visualization of expression data in the context of pathways. Expression levels are represented either by color intensity or by nodes with an embedded expression profile. Multiple experiments can be navigated or animated. Known KEGG pathways can be enriched by querying either coexpressed components of known pathway members or proteins with known physical interactions. Predicted pathways for genes/proteins with unknown functions can be inferred from coexpression or physical interaction data. Pathways produced in VisANT can be saved as computer-readable XML format (VisML), graphic images or high-resolution Scalable Vector Graphics (SVG). Pathways in the format of VisML can be securely shared within an interested group or published online using a simple Web link. VisANT is freely available at http://visant.bu.edu.
Genome-wide association studies (GWAS) focus on relatively few highly significant loci while less attention is given to other genotyped markers. Employing pathway analysis to existing GWAS data may shed light on relevant biological processes, and illuminate new candidate genes. We employed a pathway-based approach to the breast cancer GWAS data of the National Cancer Institute (NCI) Cancer Genetic Markers of Susceptibility (CGEMS) project that includes 1145 cases and 1142 controls. Pathways were retrieved from three databases: KEGG, BioCarta, and the NCI’s Protein Interaction Database (PID). Genes were represented by their most strongly associated SNP, and an enrichment score (ES) reflecting the overrepresentation of gene-based association signals in each pathway was calculated using a weighted Kolmogorov-Smirnov procedure. Finally, hierarchical clustering was used to identify pathways with overlapping genes, and clusters with excess of association signals were determined by the adaptive rank-truncated product (ARTP) method. A total of 421 pathways containing 3962 genes were included in our study. Of these, three pathways (‘Syndecan-1-mediated signaling ‘, ‘Signaling of Hepatocyte Growth Factor Receptor’ and ‘Growth Hormone Signaling’) were highly enriched with association signals (PES < 0.001, False Discovery Rate (FDR) = 0.118). Our clustering analysis revealed that pathways containing key components of the RAS/RAF/MAPK canonical signaling cascade, were significantly more likely to have excess of association signals than expected by chance (PARTP = 0.0051, FDR = 0.07). These results suggest that genetic alterations associated with these three pathways and one canonical signaling cascade may contribute to breast cancer susceptibility.
Pathways; GWAS; Breast cancer; Susceptibility; Genetics
Analyzing sets of genes in genome-wide association studies is a relatively new approach that aims to capitalize on biological knowledge about the interactions of genes in biological pathways. This approach, called pathway analysis or gene set analysis, has not yet been applied to the analysis of rare variants. Applying pathway analysis to rare variants offers two competing approaches. In the first approach rare variant statistics are used to generate p-values for each gene (e.g., combined multivariate collapsing [CMC] or weighted-sum [WS]) and the gene-level p-values are combined using standard pathway analysis methods (e.g., gene set enrichment analysis or Fisher’s combined probability method). In the second approach, rare variant methods (e.g., CMC and WS) are applied directly to sets of single-nucleotide polymorphisms (SNPs) representing all SNPs within genes in a pathway. In this paper we use simulated phenotype and real next-generation sequencing data from Genetic Analysis Workshop 17 to analyze sets of rare variants using these two competing approaches. The initial results suggest substantial differences in the methods, with Fisher’s combined probability method and the direct application of the WS method yielding the best power. Evidence suggests that the WS method works well in most situations, although Fisher’s method was more likely to be optimal when the number of causal SNPs in the set was low but the risk of the causal SNPs was high.
Most methods for large-scale gene expression microarray and RNA-Seq data analysis are designed to determine the lists of genes or gene products that show distinct patterns and/or significant differences. The most challenging and rate-liming step, however, is to determine what the resulting lists of genes and/or transcripts biologically mean. Biomedical ontology and pathway-based functional enrichment analysis is widely used to interpret the functional role of tightly correlated or differentially expressed genes. The groups of genes are assigned to the associated biological annotations using Gene Ontology terms or biological pathways and then tested if they are significantly enriched with the corresponding annotations. Unlike previous approaches, Gene Set Enrichment Analysis takes quite the reverse approach by using pre-defined gene sets. Differential co-expression analysis determines the degree of co-expression difference of paired gene sets across different conditions. Outcomes in DNA microarray and RNA-Seq data can be transformed into the graphical structure that represents biological semantics. A number of biomedical annotation and external repositories including clinical resources can be systematically integrated by biological semantics within the framework of concept lattice analysis. This array of methods for biological knowledge assembly and interpretation has been developed during the past decade and clearly improved our biological understanding of large-scale genomic data from the high-throughput technologies.
Gene expression profiling offers a great opportunity for studying multi-factor diseases and for understanding the key role of genes in mechanisms which drive a normal cell to a cancer state. Single gene analysis is insufficient to describe the complex perturbations responsible for cancer onset, progression and invasion. A deeper understanding of the mechanisms of tumorigenesis can be reached focusing on deregulation of gene sets or pathways rather than on individual genes. We apply two known and statistically well founded methods for finding pathways and biological processes deregulated in pathological conditions by analyzing gene expression profiles. In particular, we measure the amount of deregulation and assess the statistical significance of predefined pathways belonging to a curated collection (Molecular Signature Database) in a colon cancer data set. We find that pathways strongly involved in different tumors are strictly connected with colon cancer. Moreover, our experimental results show that the study of complex diseases through pathway analysis is able to highlight genes weakly connected to the phenotype which may be difficult to detect by using classical univariate statistics. Our study shows the importance of using gene sets rather than single genes for understanding the main biological processes and pathways involved in colorectal cancer. Our analysis evidences that many of the genes involved in these pathways are strongly associated to colorectal tumorigenesis. In this new perspective, the focus shifts from finding differentially expressed genes to identifying biological processes, cellular functions and pathways perturbed in the phenotypic conditions by analyzing genes co-expressed in a given pathway as a whole, taking into account the possible interactions among them and, more importantly, the correlation of their expression with the phenotypical conditions.
Microarray; pathway analysis; prediction accuracy; machine learning; gene expression; colon cancer.
Motivation: Development of high-throughput technology makes it possible to measure expressions of thousands of genes simultaneously. Genes have the inherent pathway structure, where pathways are composed of multiple genes with coordinated biological functions. It is of great interest to identify differential gene pathways that are associated with the variations of phenotypes.
Results: We propose the following approach for detecting differential gene pathways. First, we construct gene pathways using databases such as KEGG or GO. Second, for each pathway, we extract a small number of representative features, which are linear combinations of gene expressions and/or their transformations. Specifically, we propose using (i) principal components (PCs) of gene expression sets, (ii) PCs of expanded gene expression sets and (iii) expanded sets of PCs of gene expressions, as the representative features. Third, we identify differential gene pathways as those with representative features significantly associated with the variations of phenotypes, particularly disease clinical outcomes, in regression models. The false discovery rate approach is used to adjust for multiple comparisons. Analysis of three gene expression datasets suggests that (i) the proposed approach can effectively identify differential gene pathways; (ii) PCs that explain only a small amount of variations of gene expressions may bear significant associations between gene pathways and phenotypes; (iii) including second-order terms of gene expressions may lead to identification of new differential gene pathways; (iv) the proposed approach is relatively insensitive to additional noises; and (v) the proposed approach can identify gene pathways missed by alternative approaches.
Supplementary information:Supplementary data are available at Bioinformatics online.
Metastasis is believed to progress in several steps including different pathways but the determination and understanding of these mechanisms is still fragmentary. Microarray analysis of gene expression patterns in breast tumors has been used to predict outcome in recent studies. Besides classification of outcome, these global expression patterns may reflect biological mechanisms involved in metastasis of breast cancer. Our purpose has been to investigate pathways and transcription factors involved in metastasis by use of gene expression data sets.
We have analyzed 8 publicly available gene expression data sets. A global approach, "gene set enrichment analysis" as well as an approach focusing on a subset of significantly differently regulated genes, GenMAPP, has been applied to rank pathway gene sets according to differential regulation in metastasizing tumors compared to non-metastasizing tumors. Meta-analysis has been used to determine overrepresentation of pathways and transcription factors targets, concordant deregulated in metastasizing breast tumors, in several data sets.
The major findings are up-regulation of cell cycle pathways and a metabolic shift towards glucose metabolism reflected in several pathways in metastasizing tumors. Growth factor pathways seem to play dual roles; EGF and PDGF pathways are decreased, while VEGF and sex-hormone pathways are increased in tumors that metastasize. Furthermore, migration, proteasome, immune system, angiogenesis, DNA repair and several signal transduction pathways are associated to metastasis. Finally several transcription factors e.g. E2F, NFY, and YY1 are identified as being involved in metastasis.
By pathway meta-analysis many biological mechanisms beyond major characteristics such as proliferation are identified. Transcription factor analysis identifies a number of key factors that support central pathways. Several previously proposed treatment targets are identified and several new pathways that may constitute new targets are identified.