|Home | About | Journals | Submit | Contact Us | Français|
ConsensusPathDB is a meta-database that integrates different types of functional interactions from heterogeneous interaction data resources. Physical protein interactions, metabolic and signaling reactions and gene regulatory interactions are integrated in a seamless functional association network that simultaneously describes multiple functional aspects of genes, proteins, complexes, metabolites, etc. With 155432 human, 194480 yeast and 13648 mouse complex functional interactions (originating from 18 databases on human and eight databases on yeast and mouse interactions each), ConsensusPathDB currently constitutes the most comprehensive publicly available interaction repository for these species. The Web interface at http://cpdb.molgen.mpg.de offers different ways of utilizing these integrated interaction data, in particular with tools for visualization, analysis and interpretation of high-throughput expression data in the light of functional interactions and biological pathways.
Knowledge of the functional interactions between physical entities in the cell has high explanatory power regarding biological processes in health and disease (1). Thus, numerous methods for mapping functional association networks such as physical protein interaction networks, metabolic and signaling pathways and gene regulatory networks have been applied in many organisms. The data resulting from such analyses are currently interspersed in hundreds of databases that typically contain only a single aspect of functional interactions of genes, proteins, etc. (2). For example, some databases are specialized on storing protein–protein interaction data, while others focus on the curation of biochemical pathways and still others on gene regulatory interactions. In the cell, however, all different types of functional interactions are operative at the same time: to give an example scenario, genes are regulated to produce proteins that interact physically with other proteins to form complexes that catalyze metabolic reactions. ConsensusPathDB, which we previously reported in (3), assembles a functional association network from multiple heterogeneous public interaction resources by integrating physical entities based on their accession numbers and functional interactions based on their participants. As the combined interaction network in ConsensusPathDB reveals multiple functional aspects of cellular entities at the same time by combining highly complementary data, it is closer to biological reality than the separate source networks. The content of ConsensusPathDB can be exploited in different ways and contexts through its public Web interface at http://cpdb.molgen.mpg.de. It features interaction querying and visualization, network validation and several tools for the interaction- and pathway-level interpretation of user-specified gene or protein expression data.
In this database update report, we highlight the major extensions of ConsensusPathDB regarding database content and functionality of its Web interface.
Since the previous database report (3), the human interaction content of ConsensusPathDB has been increased significantly (Figure 1, left panel). Due to the integration of six additional interaction data resources and updates on the previously integrated 12 resources, the human interaction data in ConsensusPathDB have more than doubled from 74289 to 155432 unique complex functional interactions. The newly integrated data include complex protein interactions from Corum (4), large-scale protein interaction networks from IntAct (5) (designated IntAct-LS), manually curated protein–protein interactions from MIPS-MPPI (6), protein–protein interactions from the Pathogen Interaction Gateway (PIG) meta-database (7), the Edinburgh Human Metabolic Network reconstruction (EHMN) (8) and biological pathways from INOH (http://www.inoh.org). We have additionally imported 5238 physical interactions between human transcription factors published recently in ref. 9. Furthermore, pathway definitions in the form of lists of genes participating in biological pathways were imported from PharmGKB (10) for use in pathway-based analysis of expression data. With the addition of PIG, 20098 host–pathogenic protein–protein interactions were introduced into ConsensusPathDB involving proteins from 864 viral and bacterial species. Thus, the integrated ConsensusPathDB network can now additionally serve as explanatory basis in the context of infectious diseases.
Table 1 shows the number of human interactions imported from each database, as well as the pairwise overlaps of source databases. To assess these overlaps and to avoid redundant interactions in ConsensusPathDB, physical entities and functional interactions from source databases are mapped to each other. The mapping process is detailed in Supplementary Data.
Apart from extending the human functional interaction network, we have created ConsensusPathDB instances for two more organisms: Saccharomyces cerevisiae and Mus musculus, integrating eight interaction resources each: Reactome (11), KEGG (12), BioCyc (13), IntAct (5), DIP (14), MINT (15), BioGRID (16) and MIPS (6,17). The mouse instance additionally includes 1145 interactions between mouse transcription factors obtained from ref. 9. As in the case of the human ConsensusPathDB instance, only metabolic reactions have been imported from KEGG in the mouse and yeast database instances. This is due to the fact that signaling reactions are not made available by KEGG in any computer-readable format. However, KEGG’s signaling pathways are stored in ConsensusPathDB in the form of gene lists for use in the context of gene expression analyses described below.
Overall, ConsensusPathDB currently contains 41271 physical entities, 155432 functional interactions and 2205 biological pathways in human; 14532 physical entities, 194480 functional interactions and 734 biological pathways in yeast; and 21946 physical entities, 13648 functional interactions and 1381 biological pathways in mouse. The numbers correspond to the content after integration, i.e. unique item counts (for example, the number of non-unique human interactions before integration is 306003). Our meta-database is updated every 3 months with the newest releases of its interaction resources.
For the vast majority of functional interactions and physical entities, annotation in the form of literature references and sequence database identifiers, respectively, is imported from the source databases. Literature references are especially useful for protein–protein interactions, as they often serve for interaction confidence estimations. We do not make any judgments on the quality of interactions: all interactions from all source databases are treated equally. For example, physical interactions detected by both large-scale and small-scale experiments are accommodated in ConsensusPathDB without applying any interaction filtering. The ConsensusPathDB users can themselves opt to use filtering based, e.g. on the number of publications, the scale of the interaction detection method or the number of source databases per interaction, since this information is stored and provided in ConsensusPathDB.
For all physical entities, interactions and pathways, the different source databases are recorded and links to the original data are provided where applicable.
Apart from boosting the interaction content in the ConsensusPathDB repository, we have further developed its publicly accessible Web interface (http://cpdb.molgen.mpg.de) to add new functionality (Figure 1, right panel). This includes mainly (i) an advanced, more flexible network visualization framework that features for example overlay of expression values on physical entity nodes and (ii) new facilities in the gene expression data analysis context for the detection of interaction sub-networks and other functional gene groups that have changed activity between phenotypes.
After searching for interactions of particular physical entities, biological pathways or shortest paths of interactions connecting any two physical entities from ConsensusPathDB, the Web interface user has presently two choices for visualizing selected interactions. These choices include the previously described, static-image-based visualization framework and a new, Java-applet-based framework. Both frameworks display interaction networks in the same style, so switching between them involves hardly any user acclimatization. While the latter framework requires a Java Runtime Environment to be installed on the client computer and thus has higher processor and workspace requirements than a simple computer image, it has several advantages, especially when it comes to visualizing larger networks. Network nodes (physical entities/functional interactions) are movable and can be rearranged automatically using different layout methods. Network viewing is further facilitated through the zoom function controlled by the computer mouse wheel. In this Java-based visualization environment, gene/protein expression data can be overlaid on the nodes of a currently viewed network to enable the interaction network-based interpretation of these data (Figure 2).
Using the Web interface of ConsensusPathDB, gene expression data can be analyzed with statistical methods on the level of predefined functional gene sets. These gene sets are based on neighborhood in the functional interaction network, cooperation in curated biochemical pathways or, since recently, co-annotation with Gene Ontology (18) categories. One possibility to interpret the gene expression data is through gene set over-representation analysis—a functionality that we have described in our previous database report (3). Here, the user uploads a list of genes that are differentially expressed in a phenotype of interest, typically a disease phenotype, compared to a control phenotype. Based on the hypergeometric test, predefined functional gene sets such as pathways or interaction sub-networks are identified that contain significantly many of the uploaded genes of interest. For example, if differentially expressed genes are over-represented in a network region, this can be an indicator that this region may be dysregulated in the phenotype in question. In addition to over-representation analysis, we have implemented a gene set enrichment analysis method, which we have reported in (19). In this approach, denoted Wilcoxon enrichment analysis, the complete set of measured genes is uploaded with two expression values per gene, rather than just a non-weighted list of genes that pass a significance threshold as in the case of over-representation analysis. The per-gene values typically represent gene expression levels in the two phenotypes being compared. For every predefined gene set, a Wilcoxon signed-rank test is calculated to evaluate the joint expression difference of the whole predefined gene set rather than individual genes. In other words, even if a predefined functional gene set, such as a pathway, contains no genes with significant differential expression, the joint expression of the group of genes may be significantly changed, indicating potential pathway deregulation on a low but nonetheless consistent gene level.
To demonstrate the utility of the new Wilcoxon enrichment analysis functionality and the overlay of expression values on interaction networks within the new Java network visualization environment, we applied these tools on gene expression measurements from Yu et al. (20) comparing prostate carcinoma against metastatic prostate cancer patients. The Yu et al. data were downloaded from Oncomine 3.0 (21) in February 2009. These data constitute gene expression measurements in 64 prostate carcinoma samples and 25 metastatic prostate cancer samples and are summarized in Oncomine in the form of mean normalized gene expression values for the two sample cohorts, as well as a t-test P-value reflecting the significance of differential gene expression. We additionally filtered the data to exclude ESTs and ambiguously identified genes. For Wilcoxon enrichment analysis in ConsensusPathDB, we uploaded the resulting list of 7807 genes together with their mean expression values for both patient cohorts. We selected interaction neighborhood-based entity sets of radius 1, curated pathways, and Gene Ontology level 3 biological process categories as predefined functional sets for enrichment analysis with default parameter settings. Results, summarized in Supplementary Table S1, clearly correspond to the hallmarks of human cancer (22): changes in the cell cycle, transcription, translation, signaling, angiogenesis and immune response. For example, among the pathways whose activity is significantly changed in metastatic cancer compared to primary carcinoma according to the Wilcoxon enrichment analysis (Supplementary Table S1) are ‘Ribosome’ (KEGG) [see (23)]; ‘Translation’ (Reactome); ‘Mitotic cell cycle’ (Reactome); ‘Interleukin-5 immune pathway’ (NetPath); ‘VEGF, hypoxia and angiogenesis’ (BioCarta); as well as several cancer-related signaling pathways like ‘GPCR signaling’ (Reactome); ‘PDGFR-beta signaling’ (Pathway Interaction Database); ‘ERK signaling’ (Reactome); ‘RAS signaling’ (Reactome); ‘JAK/STAT signaling’ (INOH). Notably, KEGG’s ‘Non-small cell lung cancer pathway’ is also among the significantly enriched pathways. Although no Gene Ontology categories were significant at the 0.05 P-value threshold after correction for multiple testing, among the categories with significant Wilcoxon enrichment P-values (Supplementary Table S1) are ‘Chromosome organization’, ‘Chromatin assembly’, ‘Regulation of organ growth’ and ‘Mitotic cell cycle’.
As for enriched neighborhood-based entity sets (Supplementary Table S1), the most significantly enriched one (Wilcoxon signed-rank test P-value=8.34e-6) has Histone H3-K9 methyltransferase 2 (gene symbol SUV39H2) as the set center. This neighborhood-based entity set is constructed from physical interactions and biochemical reactions originating from overall nine different source databases. The central gene SUV39H2 plays a crucial role in cell cycle, transcriptional regulation and cell differentiation [Gene Ontology annotation, UniProt (24) keywords] and its mutations have been shown to increase the risk of cancer in human and in mouse models (25,26). It is important to mention that SUV39H2 itself is not contained in the expression data set that we uploaded and used for Wilcoxon enrichment analysis, but many of its network neighbors have expression measurements that show coherent transcriptional dysregulation. Figure 2 shows the functional interaction neighborhood sub-network of SUV39H2, where the Yu et al. data are overlaid on protein nodes as logarithmized gene expression fold change. To conclude, our results support previous findings (27) that utilizing functional interaction data can substantially improve expression data interpretation and can point to disease genes that do not come up on the gene expression analysis level alone. Moreover, our results underline the importance of data integration as the SUV39H2-centered sub-network comprises different types of functional interactions originating from nine interaction databases.
Examples for further significantly enriched neighborhood-based entity sets include the ones centered by ribosomal proteins (e.g., 40S ribosomal protein S4, Y isoform 2: UniProt: RS4Y2_HUMAN), Nucleosome assembly complex protein 1-like 4 (UniProt: NP14L_HUMAN), cell cycle proteins (e.g., MAT1 and MAP kinase p38 delta: UniProt: MK13_HUMAN) and by the transcription factor SP1 that, according to UniProt annotation, may modulate the cellular response to DNA damage.
The Web interface to ConsensusPathDB is freely available to academic users through http://cpdb.molgen.mpg.de.
The protein interaction part of the ConsensusPathDB interaction network can be downloaded in tab-delimited or PSI-MI (28) formats. While the complete database content is not downloadable due to licensing limitations imposed by several source databases, we provide a list of identifiers of matching interactions across source databases upon request. Furthermore, we offer web services for automated pathway analysis of gene expression data. Through the ConsensusPathDB plugin (29) for the Cytoscape network visualization and analysis software (30), experimentalists can automatically mine evidences for sets of newly detected protein interactions from ConsensusPathDB and highlight novel interactions among them. For the interaction data, web services, and Cytoscape plugin, please visit the ‘data access’ section of ConsensusPathDB’s Web site.
Supplementary Data are available at NAR Online.
The Max Planck Society (IMPRS-CBSC); the European Commission under its 7th Framework Programme with the grant APO-SYS (Health-F4-2007-200767); the German Ministry for Research with the grant PREDICT (0315428A); and the Austrian Nationalstiftung and the Austria Wirtschaftsservice GmbH in the framework of the IMGuS research program. Funding for open access charge: European Union’s project APO-SYS (Health-F4-2007-200767).
Conflict of interest statement. None declared.
We thank the developers of all ConsensusPathDB’s source databases for making their interaction data available.