|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oxfordjournals.org
In recent years, the Munich Information Center for Protein Sequences (MIPS) yeast protein–protein interaction (PPI) dataset has been used in numerous analyses of protein networks and has been called a gold standard because of its quality and comprehensiveness [H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J. D. Han, N. Bertin, S. Chung, M. Vidal and M. Gerstein (2004) Genome Res., 14, 1107–1118]. MPact and the yeast protein localization catalog provide information related to the proximity of proteins in yeast. Beside the integration of high-throughput data, information about experimental evidence for PPIs in the literature was compiled by experts adding up to 4300 distinct PPIs connecting 1500 proteins in yeast. As the interaction data is a complementary part of CYGD, interactive mapping of data on other integrated data types such as the functional classification catalog [A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Güldener, G. Mannhaupt, M. Münsterkötter and H. W. Mewes (2004) Nucleic Acids Res., 32, 5539–5545] is possible. A survey of signaling proteins and comparison with pathway data from KEGG demonstrates that based on these manually annotated data only an extensive overview of the complexity of this functional network can be obtained in yeast. The implementation of a web-based PPI-analysis tool allows analysis and visualization of protein interaction networks and facilitates integration of our curated data with high-throughput datasets. The complete dataset as well as user-defined sub-networks can be retrieved easily in the standardized PSI-MI format. The resource can be accessed through http://mips.gsf.de/genre/proj/mpact.
The analysis of numerous genomes over the past decade contributed substantially to a comprehensive understanding of the complex biological processes in living cells since the ‘parts list’ of a genome lacks any information on the action of genes in context provided by the cellular environment. Several types of interaction networks such as metabolic pathways, regulatory modules or signaling cascades, which require coordinated action of many different proteins can be distinguished. The most exhaustively studied model for functional interactions in eukaryotes is the yeast Saccharomyces cerevisiae. In addition to the impressive number of individual experiments that uncover protein–protein interactions (PPIs) in yeast, data generated by several high-throughput techniques are available. Especially, large-scale yeast-two-hybrid analysis added valuable information to the understanding of the protein network in yeast (1,2). However, a major disadvantage of most high-throughput approaches is their significant rate of false-positive interactions. The overlap between the two large but independent yeast-two-hybrid data sets has been found to be remarkably low which gave rise to the question of how these data should be weighted. Since no straightforward benchmark standards of truth are available, manually curated data in the MPact dataset are accepted as a trusted standard (3,4).
Not only providing a sound reference for the evaluation of experimental results, MPact was used intensively for the validation of bioinformatics methods for predicting functional associations from experimental data. It was shown that genes with similar expression profiles are more likely to encode interacting proteins, thus describing a subset of functional modules, named ‘party hubs’, in contrast to ‘date hubs’ which consists of interacting proteins not synchronized by co-regulation (5–9). Extracting information from scientific literature and subsequent processing for systematic storage is a time consuming and expensive task. Accordingly, only few databases of manually compiled PPIs exist. CYGD (10), DIP (11), BIND (12), MINT (13) and HPRD (14) are important resources of this kind.
As applications mapping experimental data to PPIs ask not only for dynamic and interactive ways of retrieval, a new section of the Munich Information Center for Protein Sequences (MIPS) Genome Research Environment (GenRE) (http://mips.gsf.de/genre/proj/genre) was developed. This section, initially designed as a generic and versatile data structure for interaction data, was used to structure interaction data on mammalian proteomes (15). Using MPact, mapping of interaction data to other secondary information such as the functional classification catalog (FunCat) or mapping of interaction data to related proteomes is feasible (16). For instance the latter revealed that the likelihood of having an ortholog in other ascomycota species correlates with the number of interacting partners which show a clear preference to be pairwise conserved as a pair (17).
We describe MPact, a manually annotated protein interaction database in yeast as a reference for the experimental and theoretical work to elucidate the characteristics of cellular protein interaction networks (3,5,7,9). The power of the manually curated data set is illustrated by the network of proteins involved in signal transduction as an example. In addition, we describe a web-based tool, that allows scientists to analyze user-defined PPI-networks enabling investigation of protein subsets of interest. The resource can be accessed through http://mips.gsf.de/genre/proj/mpact.
The MIPS interaction information resource is divided into several physically separated independent databases. This approach was chosen to fulfill different requirements of diverse protein interaction projects at MIPS like the MPPI resource (15).
To avoid redundancy and possible inconsistencies, we focus on interaction relevant information and retrieve additional information about the interaction partners from related databases. Therefore, we decided to implement the resource with a component oriented approach. The MIPS GenRE (http://mips.gsf.de/genre/proj/genre) concept is built on linked but distributed components following the J2EE (http://java.sun.com/j2ee/) specification. The design principles of GenRE allow for seamless integration of different data sources and their representation as domain objects. The advantage of GenRE is its modularity that can be part of integrated distributed environments by introducing a multi-tier architecture with separated layers (Figure 1).
The core classes comply with a light-weight object-oriented data model able to map the minimal information about protein interactions (http://mips.gsf.de/genre/proj/mpact/info/about.html), in accordance with the PSI-MI standard for exchange of protein interaction data (18). PSI-MI specifies minimal requirements for the description of molecular interactions like confidence levels and information necessary for protein identification. Additionally, it provides controlled vocabularies for experiment types or the role of the interactor in the experiment (e.g. bait and prey). The classes are mapped within the integration layer using Hibernate, an object/relational mapping technology (http://hibernate.org/) and do not access the databases directly. Retrieval of the data is performed by data access objects using the Hibernate persistence mechanism. Supplementary information about the interaction partners, such as functional annotation or localization, is accessed with similar components already available in GenRE.
On top of the core classes we developed components located in the application tier for further processing. Data is wrapped into a generic XML format allowing HTML generation by XSL style sheet transformation for the presentation layer. The generic XML format contains all the interaction information, including protein and gene annotation from in-house databases. Furthermore, the relevant subset of this information can be compiled into PSI-MI XML documents.
We restrict the access not only to internal applications but offering the same functionality also for web-wide external access. Therefore we also developed a HOBIT service layer (http://hobit.sf.net) based on the web service technology to share MPact in a programming language independent and web-wide way with the public domain. The MPact web service is accessible at http://mips.gsf.de/proj/hobitws/services/PsimiService?wsdl.
Although attempts have been published for natural language analysis and text-mining techniques (19,20), automatic extraction of information from scientific articles is still in its infancy and does not compete yet with high-quality manual annotation. While many journals require authors to deposit sequence information for new proteins and genes in one of the publicly available sequence databases, other knowledge such as protein interactions, regulation, signaling, cellular location or function is rarely submitted to an appropriate database or in an appropriate notation and hence are effectively lost for systematic approaches. The MIPS group has acquired long-standing experience in protein and genome annotation contributing to the protein database PIR-International as well as several genomes such as yeast and Arabidopsis thaliana. (21–25). In order to annotate PPIs, relevant articles are selected from PubMed using text-mining tools and processed by a human expert. The collection of yeast PPI data was started in the context of the original effort to annotate the S.cerevisiae genome for the Comprehensive Yeast Genome Database (CYGD) (25). As a consequence, our protein interaction data is well integrated with other CYGD data such as description and the localization data or functional classification of proteins, using the FunCat annotation scheme (26). Since then, newly discovered PPIs have been added continuously. Moreover, CYGD continues integrating data from high-throughput experiments.
Key information in the annotation of PPIs are the identification of the interacting partners, the kind of experiment as well as the original source of information (PubMed ID). For a standardized specification of experiments an evidence catalog exists. This is in line with the requirements for PSI-MI compliant annotation (see below). Based on this information it is possible to filter the data according to interaction type (physical/genetic) or to restrict the analysis to a certain type of experiment. As large-scale experiments have their unique strengths and weaknesses, and produce a significant fraction of false positives, it is important to distinguish this data from individual and manually extracted interactions described in the literature (3). We clearly make a distinction in our data using the ‘high-throughput’ tag (htp), indicating that these interactions should be filtered first while browsing the data or performing in-depth analysis.
The reliability for any individual interaction described in the database increases by the number of annotated evidences. In MPact, the manually extracted data have on average 2.6 interactions per protein and are annotated with 1.2 evidences per interaction; 2.5 interactions are published per reference (Figure 2). In contrast to the lower quality of high-throughput data sets, highly reliable co-immunoprecipitation—together with affinity chromatography—experiments are the major source for the extracted data (Table 1).
The database can be accessed through http://mips.gsf.de/genre/proj/mpact. Further details concerning the implementation are described in the method section.
Several types of predefined queries are available. ‘Query by Protein’ offers simple queries by searching for interactions of individual proteins by their systematic name, gene name or aliases. Queries are not limited to single proteins; alternatively selections using attributes such as functional categories based on the MIPS FunCat (26), cellular localization and EC number are possible.
Complex confinements of the search space are possible using ‘Query by Interaction’. Several filters can be applied; searches between two distinct individual proteins or lists of proteins can be performed. The result set delivers all interactions with at least one partner from each list. As in the ‘Query by Protein’ form, combinations of attributes are available. To consider the different strengths of certain interaction detection methods the user can choose to display only interactions derived from a specific method based on the PSI-MI controlled vocabulary. To distinguish the manually extracted data as described above, high-throughput experiments can be excluded. Since MPact contains both physical and genetic interactions we provide separate exclusion for these types. Finally, interactions described in a certain reference (PubMed ID) can be selected.
Search results are presented as tables, depending on the selection of the short or long output option different levels of detail are displayed. Short description of the interaction participants are linked to the corresponding entries of the CYGD database. For convenient navigation through the interaction network a link to the direct interactions of a specific participant is available. The long format additionally provides details such as the type of experimental evidence, PubMed references and a description of the interaction. MPact offers the possibility of extracting this result set in the standardized PSI-MI format.
Complementary to the tabular format, visualization of the interaction graph or its selected subgraphs is offered. Edges of the interaction graph are colored according to the number of evidences supporting the respective interaction. Additional information from CYGD including the functional annotation of the interacting proteins is included. Visualizations may be downloaded in PDF format for offline use and to allow enlargement of interesting regions even for very large graphs. As an example for the visualization and analysis of a complex cellular protein interaction network we focus on signal transduction processes.
Protein interactions involved in signaling pathways provide a suitable example to illustrate the information collected and structured by MPact. Figure 3 shows all proteins of S.cerevisiae that have been annotated as signal transduction proteins and their physical interactions. Our dataset contains evidence of physical interactions with at least one partner for 190 (204 physical and genetic) out of a total of 231 signaling proteins. The vast majority of signal transduction proteins are connected through one large network including 51 members. While the majority of proteins in the graph are connected to only one or two binding partners a few nodes exhibit connections with a large number of other proteins thus serving as signaling hubs in the interaction network. This characteristic feature of scale-free networks has been shown to be applicable to most known biological interactions (27). The complete network as well as the signal transduction network follows a power law distribution indicating a scale-free behavior. The overall picture in Figure 3 shows that signal transduction in yeast is a highly complex network in which regulatory proteins do not necessarily interact directly but are linked through different players.
To get a notion of the completeness of our interaction collection, we compared our dataset with signal transduction pathways documented in the KEGG pathway database (28). In KEGG, signal transduction in yeast is shown for the MAPK signaling pathway, two-component system, second messenger pathway and phosphatidylinositol signaling system.
The MAPK pathways are highly conserved signaling units present in all eukaryotes, where they play essential roles in the response to environmental signals and hormones, growth factors and cytokines. They control cell growth, morphogenesis, proliferation and stress response (29). Figure 3 shows that members of the MAPK pathway appear in the centre of the large network, which agrees with the pivotal role of the MAPK pathway in information transfer processes. In KEGG, 54 proteins are displayed in the MAPK pathway of S.cerevisiae. Of these proteins 40 were annotated in CYGD as involved in signal transduction. Proteins that were assigned to signal transduction in KEGG but not in CYGD were found to be linked only peripherally to signal transduction processes. It is a general problem in the functional assignment of proteins, how to distinguish between core proteins of a biological process and others that are only associated with it. In KEGG, a total of 40 protein–protein relations are represented by single arrows or lines, which could theoretically also be found as PPI in MPact. In fact, our dataset includes 27 PPIs which are also part of the KEGG dataset. Differences between KEGG and our manually annotated dataset can originate for different reasons.
The two-component system of yeast in KEGG consists of three proteins and their interactions. These are redundant in the MAPK pathway and completely represented in our dataset. The second messenger pathway and phosphatidylinositol signaling system in KEGG are represented by 17 and 9 proteins, respectively; 14 and 6 of those are found in our dataset.
Although physical interaction is not an obligatory condition in many signal transduction processes a comparison with three important signaling cascades taken from the KEGG database revealed good coverage of the respective pathways by our physical interaction data.
A comprehensive resource on yeast protein interaction data was set up as a reference for comparative genomics and setting a standard for other organisms such as human (15). To access the data a convenient data structure as well as a public interface is available allowing user-defined analysis of sub-networks and data retrieval in the standardized PSI-MI format. The data resource is interlinked with the CYGD database enabling in-depth mapping and analysis employing functional classification or localization data. As the resource is continuously updated its value for the community will steadily increase in future.
We thank Louise Riley for critical reading of the manuscript, Gisela Fobo, Barbara Brauner, Goar Frishman, Corinna Montrone and Irmtraud Dunger for excellent annotation. This work was supported by a grant of the German Federal Ministry of Education and Research (BMBF) within the BFAM framework (031U112C/212C), the European Commission (QLRI-CT 1999-01333) and the Impuls- und Vernetzungsfonds der Helmholtz-Gemeinschaft Deutscher Forschungszentren eV. Funding to pay the Open Access publication charges for this article was provided by the GSF National Research Center for Environment and Health.
Conflict of interest statement. None declared.