Genome information is traditionally stored in databases containing entries as instances of predefined rigid data structures (i.e. formats). However, the development of concepts to cope with complex data structures within a database becomes practically unmanageable as soon as several independent data sources have to be covered. Therefore information pointing to the same biological objects is distributed over a large number of independent and often syntactically incompatible databases (e.g. nucleic acids, proteins, protein interactions, metabolic and regulatory networks and the like). While passive integration of these databases is feasible through database indexing and integration of flat files or web resources [e.g. PubMed (
7)], it does not allow for any semantic integration required for comprehensive annotation purposes ().
| Table 1URL addresses for MIPS database resources |
The Munich Information Center for Protein Sequences (MIPS) Genome Research Environment System (GenRE) provides a flexible technology to cope with the needs of biological data representation. It is a J2EE based multi-tier architecture, implemented with established software design patterns. Seamless integration of distributed information resources (databases and applications) is realized with Enterprise Java Beans (EJBs) capable of retrieving information in XML format for straightforward web publishing including expression based queries similar to PubMed.
Internally, GenRE is based on three different types of objects and components. Components of the first type are responsible for the access of applications and databases. These EJBs provide a uniform interface while hiding the data resource dependent access mechanisms in the data integration tier. Databases for example are typically accessed via Hibernate, an object-relational mapping tool, whereas applications are often directly accessed. Input and output are commonly XML documents and data objects. Data objects, which represent the second type within GenRE abstract biological entities such as genes, proteins or even complexes at a semantic level. In this layered approach, the data object level is unambiguously separated from the underlying data sources. These objects are used for semantic integration into a third type of component. They are realized as EJBs and are responsible for any further information processing. This allows the association of any biochemical entities (e.g. RNA, drugs, etc.) with either an entity describing binary relationships—e.g. protein interactions from yeast two-hybrid experiments—with many to many relationships, e.g. functional assignments using the MIPS FunCat (
2).
Hence GenRE does not only allow for the flexible creation of different object types needed to include various types of ‘omics’ data, but is also capable of incorporating relations between instances from different data sources. Even complex data models suitable for handling biological networks together with functional annotation of the distinct nodes are realized. In combination with integrated applications like SIMAP components for comparative proteomics can be realized.
The MIPS protein–protein interaction resource (MPact) (
8) is illustrative of the advantage of our approach to extend the single-protein view into a network perspective. The data model allows extensions for interactions of proteins with other biochemical entities (e.g. RNA, drugs, etc.). Interacting objects can not only be associated with each other to represent for example complexes, but also with external information describing the corresponding experiments (e.g. yeast two-hybrid, co-immunoprecipitation or mass spectrometry data). In the same way information about the evidence of interactions and various protein annotations such as functions, motifs and cellular localization are associated with the interacting object. Owing to the object-oriented approach any instances of the interacting object (notably a protein) can be furthermore associated with the corresponding entry e.g. in a genome database.
Our implementation allows two different approaches to query the repository. On the one hand a gene-centric or protein-centric query is possible where distinct interacting objects can be retrieved within a specific context. Since the proteins and interactions are ‘decorated’ with annotation information, it is possible to query for specific attributes of proteins (e.g. functions) or the interactions (e.g. evidence). On the other hand, network-centric queries can be performed. It is possible to query both for the nodes (the proteins) and the edges (the interactions) of the graph. Based on functional annotation a traversal of the network graph is possible. This traversal can be used to quickly scan the network for false-positive interactions between proteins whose functions and/or localizations differ completely or to assign new functions to proteins without functions which interact specifically within a certain functional context. Furthermore extraction of sub-networks based on any associated context (function, localization, experiment) is possible. It is relevant to point out that our approach enables seamless context dependent views starting from single genes and ending with complete networks. MIPS databases are implemented in the GenRE environment.