|Home | About | Journals | Submit | Contact Us | Français|
The RIKEN integrated database of mammals (http://scinets.org/db/mammal) is the official undertaking to integrate its mammalian databases produced from multiple large-scale programs that have been promoted by the institute. The database integrates not only RIKEN’s original databases, such as FANTOM, the ENU mutagenesis program, the RIKEN Cerebellar Development Transcriptome Database and the Bioresource Database, but also imported data from public databases, such as Ensembl, MGI and biomedical ontologies. Our integrated database has been implemented on the infrastructure of publication medium for databases, termed SciNetS/SciNeS, or the Scientists’ Networking System, where the data and metadata are structured as a semantic web and are downloadable in various standardized formats. The top-level ontology-based implementation of mammal-related data directly integrates the representative knowledge and individual data records in existing databases to ensure advanced cross-database searches and reduced unevenness of the data management operations. Through the development of this database, we propose a novel methodology for the development of standardized comprehensive management of heterogeneous data sets in multiple databases to improve the sustainability, accessibility, utility and publicity of the data of biomedical information.
Securing the sustainability of databases is one of the most important issues for research institutes, funding agencies and research communities, because the accumulated cost of maintenance becomes a serious burden on the responsible institutes and communities (1). Moreover, the development of technology for biomedical analyses has brought about a dramatic increase in the amount and variety of data and information. The outdating of isolated data is also a serious problem. The association with public data records broadly used in the research community is crucially important to improve the usability and accessibility of data. If data are isolated in the application software without updates from external data, then the data will become increasingly difficult to retrieve by external retrieval systems and will become useless, unnecessarily occupying the storage resource. On the contrary, the integration of a datum with external data will generally increase its usability and value, often promoting unexpected uses and knowledge discovery. In the community of mammalian research, authoritative data are provided by the Mouse Genome Informatics Database (MGI), HUGO Gene Nomenclature Committee (HGNC) and Rat Genome Database (RGD) with nomenclature activities for genes, alleles and strains for each species (2–4). Data from the National Center for Biotechnology Information (NCBI) and Ensembl are also broadly used across species (5,6). The Open Biomedical Ontology (OBO) Consortium, an umbrella body for the developers of life-science ontologies, also provides ontologies developed with the aim of comprehensive annotation of biological information (7,8).
In the mouse genetical research community, these issues have been discussed by international consortia. The Mouse Phenotype Database Integration Consortium (InterPhenome) (http://www.interphenome.org/) and the Coordination and Sustainability of International Mouse Informatics Resources (CASIMIR) (http://www.casimir.org.uk/) have discussed broad issues regarding the integration, coordination, interoperability and sustainability of databases, such as methodologies to integrate phenotype information, the association of phenotype with human disease, models for long-term and financial sustainability for databases and legal issues of data accessibility (9–11). A complete solution to satisfy these multiple and broad requirements at once is desired to ensure the sustainability of databases.
One effective way to reduce the management cost of databases is to share common fundamental infrastructures such as the hardware and application software used in their implementations. Recently, such common operations have been effectively implemented through ‘cloud computing’, which is a type of internet-based computing whereby shared resources, software and information are provided on demand. Cloud computing is often economically beneficial for the facility in terms of the running costs of space, electricity, cooling and staff support (12). If data are properly and continuously managed and integrated with the public data records that are regarded as the de facto standard in the biomedical community, then a common infrastructure could be one of the best ways to achieve cost effectiveness and advanced usability. On the other hand, the ‘semantic web’ offers a series of methods and technologies to develop extensions of the current World Wide Web (WWW) in which information is given well-defined meanings and integrated (13). These technologies include the Resource Description Framework (RDF), a variety of data interchange formats (e.g. RDF/XML, N3, Turtle and N-Triples), and notations such as the RDF Schema (RDFS) and the Web Ontology Language (OWL), all of which are intended to provide a formal description of concepts, terms and relationships within a given knowledge domain. The semantic web is regarded as an integrator across different content and information applications and systems and provides mechanisms for the realization of a common information system. It is also useful for the dissemination of data, providing a standardized framework to describe metadata recommended by the WWW consortium that aids the automated (and also manual) processing of disseminated data to derive meaning from the data. The dissemination of data with standardized metadata risks the extinction of the data and creates the opportunity to promote the discovery of new knowledge. Consequently, the semantic web seems to be suitable as a fundamental technology to implement the common infrastructure.
In this study, we developed a new database, the RIKEN integrated database of mammals, as an official undertaking in RIKEN to integrate heterogeneous mammal-related data in multiple individual databases. This database was constructed on the Scientists’ Networking System (SciNetS: http://www.riken.jp/engn/r-world/info/release/press/2009/090331_2/), a general fundamental system that applies the semantic web technology to provide massive data management, supported by Japan’s national database integration project. In this system, we achieved the top-level ontology-based re-organization of imported data to integrate the typical and instructive knowledge with individual data records. The RIKEN integrated database of mammals is complementary to the original databases. For example, the FANTOM web resource aims to present data on the dynamic behavior of transcription and its regulation in the expanding fields of the transcriptome, epigenome and transcriptional networks (14,15). By contrast, this integrated database attaches greater importance to the standardization of data for better distribution, metadata-level integration and cross-database retrieval.
In RIKEN, there are a number of databases related to mammalian research resources. In the primary development of the integrated database, we integrated six database projects: the Functional Annotation of the Mammalian Genome 4 (FANTOM 4: http://fantom.gsc.riken.jp) (14–16), the RIKEN Cerebellar Development Transcriptome Database (CDT-DB: http://www.cdtdb.brain.riken.jp/CDT/Top.jsp) (17,18), the resource database from the RIKEN BioResource Center (BRC) (19–21) including mutant resources produced by the ENU mutagenesis program (22,23) and the Resource of Asian Primary Immunodeficiency Diseases (RAPID) (24), the RIKEN Structural Genomics/Proteomics Initiative (RSGI) and two data repositories for the Reference Database of Immune Cells (RefDIC) (25) and the RIKEN Expression Array Database (READ) (26), all of which are produced from individual research projects in the human and mouse. Each database project has its original data schema to represent a variety of data ranging from research resources, such as biological strains, cell lines and DNA clones, to experimental data, such as gene expression and phenotypic analyses. There are no relationships defined among original data tables, which are described by various data formats such as text, images and movies. However, as is usual for most databases, they are compiled in a main data table to represent the objects of the database and related information (Table 1). In the discussions of InterPhenome and CASIMIR, it was recommended that the equivalences or relationships among records from the MGI database for genes and alleles, the International Mouse Strain Resource (IMSR) for experimental strain (27) and terms of OBO ontologies be specified. To show the association between the institute’s data and the public data broadly used in the research community, we constructed an association between RIKEN’s data and public data (Supplementary Table S1).
We have implemented the integrated database on the data-hosting system, SciNetS, which is a fully web-based common platform that ensures cloud computing in the scientific community on the basis of semantic web technologies (Figure 1). It has multiple features useful for data integration:
The overview of the implementation of this database is presented in Figure 2. The mammalian data and public data shown in Table 1 and Supplementary Table S1, respectively, were imported to SciNetS as individual database projects such that their intact data schema were reflected fully or partially. According to the forms of the original data sources, the databases were imported as three distinct types of projects implemented in SciNetS. First, in the database-type project, a replication of the original database elements, the database table and a data record, is represented with a class and an instance, respectively. Second, the ontology-type project is a replication of the ontology with the OWL methodology. Upon the import of OBO ontologies, ontology files in the OWL format are downloaded from the OBO Foundry website (http://www.obofoundry.org/). Then, the ontology is directly imported into SciNetS. Third, in repository projects, the complete data from a database are stored as single or multiple files. As a result, 27 projects (17 for database, nine for ontology and one for repository) composed of 108396 classes and 777319 instances were defined as for September in 2010. These projects are updated monthly in average from constituent databases and ontologies.
Then, we examined the contents and semantics (not the data format or syntax) of 41 classes of imported projects, which play the principal roles in each project. To ensure the consistent classification of the content, we used a top-middle level ontology, YAMATO-GXO Lite (http://scinets.org/item/rib23i/), which is the lightened version of the middle-level ontology, Genetics Ontology (GXO) (30) (http://www.brc.riken.jp/lab/bpmp/ontology/ontology_gxo.html), to bridge between the experimental genetics domain and the latest top-level ontology, Yet Another More Advanced Top-level Ontology (YAMATO) (31) (http://www.ei.sanken.osaka-u.ac.jp/hozo/onto_library/upperOnto.htm). YAMATO-GXO Lite was developed with the ontology editor in SciNetS (paper in preparation). As a result, 41 classes conveying the key information from each project are classified under the fifteen upper classes as follows: ‘Genome segment and gene in mammal’, ‘Allele in mammal’, ‘Transcript in mammal’, ‘Protein in mammal’, ‘Strain resource in mammal’, ‘Cell line resource in mammal’, ‘Disease’, ‘Experimental data with mammalian sample’ and ‘Mammalian Orthologous group’ (Figure 3). The RIKEN Integrated Database of Mammals is implemented as a project to define these classes as a root (http://SciNetS.org/db/mammal). The ontology-based classification of contents was embodied with rdf:subclassOf links, which can be applied across multiple projects in SciNetS. To integrate across species databases, we applied the ‘query-class’, which dynamically refers only to specific instances from another class. For example, the diffraction data class in the SSBC project includes the diffraction data from mammal and non-mammal proteins. To extract only mammal data, we implemented the query-class, which is an expanded use of the owl:oneOf element to define a class by enumerating its elements. With these operations, the project for the integrated database works as the bridge to connect the YAMATO-GXO Lite and imported projects, in which the imported classes are defined as lower concepts of the top-level ontology as shown in Figure 2.
In the next step, to ensure further semantic integration of the imported data, we examined the equivalencies of property links (semantic links) between the upper ontology and lower classes in imported projects. For example, the ‘Allele’ class in YAMATO-GXO Lite has a property named ‘variant_of’ that takes its value from the range of the ‘Genome segment’ class. It is the logical representation of one of the features of an allele that the ‘allele is a variant of a genome segment’. The examination of properties in lower classes reveals that the ‘MGI allele’ class has the ‘MGI gene’ property range of ‘MGI gene’, which is equivalent to ‘variant_of’. Consequently, we defined the ‘MGI gene’ property as a specified type of (rdfs:subPropertyOf) ‘variant_of’ to show that Gdf5Rgsc451, an instance of the MGI allele class, is a variant of Gdf5, an instance of the MGI gene class. With this equivalence mapping of properties between YAMATO-GXO Lite and lower imported database classes, we built the ontology-based information structure so that information defined in the upper classes is instantiated in lower database classes and instances.
In addition, regarding the import of external and internal data records, multiple overlaps of records (instances) were collapsed to represent a single identical entity in the real world (i.e. instances of a gene in the Ensembl, MGI and FANTOM projects). We also examined such equality between instances in lower classes that belong to a single upper class. We related identical data items with a semantic link that is equivalent to owl:sameAs.
At the top page of this integrated database, users can overlook all the classes of integrated databases and those data sizes shown in Supplementary Table S2. The overview of the data structure is presented on the ‘data folder’ page, where users can navigate down the class hierarchy across database or ontology projects by clicking on the folder icons that represent classes. On the page of each project, detailed explanations of the projects and URL links to the original database websites are shown. On the class and instance pages, detailed explanations, a table view of instances, a graphic representation of semantic links and links to original data records are displayed (Figure 3).
SciNetS implements two kinds of search function, the internal-search and the cross-search. When users search with ‘Search’ button, SciNetS executes internal-search to retrieve queries within the accessing project and related projects and shows the number of query hits on each folder icon of the accessing page (Figure 4). For the cross-search from whole SciNetS data, users can access from ‘Search All’ button. SciNetS replies search results in descending order according to traffic. From this cross-search, users can jump to the Positional Medline (PosMed) search, allows the user to retrieve various information (i.e. gene, phenotypes or diseases) correlated with a genomic position by jumping the PosMed database for a full-document search of various contents: scientific literature, genome annotations, phenome information, protein–protein interactions, co-expression data, orthologous genes, drugs and metabolite information (32,33). These search functions are implemented by our original database search engine GRASE (28). Data in this database are downloadable from the ‘Download’ links of each project with specifications of licenses via CC or GNU. SciNetS provides various several standard formats, such as RDF, OWL or tab-delimited files.
The RIKEN integrated database of mammals should be the first practical database to perform the direct integration of the top-level ontology, domain-specific ontologies and the existing databases. Although there is much room for improvement, this database represents a simple and practical methodology to generate a consistent and scalable body of information that is interoperable with the global informational whole based on semantic web technology. In the process of the integration, we have investigated data schema of each database and classified their contents based on the top-level ontology. These operations are comparable to the ‘annotation’ of databases.
Currently, the main knowledge framework is provided by a top-level ontology, YAMATO-GXO lite. During the development of this ontology, it was optimized to allow the integration of multiple biological databases used by the mammalian genetics community. For example, the basic definition of mammalian genes is provided by the Mouse Genomic Nomenclature Committee (MGNC), which is suitable for data management of genome information. It defines gene as ‘a functional unit, usually encoding a protein or RNA, whose inheritance can be followed experimentally’; also, ‘a gene symbol should be unique within the species’. This definition is surely represented in the MGI database because each gene record is stored in the genome segment (phrased as ‘genetic marker’ in MGI) database as a subset (or a subclass) having a biological function and is unique in the mouse genome. An allele is defined as a variant form of a genome segment, which is usually unique for the sequence of itself. Here, we should mention that there are at least two ways to conceptualize genome segments and alleles. One attaches greater importance to the instantiation toward a molecule. Such a classification may be performed in the BioTop top-level ontology (34). Another applies the conceptualization of gene and allele as classes and allows them to have their own instances such as Gdf5 and Gdf5Rgsc451. YAMATO-GXO lite applies latter as useful for integrating databases. A gene is a subclass of the genome segment that has a biological function. An allele is defined as a different class to be unique for conveying information and is equal to the nucleotide sequence.
The consistent knowledge framework contributes to metadata-based and cross-database retrieval for easy and clear specification of the range of the search object. Such retrieval was previously only available for individual databases. For example, to search for ‘the mouse genome segment that has a variant with a point mutation’, a cross-database retrieval is usually performed with the combination of the text, ‘genome segment’ ‘mouse’ and ‘point mutation’. Such a search never indicates the range of the search resource, ‘genome segment of mouse’, which is a subclass of genome segments of mammals. Furthermore, the range must be clearly distinguished from the mouse allele, which is the entity that has the point mutation. In this database, the fifteen upper classes and the lower class-tree are explicitly defined to represent the range of resources and the organization of metadata. Therefore, the knowledge framework enables the retrieval of specific resources, such as ‘genome segment of mouse’, to be related to the text ‘point mutation’ (which may be described in the instance of an allele) using query languages such as SPARQL or GRASQL. On the GUI of this database, the simple GRASQL-based searches are implemented as simple text searches, as described above.
The knowledge framework also contributes to ensuring the cost-effective sustainability and updating of data. In the implementation of SciNetS, the common body for data integration, the continuous maintenance and management of data are essential. These operations are differentiated with respect to not only the formalism of data but also the contents in each database. The consistently integrated data, which represent classification and inheritances between property links, reveal the content-oriented standardization of the formalism of data items. We are now developing content-oriented procedures for data maintenance specified for data contents such as gene, allele and strain. The standardized data formulation provided from top- and middle- level ontologies reduces the labor cost of data management through the reduction of unevenness in the operations of individual databases. Thus, the ‘annotation’ of databases helps to design the contents-oriented common user interfaces or the procedure of data management of imported databases, which had been independently developed in different research projects.
Another advantage of the data integration on SciNetS is that the continuous improvements and enhancements are ensured by the data tracking system to integrate newly added projects. We are planning to incorporate other mammal-related databases into RIKEN to disseminate them to broad communities. Public data are also incorporated to provide higher usability by establishing relationships among data. For example, we still do not ensure fully functional cross-species integration of anatomies and phenotypes, which are provided as species-specific ontologies. To solve this problem, we need equivalence mapping of homologous organs/tissues and phenotypes. Some ontology developers are working on this issue to establish relationships between the Mammalian Phenotype ontology (MP) (35) and Human Phenotype Ontology (HPO) (36–37) mediated by the Phenotypic Quality Ontology (PATO) (38–44). The implementation of such equivalence information in the integrated database will greatly improve the utility of phenotype data to provide cross-mapping information with diseases. Furthermore, we are also integrating the plant omics data using SciNetS with a similar methodology (K. Doi et al. manuscript in preparation). Referring to the same top-level ontology, we are planning to integrate the mammalian database with the plant one. One of the merits of the institute-oriented data integration is the promotion of data integration across phylogenetically distant species because the species- or community-oriented integration of plant and mammal information is often difficult.
We will continue the development of this database to enhance the data, retrieval functions and semantics as described above. In addition, we are also planning to incorporate other top-middle level ontologies beyond YAMATO-GXO lite, such as the Basic Formal Ontology (BFO) (45), the Descriptive Ontology for Linguistic, Cognitive Engineering (DOLCE) (46), BioTop and the Ontology of Biomedical Investigation (OBI). In YAMATO, the interoperability among these top-level ontologies represents a general model to explain differentiation and interrelationships among classes (31). With this enhancement, we will cooperate with the global efforts of the OBO Foundry, the initiative activity of the OBO consortium, which has been to coordinate the scientific methods in ontology developments toward forming a consistent, cumulatively expanding and algorithmically tractable whole (7) based on the BFO as the semantic framework.
Supplementary Data are available at NAR Online.
Maintenance of SciNetS is supported by the Integrated Database Project by Ministry of Education, Culture, Sports, Science and Technology (MEXT).
Conflict of interest statement. None declared.
The authors thank Drs Kaoru Saijyo, Kazuyuki Mekada and Hatsumi Nakata in RIKEN BRC to help data import from Resource database to SciNetS.