This paper describes how DISCO, the data aggregator that supports the Neuroscience Information Framework (NIF), has been extended to play a central role in automating the complex workflow required to support and coordinate the NIF’s data integration capabilities. The NIF is an NIH Neuroscience Blueprint initiative designed to help researchers access the wealth of data related to the neurosciences available via the Internet. A central component is the NIF Federation, a searchable database that currently contains data from 231 data and information resources regularly harvested, updated, and warehoused in the DISCO system. In the past several years, DISCO has greatly extended its functionality and has evolved to play a central role in automating the complex, ongoing process of harvesting, validating, integrating, and displaying neuroscience data from a growing set of participating resources. This paper provides an overview of DISCO’s current capabilities and discusses a number of the challenges and future directions related to the process of coordinating the integration of neuroscience data within the NIF Federation.
data integration; database federation; database interoperation; neuroinformatics; biomedical informatics
This paper describes the capabilities of DISCO, an extensible approach that supports integrative Web-based information dissemination. DISCO is a component of the Neuroscience Information Framework (NIF), an NIH Neuroscience Blueprint initiative that facilitates integrated access to diverse neuroscience resources via the Internet. DISCO facilitates the automated maintenance of several distinct capabilities using a collection of files 1) that are maintained locally by the developers of participating neuroscience resources and 2) that are “harvested” on a regular basis by a central DISCO server. This approach allows central NIF capabilities to be updated as each resource’s content changes over time. DISCO currently supports the following capabilities: 1) resource descriptions, 2) “LinkOut” to a resource’s data items from NCBI Entrez resources such as PubMed, 3) Web-based interoperation with a resource, 4) sharing a resource’s lexicon and ontology, 5) sharing a resource’s database schema, and 6) participation by the resource in neuroscience-related RSS news dissemination. The developers of a resource are free to choose which DISCO capabilities their resource will participate in. Although DISCO is used by NIF to facilitate neuroscience data integration, its capabilities have general applicability to other areas of research.
Data integration; database federation; database interoperation; neuroinformatics
The amount of biomedical data available in Semantic Web formats has been rapidly growing in recent years. While these formats are machine-friendly, user-friendly web interfaces allowing easy querying of these data are typically lacking. We present “Entrez Neuron”, a pilot neuron-centric interface that allows for keyword-based queries against a coherent repository of OWL ontologies. These ontologies describe neuronal structures, physiology, mathematical models and microscopy images. The returned query results are organized hierarchically according to brain architecture. Where possible, the application makes use of entities from the Open Biomedical Ontologies (OBO) and the ‘HCLS knowledgebase’ developed by the W3C Interest Group for Health Care and Life Science. It makes use of the emerging RDFa standard to embed ontology fragments and semantic annotations within its HTML-based user interface. The application and underlying ontologies demonstrates how Semantic Web technologies can be used for information integration within a curated information repository and between curated information repositories. It also demonstrates how information integration can be accomplished on the client side, through simple copying and pasting of portions of documents that contain RDFa markup.
Semantic Web; Neuroscience; OWL; Web User Interface; Reasoning; RDFa; Information integration
Integrative neuroscience research needs a scalable informatics framework that enables semantic integration of diverse types of neuroscience data. This paper describes the use of the Web Ontology Language (OWL) and other Semantic Web technologies for the representation and integration of molecular-level data provided by several of SenseLab suite of neuroscience databases.
Based on the original database structure, we semi-automatically translated the databases into OWL ontologies with manual addition of semantic enrichment. The SenseLab ontologies are extensively linked to other biomedical Semantic Web resources, including the Subcellular Anatomy Ontology, Brain Architecture Management System, the Gene Ontology, BIRNLex and UniProt. The SenseLab ontologies have also been mapped to the Basic Formal Ontology and Relation Ontology, which helps ease interoperability with many other existing and future biomedical ontologies for the Semantic Web. In addition, approaches to representing contradictory research statements are described. The SenseLab ontologies are designed for use on the Semantic Web that enables their integration into a growing collection of biomedical information resources.
We demonstrate that our approach can yield significant potential benefits and that the Semantic Web is rapidly becoming mature enough to realize its anticipated promises. The ontologies are available online at http://neuroweb.med.yale.edu/senselab/
Semantic Web; neuroscience; description logic; ontology mapping; Web Ontology Language; integration
Maintaining a large controlled biomedical vocabulary requires ensuring the content's internal consistency. This is done through rules, specified by the vocabulary's curators, which denote how the vocabulary's concepts should be defined. When individual organizations deploy such vocabularies, local concepts are typically added and linked to concepts in the main vocabulary: the process of maintaining and linking local content should follow the same rules. The operation of content-maintenance software can be facilitated by maintaining such rules in computable form. In this paper, we demonstrate how to implement computable rules for attribute usage in SNOMED CT using a table-driven approach where a given rule is expressed as one or more rows in a table and is consulted by generic code. This approach, which is tailored to database implementations, is computationally efficient and allows new attribute-definition rules to be created as data while needing minimal or no code modification.
SNOMED CT; description logic; table-driven methods
As the number of neuroscience databases increases, the need for neuroscience data integration grows. This paper reviews and compares several approaches, including the Neuroscience Database Gateway (NDG), Neuroscience Information Framework (NIF) and Entrez Neuron, which enable neuroscience database annotation and integration. These approaches cover a range of activities spanning from registry, discovery and integration of a wide variety of neuroscience data sources. They also provide different user interfaces for browsing, querying and displaying query results. In Entrez Neuron, for example, four different facets or tree views (neuron, neuronal property, gene and drug) are used to hierarchically organize concepts that can be used for querying a collection of ontologies. The facets are also used to define the structure of the query results.
data integration; neuroinformatics; ontology; semantic web; user interface
To devise an automated approach for integrating federated database information using database ontologies constructed from their extended metadata.
One challenge of database federation is that the granularity of representation of equivalent data varies across systems. Dealing effectively with this problem is analogous to dealing with precoordinated vs. postcoordinated concepts in biomedical ontologies.
The authors describe an approach based on ontological metadata mapping rules defined with elements of a global vocabulary, which allows a query specified at one granularity level to fetch data, where possible, from databases within the federation that use different granularities. This is implemented in OntoMediator, a newly developed production component of our previously described Query Integrator System. OntoMediator's operation is illustrated with a query that accesses three geographically separate, interoperating databases. An example based on SNOMED also illustrates the applicability of high-level rules to support the enforcement of constraints that can prevent inappropriate curator or power-user actions.
A rule-based framework simplifies the design and maintenance of systems where categories of data must be mapped to each other, for the purpose of either cross-database query or for curation of the contents of compositional controlled vocabularies.
This article presents the latest developments in neuroscience information dissemination through the SenseLab suite of databases: NeuronDB, CellPropDB, ORDB, OdorDB, OdorMapDB, ModelDB and BrainPharm. These databases include information related to: (i) neuronal membrane properties and neuronal models, and (ii) genetics, genomics, proteomics and imaging studies of the olfactory system. We describe here: the new features for each database, the evolution of SenseLab’s unifying database architecture and instances of SenseLab database interoperation with other neuroscience online resources.
neuroscience; databases; SenseLab; neuroinformatics; Human Brain Project
This paper describes the NIF LinkOut Broker (NLB) that has been built as part of the Neuroscience Information Framework (NIF) project. The NLB is designed to coordinate the assembly of links to neuroscience information items (e.g., experimental data, knowledge bases, and software tools) that are (1) accessible via the Web, and (2) related to entries in the National Center for Biotechnology Information’s (NCBI’s) Entrez system. The NLB collects these links from each resource and passes them to the NCBI which incorporates them into its Entrez LinkOut service. In this way, an Entrez user looking at a specific Entrez entry can LinkOut directly to related neuroscience information. The information stored in the NLB can also be utilized in other ways. A second approach, which is operational on a pilot basis, is for the NLB Web server to create dynamically its own Web page of LinkOut links for each NCBI identifier in the NLB database. This approach can allow other resources (in addition to the NCBI Entrez) to LinkOut to related neuroscience information. The paper describes the current NLB system and discusses certain design issues that arose during its implementation.
Data integration; Entrez LinkOut; Gateway; GUID; Neurodatabases; PubMed
The overarching goal of the NIF (Neuroscience Information Framework) project is to be a one-stop-shop for Neuroscience. This paper provides a technical overview of how the system is designed. The technical goal of the first version of the NIF system was to develop an information system that a neuroscientist can use to locate relevant information from a wide variety of information sources by simple keyword queries. Although the user would provide only keywords to retrieve information, the NIF system is designed to treat them as concepts whose meanings are interpreted by the system. Thus, a search for term should find a record containing synonyms of the term. The system is targeted to find information from web pages, publications, databases, web sites built upon databases, XML documents and any other modality in which such information may be published. We have designed a system to achieve this functionality. A central element in the system is an ontology called NIFSTD (for NIF Standard) constructed by amalgamating a number of known and newly developed ontologies. NIFSTD is used by our ontology management module, called OntoQuest to perform ontology-based search over data sources. The NIF architecture currently provides three different mechanisms for searching heterogeneous data sources including relational databases, web sites, XML documents and full text of publications. Version 1.0 of the NIF system is currently in beta test and may be accessed through http://nif.nih.gov.
ontology; data federation; neuroscience resource
This paper describes a pilot query interface that has been constructed to help us explore a “concept-based” approach for searching the Neuroscience Information Framework (NIF). The query interface is concept-based in the sense that the search terms submitted through the interface are selected from a standardized vocabulary of terms (concepts) that are structured in the form of an ontology. The NIF contains three primary resources: the NIF Resource Registry, the NIF Document Archive, and the NIF Database Mediator. These NIF resources are very different in their nature and therefore pose challenges when designing a single interface from which searches can be automatically launched against all three resources simultaneously. The paper first discusses briefly several background issues involving the use of standardized biomedical vocabularies in biomedical information retrieval, and then presents a detailed example that illustrates how the pilot concept-based query interface operates. The paper concludes by discussing certain lessons learned in the development of the current version of the interface.
Data search; Web search; Ontologies; Database mediation; Data federation; Text search; Neuroscience
With support from the Institutes and Centers forming the NIH Blueprint for Neuroscience Research, we have designed and implemented a new initiative for integrating access to and use of Web-based neuroscience resources: the Neuroscience Information Framework. The Framework arises from the expressed need of the neuroscience community for neuroinformatic tools and resources to aid scientific inquiry, builds upon prior development of neuroinformatics by the Human Brain Project and others, and directly derives from the Society for Neuroscience’s Neuroscience Database Gateway. Partnered with the Society, its Neuroinformatics Committee, and volunteer consultant-collaborators, our multi-site consortium has developed: (1) a comprehensive, dynamic, inventory of Web-accessible neuroscience resources, (2) an extended and integrated terminology describing resources and contents, and (3) a framework accepting and aiding concept-based queries. Evolving instantiations of the Framework may be viewed at http://nif.nih.gov, http://neurogateway.org, and other sites as they come on line.
Neurodatabases; Data sharing; Terminologies; Portals
Data sparsity and schema evolution issues affecting clinical informatics and bioinformatics communities have led to the adoption of vertical or object-attribute–value-based database schemas to overcome limitations posed when using conventional relational database technology. This paper explores these issues and discusses why biomedical data are difficult to model using conventional relational techniques. The authors propose a solution to these obstacles based on a relational database engine using a sparse, column-store architecture. The authors provide benchmarks comparing the performance of queries and schema-modification operations using three different strategies: (1) the standard conventional relational design; (2) past approaches used by biomedical informatics researchers; and (3) their sparse, column-store architecture. The performance results show that their architecture is a promising technique for storing and processing many types of data that are not handled well by the other two semantic data models.
The present study described an open source application, ResourceLog, that allows website administrators to record and analyze the usage of online resources. The application includes four components: logging, data mining, administrative interface, and back-end database. The logging component is embedded in the host website. It extracts and streamlines information about the Web visitors, the scripts, and dynamic parameters from each page request. The data mining component runs as a set of scheduled tasks that identify visitors of interest, such as those who have heavily used the resources. The identified visitors will be automatically subjected to a voluntary user survey. The usage of the website content can be monitored through the administrative interface and subjected to statistical analyses. As a pilot project, ResourceLog has been implemented in SenseLab, a Web-based neuroscience database system. ResourceLog provides a robust and useful tool to aid system evaluation of a resource-driven Web application, with a focus on determining the effectiveness of data sharing in the field and with the general public.
Neuroscientists often need to access a wide range of data sets distributed over the Internet. These data sets, however, are typically neither integrated nor interoperable, resulting in a barrier to answering complex neuroscience research questions. Domain ontologies can enable the querying heterogeneous data sets, but they are not sufficient for neuroscience since the data of interest commonly span multiple research domains. To this end, e-Neuroscience seeks to provide an integrated platform for neuroscientists to discover new knowledge through seamless integration of the very diverse types of neuroscience data. Here we present a Semantic Web approach to building this e-Neuroscience framework by using the Resource Description Framework (RDF) and its vocabulary description language, RDF Schema (RDFS), as a standard data model to facilitate both representation and integration of the data.
We have constructed a pilot ontology for BrainPharm (a subset of SenseLab) using RDFS and then converted a subset of the BrainPharm data into RDF according to the ontological structure. We have also integrated the converted BrainPharm data with existing RDF hypothesis and publication data from a pilot version of SWAN (Semantic Web Applications in Neuromedicine). Our implementation uses the RDF Data Model in Oracle Database 10g release 2 for data integration, query, and inference, while our Web interface allows users to query the data and retrieve the results in a convenient fashion.
Accessing and integrating biomedical data which cuts across multiple disciplines will be increasingly indispensable and beneficial to neuroscience researchers. The Semantic Web approach we undertook has demonstrated a promising way to semantically integrate data sets created independently. It also shows how advanced queries and inferences can be performed over the integrated data, which are hard to achieve using traditional data integration approaches. Our pilot results suggest that our Semantic Web approach is suitable for realizing e-Neuroscience and generic enough to be applied in other biomedical fields.
Integrative neuroscience involves the integration and analysis of diverse types of neuroscience data involving many different experimental techniques. This data will increasingly be distributed across many heterogeneous databases that are web-accessible. Currently, these databases do not expose their schemas (database structures) and their contents to web applications/agents in a standardized, machine-friendly way. This limits database interoperation. To address this problem, we describe a pilot project that illustrates how neuroscience databases can be expressed using the Web Ontology Language, which is a semantically-rich ontological language, as a common data representation language to facilitate complex cross-database queries. In this pilot project, an existing tool called “D2RQ” was used to translate two neuroscience databases (NeuronDB and CoCoDat) into OWL, and the resulting OWL ontologies were then merged. An OWL-based reasoner (Racer) was then used to provide a sophisticated query language (nRQL) to perform integrated queries across the two databases based on the merged ontology. This pilot project is one step toward exploring the use of semantic web technologies in the neurosciences.
Query Integrator System (QIS) is a database mediator framework intended to address robust data integration from continuously changing heterogeneous data sources in the biosciences. Currently in the advanced prototype stage, it is being used on a production basis to integrate data from neuroscience databases developed for the SenseLab project at Yale University with external neuroscience and genomics databases. The QIS framework uses standard technologies and is intended to be deployable by administrators with a moderate level of technological expertise: It comes with various tools, such as interfaces for the design of distributed queries. The QIS architecture is based on a set of distributed network-based servers, data source servers, integration servers, and ontology servers, that exchange metadata as well as mappings of both metadata and data elements to elements in an ontology. Metadata version difference determination coupled with decomposition of stored queries is used as the basis for partial query recovery when the schema of data sources alters.
The EAV/CR framework, designed for database support of rapidly evolving scientific domains, utilizes metadata to facilitate schema maintenance and automatic generation of Web-enabled browsing interfaces to the data. EAV/CR is used in SenseLab, a neuroscience database that is part of the national Human Brain Project. This report describes various enhancements to the framework. These include (1) the ability to create “portals” that present different subsets of the schema to users with a particular research focus, (2) a generic XML-based protocol to assist data extraction and population of the database by external agents, (3) a limited form of ad hoc data query, and (4) semantic descriptors for interclass relationships and links to controlled vocabularies such as the UMLS.
The paper provides an overview of neuroinformatics research at Yale University being performed as part of the national Human Brain Project. This research is exploring the integration of multidisciplinary sensory data, using the olfactory system as a model domain. The neuroinformatics activities fall into three main areas: 1) building databases and related tools that support experimental olfactory research at Yale and can also serve as resources for the field as a whole, 2) using computer models (molecular models and neuronal models) to help understand data being collected experimentally and to help guide further laboratory experiments, 3) performing basic neuroinformatics research to develop new informatics technologies, including a flexible data model (EAV/CR, entity-attribute-value with classes and relationships) designed to facilitate the integration of diverse heterogeneous data within a single unifying framework.
The Olfactory Receptor Database (ORDB; http://senselab.med.yale.edu/senselab/ordb) is a central repository of olfactory receptor (OR) and olfactory receptor-like gene and protein sequences. To deal with the very large OR gene family, we have constructed an algorithm that automatically downloads sequences from web sources such as GenBank and SWISS-PROT into the database. The algorithm uses hypertext markup language (HTML) parsing techniques that extract information relevant to ORDB. The information is then correlated with the metadata in the ORDB knowledge base to encode the unstructured text extracted into the structured format compliant with the database architecture, entity attribute value with classes and relationship (EAV/CR), which supports the SenseLab project as a whole. Three population methods: batch, automatic and semi-automatic population are discussed. The data is imported into the database using extensible markup language (XML).
Objective: To develop a guideline document model
that includes a sufficiently broad set of concepts to be useful throughout the
guideline life cycle.
Design: Current guideline document models are limited in that they
reflect the specific orientation of the stakeholder who created them; thus,
developers and disseminators often provide few constructs for conceptualizing
recommendations, while implementers de-emphasize concepts related to
establishing guideline validity. The authors developed the Guideline Elements
Model (GEM) using XML to better represent the heterogeneous knowledge
contained in practice guidelines. Core constructs were derived from the
Institute of Medicine's Guideline Appraisal Instrument, the National Guideline
Clearinghouse, and the augmented decision table guideline representation.
These were supplemented by additional concepts from a literature review.
Results: The GEM hierarchy includes more than 100 elements. Major
concepts relate to a guideline's identity, developer, purpose, intended
audience, method of development, target population, knowledge components,
testing, and review plan. Knowledge components in guideline documents include
recommendations (which in turn comprise conditionals and imperatives),
definitions, and algorithms.
Conclusion: GEM is more comprehensive than existing models and is
expressively adequate to represent the heterogeneous information contained in
guidelines. Use of XML contributes to a flexible, comprehensible, shareable,
and reusable knowledge representation that is both readable by human beings
and processible by computers.
Background: The entity-attribute-value representation with classes and relationships (EAV/CR) provides a flexible and simple database schema to store heterogeneous biomedical data. In certain circumstances, however, the EAV/CR model is known to retrieve data less efficiently than conventionally based database schemas.
Objective: To perform a pilot study that systematically quantifies performance differences for database queries directed at real-world microbiology data modeled with EAV/CR and conventional representations, and to explore the relative merits of different EAV/CR query implementation strategies.
Methods: Clinical microbiology data obtained over a ten-year period were stored using both database models. Query execution times were compared for four clinically oriented attribute-centered and entity-centered queries operating under varying conditions of database size and system memory. The performance characteristics of three different EAV/CR query strategies were also examined.
Results: Performance was similar for entity-centered queries in the two database models. Performance in the EAV/CR model was approximately three to five times less efficient than its conventional counterpart for attribute-centered queries. The differences in query efficiency became slightly greater as database size increased, although they were reduced with the addition of system memory. The authors found that EAV/CR queries formulated using multiple, simple SQL statements executed in batch were more efficient than single, large SQL statements.
Conclusion: This paper describes a pilot project to explore issues in and compare query performance for EAV/CR and conventional database representations. Although attribute-centered queries were less efficient in the EAV/CR model, these inefficiencies may be addressable, at least in part, by the use of more powerful hardware or more memory, or both.
The task of creating and maintaining a front end to a large
institutional entity-attribute-value (EAV) database can be cumbersome when
using traditional client-server technology. Switching to Web technology as a
delivery vehicle solves some of these problems but introduces others. In
particular, Web development environments tend to be primitive, and many
features that client-server developers take for granted are missing. WebEAV is
a generic framework for Web development that is intended to streamline the
process of Web application development for databases having a significant EAV
component. It also addresses some challenging user interface issues that arise
when any complex system is created. The authors describe the architecture of
WebEAV and provide an overview of its features with suitable examples.