The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.
Collaborative technologies; knowledge representations; knowledge acquisition and knowledge management; controlled terminologies and vocabularies; ontologies; knowledge bases; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); statistical analysis of large datasets; methods for integration of information from disparate sources; discovery; and text and data mining methods; automated learning; information retrieval; HIT data standards; representing; identifying; and modeling biological structures; developing and refining ehr data standards (including image standards)
Ontologies are commonly used in biomedicine to organize concepts to describe domains such as anatomies, environments, experiment, taxonomies etc. NCBO BioPortal currently hosts about 180 different biomedical ontologies. These ontologies have been mainly expressed in either the Open Biomedical Ontology (OBO) format or the Web Ontology Language (OWL). OBO emerged from the Gene Ontology, and supports most of the biomedical ontology content. In comparison, OWL is a Semantic Web language, and is supported by the World Wide Web consortium together with integral query languages, rule languages and distributed infrastructure for information interchange. These features are highly desirable for the OBO content as well. A convenient method for leveraging these features for OBO ontologies is by transforming OBO ontologies to OWL.
We have developed a methodology for translating OBO ontologies to OWL using the organization of the Semantic Web itself to guide the work. The approach reveals that the constructs of OBO can be grouped together to form a similar layer cake. Thus we were able to decompose the problem into two parts. Most OBO constructs have easy and obvious equivalence to a construct in OWL. A small subset of OBO constructs requires deeper consideration. We have defined transformations for all constructs in an effort to foster a standard common mapping between OBO and OWL. Our mapping produces OWL-DL, a Description Logics based subset of OWL with desirable computational properties for efficiency and correctness. Our Java implementation of the mapping is part of the official Gene Ontology project source.
Our transformation system provides a lossless roundtrip mapping for OBO ontologies, i.e. an OBO ontology may be translated to OWL and back without loss of knowledge. In addition, it provides a roadmap for bridging the gap between the two ontology languages in order to enable the use of ontology content in a language independent manner.
Integrative neuroscience research needs a scalable informatics framework that enables semantic integration of diverse types of neuroscience data. This paper describes the use of the Web Ontology Language (OWL) and other Semantic Web technologies for the representation and integration of molecular-level data provided by several of SenseLab suite of neuroscience databases.
Based on the original database structure, we semi-automatically translated the databases into OWL ontologies with manual addition of semantic enrichment. The SenseLab ontologies are extensively linked to other biomedical Semantic Web resources, including the Subcellular Anatomy Ontology, Brain Architecture Management System, the Gene Ontology, BIRNLex and UniProt. The SenseLab ontologies have also been mapped to the Basic Formal Ontology and Relation Ontology, which helps ease interoperability with many other existing and future biomedical ontologies for the Semantic Web. In addition, approaches to representing contradictory research statements are described. The SenseLab ontologies are designed for use on the Semantic Web that enables their integration into a growing collection of biomedical information resources.
We demonstrate that our approach can yield significant potential benefits and that the Semantic Web is rapidly becoming mature enough to realize its anticipated promises. The ontologies are available online at http://neuroweb.med.yale.edu/senselab/
Semantic Web; neuroscience; description logic; ontology mapping; Web Ontology Language; integration
SSWAP (Simple Semantic Web Architecture and Protocol; pronounced "swap") is an architecture, protocol, and platform for using reasoning to semantically integrate heterogeneous disparate data and services on the web. SSWAP was developed as a hybrid semantic web services technology to overcome limitations found in both pure web service technologies and pure semantic web technologies.
There are currently over 2400 resources published in SSWAP. Approximately two dozen are custom-written services for QTL (Quantitative Trait Loci) and mapping data for legumes and grasses (grains). The remaining are wrappers to Nucleic Acids Research Database and Web Server entries. As an architecture, SSWAP establishes how clients (users of data, services, and ontologies), providers (suppliers of data, services, and ontologies), and discovery servers (semantic search engines) interact to allow for the description, querying, discovery, invocation, and response of semantic web services. As a protocol, SSWAP provides the vocabulary and semantics to allow clients, providers, and discovery servers to engage in semantic web services. The protocol is based on the W3C-sanctioned first-order description logic language OWL DL. As an open source platform, a discovery server running at (as in to "swap info") uses the description logic reasoner Pellet to integrate semantic resources. The platform hosts an interactive guide to the protocol at , developer tools at , and a portal to third-party ontologies at (a "swap meet").
SSWAP addresses the three basic requirements of a semantic web services architecture (i.e., a common syntax, shared semantic, and semantic discovery) while addressing three technology limitations common in distributed service systems: i.e., i) the fatal mutability of traditional interfaces, ii) the rigidity and fragility of static subsumption hierarchies, and iii) the confounding of content, structure, and presentation. SSWAP is novel by establishing the concept of a canonical yet mutable OWL DL graph that allows data and service providers to describe their resources, to allow discovery servers to offer semantically rich search engines, to allow clients to discover and invoke those resources, and to allow providers to respond with semantically tagged data. SSWAP allows for a mix-and-match of terms from both new and legacy third-party ontologies in these graphs.
The World Wide Web plays a critical role in enabling molecular, cell, systems and computational biologists to exchange, search, visualize, integrate, and analyze experimental data. Such efforts can be further enhanced through the development of semantic web concepts. The semantic web idea is to enable machines to understand data through the development of protocol free data exchange formats such as Resource Description Framework (RDF) and the Web Ontology Language (OWL). These standards provide formal descriptors of objects, object properties and their relationships within a specific knowledge domain. However, the overhead of converting datasets typically stored in data tables such as Excel, text or PDF into RDF or OWL formats is not trivial for non-specialists and as such produces a barrier to seamless data exchange between researchers, databases and analysis tools. This problem is particularly of importance in the field of network systems biology where biochemical interactions between genes and their protein products are abstracted to networks.
For the purpose of converting biochemical interactions into the BioPAX format, which is the leading standard developed by the computational systems biology community, we developed an open-source command line tool that takes as input tabular data describing different types of molecular biochemical interactions. The tool converts such interactions into the BioPAX level 3 OWL format. We used the tool to convert several existing and new mammalian networks of protein interactions, signalling pathways, and transcriptional regulatory networks into BioPAX. Some of these networks were deposited into PathwayCommons, a repository for consolidating and organizing biochemical networks.
The software tool Sig2BioPAX is a resource that enables experimental and computational systems biologists to contribute their identified networks and pathways of molecular interactions for integration and reuse with the rest of the research community.
The Clinical Narrative Temporal Relation Ontology (CNTRO) has been developed for the purpose of allowing temporal information of clinical data to be semantically annotated and queried, and using inference to expose new temporal features and relations based on the semantic assertions and definitions of the temporal aspects in the ontology. While CNTRO provides a formal semantic foundation to leverage the semantic-web techniques, it is still necessary to arrive at a shared set of semantics and operational rules with commonly used ontologies for the time domain. This paper introduces CNTRO 2.0, which tries to harmonize CNTRO 1.0 and a list of existing time ontologies or top-level ontologies into a unified model—an OWL based ontology of temporal relations for clinical research.
Ontologies have become an essential asset in the bioinformatics toolbox and a number of ontology access resources are now available, for example, the EBI Ontology Lookup Service (OLS) and the NCBO BioPortal. However, these resources differ substantially in mode, ease of access, and ontology content. This makes it relatively difficult to access each ontology source separately, map their contents to research data, and much of this effort is being replicated across different research groups.
OntoCAT provides a seamless programming interface to query heterogeneous ontology resources including OLS and BioPortal, as well as user-specified local OWL and OBO files. Each resource is wrapped behind easy to learn Java, Bioconductor/R and REST web service commands enabling reuse and integration of ontology software efforts despite variation in technologies. It is also available as a stand-alone MOLGENIS database and a Google App Engine application.
OntoCAT provides a robust, configurable solution for accessing ontology terms specified locally and from remote services, is available as a stand-alone tool and has been tested thoroughly in the ArrayExpress, MOLGENIS, EFO and Gen2Phen phenotype use cases.
Genetic testing for personalizing pharmacotherapy is bound to become an important part of clinical routine. To address associated issues with data management and quality, we are creating a semantic knowledge base for clinical pharmacogenetics. The knowledge base is made up of three components: an expressive ontology formalized in the Web Ontology Language (OWL 2 DL), a Resource Description Framework (RDF) model for capturing detailed results of manual annotation of pharmacogenomic information in drug product labels, and an RDF conversion of relevant biomedical datasets. Our work goes beyond the state of the art in that it makes both automated reasoning as well as query answering as simple as possible, and the reasoning capabilities go beyond the capabilities of previously described ontologies.
pharmacogenetics; pharmacogenomics; ontology; medical informatics; clinical decision support systems
The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central – a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.
A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.
We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.
Ontology; Phenotype; GWAS; RDF
The clinical element model (CEM) is an information model designed for representing clinical information in electronic health records (EHR) systems across organizations. The current representation of CEMs does not support formal semantic definitions and therefore it is not possible to perform reasoning and consistency checking on derived models. This paper introduces our efforts to represent the CEM specification using the Web Ontology Language (OWL). The CEM-OWL representation connects the CEM content with the Semantic Web environment, which provides authoring, reasoning, and querying tools. This work may also facilitate the harmonization of the CEMs with domain knowledge represented in terminology models as well as other clinical information models such as the openEHR archetype model. We have created the CEM-OWL meta ontology based on the CEM specification. A convertor has been implemented in Java to automatically translate detailed CEMs from XML to OWL. A panel evaluation has been conducted, and the results show that the OWL modeling can faithfully represent the CEM specification and represent patient data.
Ontologies; Semantic Web; OWL; Clinical Element Model; Secondary Use of EHR
Complexity and amount of post-genomic data constitute two major factors limiting the application of Knowledge Discovery in Databases (KDD) methods in life sciences. Bio-ontologies may nowadays play key roles in knowledge discovery in life science providing semantics to data and to extracted units, by taking advantage of the progress of Semantic Web technologies concerning the understanding and availability of tools for knowledge representation, extraction, and reasoning.
This paper presents a method that exploits bio-ontologies for guiding data selection within the preparation step of the KDD process. We propose three scenarios in which domain knowledge and ontology elements such as subsumption, properties, class descriptions, are taken into account for data selection, before the data mining step. Each of these scenarios is illustrated within a case-study relative to the search of genotype-phenotype relationships in a familial hypercholesterolemia dataset. The guiding of data selection based on domain knowledge is analysed and shows a direct influence on the volume and significance of the data mining results.
The method proposed in this paper is an efficient alternative to numerical methods for data selection based on domain knowledge. In turn, the results of this study may be reused in ontology modelling and data integration.
Motivation: For many years, the Unified Medical Language System (UMLS) semantic network (SN) has been used as an upper-level semantic framework for the categorization of terms from terminological resources in biomedicine. BioTop has recently been developed as an upper-level ontology for the biomedical domain. In contrast to the SN, it is founded upon strict ontological principles, using OWL DL as a formal representation language, which has become standard in the semantic Web. In order to make logic-based reasoning available for the resources annotated or categorized with the SN, a mapping ontology was developed aligning the SN with BioTop.
Methods: The theoretical foundations and the practical realization of the alignment are being described, with a focus on the design decisions taken, the problems encountered and the adaptations of BioTop that became necessary. For evaluation purposes, UMLS concept pairs obtained from MEDLINE abstracts by a named entity recognition system were tested for possible semantic relationships. Furthermore, all semantic-type combinations that occur in the UMLS Metathesaurus were checked for satisfiability.
Results: The effort-intensive alignment process required major design changes and enhancements of BioTop and brought up several design errors that could be fixed. A comparison between a human curator and the ontology yielded only a low agreement. Ontology reasoning was also used to successfully identify 133 inconsistent semantic-type combinations.
Availability: BioTop, the OWL DL representation of the UMLS SN, and the mapping ontology are available at http://www.purl.org/biotop/.
Motivation: Ontologies are essential in biomedical research due to their ability to semantically integrate content from different scientific databases and resources. Their application improves capabilities for querying and mining biological knowledge. An increasing number of ontologies is being developed for this purpose, and considerable effort is invested into formally defining them in order to represent their semantics explicitly. However, current biomedical ontologies do not facilitate data integration and interoperability yet, since reasoning over these ontologies is very complex and cannot be performed efficiently or is even impossible. We propose the use of less expressive subsets of ontology representation languages to enable efficient reasoning and achieve the goal of genuine interoperability between ontologies.
Results: We present and evaluate EL Vira, a framework that transforms OWL ontologies into the OWL EL subset, thereby enabling the use of tractable reasoning. We illustrate which OWL constructs and inferences are kept and lost following the conversion and demonstrate the performance gain of reasoning indicated by the significant reduction of processing time. We applied EL Vira to the open biomedical ontologies and provide a repository of ontologies resulting from this conversion. EL Vira creates a common layer of ontological interoperability that, for the first time, enables the creation of software solutions that can employ biomedical ontologies to perform inferences and answer complex queries to support scientific analyses.
Availability and implementation: The EL Vira software is available from http://el-vira.googlecode.com and converted OBO ontologies and their mappings are available from http://bioonto.gen.cam.ac.uk/el-ont.
The Open Biomedical Ontologies (OBO) Foundry is a coordinated community-wide effort to develop ontologies that support the annotation and integration of scientific data. In work supported by the National Database of Autism Research (NDAR), we are developing an ontology of autism that extends the ontologies available in the OBO Foundry. We undertook a systematic literature review to identify domain terms and relationships relevant to autism phenotypes. To enable user queries and inferences about such phenotypes using data in the NDAR repository, we augmented the domain ontology with an information model. In this paper, we show how our approach, using a combination of description logic and rule-based reasoning, enables high-level phenotypic abstractions to be inferred from subject-specific data. Our integrated domain ontology–information model approach allows scientific data repositories to be augmented with rule-based abstractions that facilitate the ability of researchers to undertake data analysis.
Background: There is an increasing need to integrate phenotype measurement data across studies for both human studies and those involving model organisms. Current practices allow researchers to access only those data involved in a single experiment or multiple experiments utilizing the same protocol. Results: Three ontologies were created: Clinical Measurement Ontology, Measurement Method Ontology and Experimental Condition Ontology. These ontologies provided the framework for integration of rat phenotype data from multiple studies into a single resource as well as facilitated data integration from multiple human epidemiological studies into a centralized repository. Conclusion: An ontology based framework for phenotype measurement data affords the ability to successfully integrate vital phenotype data into critical resources, regardless of underlying technological structures allowing the user to easily query and retrieve data from multiple studies.
We demonstrate the use of Semantic Web technology to integrate the ALFRED allele frequency database and the Starpath pathway resource. The linking of population-specific genotype data with cancer-related pathway data is potentially useful given the growing interest in personalized medicine and the exploitation of pathway knowledge for cancer drug discovery. We model our data using the Web Ontology Language (OWL), drawing upon ideas from existing standard formats BioPAX for pathway data and PML for allele frequency data. We store our data within an Oracle database, using Oracle Semantic Technologies. We then query the data using Oracle’s rule-based inference engine and SPARQL-like RDF query language. The ability to perform queries across the domains of population genetics and pathways offers the potential to answer a number of cancer-related research questions. Among the possibilities is the ability to identify genetic variants which are associated with cancer pathways and whose frequency varies significantly between ethnic groups. This sort of information could be useful for designing clinical studies and for providing background data in personalized medicine. It could also assist with the interpretation of genetic analysis results such as those from genome-wide association studies.
Using Semantic-Web specifications to represent temporal information in clinical narratives is an important step for temporal reasoning and answering time-oriented queries. Existing temporal models are either not compatible with the powerful reasoning tools developed for the Semantic Web, or designed only for structured clinical data and therefore are not ready to be applied on natural-language-based clinical narrative reports directly. We have developed a Semantic-Web ontology which is called Clinical Narrative Temporal Relation ontology. Using this ontology, temporal information in clinical narratives can be represented as RDF (Resource Description Framework) triples. More temporal information and relations can then be inferred by Semantic-Web based reasoning tools. Experimental results show that this ontology can represent temporal information in real clinical narratives successfully.
OWL, the Web Ontology Language, provides syntax and semantics for representing knowledge for the semantic web. Many of the constructs of OWL have a basis in the field of description logics. While the formal underpinnings of description logics have lead to a highly computable language, it has come at a cognitive cost. OWL ontologies are often unintuitive to readers lacking a strong logic background.
In this work we describe GLEEN, a regular path expression library, which extends the RDF query language SparQL to support complex path expressions over OWL and other RDF-based ontologies. We illustrate the utility of GLEEN by showing how it can be used in a query-based approach to defining simpler, more intuitive views of OWL ontologies. In particular we show how relatively simple GLEEN-enhanced SparQL queries can create views of the OWL version of the NCI Thesaurus that match the views generated by the web-based NCI browser.
Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.
Results: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations.
Availability: The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl.
CiTO, the Citation Typing Ontology, is an ontology for describing the nature of reference citations in scientific research articles and other scholarly works, both to other such publications and also to Web information resources, and for publishing these descriptions on the Semantic Web. Citation are described in terms of the factual and rhetorical relationships between citing publication and cited publication, the in-text and global citation frequencies of each cited work, and the nature of the cited work itself, including its publication and peer review status. This paper describes CiTO and illustrates its usefulness both for the annotation of bibliographic reference lists and for the visualization of citation networks. The latest version of CiTO, which this paper describes, is CiTO Version 1.6, published on 19 March 2010. CiTO is written in the Web Ontology Language OWL, uses the namespace http://purl.org/net/cito/, and is available from http://purl.org/net/cito/. This site uses content negotiation to deliver to the user an OWLDoc Web version of the ontology if accessed via a Web browser, or the OWL ontology itself if accessed from an ontology management tool such as Protégé 4 (http://protege.stanford.edu/). Collaborative work is currently under way to harmonize CiTO with other ontologies describing bibliographies and the rhetorical structure of scientific discourse.
Protégé OWL1 is an open source tool created to support ontology development for the
Semantic Web. It is a plug-in extension to the Protégé ontology
development platform. Protégé OWL allows users
to edit ontologies in the Web Ontology Language (OWL) and to use description
logic classifiers to maintain consistency of their ontologies. Protégé OWL can also assist developers of intelligent
applications in biomedicine, because many of the problem-solving tasks
they seek to automate can be construed as classification tasks. Protégé OWL
provides access to emerging knowledge representation
standards such as OWL and high-performance classifiers. Being
integrated with Protégé, the OWL Plug-in allows users
to exploit Protégé’s core features and services
such as graphical user interfaces, a variety of storage formats, and
data acquisition and visualization tools. Finally, Protégé OWL
provides an API allowing it to be integrated into applications.
This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.
We use an ontology-driven approach to integrate two gene resources (Entrez Gene and HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms, including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources. The integrated schema is populated with data from the pathway resources, publicly available in BioPAX-compatible format, and gene resources for which a population procedure was created. The SPARQL query language is used to formulate queries over the integrated knowledge base to answer the three biological queries.
Simple SPARQL queries could easily identify hub genes, i.e., those genes whose gene products participate in many pathways or interact with many other gene products. The identification of the genes expressed in the brain turned out to be more difficult, due to the lack of a common identification scheme for proteins.
Semantic Web technologies provide a valid framework for information integration in the life sciences. Ontology-driven integration represents a flexible, sustainable and extensible solution to the integration of large volumes of information. Additional resources, which enable the creation of mappings between information sources, are required to compensate for heterogeneity across namespaces.
Semantic Web; Semantic mashup; Nicotine dependence; Information integration; Ontologies
There are lots of question-based data elements from the pharmacogenomics research network (PGRN) studies. Many data elements contain temporal information. To semantically represent these elements so that they can be machine processiable is a challenging problem for the following reasons: (1) the designers of these studies usually do not have the knowledge of any computer modeling and query languages, so that the original data elements usually are represented in spreadsheets in human languages; and (2) the time aspects in these data elements can be too complex to be represented faithfully in a machine-understandable way.
In this paper, we introduce our efforts on representing these data elements using semantic web technologies. We have developed an ontology, CNTRO, for representing clinical events and their temporal relations in the web ontology language (OWL). Here we use CNTRO to represent the time aspects in the data elements. We have evaluated 720 time-related data elements from PGRN studies. We adapted and extended the knowledge representation requirements for EliXR-TIME to categorize our data elements. A CNTRO-based SPARQL query builder has been developed to customize users’ own SPARQL queries for each knowledge representation requirement. The SPARQL query builder has been evaluated with a simulated EHR triple store to ensure its functionalities.
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n=50), and precision of the semantic mapping between these terms across datasets was 98% (n=100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets.
computational ontologies; phenotypes; database interoperability; Mediated Schema; SNOMED
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledgebased phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO and Neuronames and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each datasets was 88% (n=50), and precision of the semantic mapping between these terms across datasets was 98% (n=100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets.