The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies.
We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, http://zfin.org). We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators.
The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.
In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error.
Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows.
We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows.
Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows.
Ontologies are commonly used in biomedicine to organize concepts to describe domains such as anatomies, environments, experiment, taxonomies etc. NCBO BioPortal currently hosts about 180 different biomedical ontologies. These ontologies have been mainly expressed in either the Open Biomedical Ontology (OBO) format or the Web Ontology Language (OWL). OBO emerged from the Gene Ontology, and supports most of the biomedical ontology content. In comparison, OWL is a Semantic Web language, and is supported by the World Wide Web consortium together with integral query languages, rule languages and distributed infrastructure for information interchange. These features are highly desirable for the OBO content as well. A convenient method for leveraging these features for OBO ontologies is by transforming OBO ontologies to OWL.
We have developed a methodology for translating OBO ontologies to OWL using the organization of the Semantic Web itself to guide the work. The approach reveals that the constructs of OBO can be grouped together to form a similar layer cake. Thus we were able to decompose the problem into two parts. Most OBO constructs have easy and obvious equivalence to a construct in OWL. A small subset of OBO constructs requires deeper consideration. We have defined transformations for all constructs in an effort to foster a standard common mapping between OBO and OWL. Our mapping produces OWL-DL, a Description Logics based subset of OWL with desirable computational properties for efficiency and correctness. Our Java implementation of the mapping is part of the official Gene Ontology project source.
Our transformation system provides a lossless roundtrip mapping for OBO ontologies, i.e. an OBO ontology may be translated to OWL and back without loss of knowledge. In addition, it provides a roadmap for bridging the gap between the two ontology languages in order to enable the use of ontology content in a language independent manner.
The formal description of experiments for efficient analysis, annotation and sharing of results is a fundamental part of the practice of science. Ontologies are required to achieve this objective. A few subject-specific ontologies of experiments currently exist. However, despite the unity of scientific experimentation, no general ontology of experiments exists. We propose the ontology EXPO to meet this need. EXPO links the SUMO (the Suggested Upper Merged Ontology) with subject-specific ontologies of experiments by formalizing the generic concepts of experimental design, methodology and results representation. EXPO is expressed in the W3C standard ontology language OWL-DL. We demonstrate the utility of EXPO and its ability to describe different experimental domains, by applying it to two experiments: one in high-energy physics and the other in phylogenetics. The use of EXPO made the goals and structure of these experiments more explicit, revealed ambiguities, and highlighted an unexpected similarity. We conclude that, EXPO is of general value in describing experiments and a step towards the formalization of science.
ontology; formalization; annotation; artificial intelligence; metadata
A variety of key activities within life sciences research involves integrating and intelligently managing large amounts of biochemical information. Semantic technologies provide an intuitive way to organise and sift through these rapidly growing datasets via the design and maintenance of ontology-supported knowledge bases. To this end, OWL—a W3C standard declarative language— has been extensively used in the deployment of biochemical ontologies that can be conveniently organised using the classification facilities of OWL-based tools. One of the most established ontologies for the chemical domain is ChEBI, an open-access dictionary of molecular entities that supplies high quality annotation and taxonomical information for biologically relevant compounds. However, ChEBI is being manually expanded which hinders its potential to grow due to the limited availability of human resources.
In this work, we describe a prototype that performs automatic classification of chemical compounds. The software we present implements a sound and complete reasoning procedure of a formalism that extends datalog and builds upon an off-the-shelf deductive database system. We capture a wide range of chemical classes that are not expressible with OWL-based formalisms such as cyclic molecules, saturated molecules and alkanes. Furthermore, we describe a surface ‘less-logician-like’ syntax that allows application experts to create ontological descriptions of complex biochemical objects without prior knowledge of logic. In terms of performance, a noticeable improvement is observed in comparison with previous approaches. Our evaluation has discovered subsumptions that are missing from the manually curated ChEBI ontology as well as discrepancies with respect to existing subclass relations. We illustrate thus the potential of an ontology language suitable for the life sciences domain that exhibits a favourable balance between expressive power and practical feasibility.
Our proposed methodology can form the basis of an ontology-mediated application to assist biocurators in the production of complete and error-free taxonomies. Moreover, such a tool could contribute to a more rapid development of the ChEBI ontology and to the efforts of the ChEBI team to make annotated chemical datasets available to the public. From a modelling point of view, our approach could stimulate the adoption of a different and expressive reasoning paradigm based on rules for which state-of-the-art and highly optimised reasoners are available; it could thus pave the way for the representation of a broader spectrum of life sciences and biomedical knowledge.
Semantic technologies; Knowledge representation and reasoning; Logic programming and answer set programming; Datalog extensions; Cheminformatics
An in-silico experiment can be naturally specified as a workflow of activities implementing, in a standardized environment, the process of data and control analysis. A workflow has the advantage to be reproducible, traceable and compositional by reusing other workflows. In order to support the daily work of a bioscientist, several Workflow Management Systems (WMSs) have been proposed in bioinformatics. Generally, these systems centralize the workflow enactment and do not exploit standard process definition languages to describe, in order to be reusable, workflows. While almost all WMSs require heavy stand-alone applications to specify new workflows, only few of them provide a web-based process definition tool.
We have developed BioWMS, a Workflow Management System that supports, through a web-based interface, the definition, the execution and the results management of an in-silico experiment. BioWMS has been implemented over an agent-based middleware. It dynamically generates, from a user workflow specification, a domain-specific, agent-based workflow engine. Our approach exploits the proactiveness and mobility of the agent-based technology to embed, inside agents behaviour, the application domain features. Agents are workflow executors and the resulting workflow engine is a multiagent system – a distributed, concurrent system – typically open, flexible, and adaptative. A demo is available at .
BioWMS, supported by Hermes mobile computing middleware, guarantees the flexibility, scalability and fault tolerance required to a workflow enactment over distributed and heterogeneous environment. BioWMS is funded by the FIRB project LITBIO (Laboratory for Interdisciplinary Technologies in Bioinformatics).
The Open Biomedical Ontologies (OBO) Foundry is a collection of freely available ontologically structured controlled vocabularies in the biomedical domain. Most of them are disseminated via both the OBO Flatfile Format and the semantic web format Web Ontology Language (OWL), which draws upon formal logic. Based on the interpretations underlying OWL description logics (OWL-DL) semantics, we scrutinize the OWL-DL releases of OBO ontologies to assess whether their logical axioms correspond to the meaning intended by their authors.
We analyzed ontologies and ontology cross products available via the OBO Foundry site http://www.obofoundry.org for existential restrictions (someValuesFrom), from which we examined a random sample of 2,836 clauses.
According to a rating done by four experts, 23% of all existential restrictions in OBO Foundry candidate ontologies are suspicious (Cohens' κ = 0.78). We found a smaller proportion of existential restrictions in OBO Foundry cross products are suspicious, but in this case an accurate quantitative judgment is not possible due to a low inter-rater agreement (κ = 0.07). We identified several typical modeling problems, for which satisfactory ontology design patterns based on OWL-DL were proposed. We further describe several usability issues with OBO ontologies, including the lack of ontological commitment for several common terms, and the proliferation of domain-specific relations.
The current OWL releases of OBO Foundry (and Foundry candidate) ontologies contain numerous assertions which do not properly describe the underlying biological reality, or are ambiguous and difficult to interpret. The solution is a better anchoring in upper ontologies and a restriction to relatively few, well defined relation types with given domain and range constraints.
The diversity and the largely independent nature of chemical research efforts over the past half century are, most likely, the major contributors to the current poor state of chemical computational resource and database interoperability. While open software for chemical format interconversion and database entry cross-linking have partially addressed database interoperability, computational resource integration is hindered by the great diversity of software interfaces, languages, access methods, and platforms, among others. This has, in turn, translated into limited reproducibility of computational experiments and the need for application-specific computational workflow construction and semi-automated enactment by human experts, especially where emerging interdisciplinary fields, such as systems chemistry, are pursued. Fortunately, the advent of the Semantic Web, and the very recent introduction of RESTful Semantic Web Services (SWS) may present an opportunity to integrate all of the existing computational and database resources in chemistry into a machine-understandable, unified system that draws on the entirety of the Semantic Web.
We have created a prototype framework of Semantic Automated Discovery and Integration (SADI) framework SWS that exposes the QSAR descriptor functionality of the Chemistry Development Kit. Since each of these services has formal ontology-defined input and output classes, and each service consumes and produces RDF graphs, clients can automatically reason about the services and available reference information necessary to complete a given overall computational task specified through a simple SPARQL query. We demonstrate this capability by carrying out QSAR analysis backed by a simple formal ontology to determine whether a given molecule is drug-like. Further, we discuss parameter-based control over the execution of SADI SWS. Finally, we demonstrate the value of computational resource envelopment as SADI services through service reuse and ease of integration of computational functionality into formal ontologies.
The work we present here may trigger a major paradigm shift in the distribution of computational resources in chemistry. We conclude that envelopment of chemical computational resources as SADI SWS facilitates interdisciplinary research by enabling the definition of computational problems in terms of ontologies and formal logical statements instead of cumbersome and application-specific tasks and workflows.
Hypothesis generation in molecular and cellular biology is an empirical process in which knowledge derived from prior experiments is distilled into a comprehensible model. The requirement of automated support is exemplified by the difficulty of considering all relevant facts that are contained in the millions of documents available from PubMed. Semantic Web provides tools for sharing prior knowledge, while information retrieval and information extraction techniques enable its extraction from literature. Their combination makes prior knowledge available for computational analysis and inference. While some tools provide complete solutions that limit the control over the modeling and extraction processes, we seek a methodology that supports control by the experimenter over these critical processes.
We describe progress towards automated support for the generation of biomolecular hypotheses. Semantic Web technologies are used to structure and store knowledge, while a workflow extracts knowledge from text. We designed minimal proto-ontologies in OWL for capturing different aspects of a text mining experiment: the biological hypothesis, text and documents, text mining, and workflow provenance. The models fit a methodology that allows focus on the requirements of a single experiment while supporting reuse and posterior analysis of extracted knowledge from multiple experiments. Our workflow is composed of services from the 'Adaptive Information Disclosure Application' (AIDA) toolkit as well as a few others. The output is a semantic model with putative biological relations, with each relation linked to the corresponding evidence.
We demonstrated a 'do-it-yourself' approach for structuring and extracting knowledge in the context of experimental research on biomolecular mechanisms. The methodology can be used to bootstrap the construction of semantically rich biological models using the results of knowledge extraction processes. Models specific to particular experiments can be constructed that, in turn, link with other semantic models, creating a web of knowledge that spans experiments. Mapping mechanisms can link to other knowledge resources such as OBO ontologies or SKOS vocabularies. AIDA Web Services can be used to design personalized knowledge extraction procedures. In our example experiment, we found three proteins (NF-Kappa B, p21, and Bax) potentially playing a role in the interplay between nutrients and epigenetic gene regulation.
A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.
An ontology represents the concepts and their interrelation within a knowledge domain. Several ontologies have been developed in biomedicine, which provide standardized vocabularies to describe diseases, genes and gene products, physiological phenotypes, anatomical structures, and many other phenomena. Scientists use them to encode the results of complex experiments and observations and to perform integrative analysis to discover new knowledge. A remaining challenge in ontology development is how to evaluate an ontology's representation of knowledge within its scientific domain. Building on classic measures from information retrieval, we introduce a family of metrics including breadth and depth that capture the conceptual coverage and parsimony of an ontology. We test these measures using (1) four commonly used medical ontologies in relation to a corpus of medical documents and (2) seven popular English thesauri (ontologies of synonyms) with respect to text from medicine, news, and novels. Results demonstrate that both medical ontologies and English thesauri have a small overlap in concepts and relations. Our methods suggest efforts to tighten the fit between ontologies and biomedical knowledge.
To develop and apply formal ontology creation methods to the domain of antimicrobial prescribing and to formally evaluate the resulting ontology through intrinsic and extrinsic evaluation studies.
We extended existing ontology development methods to create the ontology and implemented the ontology using Protégé-OWL. Correctness of the ontology was assessed using a set of ontology design principles and domain expert review via the laddering technique. We created three artifacts to support the extrinsic evaluation (set of prescribing rules, alerts and an ontology-driven alert module, and a patient database) and evaluated the usefulness of the ontology for performing knowledge management tasks to maintain the ontology and for generating alerts to guide antibiotic prescribing.
The ontology includes 199 classes, 10 properties, and 1,636 description logic restrictions. Twenty-three Semantic Web Rule Language rules were written to generate three prescribing alerts: 1) antibiotic-microorganism mismatch alert; 2) medication-allergy alert; and 3) non-recommended empiric antibiotic therapy alert. The evaluation studies confirmed the correctness of the ontology, usefulness of the ontology for representing and maintaining antimicrobial treatment knowledge rules, and usefulness of the ontology for generating alerts to provide feedback to clinicians during antibiotic prescribing.
This study contributes to the understanding of ontology development and evaluation methods and addresses one knowledge gap related to using ontologies as a clinical decision support system component—a need for formal ontology evaluation methods to measure their quality from the perspective of their intrinsic characteristics and their usefulness for specific tasks.
Ontology; Clinical decision support; Evaluation
The AMBIT web services package is one of the several existing independent implementations of the OpenTox Application Programming Interface and is built according to the principles of the Representational State Transfer (REST) architecture. The Open Source Predictive Toxicology Framework, developed by the partners in the EC FP7 OpenTox project, aims at providing a unified access to toxicity data and predictive models, as well as validation procedures. This is achieved by i) an information model, based on a common OWL-DL ontology ii) links to related ontologies; iii) data and algorithms, available through a standardized REST web services interface, where every compound, data set or predictive method has a unique web address, used to retrieve its Resource Description Framework (RDF) representation, or initiate the associated calculations.
The AMBIT web services package has been developed as an extension of AMBIT modules, adding the ability to create (Quantitative) Structure-Activity Relationship (QSAR) models and providing an OpenTox API compliant interface. The representation of data and processing resources in W3C Resource Description Framework facilitates integrating the resources as Linked Data. By uploading datasets with chemical structures and arbitrary set of properties, they become automatically available online in several formats. The services provide unified interfaces to several descriptor calculation, machine learning and similarity searching algorithms, as well as to applicability domain and toxicity prediction models. All Toxtree modules for predicting the toxicological hazard of chemical compounds are also integrated within this package. The complexity and diversity of the processing is reduced to the simple paradigm "read data from a web address, perform processing, write to a web address". The online service allows to easily run predictions, without installing any software, as well to share online datasets and models. The downloadable web application allows researchers to setup an arbitrary number of service instances for specific purposes and at suitable locations. These services could be used as a distributed framework for processing of resource-intensive tasks and data sharing or in a fully independent way, according to the specific needs. The advantage of exposing the functionality via the OpenTox API is seamless interoperability, not only within a single web application, but also in a network of distributed services. Last, but not least, the services provide a basis for building web mashups, end user applications with friendly GUIs, as well as embedding the functionalities in existing workflow systems.
Biobanks are a critical resource for translational science. Recently, semantic web technologies such as ontologies have been found useful in retrieving research data from biobanks. However, recent research has also shown that there is a lack of data about the administrative aspects of biobanks. These data would be helpful to answer research-relevant questions such as what is the scope of specimens collected in a biobank, what is the curation status of the specimens, and what is the contact information for curators of biobanks. Our use cases include giving researchers the ability to retrieve key administrative data (e.g. contact information, contact's affiliation, etc.) about the biobanks where specific specimens of interest are stored. Thus, our goal is to provide an ontology that represents the administrative entities in biobanking and their relations. We base our ontology development on a set of 53 data attributes called MIABIS, which were in part the result of semantic integration efforts of the European Biobanking and Biomolecular Resources Research Infrastructure (BBMRI). The previous work on MIABIS provided the domain analysis for our ontology. We report on a test of our ontology against competency questions that we derived from the initial BBMRI use cases. Future work includes additional ontology development to answer additional competency questions from these use cases.
We created an open-source ontology of biobank administration called Ontologized MIABIS (OMIABIS) coded in OWL 2.0 and developed according to the principles of the OBO Foundry. It re-uses pre-existing ontologies when possible in cooperation with developers of other ontologies in related domains, such as the Ontology of Biomedical Investigation. OMIABIS provides a formalized representation of biobanks and their administration. Using the ontology and a set of Description Logic queries derived from the competency questions that we identified, we were able to retrieve test data with perfect accuracy. In addition, we began development of a mapping from the ontology to pre-existing biobank data structures commonly used in the U.S.
In conclusion, we created OMIABIS, an ontology of biobank administration. We found that basing its development on pre-existing resources to meet the BBMRI use cases resulted in a biobanking ontology that is re-useable in environments other than BBMRI. Our ontology retrieved all true positives and no false positives when queried according to the competency questions we derived from the BBMRI use cases. Mapping OMIABIS to a data structure used for biospecimen collections in a medical center in Little Rock, AR showed adequate coverage of our ontology.
The importance of ontologies in the biomedical domain is generally recognized. However, their quality is often too poor for large-scale use in critical applications, at least partially due to insufficient training of ontology developers.
To show the efficacy of guideline-based ontology development training on the performance of ontology developers. The hypothesis was that students who received training on top-level ontologies and design patterns perform better than those who only received training in the basic principles of formal ontology engineering.
A curriculum was implemented based on a guideline for ontology design. A randomized controlled trial on the efficacy of this curriculum was performed with 24 students from bioinformatics and related fields. After joint training on the fundamentals of ontology development the students were randomly allocated to two groups. During the intervention, each group received training on different topics in ontology development. In the assessment phase, all students were asked to solve modeling problems on topics taught differentially in the intervention phase. Primary outcome was the similarity of the students’ ontology artefacts compared with gold standard ontologies developed by the authors before the experiment; secondary outcome was the intra-group similarity of group members’ ontologies.
The experiment showed no significant effect of the guideline-based training on the performance of ontology developers (a) the ontologies developed after specific training were only slightly but not significantly closer to the gold standard ontologies than the ontologies developed without prior specific training; (b) although significant differences for certain ontologies were detected, the intra-group similarity was not consistently influenced in one direction by the differential training.
Methodologically limited, this study cannot be interpreted as a general failure of a guideline-based approach to ontology development. Further research is needed to increase insight into whether specific development guidelines and practices in ontology design are effective.
Biomedical ontologies are being widely used to annotate biological data in a computer-accessible, consistent and well-defined manner. However, due to their size and complexity, annotating data with appropriate terms from an ontology is often challenging for experts and non-experts alike, because there exist few tools that allow one to quickly find relevant ontology terms to easily populate a web form.
Life-science laboratories make increasing use of Next Generation Sequencing (NGS) for studying bio-macromolecules and their interactions. Array-based methods for measuring gene expression or protein-DNA interactions are being replaced by RNA-Seq and ChIP-Seq. Sequencing is generally performed by specialized facilities that have to keep track of sequencing requests, trace samples, ensure quality and make data available according to predefined privileges.
An integrated tool helps to troubleshoot problems, to maintain a high quality standard, to reduce time and costs. Commercial and non-commercial tools called LIMS (Laboratory Information Management Systems) are available for this purpose. However, they often come at prohibitive cost and/or lack the flexibility and scalability needed to adjust seamlessly to the frequently changing protocols employed.
In order to manage the flow of sequencing data produced at the Genomic Unit of the Italian Institute of Technology (IIT), we developed SMITH (Sequencing Machine Information Tracking and Handling).
SMITH is a web application with a MySQL server at the backend. Wet-lab scientists of the Centre for Genomic Science and database experts from the Politecnico of Milan in the context of a Genomic Data Model Project developed SMITH. The data base schema stores all the information of an NGS experiment, including the descriptions of all protocols and algorithms used in the process. Notably, an attribute-value table allows associating an unconstrained textual description to each sample and all the data produced afterwards. This method permits the creation of metadata that can be used to search the database for specific files as well as for statistical analyses.
SMITH runs automatically and limits direct human interaction mainly to administrative tasks. SMITH data-delivery procedures were standardized making it easier for biologists and analysts to navigate the data. Automation also helps saving time. The workflows are available through an API provided by the workflow management system. The parameters and input data are passed to the workflow engine that performs de-multiplexing, quality control, alignments, etc.
SMITH standardizes, automates, and speeds up sequencing workflows. Annotation of data with key-value pairs facilitates meta-analysis.
The indexing of scientific literature and content is a relevant and contemporary requirement within life science information systems. Navigating information available in legacy formats continues to be a challenge both in enterprise and academic domains. The emergence of semantic web technologies and their fusion with artificial intelligence techniques has provided a new toolkit with which to address these data integration challenges. In the emerging field of lipidomics such navigation challenges are barriers to the translation of scientific results into actionable knowledge, critical to the treatment of diseases such as Alzheimer's syndrome, Mycobacterium infections and cancer.
We present a literature-driven workflow involving document delivery and natural language processing steps generating tagged sentences containing lipid, protein and disease names, which are instantiated to custom designed lipid ontology. We describe the design challenges in capturing lipid nomenclature, the mandate of the ontology and its role as query model in the navigation of the lipid bibliosphere. We illustrate the extent of the description logic-based A-box query capability provided by the instantiated ontology using a graphical query composer to query sentences describing lipid-protein and lipid-disease correlations.
As scientists accept the need to readjust the manner in which we search for information and derive knowledge we illustrate a system that can constrain the literature explosion and knowledge navigation problems. Specifically we have focussed on solving this challenge for lipidomics researchers who have to deal with the lack of standardized vocabulary, differing classification schemes, and a wide array of synonyms before being able to derive scientific insights. The use of the OWL-DL variant of the Web Ontology Language (OWL) and description logic reasoning is pivotal in this regard, providing the lipid scientist with advanced query access to the results of text mining algorithms instantiated into the ontology. The visual query paradigm assists in the adoption of this technology.
The recent availability of high-throughput data in molecular biology has increased the need for a formal representation of this knowledge domain. New ontologies are being developed to formalize knowledge, e.g. about the functions of proteins. As the Semantic Web is being introduced into the Life Sciences, the basis for a distributed knowledge-base that can foster biological data analysis is laid. However, there still is a dichotomy, in tools and methodologies, between the use of ontologies in biological investigation, that is, in relation to experimental observations, and their use as a knowledge-base.
RDFScape is a plugin that has been developed to extend a software oriented to biological analysis with support for reasoning on ontologies in the semantic web framework. We show with this plugin how the use of ontological knowledge in biological analysis can be extended through the use of inference. In particular, we present two examples relative to ontologies representing biological pathways: we demonstrate how these can be abstracted and visualized as interaction networks, and how reasoning on causal dependencies within elements of pathways can be implemented.
The use of ontologies for the interpretation of high-throughput biological data can be improved through the use of inference. This allows the use of ontologies not only as annotations, but as a knowledge-base from which new information relevant for specific analysis can be derived.
Ontologies have increasingly been used in the biomedical domain, which has prompted the emergence of different initiatives to facilitate their development and integration. The Open Biological and Biomedical Ontologies (OBO) Foundry consortium provides a repository of life-science ontologies, which are developed according to a set of shared principles. This consortium has developed an ontology called OBO Relation Ontology aiming at standardizing the different types of biological entity classes and associated relationships. Since ontologies are primarily intended to be used by humans, the use of graphical notations for ontology development facilitates the capture, comprehension and communication of knowledge between its users. However, OBO Foundry ontologies are captured and represented basically using text-based notations. The Unified Modeling Language (UML) provides a standard and widely-used graphical notation for modeling computer systems. UML provides a well-defined set of modeling elements, which can be extended using a built-in extension mechanism named Profile. Thus, this work aims at developing a UML profile for the OBO Relation Ontology to provide a domain-specific set of modeling elements that can be used to create standard UML-based ontologies in the biomedical domain.
We have studied the OBO Relation Ontology, the UML metamodel and the UML profiling mechanism. Based on these studies, we have proposed an extension to the UML metamodel in conformance with the OBO Relation Ontology and we have defined a profile that implements the extended metamodel. Finally, we have applied the proposed UML profile in the development of a number of fragments from different ontologies. Particularly, we have considered the Gene Ontology (GO), the PRotein Ontology (PRO) and the Xenopus Anatomy and Development Ontology (XAO).
The use of an established and well-known graphical language in the development of biomedical ontologies provides a more intuitive form of capturing and representing knowledge than using only text-based notations. The use of the profile requires the domain expert to reason about the underlying semantics of the concepts and relationships being modeled, which helps preventing the introduction of inconsistencies in an ontology under development and facilitates the identification and correction of errors in an already defined ontology.
There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein–protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein–protein interaction data and PSI-MI terms referring to interaction detection methods.
Linked Science is the practice of inter-connecting scientific assets by publishing, sharing and linking scientific data and processes in end-to-end loosely coupled workflows that allow the sharing and re-use of scientific data. Much of this data does not live in the cloud or on the Web, but rather in multi-institutional data centers that provide tools and add value through quality assurance, validation, curation, dissemination, and analysis of the data. In this paper, we make the case for the use of scientific scenarios in Linked Science. We propose a scenario in river-channel transport that requires biogeochemical experimental data and global climate-simulation model data from many sources. We focus on the use of ontologies—formal machine-readable descriptions of the domain—to facilitate search and discovery of this data. Mercury, developed at Oak Ridge National Laboratory, is a tool for distributed metadata harvesting, search and retrieval. Mercury currently provides uniform access to more than 100,000 metadata records; 30,000 scientists use it each month. We augmented search in Mercury with ontologies, such as the ontologies in the Semantic Web for Earth and Environmental Terminology (SWEET) collection by prototyping a component that provides access to the ontology terms from Mercury. We evaluate the coverage of SWEET for the ORNL Distributed Active Archive Center (ORNL DAAC).
Linked Science; ontologies; BioPortal; semantic search; climate change; data discovery
Most biomedical ontologies are represented in the OBO Flatfile Format, which is an easy-to-use graph-based ontology language. The semantics of the OBO Flatfile Format 1.2 enforces a strict predetermined interpretation of relationship statements between classes. It does not allow flexible specifications that provide better approximations of the intuitive understanding of the considered relations. If relations cannot be accurately expressed then ontologies built upon them may contain false assertions and hence lead to false inferences. Ontologies in the OBO Foundry must formalize the semantics of relations according to the OBO Relationship Ontology (RO). Therefore, being able to accurately express the intended meaning of relations is of crucial importance. Since the Web Ontology Language (OWL) is an expressive language with a formal semantics, it is suitable to de ne the meaning of relations accurately.
We developed a method to provide definition patterns for relations between classes using OWL and describe a novel implementation of the RO based on this method. We implemented our extension in software that converts ontologies in the OBO Flatfile Format to OWL, and also provide a prototype to extract relational patterns from OWL ontologies using automated reasoning. The conversion software is freely available at http://bioonto.de/obo2owl, and can be accessed via a web interface.
Explicitly defining relations permits their use in reasoning software and leads to a more flexible and powerful way of representing biomedical ontologies. Using the extended langua0067e and semantics avoids several mistakes commonly made in formalizing biomedical ontologies, and can be used to automatically detect inconsistencies. The use of our method enables the use of graph-based ontologies in OWL, and makes complex OWL ontologies accessible in a graph-based form. Thereby, our method provides the means to gradually move the representation of biomedical ontologies into formal knowledge representation languages that incorporates an explicit semantics. Our method facilitates the use of OWL-based software in the back-end while ontology curators may continue to develop ontologies with an OBO-style front-end.
Text definitions for entities within bio-ontologies are a cornerstone of the effort to gain a consensus in understanding and usage of those ontologies. Writing these definitions is, however, a considerable effort and there is often a lag between specification of the main part of an ontology (logical descriptions and definitions of entities) and the development of the text-based definitions. The goal of natural language generation (NLG) from ontologies is to take the logical description of entities and generate fluent natural language. The application described here uses NLG to automatically provide text-based definitions from an ontology that has logical descriptions of its entities, so avoiding the bottleneck of authoring these definitions by hand.
To produce the descriptions, the program collects all the axioms relating to a given entity, groups them according to common structure, realises each group through an English sentence, and assembles the resulting sentences into a paragraph, to form as ‘coherent’ a text as possible without human intervention. Sentence generation is accomplished using a generic grammar based on logical patterns in OWL, together with a lexicon for realising atomic entities. We have tested our output for the Experimental Factor Ontology (EFO) using a simple survey strategy to explore the fluency of the generated text and how well it conveys the underlying axiomatisation. Two rounds of survey and improvement show that overall the generated English definitions are found to convey the intended meaning of the axiomatisation in a satisfactory manner. The surveys also suggested that one form of generated English will not be universally liked; that intrusion of too much ‘formal ontology’ was not liked; and that too much explicit exposure of OWL semantics was also not liked.
Our prototype tools can generate reasonable paragraphs of English text that can act as definitions. The definitions were found acceptable by our survey and, as a result, the developers of EFO are sufficiently satisfied with the output that the generated definitions have been incorporated into EFO. Whilst not a substitute for hand-written textual definitions, our generated definitions are a useful starting point.
An on-line version of the NLG text definition tool can be found at http://swat.open.ac.uk/tools/. The questionaire and sample generated text definitions may be found at http://mcs.open.ac.uk/nlg/SWAT/bio-ontologies.html.
Cheminformatics is the application of informatics techniques to solve chemical problems in silico. There are many areas in biology where cheminformatics plays an important role in computational research, including metabolism, proteomics, and systems biology. One critical aspect in the application of cheminformatics in these fields is the accurate exchange of data, which is increasingly accomplished through the use of ontologies. Ontologies are formal representations of objects and their properties using a logic-based ontology language. Many such ontologies are currently being developed to represent objects across all the domains of science. Ontologies enable the definition, classification, and support for querying objects in a particular domain, enabling intelligent computer applications to be built which support the work of scientists both within the domain of interest and across interrelated neighbouring domains. Modern chemical research relies on computational techniques to filter and organise data to maximise research productivity. The objects which are manipulated in these algorithms and procedures, as well as the algorithms and procedures themselves, enjoy a kind of virtual life within computers. We will call these information entities. Here, we describe our work in developing an ontology of chemical information entities, with a primary focus on data-driven research and the integration of calculated properties (descriptors) of chemical entities within a semantic web context. Our ontology distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data. The Chemical Information Ontology is being developed as an open collaborative project. More details, together with a downloadable OWL file, are available at http://code.google.com/p/semanticchemistry/ (license: CC-BY-SA).
Workflow engine technology represents a new class of software with the ability to graphically model step-based knowledge. We present application of this novel technology to the domain of clinical decision support. Successful implementation of decision support within an electronic health record (EHR) remains an unsolved research challenge. Previous research efforts were mostly based on healthcare-specific representation standards and execution engines and did not reach wide adoption. We focus on two challenges in decision support systems: the ability to test decision logic on retrospective data prior prospective deployment and the challenge of user-friendly representation of clinical logic.
We present our implementation of a workflow engine technology that addresses the two above-described challenges in delivering clinical decision support. Our system is based on a cross-industry standard of XML (extensible markup language) process definition language (XPDL). The core components of the system are a workflow editor for modeling clinical scenarios and a workflow engine for execution of those scenarios. We demonstrate, with an open-source and publicly available workflow suite, that clinical decision support logic can be executed on retrospective data. The same flowchart-based representation can also function in a prospective mode where the system can be integrated with an EHR system and respond to real-time clinical events. We limit the scope of our implementation to decision support content generation (which can be EHR system vendor independent). We do not focus on supporting complex decision support content delivery mechanisms due to lack of standardization of EHR systems in this area. We present results of our evaluation of the flowchart-based graphical notation as well as architectural evaluation of our implementation using an established evaluation framework for clinical decision support architecture.
We describe an implementation of a free workflow technology software suite (available at http://code.google.com/p/healthflow) and its application in the domain of clinical decision support. Our implementation seamlessly supports clinical logic testing on retrospective data and offers a user-friendly knowledge representation paradigm. With the presented software implementation, we demonstrate that workflow engine technology can provide a decision support platform which evaluates well against an established clinical decision support architecture evaluation framework. Due to cross-industry usage of workflow engine technology, we can expect significant future functionality enhancements that will further improve the technology's capacity to serve as a clinical decision support platform.