An ontology is a formal representation of a domain modeling the entities in the domain and their relations. When a domain is represented by multiple ontologies, there is need for creating mappings among these ontologies in order to facilitate the integration of data annotated with these ontologies and reasoning across ontologies. The objective of this paper is to recapitulate our experience in aligning large anatomical ontologies and to reflect on some of the issues and challenges encountered along the way. The four anatomical ontologies under investigation are the Foundational Model of Anatomy, GALEN, the Adult Mouse Anatomical Dictionary and the NCI Thesaurus. Their underlying representation formalisms are all different. Our approach to aligning concepts (directly) is automatic, rule-based, and operates at the schema level, generating mostly point-to-point mappings. It uses a combination of domain-specific lexical techniques and structural and semantic techniques (to validate the mappings suggested lexically). It also takes advantage of domain-specific knowledge (lexical knowledge from external resources such as the Unified Medical Language System, as well as knowledge augmentation and inference techniques). In addition to point-to-point mapping of concepts, we present the alignment of relationships and the mapping of concepts group-to-group. We have also successfully tested an indirect alignment through a domain-specific reference ontology. We present an evaluation of our techniques, both against a gold standard established manually and against a generic schema matching system. The advantages and limitations of our approach are analyzed and discussed throughout the paper.
Ontology; ontology alignment; knowledge representation; anatomy; Semantic Web
The formal description of experiments for efficient analysis, annotation and sharing of results is a fundamental part of the practice of science. Ontologies are required to achieve this objective. A few subject-specific ontologies of experiments currently exist. However, despite the unity of scientific experimentation, no general ontology of experiments exists. We propose the ontology EXPO to meet this need. EXPO links the SUMO (the Suggested Upper Merged Ontology) with subject-specific ontologies of experiments by formalizing the generic concepts of experimental design, methodology and results representation. EXPO is expressed in the W3C standard ontology language OWL-DL. We demonstrate the utility of EXPO and its ability to describe different experimental domains, by applying it to two experiments: one in high-energy physics and the other in phylogenetics. The use of EXPO made the goals and structure of these experiments more explicit, revealed ambiguities, and highlighted an unexpected similarity. We conclude that, EXPO is of general value in describing experiments and a step towards the formalization of science.
ontology; formalization; annotation; artificial intelligence; metadata
There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein–protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein–protein interaction data and PSI-MI terms referring to interaction detection methods.
Numerous ontologies have recently been developed in life sciences to support a consistent annotation of biological objects, such as genes or proteins. These ontologies underlie continuous changes which can impact existing annotations. Therefore, it is valuable for users of ontologies to study the stability of ontologies and to see how many and what kind of ontology changes occurred.
We present OnEX (Ontology Evolution EXplorer) a system for exploring ontology changes. Currently, OnEX provides access to about 560 versions of 16 well-known life science ontologies. The system is based on a three-tier architecture including an ontology version repository, a middleware component and the OnEX web application. Interactive workflows allow a systematic and explorative change analysis of ontologies and their concepts as well as the semi-automatic migration of out-dated annotations to the current version of an ontology.
OnEX provides a user-friendly web interface to explore information about changes in current life science ontologies. It is available at .
Several biomedical ontologies cover the domain of biological functions, including molecular and cellular functions. However, there is currently no publicly available ontology of anatomical functions.
Consequently, no explicit relation between anatomical structures and their functions is expressed in the anatomy ontologies that are available for various species. Such an explicit relation between anatomical structures and their functions would be useful both for defining the classes of the anatomy and the phenotype ontologies accurately.
We provide an ontological analysis of functions and functional abnormalities. From this analysis, we derive an approach to the automatic extraction of anatomical functions from existing ontologies which uses a combination of natural language processing, graph-based analysis of the ontologies and formal inferences. Additionally, we introduce a new relation to link material objects to processes that realize the function of these objects. This relation is introduced to avoid a needless duplication of processes already covered by the Gene Ontology in a new ontology of anatomical functions.
Ontological considerations on the nature of functional abnormalities and their representation in current phenotype ontologies show that we can extract a skeleton for an ontology of anatomical functions by using a combination of process, phenotype and anatomy ontologies automatically. We identify several limitations of the current ontologies that still need to be addressed to ensure a consistent and complete representation of anatomical functions and their abnormalities.
The source code and results of our analysis are available at http://bioonto.de.
To develop and apply formal ontology creation methods to the domain of antimicrobial prescribing and to formally evaluate the resulting ontology through intrinsic and extrinsic evaluation studies.
We extended existing ontology development methods to create the ontology and implemented the ontology using Protégé-OWL. Correctness of the ontology was assessed using a set of ontology design principles and domain expert review via the laddering technique. We created three artifacts to support the extrinsic evaluation (set of prescribing rules, alerts and an ontology-driven alert module, and a patient database) and evaluated the usefulness of the ontology for performing knowledge management tasks to maintain the ontology and for generating alerts to guide antibiotic prescribing.
The ontology includes 199 classes, 10 properties, and 1,636 description logic restrictions. Twenty-three Semantic Web Rule Language rules were written to generate three prescribing alerts: 1) antibiotic-microorganism mismatch alert; 2) medication-allergy alert; and 3) non-recommended empiric antibiotic therapy alert. The evaluation studies confirmed the correctness of the ontology, usefulness of the ontology for representing and maintaining antimicrobial treatment knowledge rules, and usefulness of the ontology for generating alerts to provide feedback to clinicians during antibiotic prescribing.
This study contributes to the understanding of ontology development and evaluation methods and addresses one knowledge gap related to using ontologies as a clinical decision support system component—a need for formal ontology evaluation methods to measure their quality from the perspective of their intrinsic characteristics and their usefulness for specific tasks.
Ontology; Clinical decision support; Evaluation
As biomedical investigators strive to integrate data and analyses across spatiotemporal scales and biomedical domains, they have recognized the benefits of formalizing languages and terminologies via computational ontologies. Although ontologies for biological entities—molecules, cells, organs—are well-established, there are no principled ontologies of physical properties—energies, volumes, flow rates—of those entities. In this paper, we introduce the Ontology of Physics for Biology (OPB), a reference ontology of classical physics designed for annotating biophysical content of growing repositories of biomedical datasets and analytical models. The OPB's semantic framework, traceable to James Clerk Maxwell, encompasses modern theories of system dynamics and thermodynamics, and is implemented as a computational ontology that references available upper ontologies. In this paper we focus on the OPB classes that are designed for annotating physical properties encoded in biomedical datasets and computational models, and we discuss how the OPB framework will facilitate biomedical knowledge integration.
Motivation: Many published manuscripts contain experiment protocols which are poorly described or deficient in information. This means that the published results are very hard or impossible to repeat. This problem is being made worse by the increasing complexity of high-throughput/automated methods. There is therefore a growing need to represent experiment protocols in an efficient and unambiguous way.
Results: We have developed the Experiment ACTions (EXACT) ontology as the basis of a method of representing biological laboratory protocols. We provide example protocols that have been formalized using EXACT, and demonstrate the advantages and opportunities created by using this formalization. We argue that the use of EXACT will result in the publication of protocols with increased clarity and usefulness to the scientific community.
Availability: The ontology, examples and code can be downloaded from http://www.aber.ac.uk/compsci/Research/bio/dss/EXACT/
Contact: Larisa Soldatova email@example.com
Researchers use animal studies to better understand human diseases. In recent years, large-scale phenotype studies such as Phenoscape and EuroPhenome have been initiated to identify genetic causes of a species' phenome. Species-specific phenotype ontologies are required to capture and report about all findings and to automatically infer results relevant to human diseases. The integration of the different phenotype ontologies into a coherent framework is necessary to achieve interoperability for cross-species research.
Here, we investigate the quality and completeness of two different methods to align the Human Phenotype Ontology and the Mammalian Phenotype Ontology. The first method combines lexical matching with inference over the ontologies' taxonomic structures, while the second method uses a mapping algorithm based on the formal definitions of the ontologies. Neither method could map all concepts. Despite the formal definitions method provides mappings for more concepts than does the lexical matching method, it does not outperform the lexical matching in a biological use case. Our results suggest that combining both approaches will yield a better mappings in terms of completeness, specificity and application purposes.
The recent availability of high-throughput data in molecular biology has increased the need for a formal representation of this knowledge domain. New ontologies are being developed to formalize knowledge, e.g. about the functions of proteins. As the Semantic Web is being introduced into the Life Sciences, the basis for a distributed knowledge-base that can foster biological data analysis is laid. However, there still is a dichotomy, in tools and methodologies, between the use of ontologies in biological investigation, that is, in relation to experimental observations, and their use as a knowledge-base.
RDFScape is a plugin that has been developed to extend a software oriented to biological analysis with support for reasoning on ontologies in the semantic web framework. We show with this plugin how the use of ontological knowledge in biological analysis can be extended through the use of inference. In particular, we present two examples relative to ontologies representing biological pathways: we demonstrate how these can be abstracted and visualized as interaction networks, and how reasoning on causal dependencies within elements of pathways can be implemented.
The use of ontologies for the interpretation of high-throughput biological data can be improved through the use of inference. This allows the use of ontologies not only as annotations, but as a knowledge-base from which new information relevant for specific analysis can be derived.
Ontologies help to identify and formally define the entities and relationships in specific domains of interest. Bio-ontologies, in particular, play a central role in the annotation, integration, analysis, and interpretation of biological data. Missing from the number of bio-ontologies is one that includes phenotypic trait information found in livestock species. As a result, the Animal Trait Ontology (ATO) project being carried out under the auspices of the USDA-National Animal Genome Research Program is aimed at the development of a standardized trait ontology for farm animals and software tools to assist the research community in collaborative creation, editing, maintenance, and use of such an ontology. The ATO is currently inclusive of cattle, pig, and chicken species, and will include other livestock species in the future. The ATO will eventually be linked to other species (e.g., human, rat, mouse) so that comparative analysis can be efficiently performed between species.
ontology; trait; phenotype; animal; cattle; chicken
Application oriented ontologies are important for reliably communicating and
managing data in databases. Unfortunately, they often differ in the
definitions they use and thus do not live up to their potential. This
problem can be reduced when using a standardized and ontologically
consistent template for the top-level categories from a top-level formal
foundational ontology. This would support ontological consistency within
application oriented ontologies and compatibility between them. The Basic
Formal Ontology (BFO) is such a foundational ontology for the biomedical
domain that has been developed following the single inheritance policy. It
provides the top-level template within the Open Biological and Biomedical
Ontologies Foundry. If it wants to live up to its expected role, its three
top-level categories of material entity (i.e., ‘object’,
‘fiat object part’, ‘object
aggregate’) must be exhaustive, i.e. every concrete material entity
must instantiate exactly one of them.
By systematically evaluating all possible basic configurations of material
building blocks we show that BFO's top-level categories of material
entity are not exhaustive. We provide examples from biology and everyday
life that demonstrate the necessity for two additional categories:
‘fiat object part aggregate’ and
‘object with fiat object part aggregate’. By
distinguishing topological coherence, topological adherence, and metric
proximity we furthermore provide a differentiation of clusters and groups as
two distinct subcategories for each of the three categories of material
entity aggregates, resulting in six additional subcategories of material
We suggest extending BFO to incorporate two additional categories of material
entity as well as two subcategories for each of the three categories of
material entity aggregates. With these additions, BFO would exhaustively
cover all top-level types of material entity that application oriented
ontologies may use as templates. Our result, however, depends on the premise
that all material entities are organized according to a constitutive
The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in the hands of human curators using manual or semi-automated protocols, leaving many novel entities unclassified. Since chemical function is often closely linked to structure, accurate structure-based classification and annotation of chemical entities is imperative to understanding their functionality.
As part of an exploratory study, we have investigated the utility of semantic web technologies in automated chemical classification and annotation of lipids. Our prototype framework consists of two components: an ontology and a set of federated web services that operate upon it. The formal lipid ontology we use here extends a part of the LiPrO ontology and draws on the lipid hierarchy in the LIPID MAPS database, as well as literature-derived knowledge. The federated semantic web services that operate upon this ontology are deployed within the Semantic Annotation, Discovery, and Integration (SADI) framework. Structure-based lipid classification is enacted by two core services. Firstly, a structural annotation service detects and enumerates relevant functional groups for a specified chemical structure. A second service reasons over lipid ontology class descriptions using the attributes obtained from the annotation service and identifies the appropriate lipid classification. We extend the utility of these core services by combining them with additional SADI services that retrieve associations between lipids and proteins and identify publications related to specified lipid types. We analyze the performance of SADI-enabled eicosanoid classification relative to the LIPID MAPS classification and reflect on the contribution of our integrative methodology in the context of high-throughput lipidomics.
Our prototype framework is capable of accurate automated classification of lipids and facile integration of lipid class information with additional data obtained with SADI web services. The potential of programming-free integration of external web services through the SADI framework offers an opportunity for development of powerful novel applications in lipidomics. We conclude that semantic web technologies can provide an accurate and versatile means of classification and annotation of lipids.
GALEN technology for re-usable terminologies using formal classification is being applied to the creation and maintenance of a reference terminology for drugs. GALEN's techniques are being used to address specific deficiencies of existing drug classifications that make it difficult to create and maintain guidelines to support prescribing in the care of patients with chronic diseases. The reference terminology is in two parts; firstly, a re-usable and automatically-classified 'ontology' is built with GALEN technology; this describes generic drugs, their composition in terms of chemicals and chemical classes, their actions, indications and interactions. Secondly, a 'dictionary' of prescribable proprietary products is integrated with this ontology. The result is a drug resource designed to support both the traditional uses of a drug knowledge base (e.g. prescribing and messaging), and the specialized demands of guideline authoring and execution.
The Protein Ontology (PRO) is designed as a formal and principled Open Biomedical Ontologies (OBO) Foundry ontology for proteins. The components of PRO extend from a classification of proteins on the basis of evolutionary relationships at the homeomorphic level to the representation of the multiple protein forms of a gene, including those resulting from alternative splicing, cleavage and/or post-translational modifications. Focusing specifically on the TGF-beta signaling proteins, we describe the building, curation, usage and dissemination of PRO.
PRO is manually curated on the basis of PrePRO, an automatically generated file with content derived from standard protein data sources. Manual curation ensures that the treatment of the protein classes and the internal and external relationships conform to the PRO framework. The current release of PRO is based upon experimental data from mouse and human proteins wherein equivalent protein forms are represented by single terms. In addition to the PRO ontology, the annotation of PRO terms is released as a separate PRO association file, which contains, for each given PRO term, an annotation from the experimentally characterized sub-types as well as the corresponding database identifiers and sequence coordinates. The annotations are added in the form of relationship to other ontologies. Whenever possible, equivalent forms in other species are listed to facilitate cross-species comparison. Splice and allelic variants, gene fusion products and modified protein forms are all represented as entities in the ontology. Therefore, PRO provides for the representation of protein entities and a resource for describing the associated data. This makes PRO useful both for proteomics studies where isoforms and modified forms must be differentiated, and for studies of biological pathways, where representations need to take account of the different ways in which the cascade of events may depend on specific protein modifications.
PRO provides a framework for the formal representation of protein classes and protein forms in the OBO Foundry. It is designed to enable data retrieval and integration and machine reasoning at the molecular level of proteins, thereby facilitating cross-species comparisons, pathway analysis, disease modeling and the generation of new hypotheses.
A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.
An ontology represents the concepts and their interrelation within a knowledge domain. Several ontologies have been developed in biomedicine, which provide standardized vocabularies to describe diseases, genes and gene products, physiological phenotypes, anatomical structures, and many other phenomena. Scientists use them to encode the results of complex experiments and observations and to perform integrative analysis to discover new knowledge. A remaining challenge in ontology development is how to evaluate an ontology's representation of knowledge within its scientific domain. Building on classic measures from information retrieval, we introduce a family of metrics including breadth and depth that capture the conceptual coverage and parsimony of an ontology. We test these measures using (1) four commonly used medical ontologies in relation to a corpus of medical documents and (2) seven popular English thesauri (ontologies of synonyms) with respect to text from medicine, news, and novels. Results demonstrate that both medical ontologies and English thesauri have a small overlap in concepts and relations. Our methods suggest efforts to tighten the fit between ontologies and biomedical knowledge.
Motivation: Describing biological sample variables with ontologies is complex due to the cross-domain nature of experiments. Ontologies provide annotation solutions; however, for cross-domain investigations, multiple ontologies are needed to represent the data. These are subject to rapid change, are often not interoperable and present complexities that are a barrier to biological resource users.
Results: We present the Experimental Factor Ontology, designed to meet cross-domain, application focused use cases for gene expression data. We describe our methodology and open source tools used to create the ontology. These include tools for creating ontology mappings, ontology views, detecting ontology changes and using ontologies in interfaces to enhance querying. The application of reference ontologies to data is a key problem, and this work presents guidelines on how community ontologies can be presented in an application ontology in a data-driven way.
Supplementary information: Supplementary data are available at Bioinformatics online.
Phenotype ontologies are used in species-specific databases for the annotation of mutagenesis experiments and to characterize human diseases. The Entity-Quality (EQ) formalism is a means to describe complex phenotypes based on one or more affected entities and a quality. EQ-based definitions have been developed for many phenotype ontologies, including the Human and Mammalian Phenotype ontologies.
We analyze formalizations of complex phenotype descriptions in the Web Ontology Language (OWL) that are based on the EQ model, identify several representational challenges and analyze potential solutions to address these challenges.
In particular, we suggest a novel, role-based approach to represent relational qualities such as concentration of iron in spleen, discuss its ontological foundation in the General Formal Ontology (GFO) and evaluate its representation in OWL and the benefits it can bring to the representation of phenotype annotations.
Our analysis of OWL-based representations of phenotypes can contribute to improving consistency and expressiveness of formal phenotype descriptions.
Ontologies are intended to capture and formalize a domain of knowledge. The
ontologies comprising the Open Biological Ontologies (OBO) project, which includes
the Gene Ontology (GO), are formalizations of various domains of biological
knowledge. Ontologies within OBO typically lack computable definitions that serve to
differentiate a term from other similar terms. The computer is unable to determine the
meaning of a term, which presents problems for tools such as automated reasoners.
Reasoners can be of enormous benefit in managing a complex ontology. OBO term
names frequently implicitly encode the kind of definitions that can be used by
computational tools, such as automated reasoners. The definitions encoded in the
names are not easily amenable to computation, because the names are ostensibly
natural language phrases designed for human users. These names are highly regular
in their grammar, and can thus be treated as valid sentences in some formal or
computable language.With a description of the rules underlying this formal language,
term names can be parsed to derive computable definitions, which can then be
reasoned over. This paper describes the effort to elucidate that language, called Obol,
and the attempts to reason over the resulting definitions. The current implementation
finds unique non-trivial definitions for around half of the terms in the GO, and
has been used to find 223 missing relationships, which have since been added to
the ontology. Obol has utility as an ontology maintenance tool, and as a means of
generating computable definitions for a whole ontology.
The software is available under an open-source license from: http://www.fruitfly.
org/~cjm/obol. Supplementary material for this article can be found at: http://www.
The Basic Formal Ontology (BFO) is a top-level formal foundational ontology for the biomedical domain. It has been developed with the purpose to serve as an ontologically consistent template for top-level categories of application oriented and domain reference ontologies within the Open Biological and Biomedical Ontologies Foundry (OBO). BFO is important for enabling OBO ontologies to facilitate in reliably communicating and managing data and metadata within and across biomedical databases. Following its intended single inheritance policy, BFO's three top-level categories of material entity (i.e. ‘object’, ‘fiat object part’, ‘object aggregate’) must be exhaustive and mutually disjoint. We have shown elsewhere that for accommodating all types of constitutively organized material entities, BFO must be extended by additional categories of material entity.
Unfortunately, most biomedical material entities are cumulative-constitutively organized. We show that even the extended BFO does not exhaustively cover cumulative-constitutively organized material entities. We provide examples from biology and everyday life that demonstrate the necessity for ‘portion of matter’ as another material building block. This implies the necessity for further extending BFO by ‘portion of matter’ as well as three additional categories that possess portions of matter as aggregate components. These extensions are necessary if the basic assumption that all parts that share the same granularity level exhaustively sum to the whole should also apply to cumulative-constitutively organized material entities. By suggesting a notion of granular representation we provide a way to maintain the single inheritance principle when dealing with cumulative-constitutively organized material entities.
We suggest to extend BFO to incorporate additional categories of material entity and to rearrange its top-level material entity taxonomy. With these additions and the notion of granular representation, BFO would exhaustively cover all top-level types of material entities that application oriented ontologies may use as templates, while still maintaining the single inheritance principle.
Recent increases in the volume and diversity of life science data and information and an increasing emphasis on data sharing and interoperability have resulted in the creation of a large number of biological ontologies, including the Cell Ontology (CL), designed to provide a standardized representation of cell types for data annotation. Ontologies have been shown to have significant benefits for computational analyses of large data sets and for automated reasoning applications, leading to organized attempts to improve the structure and formal rigor of ontologies to better support computation. Currently, the CL employs multiple is_a relations, defining cell types in terms of histological, functional, and lineage properties, and the majority of definitions are written with sufficient generality to hold across multiple species. This approach limits the CL's utility for computation and for cross-species data integration.
To enhance the CL's utility for computational analyses, we developed a method for the ontological representation of cells and applied this method to develop a dendritic cell ontology (DC-CL). DC-CL subtypes are delineated on the basis of surface protein expression, systematically including both species-general and species-specific types and optimizing DC-CL for the analysis of flow cytometry data. We avoid multiple uses of is_a by linking DC-CL terms to terms in other ontologies via additional, formally defined relations such as has_function.
This approach brings benefits in the form of increased accuracy, support for reasoning, and interoperability with other ontology resources. Accordingly, we propose our method as a general strategy for the ontological representation of cells. DC-CL is available from .
Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences.
We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications.
Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/.
Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.
To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats).
PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.
Biomedical processes can provide essential information about the (mal-) functioning of an organism and are thus frequently represented in biomedical terminologies and ontologies, including the GO Biological Process branch. These processes often need to be described and categorised in terms of their attributes, such as rates or regularities. The adequate representation of such process attributes has been a contentious issue in bio-ontologies recently; and domain ontologies have correspondingly developed ad hoc workarounds that compromise interoperability and logical consistency.
We present a design pattern for the representation of process attributes that is compatible with upper ontology frameworks such as BFO and BioTop. Our solution rests on two key tenets: firstly, that many of the sorts of process attributes which are biomedically interesting can be characterised by the ways that repeated parts of such processes constitute, in combination, an overall process; secondly, that entities for which a full logical definition can be assigned do not need to be treated as primitive within a formal ontology framework. We apply this approach to the challenge of modelling and automatically classifying examples of normal and abnormal rates and patterns of heart beating processes, and discuss the expressivity required in the underlying ontology representation language. We provide full definitions for process attributes at increasing levels of domain complexity.
We show that a logical definition of process attributes is feasible, though limited by the expressivity of DL languages so that the creation of primitives is still necessary. This finding may endorse current formal upper-ontology frameworks as a way of ensuring consistency, interoperability and clarity.
With the emergence of “service oriented science,” the need arises to orchestrate multiple services to facilitate scientific investigation—that is, to create “science workflows.” We present here our findings in providing a workflow solution for the caGrid service-based grid infrastructure. We choose BPEL and Taverna as candidates, and compare their usability in the lifecycle of a scientific workflow, including workflow composition, execution, and result analysis. Our experience shows that BPEL as an imperative language offers a comprehensive set of modeling primitives for workflows of all flavors; while Taverna offers a dataflow model and a more compact set of primitives that facilitates dataflow modeling and pipelined execution. We hope that this comparison study not only helps researchers select a language or tool that meets their specific needs, but also offers some insight on how a workflow language and tool can fulfill the requirement of the scientific community.
scientific workflow; functional programming; Taverna; BPEL; caGrid