A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over sentences. Over sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately are erroneous, whilst appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation.
Source code and supplementary data are available from the authors website at http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/.
Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations.
Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality.
Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation.
Motivation: Biological experiments give insight into networks of processes inside a cell, but are subject to error and uncertainty. However, due to the overlap between the large number of experiments reported in public databases it is possible to assess the chances of individual observations being correct. In order to do so, existing methods rely on high-quality ‘gold standard’ reference networks, but such reference networks are not always available.
Results: We present a novel algorithm for computing the probability of network interactions that operates without gold standard reference data. We show that our algorithm outperforms existing gold standard-based methods. Finally, we apply the new algorithm to a large collection of genetic interaction and protein–protein interaction experiments.
Availability: The integrated dataset and a reference implementation of the algorithm as a plug-in for the Ondex data integration framework are available for download at http://bio-nexus.ncl.ac.uk/projects/nogold/
Supplementary data are available at Bioinformatics online.
Motivation: The rise of high-throughput technologies in the post-genomic era has led to the production of large amounts of biological data. Many of these datasets are freely available on the Internet. Making optimal use of these data is a significant challenge for bioinformaticians. Various strategies for integrating data have been proposed to address this challenge. One of the most promising approaches is the development of semantically rich integrated datasets. Although well suited to computational manipulation, such integrated datasets are typically too large and complex for easy visualization and interactive exploration.
Results: We have created an integrated dataset for Saccharomyces cerevisiae using the semantic data integration tool Ondex, and have developed a view-based visualization technique that allows for concise graphical representations of the integrated data. The technique was implemented in a plug-in for Cytoscape, called OndexView. We used OndexView to investigate telomere maintenance in S. cerevisiae.
Availability: The Ondex yeast dataset and the OndexView plug-in for Cytoscape are accessible at http://bsu.ncl.ac.uk/ondexview.
Supplementary information: Supplementary data is available at Bioinformatics online.
Many areas of biology are open to mathematical and computational modelling. The application of discrete, logical formalisms defines the field of biomedical ontologies. Ontologies have been put to many uses in bioinformatics. The most widespread is for description of entities about which data have been collected, allowing integration and analysis across multiple resources. There are now over 60 ontologies in active use, increasingly developed as large, international collaborations. There are, however, many opinions on how ontologies should be authored; that is, what is appropriate for representation. Recently, a common opinion has been the “realist” approach that places restrictions upon the style of modelling considered to be appropriate.
Here, we use a number of case studies for describing the results of biological experiments. We investigate the ways in which these could be represented using both realist and non-realist approaches; we consider the limitations and advantages of each of these models.
From our analysis, we conclude that while realist principles may enable straight-forward modelling for some topics, there are crucial aspects of science and the phenomena it studies that do not fit into this approach; realism appears to be over-simplistic which, perversely, results in overly complex ontological models. We suggest that it is impossible to avoid compromise in modelling ontology; a clearer understanding of these compromises will better enable appropriate modelling, fulfilling the many needs for discrete mathematical models within computational biology.
The creation of accurate quantitative Systems Biology Markup Language (SBML) models is a time-intensive, manual process often complicated by the many data sources and formats required to annotate even a small and well-scoped model. Ideally, the retrieval and integration of biological knowledge for model annotation should be performed quickly, precisely, and with a minimum of manual effort.
Here we present rule-based mediation, a method of semantic data integration applied to systems biology model annotation. The heterogeneous data sources are first syntactically converted into ontologies, which are then aligned to a small domain ontology by applying a rule base. We demonstrate proof-of-principle of this application of rule-based mediation using off-the-shelf semantic web technology through two use cases for SBML model annotation. Existing tools and technology provide a framework around which the system is built, reducing development time and increasing usability.
Integrating resources in this way accommodates multiple formats with different semantics, and provides richly-modelled biological knowledge suitable for annotation of SBML models. This initial work establishes the feasibility of rule-based mediation as part of an automated SBML model annotation system.
Detailed information on the project files as well as further information on and comparisons with similar projects is available from the project page at http://cisban-silico.cs.ncl.ac.uk/RBM/.
Understanding the distinction between function and role is vexing and difficult. While it appears to be useful, in practice this distinction is hard to apply, particularly within biology.
I take an evolutionary approach, considering a series of examples, to develop and generate definitions for these concepts. I test them in practice against the Ontology for Biomedical Investigations (OBI). Finally, I give an axiomatisation and discuss methods for applying these definitions in practice.
The definitions in this paper are applicable, formalizing current practice. As such, they make a significant contribution to the use of these concepts within biomedical ontologies.
Experimental descriptions are typically stored as free text without using standardized terminology, creating challenges in comparison, reproduction and analysis. These difficulties impose limitations on data exchange and information retrieval.
The Ontology for Biomedical Investigations (OBI), developed as a global, cross-community effort, provides a resource that represents biomedical investigations in an explicit and integrative framework. Here we detail three real-world applications of OBI, provide detailed modeling information and explain how to use OBI.
We demonstrate how OBI can be applied to different biomedical investigations to both facilitate interpretation of the experimental process and increase the computational processing and integration within the Semantic Web. The logical definitions of the entities involved allow computers to unambiguously understand and integrate different biological experimental processes and their relevant components.
OBI is available at http://purl.obolibrary.org/obo/obi/2009-11-02/obi.owl
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization.
We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies.
Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research.
With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the ‘transparency’ of the information contained in existing genomic databases.
The bio-ontology community falls into two camps: first we have biology domain experts, who actually hold the knowledge we wish to capture in ontologies; second, we have ontology specialists, who hold knowledge about techniques and best practice on ontology development. In the bio-ontology domain, these two camps have often come into conflict, especially where pragmatism comes into conflict with perceived best practice. One of these areas is the insistence of computer scientists on a well-defined semantic basis for the Knowledge Representation language being used. In this article, we will first describe why this community is so insistent. Second, we will illustrate this by examining the semantics of the Web Ontology Language and the semantics placed on the Directed Acyclic Graph as used by the Gene Ontology. Finally we will reconcile the two representations, including the broader Open Biomedical Ontologies format. The ability to exchange between the two representations means that we can capitalise on the features of both languages. Such utility can only arise by the understanding of the semantics of the languages being used. By this illustration of the usefulness of a clear, well-defined language semantics, we wish to promote a wider understanding of the computer science perspective amongst potential users within the biological community.
The Annual Bio-Ontologies Meeting  has now reached its seventh consecutive year,
running as a special interest group (SIG) of the much larger ISMB conference. This
year's meeting in Glasgow had approximately 100 attendees. Since the advent of
the Gene Ontology, which coincided with the first Bio-Ontologies Meeting, we have
seen a year-on-year strengthening of the field; bio-ontologies has moved from being
dominated by computer science to be led by biological applications; discussion is less
about ‘what is an ontology?’ and more about ‘how to build an ontology which is fit
for purpose?’. This strengthening of the field can be seen elsewhere. Both the main
ISMB conference and this year's Pacific Symposium on Biocomputing (PSB)  have
seen a large number of submissions to their ontologies track. For the first time a
selection of the papers from the SIG is being published in this issue of
Comparative and Functional Genomics. We hope that this will complement the publications of
the larger conferences, bringing to a wider audience the cutting edge research that
characterizes the Bio-Ontologies SIG.
The Annual Bio-Ontologies meeting (http://www.cs.man.ac.uk/˜stevens/meeting03/)
has now been running for 6 consecutive years, as a special interest group (SIG)
of the much larger ISMB conference. It met in Brisbane, Australia, this summer, the
first time it was held outside North America or Europe. The bio-ontologies meeting
is 1 day long and normally has around 100 attendees. This year there were many
fewer, no doubt a result of the distance, global politics and SARS.
The meeting consisted of a series of 30 min talks with no formal peer review or
publication. Talks ranged in style from fairly formal and complete pieces of work,
through works in progress, to the very informal and discursive. Each year's meeting
has a theme and this year it was ‘ontologies, and text processing’. There is a tendency
for those submitting talks to ignore the theme completely, but this year's theme
obviously struck a chord, as half the programme was about ontologies and text
analysis (http://www.cs.man.ac.uk/˜stevensr/meeting03/programme.html). Despite the
smaller size of the meeting, the programme was particularly strong this year, meaning
that the tension between allowing time for the many excellent talks, discussion and
questions from the floor was particular keenly felt. A happy problem to have!
In this article we describe an approach to representing and building ontologies
advocated by the Bioinformatics and Medical Informatics groups at the University
of Manchester. The hand-crafting of ontologies offers an easy and rapid avenue to
delivering ontologies. Experience has shown that such approaches are unsustainable.
Description logic approaches have been shown to offer computational support for
building sound, complete and logically consistent ontologies. A new knowledge
representation language, DAML + OIL, offers a new standard that is able to support
many styles of ontology, from hand-crafted to full logic-based descriptions with
reasoning support. We describe this language, the OilEd editing tool, reasoning
support and a strategy for the language’s use. We finish with a current example,
in the Gene Ontology Next Generation (GONG) project, that uses DAML + OIL as
the basis for moving the Gene Ontology from its current hand-crafted, form to one
that uses logical descriptions of a concept’s properties to deliver a more complete
version of the ontology.