|Home | About | Journals | Submit | Contact Us | Français|
The value of any kind of data is greatly enhanced when it exists in a form that allows it to be integrated with other data. One approach to integration is through the annotation of multiple bodies of data using common controlled vocabularies or ‘ontologies’. Unfortunately, the very success of this approach has led to a proliferation of ontologies, which itself creates obstacles to integration. The Open Biomedical Ontologies (OBO) consortium is pursuing a strategy to overcome this problem. Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.
In the search for what is biologically and clinically significant in the swarms of data being generated by today’s high-throughput technologies, a common strategy involves the creation and analysis of ‘annotations’ linking primary data to expressions in controlled, structured vocabularies, thereby making the data available to search and to algorithmic processing1. The most successful such endeavor, measured both by numbers of users and by reach across species and granularities, is the Gene Ontology (GO)2. There exist over 11 million annotations relating gene products described in the UniProt, Ensembl and other databases to terms in the GO3, of which half a million have been manually verified by specialist curators in different model-organism communities on the basis of the analysis of experimental results reported in 52,000 scientific journal articles (http://www.ebi.ac.uk/GOA/). Data related to some 180,000 genes have been manually annotated in this way, an endeavor now being refined and systematized within the Reference Genome Project (US National Institutes of Health National Human Genome Research Institute grant 2P41HG002273-07), which will provide comprehensive GO annotations for both the human genome and a representative set of model-organism genomes in support of research on the primary molecular systems affecting human health.
The domain of molecular biology is marked by the availability of large amounts of well defined data that can be used without restriction as inputs to algorithmic processing. In the clinical domain, by contrast, only limited amounts of data are available for research purposes, and these still consist overwhelmingly of natural language text. Even where more systematic clinical data are available, the use of local coding schemes means that these data do not cumulate in ways useful to research4. One approach to solving this problem is the Unified Medical Language System (UMLS)5, a compendium of some 100 source vocabularies combined through a process of retrospective mapping based on the identification of synonymy relations between constituent terms. The UMLS has yielded very useful results for applications such as indexing and retrieval of documents. But because the separate vocabularies have no common architecture6,7, UMLS mappings do not meld their terms together into any single system8.
Increasingly, therefore, the need is being recognized for strategies of prospective standardization designed to bring about the progressive improvement and reciprocal alignment of the frameworks employed for the management, description and publication of biomedical data. Two conspicuous products of this trend are the US National Cancer Institute’s Cancer Biomedical Informatics Grid (caBIG) project9 and HL7’s Reference Information Model (RIM) (http://hl7.org). caBIG seeks to integrate all cancer research data in a common cyberinfrastructure by standardizing the ways in which such data are acquired, formatted, processed and stored. The HL7 RIM, similarly, offers a standard for the exchange, management and integration of all information relevant to healthcare, from clinical genomics to hospital billing. However, because both caBIG and HL7 focus on the meta-level question of how data and information should be represented in computer and messaging systems, it can be argued that they fail to do justice to the object-level question of how best to represent the proteins, organisms, diseases or drug interactions that are of primary interest in biomedical research7,10.
In 2001, Ashburner and Lewis initiated a strategy to address this objectlevel question by creating OBO, an umbrella body for the developers of life-science ontologies. OBO applies the key principles underlying the success of the GO, namely, that ontologies be open, orthogonal, instantiated in a well-specified syntax and designed to share a common space of identifiers11. Ontologies must be open in the sense that they and the bodies of data described in their terms should be available for use without any constraint or license and so be applicable to new purposes without restriction. They are also receptive to modification as a result of community debate. They must be orthogonal to ensure additivity of annotations and to bring the benefits of modular development. They must be syntactically in good order to support algorithmic processing. And they must employ a common system of identifiers to enable backward compatibility with legacy annotations as the ontologies evolve.
OBO now comprises over 60 ontologies, and its role as an ontology information resource is supported by the NIH Roadmap National Center for Biomedical Ontology (NCBO) through its BioPortal12. At the same time, the developers of a subset of OBO ontologies have initiated the OBO Foundry, a collaborative experiment based on the voluntary acceptance by its participants of an evolving set of principles (available at http://obofoundry.org) that extend those of the original OBO by requiring in addition that ontologies (i) be developed in a collaborative effort, (ii) use common relations that are unambiguously defined, (iii) provide procedures for user feedback and for identifying successive versions and (iv) have a clearly bounded subject-matter (so that an ontology devoted to cell components, for example, should not include terms like ‘database’ or ‘integer’). A graphical representation of the coverage of the initial Foundry ontologies is provided in Table 1.
Since the OBO Foundry was established, ontologies such as the GO and the Foundational Model of Anatomy (FMA)13 have been reformed and new ontologies created on the basis of its principles14-16. Perhaps most importantly, ontologies have been laid to rest. Before the OBO Foundry there existed at least four cell-type ontologies: one from Bard, Rhee and Ashburner17, another from Kelso et al.18, a third implicit within the GO and the fourth a subontology within the FMA. The first three now form a single cell-type ontology (CL)19, which is itself being integrated with the cell-type representations contained within the FMA.
The Foundry initiative also serves to align ontology development efforts carried out by separate communities, for example in research on different model organisms. The potential of such research to yield results valuable for the understanding of human disease rests on our ability to make reliable cross-species comparisons. Because so much modelorganism data is localized to anatomical structures, drawing inferences on the basis of such comparisons has been hampered by the lack of coordination in anatomy ontology development among different communities. Some ontologies represent structure, others represent function, yet others represent stages of development, and some draw on combinations of these, in ways that close off opportunities for automatic reasoning. The Foundry has created a roadmap for the incremental resolution of this problem through the initiation of the Common Anatomy Reference Ontology (CARO)14, which is providing guidelines both for modelorganism communities with legacy anatomy ontologies who wish to initiate reforms in the direction of compatibility and for communities who wish to build new ontologies from scratch. CARO is based on the toplevel types of the FMA and is serving as a template for the creation of the Fish Multi-Species, Ixodidae and Argasidae (tick), mosquito and Xenopus anatomy ontologies, and also as basis for reforms of the Drosophila and zebrafish anatomy ontologies19.
The Ontology for Biomedical Investigations (OBI) addresses the need for controlled vocabularies to support integration of experimental data, a need originally identified in the transcriptomics domain by the Microarray Gene Expression Data Society (MGED), which developed the MGED Ontology20 as an annotation resource for microarray data. In response to the recognition of convergent needs in areas such as protein and metabolite characterization, this effort was broadened to become what was initially known as FuGO (Functional Genomics Investigation Ontology)21. FuGO was further expanded in 2006 to include clinical and epidemiological research, biomedical imaging and a variety of further experimentation domains to become what is today OBI, an ontology designed to serve the coordinated representation of designs, protocols, instrumentation, materials, processes, data and types of analysis in all areas of biological and biomedical investigation. Twenty-five groups are now involved in building OBI (http://obi.sf.net/community), and the Foundry discipline has proven essential to its distributed development.
Unlike most OBO ontologies, which use the OBO file format and the associated OBO-Edit software favored by model-organism and other biologist communities, OBI uses the OWL-DL Web Ontology Language. The need to make OWL and OBO ontologies interoperable has sparked the creation of bidirectional OBO–OWL conversion tools22 that integrate data annotated in terms of the GO and other OBO ontologies with the bodies of data coming onstream within the framework of the Semantic Web23 an influential initiative to exploit OWL ontologies to encode knowledge in distributed computer systems24.
Each Foundry ontology forms a graph-theoretic structure, with terms connected by edges representing relations such as ‘is_a’ or ‘part_of’ in assertions such as ‘serotonin is_a biogenic amine’ or ‘cytokinesis part_of cell proliferation’. Because relations in OBO ontologies were initially used in inconsistent ways25, the OBO Relation Ontology (RO)26 was developed to provide guidelines to ontology builders in the consistent formulation of relational assertions. These guidelines are already proving useful—for example, in the representation of anatomical change27 and in linking diverse image collections to phylogenetic datasets28.
Other areas in which the Foundry is providing guidelines include naming conventions29 and pathway representations30. The model of good practice in the formulation of definitions is the FMA13, a representation of types of anatomical entities built around two backbone hierarchies of ‘is_a’ and ‘part_of’ relations. The FMA imposes a rule whereby all definitions take the genus-species form:
Anchoring definitions in the ‘is_a’ hierarchy in this way diminishes the role of opinion in determining where terms should be placed in the hierarchy, thereby fostering consistency both within and between ontologies and helping to prevent common errors6,7,26.
To maximize cross-ontology coordination, compound terms should be built as far as possible out of constituent terms drawn from Foundry ontologies linked using relational expressions from the RO31. This methodology of cross-products is being applied, in one of the biological projects driving the NCBO, to the annotation of Drosophila, zebrafish and human alleles for genes implicated in disease12,32. Specialist curators associate these alleles with phenotype descriptions formulated using terms drawn from more than one OBO Foundry ontology—for example, composing the Phenotypic Quality Ontology (PATO) term ‘increased concentration’ with the FMA term ‘blood’ and the ChEBI term ‘glucose’ to represent increased blood glucose phenotypes. Such creation of terms through explicit composition avoids the bottlenecks created where, as for example in the Mammalian Phenotype Ontology, each new term must be approved for inclusion in the ontology before it can be used in annotations. But the approach will work only if the resultant terms are unambiguous, and here the Foundry helps provide the necessary rigor. The orthogonality principle helps to reduce the need for arbitrary decisions between equivalent-seeming terms drawn from different ontologies, the PATO phenotypic-quality ontology provides templates for term formation, and the RO provides formally coherent glue for combination33.
The current scope of the OBO Foundry initiative is summarized in Table 2. Foundry ontologies are created and maintained by biologists with a thorough knowledge of the underlying science. Where domain experts jointly control ontology, data, and annotations (as in the case of the GO/Uniprot collaboration), all three can be curated in tandem in a way that provides a reality check at each stage of the process34. As results of experiments are described in annotations, this leads to extensions or corrections of the ontology, which in turn lead to better annotation35. The results of the Foundry’s work can then be applied by external groups as benchmarks—for example, to help identify genes mutated at significant frequencies in human cancers36 or to identify cellular components involved in antigen processing37 or, in general, to refine otherwise noisy results of text- and data-mining38-41.
A demonstration of the utility of the Foundry methodology is provided by ongoing work to create the NeuronDB database within the Senselab project (http://senselab.med.yale.edu/). NeuronDB encompasses three types of neuronal property: voltage-gated conductances, neurotransmitters and neurotransmitter receptors. An initial representation of neurotransmitters defined an ‘is_a’ hierarchy with classes such as ‘neurotransmitter receptor’ and subclasses such as ‘GABA receptor’. In this initial ontology, receptors were not defined, and strictly speaking one would not have known, for example, whether a receptor was a protein or a protein complex. The Foundry provided a set of principles and at least one task that may be evaluated in making such choices: namely, the scope of each ontology should be clearly bounded and (by orthogonality) no term should appear in more than one ontology. Reviewing the existing ontologies, we found that the GO Molecular Function (GO MF) ontology already had classes such as ‘receptor activity’ (GO:0004872) and a number of subclasses that described receptor activities that were referred to in NeuronDB.
We reviewed one hundred thirty resultant receptor classes. Where they existed, we reused MF classes; where they did not, we created subclasses of existing MF classes and submitted the results to GO for future inclusion. Arranging NeuronDB to interoperate transparently with GO provided the further benefit that we can now take advantage of GO annotations to find the proteins that correspond to the receptor classes by searching annotations to the MF terms. This is a model for how small ontology builders can constructively contribute to the growth of shared resources while simultaneously benefiting users of their own ontologies.
In support of research on neurodegenerative and neurological disease within the Biomedical Informatics Research Network (BIRN)42, the BIRN Ontology Task Force is applying the Foundry principles to formally represent several large domains, including (i) neuroanatomy43, where annotations must capture not only the structural systems of parthood and topological connection but also cytoarchitectural parcellations such as the CA1, CA2 and CA3 regions of the hippocampus, (ii) functional systems, such as the basal ganglion circuits for motor planning and motor memory and (iii) neurochemistry (for example, of brainstem monoamine nuclei). The members of the BIRN Ontology Task Force see the Foundry as providing a framework within which these distinct axes can be algorithmically combined, and they are incorporating the results into BIRN’s neuroimage atlasing project and using them to integrate spatially mapped microarray expression data with mouse imaging results.
This initiative represents the first new standards effort that takes OBO and the OBO Foundry as its role model44. MIBBI provides information resources to promote the consolidation of the many prescriptive checklists that specify core metadata items to be included when reporting results in a variety of experimentation domains45. The proliferation of such ‘minimum information’ checklists has made it increasingly difficult to obtain an overview of existing specifications, unnecessarily duplicating efforts and creating problems when third parties try to use described information. The MIBBI Portal operates analogously to OBO and the NBCO Bioportal as an open information resource for all initiatives addressing these problems; the MIBBI Foundry fosters collaborative development and integration of checklists into orthogonal modules46.
Like OBO, the OBO Foundry is an open community. Any individual or group working in the domain of biomedicine wishing to join the initiative is encouraged to do so, and all discussion forums (listed at http://obofoundry.org) are open to all interested parties without restriction. The recommended first step is to join one or more mailing lists in salient areas as a way to become familiar with the Foundry’s collaborative methodology and identify members with overlapping expertise. Those with new ontology resources are invited to submit them for informal consideration by existing members; this will be followed by a period in which compliance with the Foundry principles is addressed, especially as concerns potential conflicts in areas of overlap. Membership in the Foundry initiative then flows from a commitment to incremental implementation of these principles as they evolve over time, with the Foundry coordinators (currently Ashburner, Lewis, Mungall and Smith) serving as analogs of journal editors, whereby the division of labor that results from orthogonality helps ensure that development decisions are made by the authors of single ontologies. By joining the initiative, the authors of an ontology commit to working with other members to ensure that, for any particular domain, there is convergence on a single ontology. Criticism, too, is welcomed: the Foundry is an attempt to apply the scientific method to the task of ontology development, and thus it accepts that no resource will ever exist in a form that cannot be further improved.
Our long-term goal is that the data generated through biomedical research should form a single, consistent, cumulatively expanding and algorithmically tractable whole. Our efforts to realize this goal, which are still very much in the proving stage, reflect an attempt to walk the line between the flexibility that is indispensable to scientific advance and the institution of principles that is indispensable to successful coordination.
The Foundry is receiving ad hoc funding under the BISC Gen e Ontology Consortium, MGED, NCBO and RNA Ontology grants. We are grateful to all of these sources, and also to the ACGT Project of the European Union and to the Humboldt and Volkswagen Foundations.