|Home | About | Journals | Submit | Contact Us | Français|
In recent years, as a knowledge-based discipline, bioinformatics has been made more computationally amenable. After its beginnings as a technology advocated by computer scientists to overcome problems of heterogeneity, ontology has been taken up by biologists themselves as a means to consistently annotate features from genotype to phenotype. In medical informatics, artifacts called ontologies have been used for a longer period of time to produce controlled lexicons for coding schemes. In this article, we review the current position in ontologies and how they have become institutionalized within biomedicine. As the field has matured, the much older philosophical aspects of ontology have come into play. With this and the institutionalization of ontology has come greater formality. We review this trend and what benefits it might bring to ontologies and their use within biomedicine.
In this briefing, we explore the current state and future prospects of the use of ontologies within bioinformatics and medical informatics. Since an earlier Briefing in 2000 , the role of ontologies within bioinformatics has changed markedly. It has moved from a niche activity to one that is, in all respects, a mainstream activity. It is useful, however, to remind ourselves why this interest is so large before we move on to review the current state and future prospects of biomedical ontologies.
Biology is unlike physics and much of chemistry in that—although it contains many laws and models—few of these are reduced to a mathematical form. It is not possible to take a protein’s sequence of amino acids, apply some formula, and derive a set of characteristics such as accurate three-dimensional shape, functionality, forms of modification, etc. Instead of mathematical laws, biomedical scientists use what they understand about characterized entities to make inferences about uncharacterized entities. This is, for example, the basis of the similarity search—similarity between biological sequences is made mathematically, but any inference about that similarity is made by a biologist reading annotations. What we are using to make these inferences is what we know about the entities being compared. This is our knowledge about those entities.
Instead of the convenience of mathematical forms, biomedical scientists collect facts, often recording them in natural language, and then use that knowledge to make inferences about yet uncharacterized observations. Yet, this knowledge is highly heterogeneous. While it is easy to compare, for instance, nucleic acid or polypeptide sequences between bioinformatics resources, the knowledge component of these resources is very difficult to compare, both for humans and computers, because the knowledge is represented in a wide variety of lexical forms [2, 3].
In computer science, ontologies are a technique or technology used to represent and share knowledge about a domain by modeling the things in that domain and the relationships between those things . These relationships describe the properties of those things; in essence, what it is to be one of those things in the domain being modeled. An ontology represents a conceptualization of reality or simply reality (This philosophical aspect of the ontological discipline is beyond the scope of this article). The labels used for the things and their properties in an ontological model can provide a language for a community to talk about their domain. By agreeing on a particular ontological representation, a common vocabulary can be used to describe and ultimately analyze data.
Such sharing has obvious benefits for humans using facts to help make inferences about a domain of study. Those facts, the knowledge about the domain, become much easier to handle as the same things are referred to in the same manner across the resources in which those facts are stored. Ultimately, we would like to be able to handle knowledge computationally in a manner comparable to that in which we handle numeric data. What is more, as will be described later in the ‘New formalisms and tools for representing bio-ontologies’ section, given a well-defined semantics for the knowledge representation language, machines can make inferences about the facts expressed in that language.
This article will show how this basic idea has become a central theme within biomedical research to the stage where it now has a national center in the US (see section titled ‘Institutionalization of bio-ontologies’). The Section titled ‘Timeline and recent additions’ shows how ontologies have a long history in the biomedical domain and, particularly, in biology and now represent a broad spectrum of important biological knowledge. Later in the article, the future direction of these current trends will be explored. It is not possible in such an article to do justice to all the resources available. Our aim, however, is to give a ‘briefing’ as to what exists. Electronic references to the ontological resources are available in Appendix 1.
Human beings like to put the things (instances) they see around them into categories. What is more, categories can have subcategories. We see classification throughout human activities: we do it to people, library books, Web pages, etc. Biomedical scientists are no different. Biologists have long classified the phenomena they observe in the world around them. After mediaeval bestiaries, a classic starting point for talking about classification in biology is the Linnaean classification of species . This classification is all pervasive and species taxonomies still form a backbone of how we talk about biological data, especially in the realm of evolution.
Ontology and classification are, however, not the same. Classification might be a component of ontology, but the latter adds something more. An ontology attempts to describe what we understand exists in our domain and to try and capture what it is to belong to one of the classes, categories or types in that model. An ontology, more formally, is a set of logic axioms that form a model of a portion of (a conceptualization) of reality (after ). There are many artifacts that are called ontology. One’s bias usually depends on purpose for modeling, representation used for modeling and philosophical viewpoint . What computer scientists call ontologies are not really ontologies; they are knowledge structures or conceptual models, but the term has now been established. So, in this article, we are very inclusive in what we call ‘ontology’.
This article is not the place for a deep discussion of what counts as a real ontology in the true philosophical sense of the discipline. It is not that such a debate is wasted, but, for the large part, what we call ontologies are being built to perform a job of sharing what we understand about the world of biomedicine. The spectrum of ontology-like structures will range from controlled vocabularies, thesauri, directed acyclic graphs and frame-based systems to rich logical axioms encapsulating our knowledge . In this article, almost anything along this spectrum will be included, but the further away from the right-hand end of the spectrum, the more ‘ontology-like’ (from a computer science perspective) the artifact becomes.
The use of the word ontology within biology is relatively recent. Figure 1 shows a timeline for the appearance of what we might call ontologies or ontology-like artifacts within bioinformatics. In the early phase, computer scientists had a technique for knowledge representation (from which they build what they call ontologies). They recognized in biological data a domain in which such techniques were needed in order to overcome the massive semantic heterogeneity in the domain [2, 3]. Rich, high-fidelity models of biology, such as can be provided by ontologies, are also seen as a way of providing a means of forming knowledge bases such as EcoCyc , RiboWeb  and PharmGKB . In TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources), we also see the use of ontologies to form a global schema over multiple heterogeneous resources . Here, ontology forms a mechanism for building queries by using a common ontological form that is mapped to each of the underlying resources. Finally, in this phase, we see the use of ontology as a reference model of what exists in biology. The Molecular Biology Ontology (MBO)  was an early attempt to begin to define the entities in the domain to promote consistent interpretation across resources.
A second phase saw the adoption of ontology by the biological community itself. Pre-eminent among these is the Gene Ontology (GO) . Biologists recognized that, as whole genomes became available, nucleic acid and polypeptide sequence data allowed easy comparative studies. The problem, however, was that, while sequence comparison was easy, comparing functional annotation of those data was hard. In order to address this problem, the mouse, yeast and fly gene communities came together to develop the Gene Ontology (GO). The GO has three aspects or separate ontologies:
Together these capture three of the major aspects that biologists wish to describe about the gene products they place in databases. As genome database providers commit to the GO (that is, they agree with its view of the world) and adopt the terminology delivered by the GO, then each resource describes its gene products in a common form. This sharing, together with the structure provided by the relationships between terms in the GO (Figure 2), makes querying of within and between resources possible (Figure 3).
From its start with some 3500 terms in 1998, covering three databases, GO now holds some 20,000 terms and is adopted by about 20 databases. These are largely species-specific genome databases, but also include cross-community resources such as UniProt and InterPro.
The Gene Ontology (GO) has been phenomenally successful and it is useful to examine why this has been so. The Gene Ontology has put its success down to the following points :
The success of the GO in meeting its objectives, its wide uptake by other databases for attributing gene product functionality and finally the use of the GO outside its original use has led to many other groups starting to develop ontologies for database annotation. In order to provide some coordination to these efforts, the OBO consortium was established.
OBO is guided by a set of principles that are used to give coherence to wider ontological efforts across the community:
Through these simple criteria, the ontology community is attempting not to repeat the errors most of their ontologies have been developed to resolve. That is, the massive syntactic and semantic heterogeneity extant in bioinformatics resources. There are many resources under the OBO umbrella, and most of these are shown in Figure 4, in which OBO have been roughly arranged along a spectrum of genotype to phenotype.
The two most significant OBO are the GO  and the Sequence Ontology . The former is used to annotate the principle attributes of gene products and the latter provides a vocabulary to describe the features of biological sequences. A common language to describe parts (regions) on nucleic acid and protein sequences across many resources has a potentially huge impact on not only querying but the computational analysis of biological sequence data.
Moving along the spectrum toward phenotype, we see increasing numbers of species ontologies on the same subject: development and anatomy. While the description of sequence features and major attributes of gene products might be core to molecular biology, these descriptions need to be placed in a context. At what stages of development are these sequence features and these gene products important? In what organ, tissue or other anatomical parts are these gene products important? Obviously, each species has its own development and anatomy, but an interesting trend over the coming years will be efforts to explore what different groupings of organisms have in common. In a sense, all explorations of molecular biology are a search for mechanisms that produce a phenotype. As a consequence, we are seeing a general trend towards descriptions of phenotype.
Other OBO ontologies include some that describe experiments that generate biological data. Foremost amongst these is the Microarray Gene Expression Data (MGED) ontology . This ontology provides a vocabulary for describing a biological sample used in an experiment, the treatment that the sample receives in the experiment and the microarray chip technology used in the experiment. This basic information will aid researchers exploring third-party data to validate comparisons between data and help confirm interpretations of data. It is, after all, necessary to know how an experiment was performed in order to interpret findings and make comparison between interpretations. As more high-throughput experimental techniques come into play across the domain, each needing vocabularies, the Functional Genomics Ontology (FUGO) (http://www.fugo.org) has been conceived in order to bring coherence to these ontological developments.
Use of clinical terminologies has a much longer history in medicine. Being able to predict disease outbreak is predicated upon reliable aggregation of statistics on those diseases. Yet, if different communities use different terminologies for the diseases being monitored, then those statistics and hence predictions become unreliable. As long ago as the early 17th century, the authorities in London drew up a list of ‘ways in which people died’. For example, the term ‘French pox’ was used for the same cause of death in each London parish, and consequently more reliable statistics were gathered. The London Bills of Mortality remained in use for many years and not just in London. In the late 1880s, the International Classification of Diseases (ICD) was published. This brought the old Bill of Mortality’s terminology up to date and provided mankind with some 200 ways of dying (what conveniently fitted on two sides of paper). ICD is now in its 10th edition and now has some 13 000 rubrics.
This need for coding is central to the use of terminology in medicine. Originally created for epidemiology purposes, ICD now plays a major role in billing within hospitals. To make this task more complex, several vocabularies have been developed for similar purposes; exactly the problem that the Open Biomedical Ontologies Consortium wishes to avoid. Figure 5 shows the timeline for the appearance of these terminologies. The need for such common, shared means of referred to phenomena of interest has a longer history in medicine, perhaps reflecting its more immediate practical benefit (not dying, for instance). Classification of what we know about the world, the putting of things into categories, is such a natural human activity that no domain can claim its use first. The use of the word ontology, in its computer science usage to denote a means of capturing and sharing a common representation of knowledge, is fairly recent and dates back less than 20 years in both fields.
For many years, the ICD was the only medical terminology. In medicine as in biology, the increasing use of information technology and increasing quantities of data have highlighted the need to be able to talk about medicine in a common manner for both humans and machines. It does not take long to think of the consequences of prescribing drugs if inconsistent and confusing terminology is used for drugs, prescribing regimes and side effects. An attempt to make those vocabularies ‘interoperable’ is represented by the Unified Medical Language System® (UMLS®), a terminology integration system comprising over 130 biomedical vocabularies . There is a debate about whether these artifacts are ontology. This is not the forum for that debate, but suffice it to say that these artifacts are structured representations of things in the biomedicine domain.
Figure 6 shows these medical terminologies arranged according to ‘phenome’, or space of observable characteristics and along the ‘prescriptome’ or space of treatments. This movement from left to right transitions from anatomy, physiology and biochemistry (how the normative human organism, or common variants of it, are supposed to work and how they respond to stressors), through symptoms that suggest one or more diseases and further investigations to filter that list, to treatment options with goals and outcomes on the far right.
Often referred to as a ‘cottage industry’ by Mark Musen, ontology development was indeed characterized, until recently, by individual researchers modeling knowledge for particular applications, without sophisticated tools or formalisms and independently of existing ontologies. As a result, the ontologies of this era were only minimally sharable and reusable. More recently, the equivalent of an industrial revolution for ontology was marked by the apparition of both new technologies (see section titled ‘New formalisms and tools representing bio-ontologies’) and institutions. It is beyond the scope of this article to give an exhaustive list of ontology centers, even in biomedicine. The institutions presented below were selected because of their impact on the community at large.
The Institute for Formal Ontology and Medical Information Science (http://www.ifomis.uni-saarland.de/) (IFOMIS) was founded in 2002 with a grant from a German nonprofit foundation, the Alexander von Humboldt Foundation. Directed by Barry Smith, a philosopher, IFOMIS is an interdisciplinary research group, with members from philosophy, computer and information science, logic, medicine, and medical informatics. Over the past years, IFOMIS has contributed to applying formal ontology to biomedicine (e.g. ) and has developed collaborations with developers of biomedical ontologies such as the Gene Ontology Consortium and the Structural Informatics Group at the University of Washington.
Created as part of the National Centers for Biomedical Computing in 2006 and funded by the National institutes of Health, the National Center for Biomedical Ontology (http://bioontology.org/) (NCBO), led by Mark Musen and Suzanna Lewis, defines itself as ‘a consortium of leading biologists, clinicians, informaticians, and ontologists who develop innovative technology and methods that allow scientists to create, disseminate, and manage biomedical information and knowledge in machine processable form.’ NCBO is now involved in the development of ontologies from the OBO family. The Center draws on the experience of long-time contributors to the field of biomedical ontology, both on the side of the content (with several core members of the Gene Ontology and OBO Consortia—see the section titled ‘The Gene Ontology Phenomenon’) and on the side of the of the tools (with key contributors to the ontology editor and knowledge acquisition system Protégé—see the section titled ‘Protégé’). NCBO is doing much to draw together activity within the biomedical ontology field and to maintain and encourage coherence and perceived best practice in ontology development.
Other ontology centers have been created recently, both in Europe and the US, with a focus on ontological research, but not limited to biomedicine in their applications. The National Center for Ontological Research (http://ncor.us/) (NCOR) was established in 2005 and is codirected by Barry Smith and Mark Musen. The European Center for Ontological Research (http://www.ecor.uni-saarland.de/) (ECOR) was founded in 2004 and is currently directed by Nicola Guarino.
Over the past couple of years, the interest of the Semantic Web community (http://www.w3.org/2001/sw/) has shifted in part toward the healthcare and life sciences community . One year after a successful workshop bringing together over 100 biologists, computer scientists and other researchers, the World Wide Web Consortium (W3C) announced the creation of the Health Care and Life Sciences Interest Group (http://www.w3.org/2001/sw/hcls/) in November 2005, ‘‘to develop and support the use of Semantic Web technologies to improve collaboration, research and development, and innovation adoption in the of Health Care and Life Science domains.’’ Several task forces currently address key areas necessary for implementation of a Semantic Web for healthcare and life sciences, for example, the conversion of existing resources into the Semantic Web formalisms RDF (Resource Description Framework) and OWL (Web Ontology Language). Semantic Web technologies are presented in more detail in section 4.3 below.
In the past ten years, bio-ontologies have become ‘mainstream’ in biomedical conferences and the literature. The pioneering workshop in the field was created in 1998 at the Intelligent Systems for Molecular Biology (ISMB) conference (http://www.iscb.org/), and is held annually since. There is now an ontology track at ISMB. A successful session on ‘Biomedical ontologies’ was organized at the Pacific Symposium on Biocomputing (http://psb.stanford.edu/) (PSB) for three years (2003–2005). Similarly, the number of presentations on ontology has regularly increased at medical informatics conferences such as the American Medical Informatics Association (http://www.amia.org/) (AMIA) Annual Symposium, the Medical Informatics Europe (MIE) organized by the European Federation for Medical Informatics (http://www.efmi.org/) (EFMI), and Medinfo, organized by the International Medical Informatics Association (http://www.imia.org/) (IMIA).
As shown in Figure 7, the number of articles on ontology has grown exponentially in PubMed/ Medline, from less than 10 in 1996 to almost 500 in 2005. Noticeably, over half of the growth is attributable to the GO. Bio-ontologies appear in the literature through permanent sections and special issues. For example, the leading journal Bioinformatics has an ontology section. Recently, two major medical informatics journals have devoted a special issue to bio-ontologies. Issues 7–8 of Computers in Biology and Medicine (Volumes 36, 200, July–August 2006) present 14 papers on various aspects of biomedical ontology [20–33], ranging from ontology development, evaluation and mapping to the use of ontologies for ontology integration, semantic similarity computation and task modeling. Also presented are ontologies for specialized domains including public health, colon carcinoma, adverse drug reactions and heart failure. Issue 3 of the Journal of Biomedical Informatics (Volume 39, June 2006) is a collection of 10 papers presented at the 2005 meeting of the International Medical Informatics Association Working Group 6 [5, 34–43]. This series of articles offers a more formal perspective on biomedical ontologies, discussing issues such as reality, granularity, mereology and reference ontologies. Together, these two journal issues provide a panorama of bio-ontologies, with foundational issues and practical aspects.
The book Ontologies for Bioinformatics  published in 2005 provides a good technological overview of bio-ontologies in the context of the Semantic Web. The introduction to ontologies puts a strong emphasis on the Semantic Web technologies (see the section titled ‘Semantic web technologies’), with examples from bioinformatics. The chapters devoted to ‘Building and using ontologies’ also present query languages and transformation methods based on XML. The last part of the book is an introduction to Bayesian networks. As this summary suggests, this book takes an extremely broad view of ontology, even including XML schema. Also of interest to bioinformaticians is the Handbook on Ontologies , presenting ontology from the perspective of computer science rather than bioinformatics. Beside the expected chapters on ontology languages and ontology engineering, the Handbook is also relevant to our community with chapters on building ontologies from medical thesauri  and ontologies in bioinformatics . Finally, Ontologies in Medicine  is a collection of nine papers reporting on issues in and applications of ontologies in the medical domain.
Biomedical terminologies are typically large, covering tens to hundreds of thousands of entities (e.g. about 20 000 for the GO and 300 000 for SNOMED Clinical Terms). Until recently, no widely used ontology development environments (as opposed to ontology editors, to use a software development analogy) were available and ontologies were developed essentially ‘by hand’ or with rudimentary tools such as file-system-like tree editors. In the past 15 years, Protégé has emerged as the leading ontology editor across disciplines. At the same time, description logics (DL) have superseded frame-based languages to become the leading formalism for representing ontologies. Finally, Semantic Web technologies are playing an increasing role in knowledge representation. This cross-discipline view is in contrast to that in bioinformatics and medical informatics. Within bio-ontology, in-house tools have been developed by the Gene Ontology Consortium in the form of DAG-Edit and latterly OBO-Edit. Medical informatics has used a variety of tools, either proprietary or open-source. In this section, we briefly review some knowledge representations and ontology development tools.
Developed by the Stanford Medical Informatics group with funding from various US Government agencies in the past 15 years (and now a core technology of the National Center for Biomedical Ontology), Protégé (http://protege.stanford.edu/) is the leading ontology editor across disciplines, with a community of about 50 000 users, representing research and industrial projects in more than 100 countries. Originally developed for representing frame-based ontologies, in accordance with the Open Knowledge Base Connectivity (OKBC) protocol, Protégé has evolved, in collaboration with the University of Manchester, to represent ontologies in the OWL, based on description logics. Many large biomedical ontologies have adopted Protégé for their representation, including the Foundational Model of Anatomy (frame-based) and the NCI Thesaurus (DL-based), though Protégé is not used for the majority of OBO ontologies. Beside the support of OWL, recent changes for Protégé include support for exporting Protégé ontologies into a variety of formats (e.g. RDF/S, OWL and XML Schema—see the section titled ‘semantic web technologies’). Based on an open architecture, Protégé can be extended through plug-in components, some of which are contributed by users. Examples of services provided through the 69 plug-ins currently available for Protégé include ontology visualization (OntoViz), ontology alignment (PROMPT) and interfaces with rule engines (e.g., Jess http://www.jessrules.com/jess/index.shtml) and formalisms (e.g. SWRL—the Semantic Web Rule Language (http://www.w3.org/Submission/SWRL/)).
It is beyond the scope of this article to give a detailed introduction to description logics. (The interested reader is referred to  for more information). Instead, we will show why they have emerged as a popular ontology language in biomedicine and other domains. Intuitively, highly expressive knowledge representation formalisms such as first-order logic (FOL) could be thought of as ideal for ontologies. In practice, however, FOL is also intractable, or, more simply, too complex to be computed. Description logics represent a family of languages defined as a trade-off between expressivity and tractability. The aforementioned OWL can be used to illustrate this trade-off. OWL actually comes in three varieties of decreasing expressivity but of increasing tractability: OWL Full, OWL DL and OWL Lite .
DLs are usually considered sufficiently expressive to represent most biomedical ontologies. The first large biomedical ontology developed with description logics was GALEN—the Generalized Architecture for Languages, Encyclopedias and Nomenclatures in medicine. The development of GALEN started in the early 1990s—before the times of the Semantic Web—and its authors started by designing a DL-based language for representing medical knowledge: GRAIL, the GALEN Representation And Integration Language . Another important milestone in the use of DLs for developing biomedical terminologies is the creation of SNOMED Clinical Terms (SNOMED CT). Not only did SNOMED CT result from merging two major clinical terminologies—SNOMED Reference Terminology (SNOMED RT) and Clinical Terms Version 3 (formerly known as the Read Codes), but it was also engineered using a different technology: a DL-based authoring system developed by Apelon (http://www.apelon.com/). Other large biomedical terminologies such as the NCI Thesaurus have recently adopted OWL for their representation . With OWL DL becoming a de facto standard ontology language, many attempts to convert existing terminologies and ontologies into OWL DL have taken place recently (e.g. MeSH ). However, in most cases, converting to OWL DL is not simply a matter of syntactic translation: information implicit in the formalism of origin may need to be made explicit in OWL DL in order to fully take advantage of the possibilities offered by the language, which often requires enriching the original representation [54, 55].
In addition to contributing to specialized domains such as healthcare and life sciences, the World Wide Web Consortium (W3C) creates the very infrastructure of the Semantic Web. The W3C originally developed the specifications of HTML, the markup language used to represent documents in the World Wide Web. Similarly, the W3C produced the specifications of other formalisms for representing documents, resources and ontologies, including XML, RDF/S, OWL. Collectively know as Semantic Web technologies, these specifications define the building blocks of the Semantic Web. Building upon them, additional formalisms are defined to represent, for example, rules. Some of these technologies will be briefly reviewed, with emphasis on their relations to biomedical applications. The interested reader is referred to the corresponding chapters in  for further information.
The Resource Description Framework (RDF) extends the capabilities of the extensible markup language XML as it enables many-to-many relationships between resources and data. The resulting structure is a graph in which the nodes are resources (identified by a Uniform Resource Identifier or URI) or data (e.g. strings, numerals) and the edges are relationships (called properties). RDF integrates limited inference rules, enabling, for example, to define subclasses and subproperties. Some extensive resources such as UniProt have already been converted to RDF (http://expasy3.isb-sib.ch/~ejain/rdf/) The BioRDF (http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup) task force of the W3C Semantic Web Health Care and Life Sciences Interest Group currently investigates methods whereby existing resources can be converted to RDF.
OWL plays a central role in bio-ontologies and has been mentioned multiple times already. OWL DL, the description logic flavor of OWL, is particularly well suited for representing bio-ontologies. In addition to many bio-ontologies, BioPAX (http://www.biopax.org/) a data exchange format for biological pathway data, uses OWL for its representation.
The inference supported by RDF and OWL is limited compared to rule-based languages. For example, clinical decision support systems typically require complex knowledge better expressed with rules. The role of ontologies in this context is to provide the vocabulary used in the rules. The Arden Syntax is one example of formalisms developed for representing rules supporting medical practice (e.g. drug interactions). Recent efforts related to Semantic Web technologies include SWRL (http://www.w3.org/Submission/SWRL/) and the rule markup language RuleML (http://www.ruleml.org/).
Some formalisms and tools have been developed specifically by the bio-ontology community, where they enjoy great popularity. OBO-Edit (http://www.geneontology.org/GO.sourceforge.links.shtml#obo) is ‘an open source, platform-independent application for viewing and editing OBO ontologies’. Formerly known as DAG-Edit, OBO-Edit is a tool for visualizing and editing the graph structure of an ontology. The OBO format is used to represent the majority of the ontologies seen in Figure 4. It is a large subset of that expressivity allowed in OWL (see the section titled Limitations). It allows the creation of types, subtype relationships and other kinds of relationship. It can express disjunction of types and features of relationships such as transitivity, symmetry, etc. It does not express, for example, quantification in relationships nor does it allow expressions to be built using types. Conversely, the OBO format has several built-in features for supporting terminology, as opposed to ontology, that OWL does not. It has built-in support for thesaurus constructs and semantic-free identifiers. It also has mechanisms for supporting view-like mechanisms over a terminology.
As illustrated in Figure 8, the OBO format is informally expressed, but its extensive documentation (http://www.geneontology.org/GO.format.shtml#oboflat) can be used to derive the language semantics which means it can be converted into OWL (that is, the semantics of the language are the same). Indeed, the GO has provided an OWL translation of its ontologies for many years. The directed acyclic graph used by the GO is a subset of the OBO format.
Seen in the context of how GO and OBO have developed (see the section titled ‘The Gene Onthology Phenomenon’), the development of the language and its tools have been central to the success of biologists’ uptake of ontology. It should be remembered that representations such as OWL are more recent additions to the catalog of representations and their use is still being explored. In addition, the OBO community has paid more attention to the needs of a biologist type of user than the knowledge representation specialist in, for instance, the OWL tools.
Apart from DAG-Edit, the Gene Ontology Consortium and the wider community have built a wide range of tools and resources, such as AmiGO (see Figures 2 and Figure 3), that allow display and querying of the GO and annotations stored in a specialist GO database. Further tools allow searching GO, annotating data using GO, and microarray analysis. A catalog of these tools can be found at the Gene Ontology Web site (http://www.geneontology.org).
COBrA is another ontology editor developed within the bioinformatics community, this time by a group interested in developmental anatomy . COBrA has the standard editing features and can export to both OBO format and Semantic Web languages. It is distinguished by giving prominence to the formation of links between ontologies, for instance, joining a tissue type to a cell type. As various ontologies, especially those in OBO, become cross-linked, such features as the support of modularization in ontologies will become of increasing importance.
Formal ontology stems from philosophy and provides a rigorous framework for understanding and representing differences between entities. Counter intuitively, formal ontology is not the same as the formal languages used to represent ontologies. Namely, an ontology expressed in a formal language such as OWL does not necessarily adhere to the principles of formal ontology, although the formality of the language can help in making ontological distinctions. This section briefly reviews some important formal ontological distinctions and properties and their applications. The notions of top-level ontology and reference ontology are presented next. We then emphasize the importance of relations in bio-ontologies before illustrating some of the current limitations of formal languages used in bio-ontologies.
Important formal ontological distinctions include the difference between continuants, which continue to exist through time and occurrents (or processes), which unfold through time in successive phases. Continuants are themselves divided into dependent and independent continuants, based on whether or not they require the existence of any other entity in order to exist. Occurrents always depend on some independent continuant. For example, the process oxygen transport and the dependent continuant oxygen transporter both depend on the independent continuant oxygen. These distinctions, along with metaproperties such as identity, rigidity, unity and dependency form the basis for OntoClean, a methodology for analyzing and validating ontologies .
The top-level distinctions presented in the preceding text can be used as the basis for creating top-level (or upper-level) ontologies, i.e. ontologies in which high-level categories are defined. All entities and processes constitutive of a particular domain can then be defined in reference to (e.g. as subclasses of) these top-level categories. As mundane as it might seem to biologists, upper-level ontologies end up being discussed in mainstream biology journals (e.g. ). To date, it is probably fair to say that there has not been an agreement yet on what constitutes a good top-level ontology. Candidates include the Basic Formal Ontology (BFO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) and the Suggested Upper Merged Ontology (SUMO). The UMLS Semantic Network (http://semanticnetwork.nlm.nih.gov/) is sometimes regarded as an upper-level ontology for the biomedical domain .
Ontologies defined independently of specific objectives are often referred to as reference ontologies. By definition, top-level ontologies should be reference ontologies as they constitute the top-level structure of many domain ontologies. However, the notion of reference ontology can be extended to domain ontologies . For example, the Foundational Model of Anatomy (FMA), a reference ontology of structural anatomy has been proposed as a reference for describing physiology and pathology . More generally, cell types and chemical entities are often referred to in other entities such as cytotoxicTcell differentiation and 6-alpha-maltosylglucose catabolism. Ontologies of cell types (e.g. the OBO cell ontology ) and chemical entities (e.g. ChEBI—the Chemical Entities of Biological Interest) could be used as a reference and guide the development of the ontology of biological processes in the GO. This strategy is being implemented progressively by the Gene Ontology Consortium, in part through the Obol language .
The semantics of the relations used in most biomedical terminologies are weak. For example, in the Medical Subject Headings (MeSH), the semantics of A narrower than B simply means that users interested in Bs might also be interested in As. The MeSH terms found under Accidents include kinds of accidents—as expected (e.g. Traffic accidents), but also Accident prevention. In contrast, A isa B implies that all As are also Bs, i.e. that A necessarily inherits all the properties of B. The publication of the OBO relations  therefore represents an important contribution to bio-ontologies. This article defines 10 relations: isa, part_of, located_in, contained_in, adjacent_to, transformation_of, derives_from, preceded_by, has_participant and has_agent. Interestingly, these relations were defined and agreed upon by a multidisciplinary group including philosophers, physicians, biologists and computer scientists. Logical definitions are provided for each relation and relations are defined at both class and instance levels whenever appropriate. This core set of relations has been proposed for use in the OBO family of ontologies. Moreover, some relations such as has_participant and has_agent are defined in reference to formal ontological distinctions between continuants (e.g. the lungs) and processes (e.g. breathing), the processes having continuants as their agents or participants.
Formality, both in the ontological and representation language sense, is a stern friend. A formal language has a well-defined interpretation of the world and a well-defined language with which to say things about that world . The OBO relations, described in the preceding text, take a standard logical view of binary relationships  and describe a world with binary relationships between individuals (instances of a class). Expressed in the OWL , each and every instance of a class must hold such a relationship (or none at all hold the relationship). In this sense, OWL talks about universals. These instances form sets or classes. Subclass relationships can hold and, by implication, every instance in a subclass must also be an instance of its superclass.
In OWL, we can place some kind of quantification on what goes at the other end of a relationship (its successor). It is possible to say there is at least one successor (existential quantification) or that an instance of a class of objects is the only kind of instance that may appear as a successor (universal quantification). In OWL, a modeler can use these constructions to describe restrictions on what instances may be members of a class. These conditions can be of two types:
For example, the OWL class expression in Figure 9 shows a complete definition for a ReceptorProteinTyrosinePhosphatase. These OWL statements state that a ReceptorProteinTyrosinePhosphatase is any protein that, among other things, has at least one TyrosinePhosphataseCatalyticDomain and at least one TransmembraneDomain. Any protein having these features can be recognized to be a member of this class of protein phosphatase. Note the phrase ‘among other things’—OWL has an assumption of an open world. Just because our description does not mention other sequence features or domains that are possible, or functions, substrates, processes, etc. it does not mean there are none—we simply have not mentioned them. OWL explicitly states what is known, whether to the positive or negative. Unless explicitly stated, the model simply does not know. As there is much we do not know about biology, OWL’s open-world assumption can make a lot of sense.
This is only a subset of OWL’s expressivity. An ontologist can use statements in OWL to create ontologies that have precise meanings; precise enough such that a machine can reason over those statements and make inferences [65–67]. As such, the formality (strictness) of the language is good—this precision means that automatic reasoning with the symbols of the ontology can take place. This very strictness is, however, potentially restrictive. OWL has many limitations in what it offers an ontologist :
For a more detailed analysis of OWL limitations, the interested reader is referred to .
Within its limitations, OWL has patterns that can provide ways to model, for instance, n-ary relationship, lists and exceptions . There are, however, large islands of biology that can be modeled with great success using OWL. The formality of the language not only means that machines are able to use the ontology to make inferences [66, 67, 69, 70], but the formality also makes an ontologist ask hard ontological questions about what he or she is modeling; in this sense ontology and language formality are linked.
The successes and, more generally, the developments observed in the field of bio-ontologies over the past 5 years certainly make sense in today’s context, but would not necessarily have been easily predictable. In the rest of this article, we take the risk of outlining some directions, which, in our opinion, may shape biomedical ontology in the years to come.
Like many new domains, biomedical ontology is still as much an art as it is a science. Methods are just emerging, and beliefs and doctrine make up for the lack of objective metrics for evaluating quality. To the casual observer, being assertive and charismatic seems to be all it takes at this early stage to become a guru in the field of bio-ontology. Would-be ontologists are eager to embrace bio-ontologies and become disciples. We are, however, near the end of this era. More than individuals, multidisciplinary fora, such as the National Center for Biomedical Ontology, will now shape the discipline. Biologists interested in ontologies for their usefulness also increasingly recognize the importance of rigor in building these ontologies. As scientific techniques become available for building ontologies , and as objective metrics are developed for measuring their quality [31, 72, 73], today’s gurus who have contributed to promoting such techniques will be remembered as visionaries in the history of bio-ontologies.
Not all ontologies are equal. As mentioned earlier, some ontologies called reference ontologies, representing a limited domain rigorously and consistently, can serve as a reference for developing domain and application ontologies. A new organization, the OBO Foundry (http://obofoundry.org/) promotes guidelines for ontology development in relation to the OBO family of ontologies. It also selects ‘high-quality’ ontologies and promotes their use as reference ontologies within the OBO family. This certification process relies on objective metrics for evaluating bio-ontologies, most of which are still to be defined.
By and large, there has been a wide use of ontology to generate vocabularies with which to describe biomedical data. With the increasing volumes of data—it is hoped, described data—there is an increasing need to automate analyses of these data. The precise capture of biological knowledge in a computational form means it is possible to compute with knowledge as we compute with continuous mathematics and strings. To accommodate this severe need, there needs to be an increase in formality and richness in biomedical ontologies.
We can see in Figure 4 the expansion of topics covered in bio-ontologies. At present these are orthogonal, but implicit relationships are implicit between, for example, GO’s biological processes and ChEBI’s chemicals. Plans exist to formalize this cross-linking, and this trend will increase. This will help querying of data, analyzing data and also help in the building and maintenance of the ontologies themselves.
We also expect to see medical research and biological research join ontologies at the level of anatomy, drugs, etc. The two communities might need different granularities or even different views of the same ontology, but they have interests in common.
With the success of collaborative resources such as Wikipedia (http://en.wikipedia.org/wiki/Main_Page), the computer science community is increasingly interested in the social organization supported by the Web and Semantic Web technologies (see, for instance, the 2006 World Wide Web conference (http://www2006.org/)). Extended to ontologies, the notion of Semantic Wikipedia has arisen . Harnessing the knowledge resource of the community, to an even greater extent than has been seen with the GO, has the potential to shift knowledge gathering and defining from a small community of experts to a larger number of ‘eyeballs’, i.e. to knowledgeable scientists who would not be otherwise involved with ontology development for geographical or other reasons. Experiments of collaborative development of biomedical ontologies have been reported already (see, e.g. ). While it is still unclear as to what the correct framework for collaborative development is or whether it will even work, this phenomenon should certainly not be ignored without investigation.
Analogously, a collaborative approach could be used also for the curation of bio-ontologies. Every assertion in an ontology could be commented upon by users, and the result of such critical evaluation can be recorded on the form of annotations added to the ontology. Early implementations of this emerging trend are becoming available . There are obvious dangers to such an approach, in building, curating and annotation, but this approach has potential, especially where funding is scarce. In such a situation, where a community decides if an ontology is necessary, a decision has to be made about whether something is better than nothing.
R. S. would like to acknowledge funding by the Sealife project (IST-2006-027269). This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM).
See Appendix Table
Olivier Bodenreider, Staff Scientist in the Cognitive Science Branch of the Lister Hill National Center for Biomedical Communications at the U.S. National Library of Medicine. His research interests include terminology, knowledge representation and ontology in the biomedical domain, both from a theoretical perspective and in their application to natural language understanding, reasoning, information visualization and interoperability.
Robert Stevens, Senior Lecturer in bioinformatics in the School of Computer Science, University of Manchester, U.K. He has degrees in biochemistry, biological computation and computer science. He was a member of the ground breaking TAMBIS project, which was the first in bioinformatics to use description logic ontology to form a homogenizing query layer over bioinformatics resources. Interest in the use of formal ontology has continued in the development of semantic similarity metrics over ontologically annotated corpora. His other work includes the development of methodologies to migrate ontologies from the informal to formal and use reasoning to increase structural validity. His current work includes the use of protein family ontologies to catalog proteins in genomes and the use of ontologies to describe in silico experiments. He has co-chaired the annual bio-ontologies meeting at ISMB for many years and is a co-developer of a highly successful OWL training course.