|Home | About | Journals | Submit | Contact Us | Français|
To provide typical examples of biomedical ontologies in action, emphasizing the role played by biomedical ontologies in knowledge management, data integration and decision support.
Biomedical ontologies selected for their practical impact are examined from a functional perspective. Examples of applications are taken from operational systems and the biomedical literature, with a bias towards recent journal articles.
The ontologies under investigation in this survey include SNOMED CT, the Logical Observation Identifiers, Names, and Codes (LOINC), the Foundational Model of Anatomy, the Gene Ontology, RxNorm, the National Cancer Institute Thesaurus, the International Classification of Diseases, the Medical Subject Headings (MeSH) and the Unified Medical Language System (UMLS). The roles played by biomedical ontologies are classified into three major categories: knowledge management (indexing and retrieval of data and information, access to information, mapping among ontologies); data integration, exchange and semantic interoperability; and decision support and reasoning (data selection and aggregation, decision support, natural language processing applications, knowledge discovery).
Ontologies play an important role in biomedical research through a variety of applications. While ontologies are used primarily as a source of vocabulary for standardization and integration purposes, many applications also use them as a source of computable knowledge. Barriers to the use of ontologies in biomedical applications are discussed.
The need for standardizing biomedical vocabulary is not recent. As long ago as the 17th century, health authorities in London used a standard list of about 200 causes of death - later integrated into the International Classification of Diseases - to compile accurate health statistics known as the Bills of Mortality . In addition to terms, scientists such as Linnaeus started formalizing the relations among biological entities, in order to represent and share their knowledge of the world .
The last decade has seen a marked increase in the number of artifacts created for representing biomedical entities, their terms and their relations, often referred to as vocabularies, terminologies and ontologies. As shown in Figure 1, the number of citations on ontologies and controlled vocabularies in the PubMed/MEDLINE database has grown by 600% to about 1200 per year1. While some authors have proposed definitions for these artifacts [3, 4] and attempted to characterize the distinctions among them , in practice, these names are often used interchangeably. This phenomenon is reflected in part by the fact that 5–10% of the PubMed/MEDLINE citations indexed under the MeSH descriptor "Vocabulary, Controlled" also contain the word "ontology" (dark section of the histogram in Figure 1.) For the sake of simplicity, we henceforth refer to these various types of artifacts as ontologies. Another interesting trend in the past decade is the change in the relative importance of these ontologies, as measured by the number of mentions in PubMed/MEDLINE citations. As shown in Figure 2, the Gene Ontology (GO) has become the most cited ontology, with over 450 citations per year. In contrast, the footprint of Unif ied Medical Language System (UMLS) seems smaller now than ten years ago, although the number of citations has remained essentially constant throughout the decade.
A number of recent reviews have presented the major biomedical ontologies to various audiences, most often with an emphasis on their design and structural characteristics, mentioning their use only in passing [6–11]. Other reviews have presented the role played by several biomedical ontologies in specific applications, such as clinical decision support  and discovery applications , or in a specific domain, such as bioinformatics . In contrast, one review provides a functional perspective on biomedical ontologies . The interested user is referred to these reviews for more information about these ontologies.
In the present survey, we analyze some of the high-impact biomedical ontologies presented in  through the functional lens of , classifying their roles - somewhat arbitrarily - into three major categories: knowledge management, including the indexing and retrieval of data and information; data integration, exchange and semantic interoperability; and decision support and reasoning. The three categories, however, are not mutually exclusive and we examine the various roles played by each ontology. For example, LOINC is used as a source of standard vocabulary for retrieval purposes , for the integration and exchange of laboratory data [17, 18], and for "reliable execution of decision logic in clinical decision support systems" . More generally, reference ontologies are designed independently of any particular applications and expected to be useful in a variety of tasks [19, 20].
The ontologies under investigation in this survey include SNOMED CT, a comprehensive concept system for healthcare [21–23]; the Logical Observation Identifiers, Names, and Codes (LOINC), a vocabulary for laboratory tests and clinical observations [24–26]; the Foundational Model of Anatomy (FMA), a domain ontology of structural human anatomy [20, 27]; the Gene Ontology, a controlled vocabulary for the functional annotation of gene products across species [28–30]; RxNorm, a controlled vocabulary of normalized names and codes for clinical drugs [31–33]; the National Cancer Institute Thesaurus, a public domain terminology that provides broad coverage of the cancer domain [34–36]; the International Classification of Diseases, the 115-year-old medical terminology, now part of a family of health classifications [37, 38]; the Medical Subject Headings (MeSH), a controlled vocabulary for the indexing and retrieval of the biomedical literature [39, 40]; and the Unified Medical Language System (UMLS), a terminology integration system in which all the above ontologies are integrated (with the exception of the FMA, soon-to-be integrated) [41–43]. Some characteristics of these ontologies (based on information present in the UMLS) are shown in Table 1, including scope, number of entities, distribution of the number of terms per entity, and existence of a subsumption hierarchy.
Due to the large number of publications on the subject, this review will necessarily be superficial. Its objective is to provide not an exhaustive list of references, but rather examples, hope-fully typical, of biomedical ontologies in action. In order to reflect the state of the art, the selection of references is also somewhat biased towards recent journal articles.
One major role of biomedical ontologies is to serve as a source of vocabulary, i.e., a list of names for the entities represented in these ontologies. Strictly speaking, collecting names is the function of terminology, not ontology, and ontology languages such as OWL, the Web Ontology Language, treat names as labels or annotations . In practice, however, most biomedical ontologies under investigation here (with the notable exception of LOINC) provide lists of names for the entities they accommodate, in addition to properties and relations for these entities. The terminological component of biomedical ontologies is an important resource for natural language processing systems  and supports knowledge management tasks such as annotation (or indexing) of resources, information retrieval, access to information and mapping across resources. However, the corpus of entity names present in biomedical ontologies covers only in part the lexicon of the domain (especially for languages other than English) and only forms the basis for managing term variation [46, 47]. As shown in Table 1, the number of terms per entity varies largely among ontologies.
Virtually every ontology in our survey serves as a source of vocabulary for the purpose of annotating data or indexing documents. Besides the prototypical examples of MeSH, used for indexing the biomedical literature , and the Gene Ontology, used for the functional annotation of gene products in several dozen model organisms , many other ontologies have also been used for annotation purposes.
Indexing is principally used in reference to the assignment of entries from a controlled vocabulary to documents, e.g., the biomedical literature. While the indexing of large collections such as PubMed/MEDLINE is still performed manually for the most part, automatic indexing systems have been developed (e.g., [49, 50]). Although the goal is to assign MeSH descriptors, these systems often take advantage of the large set of terms and relations provided by the UMLS. Systems such as GoPubMed co-annotate the biomedical literature to both MeSH and the Gene Ontology .
The indexing of clinical documents is generally referred to as coding - and biomedical ontologies are sometimes called "code sets" . The International Classification of Diseases (ICD) has been used for over a century for coding morbidity and mortality and, more recently, as a coding system for reimbursement purposes . SNOMED CT is becoming adopted as a standard terminology for electronic health records by a growing number of countries [21,53] and has also been evaluated as a source of vocabulary for clinical research . The UMLS Metathesaurus as a whole has also been used to support the coding of clinical documents, such as surgical pathology reports . Like indexing, most coding is still performed manually. However, automatic techniques have been developed and evaluated (e.g., for ICD [56–58]), some of which exhibit high accuracy in limited domains.
In biology, the functional description of experimental data is usually referred to as annotation. Here again, (semi-)automatic methods for acquiring annotations from text have been investigated recently [59–63], but annotations are still most often the product of manual curation. Functional annotation is not limited the annotation of gene products to the Gene Ontology, but can be seen more generally as a "normalization" process applied to datasets, enabling further processing. For example,  used SNOMED CT and the NCI Thesaurus to annotate tissue microarray data in the Stanford Tissue Microarray Database. Analogously, MeSH was used to annotate mentions of human diseases in the Gene Expression Omnibus, a public repository of gene expression data, in order to create gene-disease networks . Related to the notion of indexing is that of term recognition, i.e., the process of automatically identifying mentions of entities of interest in text through natural language processing (NLP) techniques. A number of biomedical term recognition systems have been developed for the biomedical domain, exploiting the rich sources of vocabulary provided by biomedical ontologies . UMLS-based systems include MetaMap  and MetaPhrase . Developed more recently are systems such as Termine  and Whatizit , which cover genomics (e.g., gene and protein names) in addition to clinical medicine.
The main function of the indexing of large document collections such as MEDLINE is to support accurate retrieval, i.e., with high recall and high precision. With hierarchical controlled vocabularies such as MeSH  or the UMLS [71, 72], queries can be expanded to the descendants of the original input term, in addition to being enriched with synonyms, which contributes to improving recall.
More generally, by providing lists of synonyms, relations among concepts, high-level categorization and co-occurrence information, the UMLS plays a major role in the retrieval of various types of documents, not only the biomedical literature in MEDLINE , but also medical textbooks available on the Internet , knowledge bases (e.g., of medical computational problems ) and medical images [76–78]. Because they provide terms in several languages, the UMLS and MeSH have also been used for cross-language information retrieval [79, 80].
Several biomedical search engines exploit MeSH and the UMLS to provide access to the biomedical literature, including SAPHIRE , Essie  and Textpresso , as well as web resources for consumers (e.g., WRAPIN , MedicoPort ). Several specialized search engines have been created as well. Of particular interest are systems supporting evidence-based medicine and answering clinical questions. Such systems often exploit existing search engines (or term recognition systems) [86, 87], and add specific constraints to the search [88, 89].
Besides MeSH and the UMLS, other biomedical ontologies have been used for the retrieval of specific information. In addition to model organism databases, most microarray experiment databases can be searched by terms from the Gene Ontology , including ArrayExpress  and the Cancer gene expression database (CGED) . The Stanford Tissue Microarray Database includes a NCI Thesaurus browser for searching disorders ; SNOMED CT is used in a system that helps patients find physicians with particular expertise ; and medical web resources are indexed with the International Classification of Diseases in the HealthCyberMap . In the case of Emily , the ontology itself - here the Foundational Model of Anatomy - is used as the knowledge source for question answering purposes. Finally, some search engines such as GoPubMed organize the documents according to two ontologies and support searches on either ontology or both . For example, a search on "COX-2" in GoPubMed, shows index terms from both MeSH (Cyclooxygenase 2) and the Gene Ontology (cyclooxygenase pathway).
The automatic classification of biomedical documents is also generally supported by ontologies. For example, the high-level categorization of UMLS Metathesaurus concepts with semantic types from the Semantic Network has been used for topic detection in medical texts , as well as document clustering . The hierarchy of MeSH terms is used in  for the purpose of categorizing MEDLINE documents. Even when they do not exploit their structure, some document classification systems use the list of synonyms provided by ontologies such as MeSH and the Gene Ontology to aggregate document features (i.e., using concepts as features instead of words) .
The availability of several dozen biomedical ontologies is both a blessing and a curse. On the one hand, users can choose from a variety of ontologies and select the artifact that best fits their purpose. On the other, resources annotated to different ontologies become more difficult to integrate, unless mappings are created among ontologies in order to identify equivalent concepts across ontologies. This issue was identified several decades ago and was in part the motivation for creating the Unified Medical Language System . In effect, the UMLS Meta-thesaurus is a terminology integration system, in which synonymous terms from various terminologies are clustered into concepts, allowing for the seamless mapping between terms from different terminologies through a UMLS concept . As mentioned earlier, these groupings of terms are often exploited for query expansion purposes in information retrieval. Some terminologies provide mapping information to other terminologies (e.g., SNOMED to ICD-9-CM), which, in some cases is recorded in the UMLS. Such features of the UMLS have been used for mapping between MeSH and SNOMED CT in the context of a digital library . However, due to large differences in scope and granularity among vocabularies, direct mapping through synonymy and built-in mapping information fails to provide mapping for most concepts. In addition to these features, the hierarchical and associative relations among UMLS concepts have also been exploited for automatic mapping purposes (sometimes in combination with lexical mapping [102, 103]), allowing concepts from one terminology to be mapped to more generic concepts in another terminology [104, 105]. Other large ontologies such as SNOMED CT have also been exploited for mapping between clinical terminologies . Analogously, the Foundational Model of Anatomy has been used as a reference for aligning anatomical ontologies . Finally, medication reconciliation, i.e., the process of comparing a patient's medication orders to all of the medications that the patient has been taking, can be facilitated by the mapping among drug vocabularies realized in systems such as RxNorm and the UMLS , as is the exchange of medication information between federal agencies .
Biomedical ontologies are often cited as an important element of semantic interoperability and information exchange in biomedicine, along with messaging standards and clinical information models . For example,  notes the role of ontologies (called "standards") in the standardization of patients data to be exchanged across electronic health record (EHR) systems, contributing to connect "islands of data". Analogously, ontologies are key to clinical guideline models such as SAGE , where they standardize the representation of knowledge, thus facilitating maintenance, sharability and interoperability with EHR systems . Ontologies also play a major role in the integration of heterogeneous data from disparate sources, which is a critical to translational research .
The use of RxNorm, UMLS, and SNOMED CT is reported in  as part of a mediation strategy to exchange medication data between the Veterans Affairs (VA) and the Department of Defense (DoD) clinical information systems. LOINC is used widely in the exchange of laboratory data [18, 115], often in conjunction with HL7 .
Semantic interoperability projects such as BRIDG, CDA and caCORE also rely on ontologies, although indirectly in most cases. The BRIDG model, developed by the Biomedical Research Integrated Domain Group, is an information model designed to "support practical application and data interchange" for clinical research . Semantic interoperability between clinical trials information systems is supported in BRIDG through semantic harmonization. Although BRIDG stopped short of binding the information model to specific ontologies, its developers acknowledge the role ontologies in semantic interoperability. (Methods for binding clinical terminologies to information models are presented in , and  has investigated the mapping of the Outcome and Assessment Data Set (OASIS-B1) to LOINC and other terminologies).
The HL7 Clinical Document Architecture, Release 2 (CDA R2) model is "richly expressive, enabling the formal representation of clinical statements", including clinical observations, medication administrations, and adverse events . CDA R2 associates the HL7 Reference Information Model with terminologies such as LOINC, SNOMED CT and RxNorm for representing the semantics of a clinical document.
The Common Ontologic Representation Environment (caCORE) is a model-driven infrastructure developed to support an interoperable biomedical information system for cancer research . Ontologies, including the NCI Thesaurus , represent an important element of this infrastructure.
Ontologies support data integration in two different ways, corresponding to two different approaches to data integration: warehousing and mediation . One the one hand, by providing a controlled vocabulary in a given domain, ontology support the standardization required from warehousing approaches to data integration, in which the sources to be integrated are transformed into a common format and converted to a common vocabulary. For example, the integration of model organism databases is facilitated by the existence of the Gene Ontology, used - natively or after conversion - for the functional annotation of gene products in many species . Analogously, the integration of data from microarray experiments benefits from the standardization of their description with ontologies .
On the other hand, mediation-based approaches use ontologies for defining a global schema (in reference to which queries are made) and mapping between the global schema and local schemas (the schemas of the sources to be integrated). TAMBIS , the BioMediator  and OntoFusion  provide examples of such systems. The UMLS is used (along with the Gene Ontology) for the creation of the global schema in OntoFusion. A similar approach, also based on the UMLS, is used in ARIANE , a system that provides access to heterogeneous medical databases.
More generally, ontologies facilitate the integration of datasets, often by providing a common reference for biomedical entities in several datasets. For example, LOINC has been used for integrating laboratory data with adverse events , the Foundational Model of Anatomy for the integration of genomic information sources , and SNOMED CT for the integration of disease and pathway information .
Ontologies represent domain knowledge in computable, reusable form . Simple ontologies (e.g., limited to subsumption hierarchies) are useful for data aggregation and clustering. Rich ontologies comprise large networks of associative relations among the entities of a given domain. Such ontologies provide domain knowledge to applications and support the interpretation of relations identified in datasets through data mining processes based on linguistic or statistical techniques. Five broad kinds of applications of ontologies are discussed next: data selection, data aggregation, decision support, natural language processing, and knowledge discovery.
Many clinical and epidemiological research studies involve the creation of groups (from an independent variable) whose characteristics (dependent variables) are examined for differences (e.g., survival rate at five years in breast cancer patients). By providing an abstraction of some domain, ontologies can help define groups from a high level value for the independent variable (e.g., breast cancer), instead of listing all possible values (e.g., cancer of upper-inner quadrant of breast, of lower-outer quadrant, etc.). The International Classification of Diseases is used pervasively for selecting groups of patients in association with a high-level disease category. For example, in a study of emergency department visits for supraventricular tachycardia (SVT), the selection of cases of SVT was based on the descendants of 2 high-level ICD codes. Analogously,  calculated survival risk ratios for trauma patients for various groups of hierarchically-defined diagnostic categories in ICD and  used high-level ICD codes in a study of stroke hospitalization over time. Many other ontologies are used for data selection purposes, including SNOMED CT, used for querying clinical data warehouses . A hierarchical structure was added to LOINC in order to facilitate public health reporting .
In addition to data selection, ontologies are used for identifying the characteristics of groups obtained through various methods (e.g., the characteristics of patients in a group of long-term cancer survivors). Here again, ontologies support the aggregation of characteristics and ICD is often used for aggregating diagnoses. For example, in a study of the evolution over time of discharge diagnoses in emergency departments, the major categories of diagnoses investigated correspond to the top-level categories in ICD-10 . (The accuracy of such studies, which depends on the quality of the coding and the type of study, is discussed in [137, 138]).
In biology, microarray technologies for measuring gene expression typically identify groups of genes up-and down-regulated under certain circumstances . The simultaneous activity (or inactivity) of genes in these groups represents only one clue into their participation in biological activities and such groups of genes generally require further characterization, especially through functional annotations . Some fifty tools have been developed to date for the characterization of gene sets, exploiting Gene Ontology annotations  and other resources (e.g., Onto-Express  and GoMiner ). Some tools specifically take advantage of the hierarchical organization of terms in the Gene Ontology (e.g., ). Several tools also use MeSH descriptors for characterizing sets of genes , sometimes in combination with Gene Ontology terms [146, 147]. The functional characterization of gene expression signatures is used widely. A search combining "gene ontology" and "gene expression" in PubMed/MEDLINE yields over 800 citations. A recent trend in gene expression profiling is co-clustering, i.e., the use of functional annotations not for the post hoc characterization of gene sets, but as part of the clustering process itself [148–150]. One limitation of data aggregation based on hierarchies is the heterogeneous density of terms throughout the ontology (i.e., some branches are more richly developed than others). Semantic similarity metrics based on information content have been developed to address this issue and successfully applied to the Gene Ontology . These metrics provide a new approach to clustering genes .
Clinical decision support systems generally benefit from ontologies in two principal ways. First, as mentioned earlier, ontologies provide a standard vocabulary for biomedical entities, helping standardize and integrate data sources . For example, a system for drug allergies must be able to resolve drug names into standard codes and map between drug coding systems and the allergy knowledge base. Second, ontologies are a source of computable domain knowledge that can be exploited for decision support purposes, often in combination with business rules [153, 154]. For example, in an alert system for drug allergies, allergy to betalactams can be represented efficiently if the system can access a classification of drugs (as opposed to direct links to specific drugs). The interested reader is referred to  for a discussion about the role of ontologies in specific clinical decision systems. Issues discussed earlier about knowledge management support for evidence-based medicine (2.2) and the role of ontologies in clinical guidelines (3.1) are also relevant to clinical decision support.
Besides clinical decision support, ontologies support reasoning in applications. The Foundational Model of Anatomy (FMA) was used as a source of anatomical knowledge for reasoning about penetrating injuries, more exactly for predicting the consequences of penetrating injury . In this application, knowledge about spatial relations between the path of injury and vital organs is provided by the FMA. The availability of the NCI Thesaurus in OWL (Web Ontology Language) format makes it amenable to automatic processing by reasoners developed for OWL, enabling consistency checking and automatic classification. Leveraging such automatic reasoning services,  developed an automatic grading system for gliomas. Ontologies sometimes participate indirectly in reasoning processes. For example,  emphasizes the role of the Gene Ontology in the extraction of information required for creating an ontology of phosphatases. This ontology was subsequently used for reasoning about phosphatases. Although isolated, these examples illustrate the potential benefit of ontologies for decision support and reasoning.
As mentioned earlier, Natural Language Processing (NLP) techniques support term recognition, exploiting the vocabulary provided by biomedical ontologies. Ontologies also provide the domain knowledge necessary for advanced NLP applications, including information extraction for a specific task, relation extraction, document summarization, question answering, literature-based discovery, and more generally, text mining .
While term recognition systems merely identify entities in text, advanced systems identify specialized facts - sometimes on the basis of information provided by term recognition systems - used to guide specific applications (e.g., mentions of smoking in patient records, used for selecting cases ; medical problems from patient records, used to maintain problem lists ; respiratory findings from emergency department reports, for biosurveillance purposes ). Systems such as BioCaster  and EpiSpider  apply term recognition techniques to health news feeds and integrate the extracted information with other resources (including ontologies), creating what is known as "mashups". These resources can help track cases of, say, avian influenza and support biosurveillance and public health.
In addition to entity recognition, some systems extract relations (i.e., facts asserted in text), thus "interpreting" the text. Example of such systems exploiting the UMLS for processing clinical text and the biomedical literature include SemRep , (Bio)MedLEE [165, 166] and commercial systems such as Tessi . More specifically, SemRep draws on MetaMap for identifying entities in text and relies on the UMLS Semantic Network as its source of domain knowledge for the interpretation of the semantic predication it extracts .
The UMLS has been used in advanced NLP applications including question-answering systems [168–171] and the summarization of medical documents [172–174]. More generally, NLP techniques have evolved to support the high-throughput processing of the biomedical literature , in a similar fashion to the high-throughput processing of genomic data enabled by sequence alignment techniques. Massive amounts of data such as the MEDLINE database are now routinely exploited, often in combination with ontologies, for hypothesis generation and knowledge discovery purposes. Literature-based discovery systems take advantage of the UMLS and MeSH as sources of knowledge, which are combined with the knowledge extracted from text to support the discovery process [176–179].
By supporting the high-throughput processing of biological and clinical data, ontologies are a component of the data-driven approach to biomedical research, synergistic with the traditional hypothesis-driven approach . Moreover, data mining often operates on datasets resulting from the integration of heterogeneous resources, also supported by ontologies .
Because of the availability of datasets coded with the International Classification of Diseases (ICD), clinical data exploration often involves the mining of ICD codes, along with, for example, geographic data  or meteorological data . The availability of large volumes of data makes it possible to detect rare events, such as adverse reactions to drugs (e.g., diabetic ketoacidosis  and hepatic toxicity ). In biology, the functional annotations of gene products from multiple model organisms to the Gene Ontology represent an important knowledge source, often mined in combination with sequence similarity [186, 187], gene expression data [188, 189], or both . Predicting the molecular function or subcellular localization of uncharacterized genes is an active field of research. While most methods exploit the annotations of related gene products to the Gene Ontology, some also take advantage of the hierarchical structure of the Gene Ontology .
Finally, as mentioned earlier, ontologies have been used for identifying relations between genotype and phenotype, both for the vocabulary they provide [65, 166], and for the relations among entities asserted in these ontologies . Ontologies have also been used for creating and interpreting gene networks [193, 194], as well as drug-target networks .
Ontologies have become important resources for biomedical research and researchers have come to rely on ontologies such as the International Classification of Diseases and the Gene Ontology in a large variety of applications, taking their existence for granted. There are still barriers, however, to the use of ontologies in biomedical applications, including availability, discoverability, the formalisms used for their representation, integration and quality.
A large number of ontologies are freely available, including LOINC, the Foundational Model of Anatomy, the Gene Ontology, the NCI Thesaurus and MeSH. Because some of the ontologies integrated in the UMLS are subject to intellectual property restrictions, however, its users must sign a license agreement to get access to the UMLS content. RxNorm follows the same model, although the part of its content owned by the National Library of Medicine is made freely available through a browser and an application programming interface . Finally, the availability of SNOMED CT to users depends on whether their country is a member of the International Health Terminology Standards Development Organization2 (IHTSDO). Being freely available is one of the requirements for ontologies to be included in the Open Biomedical Ontology repository , as it is also expected from the ontologies used in the Semantic Web.
With over 140 ontologies, the UMLS is the largest repository of biomedical ontologies (accessible through the Knowledge Source Server ), but its coverage is some-what biased towards healthcare applications. The National Center for Biomedical Ontology's BioPortal  provides access to about ninety ontologies, including those from the Open Biomedical Ontology (OBO) collection, with a bent for biological ontologies. Ontologies such as the Gene Ontology and the NCI Thesaurus are present in both collections. While useful, these two resources do not completely compensate for the lack of a registry allowing users to discover biomedical ontologies corresponding to their needs, which leads to both the underutilization of existing but unpublicized resources, and the development of roughly similar artifacts by independent groups.
The ontologies integrated in the UMLS are all converted to the so-called RRF format, regardless of their native representation formalism. RRF supports the representation of both the terms and relations natively present in these ontologies and the concept-oriented view superimposed by the UMLS. On the other hand, most ontologies available on BioPortal are represented in OBO format, the others being in frame-based Protégé or OWL format. Despite the availability of converters between OBO and OWL (e.g., ) and terminology servers supporting multiple formats, such as LexGrid , the multiplicity of formats remains an impediment to the use of biomedical ontologies.
There are basically two approaches to integrating ontologies. On the one hand, the UMLS realizes the post hoc integration of ontologies, from the bottom up, without interfering with the development process or governance of the ontologies being integrated. On the other, the OBO Foundry promotes a model of coordinated development of ontologies . Both approaches are useful to data integration. By integrating existing ontologies "as is", the former only links them to the extent possible (as they might show limited compatibility), but has the advantage of facilitating the integration of the vast datasets annotated to these ontologies (e.g., ICD, MeSH). On the other hand, the top-down approach of the OBO Foundry model ensures consistency ab initio, but is virtually impossible to apply retrospectively to large, widely used, legacy ontologies.
Intuitively, the poor quality of some ontologies might result in inaccuracies in the applications they support. In practice, assessing the quality of biomedical ontologies with intrinsic criteria is difficult and might be futile if disconnected from practical applications . On the one hand, the evaluation of quality can be seen as the responsibility of users, who can share their experience with other users by commenting on the usefulness of a given ontology (or part thereof) from the perspective of their application. This constitutes a democratic approach to quality evaluation. For others, the determination of the quality of ontologies should be based solely on science and left to an oligarchy of specialists . While the accuracy of statements in ontologies is important, other factors such as installed base (how many users does it have?) and governance (who makes decisions about development and maintenance?) also need to be taken into account when selecting an ontology for a given application.
Ontologies play an important role in biomedical research through a variety of applications. They provide the controlled vocabulary required for the annotation of biological datasets, the biomedical literature and patient records, facilitating the retrieval of and, more generally, access to information. Such standardization also facilitates the exchange of information and contributes to semantic interoperability among systems. By providing a representation of a domain, ontologies are also used in the mediation approach to integrating datasets. Finally, many applications use ontologies as a source of computable domain knowledge, including natural language processing applications and decision support systems. Ontologies are also critical to hypothesis generation and knowledge discovery in a data-driven approach to biomedical research.
This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM).
1Citations indexed by the MeSH term "Diagnostic and Statistical Manual of Mental Disorders" (DSM) are excluded, because, in most articles, DSM is used not as a terminology, but as a source of diagnostic criteria for mental diseases.