|Home | About | Journals | Submit | Contact Us | Français|
Although controlled biomedical terminologies have been with us for centuries, it is only in the last couple of decades that close attention has been paid to the quality of these terminologies. The result of this attention has been the development of auditing methods that apply formal methods to assessing whether terminologies are complete and accurate. We have performed an extensive literature review to identify published descriptions of these methods and have created a framework for characterizing them. The framework considers manual, systematic and heuristic methods that use knowledge (within or external to the terminology) to measure quality factors of different aspects of the terminology content (terms, semantic classification, and semantic relationships). The quality factors examined included concept orientation, consistency, non-redundancy, soundness and comprehensive coverage. We reviewed 130 studies that were retrieved based on keyword search on publications in PubMed, and present our assessment of how they fit into our framework. We also identify which terminologies have been audited with the methods and provide examples to illustrate each part of the framework.
The quality of a controlled terminology can be characterized from any of several different perspectives. The design of a terminology can, from the outset, determine much about the future capabilities of the terminology. Many aspects of terminology design have been identified and characterized as desirable or undesirable [1, 2]. Standards development organizations have paid much attention to creating guidelines for quality control in terminology development. For example, the ISO/TC215 WG3 (Health Informatics - Semantic Content) has been working on such guidelines,* and the latest American National Standards Institute guidelines for designing controlled terminologies (ANSI/NISO Z39.19-2005) serves as a comprehensive reference . In some cases, there is lack of consensus about desirability of particular design features (for example, some desire multiple hierarchy [1, 2] while others feel it should be avoided ).
The structure of a terminology can be studied to determine whether it supports or contradicts the stated design principles of the terminology. For example, Logical Observation Identifiers Names and Codes (LOINC) is designed to have meaningless identifiers; its use of sequential integers with check digits satisfies this requirement . Similarly, the relationships in the Unified Medical Language System (UMLS) are designed to be reciprocal; the MRREL file provides a mechanism for delivering this information, as described in .
Finally, the content of a terminology can be assessed to determine if is comprehensive and accurate from lexical and semantic (as opposed to structural) standpoints. For example, the list of all laboratory tests contained in LOINC can be evaluated to identify whether it in fact contains all the terms used by hospital laboratories.
To illustrate these distinctions, consider the assessment of a terminology with respect to multiple hierarchies. A terminology can be designed to include multiple hierarchies, but can be found to have a structural characteristic that interferes with true multiple hierarchies, such as the tree addresses used in the Medical Subject Headings (MeSH), as described in . Even when the terminology has a high-quality structure to support multiple hierarchies, its content might be deficient if a term that should have two parents is found to have only one.
A great deal of thoughtful planning is generally applied to terminology design, construction and maintenance. Design decisions (however controversial) are made with care, while the structural integrity can generally be guaranteed through good programming and database design. The quality of a terminology’s content, on the other hand, is often not immediately obvious. However well-intentioned, authoritative, and cautious a terminology builder may be, there is always the chance for errors of omission or commission.
At the very least, good quality assurance practices dictate that assessment for errors should be a standard part of terminology management . However, these practices, collectively referred to as auditing, can be challenging. Manual, expert review of a large terminology may provide little confidence that all errors have been detected. For example, any manual attempt to identify redundant terms in a large (>100,000 term) terminology will likely require memory that goes beyond human capacity.
To address this problem, informatics researchers and terminology developers have devised a number of methods to audit terminologies in systematic ways. Their methods often use knowledge in the terminology itself to perform the assessment and use computers to support - and in some cases entirely automate - the assessment. This paper reviews the major efforts in this area and organizes them into a framework that considers the aspects of terminology content that are audited, the methods used in the audits, and the terminology content that is employed to actually support the auditing process.
We first identify quality factors by which terminology content can be assessed. We consider intrinsic quality factors that are inherent to terminology content and that can be audited independently from external reference standards. Intrinsic factors include concept orientation, consistency, soundness, and non-redundancy. We also consider extrinsic quality factors that are contingent on comprehensive coverage of external user requirements, domain-specific contextual needs, or other external reference standards. Both types of quality factors can be further applied to the content and knowledge structure of a terminology. We describe these factors below and summarize them in Table 1.
Concept orientation refers to the principle that the units of discourse in a controlled terminology should actually be the meanings (or concepts), rather than the human-readable labels (that is, the terms) that are enumerated in a terminology with the intention of conveying the meanings. In some of the terminological literature and documentation, the words “concept” and “term” are used inconsistently or interchangeably. In this paper, we will generally use “concepts” when we are referring to the meanings being conveyed and “terms” when we are referring to the character strings that are names for concepts. In the case of the UMLS, where strings are grouped together based on their meanings and the groupings are called “concepts” and given unique identifiers, the distinction between terms and concepts is less clear; we refer to these groupings as “UMLS concepts”.
While the precise intended meanings of terms in a terminology may be difficult to audit directly, concept orientation at a minimum requires that the items in a terminology must correspond to at least one meaning (“non-vagueness”) and to no more than one meaning (“non-ambiguity”) . For example, Poison Ivy can be used to refer to a disease or a plant , which can cause concept ambiguity. Methods aimed at auditing concept orientation have therefore sought to identify undefined terms and terms with multiple meanings (polysemy) that are nevertheless mapped to a single term identifier (implying a single concept) [9-11].
Consistency refers to adherence to semantic and/or linguistic rules for representing terms in a terminology. Linguistic inconsistency may occur when the same lexical modifier applied to different terms implies different relationships to the original terms. For example, there is a hierarchical relationship between congenital porphyria and porphyria, but a sibling relationship between congenital Addison’s disease and Addison’s disease (as provided by SNOMED) . Inconsistency may also occur among hierarchical relationships in ways that are unrelated to lexical phenomena. Most terminologies include some kind of hierarchical relationship among their terms. These may be in the form of is-a relationships, broader-narrower relationships, part-whole relationship, or a mixture of these; in some cases, the meanings are not explicitly or implicitly determined . We consider all of these representations to convey some type of organization into groups of terms with similar properties – that is, classification – including the semantic type assignments in the UMLS. Whatever the intent of such classifications within a particular terminology, the consistent application of that classification is a desirable quality. . In the UMLS, for example, “adulthood” and “old-age” are assigned both semantic types “Idea or Concept” and “Age Group”, while other terms, such as “childhood”, “juvenile”, and “young adults” are assigned only to the latter . The UMLS is particularly sensitive to classification inconsistencies, since it attempts to combine information from multiple terminologies that may be consistent with themselves but not with each other. For example in the Computer Retrieval of Information on Scientific Projects (CRISP) terminology, “colorectal neoplasm” is broader than “colon neoplasm”, while in MeSH the relationship is reversed; the UMLS contains both relationships, such that the two terms appear to be each other’s parent .
Non-redundancy refers to the absence of unwanted repetitive information. Typical redundancy errors include redundant terms, where terms with the same meaning are represented with separate, independent identifiers (rather than being linked as synonyms). For example, in the UMLS the following pairs are redundant : C0000760: ABNORMAL PAP SMEAR, C0240660: PAP SMEAR ABNORMAL, and C0002965: Angina, Unstable, C0235466: ANGINA UNSTABLE.
Redundant classifications can also occur in a terminology. Redundant is-a assignments can be found as assignments wherein a term is indentified as being in two classes, with the first class being a descendant of the second (the second assignment is implied by the first assignment due to the transitivity of the is-a relationship between classes) [13, 15]. For example, in the UMLS, “year” has two semantic types: “Temporal Concept” and “Idea or Concept”, but the former is a child of the latter. Therefore, these semantic type assignments are redundant.
Soundness refers to the accuracy of the knowledge represented in a terminology. Typical terminology features that are audited for their soundness include term classifications , and term names and definitions . There are 9,296 UMLS concepts with the semantic type “Pharmacologic Substance” that have as children UMLS concepts of type “Clinical Drug”; however, the former refers to chemicals and the latter refers to manufactured objects consisting (in part) of chemicals . This is an example of incorrect classification.
Comprehensive coverage is, in some ways, the converse of soundness. While soundness refers to the accuracy of what a terminology does contain, comprehensive coverage refers to what a terminology should contain. Auditing for comprehensive coverage, therefore, must be related to the terminology’s intended domain in the world outside the terminology itself. Auditing methods assess a number of different aspects of terminology coverage, including: 1) Terms and synonyms. Some composite terms exist for which the derivative terms are missing. In the UMLS, for example, “monitor for seizure activity” and “seizure activity not present” exist, but “seizure activity” (an electroencephalographic finding) is missing . 2) Defining attributes. As an example of incomplete definitions, the UMLS does not support formal definitions and includes narrative definitions for only a small portion of its concepts. 3) Hierarchical classification. In UMLS, “Inert Gas Narcosis” is an “Injury or Poisoning”, but based on common sense, it is also an “Occupational Disease”. The inference that “Injury or Poisoning” is-descendant-of “Disease or Syndrome” should be added to the semantic network. This is an example of missing Ancestor–Descendant  in the UMLS Semantic Network. 4) Non-hierarchical semantic relationships. For example, “cleft lip” is a “disease” but has no relation to “finding of appearance of lip” in UMLS .
Terminology errors may sometimes relate to multiple quality factors. For example, as mentioned previously, Gu, et al.  found that the UMLS concepts “adulthood” and “old-age” were classified as being of the semantic type “Idea or Concept” while other similar UMLS concepts such as “childhood”, “juvenile”, and “young adults” were not. This indicates inconsistent classification. At the same time, these UMLS concepts referred to multiple meanings, for example, the state of age and the group of people in that state, which violates non-ambiguity of concept orientation.
Auditing a terminology can be considered a comparative process, in which the content of terminology is compared to some source of truth. There are numerous potential sources of knowledge that can be used to audit the knowledge a terminology contains, as both the literature and our own experience with Columbia University’s Medical Entities Dictionary (MED) clearly demonstrate . In a certain respect, all of these sources lie outside of the actual targeted internal representation in the terminology. However, for the purposes of this discussion, we define intrinsic knowledge as information derived from the classification scheme, hierarchy, semantic relationships to other terms, or attributes such as lexical information that are present within the terminology itself. We define extrinsic knowledge as a comparative standard deriving from an outside source, such as other terminologies, user requirements, or human expert knowledge. A recent example of automated auditing using extrinsic knowledge can be found in . The National Cancer Institute Thesaurus (NCIt)’s Gene hierarchy was audited for missing “has associated process” relationships using two sources, the NCBI Entrez Gene database and the NCIt’s Biologic Process hierarchy, to check hierarchical relationships. A given audit process could potentially use intrinsic knowledge, extrinsic knowledge, or a combination of both.
When a terminology seeks to maintain synchronization with some other terminology, as is the case with the UMLS and the MED, the external terminology itself becomes an extrinsic source. Maintaining synchronization with source terminologies entails a persistent ongoing audit to determine what has been added, changed or deleted. The UMLS must remain synchronized with over 100 standard source terminologies, while the MED must remain synchronized with terminologies from multiple ancillary systems (e.g., multiple laboratory, pharmacy, clinical documentation, and radiology systems) on at least a weekly basis, as well as the annually updated set of ICD9 diagnoses and procedures [22-24].
Although a source terminology may provide sufficient information for the auditing process, occasionally additional external sources are required. For example, if a new laboratory test term is to be added to a terminology such as the MED or LOINC, the terminology maintainer must consult a trusted external information source (whether published or based on his or her own knowledge) to determine whether it should be assigned to an existing class in the terminology or if a new class is needed.
External sets of clinical terms have frequently been used to assess coverage, in what are known as coding exercises, which are generally semi-automated or manual attempt to find terms. Targets for these studies have included SNOMED-CT , UMLS [26, 27] nursing taxonomies , Read Codes , and multiple terminologies in large studies [30, 31].
Expert user review is another external knowledge source for audits. Whether by direct perusal of a terminology , post hoc review to correctly classify and semantically link external data (discussed above), or to review output of audits based on intrinsic factors (discussed below), expert knowledge is the final arbiter in many evaluations.
The intrinsic knowledge that can be used in auditing processes includes hierarchical relationships, the non-hierarchical semantic relationships between terms, and the lexical knowledge about terms. We start with an example from the experience of two of the authors (DMB and JJC). Because the MED is integrated with a live clinical information system, updating content is a multistage process, in which changes are first entered into an editing environment, then a test environment, and finally to the production environment. During each editing cycle, there are more than 25 automated audits that test for primary violations of terminology rules and structure . These audits are inextricably tied to the initial design. Some use hierarchical information rules (“cannot remove the last parent - all terms must have a parent”, “there can be no hierarchical cycles”, “a term should not have two hierarchically related parents”), some use semantic relationships rules between terms (“all semantic slots must have a reciprocal”, “no redundant semantic relationships”), and some use rules combining classification and semantics (“a term cannot have two hierarchically related values in the same semantic slot; the more specialized value should be used – i.e., refinement is enforced”). This is a short sampling of the checks on more commonly identify errors.
A number of distributed biomedical terminologies have been assessed for similar adherence to design rules or terminological principles using intrinsic knowledge. These include assessment for adherence to basic ontological principles of SNOMED-CT  and the Foundational Model of Anatomy (FMA) , assessment for internal consistency in terminologies such as the READ Thesaurus [36, 37], and various approaches to removing cycles in the UMLS [38, 39]. Outside the domain of biomedicine, a set of formal consistency checking rules based on intrinsic hierarchical and semantic inputs has been proposed for the WordNet™ 1.5 lexical database  that share similarities with the routine MED audits.
Intrinsic knowledge can be used for more than highlighting violations of design principles; it has also been used in a variety of interesting ways to correct or suggest correction to content. In general the correction of content using intrinsic knowledge involves the additional input of extrinsic knowledge, often in the form of an expert to manually review items brought to light by an automated process .
A repeating theme in the literature is the application of semantic relationship patterns to partition a terminology into more manageable pieces for manual review. Occurrence of concepts having identical relationships (or associations to other concepts) and hierarchy are then brought to the attention of human reviewers. Using intrinsic knowledge for this kind of partitioning has been applied to the MED [42, 43], SNOMED [44-46], and the NCIt  to reveal issues such as missing classifications and redundancy.
Another example of this approach is to use knowledge in the UMLS to derive a metaschema for the UMLS Semantic Network by grouping semantic types with identical relationship sets [47, 48]. In [13, 49], assignment of UMLS concepts to multiple UMLS semantic types, especially when those types are considered to be in different groups of semantic types of a metaschema [9, 50] or mutually exclusive , has been used to suggest classification errors, ambiguity and inconsistency. These methods almost always use expert manual review as a follow-up knowledge source. The MED regularly uses internal semantic relationships to determine classification; Cimino, et al.  described an algorithmic approach to enrich hierarchical structure in the MED and determine the correct location in the hierarchy for newly added terms.
Lexical information embedded in terminologies has also been used for auditing processes. In the simplest case, lexical information is used to fix lexical targets, such as spelling errors and uniqueness of term names, an audit performed with every MED update. Lexical information has also been applied to reveal other quality issues. For example, Campbell, et al. used term substrings in SNOMED to suggest classification omissions . In other work, synonymous terms in the UMLS were used to compile a list of keyword synonyms that, in combination with semantic types, was used to detect redundancy .
Consistent use of linguistic phenomena, such as adjectival modifiers, has been used to assess potential inconsistency in the UMLS and SNOMED. In one study by Bodenreider, et al. , the intrinsic characteristics of lexical usage of adjective pairs such as “acute”/“chronic” and “primary”/“secondary” were reviewed in the context of known, extrinsic knowledge to reveal inconsistencies. In another lexical study, drug descriptions from four leading pharmacy system knowledge base vendors were compared across each field to determine lexical consistency of usage in the pharmacy domain . In yet another application of intrinsic lexical knowledge, all terms, synonyms and headings contain the conjunctions “and” or “or” in SNOMED were assessed compared to editorial board policy on usage, which specifies that “and” should imply logical AND (both must be present), “and/or” when one or both must be present and “either_or” should be used when one but not both must be present. Usage in this regard was found to be inconsistent .
The knowledge described in previous sections can be applied to the content of controlled biomedical terminologies in different ways. We summarize auditing methods into four major categories. The most straightforward but probably most labor-intensive method is manual review (with or without the support of a computerized user interface), by which a terminology reviewer (often a domain expert) audits the terminology with respect to certain quality factors. Automated systematic methods involve implementing knowledge into rule-checking programs that scan the terminology for potential problems with respect to particular quality factors, usually in a “batch” mode, to identify errors and inconsistencies. The automated systematic methods are generally reproducible and reduce the need for detailed, costly manual review. Automated heuristic methods involve use rules that make inferences about terminology content and then seek to identify those inferences that lead to illogical or inconsistent conclusions.
For the above three categories, we classify published reports based on the terminology attributes being audited:
In addition to the above three major categories, we discuss separately some high-level change management methods that deal with logistics issues involved in auditing terminologies.
In the following subsections, we examine examples of each of these methods from the published literature and our own experience, with consideration of the terminology attributes audited and the knowledge used to support the processes.
Because controlled medical terminologies are typically in a constant state of development, expansion and refinement, a formal representation and taxonomy were proposed by Cimino and Clayton [22, 24, 55] to characterize changes in a terminology, based on its syntactic properties (i.e., addition, deletion, name change, and code change) in order to characterize the semantic changes they represented. For example, a name change could represent a change in meaning (major name change) or not (minor name change). Once the semantic changes were characterized, they could be dealt with formally to maintain concept orientation and concept permanence. For example, a major name change would generally require the retirement of an existing concept (corresponding to the previous version of the term) and creation of a new one (corresponding to the new version). Those changes were reconciled through manual review by domain experts.
Subsequently, Fischer  developed formal rules to check redundancy and consistency in a lexical database. Later, Wroe collaborated with Cimino and Rector  to model and integrate different drug formulation terminologies based on formal definitions that were manually created using the OpenGalen knowledge representation scheme. Terms in each terminology could be compared to corresponding ones in other terminologies based on the definitions, to identify inconsistencies between the two. Recently, Smith et al. presented a case study of applying similar formal principles to the Gene Ontology (GO) .
Several other manual auditing methods involved the coding of clinical records with terminologies, with their adequacy and accuracy judged by experts. For example, Chute, et al.  assessed various clinical terminologies for their content coverage by parsing 14,247 words into 3,061 distinct concepts. These concepts were grouped into Diagnoses, Modifiers, Findings, Treatments and Procedures, and Other. An attempt was made to manually code each concept in ICD-9-CM, ICD-10, CPT, SNOMED III, Read V2, UMLS 1.3, and NANDA, with the result scored as “no match”, “fair match”, and “complete match”. Coding consistency was assured by a secondary reviewer.
To audit attributes of controlled clinical terminologies, such as completeness, term definitions, “clarity” (represented as the inverse of the rate at which the same data might be coded in duplicate ways), consistency in clinical taxonomy, and administrative mapping, a Computer-based Patient Record Institute (CPRI) workgroup  assembled 1929 source records based on an initial expert evaluation and an organized consensus discussion within the workgroup. The source records were coded in each terminology scheme by an investigator and checked by the coding scheme owner. The coding was then scored by an independent panel of clinicians for acceptability on a Likert scale. The investigator for each scheme exhaustively searched a sample of coded records for duplications.
Many studies have been conducted to audit comprehensive coverage of a terminology in different settings. Humphreys, et al.  carried out a distributed national experiment using the Internet and the UMLS to determine the extent to which a combination of existing machine-readable health terminologies covered the terms and concepts needed for a comprehensive controlled terminology for health information systems. Several studies were performed within specific clinical information system settings: Wasserman and Wang  evaluated the breadth of terms and concepts for the coding of diagnosis and problem lists by clinicians within a physician order entry system; Kushniruk, et al.  conducted an observational study to identify concepts missing from an outpatient information system.
As Rector  argued, explicit information is key to most existing coding and classification systems, and different types of content must be separated based on their conceptual, linguistic, inferential and pragmatic correctness. Many research studies have evaluated comprehensive coverage of different domain-specific terminologies. Cieslowski, et al. and Moss, et al.  worked on nursing terminology integration and evaluation. Chiang, et al.  extracted ophthalmology concepts from patient reports and manually reviewed their coverage in five terminologies. Warnekar and Carter  assessed HIV term coverage of a commercial terminology. Smith and Kumar [63, 64] audited semantic appropriateness of term names and definitions of the GO terms. While combining laboratory data sets across terminologies, Baorto, et al.  used the LOINC knowledge model to identify missing concepts and synonyms. Zhu, et al.  created a terminology model for the missing acupuncture terms in the UMLS.
By examining violations of ontological semantics, Kumar and Smith  were able to assess the appropriateness of semantic classification of GO terms. Similarly, Schulze-Kremer, et al.  applied five ontological principles to audit knowledge classification in the UMLS semantic network, and Smith, et al. [16, 63] applied a set of ontological principles against GO to audit the assignment of hierarchical relations. Several years later, Mendonça, et al. , Spackman and Reynoso  used formal ontological definitions to audit misclassification in SNOMED, and Bodenreider, et al.  investigated subsumption in large description logic-based biomedical terminologies based on unique ontological principles.
Specifically, these ontological principles specified that all relationships of a parent class must either be inherited by each child or refined in the child, and refinement from parent to child should uniquely result in every case either from refinement of the value of a common role or introduction of a new role.
In a similar way, a qualitative, rather than a quantitative analysis of the NCIt was performed by Ceusters, et al.  in order to assess NCIt’s ontological principles. Using an OWL-representation, their inspection of the system was performed breadth first, top down, entry-by-entry to detect inconsistencies with respect to the term-formation principles used, the underlying knowledge representation, and missing or inappropriately assigned textual and formal definitions.
Schulz and Hahn  proposed a semi-automatic knowledge engineering approach for converting the human anatomy and pathology portion of the UMLS Metathesaurus into a terminological knowledge base. Their approach consisted of an integrity check of the emerging taxonomic and partonomic hierarchies, and elimination of terminological cycles and inconsistencies. They used LOOM to represent the Metathesaurus followed by a medical expert review. In detecting missing and incorrect is-a, part-of, and has-part relations, special attention was paid to the proper representation of part-whole hierarchies, while running experiments on 164,000 UMLS concepts and 76,000 relations in the terminological knowledge base. Their approach provided a formal description-logic framework to support taxonomic and partonomic reasoning. Consequently, Arts, et al.  used a semi-automated method to evaluate the structure of diagnoses terms used in intensive care. A description-logic reasoner was designed to find incorrect relations, along with manual review of relations performed by domain experts.
Table 2 summarizes the quality factors and knowledge sources in studies that applied manual auditing methods to various terminologies.
To assure unambiguous concept representation, Schulz, et al.  encoded simple rules into a system used to manage Read Codes, in order to control the uniqueness of concept identifiers (primary keys) and assure that each concept had a unique preferred name. Cimino  processed the UMLS concept strings and created an index consisting of normalized and synonymous lexical tokens to search equivalent and possibly duplicate UMLS concepts; the specificity was further improved by applying constraints based on semantic types. For example, “Angina, Unstable” and “ANGINA UNSTABLE” were found to be duplicates with different concept identifiers. Starting with a similar approach, Hole and Srinivasan  introduced more sophisticated algorithms for normalizing terms and enriching the lexicon of synonymous tokens so as to increase the sensitivity of their methods for identifying duplicate UMLS concepts.
Ceusters, et al.  identified duplicate concepts in SNOMED-CT by using a commercial medical ontology, LinKBase, and its associated search algorithm based on flexible string matching and ontology relation traversal; SNOMED-CT terms that mapped to the same LinKBase concept were considered redundancy-prone. For frame-based terminologies (again using Read Codes as a case study), Schulz, et al.  searched concepts with identical definitions (attribute-values) and considered them possible duplicates.
Rogers, et al.  proposed a method that translated Read Codes into a description logic (DL)-based representation and applied a DL reasoning program to audit the formal definitions of the terms; problems such as missing attributes in the definitions and missing inherited attributes from parent terms to child terms were found. Similarly, Cornet and Abu-Hanna [82, 83] audited the formal definitions of the Diagnoses for Intensive Care Evaluation (DICE) terminology by translating its frame-based representation into DL and applied a DL reasoning engine to check for problems such as redundant and ambiguous terms.
To audit DL-based terminologies (using the NCIt as a case study), Min, et al.  proposed a method to partition a terminology’s classification into networks containing terms with identical DL roles (e.g., has initiator process), organized into single-rooted sub-networks. In particular, their method assumed that smaller subnetworks of networks with few subnetworks were highly error-prone. The underlying idea is that concepts with a rare combination of DL roles and classification are highly error-prone. Using their approach, they were able to find missing synonyms, missing concepts, and duplicate terms. For auditing more complicated DL-based terminologies such as the multi parented Specimen hierarchy of SNOMED-CT, Wang, et al.  advanced Min’s method by refining the procedure of generating single-rooted sub-networks, explicitly differentiating whether each DL role of the sub-networks is inherited or newly introduced. They also added the assumption that sub-networks containing many DL roles but few terms (i.e., semantically specific) and the assumption that sub-networks with multiple inherited roles (i.e., semantically heterogeneous) are error-prone  or overlapping sub-networks . They found inaccurate concept naming and incorrect synonyms, in addition to missing synonyms, missing concepts, and duplicate terms.
Cimino  checked UMLS concepts assigned to mutually exclusive semantic types, assuming those classifications are error-prone. For example, if a UMLS concept is classified both as “Animal” and as “Plant”, then one of the semantic type assignments should be wrong. Implementing the above principle from a preventive perspective (which can be considered pre-release auditing), Schulz, et al.  imposed restrictions in Read Codes to disallow assigning a term to semantically exclusive classes. Geller, et al.  and Gu, et al.  advanced Cimino’s method with an algorithm that refined the UMLS Semantic Network into pure and intersected semantic types so that incorrect, redundant, and missing classifications were more easily exposed. Similar to Geller, Gu, et al. [9, 50] grouped the semantic types into broader meta-types [47, 48] first and then checked meta-type intersections with small number of UMLS concepts, assuming that rare combinations between the different semantic groups strongly imply erroneous classification.
Another method proposed by Cimino [6, 10] for auditing erroneous UMLS semantic type assignments was based on the principle that the hierarchical relation between two semantic types should be consistent with the parent-child relations of the UMLS concepts assigned to the types, i.e., a child UMLS concept should always be assigned a semantic type no broader than the semantic type of its parent UMLS concept. For example, the concept “Lys-Lys” should not be classified as the semantic type “Organic Chemical”, because its parent “Dipeptides” is classified as the semantic type “Amino Acid, Peptide, or Protein”, which is the child of type “Organic Chemical”. This expected consistency was used to automatically identify suspicious concepts needing manual auditing in the extent (set of concepts) assigned a given semantic type .
Applying a similar principle, Peng, et al.  audited redundant semantic type assignment for UMLS concepts that were assigned to both a particular semantic type and a parent (or ancestor) of that type, taking the parent type as redundant according to the rule of semantic type assignment specificity . Fan, et al.  built automatic classifiers using lexical features from the UMLS concept strings and contextual features from a PubMed corpus to reclassify the UMLS concepts into broad classes. Their method found erroneous and missing classifications by checking the disagreement between their broader classification and the original semantic types.
Schulz, et al.  audited the hierarchical relationships in the Read Codes by automating two rules: 1) the attributes of a child term should be the same as or more detailed than that of its parent term (this helps audit the correctness of the is-a relations) and 2) a term with more detailed attributes than another term should be considered a child of that term (this helps audit the completeness of the is-a relations). Campbell, et al.  examined the problem of missing hierarchical and non-hierarchical semantic relations in SNOMED by using lexical algorithms to suggest the existence of relationships between terms with common substrings.
Ceusters, et al.  used two algorithms to audit incorrect and missing relations in SNOMED-CT: 1) a DL-based classification algorithm and 2) a search algorithm that estimated semantic distance by implying correct subsumption relations.
The methods by Cimino [6, 10] mentioned above that examine consistency of semantic classification of UMLS concepts was used simultaneously to audit erroneous parent-child relations between them. Cimino  also suggested inferring non-hierarchical semantic relations between UMLS semantic types from the relations between the terms specified by source terminologies to improve the completeness of the semantic relationships.
Bodenreider, et al.  audited the hierarchical relations of SNOMED-CT by automatically checking against four ontological principles: 1) each hierarchy must have a single root, 2) each class (except for the root) must have at least one parent, 3) non-leaf classes must have at least two children, and 4) each child must differ from its parent and siblings must differ from one another.
Due to partition of a hierarchy of DL-based terminology such as NCIt and SNOMED-CT into sub-networks of identical DL-roles, [8, 45] support detecting of wrong and missing roles. Furthermore, a more refined partition of those sub-networks into single rooted sub-network helps to highlight wrong and missing hierarchical relationships.
The following table (Table 3) summarizes the quality factors and knowledge sources in studies that applied automated systematic auditing methods to various terminologies.
Cimino and Barnett  created frame-based definitions manually, and automated a translation method for comparing a source term with all terms in each of the other target terminologies. The approach was applied to cardiac procedures in ICD9-CM, MeSH, SNOMED and CPT, with a score produced based on semantic distance. The scores were then ranked and manually reviewed.
To identify synonymy and near-synonymy, Barrows, et al.  explored lexical and morphologic text matching techniques to map clinically useful terms into a controlled medical terminology. Hole and Srinivasan  attempted to discover missed word and phrase synonyms in a large concept-oriented metathesaurus through lexical matching, selective algorithms, and expert reviews.
Tulipano, et al.  used manually created knowledge-based descriptions to represent molecular imaging terms, which were then automatically mapped to GO. This process was created to address issues related to synonymy, redundancy, and ambiguity in concepts mapping. Recent work from Huang, et al.  introduced ways to generate new synonyms. Terms with multiple words were decomposed into single words. Synonyms for single words were identified; these provided the basis for new multiword terms to be reconstructed. In , techniques for limiting the combinatorial explosion caused by substituting WordNet synonyms for the single words of multiword-terms are introduced. Patel and Cimino  used network analysis to support decompositional terminology translation in order to identify, using a clustering coefficient, those primitive concepts that should be related to more complex concepts.
Bodenreider, et al.  assessed the systematic use of linguistic phenomena for both lexical and semantic features in SNOMED and the UMLS Metathesaurus. Frequently co-occurring adjectival modifiers were identified syntactically and studied in combination with the contexts of each modifier. Bodenreider, et al.  also evaluated the content coverage of the UMLS with an attempt to find exact matches first, followed by normalization and semantic incompatibility checking. Five broad classes of UMLS concepts were extracted using their system (LocusLink) and mapped to the UMLS Metathesaurus. The search also covered contents of gene products, phenotypes, molecular functions, biological processes, and cellular components.
Cimino, et al.  automated term subsumption, using manually created knowledge, followed by manual review of the system’s suggestions for subclass partitioning, that is, the creation of new subclasses and inclusion of terms in those subclasses. The subsumption rules suggested partitioning large classes based on characteristics of subsets within the class, for example, chemical tests could be separated into subsets such as hormone tests, lipid tests, drug tests, etc. Semantic definitions for the new terms were created and added to the knowledge-base through a combination of automatic and manual means. For instance, in order to partition the large class “Chemistry Laboratory Tests”, the system found that many children had a “substance measured” relation to a drug term, while many others had relations to non-drug chemical terms. The sytem therefore proposed a new semantic class “Drug Measurement Tests” to subsume the former, as a new subclass of “Chemistry Laboratory Tests”. The same group  also built a rule-based terminology management system prototype in an object-oriented (OO) environment to check the inconsistency in classification.
Gu, et al. [49, 102] further experimented with OO modeling through the development of an OO database schema, using visualization techniques to identify places in the UMLS where errors and inconsistencies of semantic type assignments occurred. A different OO modeling [42, 43] was applied to the MED yielding a upper level schema, which helped highlighting ambiguity and inconsistency in the MED modeling. Moreover, to enhance comprehensibility and usability of a terminology system, Gu, et al.  developed an OO representation that provided an abstract or skeleton view, using a theoretical paradigm and a methodology that partitioned schemas into manageably sized fragments (they refer to this as a “forest subschema”). A set of rules was applied sequentially to simplify classification schemes such that OO classes were grouped into partitions that were relatively independent of each other and contained highly interrelated UMLS concepts. Classification errors could be detected by this step-wise heuristic model. Subclass hierarchy was refined by a medical domain expert in conjunction with a computer.
Recent work from Gu, et al. [9, 50] included a combination of an auditing technique and an expert review that determined the pure intersections of meta-semantic types of the metaschema , which yielded a compact abstract view of the UMLS Semantic Network. Ambiguity, incorrect classification, and inconsistencies were readily identified in the pure intersections that were examined. New conjugate and complex semantic types are suggested in  to better capture the semantics of concepts assigned multiple structurally viewed chemical semantic types. For such concepts the proper semantics expresses a chemical reaction or mixture rather than the typical conjunctive semantics of multiple semantic types.
Zhang and Bodenreider  provided an operational definition of fifteen ontological principles and investigated the degree to which a large ontology of anatomy complied with them. Three rules were proposed to detect incompatible relationships. One such rule was that two terms cannot stand both in taxonomic and partitive relations, that is, for every pair of terms x and y, x and y do not have both IS-A and PART-OF relationships. For example, “Myocardium proper of right atrium” has both “REGIONAL PART OF” and “IS-A” relations with “Myocardium”, which are considered to be incompatible relationships.
Since redundant hierarchical relations are generally semantically consistent, semantic inconsistency detected in redundant hierarchical relations could be used as an indicator of potential mis-classification of one or both terms, and to trigger a review of these terms by editors of biomedical terminologies. Bodenreider  calculated an index of redundancy between pairs of hierarchically related terms to detect errors such as multiple pathways. Burgun and Bodenreider  proposed a method for assessing the semantic and hierarchical relationships between MeSH terms that co-occurred in literature citations. Zhang, et al. [93, 106, 107] expanded the UMLS Semantic Network into a multiple subsumption structure with a directed acyclic graph IS-A hierarchy, which allows a semantic type to have more than one parent. They argued that parent-child relations in the Metathesaurus implied the same in the Semantic Network, and finding the name of one type contained in the other implied the latter was a refinement of the former. New is-a relations were added to the Semantic Network, and new semantic types were created to support the multiple subsumption framework, from which new connected groups could be derived for the meta-semantic types of the metaschema for the expanded semantic network . Two methodologies were applied to identify and validate new is-a relations: a lexical-based string matching process (involving names and definitions of various semantic types in the Semantic Network) and a process for converting the partition’s disconnected groups of the Semantic Network  into connected ones.
Table 4 summarizes the quality factors and knowledge sources in studies that applied automated heuristic auditing methods to various terminologies.
The auditing of contemporary biomedical terminologies often involves feedback from multiple users with different requirements. If we consider a temporal axis to auditing methods (i.e., auditing a terminology at different time points), it appears to be a dauntingly complex task that deserves robust management techniques to handle the vast amount of changes made during the auditing process. Those high level change management methods are different from the methods directly auditing terminology content. Research closely related to change management of terminologies has been found in the literature on ontology evolution and versioning. A typical method used in ontology evolution (or versioning) is to log the “who”, “what”, “when”, and “why” concerning the changes made. A benefit of detailed logging is the reversibility of changes if they are later found to be erroneous.
The circular evolution process by Stojanovic, et al.  provides a framework for considering the methods involved in change management. The six phases in their framework are: 1) Change capture (deciding what changes to be make; this corresponds to the output of the manual, automated systematic, and automated heuristic methods covered in the previous subsections), 2) Change representation (e.g., extract_superconcept is a change that occurs when a single concept is split into several subconcepts, with distribution of properties among them and their associated metadata; e.g., auditor, timestamp, and reason_for_change), 3) Semantics of change ( (e.g., to spot inconsistencies that could be introduced by certain change operations), 4) Change implementation (referring to both pre-implementation proofreading and the implementation itself), 5) Change propagation (applications or other terminologies that depend on the changed terminology need to be updated correspondingly), and 6) Change validation (referring to field-testing and retracting inappropriate changes based on the result of the testing).
The methods reviewed in the previous subsections are generally to be performed at certain fixed time points. However, considering Stojanovic’s framework, failing to characterize potential risks at Phase 3 would unwittingly allow illegal changes to the terminology; in Phase 5, either failing to propagate correct changes or propagating incorrect changes would spoil the terminology. Additionally, if an error occurs in Phase 3 and is propagated in Phase 5, a subsequent Phase 1 can identify corrections to be made in a subsequent Phase 2.
Tools have been developed or augmented to support change management of auditing terminologies. Stojanovic and Motik  evaluated three generic ontology editors (Protégé, OntoEdit, and OilEd) and found that they satisfy some functions related to the temporal processes related to auditing, each with individual strength and weakness. The KArlsruhe ONtology Management Infrastructure (KAON) [117, 119] is another ontology editor that supports change management functions such as evolution strategies, consistency checking, and transparency of actions. Noy, et al.  developed the CHange and Annotation Ontology (CHAO) to facilitate collaborative change management, with emphasis on functions such as controlling access privileges and resolving conflicts in a multi-editor environment. CHAO was implemented as a set of plugins to Protégé and has been tested on the NCIt. The usability of such collaborative editing tools can be generalized to multi-auditor tasks. Oliver and Shahar  proposed the CONCORDIA (CONcept and Change Operation Representation for DIAlects) model to address synchronization and logging requirements peculiar to maintaining a local terminology that diverges from an original shared terminology.
Rogers has presented a framework for quality assurance methods applied to logic-based biomedical ontologies. He concludes that there are four aspects: philosophical validity, meta-ontological commitment, content correctness, and fitness for purpose . Our review extends beyond ontologies to include all forms of controlled biomedical terminologies and characterizes the actual methods used to assess such properties. We find that most of the methods are focused on content correctness, in one form or an other, and assume that good quality content implies fitness of purpose and vice versa.
Our framework divides terminology auditing methods into manual, systematic and heuristic methods. Some of these methods are best suited for assessing the terms and concepts in a terminology, while others are better suited for assessing semantic classification and relationships. Some methods can be used to audit multiple terminology attributes at the same time, e.g., simultaneously auditing problematic semantic type assignments and hierarchical relations between terms . On the other hand, different methods can be used to audit the same terminology attribute and cross-validate each other, e.g., identifying the mis-assignment of “Lys-Lys” to the semantic type Organic Chemical (see section 4.2.2) could be found through either the by a rule-based  or an automated reclassification  approach.
Each of these methods makes use of some knowledge. Most interesting are those that use knowledge that is intrinsic to the terminology itself, making the auditing process an exercise in introspection . The disadvantage of such an approach is that it relies on “extra” knowledge that might not normally be present in a terminology. This may require terminology developers to exert effort beyond basic terminology construction (that is, enumerating and arranging the terms) to add such knowledge. The advantage of this approach is that automated auditing methods can be created that can operate independently from human expertise. Despite the added effort, many modern terminologies, such as LOINC, RxNorm, the FMA, and SNOMED, as well as the UMLS Metathesaurus, now include formal definitional information that can be exploited by auditing techniques. Thus, methods that rely on intrinsic knowledge are becoming increasingly practical.
Also interesting are the methods that are largely automated, since many terminologies are too large for complete comprehension by individual human auditors. Systematic methods (such as referential integrity) are already being incorporated directly into the terminology maintenance processes being used by standards development organizations. The methods described in this review offer additional ways to assure that terminologies are following their own rules.
Most interesting, in our opinion, is the combination of the use of intrinsic knowledge with heuristic methods. These methods can identify terminology errors that might escape deterministic automated methods and human-centered manual methods. Inherent in the nature of heuristic approaches is their imperfection, resulting in false positive findings. However, as Min, et al.  point out, a modest false positive rate is an acceptable trade-off when attempting to identify the parts of a terminology that should be scrutinized by human reviewers, especially when the availability of such reviewers is limited by resources or the limits of human attention.
As this review shows, there is now a rich tradition of formal auditing methods for controlled terminologies that has largely arisen in the last decade or so. Although much of the work began with theoretical approaches on limited data sets, the growing presence of rich biomedical ontologies is providing a fertile ground for further development of practical, usable auditing procedures. Indeed, much of the work presented here has resulted in feedback to UMLS, SNOMED, ICD9, MeSH and GO, leading to their improvement. Our own auditing methods are applied daily to the MED at Columbia University to support clinical, administrative and research information systems at New York Presbyterian Hospital. At the same time, the appreciation of high-quality controlled biomedical terminologies is currently on the rise, providing impetus for the actual use of such procedures to improve the terminologies that, in turn, are being increasingly relied upon to improve biomedicine.
This paper summarizes a wealth of terminology auditing methods and provides only the briefest descriptions of the work of many creative researchers. We hope that the framework we present serves to guide those who seek to develop their own auditing methods based on work that has come before. At the same time, the absence of entries in Tables 2--44 points to many heretofore unexplored opportunities for exploiting the rich knowledge in modern terminologies to support their improvement.
The last twenty years have been witness to a proliferation of terminology auditing methods that employ a variety of creative methods and exploit a variety of terminological knowledge to better evaluate and improve the terminologies that are emerging today as important components of biologic, clinical and public health systems. Much of the work has gone beyond the experimental stage to become key components of standards development and information system maintenance.
Dr. Cimino was supported in part by funds from the intramural research program at the National Institutes of Health (NIH) Clinical Center and the National Library of Medicine (NLM).
Publications of Auditing Methods Applied to Controlled Biomedical Terminologies
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
*See Documents under http://www.tc215wg3.nhs.uk/pages/default.asp