Auditing a terminology can be considered a comparative process, in which the content of terminology is compared to some source of truth. There are numerous potential sources of knowledge that can be used
to audit the knowledge a terminology contains
, as both the literature and our own experience with Columbia University’s Medical Entities Dictionary (MED) clearly demonstrate [20
]. In a certain respect, all of these sources lie outside of the actual targeted internal representation in the terminology. However, for the purposes of this discussion, we define intrinsic knowledge
as information derived from the classification scheme, hierarchy, semantic relationships to other terms, or attributes such as lexical information that are present within the terminology itself. We define extrinsic knowledge
as a comparative standard deriving from an outside source, such as other terminologies, user requirements, or human expert knowledge. A recent example of automated auditing using extrinsic knowledge can be found in [21
]. The National Cancer Institute Thesaurus (NCIt)’s Gene hierarchy was audited for missing “has associated process” relationships using two sources, the NCBI Entrez Gene database and the NCIt’s Biologic Process hierarchy, to check hierarchical relationships. A given audit process could potentially use intrinsic knowledge, extrinsic knowledge, or a combination of both.
When a terminology seeks to maintain synchronization with some other terminology, as is the case with the UMLS and the MED, the external terminology itself becomes an extrinsic source. Maintaining synchronization with source terminologies entails a persistent ongoing audit to determine what has been added, changed or deleted. The UMLS must remain synchronized with over 100 standard source terminologies, while the MED must remain synchronized with terminologies from multiple ancillary systems (e.g., multiple laboratory, pharmacy, clinical documentation, and radiology systems) on at least a weekly basis, as well as the annually updated set of ICD9 diagnoses and procedures [22
Although a source terminology may provide sufficient information for the auditing process, occasionally additional external sources are required. For example, if a new laboratory test term is to be added to a terminology such as the MED or LOINC, the terminology maintainer must consult a trusted external information source (whether published or based on his or her own knowledge) to determine whether it should be assigned to an existing class in the terminology or if a new class is needed.
External sets of clinical terms have frequently been used to assess coverage, in what are known as coding exercises, which are generally semi-automated or manual attempt to find terms. Targets for these studies have included SNOMED-CT [25
], UMLS [26
] nursing taxonomies [28
], Read Codes [29
], and multiple terminologies in large studies [30
Expert user review is another external knowledge source for audits. Whether by direct perusal of a terminology [32
], post hoc review to correctly classify and semantically link external data (discussed above), or to review output of audits based on intrinsic factors (discussed below), expert knowledge is the final arbiter in many evaluations.
The intrinsic knowledge that can be used in auditing processes includes hierarchical relationships, the non-hierarchical semantic relationships between terms, and the lexical knowledge about terms. We start with an example from the experience of two of the authors (DMB and JJC). Because the MED is integrated with a live clinical information system, updating content is a multistage process, in which changes are first entered into an editing environment, then a test environment, and finally to the production environment. During each editing cycle, there are more than 25 automated audits that test for primary violations of terminology rules and structure [33
]. These audits are inextricably tied to the initial design. Some use hierarchical information rules (“cannot remove the last parent - all terms must have a parent”, “there can be no hierarchical cycles”, “a term should not have two hierarchically related parents”), some use semantic relationships rules between terms (“all semantic slots must have a reciprocal”, “no redundant semantic relationships”), and some use rules combining classification and semantics (“a term cannot have two hierarchically related values in the same semantic slot; the more specialized value should be used – i.e., refinement is enforced”). This is a short sampling of the checks on more commonly identify errors.
A number of distributed biomedical terminologies have been assessed for similar adherence to design rules or terminological principles using intrinsic knowledge. These include assessment for adherence to basic ontological principles of SNOMED-CT [34
] and the Foundational Model of Anatomy (FMA) [35
], assessment for internal consistency in terminologies such as the READ Thesaurus [36
], and various approaches to removing cycles in the UMLS [38
]. Outside the domain of biomedicine, a set of formal consistency checking rules based on intrinsic hierarchical and semantic inputs has been proposed for the WordNet™ 1.5 lexical database [40
] that share similarities with the routine MED audits.
Intrinsic knowledge can be used for more than highlighting violations of design principles; it has also been used in a variety of interesting ways to correct or suggest correction to content. In general the correction of content using intrinsic knowledge involves the additional input of extrinsic knowledge, often in the form of an expert to manually review items brought to light by an automated process [41
A repeating theme in the literature is the application of semantic relationship patterns to partition a terminology into more manageable pieces for manual review. Occurrence of concepts having identical relationships (or associations to other concepts) and hierarchy are then brought to the attention of human reviewers. Using intrinsic knowledge for this kind of partitioning has been applied to the MED [42
], SNOMED [44
], and the NCIt [8
] to reveal issues such as missing classifications and redundancy.
Another example of this approach is to use knowledge in the UMLS to derive a metaschema for the UMLS Semantic Network by grouping semantic types with identical relationship sets [47
]. In [13
], assignment of UMLS concepts to multiple UMLS semantic types, especially when those types are considered to be in different groups of semantic types of a metaschema [9
] or mutually exclusive [10
], has been used to suggest classification errors, ambiguity and inconsistency. These methods almost always use expert manual review as a follow-up knowledge source. The MED regularly uses internal semantic relationships to determine classification; Cimino, et al. [51
] described an algorithmic approach to enrich hierarchical structure in the MED and determine the correct location in the hierarchy for newly added terms.
Lexical information embedded in terminologies has also been used for auditing processes. In the simplest case, lexical information is used to fix lexical targets, such as spelling errors and uniqueness of term names, an audit performed with every MED update. Lexical information has also been applied to reveal other quality issues. For example, Campbell, et al. used term substrings in SNOMED to suggest classification omissions [52
]. In other work, synonymous terms in the UMLS were used to compile a list of keyword synonyms that, in combination with semantic types, was used to detect redundancy [10
Consistent use of linguistic phenomena, such as adjectival modifiers, has been used to assess potential inconsistency in the UMLS and SNOMED. In one study by Bodenreider, et al. [12
], the intrinsic characteristics of lexical usage of adjective pairs such as “acute”/“chronic” and “primary”/“secondary” were reviewed in the context of known, extrinsic knowledge to reveal inconsistencies. In another lexical study, drug descriptions from four leading pharmacy system knowledge base vendors were compared across each field to determine lexical consistency of usage in the pharmacy domain [53
]. In yet another application of intrinsic lexical knowledge, all terms, synonyms and headings contain the conjunctions “and” or “or” in SNOMED were assessed compared to editorial board policy on usage, which specifies that “and” should imply logical AND (both must be present), “and/or” when one or both must be present and “either_or” should be used when one but not both must be present. Usage in this regard was found to be inconsistent [54