Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Biomed Inform. Author manuscript; available in PMC 2010 June 1.
Published in final edited form as:
PMCID: PMC2714188

Structural Group-based Auditing of Missing Hierarchical Relationships in UMLS


The Metathesaurus of the UMLS was created by integrating various source terminologies. The inter-concept relationships were either integrated into the UMLS from the source terminologies or specially generated. Due to the extensive size and inherent complexity of the Metathesaurus, the accidental omission of some hierarchical relationships was inevitable. We present a recursive procedure which allows a human expert, with the support of an algorithm, to locate missing hierarchical relationships. The procedure starts with a group of concepts with exactly the same (correct) semantic type assignments. It then partitions the concepts, based on child-of hierarchical relationships, into smaller, singly rooted, hierarchically connected subgroups. The auditor only needs to focus on the subgroups with very few concepts and their concepts with semantic type reassignments. The procedure was evaluated by comparing it with a comprehensive manual audit and it exhibits a perfect error recall.

Keywords: UMLS, partition, semantic type assignment, auditing, semantic refinement, refined semantic network, hierarchical relationships, refined semantic type

1. Introduction

The Unified Medical Language System (UMLS) [1] helps health professionals and researchers to retrieve biomedical information integrated from a variety of terminology sources. The Metathesaurus (META) [2] of the UMLS contains about 1.5 million biomedical concepts and provides a uniform, integrated distribution format for more than 100 biomedical source terminologies. The META includes all relationships present in its source terminologies and some additional relationships introduced by UMLS editors to connect new concepts that they had to create to disambiguate ambiguous concepts. Since the source terminologies might be missing relationships or contain wrong relationships, the META does not necessarily include all possible relationships between the concepts it contains [3]. The use of hierarchical relationships is a primary feature of the META. Any wrong or missing hierarchical relationships in the META will not only mislead the users of the UMLS, but may also cause other errors in the META. For example, locating missing hierarchical relationships may help to expose other kinds of errors, such as missing lateral relationships, missing or wrong semantic type assignments, and redundant or ambiguous concepts [4,5]. Therefore, locating missing hierarchical relationships or correcting wrong hierarchical relationships in the META is an important and essential task for the UMLS maintainers.

In a recent study of UMLS user preferences [6], users expressed that 35% of a putative UMLS budget should be spent for auditing (more than for any other task). Users were surveyed on the degrees to which they were bothered by twelve kinds of errors. Among the six errors related to missing aspects of a concept, the most concerning one was missing hierarchical relationships. Therefore, it is imperative to audit the META for missing hierarchical relationships to ensure the overall quality and usability of the UMLS.

Due to the extensive size and complexity of the META, auditing for missing hierarchical relationships is an overwhelming task. It is helpful if algorithmic techniques are available for identifying a limited number of concepts with high likelihood of missing hierarchical relationships. In this way, the ever limited resources of domain experts and terminology editors can be utilized more efficiently for finding such errors. To facilitate the auditing task, we describe a methodology for auditing missing hierarchical relationships by applying a “divide and conquer” technique to a collection of small groups of concepts.

In the META, each concept is assigned one or more semantic types (STs) of the Semantic Network (SN) [7-9], which provides a consistent categorization of all concepts in the META. The extent of a semantic type T is defined as the set of all the concepts assigned this ST T. Concepts in the extent of an ST are expected to have the semantics of that ST. Our group-based auditing methodology was designed to audit groups of concepts with the same semantics at a time. The basic idea is that an auditor who is looking at a group of (supposedly) similar concepts will find it relatively easy to notice a concept that does not fit into the group. Such concepts warrant a closer look by an auditor, since they have a high likelihood of an error. Similarly, a concept that is glaringly absent is more likely to be noticed as missing.

As will be pointed out in Section 2, the extent of an ST is not necessarily a semantically uniform group of concepts. Thus, in [10], we introduced a technique in which an extent is partitioned into several sets of concepts which are semantically uniform. We call such a set a refined ST extent. In our previous work [10], incorrect ST assignments were identified and eliminated from the refined ST extents. In this paper, the previous auditing methodology of [10] is referred to as semantic auditing. This paper builds on the methodology and on the results of our work in [10]. However, this paper is completely self-contained, and all necessary results are cited.

In our current methodology, each resulting refined ST extent is further partitioned into cohesive sets. A cohesive set is defined as a singly rooted set of concepts connected through hierarchical relationships (Section 3.1). Hence, for each cohesive set, all concepts are specializations of the root concept. Thus, the concepts of the cohesive set share a uniform overarching semantics, which is more specific than the uniform semantics of being assigned the same refined ST.

The auditing methodology presented in this paper, which is called hierarchical auditing, is based on the hypothesis that root concepts of small cohesive sets with three or fewer concepts have a higher probability of missing hierarchical relationships than other concepts. The methodology checks whether root concepts in each small cohesive set are missing a hierarchical relationship to a large cohesive set. If a missing hierarchical relationship to another concept in a different large cohesive set is located, the two concepts will be connected by an added hierarchical relationship.

Furthermore, if a root of a small cohesive set had its ST assignment corrected by the semantic auditing methodology [10], this may be indicative of a problem. In particular, if a concept was assigned a new ST by semantic auditing, this may also indicate a missing child-of. For example, the concept Mouse Models of Human Cancer, originally assigned Experimental Model of Disease (EMD), additionally assigned Neoplastic Process as a result of auditing, was missing a child-of to Animal Cancer Model.

Note that because of applying our hierarchical auditing methodology to a refined ST extent, the number of cohesive sets in the refined ST extent will be reduced since some small cohesive sets are joined into large cohesive sets.

In this paper, we demonstrate our hierarchical auditing techniques, designed for processing one refined ST extent at a time, by examining the extents of the refined STs derived from Experimental Model of Disease (EMD)and Environmental Effect of Humans (EEH) of the UMLS 2006AB version after the semantic auditing methodology in [10] was applied to them.

2. Background

2.1. The Refined Semantic Network

The META and the SN are components of the UMLS. Each concept in the Meta has been assigned at least one ST to reflect its meaning. In [7], it was stated that “Semantic types were assigned to Meta-1 concepts and reviewed in the following manner. First, where possible, suggested types were assigned algorithmically using the information available from source vocabularies. Second, types were reviewed or assigned by subject matter experts. Third, all assignments were reviewed by a smaller group of NLM and contractor staff, and finally, all assignments were reviewed and revised by a small team using analyses produced from a relational version of Meta-1.” In [7] it was also stated that “Semantic types are assigned to reflect the meaning of terms in their sources.” If the same term in a source terminology has multiple meanings, the concept is assigned multiple STs accordingly. “Sometimes a single term has different meanings in different sources. In these cases, the terms are represented as separate concepts in Meta-1, each with appropriate semantic type.”

All concepts in the extent E(T) of T are expected to exhibit the same semantics. However, if the extent of a ST contains concepts which are also assigned other STs, the concepts in the extent may have different semantics since some concepts will appear in multiple ST extents. For example, in E(EMD), the concept Carcinoma, Lewis Lung is assigned EMD and also Neoplastic Process (NP), while the concept Experimental Lung Inflammation is assigned only EMD. Thus, an extent, such as E(EMD), is semantically non-uniform since E(EMD) also contains concepts that are in E(NP).

In order to create semantically uniform extents, we introduced the Refined Semantic Network (RSN) [11,12] in our previous research. Figure 1 uses a Venn diagram to explain part of the RSN constructed for E(EMD) resulting from the semantic auditing in [10]. Each ellipse represents the extent of the semantic type written above it. Each box represents a concept. An overlapping part of ellipses represents an intersection of extents of STs.

Fig. 1
The types and intersections of RSN for concepts assigned EMD.

The RSN is a semantically uniform abstraction network consisting of two kinds of STs: pure STs and intersection STs. Both of these kinds are called refined STs. A pure ST T' of RSN is derived directly from the original ST T of the SN. However, the extent E(T') of T' for RSN is not identical to the extent E(T) of T for SN since concepts in E(T) which are also assigned any other STs are not in E(T'). Only those concepts that are not assigned any ST of SN other than T are still assigned this pure ST T' of RSN. Those concepts are considered to have the simple semantics expressed by the pure ST T'. For example, Experimental Lung Inflammation is assigned the pure ST EMD and has the simple semantics of experimental model of disease.

An intersection ST is defined for each non-empty intersection of extents, involving any number of original semantic types. Here, we are using “intersection” in the sense of the standard mathematical notion of set intersection, since extents of STs are defined as sets. We are using the mathematical symbol ∩ for intersection semantic types. A concept originally assigned more than one ST is now assigned a unique intersection ST in RSN. A concept with an assignment of an intersection ST is considered to have a compound semantics. For example, Carcinoma, Lewis Lung is assigned EMD and Neoplastic Process (NP) of SN. Thus, we say it is assigned an intersection ST EMDNP of RSN and has the compound semantics EMDNP. Therefore, all extents of refined STs are disjoint. Thus, any auditing process focusing on one refined ST extent at a time will not encounter the same concept more than once.

There are 31 concepts, which are assigned the pure ST EMD. For example, in Figure 2(a), we see Experimental Lung Inflammation and Disease Model. There are 33 concepts, which are assigned the intersection ST EMDNP. In Figure 2(a) we see, for example, Melanoma, Experimental and Carcinoma, Lewis Lung. Figure 2(b) shows the placement of the intersection ST EMDNP in the Refined Semantic Network.

Fig. 2
Refined STs derived from the semantic type EMD and their IS-A relationships.

2.2. Auditing Semantic Type Assignments for a Semantic Type Extent

In principle, a refined ST extent E(T') is semantically uniform, since all concepts share the same semantics, either simple or compound, as expressed by their ST assignments. However, some concepts in E(T') may have been assigned the refined ST T' by mistake. If T' is an intersection ST, various situations may occur: all intersected STs of T' are assigned to some concept incorrectly; some intersected STs are wrong for a concept; or some STs may be missing for a concept. Such errors in the refined ST assignments are likely because these refined STs and their extents were not originally created as semantically uniform sets. Rather, they were derived from the original ST assignments, where the extents of the original STs were not necessarily uniform. In our previous research [10], we have developed a group-centered approach to facilitate the task of auditing of ST assignments, concentrating on auditing of a single ST extent. This methodology applies a “divide and conquer” technique by auditing all the refined ST extents of an original ST extent separately.

The methodology of [10] uses an algorithm to identify suspicious concepts based on the ST assignments of parents and children. A suspicious concept is a concept whose ST assignment is neither the same as its parents' assignment(s), nor is it the descendant of its parents' assignment(s). For each refined ST extent, only such suspicious concepts are audited. By auditing only such suspicious concepts, our methodology concentrates on concepts with higher probability of having errors. The auditing methodology in [10] featured a dynamic process, where a re-invocation after the correction of an ST misassignment at a parent concept can lead to the discovery of suspicious children, which were not initially suspicious. This dynamic feature of the methodology enables the auditor to increase the number of errors found with only slightly more effort.

2.3. Auditing Techniques for Terminologies

The UMLS is extremely large and exceedingly complex, consisting of over 100 integrated terminologies organized in a two-layer structure, with the Semantic Network constituting the glue for these terminologies. Thus it is unlikely that a human reviewer would be able to locate all or a majority of existing errors, even when expending significant time and effort. However, many systematic attempts have been made to support error detection in the UMLS. In our previous paper [10] we presented an extensive review of the literature relevant to auditing of terminologies in general and of the UMLS in particular. In the interest of brevity only a summary will be presented here.

Cimino [4,13] presented methods for detecting classification errors in the UMLS. Bodenreider et al. [14-16] investigated redundancy and circularity problems in the UMLS. Redundant semantic type assignments are forbidden in the UMLS [9]. This issue was also investigated by Peng et al. [17]. Hole and Srinivasan presented a method for finding undetected synonymy in the UMLS [18]. A number of approaches have attempted to globally improve the UMLS model by correcting the Semantic Network, as opposed to making local changes to the Metathesaurus [19-21].

The importance of auditing terminologies has been stressed by Min et al. [22]. Various auditing techniques have been proposed and applied to different medical terminologies such as the NCI Thesaurus and the SNOMED [23-29]. Formal tools, such as Description Logics, have been used successfully towards this end [30-33]. In [34-38] different techniques have been used to find errors in the Gene Ontology.

2.4. The Classification Algorithm for Onotologies

Classification is a limited reasoning mechanism that was introduced as part of the KL-ONE family of knowledge representation systems [39]. A detailed description of the KL-ONE classifier can be found in [40]. Citing [41], “Classification is the process of taking a new class description and putting it where it belongs in the class hierarchy … A class is in the right place if it is below all classes that subsume it and above all that it subsumes.”

Thus, the classification algorithm is also referred to as subsumption algorithm. Following [42] “the classifier for KL-ONE deduces that the set denoted by some concept necessarily includes the set denoted by a second concept but where no subsumption relation between the concepts was explicitly entered.” In other words, the classification algorithm takes the descriptions of two concepts as input, for which no “IS-A relationship” was explicitly entered by the knowledge base builder, and it determines whether such an IS-A relationship should hold between those two concepts.

In a landmark paper, reprinted and extended in [43], the authors analyzed two languages FL and FL- that differ only in one representational feature. They show that for FL- subsumption can be computed in polynomial time, while for FL subsumption is intractable. In other words, there is a fundamental tradeoff between the number of features a knowledge representation language provides (expressibility) and the computability of its reasoning algorithms, as demonstrated for subsumption. Thus, the knowledge to which the subsumption algorithm can be applied, is fairly limited.

Secondly, obtaining the logically precise descriptions of the two concepts which are used as input to the classification algorithm is difficult for natural (real world) concepts. These two problems have limited the practical use of the classification algorithm considerably.

The lack of formality of some members of the KL-ONE family led to a general move towards recasting KL-ONE-like structured inheritance networks as Terminological Logics [44,45] and subsequently as Description Logics. Note that [45] is considered the first of an (almost) annual series of Description Logics workshops [46].

The Description Logic Handbook [30] makes it clear that the subsumption algorithm is still front and center stage in Description Logics. As [47] writes: “The basic inference on concept expressions in Description Logics is subsumption,…” Determining subsumption is the problem of checking whether the subsumer is considered more general than the subsumee. “In other words, subsumption checks whether the first concept always denotes a subset of the set denoted by the second concept.” In our approach, we are compensating for all the limitations of the classification approach by letting a human make the subsumption decision at every stage, while the computer organizes the logical order of these subsumption decisions.

2.5. The child-of relationships in the Metathesaurus

The child-of relationship in the Metathesaurus is an important hierarchical relationship. Its instances in the UMLS are derived from corresponding hierarchical relationships in the different source terminologies of the UMLS [15]. Many instances of the child-of relationship appear with an additional annotation, such as is_a, branch_of, member_of or part_of [3]. This label elucidates the nature of each instance of the child-of relationship. Regrettably, in many cases (60%) no further information about the child-of relationship is available, which is indicated by the annotation null. Table 1 shows the distribution of labels associated with the child-of relationship in the UMLS. Approximately 38% are labeled with is_a. These is_a labels are typically derived from well-designed source terminologies such as the SNOMED, the NCI, etc. and greatly improve the representation of relationship semantics in the UMLS.

Table 1
RELA Distribution of child-of in the UMLS

3. Methods

In the group-based approach underlying our methodology in [10], we present an auditor with a group of concepts purportedly exhibiting exactly the same overarching semantics. In this way, concepts not conforming to the semantics should be readily discernable. This motif is repeated in the following methods for partitioning a refined ST extent into even smaller groups.

3.1. Partitioning of Refined ST Extent into Cohesive Sets

As a result of semantic auditing, E(T) will be partitioned into smaller refined ST extents E(Ti). After semantic auditing, each E(Ti), for a fixed i, is deemed semantically uniform. To aid in the further auditing of the concepts of this extent, we will now employ a second step of our “divide and conquer” approach.

Even though all concepts of E(Ti), for a fixed i, have the same semantics, as expressed by the refined ST assignments, the concepts may still differ in their details. For a better comprehension of the concepts of E(Ti), it would help to further partition this set into smaller subsets, each of which has a more precise semantics than the set E(Ti).

To guide us to this more refined partition, we will utilize child-of hierarchical relationships between concepts of E(Ti). The child-of relationship is a fundamental feature in the Metathesaurus, which represents increasing levels of specialization.

Definition (Descendant-of path)

A sequence of concepts P={c1, c2, …, cn} of E(Ti) is called a descendant-of path if ∀j: 1 ≤j< n, cj is child-ofcj+1.

Note that for n = 2, the descendant-of path consist just of {c1, c2}. Thus, in such a case, it is the case that c1 descendant-ofc2, and c1 child-ofc2.

Definition (Transitive)

A relationship R is transitive if whenever (a R b) and (b R c) is true, it is also true that (a R c). □

As a descendant of another descendant is also a descendant, “descendant-of” is a transitive relationship.

All of the concepts of a descendant-of path are (by transitivity of the descendant-of relationship) specializations of the last concept cn of the path. □

Definition (Root Concept)

A concept r of E(Ti) is a root of E(Ti)if no parent of r is in E(Ti). □

This leads to the central definition of this section.

Definition (Cohesive Set)

A set of concepts of E(Ti) is called cohesive set if it contains a root concept such that all the other concepts of the set have a descendant-of path directed to the root concept. □

This definition implies a unique root of a cohesive set. We use the name “cohesive set” for this set of vertices since all its concepts are descendant-of the unique root concept (by transitivity of descendant-of), that is, all these concepts are specializations of the root concept. In such a case, we say that all the concepts in the cohesive set are sharing the semantics of the root concept, called the overarching semantics of the cohesive set. For example, there are six cohesive sets, which are enclosed by dashed boxes, in Figure 3. The cohesive set rooted at Neoplasms, Experimental (Figure 3(a)) contains 21 concepts at three different layers of the hierarchy. All these concepts share the overarching semantics of the root Neoplasms, Experimental, but with increased specializations. In other words, all 21 concepts in this set represent experimental cancer diseases. For example, Sarcoma Avian has both meanings Tumor Virus Infections and Sarcoma, Experimental, both of which are specializations of Neoplasms, Experimental.

Fig. 3
Cohesive Sets Examples of E(EMDNP).

Definition (Singleton set (in E(Ti)))

A singleton set is a cohesive set of one concept called a singleton concept (which is its root). □

In Figure 3(b), cohesive sets 2-6 are singletons.

To summarize, our partitioning technique further divides a refined ST extent into cohesive sets. The cohesive sets are typically smaller than the original refined ST extent. The cohesive sets help auditors in orientation to and navigation of a refined ST extent in the auditing process.

Ideally, one can partition the extent of a refined ST into several disjoint cohesive sets. However, a set of concepts of a refined ST, which are connected by child-of relationships, may have multiple roots, in which case, this set of concepts is not a cohesive set. In this stage of the research, we will assume a partition of the extent of a refined ST into disjoint cohesive sets. At the end of this section, we will show how to deal with a set of connected concepts with multiple roots.

The audit process focuses on cohesive sets with very few concepts. This kind of set represents potential irregularities and has a high likelihood of errors. The reason is that if a cohesive set exists due to its legitimate hierarchical relationships and overarching semantics, then there would probably be at least several concepts in it. We present the following hypothesis:

Hypothesis 1

The probability of missing hierarchical relationships for root concepts of small cohesive sets with three or fewer concepts and especially for singletons is higher than for roots of larger cohesive sets.

For example, the singletons in Figure 3(b), are likely to erroneously lack hierarchical relationships to other concepts. Following this hypothesis, our auditing methodology requires an auditor to manually review the small cohesive sets, which have a relatively high likelihood of errors. This methodology requires only a limited amount of time of an auditor. For the example of Figure 3(b), all singletons are indeed missing child-of relationships. For example, Sarcoma, Jensen should be child-of Sarcoma, Experimental and Experimental Hepatoma should be child-of Neoplasms, Experimental.

Our second hypothesis relates to missing hierarchical relationships for concepts with a wrong ST assignment:

Hypothesis 2

The likelihood of a missing hierarchical relationship is higher for concepts which had a wrong ST assignment than for concepts which had a correct ST assignment.

The basis for this hypothesis is that an error in an ST assignment may indicate a misconception or confusion regarding the concept with that erroneous ST assignment. Such a misconception or confusion may underlie further errors.

3.2. Auditing for Missing Hierarchical Relationships Based on Cohesive Sets

We partition the t cohesive sets of a refined ST extent into two groups: k small sets, with up to three concepts in one group, and t-k large sets, with more than three concepts, in the second group. The k small cohesive sets are rooted at r1, r2,…, rk, respectively. While rk+1, rk+2, …, rt are the roots of the t-k large cohesive sets, respectively.

We observed that integrating two large cohesive sets potentially involves the interweaving of their concepts, in a way that neither of the two hierarchies is preserved as a complete sub-hierarchy in the resulting integrated hierarchy. Figure 4 shows an abstract example of integrating two large cohesive sets A and B into a cohesive set C. None of the child-of relationships in the cohesive set B is preserved in the integrated cohesive set C, although they are implied by transitivity. A procedure to handle such a case would be very complex and is beyond what we suggest in this paper. Thus we will avoid the integration of two large cohesive sets but focus on the integration of a small cohesive set into a large cohesive set. Later we will discuss the integration of two small cohesive sets that were not integrated into a large cohesive set.

Fig. 4
Example of integrating two large cohesive sets, which potentially involves interweaving their concepts.

For simplicity, we first describe how to integrate a singleton set into an appropriate large cohesive set, if such an appropriate cohesive set exists. The integration of a non-singleton small cohesive set into a large cohesive set will be discussed later. Remember that ri, the root of each singleton set, has neither parents nor children in the extent of the refined ST which we are processing. This methodology is performed in a recursive way. Thus, we will describe only traversing through one level of a large cohesive set. Traversal of lower levels is implicitly described by the recursion.

In the following description, the root of the large cohesive set, rj has m (m ≥ 0) child concepts c1, c2, …, cm. The root of a singleton cohesive set is ri, and the purpose is to find whether ri fits into the cohesive set rj (i.e., is more specific than rj). If the answer is “yes,” we continue to check whether it fits into the subhierarchy of rj rooted at its first child c1. Every decision of “fitting” has to be made by a human. In case this is indeed true, the process continues recursively at the children of c1. If the answer is “no,” the methodology continues to check all other children cq, 2 ≤ qm, of rj. If ri does not fit into any of the subhierarchies of the children of rj, it is added as a new child of rj, since it is more specific than rj.

We note that ri may be more specific than several children of rj, in which case it will be added as a child of several concepts. In such a case, ri ends up with multiple parents. We also note that if ri is not more specific than c1, the methodology checks whether c1 is more specific than ri. In such a case, ri is added between rj and c1, as child of rj and parent of c1.

By applying this recursive methodology through all the levels of the large cohesive set, the process described is similar to the classical classification process used when constructing an ontology, as described in Section 2.

In addition to the steps in the methodology above, we also need to check for the “unusual case”

An external file that holds a picture, illustration, etc.
Object name is nihms-129197-f0001.jpg

that the root of a large cohesive set, rj, is more specific than a singleton concept, ri. Checking this “unusual case” will occur in the Procedure AuditingAllHier-archicalRelationships(E(Ti)). Lastly, it is also possible that ri is not related by a child-of relationship to rj or any of its descendants. In other words, it is possible that the relationship “is more specific than” does not exist between ri and rj, in either direction.

Let us use the singletons in Figure 3(b) to demonstrate the above methodology. Our goal is to check if those singletons fit into the large cohesive set rooted at Neoplasms, Experimental by applying the recursive methodology. Here, we will show several scenarios which are handled by the algorithm: 1) The singleton concept is added as a leaf child of the root of a large cohesive set; 2) The concept is added as a leaf descendant of a child of the root of a large cohesive set; 3) The concept is added as a child of multiple concepts; 4) The concept is inserted between two concepts; and 5) The concept is added as a parent of the root of a large cohesive set.

Adding a Concept as a Leaf Child of the Root: The first singleton to be audited is Mouse Glucagonoma. The root of the large cohesive set is Neoplasms, Experimental. An auditor checks whether Mouse Glucagonoma is more specific than Neoplasms, Experimental. The answer is “yes.” Then the flag inserted is set to 0 (line 6), indicating Mouse Glucagonoma has not been inserted into the large cohesive set. The recursive call is applied to the 10 children of Neoplasms, Experimental (Figure 3(a)) one by one (lines 7-8).

We find that Mouse Glucagonoma is not more specific than any of the children. We then check whether any of the 10 children of Neoplasms, Experimental are more specific than Mouse Glucagonoma. The answer is again “no.” Therefore, lines 9-13 are not executed in this case, and the flag inserted remains 0. The recursive calls exit at the first level of children. Since the flag inserted is 0 (line 14), Mouse Glucagonoma is added to the cohesive set as a leaf child of Neoplasms, Experimental.

Adding a Concept as a Leaf Descendant of a Child of the Root: When auditing the second singleton Sarcoma, Jensen, we find that it is more specific than Neoplasms, Experimental. We set the flag inserted to 0 and then check whether it is more specific than any children of Neoplasms, Experimental (line 2). Thus, Sarcoma, Jensen is recursively compared with all 10 children of Neoplasms, Experimental (line 6). The first five, from left to right, are neither more specific nor more general than Sarcoma, Jensen. The recursions exit when applied to those children.

When compared with Sarcoma, Experimental; Sarcoma, Jensen is more specific. Therefore, recursive calls are applied to the children of Sarcoma, Experimental (Figure 3(a))). However, none of these children is either more specific or more general than Sarcoma, Jensen which is inserted as a child of Sarcoma, Experimental. Using the same methodology, Sarcoma, Jensen is compared with the rest of the children of Neoplasms, Experimental. But no other child is found that is more specific or less specific than it.

Adding a Concept as a Child of Multiple Concepts: When auditing the third singleton Rous Sarcoma, we find that it is more specific than Neoplasms, Experimental. We check if it is still more specific than any children of Neoplasms, Experimental. Thus, Rous Sarcoma is recursively compared with all 10 children of Neoplasms, Experimental. Rous Sarcoma is more specific than Tumor Virus Infections, therefore, it is compared with the only child, Sarcoma, Avian, of Tumor Virus Infections. The recursion is complete. Since Rous Sarcoma is neither more specific nor more general than Sarcoma, Avian, it is thus added as a direct child of Tumor Virus Infections. For the rest of the children of Neoplasms, Experimental, Rous Sarcoma is more specific than Sarcoma, Experimental. Then, recursive calls are applied to the five children of Sarcoma, Experimental including Sarcoma, Jensen, added earlier. However, none of those children is either more specific or more general than Rous Scarcoma which is added as a child of Sarcoma, Experimental. In this scenario, Rous Sarcoma will have two parents Tumor Virus Infections and Sarcoma, Experimental.

The process for integrating the remaining two singletons in Figure 3(b) depends on the order in which these singletons are selected as input. For example, if Experimental Hepatoma will be considered first, then Experimental Hepatoma is added as a leaf child of the root Neoplasms, Experimental (Scenario 1) followed by adding the leaf Hepatoma, Morris as a child of a child (Experimental Hepatoma) of the root (Neoplasms, Experimental (Scenario 2)). However, if Hepatoma, Morris is selected as an input before Experimental Hepatoma, adding Hepatoma, Morris follows the case of adding as a leaf child of the root (Scenario 1), but adding Experimental Hepatoma becomes complicated. It needs to be inserted between Hepatoma, Morris and Neoplasms, Experimental, since Experimental Hepatoma is more specific than Neoplasms, Experimental, but more general than Hepatoma, Morris, as will be discussed next.

Inserting a Concept between two Concepts: Suppose Hepatoma, Morris is added before Experimental Hepatoma as a child of Neoplasms, Experimental. The recursive methodology is applied here for inserting Experimental Hepatoma. As Experimental Hepatoma is more specific than Neoplasms, Experimental, it is compared with all 11 children (including the added Hepatoma, Morris) of Neoplasms, Experimental. All the recursive calls exit at the first level, since Experimental Hepatoma is not more specific than any of those 11 concepts. When each recursive call exits, a test is performed whether any of the 11 children is more specific than Experimental Hepatoma, and only Hepatoma, Morris is. Therefore, Experimental Hepatoma is inserted between Hepatoma, Morris and Neoplasms, Experimental.

After auditing the hierarchical relationships, one cohesive set, consisting of all the cohesive sets from Figure 3, is shown in Figure 5. Broken lines represent the missing hierarchical relationships added after applying the procedure AuditingHierarchicalRelationships to integrate each singleton set into the cohesive set rooted at Neoplasms, Experimental.

Fig. 5
Audited hierarchical relationships for cohesive sets in Figure 3.

As mentioned earlier, a connected component of a refined ST extent may have several roots. In such a case it is not a cohesive set which by definition has only one root. For such a set, one can add an artificial root “Thing” to be the parent of all the roots of the extent and then can apply the AuditingHierarchicalRelationships(ri, Thing) procedure. Note that since our procedure allows an added singleton concept to become a child-of several concepts, the new singleton maybe a descendant of several original roots. Figure 6 (a) and (b) show the before and after situation for integrating a singleton set rooted at x into a multi-rooted set rooted at a and b, respectively. An artifical root “Thing” is added before the procedure AuditingHierarchicalRelationships(x, Thing) is called.

Fig. 6
Integration of a singleton set into a multiply rooted set.

We have described the procedure AuditingHierarchicalRelationships(ri, rj) and demonstrated it by illustrating several common scenarios when integrating a singleton into a large cohesive set. Now we are presenting the AuditingAllHierarchicalRelationships(E(Ti)) procedure to audit the hierarchical relationships between any small cohesive set and any large cohesive set. We first split the non-singleton small cohesive sets into singletons, inserting their concepts into the large cohesive sets (with more than three concepts) one at a time. The reason for splitting a small cohesive set into singleton concepts for the purpose of its integration into a large cohesive set is that, as we saw in the example of Figure 4, the small cohesive set hierarchy is not necessarily preserved as is when inte-grated into a large cohesive set. Hence, if we try to integrate it as a unit, its integration procedure will be complex. But by splitting its concepts into singletons and integrating them one by one, the complex integration process will be divided into several simple integrations utilizing the AuditingHierarchicalRelationships(ri, rj) procedure. As is shown in the example, if the small cohesive set would appear in the integrated cohesive set as a subhierarchy preserving the original child-of relationships, the same situation will be obtained by the repeated insertion of its concepts as singletons.

For each singleton rooted at rh, we start with checking the “unusual case” that a large cohesive set rj is more specific than a singleton rh, in which case, we make rjchild-ofrh (and thus rh becomes the new root of this large cohesive set). If rj is more general than rh, we check whether rh fits into the large cohesive set rooted at rj by calling the AuditingHierarchicalRelationships(rh, rj) procedure. If rh is neither more general nor more specific than rj, we continue to check the next cohesive set.

An external file that holds a picture, illustration, etc.
Object name is nihms-129197-f0002.jpg

Our procedure does not check whether there are missing hierarchical relationships between two singleton concepts (or between a concept of a small non-singleton cohesive set and a singleton concept.) We will now discuss the first case. When the application of the procedure is complete, it is possible that some hierarchical relationships are missing among singleton concepts. To identify all those missing hierarchical relationships, if there are still some singletons left after checking whether each singleton fits into some large cohesive sets, we will look for a missing hierarchical relationship between any two such singletons in both directions. Similarly, we will try to insert every singleton into every non-singleton small cohesive set. For the case of two small non-singleton sets, we separate one of them into singletons and then try to integrate these singletons into the other small cohesive set, similar to integration into a large cohesive set, discussed earlier. Finally, we note that if we try to integrate pairs of small cohesive sets before trying to integrate them into the large cohesive sets, the integrated (still relatively small) cohesive set would be harder to integrate into a large cohesive set. Thus we follow the order of integration as presented.

Finally, we need to prove that the resulting hierarchy is independent of the order in which we were considering various singleton concepts. This is a straightforward fact when two singletons end up in two independent branches of the hierarchy. However, it is not so clear when one is a parent or an ancestor of the other in the final hierarchy. To prove order independence in such a case, let us consider an abstract example of two singletons x and y, such that an expert determines that y is more specific than x. Suppose also that the given hierarchy is rooted at concept a which has a concept b as child-ofa. Furthermore, we assume that each of the concepts x and y is more specific than a and more general than b. Figures 7(a) and (b) show the relative situation of these four concepts before the integration of the singletons and after inegration.

Fig. 7
Before and after integrating the singletons x and y into a hierarchy rooted at a.

We need to show that for either order of processing the singletons x and y, the final hierarchy will be as shown Figure 7(b). Let us assume first that x is processed before y. When x is considered for integration, it is found to be more specific than a but more general than b. Thus it is added between a and b to yield the hierarchy in Figure 8. When y is considered later (not necessarily immediately after x), it is considered for integration in the hierarchy of Figure 8. Now y is found to be more specific than a, and more specific than x, but more general than b. Thus it is added between x and b to yield the hierarchy of Figure 7(b).

Fig. 8
The hierarchy after adding x when x is considered before y.

Now let us consider the alternative case where y is considered for integration into the hierarchy of Figure 7(a), before x. When y is compared to a, it is found to be more specific and when it is compared to b, it is found to be more general. Thus, y is added between the concepts a and b and yielding the hierarchy of Figure 9. Later (not necessarily immediately) the concept x is considered for integration into the hierarchy of Figure 9. It is compared to a and found to be more specific. Then is compared to y which is the child-ofa in Figure 9 and it is found to be more general (according to the assumption made originally about the relationship between x and y). Thus, x is added between a and y to yield the hierarchy of Figure 7(b). Hence the same hierarchy (of Figure 7(b)) is obtained independent of the order of processing the insertion of the singletons x and y.

Fig. 9
The hierarchy after adding y when y is considered before x.

We note that such a situation occurs for the extent of EMDNP, regarding the concepts Hepatoma, Morris and Experimental Hepatoma as described earlier for the case of inserting a concept between two concepts. Other more complex configurations can occur. For example, there may be a concept in the hierarchy which is more specific than a but more general than b. Another possibility is that there may be an additional concept c which is also more specific than a but independent of b. Such a situation occurred, for example, when concept Hepatoma Novikoff was added to the above two concepts. For such a configuration, the proof that the same hierarchy will result, independent of the order of considering the singletons, is very similar to the proof given above.

4. Results

We have chosen to demonstrate our partitioning and auditing techniques for the extents of Experimental Model of Disease (EMD) (representation in a non-human organism of a human disease for the purpose of research into its mechanism or treatment) and Environmental Effect of Humans (EEH) (change in the natural environment that is a result of the activities of human beings) of the UMLS 2006AB version.1

4.1. Auditing the extent of Experimental Model of Disease

4.1.1. Partition of refined ST Extent into Cohesive Sets

Figures 10 and and1111 show the hierarchies of the extents of the refined STs EMD and EMDNeoplastic Process after applying semantic auditing [10] with 23 and 14 cohesive sets respectively. There are 23 cohesive sets in Figure 10, a large cohesive set containing eight concepts, one cohesive set containing three concepts and 21 singletons. According to Hypothesis 1, the roots of these 22 cohesive sets with three or fewer concepts are likely missing hierarchical relationships. Therefore, the auditing for missing hierarchical relationships is focused on these 22 small cohesive sets.

Fig. 10
EMD (pure ST) after semantic auditing.
Fig. 11
EMDNP (intersection ST) after semantic auditing.

The procedure AuditingAllHierarchicalRelationships(E(Ti)) was first applied to the extent of the pure ST EMD. In Figure 10, there is one cohesive set rooted at Transgenic Model containing three concepts. According to the methodology, this cohesive set is split into three singletons. Therefore, after the split, there are 21 + 3 = 24 singletons in E(EMD). Then we check whether each of these 24 singletons fits into the large cohesive set starting at the root concept Animal Disease Models.

Eleven concepts are added as leaf children of the root Animal Disease Models, since they are more specific than Animal Disease Models, but none of the eleven is more specific than or more general than any of the children of Animal Disease Models (see Figure 12). Five additional concepts are added as leaf descendants of children of the root Animal Disease Models, see Figure 12.

Fig. 12
EMD (pure ST) after hierarchical auditing.

Adding a Concept as Parent of the Root: The singleton concept Disease Model is more general than the root of Animal Disease Models of the large cohesive set. This is a case of the procedure AuditingHierarchicalRelationships(ri, rj) not demonstrated in Section 3.2. Thus, a child-of relationship is added from Animal Disease Models to Disease Model. After the root has been changed to Disease Model, the singletons rooted at Rodent Model, Non-Rodent Model, Xenograft Model and Transgenic Model are inserted as leaf children of the new root.

The cohesive set rooted at Transgenic Model is split into singletons and each concept is considered for insertion from top to bottom. Therefore, the addition of the first of these three singleton concepts, Transgenic Model, follows the steps of adding a leaf child of the root, while the other two are added as leaf descendants of a child of the root. As a consequence, the original cohesive set of three concepts appears as a whole, under Disease Model (see Figure 12). In total, 21 missing hierarchical links were added in E(pure ST EMD) and one cohesive set (Figure 12) is obtained.

Among the 13 cohesive sets in E(EMDNeoplastic Process) (see Figure 13), there is one large cohesive set with 21 concepts, and there are 12 singleton cohesive sets. Therefore the auditing efforts concentrated on these 12 singleton concepts.

Fig. 13
EMDNP (intersection ST) after hierarchical auditing.

In a process similar to the auditing of E(pure ST EMD), the procedure AuditingAllHierarchicalRelationships(E(Ti)) is applied to E(EMDNeoplastic Process). Five concepts are added as leaf children of the root Neoplasm, Experimental. Six concepts are added as leaf children of the descendants of the root. As was demonstrated in Section 3.2, one concept, Rous Sarcoma, appears as a child of multiple parents. In total, 13 hierarchical links are added, as shown in Figure 13, and as a result all concepts in the extent of E(EMDNeoplastic Process) are connected.

To evaluate the hypotheses on auditing of hierarchical relationships for the EMD extent, we conducted an independent, exhaustive review by looking for missing hierarchical relationships among all pairs of cohesive sets, and only the same 13 hierarchical links were added.

4.2. Auditing the extent of EEH

4.2.1. Partition Refined ST Extent into Cohesive Sets

We also performed an audit of EEH. After performing semantic auditing [10], there were 21 cohesive sets for the refined ST EEH, among which 20 were singletons. As a result of hierarchical auditing, we added six child-of relationships. For example, Thermal Water Pollution is more specific than Water Pollution. Therefore, a child-of was added to establish the hierarchical relationship between these two concepts. Environmental Sludge and Atmospheric Pollution were singletons. They are kinds of Environmental Pollution, just as their counterparts, such as Indoor Pollution. Pollutant Transport and Contaminant Transport are specifications of Environmental Transport. Therefore, proper child-of links were established. Noise, Transportation was a singleton, which missed a child-of link to Noise Pollution, and the hierarchical link was added. No hierarchical relationships were added for the refined STs EEHHazardous or Poisonous Substance, EEHSubstance and EEHQuantitative Concept.

To evaluate the hypotheses on auditing of hierarchical relationships for the EEH extent, an independent, exhaustive review was conducted by checking for missing hierarchical relationships among all pairs of cohesive sets. Only the same six hierarchical relationships were found missing.

5. Discussion

5.1. Evaluation

In order to evaluate the auditing results obtained by our methodologies, we applied them to two different STs with small extents, EMD and EEH. To measure the performance of our methodology, we conducted a comprehensive manual audit for each of the two tasks for the two STs. With respect to the extents of the refined STs of EMD, the pure EMD and the intersection EMDNP, a recall of 1.0 was achieved for finding missing hierarchical relationships in the whole extent of EMD. In fact, when the process was completed there were two cohesive groups, one for each refined ST, connecting all the previously isolated smaller cohesive groups. No multiple parents were found in the manual review. Similiar results were found for EEH-related extents. The recall was also 1.0.

The results for EMD show that the recall for the roots of small cohesive sets missing hierarchical relationships predicted by to Hypothesis 1, is 1.0. Among the two roots of the large cohesive sets, one (50%) - Animal Disease Model - was missing a hierarchical relationship. Among the 34 small cohesive sets (22 for the pure EMD extent and 12 for the EMDNP extent) all were missing hierarchical relationships. (One concept, Rous Sarcoma, missed two hierarchical relationships). Hence for the EMD extent, 100% of the roots of the small cohesive sets were missing hierarchical relationships. The results for the STs of EMD confirmed Hypothesis 1 that the probability of missing hierarchical relationships for roots of cohesive sets is higher in small cohesive sets with three or fewer concepts (100%), than in large cohesive sets (50%). For the EEH extent, 30% (6/20) of the roots of the small cohesive sets missed hierarchical relationships. No missing hierarchical relationships were found from the large cohesive set, also confirming Hypothesis 1.

For Hypothesis 2, only E(EMDNP) can provide data, since, for all the concepts in E(pure EMD) there were no changes of their ST assignments. For EMDNP,there were six concepts with missing hierarchical relationships among the seven concepts with erroneous ST assignments (84%), versus six concepts with missing hierarchical relationships (one missing two such relationships), among 26 concepts with correct ST assignments (23%). The EEH extent does not provide data for Hypothesis 2. The reason is that for all the concepts in E(pure EEH) there were no changes of their ST assignments. No missing hierarchical relationships were identified in the other refined extents of EEH: E(EEHSubstance)and E(EEHHazardous or Poisonous Substance).

Hence, only the results for EMD supported Hypothesis 2 about an expected higher likelihood of missing hierarchical relationships for concepts with erroneous ST assignments. The data on EMD show that ST assignment errors tend to expose other errors as well. However, the evidence for Hypothesis 2 is weak, as it occured only for one of the several refined STs. More studies are needed to assess Hypothesis 2.

5.2. Implementation

The ST assignments for UMLS concepts are an artifact created by the NLM when integrating various source terminologies [48]. Hence, the NLM is the only organization responsible for the assignments and has no outside constraints preventing it from correcting wrong assignments. However, there are few possible ways for handling missing child-of relationships. If both child and parent concepts appear in the same source terminology, then it is possible to communicate the correction of adding the hierarchical relationship to the organization maintaining this source terminology. For other cases, only the NLM can add such a relationship.

In Table 2, all the missing child-of relationships we identified for the concepts assigned EMD are listed. For both the child and the parent their source terminologies are listed. For 16 (highlighted) out of 26 concepts, both child and parent appear in the MESH [49] source terminology. Thus, these results can be submitted to the MESH editor suggesting to add the missing child-of relationships. The corrections in the MESH terminology would then propagate to future releases of the UMLS. One missing child-of appears between two concepts from the NCI [50] source terminology. This can be corrected by an NCI editor. The remaining nine cases are between concepts from different source terminologies. For EEH only one child-of is missing between two concepts of the source terminology MESH, from Thermal Water Pollution to Water Pollution. Hence, a side effect of such an auditing effort is that missing hierarchical relationships can be corrected in UMLS source terminologies as well.

Table 2
Missing hierarchical relationships between EMD concepts (and the concepts' source terminologies)

5.3. Limitations

The methodology of hierarchical auditing presented in this paper, was tested for two STs of small extents. Experiments with more STs with larger extents are needed to confirm the general applicability of the methodology. Such experiments will also provide more information about the recall obtained by the methodology. In general, we cannot expect the perfect recall obtained for our two test STs.

We limited our research to the child-of hierarchical relationships, which are marked as CHD/PAR in the MRREL file. The reason is that they are more reliable than the narrower/broader relationship, which is marked as RN/RB in the MRREL file, since the first kind is given as hierarchical relationship in its source terminology, while the second kind lacks such a designation in the source terminology [15]. Also, due to the very general interpretation of the narrower/broader relationships it is diffcult to determine when such a relationship is truly missing.

When partitioning the extent of a refined semantic network into singly rooted cohesive sets, we may encounter a problem if a cycle of child-of relationships exists. Such a cycle may occur in the UMLS due to the integration of hierarchical relationships from various source terminologies that are not necessarily consistent with one another. In [15,16] algorithms are presented for detection and elimination of circular hierarchical relationships in the UMLS. In case we encounter such cycles, we will apply the techniques of [15,16] to eliminate them before the partition into singly rooted cohesive sets.

Another issue resulting from the UMLS being an integrated terminological system of many sources is that the same concept may have different meaning in different sources. Thus when considering a missing child-of relationship, the consideration should be source sensitive for the meaning of both parent and child of this potential missing child-of relationship.

5.4. Auditing for Wrong Hierarchical Relationships

As we saw in Section 3.2, concepts with erroneous semantic type assignments are more likely than other concepts to lack hierarchical relationships. A natural question is: Do such concepts also have higher likelihood of wrong hierarchical relationships? If such a hypothesis could be confirmed, then one could audit for wrong hierarchical relationships by concentrating only on the concepts with corrected ST assignments. This would limit the auditing effort, while correcting a relatively high percentage of wrong hierarchical relationships.

The procedure in [10] for checking ST assignments of concepts consists of two parts. First, we determine algorithmically whether the ST assignment is suspicious. The assignment of a concept c is suspicious if it has a parent concept p with an assignment of a ST Z, such that c is neither assigned Z nor a descendant of Z.

The human auditor reviews only the suspicious ST assignment of the concept c. We now consider whether the auditor should also review the child-of relationship from c to p in case that the ST assignment of c is indeed wrong or missing an extra ST. The motivation for such a review is that a wrong or missing ST assignment for c may hint at a misconception regarding the modeling of c. Such a misconception may also be manifested by a wrong parent for c.

On the other hand, the expectation for a subtype relationship between the STs of c and p is based on the assumption that p is indeed a parent of c. In such a case, there is no motivation for auditing c for a wrong parent. The only case for which it is justified to audit c for a wrong parent, after the ST assignment of c was corrected, is that both p is a wrong parent for c and the STs assigned c and p are not in a subtype relationship. That is, although the expected subtype relationship between the STs assigned c and p, is not justified, nevertheless its absence leads to the correction of the ST of c.

We realize that once there is an error in modeling a concept, reflecting some misconception, it may indicate more errors regarding this concept, even in unexpected ways. Before we recommend that an auditor check the parent for every concept c, the ST assignment of which was corrected by the procedure of [10], we would like to estimate how many wrong hierarchical relationships could be found this way. In an exhaustive review, we found only one example for such a case in EMD. The child-of relationship from Genetically Engineered Mouse (reassigned from EMD to Mammal), originally directed to Organism Modification (assigned Research Activity), was indeed redirected in release 2006AD to Laboratory Animal (also assigned Mammal).

Three such cases were found for EEH. For example, Sewage is currently a child-of both Waste Product and Waste Management. According to its definition, “Refuse liquid or waste matter carried off by sewers,” Sewage should only be a child-of Waste Product, but not the management of waste. Therefore, the child-of directed to Waste Management should be removed. For details on all four cases from both EMD and EEH,see Table 3.

Table 3
Cases of wrong child-of relationships

Another place where to search for wrong child-of relationships is among suspicious concepts which were not corrected. That is, the STs of c and p are not in a subtype relationship, but the semantic type of c was found to be correct. One may wonder whether p is a wrong parent for c, i.e., there is a wrong child-of relationship. We reviewed all such cases for both the EMD and the EEH extent, but did not find any wrong child-of relationships.

Due to the low success rate of the first technique (only four wrong child-of relationships found) and no success at all of the second technique, we conclude that those are not likely fertile procedures for auditing child-of relationships. More experiments with other STs may change this judgement. More research is needed to find a technique to identify hierarchical relationships which are suspicious and have a high probability of being wrong.

6. Conclusions

We presented a hierarchical auditing paradigm for the UMLS that is based on groups of concepts which have exactly the same correct semantic type assignments. The uniform groups of concepts are further partitioned into cohesive sets. In a cohesive set, one special concept, the root, is reachable from every other concept by a chain of child-of links. The root itself does not have any child-of links to other concepts within the same extent. We have developed a recursive methodology which allows a human expert, with the support of an algorithm, to combine pairs of cohesive sets into a smaller number of cohesive sets by inserting missing child-of links. The resulting structure will be a tree or a Directed Acyclic Graph (DAG). It is not always possible to combine all concepts of a group into a singly rooted DAG. However, in the paper, it was shown how to modify our methodologies for a multi-rooted, connected set. Two hypotheses were formulated to express the efficiency of our technique. Our methodologies were demonstrated with the extents of the two semantic types Experimental Model of Disease and Environmental Effect of Humans. We found that 21 hierarchical relationships were missing in the pure ST Experimental Model of Disease extent and 13 in the interesection ST Experimental Model of DiseaseNeoplastic Process extent. Six missing hierarchical relationships were identified in the pure ST Environmental Effect of Humans extent. All those missing hierarchical relationships were found from the roots of the small cohesive sets. Missing hierarchial relationships occured for 84% of the concepts with ST mis-assignments, but only for 23% of the concepts with correct ST assignments. Thus the hypotheses were supported by the results with the two semantic types, Experimental Model of Disease and Environmental Effect of Humans.


This work was partially supported by the United States National Library of Medicine under grant R 01 LM008445-01A2.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1The use of this version is due to the use of the results of [10] in this paper, which used that version.


[1] Humphreys BL, Lindberg DAB, Schoolman HM, Barnett GO. The Unified Medical Language System: An informatics research collaboration. JAMIA. 1998;5(1):1–11. [PMC free article] [PubMed]
[2] Schuyler PL, Hole WT, Tuttle MS, Sherertz DD. The UMLS Metathesaurus: Representing different views of biomedical concepts. Bull Med Libr Assoc. 1993;81(2):217–222. [PMC free article] [PubMed]
[3] U. S. Dept. of Health and Human Services, National Institutes of Health, National Library of Medicine Unified Medical Language System (UMLS) 2008.
[4] Cimino JJ. Auditing the unified medical language system with semantic methods. JAMIA. 1998;5:41–51. [PMC free article] [PubMed]
[5] Cimino JJ, Min H, Perl Y. Consistency across the hierarchies of the UMLS Semantic Network and Metathesaurus. JBI. 2003;36(6):450–461. [PubMed]
[6] Chen Y, Perl Y, Geller J, Cimino JJ. Analysis of a study of the users, uses and future agenda of the UMLS. JAMIA. 2007;14(2):221–231. [PMC free article] [PubMed]
[7] McCray AT, Hole WT. The scope and structure of the first version of the UMLS Semantic Network. Proc. Fourteenth Annual SCAMC; Los Alamitos, CA. Nov, 1990. pp. 126–130.
[8] McCray AT. An Upper-Level Ontology for the Biomedical Domain. Comp Func Genom. 2003;4:80–84. [PMC free article] [PubMed]
[9] McCray AT, Nelson SJ. The representation of meaning in the UMLS. Methods of Information in Medicine. 1995;34:193–201. [PubMed]
[10] Chen Y, Gu H, Perl Y, Geller J, Halper M. Group auditing of a semantic type's extent. accepted. [PubMed]
[11] Geller J, Gu H, Perl Y, Halper M. Semantic refinement and error correction in large terminological knowledge bases. Data and Knowledge Engineering. 2003;45(1):1–32.
[12] Gu H, Perl Y, Geller J, Halper M, Liu LM, Cimino JJ. Representing the UMLS as an object-oriented database: Modeling issues and advantages. JAMIA. 2000 Jan-Feb;7(1):66–80. [PMC free article] [PubMed]
[13] Cimino JJ. Battling scylla and charybdis: the search for redundancy and ambiguity in the 2001 UMLS metathesaurus. In: Overhage JM, editor. Proc. 2001 AMIA Annual Symposium.2001. pp. 120–124. [PMC free article] [PubMed]
[14] Bodenreider O. Strength in numbers: Exploring redundancy in hierarchical relations across biomedical terminologies. Proc. 2003 AMIA Annual Symposium.2003. pp. 101–105. [PMC free article] [PubMed]
[15] Bodenreider O. Circular hierarchical relationships in the UMLS: Etiology, diagnosis, treatment, complications and prevention. Proc. AMIA Symp.2001. pp. 57–61. [PMC free article] [PubMed]
[16] Mougin F, Bodenreider O. Approaches to eliminating cycles in the UMLS metathesaurus: Naive vs. formal. Proc. 2005 AMIA Annual Symposium.2005. pp. 550–554. [PMC free article] [PubMed]
[17] Peng Y, Halper M, Perl Y, Geller J. Auditing the UMLS for redundant classifications. Proc. 2002 AMIA Annual Symposium; San Antonio, TX. November 2002.pp. 612–616. [PMC free article] [PubMed]
[18] Hole WT, Srinivasan S. Discovering missed synonymy in a large concept-oriented metathesaurus. In: Overhage JM, editor. Proc. 2000 AMIA Annual Symposium; Los Angeles, CA. November 2000.pp. 354–358. [PMC free article] [PubMed]
[19] Schulze-Kremer S, Smith B, Kumar A. Revising the UMLS Semantic Network. Proc. Medinfo2004; San Francisco, CA. September 2004.p. 1700.
[20] Zhang L, Perl Y, Geller J, Halper M, Cimino JJ. An enriched UMLS Semantic Network with a multiple inheritance hierarchy. JAMIA. 2004;11(3):195–206. [PMC free article] [PubMed]
[21] Zhang L, Halper M, Perl Y, Geller J, Cimino JJ. Relationship structures and semantic type assignments of the UMLS enriched semantic network. J Am Med Inform Assoc. 2005 July;12(6):657–666. [PMC free article] [PubMed]
[22] Min H, Perl Y, Chen Y, Halper M, Geller J, Wang Y. Auditing as part of the terminology design life cycle. JAMIA. 2006 November/December;13(6):676–690. [PMC free article] [PubMed]
[23] Ceusters W, Smith B, Kumar A, Dhaen C. Mistakes in medical ontologies: Where do they come from and how can they be detected?. In: Pisanelli DM, editor. Ontologies in Medicine: Proc. Workshop on Medical Ontologies; Rome. October 2003.pp. 145–164. [PubMed]
[24] Ceusters W, Smith B, Kumar A, Dhaen C. Ontology-based error detection in SNOMED-CT. In: Fieschi M, Coiera E, Li Y-C, editors. Proc. Medinfo 2004; San Francisco, CA. September 2004.pp. 482–486. [PubMed]
[25] Ceusters W, Smith B, Goldberg L. A terminological and ontological analysis of the NCI Thesaurus. Methods of Information in Medicine. 2005;44:498–507. [PubMed]
[26] Ceusters W, Spackman KA, Smith B. Would SNOMED-CT benefit from realism-based ontology evolution?. In: Teich JM, Suermondt J, Hripcsak G, editors. Proc. 2007 AMIA Annual Symposium; Chicago, IL. November 2007.pp. 105–109. [PMC free article] [PubMed]
[27] Wang Y, Halper M, Min H, Perl Y, Chen Y, Spackman KA. Structural methodologies for auditing SNOMED. Journal of Biomedical Informatics. 2007 October;40(5):561–581. [PubMed]
[28] Halper M, Wang Y, Min H, Chen Y, Hripcsak G, Perl Y, Spackman KA. Analysis of error concentrations in SNOMED. In: Teich JM, Suermondt J, Hripcsak G, editors. Proc. 2007 AMIA Annual Symposium; Chicago, IL. November 2007.pp. 314–318. [PMC free article] [PubMed]
[29] Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in DL-based terminologies: A case study in SNOMED CT. In: Hahn U, Schulz S, Cornet R, editors. Proc. First Int'l Workshop on Formal Biomedical Knowledge Representation (KR-MED 2004); Whistler, Canada. 2004. pp. 12–20.
[30] Baader F, Calvanese D, McGuinness DL, Nardi D, Patel-Schneider PF, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press; 2003.
[31] Schlobach S, Huang Z, Cornet R, Van Harmelen F. Debugging incoherent terminologies. Journal of Automated Reasoning. 2007;39:317–349.
[32] Cornet R, Abu-Hanna A. Auditing description-logic-based medical terminological systems by detecting equivalent concept definitions. Int J Med Inform. 2008;77(5):336–345. [PubMed]
[33] De Keizer NF, Abu-Hanna A, Cornet R, Zwersloot-Schonk JH, Stoutenbeek CP. Analysis and design of an ontology for intensive care diagnoses. Methods of Information in Medicine. 1999 June;38(2):102–112. [PubMed]
[34] Kohler J, Munn K, Regg A, Skusa A, Smith B. Quality control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics. 2006;7:212. [PMC free article] [PubMed]
[35] Kumar A, Smith B. The Unified Medical Language System and the Gene Ontology: Some critical reflections. In: Günter A, Kruse R, Neumann B, editors. KI 2003, Advances in Artificial Intelligence. 2003. pp. 135–148. Lecture Notes in Artificial Intelligence 2821, Springer.
[36] Smith B, Williams J, Schulze-Kremer S. The ontology of the Gene Ontology. In: Musen MA, editor. Proc. 2003 AMIA Annual Symposium; Washington, DC. November 2003.pp. 609–613. [PMC free article] [PubMed]
[37] Smith B, Köhler J, Kumar A. On the application of formal principles to life science data: A case study in the Gene Ontology. Proc. DILS 2004 (Data Integration in the Life Sciences); 2004. pp. 79–94. Lecture Notes in Bioinformatics 2994, Springer.
[38] Kumar A, Smith B, Borgelt C. Dependence relationships between Gene Ontology terms based on TIGR Gene Product Annotations. Proc. Third International Workshop on Computational Terminology.2004. pp. 31–38.
[39] Brachman RJ, Schmolze JG. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985;9
[40] Schmolze JG, Lipkis TA. Classification in the KL-ONE knowledge representation system. IJCAI. 1983;1:330–331.
[41] Elke Angelika Rundensteiner A classification algorithm for supporting object-oriented views. Proc. the Third International Conference on Information and Knowledge Management; Gaithersburg, Maryland. 1994. pp. 18–25.
[42] Kaczmarek TS, Bates R, Robins G. Recent developments in NIKL. Proc. AAAI-86 Proceedings.1986. pp. 978–985.
[43] Levesque HJ, Brachman R. A Fundamental Tradeoff in Knowledge Representation and Reasoning. Morgan Kaufman Publishers; Los Altos, CA: 1985.
[44] Patel-Schneider PF. Adding number restrictions to a four-valued terminological logic. AAAI. 1988:485–490.
[45] Peltason C, Nebel B, Luck KV, editors. Proc. of the International Workshop on Terminological Logics; 1991. DFKI-D-91-13.
[47] Nardi D, Brachman RJ. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press; 2003. chapter 1 An Introduction to Description Logics.
[48] Lomax J, McCray AT. Mapping the Gene Ontology into the Unified Medical Language System. Comparative and Functional Genomics. 2004;5(5):345–361. [PMC free article] [PubMed]
[49] Medical Subject Headings.
[50] National Cancer Insitute Thesaurus.