PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Stud Health Technol Inform. Author manuscript; available in PMC 2010 June 1.
Published in final edited form as:
PMCID: PMC2879046
NIHMSID: NIHMS202315

A Tool for Abstracting Relevant Classes of Concepts: the Common Ancestry Summarizer

Abstract

Controlled Medical Terminologies (CMTs) are indispensable tools for present medical information systems and medical informatics researches. Concept-oriented architecture and multiple-hierarchy have been accepted as two of the desiderata of contemporary CMTs. A common problem for informaticians working with class-based methods for terminologies is to find the Most Relevant Common Ancestor (MRCA) for a group of concepts, the common ancestor which has the closest semantic distance to all the given concepts. Finding an ancestor concept to summarize a group of concepts is required to map concepts from the knowledgebase to those of appropriate granularity in CMTs. However, manually exploring the hierarchical relationships and determining the MRCA are daunting tasks due to the massive size, the multiple hierarchies and in some case the cycles of the hierarchical networks of terminologies. In our study, we developed a web-based visualization tool, the Common Ancestry Summarizer (CAS), that graphically displays the hierarchical relations of multiple concepts within CMTs, and also assist users to identify the MRCA of these concepts by identifying the Lowest Common Ancestors (LCAs). The CAS has been shown useful in organizing classes for the controlled terms used in clinical guidelines (e.g. “the bottom-up method”).

Keywords: Controlled medical terminologies, Semantic network, Multiple hierarchies, Common ancestors

Introduction

Controlled medical terminologies (CMTs) are indispensable tools for present medical information systems and medical informatics researches. Concept-oriented architectures and multiple-hierarchies have been accepted as two of the desiderata of contemporary CMTs[1, 2]. The Unified Medical Language System (UMLS®)[3-5], the Systematized Nomenclature of Medicine Reference Terminology (SNOMED RT®)[6] and the Medical Entities Dictionary (MED)[7, 8] are typical CMTs embodying such desiderata. The concepts and their hierarchical and non-hierarchical relations within such CMTs form complex semantic networks. A common problem for informaticians working with class-based methods for terminologies is to find a higher-level concept as a class for summarizing a group of lower-level concepts so that concepts in knowledge bases, such as rules in rule-based systems, can be mapped to concepts of appropriate granularity in CMTs. Elucidating the the Most Relevant Common Ancestor (MRCA), a common ancestor concept which has the closest semantic distances to a group of given concepts, is required for class-based methods. . However, manually selecting the MRCA is daunting due to the massive size, the multiple hierarchies and in some case the cycles (e.g. the UMLS) in the networks.

In this study, we developed an efficient visualization tool that graphically displays the hierarchical relations of multiple concepts within CMTs, and also provides suggestions to help users identify the MRCA of these concepts.

Background

Difficulties

The difficulties of manually finding hierarchical relationships and the MRCA of multiple concepts reside in the following facts of modern hierarchical CMTs.

First, the larger number of concepts contained in contemporary CMTs and the multiple-hierarchical architecture make it difficult to traverse the hierarchical relations of concepts. For instance, the metathesaurus of the UMLS® 2003AA contains 875,255 distinct concepts. There are 10,553,299 relations and 765,223 Parent-Child relations among these concepts. Even the MED, a much smaller medical terminology mainly used within a local institution (Columbia Presbyterian Medical Center, CPMC), consists of more than 75,000 concepts presently and it is still expanding. Moreover, multiple-hierarchy allows each concept to have more than one parent, which substantially increases the number of ancestors to be manually verified. For example, identifying the hierarchical relationships of a medical concept, Aspirin, within the metathethaurus of the UMLS® 2003 AA requires users to look up its 152 ancestors. In the case of the MED, the number of Aspirin’s ancestors is significantly smaller (10 ancestors). If we want to detect the hierarchical relationships of N concepts, the work-load could be multiplied N times in the worst case.

Furthermore, multiple-hierarchy complicates the task of identifying the MRCA of multiple concepts. In the case of single hierarchy architecture, each concept in a tree structure has no more than one parent. We can obtain the MRCA of a list of concepts found in a tree by finding their single lowest common ancestor (LCA), as illustrated in Figure 1 (a). In a tree structure, the LCA (e.g. Concept A1) is inevitably the MRCA of a list of concepts (e.g. C1, C2 and C3). In contrast, in a semantic network which allows multiple-hierarch, the LCA is not necessarily the MRCA. As shown in the Directed Acyclic Graph (DAG) of Figure 1 (b), it still has to be determined which LCA out of A1 and A2 has the shortest overall semantic distance to C1, C2 and C3.

Figure 1
Multiple hierarchy architecture in CMTs. C1, C2 and C3: given concepts; A1, A2 and A3: common ancestors of C1, C2 and C3.

Cycles are found in some CMTs (e.g. the UMLS) that create complicated ancestor-descendant relationships increasing the difficulty of finding an MRCA. For example, if the common ancestors of a group of concepts form cycles as illustrated in Figure 1 (c), we have no way to tell which ancestor out of A1, A2 and A3 is the MRCA. Methods have been published to simplify the UMLS to a DAG[9].

Related tools

A number of visualization tools for exploring medical terminologies have been published [10-15], but none of them was designed to prune the display of hierarchical relationships to illustrate the MRCA of multiple concepts. These visualization tools display an excessive number of irrelevant concepts and do not solve the MRCA problem.

SemNav is a graphic navigation tool for the UMLS® based on the graph visualization package GraphViz[10-12]. The Sem-Nav can take only one single concept as focus concept at a time. Users cannot view the hierarchical relationships of multiple concepts within one screen. In addition, it limits the number of concepts displayed on the screen. To construct a complete hierarchy structure, users have to click and follow each most distal ancestor displayed.

In the editing tool of the MED, MUMPS MED Editor, the hierarchy structure of the MED can be displayed in an indented format in the upper frame of the window. However, it only displays the hierarchical architecture from the concept of interest to the root term, “Medical Entity”. It does not provide a view of multiple concept relationships. Moreover, it is hard for this indented format to represent multiple hierarchy architecture. The capability of displaying hierarchy relationships was improved considerably in another MED browser, AccessMED[13]. Unfortunately, AccessMED does not allow focusing on relationships among multiple concepts.

We also investigated a general-purposed ontology editor Protégé 2000[14]. The Diagram Widget and Graph Widget are two built-in graphic functions of Protégé 2000. However, they are limited in their abilities to draw the hierarchy relationships automatically. OntoViz is a plug-in designed for Protégé 2000[15]. For the first time users are allowed to pick up multiple focus concepts. Users can also decide whether to show the subclass or superclass closure of focus concepts. Nonetheless, it does not show common ancestors of input concepts and it does not integrate any algorithms to calculate the MRCA either.

Related algorithms

The problem of evaluating semantic relatedness within semantic networks has existed for a long history in artificial intelligence and cognitive science. It can be dated back to the spreading activation approach of Quillian in 1968[16]. This method was also known as intersection search or marker passing in literature. Exploring hierarchical relationships in semantic network is a special case of semantic relatedness where only IS-A links are considered. Therefore, it is feasible to explore hierarchical relationships of concepts in CMTs using this method. Nevertheless, further methods are still required to determine the MRCA for specified concepts.

Related to the problem of finding the MRCA, several algorithms of determining concept semantic distance or semantic similarity (opposite to distance) have been proposed, such as edge-counting[17, 18], Resnik’s method[19], Lin’s method[20] and Jiang-Conrath’s method[21]. Because MRCA is defined as the ancestor concept within the hierarchical relationships that can connect all the given concepts in the shortest semantic distance, these algorithms provide potential ways of determining the MRCA.

In our tool, we use spreading activation [16] as the basic algorithm to find the common ancestors of the concepts in CMTs. Because the MCRA is one of the LCAs in a DAG, we mark the LCAs to narrow down the searching scope of the MCRA for users in the current version of our tool. It is up to the users to determine the MCRA out of the marked LCAs according to their understanding of the semantic relatedness to the given concepts. We also provided typical implementation architecture of these algorithms in a possible knowledge engineering environment.

Methods

Finding common ancestors

First, we search all the common ancestors of a number of given concepts by spreading activation. Because we explore only hierarchical relations, other types of links among concepts are not considered.

The steps are listed below:

  1. Follow the Children-Parents relations of one concept recursively to traverse all the ancestors of the concept.
  2. Mark its ancestor nodes as common ancestors.
  3. For each one of other concepts, follow its Children-Parents relations recursively to traverse all of its ancestors.
  4. In step 3, update the common ancestors by marking the concept’s ancestors which are in common with the previously marked common ancestors.
  5. In the end, return the marked common ancestors.

Identifying the LCAs

After obtaining the common ancestors of specified concepts, we mark those common ancestors which don’t have any other common ancestors as their children in a different color as the LCAs.

The steps are listed below:

  1. Recursively follow the Children-Parents relations of one of the common ancestors.
  2. Mark all the ancestors of that common ancestor within the list of common ancestors obtained previously.
  3. Repeat step 1 and 2 using remained common ancestors.
  4. Return the common ancestors which are left unmarked as the LCAs.

Implementation

We named the software system “Common Ancestry Summarizer” (CAS). To demonstrate the usefulness of our method we integrated our algorithm with a terminology browser and a clinical guideline browser, which are two common tools in the working environment of knowledge engineers working on clinical guideline encoding. Thus the CAS consists of three components: 1) The terminological browser, 2) the Guideline and MLM browser, and 3) the Ancestry Summarizer.

We implemented the previously described algorithms over the semantic network of two CMTs (MED, UMLS). Because the UMLS is not a DAG, we developed a metric that unveils and avoids cycles whenever they were found during the iterating processes. Related methods that remove cycles in the UMLS are discussed in[9].

In the terminology browser, users can search the MED and the UMLS by the code of a term, a key word, and parents, children, ancestors or descendants of a term. Users can also input any terms in batch from a text file and output the displayed graph in XML format.

The second component, clinical guideline browser, allows users to select one guideline from a list to see its content. In the current version, we display 245 Medical Logical Modules (MLMs) from the Columbia Presbyterian Medical Center (containing MED codes) and 17 MLMs from the Jersey Shore University Medical Center (containing UMLS codes). The terminology codes in a MLM loaded in the browser are high-lighted. Users can select any of the terminology codes and seek their hierarchical relationships and LCAs by pressing the “Ancestry Summarize” button.

The third Component, the Ancestry Summarizer, generates the summary of MRCAs in a separated window (Figure 2). Figure 2 shows an example output of the hierarchical relationships of the following list of concepts: SERUM CHOLESTEROL MEASUREMENT2, SERUM CHOLESTEROL MEASUREMENT, PLASMA CHOLESTEROL MEASUREMENT and SERUM CHOLESTEROL TEST (the four nodes at the bottom). The common ancestors (originally in red) of the four concepts are designated by (CA) in Figure 2. The two LCA concepts (originally in pink) are designated by (LCA) in the figure. Then, users select the MRCA from the marked LCAs according their knowledge of the concepts. In this case, the user will probably select the concept INTRAVASCULAR CHOLESTEROL TESTS as the MRCA because it is more specific than the concept CPMC SINGLE-RESULT LABORATORY TEST although the later seems closer to the four given concepts in the hierarchical linkages.

Figure 2
An example output showing the common ancestors (CA) and the lowest common ancestors (LCA) of four concepts (C) in the MED.

We implemented the user interface using Java Applets due to the requirement of graphic display and easy sharing through intranet access. Figure 3 shows a screen shot of the clinical guideline browsers.

Figure 3
Screen shot of the clinical guideline browser of the CAS

Evaluation

This tool has been put into practical use for one year to assist knowledge engineers revising the MLMs of the CPMC’s clinical event monitor. Current CPMC’s MLMs can contain from 1 to 125 MED codes in their queries. We intend to simplify those enumerations with simpler class-based concepts. The CAS system was used for such a task by six knowledge engineers (KEs) that also compare its use to other tools, such as the web-based MED browser and the AccessMED. The objective was to find the MRCAs of the MED codes in the database queries embedded in approximately 30 distinct MLMs. Over 600 distinct MED codes were investigated using this tool and they were subsequently replaced by roughly 45 class-based queries in the 30 MLMs. According to the qualitative feedback of the KEs, the time spent on finding the MRCA using current tools varied from ten minutes (e.g. using AccessMED) to more than an hour (using the web-based MED browse). The two KEs who responded to a questionnaire unanimously agreed that the same task requires less than one minute using the CAS system. None of the users has reported logic errors in the tool.

Discussion

There are generally two strategies for encoding concepts using terminology codes: top-down methods and bottom-up methods. In top-down methods, users browse a terminology from the topmost term according to their understanding of the concepts. However, users’ different understandings of the hierarchical architectures make this strategy fallible. Users may end up omitting the appropriate concepts that is in a branch they did not visualize. In bottom-up methods, users search for all the concepts related to a key word and then select the most exact term for the concept from the list of retrieved terms according to their hierarchical relationships. Our tool acts as an assistant for the bottom-up method by providing a graphic view of the hierarchical relationships and their LCAs. Because the MRCA is within a few LCAs, users are required to pick the MRCA from a pruned list of LCAs.

We also realize several limitations in our work. First, we did not provide algorithms to calculate the MRCA directly. Instead, we suggest the LCAs as the candidates for the MRCA. It is the user’s responsibility to choose the correct MRCA from those LCAs for the given concepts. It will be advantageous if our program is able to present the MRCA automatically. Second, our algorithm cannot guarantee a correct result if CMTs contain errors and inconsistencies in themselves, such as the existence of cycles in the UMLS. Last, the architecture of our implement is not mobile enough to be integrated in other terminology tools or guideline editing tools.

Future work

Because the calculation of MRCA need the evaluation of semantic distance and the algorithms of evaluating semantic distance have existed, we will investigate the possibility of using those algorithms in deciding the MRCA.

We are also considering encapsulating our tool as a portable component for other terminology and guideline editing tools (e.g. Protégé 2000). The support for more CMTs, such as SNOMED CT, will be tested by using filters over the UMLS server.

Conclusion

The “spreading activation” method [16] can be used as the basis of an efficient algorithm to explore CMTs and to find common ancestors of given concepts. LCAs of a group of concepts are good candidates of their MRCA. A graphic tool implementing such functions provides assistance to the task of exploring hierarchical relationships of CMTs through their common ancestors and determining the MRCA in the process of clinical guideline encoding. The practical use of the CAS demonstrated that finding MRCAs is critical for the development and maintenance of class-based knowledge in rule-based decision support systems. This tool has been shown to increase significantly the productivity of knowledge engineering on our health monitor.

Acknowledgments

The authors thank Jeeyae Choi, Pallav Sharda, Jianbo Lei and Sunmoo Yoon for their comments and evaluations. This work was partially conducted under the 1DIB TM 00043-01 contract from the Office of Advanced Telemedicine (OAT) of the Health Resources and Services Administration (HRSA), the grant # 5-67674 of the Center for Advanced technology of Columbia University (NYSTAR/CAT) and the 528753/PO P417322 contract with Virginia Commonwealth University’s Medical Informatics and Technology Applications Consortium, a National Aeronautics and Space Administration (NASA) Commercial Space Center.

References

[1] Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998;37(4-5):394–403. [PMC free article] [PubMed]
[2] Chute CG, Cohn SP, Campbell JR. A Framework for Comprehensive Health Terminology Systems in the United States: Development Guidelines, Criteria for Selection, and Public Policy Implications. J Am Med Inform Assoc. 1998;5(6):503–510. [PMC free article] [PubMed]
[3] Humphreys BL, Lindberg DAB. Building the Unified Medical Language System; Proceedings: The Thirteenth Annual Symposium on Computer Applications in Medical Care; 1989; 1989. Cat. No.89TH0286-5.
[4] Lindberg DAB, Humphreys BL, McCray AT. The Unified Medical Language System. Methods of Information in Medicine. 1993;32 [PubMed]
[5] Campbell KE, Oliver DE, Shortliffe EH. The Unified Medical Language System: Toward a Collaborative Approach for Solving Terminologic Problems. J Am Med Inform Assoc. 1998;5(1):12–16. [PMC free article] [PubMed]
[6] Spackman KA, Campbell KE, Cote RA. SNOMED RT: a reference terminology for health care. Proc AMIA Annu Fall Symp. 1997:640–4. [PMC free article] [PubMed]
[7] Cimino J, Clayton P, Hripcsak G, Johnson S. Knowledge-based approaches to the maintenance of a large controlled medical terminology. J Am Med Inform Assoc. 1994;1(1):35–50. [PMC free article] [PubMed]
[8] Cimino JJ. From Data to Knowledge through Concept-oriented Terminologies: Experience with the Medical Entities Dictionary. J Am Med Inform Assoc. 2000;7(3):288–297. [PMC free article] [PubMed]
[9] Bodenreider O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc AMIA Symp. 2001:57–61. [PMC free article] [PubMed]
[10] Bodenreider O. GenNav: Visualizing Gene Ontology as a graph; Proceedings of the AMIA 2002 Annual Symposium; 2002.2002. p. 978.
[11] Bodenreider O. Experiences in visualizing and navigating biomedical ontologies and knowledge bases; Proceeding of the ISMB’2002 SIG meeting “Bio-ontologies”; 2002.2002. pp. 29–32.
[12] Bodenreider O. A semantic navigation tool for the umls; Proceeding of AMIA Annual Symposium; 2000.2000. p. 971.
[13] Allen BA BR, Desai N. In: JJ C, editor. AccessMED, a tool for browsing, searching and maintaining a large controlled medical vocabulary; Proceedings of the 1996 AMIA Fall; 1996.1996. p. 940.
[14] Musen MAF RW, Grosso WE, Noy NF, Crubezy M, Gennari JH. In: Shi ZF B, Musen M, editors. Component-based support for building knowledge-acquisition systems; 16th World Computer Congress 2000. Proceedings of Conference on Intelligent Information Processing; 2000.2000. pp. 18–22.
[16] Quillian MR. Semantic memory. In: Minsky ML, editor. Semantic information processing. MIT Press; Cambridge, MA: 1968.
[17] Rada R, Mili H, Bicknell E, Blettner M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics. 1989;19(1):17–30.
[18] Leacock C, Chodorow M. Combining Local Context and WordNet Similarity for Word Sense Identification. In: Fellbaum C, editor. WordNet: An Electronic Lexical Database. Mass MIT Press; Cambridge: 1999. pp. 265–283.
[19] Resnik P. In: Mellish CS, editor. Using information content to evaluate semantic similarity in a taxonomy; IJCAI-95. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence; 1995; Morgan Kaufmann Publishers: 1995. pp. 448–52.
[20] Jiang JJ, Conrath DW. semantic similarity based on corpus statistics and lexical taxonomy; Proceedings of International Conference on Research in Computational Linguistics; Taiwan. 1997.1997.
[21] Lin D. In: Shavlik J, editor. An information-theoretic definition of similarity; Machine Learning. Proceedings of the Fifteenth International Conference (ICML’98); Madison, WI, USA. 1998.1998. pp. 296–304.