PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of procamiaLink to Publisher's site
 
AMIA Annu Symp Proc. 2012; 2012: 681–689.
Published online 2012 November 3.
PMCID: PMC3540580

Deriving an Abstraction Network to Support Quality Assurance in OCRe

Christopher Ochs, MS,1 Ankur Agrawal, BE,1 Yehoshua Perl, PhD,1 Michael Halper, PhD,1 Samson W. Tu, MS,3 Simona Carini, MA,2 Ida Sim, MD, PhD,2 Natasha Noy, PhD,3 Mark Musen, MD, PhD,3 and James Geller, PhD1

Abstract

An abstraction network is an auxiliary network of nodes and links that provides a compact, high-level view of an ontology. Such a view lends support to ontology orientation, comprehension, and quality-assurance efforts. A methodology is presented for deriving a kind of abstraction network, called a partial-area taxonomy, for the Ontology of Clinical Research (OCRe). OCRe was selected as a representative of ontologies implemented using the Web Ontology Language (OWL) based on shared domains. The derivation of the partial-area taxonomy for the Entity hierarchy of OCRe is described. Utilizing the visualization of the content and structure of the hierarchy provided by the taxonomy, the Entity hierarchy is audited, and several errors and inconsistencies in OCRe’s modeling of its domain are exposed. After appropriate corrections are made to OCRe, a new partial-area taxonomy is derived. The generalizability of the paradigm of the derivation methodology to various families of biomedical ontologies is discussed.

Introduction

Biomedical ontologies support communication and integration among healthcare information systems. These terminological knowledge bases are typically large and complex, making them difficult to visualize and comprehend. As a result, errors and inconsistencies in their content are nearly unavoidable and are often difficult to detect. Visualization tools like FlexViz [1] help in this regard by providing a view of one concept’s contextual neighborhood, including its parents, children, and target concepts of its lateral relationships. Such visualization tools provide micro-level contextual knowledge for concepts, but are not helpful in gaining an understanding of the whole ontology, i.e., getting an idea of its “gestalt.” To accomplish that task, other tools are required.

One example of such a tool is an abstraction network, a relatively compact, auxiliary collection of nodes and links that summarizes the structure and content of an ontology. Each node in an abstraction network—derived from the ontology itself—represents a whole subset of classes, unified by a similar structure and/or semantics. The nodes are organized hierarchically in a manner derived from the hierarchical relationships in the original ontology.

While abstraction networks have previously been constructed for various terminologies and terminological systems [2], idiosyncrasies in the underlying knowledge models can limit the overall applicability of the methodologies. There is a need to formulate a unified abstraction methodology that can work with an entire family of ontologies exhibiting similar knowledge properties. In this paper, we begin the work toward that goal. In fact, fruitful ground for this work can be found in the extensive collection of important ontologies hosted in the National Center for Biomedical Ontology (NCBO) BioPortal repository [3, 4]. In particular, we present an abstraction methodology for the Ontology of Clinical Research (OCRe), as a representative description-logic-based ontology, modeled using OWL. The abstraction network derived for OCRe is called a partial-area taxonomy. Its usefulness in auditing OCRe and guiding selective remodeling is demonstrated. Indeed, the results of this research have already been incorporated into a newer release of OCRe. The general applicability of the methodology with respect to other ontologies is discussed.

Background

In previous work, two abstraction networks, the area taxonomy and the partial-area taxonomy, have been formulated and successfully employed in quality-assurance efforts for SNOMED CT hierarchies [5]. The area taxonomy is derived based on the relationship patterns exhibited by the hierarchy’s concepts, which are partitioned accordingly into groups called areas. The partial-area taxonomy goes further by systematically dividing areas into smaller groups of concepts, called partial-areas, based on relationship patterns and hierarchical proximity. The methodology presented in this paper significantly adapts some of the techniques to OCRe (and related ontologies).

The Ontology of Clinical Research (OCRe) provides classes and relationships to characterize the different types of human studies in a uniform way [6]. It was developed in the Web Ontology Language (OWL) 1.1 [6] using Protégé 4.1 [6], and focuses on annotating human studies according to their design and analysis. OCRe is organized as a set of modular components with the core modules being study protocol, study design, and statistics. OCRe includes significant information on human investigations, with its Revision 258 consisting of 342 unique classes and 192 different kinds of relationships [7]. In this research, we have accessed OCRe through the National Center for Biomedical Ontologies (NCBO) BioPortal, which provides access to a collection of over 300 ontologies with a uniform framework [2].

OWL [8] is based on description logic (DL) and defines ontologies in an abstract syntax. Ontology authoring tools, such as Protégé [9, 10], help create ontologies in OWL format. Within NCBO’s BioPortal [4, 11], many ontologies, including OCRe, are released in this format [1214]. Overall, an OWL ontology consists of a collection of classes that may be related to one another through a class hierarchy. All classes are subclasses of owl:Thing which is the root class and all classes are subclassed by owl:Nothing which is the empty class. An important knowledge element used in this paper is the object property, which defines a directed binary relationship between two classes, allowing for their respective instances to be related. As an example from [15], consider this definition of the object property madeFromGrape:

<owl:ObjectProperty rdf:ID=”madeFromGrape”>

  <rdfs:domain rdf:resource=”#Wine”/>

  <rdfs:range rdf:resource=”#WineGrape”/>

</owl:ObjectProperty>

This states that madeFromGrape has the domain (class) Wine and the range (class) WineGrape. Therefore, for example, LindemansBin65Chardonnay (a type of wine) can be related to ChardonnayGrape (a type of grape) using the object property madeFromGrape.

OCRe is released in the asserted OWL format. An ontology released in the asserted format consists of only the knowledge explicitly defined by the editor(s) of the ontology. This stands in contrast to the inferred view of an ontology, which is obtained by running a reasoner on the asserted view. More precisely, classifiers take the facts from the asserted view of an ontology and infer new knowledge to obtain the inferred view of the same ontology. The inferred view may contain new hierarchical relationships, different domains for object properties, and other knowledge that was not explicitly expressed in the asserted view.

Methods

An Abstraction Network for OCRe

The derivation of a partial-area taxonomy abstraction network for OCRe requires the reformulation of the taxonomic elements of area and partial-area. Here, these notions are based on object properties and their defined (class) domains. This represents a shift from a reliance on actual relationship occurrences to potential relationship occurrences.

Let O be a non-empty set of object properties. The area with respect to O is defined as the set of all classes that are explicitly defined (or are inferred) as being in the exact domains of O’s object properties. The object properties collectively are used to name the area. For example, the OCRe class Entity is explicitly asserted as the domain for the object properties has part and part of; therefore, it belongs to the area named {has part, part of}. All of the descendants of Entity are also implicitly within the domain of has part and part of. However, many descendants “introduce” new object properties of their own in the sense of being the asserted domain of the properties and therefore will have different (larger) sets of object properties and reside in different areas. This inheritance and introduction of object properties within OCRe is the basis for defining an area taxonomy. Areas are linked by child-of relationships that abstract the underlying subclass hierarchy.

A root within an area is defined as a class such that the set of object properties having the class as their domains differs from all such sets of its superclasses. An area may have more than one root. A partial-area, which is based on a root within an area, is defined as a collection of classes that share a common set of object properties and a common ancestor class, namely, the root, which introduced the partial-area’s new object properties (while the rest of the object properties were inherited from ancestors of the root). Let R be a root of an area A. The set of classes consisting of R and all its descendants in A is called a partial-area and is named after the root. Partial-areas are linked by child-of relationships derived from the underlying subclass hierarchy in the ontology. In this paper, we focus on OCRe’s Entity hierarchy, which is the largest and features a rich set of object properties. As of Version 244, there were 120 distinct classes and 75 unique types of object properties whose domains are subclasses of Entity. This hierarchy also contains the important Study class, which is considered the primary element of OCRe.

A preliminary step in creating the partial-area taxonomy was to run a reasoner on the ontology. We applied Pellet, provided within Protégé [7]. As the first step, we analyzed where object properties were introduced within the hierarchy. Figure 1 shows an indented subhierarchy of 21 classes from Entity, along with classes that are explicitly defined as the domains for the given object properties. The object properties introduced at a given class are shown in brackets next to the class name. Background color alternates between white and light blue to help identify on which level of the subhierarchy a given concept resides. As an example, the class Physical entity is defined as being within the domain of two object properties, is element of and plays. In addition, Physical entity has the two object properties has part and part of that are inherited from the Entity class. Altogether, Physical entity is in the domain of four object properties. Object properties are color-coded according to the total number of properties for the class at which they are introduced. For example, all classes with a green-colored object property have three in total. This color coding will be utilized in the following figures.

Figure 1.
A portion of OCRe’s Entity hierarchy in an indented format with object properties introduced. The number after “+” indicates the number of inherited properties.

Once the inferred hierarchy was established and all the properties were identified, the hierarchy was traversed using a depth first search algorithm and classes that are in the domains of the exact same set of properties were grouped together. This second step established the area taxonomy for Entity. Figure 2 shows the grouping of classes for the part of the hierarchy from Figure 1. Areas are represented as colored boxes. Different colors indicate different numbers of object properties. Sets of properties are shown at the top of each colored box. Each such list of properties is the respective area’s name. Classes with that set of properties are shown in the box, with descendants of each root shown indented. For example, the class Collection in Figure 2 has the object property set {has part, part of, has element} and a child class Population along with two grandchildren, Enrolled population and Study population. Edges are used to represent child-of links between areas. Child-of links indicate the chain of inheritance involved with a particular set of object properties. The edge from the area containing Organization indicates that this area inherited five properties from the area containing Social institution.

Figure 2.
The grouping of classes from Figure 1 into areas based on each class’ set of object properties. Edges are child-of links between areas.

The third and final step was to identify the roots of each area and group the descendants of these roots into partial-areas. In Figure 2, Planned activity is the ancestor of five classes that all share the same set of object properties: has part, part of, has effective time, has planned component relationship, and occurs in. This group of classes is joined into a single partial-area rooted at Planned activity. Partial-areas are then arranged into the partial-area taxonomy. Child-of links between partial-areas are established, linking each root to the partial-area where its superclass resides. Figure 3 shows the partial-area taxonomy for the subset of classes from Figure 2. In Figure 3, we explicitly identity the root that names each partial-area. The numbers in parentheses indicate the total numbers of classes of the respective partial-areas. The targets of the child-of’s between partial-areas are listed in the boxes after “CHILD OF.”

Figure 3.
Partial-areas derived for each area in Figure 2. The targets of child-of links within partial-areas are indicated after “CHILD OF.”

Figure 4 shows the final diagram representation of the partial-area taxonomy created for the subhierarchy of Entity given in Figure 1. It is a representation that is more compact than the original hierarchy. In Figure 4, we have condensed the 21 classes of Figure 1 into a structure of ten partial-areas residing in nine areas. In the figure, the colored boxes represent areas. The white boxes in an area are the partial-areas. The number of classes for each partial-area is shown. We organized areas into levels based on their number of object properties. Areas containing classes with the fewest object properties are at the top, while those with the most properties are at the bottom.

Figure 4.
Partial-area taxonomy for the subset of classes from OCRe’s Entity hierarchy used in Figures 13.

Within the partial-area taxonomy, edges are used to represent the child-of’s links between partial-areas. As a graphical simplification, we draw edges only between areas when all of the partial-areas of a given area are children of the same parent area. As the level structure is now explicit, arrow heads may be omitted. For additional clarity, edges may be color coded based on which area the parent partial-area resides in. For example, Biospecimen is child-of Physical entity. Because Physical entity is at Level 2 (blue), a blue line is used to connect the two areas that contain these two partial-areas.

Auditing Methodology

One way to perform quality assurance is to review the partial-area taxonomy to see whether it conforms to the original conception that the designer of the ontology had. For example, do the various partial-areas indeed have the correct sets of object properties? Such a review can be done by an individual who is familiar with the content and structure of the ontology. Another way of utilizing the taxonomy is by identifying any components that display an anomaly vis-à-vis the rest of the ontology. For example, a partial-area that is much larger than all the other partial-areas might be considered an anomaly. Another example is a partial-area in which a very large number of object properties are introduced.

A further anomaly may relate to exceptions in the number of child-of relationships emanating from a partial-area. If, for example, most partial-areas have just one child-of and only a few have multiple child-of’s, the latter constitute an exception to the norm and are recommended for review. It is not necessarily the case that each such anomaly manifests an error, but the anomalous classes are recommended for in-depth review by an auditor or curator of the ontology. Some anomalies are the results of modeling errors that can be discovered during an in-depth review.

Results

The complete partial-area taxonomy for the Entity hierarchy, shown in Figure 5, was created for Version 244 of OCRe, as hosted on NCBO’s BioPortal repository. This version of Entity consists of 120 classes and 75 unique types of object properties. Levels are numbered, with the root area {has part, part of} at Level 0. Lower levels have larger level numbers and also larger numbers of object properties; however, these numbers are usually not equal.

Figure 5.
Complete partial-area taxonomy for OCRe’s Entity hierarchy prior to auditing efforts.

The partial-area Physical entity is in the area {has part, part of, is element of, plays} at Level 2. Physical entity has three classes, the root and its two children, Material and Organism (see Figure 2). There are two partial-areas that are child-of the partial-area Physical entity: Person which is a subclass of Organism, and Biospecimen which is a subclass of Material in the original ontology. The corresponding child-of relationships are shown as (blue) lines in Figure 5. In total, the taxonomy has 21 areas organized into nine levels. There are 23 partial-areas in total, because two areas at Level 1 contain two partial-areas.

Twelve of the partial-areas consist of just one class. To observe the main focus of the content of the Entity hierarchy, we should concentrate on the larger partial-areas: Entity (14 classes), Study design (13 classes), Outcome analysis specification (34 classes), Planned activity (6 classes), and Study (24 classes). By reviewing the 23 partial-areas and concentrating on the large ones, one can get an orientation to the structure and content of this hierarchy. The two largest partial-areas, Outcome analysis specification and Study, should be considered anomalous. At first, the OCRe curators were surprised to see so many classes included within the former. Upon closer inspection, it was seen that all 33 descendants of Outcome analysis specification describe statistical methods and clearly do not belong under this class. Furthermore, they do not even belong in the Entity hierarchy. The reasoner inferred the subsumption relationship because of the erroneous domain specification of a property. After the error was fixed, the 33 classes no longer show up as inferred classes of Outcome analysis specification.

Figure 6 shows the taxonomy of the revised Entity hierarchy, which has only 88 classes. It was made available on BioPortal as Version 258 of OCRe. In the revised taxonomy, the partial-area Outcome analysis specification contains just one class on Level 2 (blue) with only four object properties. Outcome analysis specification was removed as the domain of two object properties, has dependent variable and has independent variable, which were formulated as existential restrictions on Outcome analysis specification. Variable specification was added as the domain of these properties’ inverses. Because of this addition, Variable specification moved down one level to Level 7 in the taxonomy of Figure 6.

Figure 6.
Partial-area taxonomy for OCRe’s Entity hierarchy, revised after an audit.

Another anomaly was encountered at the partial-area Relative time point (one class). It is the only partial-area that has two child-of’s emanating from it, one to the partial-area Entity, where it is a subclass of the Time point class, and the other to the partial-area Time interval, where it is a subclass of the root class Time interval. According to the definition provided within the ontology, Relative time point is not a Time interval at all, but a Time point in reference to some other given time point. Furthermore, the subclass to Time interval was not in the asserted view of OCRe, but was instead inferred by the Pellet reasoner. The error was due to an error in the specified domain of the duration object property, leading the reasoner to infer an unintended subsumption relationship. During the review of OCRe, the domain of the property was changed and as a result the second subclass relationship to Time interval is no longer inferred. In Figure 6, Relative time point appears on Level 3 (red), with only five object properties. This is due to the fact that the two object properties has start time and has stop time, originally inherited from Time interval, disappeared since Relative time point is no longer a subclass of Time interval.

Another change in the ontology resulted from observations made upon review of the various partial-areas and their sets of object properties. The partial-area Physical quantity (two classes) on Level 2 (blue) had an irrelevant object property has semantic constraint, which was removed. This partial-area is now on Level 1 (green) of the taxonomy (Figure 6).

Comparing Figures 5 and and6,6, we see that changes occurred on all levels except Level 0. However, as it happens, Level 5 (comprising one class Biospecimen) in Figure 6 is identical to Level 6 in Figure 5. On every other level, there are significant changes between the two figures.

Discussion

Our long-range goal in a broader research project is to develop methodologies for deriving partial-area taxonomies and similar abstraction networks that are applicable to whole families of ontologies, with members sharing a similar model and similar “meta-properties.” Examples of candidate ontologies that all share a similar model and have similar properties to OCRe include the Bone Dysplasia Ontology (BDO) [16] and the Sleep Domain Ontology (SDO) [17, 18], which are expressed in OWL and include many object properties having explicitly defined domains. Hence, we hypothesize that they form a family and that the methodology developed in this paper for deriving a partial-area taxonomy is applicable to them—and further that such a taxonomy is useful for quality assurance. However, we note that OCRe is somewhat unusual among OWL ontologies in its careful assignment of domains for object properties. Future research may lead to distinguishing among various kinds of OWL ontologies, and to the formulation of several families requiring potentially different methodologies for taxonomy derivation.

OCRe served as an initial test case for developing a methodology for deriving taxonomies for medical ontologies based on DL and expressed using OWL. It is an encouraging sign that the derived taxonomy for OCRe has proven helpful in quality assurance and was embraced by the OCRe editorial team for this purpose. Further taxonomy derivation is planned for the other six hierarchies of OCRe for the purpose of auditing. The mis-classifications uncovered in the audit can lead to errors in the inferred hierarchy if it is used to answer queries on data annotated with OCRe classes. Queries for outcome analysis specification or data involving time intervals may be answered with data on particular statistical analysis or data involving time points, respectively.

In [19], it is proposed that an ontology can be audited in terms of “philosophical validity, compliance with meta-ontological commitments, ‘content correctness,’ and fitness for purpose.” In contrast to ontology evaluation methodologies that focus on fitness of purpose (e.g., satisfaction of competency questions [20]) or on philosophical validity and compliance with meta-ontological commitments (e.g., OBO Foundry Principles [21]), the partial-area taxonomy auditing methodology provides tools and heuristics to detect potential content errors at a more abstract level. Of course, the different auditing methodologies are complementary to each other. For example, a fitness-for-purpose approach may highlight areas in the ontology that should be more intensely examined using our partial-area approach.

With regards to the methodology used in this paper, we note that the intermediate stages, reflected by Figures 1, ,2,2, and and3,3, are only shown for expository purposes. The methodology is algorithmic and can automatically generate the partial-area taxonomy (as has been done previously for SNOMED CT). The purpose of highlighting these stages and the transitions between them was to help the reader develop an understanding of the process. In other words, we provided the figures for the intermediate stages to demystify the “black box” derivation. A human auditor will only see the final results, corresponding to Figure 5 before the auditing and Figure 6 afterward.

The remodeling of OCRe following the auditing that was facilitated by the taxonomy of Figure 5 has not yet been completed. In addition to the changes reflected in Figure 6 and presented in the previous section, there is some additional remodeling work underway for the partial-area Study (bottom of Figure 5 on Level 8). This partial-area has 24 object properties, a large increase versus the nine object properties at Level 7 in the taxonomy. It is surprising that besides part of and has part all 22 other object properties have the same class Study as their domains. One could envision some of the object properties having children or grandchildren of Study as their domains. For example, the object properties has recruitment status or has biospecimen collected may not be relevant for all 24 classes of this partial-area and should be introduced at appropriate descendants. This modification would partition the large partial-area Study into several smaller ones, likely improving the presentation of OCRe to users. The editorial team of OCRe is currently using this feedback to re-examine the modeling of these classes, which are critical to the purpose of OCRe. Some initial changes are reflected in Figure 6, where Study has grown from 24 to 26 classes, compared to Figure 5, reflecting a finer distinction between classes.

Conclusions

An abstraction-network derivation methodology has been introduced and used to obtain a partial-area taxonomy for OCRe, an OWL ontology, focusing on human studies. This taxonomy was shown to support quality-assurance efforts and led to an improved version of the Entity hierarchy of OCRe, which was reduced from 120 classes to 88 classes. The corrections suggested by the study have already been incorporated by the OCRe editorial team into Version 258 of OCRe, hosted in the NCBO BioPortal, and more corrections are expected in the future. The derivation methodology is applicable to three more ontologies in the NCBO BioPortal. Further work is planned in studying the generalizability of the paradigm for taxonomy derivation methodologies for various families of ontologies sharing similar models and properties.

References

1. The CHISEL Group, University of Victoria [cited 2012 January 24]; Available from: http://www.thechiselgroup.org/flexviz.
2. Min H, Perl Y, Chen Y, Halper M, Geller J, Wang Y. Auditing as part of the terminology design life cycle. J Am Med Inform Assoc. 2006;13(6):676–90. Epub 2006/08/25. [PMC free article] [PubMed]
3. BioPortal. [cited 2012 January 30]; Available from: http://bioportal.bioontology.org/
4. Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, et al. The national center for biomedical ontology. J Am Med Inform Assoc. 2012;19(2):190–5. [PMC free article] [PubMed]
5. Wang Y, Halper M, Min H, Perl Y, Chen Y, Spackman KA. Structural methodologies for auditing SNOMED. J Biomed Inform. 2007;40(5):561–81. Epub 2007/02/06. [PubMed]
6. Tu S, Carini S, Rector A, Maccalum P, Toujilov I, Harris S, et al. OCRe: An Ontology of Clinical Research; 11th International Protege Conference; 2009.
7. OCRe in BioPortal. BioPortal [updated February 21, 2012; cited 2012 February 22]; Available from: http://bioportal.bioontology.org/ontologies/45657.
8. OWL Web Ontology Language Overview. [cited 2012 February 23]; Available from: http://www.w3.org/TR/owl-features.
9. Noy NF, Crubezy M, Fergerson RW, Knublauch H, Tu SW, Vendetti J, et al. Protege-2000: an open-source ontology-development and knowledge-acquisition environment. AMIA Annu Symp Proc. 2003:953. Epub 2004/01/20. [PMC free article] [PubMed]
10. Protégé. [cited 2012 February 23]; Available from: http://protege.stanford.edu/
11. Rein-Lien Hsu, Abdel-Mottaleb M, Jain AK. Face detection in color images. Pattern Analysis and Machine Intelligence. IEEE Transactions on. 2002;24(5):696–706.
12. Bertaud-Gounot V, Duvauferrier R, Burgun A. Ontology and medical diagnosis. Inform Health Soc Care. 2012;37(1):22–32. [PubMed]
13. Bright TJ, Yoko Furuya E, Kuperman GJ, Cimino JJ, Bakken S. Development and evaluation of an ontology for guiding appropriate antibiotic prescribing. J Biomed Inform. 2012;45(1):120–8. [PMC free article] [PubMed]
14. Hardiker NR, Kim TY, Coenen AM, Jansen KR. A dynamic classification approach for nursing. AMIA Annu Symp Proc. 2011:543–8. [PMC free article] [PubMed]
15. OWL Web Ontology Language Guide. [cited 2012 March 14]; Available from: http://www.w3.org/TR/owl-guide.
16. Bone Dysplasia Ontology. [cited 2012 March 4]; Available from: http://bioportal.bioontology.org/ontologies/1613.
17. Arabandi S, Ogbuji C, Redline S, Chervin R, Boero J, Benca R, et al. Developing a Sleep Domain Ontology. AMIA Clinical Research Informatics Summit; San Francisco: Mar, 2010. pp. 12–13.
18. Sleep Domain Ontology. [cited 2012 March 4]; Available from: http://bioportal.bioontology.org/ontologies/1651.
19. JE R. Quality assurance of medical ontologies. Methods Inf Med. 2006;45(3):267–74. [PubMed]
20. Gruninger M, F MS. Methodology for the Design and Evaluation of Ontologies; Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI-95; 1995.
21. OBO Foundry Principles. [cited 2012 July 16]; Available from: http://www.obofoundry.org/wiki/index.php/Category:Accepted.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association