Protégé Knowledge Model
Protégé is a suite of tools for ontology development and use. Before describing the tool itself, it is important to understand how terminologies and ontologies are built and stored in computers. Briefly, terminologies comprise lists of terms (also called “entities”), and they may also specify attributes (also called “slots”) for those terms (for example, RadLex contains a term called supraaortic valve area, which has an attribute Version Number). Ontologies are similar to terminologies, but contain rich relationships amongst terms, enabling them to represent knowledge in a domain (for example, a relationship SegmentOf linking supraaortic valve area to thoracic arota represents the fact that the supraaortic valve area is a segment of the thoracic aorta).
In the Protégé knowledge model, terminologies and ontologies are represented using “
frames” (classes, slots, and facets).
9 An ontology in Protégé consists of frames and axioms.
Classes are the entities (“terms”) used in the domain of discourse. Classes are the sometimes-called “concepts” in terminologies. For example, if we wish to have a term in our RadLex vocabulary representing the supraaortic valve area, we would create the class supraaortic valve area in the Protégé ontology (Fig. ).
Slots describe properties or attributes of classes. For example, if we wanted all RadLex terms in our ontology to have a string unique identifier called “Name,” which had a nonempty value, then we would create a Name slot and give it the necessary facets (Fig. ). Facets describe characteristics of slots (such as cardinality, required values, etc). Facets allow us to express the fact that the name of a RadLex term class is required (Fig. ). Using a set of slots, we can fully describe each RadLex term. For example, we can express the fact that the supraaortic valve area is a segment of the thoracic aorta, that “root of aorta” is a synonym, and that it has a definition and a RadLex identifier of “RID482” (Fig. ).
Axioms specify additional constraints, but will not be described further here as they are not used in RadLex at this time. An instance is a frame built from at least one class (“instantiation”) that carries particular values for the slots. A “knowledge base” includes the ontology (classes, slots, facets, and axioms) as well as instances of particular classes with specific values for slots. The distinction between classes and instances is not an absolute one; however, because terminologies generally do not contain instances, we can simplify and limit our discussion to classes.
Classes, slots, and facets are the basic building blocks of terminologies. A tutorial on using these elements and Protégé for creating terminologies and ontologies is available
10 and will not be repeated here. Instead, we will focus on the features of Protégé that are particularly relevant to accessing, editing, and sharing ontologies and using them in applications.
Architecture of Protégé
Protégé is implemented in Java and runs on a broad range of software platforms, including Windows, MacOS, Linux, and Unix. Protégé is built using a tiered architecture (Fig. ), providing an ontology storage layer, a knowledge model layer, a graphical user interface (GUI), and an application programming interface (API). Users running the desktop application interact with ontologies using the GUI, while application programs access ontologies in Protégé via the API. The Protégé GUI uses the API to access the Protégé knowledge model of the ontology. Thus, the Protégé GUI, plug-ins, and user applications all access the ontology through the same API, making the Protégé architecture modular and highly flexible (Fig. ).
In addition, Protégé has a plug-in architecture, permitting developers to extend Protégé’s core functionality in many ways without needing to modify the Protégé source code. There are more than 90 Protégé plug-ins providing advanced capabilities such as import/export, validation, and visualization of large ontologies (
http://protege.cim3.net/cgi-bin/wiki.pl?ProtegePluginsLibraryByTopic), some of which we discuss below.
Import and export of ontology content is a critical feature of any tool, because there are many formats for storing ontologies and terminologies. There are several plug-ins for Protégé, enabling it to import and export ontologies in many different formats, including XML, RDF, OWL, and the native Protégé format.
If the ontology is stored in a file, the entire ontology is read into memory. Protégé also provides a relational database backend, which is useful when working with ontologies too large to reside in memory. Finally, through Protégé’s plug-in mechanism, developers can create their own custom import/export plug-ins to work with custom or specialized formats.
Protégé Application and Graphical User Interface
Protégé can be run as a stand-alone application or through a Protégé client in communication with a remote server (the latter is particularly useful for collaborative ontology development). When the Protégé desktop application is launched, the user can create a new ontology, open an existing ontology, or import an ontology from a variety of formats. Protégé creates a project file that records display-related information and the location of the ontology source file; the project file simplifies the process of reopening ontologies, and it maintains user GUI customizations.
Once an ontology is opened, the user works with it in the Protégé GUI (Fig. ). The Protégé GUI is organized into separate panels (accessed by tabs along the top of the GUI) that provide different views into the contents of the ontology. The first tab is the Classes tab, which is the most common view used to browse ontologies and terminologies. On this tab, the ontology is shown as an expandable tree on the left of the GUI, and the attributes of a class selected by the user is shown on the right (Fig. ). In the tree view, terms that are indented and below a term are called “child” terms, whereas the term above is the “parent.” The interpretation of a child term is that it is subsumed by the parent by an “is-a” relationship (unless the tree of classes is displayed according to a different slot). For example, Protégé shows that the RadLex term aorta is a child of systemic artery, which is interpreted as “aorta is-a systemic artery” (Fig. ).
The tree view of RadLex can facilitate the process of curating the terminology to find problems to be corrected, because the various hierarchies can be easily expanded and browsed to detect inconsistencies. In this tree view, for example, we see that the supraaortic value area is a child of thoracic aorta (Fig. ; “supraaortic value area is-a thoracic aorta”). As it is not correct to say that “supraaortic valve area is-a thoracic aorta,” the RadLex ontology should be changed in a future release (what was likely intended was to say “supraaortic value area part-of thoracic aorta,” which could be reflected by changing the Part-of value).
The classes tab permits full editing of the ontology. In the tree view of the ontology on this tab, users can drag and drop classes to reorganize the hierarchy, create and rename classes using intuitive GUI paradigms. The values for attributes of classes can also be directly edited. For example, we could change the name of the class, left coronary sinus, by simply selecting this class and changing the value of the “Preferred Name” slot in the GUI (Fig. , right panel).
The other tabs accessible from the Protégé GUI are the slots, forms, and instances tab, which are less often used by those curating terminologies. The slots tab enables users to view and edit all slots in an ontology; the instances tab displays all instances associated with the classes; and the forms view permits users to customize the layout of the elements on display forms.
Ontology Difference
Ontologies change over time. They are developed iteratively, with incremental changes contributed by people working collaboratively on them. RadLex will soon be releasing a new version, adding new terms as well as changes to the original terminology based on suggestions from users of the first version. There is a need to maintain and compare versions of ontologies, similar to the need for document version management in large projects. Document versioning systems enable users to compare versions, examine changes, and accept or reject changes. However, a versioning system for ontologies must compare and present structural changes rather than changes in the text of a document.
PROMPT is a plug-in to Protégé providing a suite of tools for aligning and comparing two ontologies. PROMPT contains PROMPTDIFF, a versioning environment for ontologies, allowing developers to track
changes in ontologies over time.
11,12 PROMPTDIFF is a version-comparison algorithm that produces a structural difference between two versions of an ontology. The results are presented to the users enabling them to view classes that were added, deleted, and moved. The users can then act on the changes, accepting or rejecting them.
Suppose we edited a copy of RadLex to update it based on community feedback. Consider that we move the class supraaortic valve area and place it under the anatomic location class, and rename it to supraaortic area. In addition, suppose we create a new class noncoraonary sinus under supraaortic area, and that we update the
Part-of relationship on supraaortic area to say it is
Part-of thoracic aorta. When we run PROMPTDIFF, comparing our edited version of RadLex against the original version, the PROMPT interface displays the following differences (Fig. ):
- supraaortic valve area was renamed to supraaortic area
- supraaortic area was moved from being under thoracic aorta to being under anatomic location.
- noncoraonary sinus is a new class, under supraaortic area
- thoracic aorta was added as a value for the Part-of relationship on the supraaortic area class.
In addition to viewing versions of ontologies, PROMPT also includes a tool to align two different ontologies to locate classes that are the same or similar to each other. Such functionality could be useful in RadLex for finding related terms in other vocabularies such as SNOMED and the ACR index.
Ontology Visualization Components
The Protégé GUI provides a tree view of ontologies (Fig. ). Although an expandable tree is suitable for many ontologies and terminologies, it can be difficult to visualize those that have multiple types of relationships among the terms. A tree shows the relationships among terms using one relationship (usually the is-a relationship). RadLex contains several types of relationships, which are best shown using a graph (Fig. ). There are several types of ontology visualization components that have been developed for Protégé, two of which we highlight here in the context of RadLex.
OntoViz is a plug-in to Protégé that creates graph visualizations of ontologies (
http://protege.cim3.net/cgi-bin/wiki.pl?OntoViz). OntoViz produces its graphs using the Graphviz software package.
13 The types of visualizations are highly configurable and include ability to (1) visualize part of an ontology by selecting specific classes and instances, (2) display slots and slot edges, (3) customize the graph by specifying colors for nodes and edges, and (4) visualize the neighborhood of classes for particular classes or instances by applying various operators (e.g., subclasses, superclasses).
TGVizTab is a plug-in for Progégé for visualizing ontologies as a TouchGraph, a software implementation for producing an interactive graph that can be navigated by clicking on the nodes and edges in the graph. The TouchGraph library has been modified and integrated into TGVizTab for visualizing ontologies.
14 TGVizTab is well suited for navigating RadLex, because of the size of the terminology: the TouchGraph display provides a compact and graphical way to visualize large numbers of classes (Fig. ). Unlike OntoViz, the user can dynamically navigate the ontology and interrogate each of its components in real time. This visualization paradigm makes the terminology display simple and intuitive.
Collaborative Ontology Development
A key challenge for maintaining a large terminology such as RadLex is to collect and process the feedback from the large community of users. The current model for feedback on terminologies and ontologies is for users to contact the developers or to post comments to online discussion forums. The problem with this approach is that the user feedback is disconnected from the ontology under discussion; it is difficult for the ontology developer to keep track of these discussions, to prioritize the feedback, and to decide which parts of the ontology need updated first.
Collaborative Protégé
15 is a recent extension of the Protégé system that supports collaborative ontology development by enabling the community to annotate ontologies. A set of annotation components can be accessed by people viewing terminologies such as RadLex, enabling them to associate their comments directly with the parts of the ontology to which they apply. Thus, for example, if a user noticed that there is a duplicate term in RadLex, that person could create an annotation on the class in question in which the nature of the problem is described (Fig. ). At a later time, the RadLex curator can view the parts of the ontology for which users have made annotations, either acting on them or making annotations on the annotations themselves. All annotations can be viewed by the community as well, enabling a dialog on ontologies that is tied to the specific components under discussion. This mechanism can streamline the process of acquiring and reviewing community feedback on ontologies.
The Collaborative Protégé components provide search and filtering of user annotations based on different criteria. There are also two types of voting mechanisms that can be used for voting of change proposals. All annotations made by one user are seen immediately by other users, enabling a community-based discussion about ontologies to be directly linked to the applicable portions of the ontologies.
Programming Interfaces
Despite the variety of plug-ins available for Protégé, many users will have custom needs related to how they work with ontologies, which will require direct programmatic access to the ontology content. For example, a RadLex curator may wish to find the classes that have missing values for slots (currently, many SNOMED and UMLS identifiers have not yet been collected for RadLex terms). To find these classes in RadLex, a program would be needed to traverse all classes in the RadLex ontology, looking for those that have missing values for the slots in question.
Protégé provides two programming interfaces that developers can use to access its ontology content. The Protégé Script Tab is a plug-in to Protégé that provides a scripting environment in several interpreted languages such as python, pearl, and ruby (
http://protege.cim3.net/cgi-bin/wiki.pl?ProtegeScriptTab). The Script Tab provides the simplest and quickest programmatic access to ontologies in Protégé; it accessed via an additional tab in the Protégé GUI into which users can enter commands (Fig. ). The commands will be applied to the ontology currently loaded into Protégé. Thus, in our example above, we could perform the query to find classes having missing values for slots by entering a few lines of code into the Script Tab and viewing the results. An example of the output of a few simple commands entered into the Script Tab is shown in Figure .
In addition to the Script Tab, Protégé provides an API. System developers can use the Protégé API, a Java interface to access and programmatically manipulate Protégé ontologies—this is the same API that the Protégé GUI uses to access ontologies (Fig. ). Developers can access the Protege Java API from their Java programs by calling the appropriate methods to access the ontology in Protégé and to perform the desired operations. Documentation for guiding developers in using the Protégé API for application and plug-in development is available (
http://protege.stanford.edu/doc/dev.html).