|Home | About | Journals | Submit | Contact Us | Français|
We propose a typology of representational artifacts for health care and life sciences domains and associate this typology with different kinds of formal ontology and logic, drawing conclusions as to the strengths and limitations for ontology of different kinds of logical resources, with a focus on description logics.
The four types of domain representation we consider are: (i) lexico-semantic representation, (ii) representation of types of entities, (iii) representations of background knowledge, and (iv) representation of individuals.
We advocate a clear distinction of the four kinds of representation in order to provide a more rational basis for using of ontologies and related artifacts to advance integration of data and interoperability of associated reasoning systems.
We highlight the fact that only a minor portion of scientifically relevant facts in a domain such as biomedicine can be adequately represented by formal ontologies when the latter are conceived as representations of entity types. In particular, the attempt to encode default or probabilistic knowledge using ontologies so conceived is prone to produce unintended, erroneous models.
It is increasingly recognized that the complexity of the health care and life sciences domain demands a consensus on the terms and language used in documentation and communication. This need is driven by the exponential growth of data generated in the contexts of both patient care and life science research. At the moment, this data cannot be fully exploited for integration, retrieval, or interoperability, because the underlying terminology and classification systems (often subsumed under the heading “biomedical vocabularies”, see Table 1) are inadequate in various ways. Their heterogeneity reflects the different backgrounds, tasks and needs of different communities – including communities on the side of information technology – and creates a serious obstacle to consistent data aggregation and interoperability of the sort demanded by biomedical research, health care, and translational medicine.
What were formerly referred to as “terminology systems” or “vocabularies” are today often vaguely referred to the name “ontology”. This term first became common in biology circles with the success of the Gene Ontology (GO), and its use is becoming more and more popular also in the medical domain as well. The so-called “omics” disciplines are a further major driving force for their development and adoption. Within this context, the Open Biomedical Ontologies (OBO) Foundry initiative, with over 60 ontologies at the moment and building on the successes of GO, is becoming a standard resource (Smith et al. 2007).
But the term “ontology” itself is notoriously affected by multiple inconsistent interpretations (Kusnierczyk 2006) and thus users often tend to have unrealistic expectations as to what ontologies can achieve (Stenzhorn et al. unpub. data). Therefore, any use of this term should ideally be preceded by an explanation of its intended meaning. To illustrate the sorts of problems which can arise, we can point to the stark contrast between the sample definitions developed by computer scientists and the ones inspired by philosophers:
Although these two families of definitions differs strongly, ontologies are seen in both cases as formal systems that apply fundamental principles and formalisms, drawing on mathematical logics, to represent entities of certain kinds, whether on the side of mind or language (“concepts”) or on the side of reality (“properties”, “types”, and “classes”). The major role of ontologies is in both cases to provide a system of domain-independent distinctions to structure domain-specific theories with the goal of integrating and retrieving data and fostering interoperability. We are interested here only in ontologies in which a formal approach is used to support an aim of this sort, and to highlight this feature we shall use the term “formal ontology” in our deliberations henceforth. We believe that the focus on the formality most clearly distinguishes the new generation of biomedical ontologies – including SNOMED CT, and recent versions of the GO - from their vocabulary-like predecessors which still bear traces of their origins in the domain of library science and literature indexing.
In this paper we focus on the role formal ontology can play in resolving some of the problems caused by the heterogeneity of terminologies and classification systems used in the biomedical domain. We want to clarify how the representation of the entities studied by the life sciences can benefit from formal ontologies in a way that will help more adequately to capture domain knowledge. We address two important aspects which are too seldom dealt with explicitly: (i) the representation of meta- or background knowledge, and (ii) the relation of ontologies to human language. We seek to highlight the role played by these factors in developing and using formal ontologies. We also seek to clarify those situations in which domain knowledge cannot be adequately accounted for by formal ontologies, especially due to vagueness and uncertainty. Two questions arise at this point.
We try to answer these questions focusing on representation standards developed by the Semantic Web community. We give examples on the use of this formalism for representing biomedical entities. We also point to some common associated misconceptions and errors in ontology design and show how they can be rectified.
A simple universal representation scheme that lends itself to a representation of a broad range of entities and relations between them is given by the so-called Object – Attribute – Value (OAV) triples. This encoding scheme was popular already in early expert systems (Shortliffe et al. 1975) and currently plays an important role in the Semantic Web initiative (W3C 2008) where it is known as Subject – Predicate – Object (SPO) triples within the Resource Description Format (RDF) (Klyne et al. 2004). This representation is also very similar to the way the Unified Medical Language System (UMLS) Metathesaurus and other vocabulary resources links pairs of concepts from different terminology systems by means of relations such as broader_than, narrower_than, part_of, mapped_to, is_a, and so on. Table 2 gives some examples of this kind of representation.
One advantage of this triple format becomes evident when looking at this table: Simple assertions are represented in an easy way which comes close to human language expressions. One disadvantage is that it promotes a confusion of use and mention (for example when it is asserted that Fever is both a synonym of Hyperthermia and a symptom of Inflammation). The triple format has difficulties also when it comes to the formulation of more complex assertions such as “In 2008, diabetes mellitus had a prevalence of 18.3% of US citizens age 60 and older”, which need to be split into sets of simpler assertions if they are to fit the triple format. Table 3 depicts one possible OAV representation of such an assertion where the successive rows are joined together in a compound conjunctive statement. One drawback is here that many concurring models of this kind can claim to represent the given statement equally well, and this creates forking. Different groups effectuate the needed translations in different ways, and as a result their information systems are no longer marked by interoperability. To avoid this silo effect, a single uniform representation model is needed.
Another drawback of the OAV representation scheme is that it is not obvious in any given case how its assertions are to be interpreted. The assertion that Smoking causes Cancer, for example, could be interpreted in such a way that its author believes that smoking always (i.e., without exception) causes cancer. But it could also be interpreted to mean that smoking often, usually or typically causes cancer, or even, as within the UMLS Semantic Network, that the expression “Smoking causes cancer” is semantically meaningful. Without additional knowledge on how to interpret the relation causes, we cannot decide which alternative is meant in any given case. Certainly, in many everyday situations humans communicate perfectly well when using ambiguous statements. But this is so because humans are able to position them spontaneously into a relevant context of implicit background assumptions. In the case of machine processing, however, such implicit knowledge is lacking. And it is for this reason that logical definitions, and axioms expressed in an appropriate formal language, are required to preclude, or at least constrain, competing interpretations. Unfortunately, as will be clear from the examples given below, application of the rigor of logic is not only very expensive in human resources, it is also such that it does not even allow in principle the formal expression of everything we know. We can however still capture an important part of our knowledge in a way that is, we believe, indispensable to computational reasoning and to the resolution of our three problems of integration, retrieval and interoperability.
To illustrate how basic ontological assertions concerning the entities in a given domain can be formulated using logical resources, we introduce the family of Description Logics (hereafter DLs) (Baader et al. 2007). DLs are subsets of first-order logic (subsequently called FOL). Although DLs are far from being able to express everything one might desire for a comprehensive logical account for ontologies (which would require the whole range of FOL) we set this focus for the following reasons:
To use DLs properly, one has to understand their basic building blocks, represented by terms like “class”, “relation”, and “individual”, and also understand how their constituent logical symbols and expressions are interpreted. For example, all past, present, and future individual hands in the world are instances of the class Hand. Binary relations (“object properties” in OWL DL) have pairs of individuals as their extensions (Patel-Schneider et al. 2004), e.g., the pair constituted by the first author's right thumb and his right hand. Classes in DL are always distinct from individuals and classes of classes are not allowed. OWL DL object properties express binary relations without any direct reference to time. This is a major drawback from an ontological – and biological -point of view1, since we often need to attach time-indexes to assertions about individuals, for example to the effect that a given individual belongs to the class Embryo at t1 and to the class Fetus at t2. One also has to be careful to recognize that the same expressions may be interpreted in different ways in different disciplines. For instance, a statement to the effect that all hands have thumbs is limited to the domain of normal (so-called canonical) human anatomy. It clearly does not hold if the domain includes injured or malformed humans, or humans in early embryonic states (Schulz et al. 2008, Neuhaus et al. 2007).
In the following, we illustrate the DL syntax and semantics through a set of increasingly complex examples. To start, we take a look at the class Liver. When we introduce this class, we define its extension to be the set of all livers of all organisms at all times. In the same vein, the class Bodily_Organ then has as its extension all individual bodily organs at all times. To relate the two classes, we can introduce the key concept of taxonomic subsumption: The class Liver is a subclass (subtype) of the class Bodily_Organ. In DL notation, this is expressed by the operator:
and the relation in question is commonly referred to as the is_a relation.
In contrast, the instantiation relation instance_of () links individuals to the classes of which they are the instances. For example, each individual liver is an instance of the class Liver, so the (individual) liver of the first author of this paper is one specific instance_of Liver. It is noteworthy that DLs do not allow a distinction to be expressed between an individual's membership in a howsoever defined class on the one hand and an individual instantiation of a universal or type on the other hand. Both are represented by means of the instance_of relation ().
More complex statements can be obtained by using operators and quantifiers. In the following example we use both the (“and”) operator and add a quantified role, using the existential quantifier (“exists”). The expression
then denotes the class of all instances that belong to the class Inflammatory_Disease and are further related through the relation has_location to some instance of the class Liver.
This example actually gives us both the necessary and the sufficient conditions needed in order to fully define the class Hepatitis:
The equivalence operator in this formula tells that: (i) each particular instance of hepatitis is an instance of inflammatory disease that is located at some liver, and also (ii) that everything that is an instance of inflammatory disease that is located at some liver is an instance of hepatitis. Hence, in any situation, the term on the left can be replaced by the expression on the right without any loss of meaning.
Note that when we express such an equivalence statement, this statement has to hold at all times without exception. Therefore we cannot use statements of this form to express, for instance, that hepatitis has the symptom fever in most (but not in all) cases. We could, of course, form the expression
and assert an equivalence with Hepatitis. In virtue of the DL interpretation of the existential quantifier, however, this assertion implies that for every instance of the class Hepatitis (without exception) there also exists some instance of Fever. The word normally in the property name normally_has_symptom can be interpreted by humans, but it plays no logical role at all. This is clearly not in accordance with the intended meaning.
Such logical effects are important, since where they are not taken into account by users of DL formalisms, then errors arise. Abundant instances of such errors can be found in the current version of SNOMED CT. Its concept Biopsy_Planned (ID: 183993008), for example, is related to the concept Biopsy as follows:
This expression states that for each planned biopsy (we assume that this is the meaning of Biopsy_Planned) there always exists at least one instance of an actual biopsy, which can certainly not be what is intended, since not all plans for biopsies are ever realized. SNOMED CT also has the class Drug_Abuse_Prevention (ID: 408941008):
This expression states, quite absurdly, that whenever an act of drug abuse prevention is performed, then there is also some instance of drug abuse.
These two examples illustrate how easy it is to create statements with unintended meanings when using even very simple DLs. The reason such examples are so common in current biomedical terminologies is that the ontology developers are often domain experts who are not familiar with the complexities of formal logic and pay too little attention to the principles of sound ontology development. They tend to be guided, rather, by the superficial simplicity of such statements, and thus do not realize that their logical interpretation contradicts the intended meaning. The resultant invalid statements then support for invalid inferences when it is used for automated reasoning.
It is clear however that some ontology users will need to use in their work to define classes such as Biopsy_Plan or Drug_Abuse_Prevention. Because any non-negated use of existentially quantified roles in a DL formalism corresponds to a statement of the form “for all … there is some …”, we must resort to so-called value restrictions if we are to bring about the needed effect. This means that the quantifier ∀ used in a quantified role is used to specify the allowed range for a given relation. We could then (correctly) state the following:
In plain words, this expression states that a biopsy plan is a plan that – if realized – can be realized only by some instance of Biopsy. In contrast to the simple existential statements, this does not say that some Biopsy must exist for each Biopsy_Plan. Similar constructs are needed for other realizable entities, such as functions, roles and dispositions (Grenon 2003).
By using the universal quantifier ∀, however, we move away from simple but scalable DL dialects like EL (Baader et al. 2007) to DLs with a computational complexity that poses severe problems for large ontologies like SNOMED CT. It is even more complicated to define classes like Drug_Abuse_Prevention with the appropriate logical rigor. Here we need to say that if such a procedure is applied, then this causes a state in the organism that precludes the organism to participate in Drug_Abuse. So in order to express this properly, we need to introduce the negation operator ¬ as follows:
In this definition the class Person is instantiated twice, but it is not specified whether those two instances are identical – what they should be. There is no DL able to express the fact that they are identical, which would require the full expressive powers of FOL with, and thus a move beyond the realm of decidability.
Other cases of medical terms that exceed the expressiveness of decidable description logics include expressions involving “without”, such as “concussion of the brain without loss of consciousness” as discussed in (Bodenreider et al. 2004, Ceusters et al. 2007, Schulz et al. 2008). They are highly relevant and important in medicine. Yet their representation is intricate, not only due their demand for expressive logical constructors, but also due to the difficulty to univocally agree upon their meaning, taken into account tacit assumptions (again time-related).
The above examples clearly demonstrate the dilemma of logic-based representations: If the purpose is to logically encode and classify large terminological systems like SNOMED CT (Baader et al. 2006), then the set of allowed constructors must be limited, since value restrictions and negations lead to computational intractability. Some (Rector et al. 2008) nonetheless stress that it is important to include even computationally more expensive constructs so that adequate domain representations are not precluded. An alternative strategy is to distinguish the constructs contained within the terminology from their use in specific sentential contexts, where negation and other terms (such as “on examination”) are properly at home.
As should by now be clear, it is often not possible with computable, logic-based domain representation formalisms, like DLs, to truthfully represent important aspects of biomedical knowledge. Many types of assertions require other means of representation. We thus propose to distinguish between different categories of domain representation which call for distinct sorts of treatment even though they are often treated within formal ontologies as if they were similar. Our interest in keeping these categories apart is to highlight the fact that each representation requires its own formalisms with its own semantics, and that inadequate use of undifferentiated representation formalisms leads to unwanted results. As a result of our discussion, we also aim to contribute to a more clear-cut understanding of what formal ontologies can and cannot accomplish in the biomedical domain.
We use “lexico-semantic representation” to refer to thesauri, semantic lexicons and similar artifacts, which are centered on the meanings of the expressions found in natural language. Typically, they address both the fact that one lexical entry may have two or more meanings (as illustrated for example by the polysemy of terms such as “fracture” or “poisoning”), and the fact that one meaning may be expressed by one or more lexicon entries (for example the synonymy of “hyperthermia” and “fever”). They may also contain word or term translations. Thesauri and semantic lexicons may further contain semantic relations between the individual lexicon entries such as broader_than or narrower_than. WordNet (Fellbaum 1998), MeSH, and most parts of the UMLS Metathesaurus (NLMb 2008) are examples for such representation systems, which have a long tradition in library science, with literature retrieval as a widely accepted use case.
The question of how lexico-semantic relations such as synonymy should be correctly expressed is not in fact an issue to be addressed by ontologies. Ontologies concern themselves in a language-independent way with the entities in reality. They describe these entities and the relations between them but do not describe the entities in the (human) language, i.e., its terms and their relating expressions. So, as even human language can be used to describe the entities in reality (besides the formal logical definition) the goal of those descriptions is not to describe the language itself. Therefore, relations such as broader_than or narrower_than, which are semantically arbitrary subclassification relations (OBRST 2006) characterizing the MeSH thesaurus, are thus fundamentally different from the subclass (is_a) relation that defines the taxonomy backbone of a properly constructed ontology. As an example, in MeSH we can find both Plasma narrower_than Blood and Fetal_Blood narrower_than Blood although, from an ontological point of view, the relations involved here are fundamentally different. In the first case, we are dealing with a parthood (part_of) relation; but in the second case with a case of the subtype (is_a) relation. This difference may not matter in the relevant context since the narrower_than relation, even though semantically ill-defined, fits perfectly well with current needs of literature indexing and retrieval. Articles on blood plasma are as relevant to a query on “blood” as are articles on fetal blood.
Problems arise already at the present stage of information retrieval when it is proposed to “ontologize” MeSH simply by mapping all narrower_than relations to taxonomic subsumption relations (Soualmia et al. 2004) such as Plasma Blood and Fetal_Blood Blood. For while the result is a seemingly perfect subsumption graph that can be easily processed by standard DLs tools, this exercise once again creates typical case of creating unintended models, since it ignores the true meaning of subsumption. Errors such as classifying plasma as a kind of blood are then the result.
While lexico-semantic relations have certain features in common with the ontological relations between entities in reality, the construction of an ontology out of a thesaurus requires numerous additional assumptions, for example concerning quantification. Hence, any automated conversion process cannot provide anything more than a raw sketch that requires careful manual elaboration and curation before it can be of any serious utility for inference purposes (Schulz et al. 2001).
Although we see lexicons or term lists as lying outside the realm of formal ontology, we want to stress that virtually all formal ontology applications require a link between ontology classes and lexical items. However, we advocate that these two issues should be treated by the two separate artifacts of formal ontologies, on the one hand, and lexico-semantic representations, on the other.
Scientific realism postulates the existence of an objective reality that can be studied by science and about which we can discover truths (Boyd 2002). A proper scientific theory, and hence a proper ontology, contains for example assertions to the effect that entities instantiating a given class stand in given relations to entities instantiating some other given class. It is important to stress that this account involves explicit recognition that all scientific assertions can rest on error and thus must be capable of being revised at every stage. Different theories of reality have been propounded – for example theories based on three- and four-dimensionalist approaches – but scientific realism as thus described is compatible with a wide range of such theories. While the realist view is still controversial and not shared by all ontology developers (Smith et al. 2006), it has a number of practical advantages. Thus, for example, it allows a view of ontologies as providing a canon of axiomatic assertions about simple relations between the most scientifically basic types of entities, which can then be taken for granted in further, more complex types of work. Examples of such assertions are “cells have membranes”, “hearts have cavities”, “every case of hepatitis is located in a liver”, “every aspirin tablet contains salicylate”, and so on.
It is useful to produce artifacts that will afford computationally amenable automated reasoning on the basis of such assertions, as demonstrated above. However, this is not identical with the attempt of producing formal theories that aim at characterizing a domain in reality. In practical ontology engineering, these two objectives have to be reconciled. Experience in use of the GO supports the thesis that features of reality can often be sufficiently well represented even though a relatively simple logic. However, as will be clear from our discussions of DLs above, we must thereby always bear in mind that such formalisms do not possess the richness necessary to create complete definitions in many cases. The necessary expressiveness conflicts with the need to construct computationally tractable models. It must therefore be accepted that ontologies (like scientific theories) provide only partial representations of reality. They state what is considered to be true of all instances of given classes: “There is no hepatitis outside the liver”; “there is no NaCl solution without chloride ions”; “there is no cell without a cell membrane”. But it becomes quite clear that such statements constitute only a minor portion of the knowledge that may be required to adequately capture a domain. As Rector (2008) expresses it, “There are very few interesting items of knowledge that are truly ontological in this strict sense.” Yet it is also evident that such items are nonetheless crucially important as forming the basis for all reasoning both by human beings and by computer applications.
Furthermore, it has been largely ignored so far that this kind of domain representation (statements about what is true of all instances of a class) is also present in numerous artifacts that are seldom identified as ontologies. UniProt, a large, central repository (“database”) of protein data (Uniprot 2008), is a typical example. Under ontological scrutiny, most of its content describes proteins types (and not individuals) in terms of what is universally true for every single protein molecule of this type. We therefore consider this kind of representation, too, as essentially ontological in nature.
The term “background knowledge” as used by Rector (2008) encompasses default knowledge, presumptive knowledge, and probabilistic knowledge, and refers to all kinds of statements that are assumed to be at least typically (but not necessarily universally) true in some domain and in some context. Such knowledge is traditionally conveyed through scientific textbooks in a highly context-dependent fashion, often invoking prototypical assertions, for example, concerning the relationship between diseases, signs and symptoms, or between adverse effects and drugs, that are expressed in terms of qualitative probabilities.
It is familiarity with this background knowledge, rather than the familiarity with the knowledge that can be conveyed using formal ontologies, that distinguishes an expert from a novice, just as it marks the difference in content between a textbook and a dictionary. The examples below highlight how formal ontology approaches and logical representation formalisms reach their limits when it comes to representing this kind of knowledge. Using DL-based formalisms for even simplified accounts of prototypical knowledge would lead to flawed results. There exist other logical formalisms that are capable to express this kind of knowledge, but again those formalisms are computationally expensive if not undecidable.
One example type of background knowledge is default knowledge (Rector 2004, Hoehndorf et al. 2007). This is knowledge concerned with what can be assumed to be typically true in the absence of contravening evidence. DL does not give us the means to state what is typically true. But especially with regard to canonical anatomy vs. clinical anatomy (Smith et al. 2005), one would like to state that, e.g., hands normally have thumbs. A statement such as
does not appropriately account for this. It states that every hand has a thumb and rules out the possibility of a hand without a thumb; that is, it rules out non-prototypical hands (e.g. after accidents).
Other statements of background knowledge are meta-statements concerning classes. They hold true when viewed as assertions about classes as wholes, but become false when viewed as assertions about instances. The DL view is that all statements about classes are statements about the corresponding sets of instances. By ignoring this, seemingly obvious subsumption statements like:
would then lead to the false conclusion that
The problem here is one of erroneously considering population-related properties of a given type, such as frequency, as properties inheritable to subtypes of this type. In the above, the symbol (is_a) is used in two logically distinct senses, only one of which is sanctioned by DLs, and the resultant is_a overloading has been identified as a typical error that occurs when building ontologies in an unprincipled way (Guarino 1999, Welty & Guarino 2001, Smith et al. 2004).
Encoding non-trivial facts in formal ontologies may require complicated additional constructs, such as the addition of representations of dispositions to convey information about potentialities. It is important to note that dispositions can exist without ever being realized and even if we cannot tell the precise conditions in which this disposition is realized (Jansen 2007). An analgesic drug, for example, is a substance that has a disposition to treat pain. But it will realize this disposition only when administered in a certain way to a certain sort of patient. We can represent the class of processes of treating (a patient) for pain with:
We can then represent the class of dispositions realized when pain is treated:
The following definition now declares an Analgesic_Drug to be a substance in which this disposition inheres:
Such constructions can strongly affect the scalability of an ontology implementation, since a larger set of such expressions, e.g., for representing the pharmacodynamics of substances, cannot be handled efficiently by current reasoning algorithms.
The body of scientific and clinical assertions is not restricted to the expression of default assumptions and dispositional features. It also includes uncertain assertions, for instance, concerning the effect of a drug in treating a given disease, or concerning the existence of a suspected risk factor for a certain condition. For the aforementioned reasons, the encoding of such assertions in formal ontologies can be highly cumbersome and it is actually questionable whether such assertions should be encoded in a formal ontology in the first place.
As an example, an ontology is being created in the context of the European Union project @neurIST as basis for the semantic mediation and integration of data in the area of brain aneurysms and subarachnoidal bleedings (Boeker et al. 2007). The data within the project originate from a multitude of sources and show a high degree of fragmentation and heterogeneity both in format and scale. The ontology needs to represent all relevant types of entities and also respect various views of these entities on the part of the disciplines such as medicine or epidemiology engaged in studying them. To do justice to all these aspects, the ontology applies dispositional statements in the formulation of class definitions and is split into two parts: (i) an ontology in the proper sense of the word and (ii) a set of representational artifacts capturing context-specific knowledge about certain facts, e.g., risk factors in clinical contexts, on the other. (A similar approach is also pursued by the Ontology of Biomedical Investigations (OBI 2008). In the @neurIST ontology, the class Hypertensive_Disease is a subclass of Biological_Process_or_State that is associated with High_Blood_Pressure and causes some Rupture_Disposition, i.e., a disposition to the effect that an aneurysm will burst. This disposition is then further connected to the class (and by doing so identified as a) Risk_Factor for Aneurysm_Rupture in that this latter class is also defined to be such that its instances cause some instance of Rupture_Disposition:
The following assertion is crucial to the study of aneurysms but transgresses the limits of formal ontology. It is incomplete in the sense that the constraints that are contextually defined and which make this statement valid are missing:
The above states that hypertensive disease is generally a risk factor, which is hardly convincing. On the other hand, hypertensive disease certainly happens to be a risk factor for cerebral aneurysms. So what we want to express is that there is a strong correlation between the two and this assertion is of fundamental importance (but there are, of course, other risk factors as well).
These examples show the sorts of steps which would have to be taken in order for a DL framework to be extended in such a way as to account for certain kinds of background knowledge, thus gaining the advantage of DL reasoning support without incurring the risk of unintended models.
However, the difficulty of representing all the hidden assumptions underlying background knowledge (and the performance problems that result from using the needed rich logic) may suggest that we use instead a much simpler triple-based representation as mentioned in the introductory section, and devise special reasoning devices to fit. Alternatively one might resort to a broad range of knowledge representation artifacts such as default logics (Reiter 1980), frames (Minsky 1974), F-logic (Kifer et al. 1989), and several kinds of computationally expensive DL extensions (Baader 2007, ch. 6). The resultant knowledge representation artifacts, however, are not formal ontologies as we use this term. Still, we can reuse the classes formally defined in an ontology as symbols in these formalisms, along the lines pointed out in our examples above.
Whereas the first three representation types described above make generalizations about all entities of some given kind, much of medicine involves descriptions of individual entities, such as a specific tumor, lab test, treatment episode, or the occurrence of a specific disease in a given patient group. The disciplines of epidemiology and public health deal with political and geographical entities such as Brazil, New Orleans, the Southern Pacific islands, or the upper Rio Negro region.
Statements of individual facts can be expressed in a straightforward manner in DL terms as instantiations of corresponding classes, or in other words, as so-called A-box assertions (with the letter “A” standing for assertions about individuals), as contrasted with the T-box component of DLs which capture what is called “terminological knowledge” (or perhaps better called “knowledge pertaining to types”). Consider for example,
which asserts that a particular disease is an instance of hepatitis.
A molecular interaction statement such as “Lmo-2 interacts with Elf-2” as published in a scientific article is, first of all, an assertion about certain individuals, namely two instances of Lmo-2 and Elf-2 portions (or molecule collections) that have been shown to exhibit some interaction in some specific experimental assay (Schulz et al. 2008).
Therefore we assert a particular interaction event in which the two substance portions under scrutiny participate:
There are domains like geography, in which individuals, not classes, constitute the primary targets of knowledge. Any detailed description of geographic or political divisions of the sort that would be of interest, for example, for epidemiology or public health, abounds in references to particular entities which instantiate only a small number of classes (Smith et al. 2005). For instance, a complete political division of the U.S. can be created on the basis of four nested levels (with one instance of countries, with 50 instances of states, with 3,077 instances of counties, and with over 50,000 instances of municipalities) (see also geographic entities in GAZ (Genomics Standard Consortium 2008)). Note the difference in representation compared to anatomical divisions in Table 4.
This example shows that assertions concerning classes differ formally from assertions about individuals. But the employed relations are the same, because DLs do not allow special relations that relate classes. Logically relating classes always requires the use of quantifiers, which are not needed in assertions relating individuals. This explains why, prior to any logic-based representation, it must be made clear whether the entities under scrutiny are classes or individuals. But especially in the field of molecular biology this is not trivial at all. Thus, our assertion example “Lmo-2 interacts with Elf-2” can be perfectly well understood as a universal statement concerning the class of Lmo-2 molecules and thus as expressing dispositional knowledge in the sense of:
There are good arguments to be made on behalf of either reading, and so disambiguation cannot be affected without first analyzing the context in which the utterance is being made.
In practice, the individual/class boundary is often drawn in an idiosyncratic way. For example, UniProt entries are asserted to denote “instances” of the class protein. A computer scientist might contend that this choice of terminology is mainly motivated by the view the modeler has of a domain: “Deciding whether a particular concept is a class in an ontology or an individual instance depends on what the potential applications of the ontology are.” (Noy & Mcguinness 2001). We believe, however, that no arbitrariness should be involved in the distinction between this particular cell in this particular test tube here and now (an instance), and Cell (a class). Moreover, encouraging the supposition that there is such arbitrariness has the potential to lead to forking of representations which will hamper the very interoperability of data resources ontologies are intended to support.
Indeed we contend that a formal ontological analysis can be coherent only on the basis of a view of the distinction between individuals and classes as an unalterable distinction obtaining on the side of the entities themselves. Individuals on the one hand exist in space and time; they do not stand to each other in subsumption relations; they can be referred to by proper names and (in many cases) photographed. Classes on the other hand do not exist in space and time; they do stand to each other in subsumption relations; and they can be referred to by common nouns. Whether an entity is a particular or a class or type is thus not a matter of choice on the part of modelers, and, in our experience, the controversial cases which seem to suggest such optionality always reveal hidden ambiguities upon closer inspection. Some defenders of the view that the human MPDU-1 gene is an instance of the class Gene, refer to genes as instances of information content entities such as in OBI (OBI 2008). The same genetic information entity can be encoded in different nucleic acid macromolecules, just as the same text can be disseminated in many hard copies. Others, however, claim that the human MPDU-1 gene is not an instance but a subclass of the class gene; they are then referring to genes as types of macromolecular sequences, the instances of which are the real nucleotide sequences replicated in the cells of our bodies.
As we already saw in the section on background knowledge, an implicit reference to individuals underlies typical probabilistic statements. An example is the following statement: “In 2000, worldwide prevalence of diabetes mellitus was 2.8%”. Here we have two classes, viz. Human and (case of) Human_Diabetes. Both classes have a cardinality (integer value), and the prevalence is given by their ratio. The prevalence is therefore not a characteristic of the disease but of the population of persons who have a case of the disease. We here extend the DL notation by symbolizing the cardinality of the extension of a class (i.e., the number of instances) by enclosing the class name in “| |”.
This demonstrates that probabilistic background knowledge could be expressed by DL A-boxes extended by arithmetic operators (referring to individuals). This is therefore not in the scope of formal ontologies, just as little as in alternative approaches, like probabilistic T-Box extensions (Koller 1997, Klinov 2008). Moreover, it cannot be expressed either by currently available DLs.
The discipline of knowledge representation evolved in the context of artificial intelligence research with the purpose of enabling computers to draw new conclusions from existing data and information. When the term “ontologies” became popular in computer science in the nineties, it was thus often regarded as a new catchword for something that already existed, namely knowledge representation artifacts. However, two strands of research have evolved since that have demonstrated the need for a on a more principled methodology.
First, Description Logics (DLs) were developed to be computable fragments of First Order Logics (FOL) that are sufficiently expressive to allow the formulation of assertions about classes of individuals as well as their relations in such a way that new theorems could be derived automatically. This required a well–defined semantics calling basically for a bipartition into classes and individuals; it demanded also a formal account of subsumption and of role quantification. Whereas in more primitive, semantic network style representations, such as the UMLS Metathesaurus, the statements – such as “aspirin is a salicylate”, “aspirin contains an aromatic ring”, and “aspirin prevents myocardial infarction” - all look quite similar, attempts at more formal representation reveal fundamental differences. In DL, the first statement is straightforward and does not require any relation beyond that of subclass, the second requires a quantified role expression, and the third cannot be adequately represented at all.
Secondly, in parallel to the evolution of the representational languages like OWL, philosophers and computer scientists confronted the history-laden discipline of philosophical ontology with the requirements of the modern information society and created the discipline of applied ontology (Guarino 1998). Biomedicine became a testbed for the convergence of DLs and applied ontology. The OBO Foundry effort, and increasingly the redesign activities of SNOMED CT, bear witness thereto.
We can now summarize the results of this paper by means of the crude delimitation of four kinds of statements we have introduced above, namely: (i) lexicosemantic representation, (ii) representation of types of entities, (iii) representation of background knowledge, and (iv) representation of individuals.
Our distinctions coincide to some degree with those proposed by OBRST (2006) in the Ontology Spectrum. Our first category corresponds to his “weak taxonomies and thesauri” and the second to logical theories (“strong ontologies”). The “weak ontologies” category in the Ontology Spectrum integrates aspects from both of these, and is used in data modeling (UML) rather than for domain representation. While Obrst mentions the class vs. instance distinction in his portrayal of strong ontologies, he does not further elaborate on this distinction.
This is in line with the main argument we have attempted to convey in this communication, namely to show that knowledge representation – which might more properly be referred to as the modeling widespread of beliefs widespread among scientists – is not a task of formal ontologies. Nor do formal ontologies describe entities properly belonging to the domain of human language. They represent different things, serve different purposes and use different formalisms. We postulate that a clearer understanding of these differences will facilitate the definition of more robust and useful interfaces between them, and thereby reduce the occurrence of unintended models and thus help to create a more rational basis for semantically interoperable systems in biology and medicine.
This work was supported by the European Union projects @neurIST and DEBUGIT, and by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 U 54 HG004028.
Stefan Schulz Holds a medical degree (Heidelberg University, Germany) and is senior researcher and professor at the Institute for Medical Biometry and Medical Informatics of the University Medical Center Freiburg, where he leads the Medical Informatics Research Group. His work focuses on biomedical terminologies and ontologies, biomedical knowledge representation, cross-language medical document retrieval, text and data mining in clinical document repositories, eLearning in Medicine, and health informatics in developing countries. After clinical work in surgery and internal medicine he obtained his doctoral degree in the field of tropical hygiene where he carried out a parasitological field study on in São Luís, Brazil. After obtaining a technical qualification in medical computing, he moved to the University of Freiburg, where he participated in clinical and educational software development projects and participated in several research projects in the field of information extraction, biomedical terminologies, medical language engineering and semantic technologies. He have been played a leading role in severe EU-funded research projects. Stefan Schulz is author of more than hundred peer reviewed publications and has received several awards. Since 2001 he has repeatedly contributed to Brazilian health informatics research projects as a visiting researcher at the Paraná Catholic University (PUC-PR).
Holger Stenzhorn Is computational linguist (Saarland University, Germany) and a research associate at the Institute for Medical Biometry and Medical Informatics of the University Medical Center Freiburg, Germany. His work focuses on the representation and management of information and data, ontologies and Semantic Web technologies, biomedical informatics, natural language processing, multimodal user interfaces and software design and development. In the past he participated in the development of multilingual document retrieval, information extraction, and natural language generation systems, both in industry and academia. Currently, he is involved in several ontology engineering tasks: an ontology for the research on cerebral aneurysms (EU funded @neurIST project), an ontology for clinical trials on nephroblastoma and breast cancer (EU funded ACGT project), and the BioTop top-domain ontology. Holger is member of the W3C Healthcare and Life Sciences Interest Group.
1A “workaround” exists to represent n-ary relations in OWL via reification - see http://www.w3.org/TR/swbp-n-aryRelations
Stefan Schulz, Institute for Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany.
Holger Stenzhorn, Institute for Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany.
Martin Boeker, Institute for Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany.
Barry Smith, Department of Philosophy and Center of Excellence, in Bioinformatics and Life Sciences and, National Center for Biomedical Ontology, University, at Buffalo, Buffalo, USA.