Formal representation of phenotypes
We logically define phenotypes by making an equivalence relation between classes in the pre-composed phenotype ontology to EQ descriptions, with each such description consisting of the following elements: Q, the type of quality (characteristic) that the genotype affects; E, the type of entity that bears the quality; E2, an additional optional entity type, for relational qualities; M, a modifier.
We can then translate the EQ description to an ontology language such as OBO Format or OWL (Web Ontology Language) - this allows us to use powerful general-purpose ontology tools such as automated reasoners to query and manipulate phenotype descriptions, and to compute subsumption hierarchies in phenotype ontologies (Figure ). Ontology languages have a means of composing descriptions in a logically unambiguous fashion as intersections between classes. The modeling strategy used is described in detail elsewhere [23
], but a brief summary as background follows here.
Figure 3 Equivalence relations between MP classes and EQ descriptions. Equivalence relations between two MP classes and their equivalent EQ descriptions. Here we treat MP 'degeneration' terms as in the PATO quality (Q) 'degenerate', rather than the process of (more ...)
We use the formal inheres_in relation for relating qualities to their bearers. We treat the phenotype 'femur shape' as the class intersection of (a) the class 'shape' and (b) the class of all things that stand in an inheres_in relationship to a 'femur'.
In OBO Format this is written as:
intersection_of: PATO:0000052 ! shape
intersection_of: inheres_in MA:0001359 ! femur
Note that the text after the '!' is merely a comment, not a part of the format, used here to provide the human readable name for that class.
This can be read as a genus-differentia style definition, a <shape that inheres_in a femur>. We translate any EQ pair to <Q that inheres_in E>. For relational qualities we use the towards
relation to connect the quality to the additional entity type on which the quality depends (for example, the concentration in urine of calcium). Here we use a simple 'EQ syntax' to explain our results, although the underlying representation is in OBO format (OBO Format, 2009). Table shows the mapping between these two schemes. Our equivalence mappings are available in both OBO and OWL formats from the PATO wiki [24
], or alternatively from the OBO logical definitions download page [25
Translation between variables in EQ templates and logic based OBO or OWL class intersections
We have developed a collection of equivalence mappings from classes in pre-composed phenotype ontologies to PATO-based formal description structures; we call these collections of mappings 'XP' ontologies (the 'XP' stands for cross-product). The descriptions are drawn from the cross-product of two sets of classes: the set of PATO classes and the set of classes from other OBO ontologies. For example, MP-XP is a collection of mappings between individual MP classes and their corresponding EQ descriptions. We can further partition the sets according to this scheme - for example, MP-XP-MA is the collection of such mappings whose descriptions are drawn from the cross-product of PATO classes and MA classes. Note that the mappings are all intended to be ones of equivalence - the EQ description should be neither more general nor more specific than the mapped pre-composed class.
In this paper we focus on the MP ontology. This is partly because of its relevance to translational research, maturity, comprehensiveness (6,844 classes), and to fulfill the data analysis needs of a particular project [20
]. However, we also present preliminary results in mapping other pre-composed phenotype ontologies: HP, WP and TO. The last one was chosen to demonstrate the applicability of the technique outside metazoans. The mapping of the portion of HP corresponding to musculoskeletal phenotypes is described elsewhere [17
The total number of classes, from MP, HP, WP and TO, for which we can map to PATO-based cross-product descriptions are summarized in Table . We attempt to achieve maximal coverage by combining initial automated term syntax parsing methods (see Materials and methods section), followed by manual curation of the results to check for biological validity. The MP-XP set has been curated most extensively, and of that set, the MP-XP-CL subset has been analyzed most thoroughly.
Summary of equivalence mapping results
Phenotypic mapping groups
The phenotype mappings fell into different overlapping categories, such as those based on basic anatomy, abnormality, compositional descriptions, processes, relational descriptions and absence. These phenotypes are described below, and Table shows examples of these phenotype classes and the breakdown of their EQ description.
Examples of equivalence mappings between pre-composed phenotype classes and EQ descriptions
Basic anatomical phenotypes
Most of the classes in the pre-composed phenotype ontologies are gross anatomy phenotypes - they can be defined in terms of a quality of some part of the body. For example: MP:decreased diameter of femur*; MP:hypothalamus hypoplasia; MP:large lymphoid organs; MP:muscular atrophy; MP:truncated notochord*; MP:motor neuron degeneration*; MP:axon degeneration*; HP:narrow pelvis*; TO:leaf area*; WP:shrunken intestine*; MP:situs inversus* (examples marked with an asterisk are shown in Table ).
The first step to creating mappings for these pre-composed phenotypes is selection of the appropriate anatomical ontology. For worm and plant phenotypes, there is a single unified gross anatomy ontology covering each. For human phenotypes from HP, we use the FMA, and although the FMA does not include developing structures, this is not currently a limitation because the HP does not include many phenotypes for developing structures such as 'neural tube'.
The MP is intended as a mammalian phenotype ontology. Although most of the phenotypes defined are applicable to all mammals (and sometimes more general taxa) there is a bias towards mouse, as this ontology is generally used for mouse genotype annotation. This, and the fact that there was no general mammalian anatomy ontology, led us to use solely mouse anatomy (MA) ontologies for the decomposition of MP. We used MA (the adult mouse anatomy ontology) wherever possible. EMAP (Theiler stages 1 to 26) posed a problem due to the lack of generalized classes for developmental structures, such as 'notochord', forcing us to choose an arbitrary time stage-specific class (for example, 'notochord at TS20' to define 'truncated notochord'; Table ). For cellular phenotypes such as 'motor neuron degeneration' we used CL, which is applicable across all taxa. For subcellular anatomy phenotypes, such as 'axon degeneration', we used the GO-CC ontology (also applicable across all taxa).
Many of the anatomical phenotypes are of the form 'abnormal X morphology' or 'increased/decreased size of X', where X is a class in the anatomy ontology or the cell ontology. Equivalence mappings for these were initially generated automatically (see Materials and methods). Manual assistance is required to map clinical terms such as 'situs inversus' (MP) to precise EQ descriptions (see Discussion).
The majority of all mapped phenotype classes fall into this category. This holds across all phenotype ontologies, but particularly for HP, which is by nature highly morphological.
Both MP and HP are ontologies of abnormal phenotypes. Many classes are of the form 'abnormal X', where the exact nature of the abnormality is not specified; for example: MP:abnormal neuroepithelium of ampullary crest; MP:abnormal septation of the cloaca; HP:abnormality of vision*.
Here we elide a detailed discussion of what constitutes 'normal' or 'abnormal', as this is beyond the scope of this paper. We simply use a has_qualifier relation to replicate the intended structure of the MP class.
Note that the WP does not classify phenotypes as abnormal, but rather as 'variants'.
Compositional descriptions of anatomical entities
Mapping a class such as abnormal Purkinje cell dendrite morphology* (MP:0008572) requires a slight variation on the basic EQ scheme. 'Purkinje cell' is represented in CL, and 'dendrite' is represented in GO-CC, but GO-CC does not specifically pre-compose 'Purkinje cell dendrite'. Logically, this presents no problem, as we can make an anonymous class defined using an intersection construct to specify this entity, using the part_of
relation from the Relations Ontology. To accomplish this, we extended the simple EQ syntax such that we can use compositional expressions as IDs [26
], and write the following:
E = dendrite^part_of(Purkinje_cell) Q = morphology M = abnormal
When translating the above EQ description to OBO or OWL we end up with a nested description, for example, in OWL Manchester syntax:
morphology that inheres_in some (dendrite that part_of some Purkinje cell) and has_qualifier some abnormal
However, tools that are downstream consumers of nested MP-XP class expressions must be able to interpret these appropriately, and the additional expressivity may pose problems for these tools. In addition, we need a way in which to present the descriptions in an intuitive manner to biologists.
We therefore extended EQ syntax to include the EW (Entity Whole) tag as below:
E = dendrite EW = Purkinje cell Q = morphology M = abnormal
This is equivalent to the above EQ description, but is simpler for tools to deal with, and simpler to present in tabular form to users.
This approach could be termed 'post-compositional', as the expression denoting the anatomical entity class is created after the anatomical entity ontology is deployed. However, the terminology becomes confusing here, so we reserve the term post-compositional specifically for the creation of such expressions at annotation time.
Process oriented phenotypes
A significant number of classes in MP are described in terms of a biological process rather than a static description of an anatomical part. Examples include: MP:delayed kidney development*; MP:increased mast cell degranulation; TO:respiration rate; WP:hyperactive egg laying; HP:impaired spermatogenesis.
For these classes, we used PATO in combination with GO biological process (GO-BP) classes. PATO is divided at the top level between qualities of biological objects and qualities of processes. The former includes qualities such as size, shape, and structure and is used in conjunction with anatomical classes. The latter includes temporal qualities such as delayed, increased rate and is used in conjunction with GO-BP classes.
Chemical entities and relational qualities
MP definitions occasionally reference types of chemical entities. For example: MP:hypocalciuria (excretion of abnormally low amounts of calcium in the urine); MP:abnormal spleen iron level*; TO:abscisic acid concentration.
Here we used the CHEBI ontology, typically using the CHEBI class as the related entity for a relational quality, where the bearer entity is a body substance such as blood or urine. In EQ syntax we would write the definition of hypocalciuria as:
E = urine Q = decreased concentration of E2 = calcium
For phenotypes that reference specific proteins such as 'interleukin-1' we can use the OBO PRO. At this time, the PRO does not include many of the required classes but these are easily added to the MP-XP definitions when they become available.
Absence or change in number of parts
Mutations in or deletions of genes may result in the loss of a body part, or a change in the number of parts. Some example phenotypes are: MP:absent middle ear ossicles; MP:loss of basal ganglia neurons*; MP:alopecia (loss of hair); MP:absent spleen; WP:no oocytes; HP:polydactyly.
With PATO we typically describe absence in terms of the entity that is missing the part. For example, the following is problematic:
Q = absent E = spleen
Logically this is incoherent because there is no spleen to possess the quality of non-existence. Instead we can use a cognate 'relational quality' in order to compose a description:
E = abdomen Q = lacking all parts of type E2 = spleen
This second form is both more coherent and more expressive. For example, in defining 'loss of basal ganglia neurons' we can say:
E = basal ganglion Q = has fewer parts of type E2 = neuron
This obviates the need for a class 'basal ganglion neuron' (not present in the mouse anatomy ontology or the cell ontology). These PATO classes are grouped under the PATO class 'has number of' and have logical definitions that can be used in reasoning.
When translating 'absence' phenotypes to representations in ontology language such as OBO or OWL we have the option of treating the above description as a logical construct called a cardinality restriction. In OWL Manchester Syntax the absent spleen phenotype could be written as:
Abdomen that has_part exactly 0 spleen
This works for stating a number or number range, but cannot be used to state a relative increase or decrease in number. Another issue with the explicit representation is that it can create inconsistencies if it contradicts what is stated in the anatomy ontology. A full discussion is outside the scope of this paper, but one solution that has been previously proposed is to use non-monotonic logic [27
Validation using automated reasoners
A reasoner can be used to automatically classify (that is, place terms in the is_a hierarchy) a compositional ontology, such as a pre-composed phenotype ontology. We can also reverse the direction of implication, and use reasoners to validate the XP mappings based on the existing asserted is_a links in these ontologies. We used a variety of reasoning strategies to validate the MP mappings to EQs.
For each pre-composed phenotype ontology, we reasoned over the combined set consisting of the phenotype ontology, the XP mappings, and the ontologies referenced in those mappings. This yielded additional is_a
links in the phenotype ontology, which were submitted to the maintainers of the ontology for approval, and often resulted in improvements to the ontology. For example, the reasoner suggested 'Purkinje cell degeneration' is_a
'neuron degeneration' (inferred from the CL is_a
hierarchy), which was previously missing from MP, and was promptly added [28
]. In other cases the reasoner suggestions were rejected, because of problems in either the XP mappings or the referenced ontologies.
To validate this approach, we examined a particular subset, MP-XP-CL, the terms in MP for which there are mappings that involve CL. Using the OBO-Edit reasoner we inferred the existence of 88 possibly missing is_a relationships in MP. These were submitted to the MP curator for review. Of these, 48 were deemed to be correct, and the new links were added to the MP graph. One link was only partially correct, and resulted in a small rearrangement of a portion of the MP graph. Twenty-two links were rejected outright, and traced back to errors in the MP-XP-CL mappings, which were subsequently fixed. The remaining 17 are still pending, and mostly derive from inconsistencies between classification of normal cells in CL and abnormal cells in MP.
We also performed a partial validation of the mappings by attempting to recapitulate is_a links asserted in existing phenotype ontologies. We started by removing all is_a links from the phenotype ontology (but not from the ontologies referenced in the mappings) and attempted to recover these links using a reasoner. We found that 37% of the existing links in MP and 14% of the links in HP can be automatically reconstructed (Table ). Of the false negatives (relationships between mapped classes that we cannot reconstruct), the problem was often an absence of supporting links in the referenced ontologies. For example, MP contains the statement 'asymmetric snout' is_a 'abnormal facial morphology'. At the time of reasoning, the MA contained no relationships linking the classes 'face' and 'snout', which means there is no way to infer the stated MP link from first principles. After discussion, the MA curator (TF Hayamizu, personal communication) added a part_of link to the ontology between 'snout' and 'face', which was sufficient to allow inference of the MP link from the logical definitions. This is an example of how the combination of composing logical descriptions and using a reasoner can contribute to the development of a suite of ontologies, enforcing more consistency with one another. This is a guiding principle of the OBO Foundry. Table also lists the novel relationships inferred by the reasoner; not all have been evaluated, and some will be true positives that will result in additions to the MP, such as the previously mentioned Purkinje cell example.
Reasoner-inferred links for both human and mouse
One problem we encountered was that the size of the combined ontologies proved too much for existing memory-bound reasoners to handle. We used two strategies to overcome this: using a relational database backed reasoner, which is not memory bound [29
]; and ontology segmentation - dividing the reasoned set into manageable subsets. For example, rather than reasoning over all the ontologies referenced in MP-XP, we would select individual pair-wise subsets, such as MP-XP-MA, and reason over these sequentially. Both approaches have strengths and drawbacks; the relational database approach is too slow to be part of the ontology development cycle, and the simple pair-wise strategy can give incomplete results for complex phenotypes involving classes from more than one other ontology.
A multi-species anatomy ontology for translational research
Our results show how classes in phenotype ontologies can be mapped to logical descriptions utilizing species-centric anatomical ontologies plus PATO qualities. These mappings enable us to query a mouse dataset, annotated using MP IDs such as MP:0001314 (corneal opacity), using the MA class 'cornea'. However, if we wish to query across combined multi-species datasets for all morphological phenotypes of the cornea, we need a more generalized class representing that which is shared by all vertebrate corneas. We have commenced construction of such a multi-species anatomical ontology, called Uber-ontology or Uberon. The current version of Uberon consists of over 2,800 classes, and it also contains links to over 9,300 classes in external, mostly species-centric anatomical ontologies. We do not attempt to generalize beyond metazoans [30
]. Uberon is available from the main OBO website [31