|Home | About | Journals | Submit | Contact Us | Français|
Motivation: The anatomy of model species is described in ontologies, which are used to standardize the annotations of experimental data, such as gene expression patterns. To compare such data between species, we need to establish relations between ontologies describing different species.
Results: We present a new algorithm, and its implementation in the software Homolonto, to create new relationships between anatomical ontologies, based on the homology concept. Homolonto uses a supervised ontology alignment approach. Several alignments can be merged, forming homology groups. We also present an algorithm to generate relationships between these homology groups. This has been used to build a multi-species ontology, for the database of gene expression evolution Bgee.
Availability: download section of the Bgee website http://bgee.unil.ch/
Supplementary information: Supplementary data are available at Bioinformatics online.
Databases dedicated to model species rely on the usage of ontologies, for example the zebrafish anatomy for ZFIN (Sprague et al., 2006), or the Mouse gross anatomy and development (Baldock et al., 2003). Such ontologies of anatomy and development facilitate the organization of functional data pertaining to a species. For example, all gene expression patterns described in ZFIN are annotated using the zebrafish anatomical ontology. A list of such ontologies is kept on the Open Biomedical Ontologies (OBO) website (Smith et al., 2007).
To pool the experimental data from different model species, we need to encode corresponding information between ontologies which describe different anatomies (e.g. zebrafish and human). For example, we are interested in integrating and comparing gene expression patterns between several species (Bastian et al., 2008). The most widely accepted criterion to make such comparisons in biology is homology (Hall, 1994; Hossfeld and Olsson, 2005). When we compare two elements, whether or not they are derived from the same ancestral element defines our expectation of similarity between them, and the interpretation of differences. For example, if a chicken wing is not homologous to a fly wing, we do not expect the same underlying structures, and similarities can be attributed to functional convergence. Whereas the chicken wing is homologous (as a limb) to the human arm, thus we do expect the same underlying structures, and differences can be attributed to divergent evolution. There are different definitions of homology (Roux and Robinson-Rechavi, 2010), and our algorithm does not in itself impose one on the user. We do recommend choosing an explicit definition and using it consistently throughout the analysis.
In practice, hundreds of terms must be compared between ontologies that may differ both in the actual biology modeled (i.e. a fish is not a mammal) and in the representation used. Although a purely manual annotation of homologies is possible, it would be too time consuming to be done for all terms between several divergent species. Kruger et al. (2007) have used a manual approach to find similarities between simplified anatomy ontologies for human and mouse. As both are mammals, they share most structures and terminology. There are also on-going efforts to integrate anatomical ontologies (Haendel et al., 2008; Washington et al., 2009), which are often geared towards the comparison of phenotypes (Lussier and Li, 2004). As far as we know, the question of using homology to align anatomical ontologies has never been explicitly addressed.
Since the problem is to find correspondences between the concepts of two ontologies, we draw on methods from ‘schema matching’, or ‘ontology alignment’ (Euzenat and Shvaiko, 2007; Lambrix and He, 2008). As opposed to more generalist solutions, we present a algorithm which is specialized in the alignment of anatomical ontologies. The specificities of these ontologies include high redundancy of terms, and few types of relations. Finally, a specific issue is that structures which have the same name and are related to similar concepts may not be homologous. This is the case of the insect eye and the mammalian eye. While some underlying molecular mechanisms are similar, these structures evolved independently and are not considered homologous (discussed in Hall, 1994; Shubin et al., 2009). Unsupervised alignment algorithms would misleadingly align such similarities; this is for instance the case for the LOOM software used on the NCBO portal (Ghazvinian et al., 2009).
In principle, an alignment algorithm should aim at finding the largest number of true positives, while avoiding false positives. In practice, our experience is that the size and structure of anatomical ontologies leads to very large numbers of false positives if a naive approach is taken (i.e. common words). Thus, the basic aim of Homolonto is to propose in priority to the user the best candidate pairs of homologs, and avoid the need to consider many irrelevant pairs.
Ontology alignment is the process of determining correspondences between ontology concepts. We present our approach based on the classification of ontology matching systems proposed by Euzenat and Shvaiko (2007; Shvaiko and Euzenat, 2005).
Biological ontologies simplify some aspects relative to the general case. The types of concepts (e.g. anatomical structures) and the relationships (e.g. part_of) are known in advance, and known to be common between the ontologies to align. Moreover, in the present implementation we only seek to establish one type of relation, homology.
Our algorithm can be described as a composite system (Fig. 1), using: (i) language-based comparison of names with tokenization (element level, syntactic technique); (ii) graph-based matching of children of elements (structure level, syntactic technique); (iii) data analysis, e.g. statistics on word occurrence (structure level, syntactic technique); (iv) external input from the user (element level, external technique; classification following Euzenat and Shvaiko, 2007). We combine the results in parallel, as opposed to in sequence, by using a sum of scores from different techniques. Thus, we make use both of schema and element level information. The algorithm produced in a first step anchors at the element level, generated by language technique, and potentially by the user (external), then uses information from the schema, the elements, and user input, to improve the alignment based on these anchors.
Importantly, each proposition of homology between elements must be validated by the user (external input), to take into account such cases as the eye, discussed in the ‘Introduction’ section. Thus our process is a supervised one.
Finally, we note that the alignment we obtain is of the form many to many, not one to one.
A central concept in our algorithm is that of a ‘proposition’ (similar to ‘suggestion’ in Lambrix and He, 2008). A proposition is a pair of terms (also called ‘class’ in OWL) from the two ontologies for which a score has been computed. This may have been done based on homonymy (common words) of the term names (also called ‘class label’ in OWL), or propagation through the ontology. It is important to note (i) that not all possible propositions (i.e. pairs of terms) are created during the alignment, and (ii) that the list of propositions evolves during iterations of the algorithm.
For performance, our algorithm is not symmetric. Propositions are managed relative to one ontology, ‘to align’, which is being aligned to the ‘reference ontology’ (the one loaded first by the user). This allows us to store explicitly the information that term A of the ontology to align has two propositions, with term X and with term Y, of the reference ontology. If X has propositions not only with A but also with B of the ontology to align, this will not be taken into account explicitly.
Homolonto displays the input OBO ontologies under a tree representation form. The user may browse the ontology, and a basic ‘find’ tool has been implemented. Before starting the alignment algorithm, the user has the possibility to manually specify homology relations. This allows potential anchoring of structures with very different names between species, based on known biology (e.g. limb and fin). Once the alignment algorithm is run, a new window opens and displays the best propositions, one at a time, in order of score. For each term of a proposition, the parents are shown for two levels, to help the decision. Clicking on a term identifier opens the first occurrence of that term in the ontology browser window, where the user can check for more information (e.g. synonyms, develops_from relations). Decisions can be annotated with comments and with a link, similar to the ‘dbxref’ field of OBO-Edit (Day-Richter et al., 2007).
To facilitate alignment of large ontologies, keyboard shortcuts are implemented for the most common decisions: enter key = validation as homology plus computation and iteration; escape key = invalidate plus computation and iteration; right and left arrows to see the next and previous propositions without computation.
When several pairwise alignments have been conducted, Homolonto offers a function to reconcile them, if they share a common ontology. Thus if both pairs human and mouse, and human and zebrafish, have been aligned, the triplets human - mouse - zebrafish are created. This means that the number of propositions to validate does not need to increase in O(N2). Rather, each new ontology must be fully aligned to only one already aligned ontology, then the missing homologies must be informed. A judicious choice of the initial pairwise alignment should minimize these missing homologies.
Homolonto is used to generate pairwise homology relationships between anatomical ontologies. As homology relationships are transitive, Homolonto offers the option to merge these pairwise alignments into homologous organs groups (HOGs). This generates both the HOGs, and the mapping of species-specific anatomical structures to these HOGs. HOGs then need to be structured as an ontology to allow reasoning on them. This means that, at a minimum, relationships amongst them have to be designed. Another algorithm has thus been developed to infer relationships between HOGs.
To date, the use of Homolonto, followed by a curation process, has allowed to define 1002 HOGs, involving 4459 structures from seven anatomical ontologies: ZFA (Sprague et al., 2006), EHDAA (Aitken, 2005; Hunter et al., 2003), EV (Kelso et al., 2003), EMAPA (Aitken, 2005; Hunter et al., 2003), MA (Smith et al., 2007), XAO (Bowes et al., 2008) and FBbt (Grumbling et al., 2006). The algorithm to design relationships amongst the HOGs inferred 1411 relations. With the most stringent parameters (ontology coverage = 1, within-ontology agreement = 1, inter-ontology agreement = 1), 222 of them were defined automatically as part_of, 15 as is_a, all the others as broader_than. After curation, there are 1179 part_of and 232 is_a relations. The resulting alignments are used in the database Bgee (Bastian et al., 2008). Thus an important result is that we have been able to implement in a practical manner anatomical homology relationships.
Here, we present, in more detail, two alignments (Table 1): first, zebrafish/Xenopus, which illustrates a best case scenario of two consistent ontologies, conforming to the CARO standards (Haendel et al., 2008), with annotations of synonyms and definitions, and low redundancy. On the other hand, Xenopus (a frog) and zebrafish (a ray-finned fish) present important differences in anatomy. And second, human/mouse which, despite the similarity in anatomy, illustrates a more difficult scenario of large ontologies, with issues such as repetition of names (76 occurrences of ‘mesenchyme’ in human, 93 in mouse), due to splitting of concepts among morphological structures or among developmental stages.
The main observation is that our algorithm is successful at ordering propositions. In the ‘easy’ case of zebrafish/Xenopus (Supplementary Figs S1 and S2), there are only seven invalidated propositions in the first 150 (95% validation). This is followed by a relatively short interval of iterations where validated and invalidated propositions are mixed: 46% of validations between iterations 151 and 200, and 20% between 201 and 250. Further iterations generate mostly invalidated propositions (3% validation from 251 to 735). Thus, 93% of all validations occurred in the first 250 iterations. Looking in more detail, the first propositions are terms which share many children. Thus, the first proposition pairs ‘organism subdivision’ from each ontology, which share four children with identical names (‘head’, ‘trunk’, ‘tail’ and ‘surface structure’). The second proposition pairs two terms which have different names, but are identified readily thanks to their synonyms: XAO:0000023 ‘skin’, synonym ‘integument’ and ZFA:0000368 ‘integument’, synonym ‘skin’ (IDs correspond to the versions used for the alignment; Table 1). The first invalidated proposition (iteration 77) has a peculiar status, since both ontologies include a term ‘unspecified’, which are equivalent but cannot be defined as homologous. The next invalidated proposition (iteration 130) is between XAO:0000313 ‘head somite’ and ZFA:0001462 ‘somite border’. Indeed, early in the iterations, sharing a parent ‘somite’ plus sharing the word ‘somite’ brings a relatively high score. But since propositions based on this are usually invalidated, the word ‘somite’ loses weight (Equation 6), and further propositions based on this similarity receive lower scores. Thus, whereas there are in principle 24 possible propositions between the Xenopus and zebrafish ontologies based on ‘somite’, only 13 were considered in this very thorough alignment (including the validated pair XAO:0000058 ‘somite’—ZFA:0000155 ‘somite’). At the other extreme of the alignment, the last validated propositions (iterations 607–610) concern aortic arches which were named, e.g. ‘aortic arch 4’ in zebrafish, but ‘fourth aortic arch’ in Xenopus. Their low scores were due to the high frequency of the words ‘aortic’ and ‘arch’ in both ontologies (Table 2).
The pattern is similar for the human/mouse alignment (Supplementary Fig. S3). In the first 1400 iterations, 99% of propositions are validated. In the next 600 iterations, the figure reduces to 63%, and in the last 962 iterations it falls to 21%. This slower decrease illustrates the complexity of this alignment. Although 2962 iterations may seem large, three points should be noted: (i) this is a worst case scenario, aligning two large anatomical ontologies, which lack important information such as definitions and synonyms, and are not up to recent standards (Haendel et al., 2008). (ii) This represents in our experience only 15 person-days of work, which means an iteration takes on average 2–3 min (on a Dual-core processor at 2.66 GHz, with 2Go of DDR2 memory). This is possible because many answers are obvious to the annotator in context of the information provided by the graphical user interface. For example, while the term EMAPA:18280 ‘intrinsic’ may appear enigmatic, its part_of relationship to ‘skeletal muscle’ part_of ‘tongue’, makes its homology to EHDAA:9140 ‘intrinsic muscle’ part_of ‘skeletal muscle’ part_of ‘tongue’ clear. Conversely, EMAPA:16370 ‘cardiovascular system’ part_of ‘extraembryonic component’, is not homologous to EHDAA:394 ‘cardiovascular system’, part_of ‘organ system’ part_of ‘embryo’ (Table 2). (iii) The 2962 propositions evaluated represents much less than the 8 202 675 possible pairs of terms between these two ontologies (2327 × 3525; Table 1). The validation rate of 66% shows that these were mostly propositions worth considering, and that the time spent was indeed due to the size of the ontologies, not to a default in the algorithm. Results also show that manual expertise is necessary, since even in the high scoring propositions some are invalid (Table 2). The example of ‘cardiovascular system’ (EMAPA:16370/EHDAA:394) given above appears at iteration 416, with a score improved by shared subcomponents (‘venous system’ and ‘arterial system’). Overall, 27% of invalidations are pairs of terms with identical names. Interestingly, Homolonto manages to give these misleading homonyms low priority: homonyms within the first 1000 iterations have a 99% chance of being homologs, whereas homonyms within the last 1000 iterations only have a 19% chance of being homologs. Thus, 93% of invalidated homonyms appear after iteration 1400.
It is also of interest to consider the capacity of Homolonto to recover homologous terms which are not described by the same name, in a case such as human/mouse where synonyms are not available. Of the 1959 validated homologs, 17% do not have identical names. Many of these share partial homonymy, as between EMAPA:17865 ‘bulbo-ventricular region’ and EHDAA:766 ‘bulbo-ventricular groove’. Such propositions will be recovered by the combination of word matching and propagation of other validated homology relationships (i.e. both are part_of ‘heart’). Structural matching is also able to recover cases with no word matching, as in EMAPA:16211 ‘cardiac muscle’/EHDAA:430 ‘myocardium’. In this case, both terms are part_of ‘early primitive heart tube’. In both ontologies, the latter term has two other children, which are homonyms and homologs: ‘endocardial tube’ and ‘cardiac jelly’. When the homonymous terms have been validated, ‘cardiac muscle’ and ‘myocardium’ remain the only pair of children of ‘early primitive heart tube’, which permits their pairing as a reasonable proposition, following Equation 5b. Similarly, XAO:0003033 ‘nostril’ and ZFA:0000550 ‘naris’ are correctly identified as homologs, since both have is_a relations to ‘surface structure’, and part_of ‘head’.
The main feature of Homolonto is its efficiency in identifying and ranking valid pairs of terms. Although most homologies concern terms with the same name, the algorithm is successful both in generating relevant propositions for terms with different names, and in ranking poorly terms with the same name which are not homologs. The algorithm has been shown to perform well in proposing valid pairs of homologous terms for two quite different cases. Zebrafish and Xenopus have divergent anatomies, from the two major branches of vertebrates (ray-finned fishes and tetrapodes), but are described by ontologies which follow consistent guidelines (Haendel et al., 2008). The Xenopus ontology is also relatively small. Conversely, human and mouse have very similar anatomies (both are mammals), but are described by large ontologies with little structured information. Despite these differences, the results of Homolonto are consistent, proposing almost exclusively valid pairs in a first series of iterations covering approximately half of the smaller ontology: 250 iterations for Xenopus/zebrafish, 1400 iterations for human/mouse.
The size of some biological ontologies makes the user interface important. The GUI of Homolonto provides rapid access to information about the terms considered, and includes keyboard shortcuts. The combination of an algorithm which proposes relevant pairs of terms, and of this GUI, allows the alignment of large ontologies of anatomy in reasonable time (i.e. weeks).
As all propositions have to be manually validated, the expertise of the curator is important to consider. In our experience, most propositions between closely related species represent ‘text-book’ knowledge, that do not require the curator to be an anatomy expert (although she/he needs to be a biologist). On the other hand, when dealing with complex structures (e.g. substructures of the brain) or distant species (e.g. alignment of insect and vertebrate anatomies), such an expertise might be needed.
Future development of Homolonto should include more relationships than simple homology. For example, homoplasy (analogy in the common sense of the word) may be relevant in cases of functional equivalence, such as the vertebrate and insect eyes. Also, it would be of interest to model explicitly serial homology, to improve the management of e.g. somites.
We thank Aurélie Comte, Anne Niknejad and Emilie Person for manual verification of homology groups within Bgee.
Funding: Etat de Vaud, Swiss National Science Foundation (116798); the Décrypthon program of Association Française contre les Myopathies; the European program Crescendo.
Conflict of Interest: none declared.