Semantic similarity measures are dependent on the quality and completeness of both the ontology and the annotated corpus they use. Biomedical ontologies have several irregularities, such as variable edge length (edges at the same level convey different semantic distances), variable depth (terms at the same level have different levels of detail), and variable node density (some areas of the ontology have a greater density of terms than others). These are due to the irregular nature of biomedical knowledge and our limited understanding of it, and should be taken into account by semantic similarity measures, particularly edge-based measures, which are sensible to these issues. There are also problems with the use of annotated corpora in node-based measures, because these are biased by biomedical research interests (since terms related to subjects of research focus are expected to be more frequently annotated than other terms) and limited by current experimental techniques (if the currently most popular technique can discover gene product functional aspects only to a certain level of detail, then more detailed terms will seldom be used for annotation). Despite these issues, the use of IC on semantic similarity measures still makes sense probabilistically, if not always semantically, because commonly used terms are more likely to be shared by gene products than uncommonly used terms.
Another issue particular to GO is that not all annotations are equal, because evidence codes show the type of evidence that supports the annotation. While most evidence codes symbolize experimental methods and the corresponding annotations are manually curated, the most prevalent evidence code (IEA
) indicates the annotation was inferred electronically (by sequence similarity, from other databases, or from keyword mapping files) and not manually curated. Whether IEA
annotations should be used or disregarded for semantic similarity is still a subject of open debate, because using them entails a loss of precision but not using them entails a loss of coverage (over 97% of all annotations in the GOA-UniProt database are IEA
, and less than 2% of the GO terms have non-IEA
). As the proportion of IEA
annotations continues to increase, and they steadily improve in quality (up to 91%–100% precision having been reported 
), there will be fewer reasons to ignore them, and they will eventually be widely adopted. Since IEA
annotations are usually made by inference through similarity to sequences of model organisms, improvements in the experimental annotation of model organisms will result in higher-quality IEA
annotations. Meanwhile, perhaps the best way to decide whether to include or discard IEA
annotations for a particular study is to first analyze the corpus in question and verify if the gene products of interest are well characterized with manually curated annotations or if IEA
annotations are essential for their characterization. Note that using different sets of annotations (such as all, or just the manually curated ones, etc.) will have an impact in the IC values calculated for the terms, preventing their comparison. It is also important to stress that only results obtained with the same versions of GO's ontology and annotation data are comparable, since changes to both the ontologies and the annotations made with them affects semantic similarity .
An important issue in the evaluation of semantic similarity measures based on the IC is that of data circularity between the data used to evaluate the measures and the GO annotations. For instance, if a given measure is evaluated by correlation with sequence similarity, then using annotations based on sequence similarity (those with evidence codes ISS and IEA) to calculate the IC leads to data circularity, as there is direct dependency between both data sources. The same is true for the use of annotations inferred by physical interaction (IPI) when a measure is evaluated by correlation with protein-protein interactions, and other similar cases. To minimize the effect of data circularity, evaluation studies should (and usually do) remove annotations based on evidence of the same nature as the data used to evaluate the measures.
Data circularity is not the only problem in evaluating GO-based semantic similarity measures. The lack of a gold standard for evaluating semantic similarity makes it hard to assess the quality of the different measures and to find out which are best for which goals. One of the reasons for the continued lack of a gold standard is that measures are often developed for specific goals and evaluated only in that context. Furthermore, there are pros and cons to all data sources used to evaluate semantic similarity. The best solution is likely that a gold standard be designed by experts to cover most applications of semantic similarity, not based on proxies for true functional similarity, such as sequence similarity or gene coexpression.
How to Choose a Semantic Similarity Measure
Researchers who wish to employ semantic similarity in their work need to spend some time defining their requirements to choose an adequate measure. Since different measures interpret ontology and annotation information in different ways, researchers need to understand the differences and decide which interpretation is best suited to their needs. Below, we outline some of the steps that should be taken before choosing a semantic similarity measure.
- Identify your scope: Comparing one aspect versus comparing multiple aspects of gene products;
- Identify your level of detail: Comparing gene products in specific functions or overall similarity; and
- Analyze the annotations of your dataset: Determining the number of annotations per gene product, including and excluding IEA annotations and annotation depth.
When wishing to compare single aspects of gene products, researchers should opt for maximum approaches (“pairwise–all pairs–maximum”). These will give a measure of how similar two gene products are at their most similar aspect. For comparing multiple aspects, the best measures are “pairwise–best pairs–average” or groupwise approaches, since they allow for the comparison of several terms. However, depending on the level of detail desired, “pairwise–best pairs–average” or “groupwise–set” should be used for a higher degree of specificity (since only direct annotations are used) and “groupwise–vector” or graph for a more generalized similarity (since all annotations are used). To further minimize the relevance of specificity, unweighted graph or vector measures can be employed, so that high-level GO terms are not considered less relevant.
However, having to analyze the dataset before deciding which measure to use can be cumbersome to researchers who just need a “quick and dirty” semantic similarity calculation. In this case, researchers should resort to one of the several semantic similarity tools available and use their good judgment in analyzing the results. Most semantic similarity measures proposed so far have shown fair if not good results, and for less detailed analyses any one of them can give a good overview of the similarities between gene products.
Over the last decade, ontologies have become an increasingly important component of biomedical research studies, because they provide the formalism, objectivity, and common terminology necessary for researchers to describe their results in a way that can be easily shared and reused by both humans and computers. One benefit of the use of ontologies is that concepts, and entities annotated with those concepts, can be objectively compared through the use of semantic similarity measures. Although the use of semantic similarity in the biomedical field is recent, there is already a wide variety of measures available that can be classified according to the approach they use.
There are two main semantic similarity approaches for comparing concepts: edge-based, which rely on the structure of the ontology; and node-based, which rely on the terms themselves, using information content to quantify their semantic meaning. Node-based measures are typically more reliable in the biomedical field because most edge-based measures assume that all relationships in an ontology are either equidistant or have a distance as function of the depth, neither of which is true for existing biomedical ontologies.
Because biomedical entities are often annotated with several concepts, semantic similarity measures for comparing entities need to rely on sets of concepts rather than single concepts. There are two main approaches for this comparison: pairwise, in which entities are represented as lists of concepts that are then compared individually; and groupwise, in which the annotation graphs of each entity are compared as a whole.
Several studies have been conducted to assess the performance of different similarity measures, by correlating semantic similarity with biological properties such as sequence similarity, or other classification schemas such as Pfam. Most measures were shown to perform well, but as few comprehensive studies have been conducted, it is difficult to draw any conclusion about which measure is best for any given goal.
Until now, most research efforts in this area developed novel measures or adapted preexisting ones to biomedical ontologies, with most novel measures sporting an increased complexity compared to previous ones. This increased complexity is mainly a result of more recent measures combining several strategies to arrive at a final score. Although the need for improved measures is unquestionable, this trend fails to answer the most pressing community needs: (1) easy application to both small and large datasets, which would be best achieved by developing tools that are at once easy to use and powerful; and (2) elucidation of which measure is better fitted to the researcher's needs, which would imply comparative studies of all existing measures and approaches.
Although important efforts in these two areas have already been made, semantic similarity is still far from reaching the status of other gene product similarities, such as sequence-based ones, in which fast and reliable algorithms coupled with the development of ever-growing databases has made them the cornerstone of present day molecular biology. One important step forward would be the development of a gold standard for gene product function that would allow the effective comparison of semantic similarity measures. Nevertheless, semantic similarity is not restricted to gene products and it can be expected that, as more biomedical ontologies are developed and used, these measures will soon be applied to different scenarios. It is then crucial that bioinformaticians focus on strategies to make semantic similarity a practical, useful, and meaningful approach to biological questions.