In this section we survey existing annotation- and topology-based approaches and set up a novel GO semantic similarity metric in order to measure GO term closeness in the hierarchy of the GO-directed acyclic graph (DAG). This novel GO term semantic similarity measure is derived in order to ensure effective exploitation of the large amounts of biological knowledge that GO offers. This, in turn, provides a measurement of functional similarity of proteins on the basis of their annotations from heterogeneous data using semantic similarities of their GO terms.
2.1. Existing GO-IC-Based Semantic Similarity Approaches
We are interested in the IC-based approaches, and unlike the graph-based or hybrid approach introduced by Wang et al. [28
], which is based on the intrinsic structure of the GO-DAG, that is, only uses the GO-DAG topology to compute the semantic similarity, other measures do not consider only the topology. Most of them are adapted from Resnik [29
] or Lin's [30
] methods, in which the information content (or semantic value) of a term conveying its biological description and specificity is based on the annotation statistics related to the term [2
], and thus they have a natural singularity problem caused by orphan terms. Here these approaches are referred to as Resnik-related approaches. In these approaches, the more often the term is used for annotation, the lower its semantic value, and as pointed out by Wang et al., this may lead to different semantic values of the GO terms for GO annotation data derived from different sources. However, each biological term in the ontology is expected to have a fixed semantic value when used in genome annotation. The semantic value is defined as the biological content of a given term, and this is particularly a problem in the hierarchical structure of the GO-DAG if the information will be used to predict functions of uncharacterized proteins in the genome, since one source can annotate a given protein with a term at a low level and another source with a term at a higher level in the hierarchy. Furthermore, the description and specificity of a given term in GO essentially depends on its GO annotation specification, translated by its position in the GO-DAG structure or topology.
To overcome these limitations, Wang introduced a topology-based semantic similarity measure in which the semantic value of a term z is given by
denotes the set of ancestors of the term z
, and Sz
) is calculated as follows:
) being the set of children of the term t
, and ωe
the semantic contribution factor for “is_a” and “part_a” relations set to 0.8 and 0.6, respectively. The semantic similarity of the two GO terms is given by
It has been shown that the Wang et al. approach performs better than Resnik's approach in clustering gene pairs according to their semantic similarity [27
On the edge-based similarity approaches, Zhang et al. [32
] introduced a GO-topology-based approach to assess protein functional similarity for retrieving functionally related proteins from a specific proteome, overcoming the common issue of other edge-based approaches mentioned previously. This was achieved by computing a measure called the D
value, which depends only on the children of a given GO term and is numerically equal to the sum of D
values of all its children. Thus, the D
value of a GO term is calculated using a recursive formula starting from leaves in the hierarchical structure, where the D
-value of all leaves are equal and set to the inverse multiplicative of the count of the root obtained by recursively summing the counts of all the direct children from the bottom up, with the count of the leaf set to 1. Note that the count of a given nonleaf term is just the number of all paths from that term node to all leaves connected to the term. In this approach, the D
value for a pair of terms x
is given by
However, a general limitation common to all these semantic similarity measures is that none of them fully address the issue related to the depth of the GO-DAG as stated previously; that is, the depth sometimes reflects vagaries in different levels of knowledge. An example is where the structure is just growing deeper in one path without spreading sideways. In the context of the GO-DAG, such a term is sometimes declared obsolete and automatically replaced by its parent. Thus, to consider this issue, we are introducing a topological identity or synonym term measure based on term topological information in which a parent term having only one child and that child term having only that parent are assumed to be topologically identical and they are assigned the same semantic value. This provides an absolute difference between more general terms closer to the root and more specific terms further from the root node, depending on the topology of the GO-DAG, that is, whether a branch splits into more than one possible path of specificity. Furthermore, this is consistent with the human language in which the semantic similarity between a parent term and its child depends on the number of children that the parent term possesses and also the number of parents that the child term has. Intuitively a parent having more children loses specificity and this parent is no longer relevant to be used for its child specification, thus leading to a lower similarity score between this parent and each of its children.
To illustrate this, let us consider the hierarchical structure in where “a”, “b”, “c”, “d”, and “e” are terms used to annotate proteins in a given genome and these terms are linked by the relation “is_a”. For the Zhang et al. approach, the semantic values of “b” and “d” are the same, which is 1.09861 (−ln (1/3)), but it fails to distinguish between “d” and “e”, which would be expected to have different semantic values. The Wang et al. approach will assign different semantic values to “b” and “d”; the semantic value of “b” is 2.44 and that of “d” is 2.952, although they are topologically identical in the sense that there is no other option going down the DAG except to “d”. For annotation-based approaches, if we consider a genome, for example, which has been annotated by two different labs, referred to as heterogeneous sources, it is likely that the terms “b” and “d” will not occur at the same frequency, in which case “b” and “d” will have different semantic values. For this new measure, the term “b” has only one child “d”, which has only one parent “b” (no sideways spread) and therefore the term “d” does not have additional value compared to “b” in the illustration in . This means that “b” and “d” are topologically identical (synonymous) and have the same fixed semantic value, equal to 1.38629. This is different to the semantic value of the term “e”, which is 3.46574 as “e” could be “derived” from two different branches.
Fictitious hierarchical structure illustrating the computation of term semantic values. Terms are nodes with “r” as a root.
2.2. GO Term Topological Information and New GO Term Similarity Approach
Translating the biological content of a given GO term into a numeric value, called the semantic value or topological information, on the basis of its location in the GO-DAG, requires knowledge of the topological position characteristics of its immediate parents. This leads to a recursive formula for measuring topological information of a given GO term, in which the child is expected to be more specific than its parents. The more children a term has, the more specific its children are compared to that term, and the greater the biological difference. In addition, the more parents a term has, the greater the biological difference between this term and each of its parent terms. The three separate ontologies, namely, molecular function (MF), biological process (BP), and cellular component (CC) with GO Ids GO: 0003674, GO: 0008150, and GO: 0005575 respectively, are roots for the complete ontology, located at level 0, the reference level, and are assumed to be biologically meaningless. Unless specified explicitly, in the rest of this work the level of a term is considered to be the length of the longest path from the root down to that term in order to avoid a given term and its child having the same level. GO
will, respectively, express the set of GO terms and links, (x
represents the link or association between a given parent x
and its child y
, and the level of the link (x
) is the level of its source node x
. Finally, [x
indicates that the level of term x
is lower than that of y
The topological information ICT
) of a given term z GO
is computed as
) is a topological position characteristic of z
, recursively obtained using its parents gathered in the set z
}, and given by
being the number of children of parent term x
A topological position is thus a function μ
→ [0,1], such that for any term t GO
) defines a reachability measure of an instance of term t
. Obviously, μ
is monotonically increasing as one moves towards the root; that is, if t1
, then μ
) ≤ μ
). For the top node or root, the reachability measure is 1. Furthermore, this reachability measure takes into account information of parents of the term under consideration through their reachability measures and that of every parent's children by incorporating the number of children that each parent term has in order to quantify how specific a given child is compared to each of its parent terms.
Note that, in general, the information we possess about something is a measure of how well we understand it and how well ordered it is. μ
) provides a precise indicator of all we know about the term z
in the DAG structure. As μ
is decreasing when moving towards leaves and a strictly positive defined function, the multiplicative inverse of μ
is an increasing function. This implies that 1/μ
) is a measure of how we understand the term z
and how ordered it is in the DAG, which merely means that the inverse of μ
) measures the information we possess about the term z
in the context of the DAG structure. The formula in (5
) is a logarithmic weighting of the inverse of μ
), referred to as topological information and measuring what we know about the term z
in the DAG structure.
To illustrate the way this approach works, consider the hierarchical structure shown in . In this DAG from top to bottom, we have the following.
Hierarchical structure illustrating how our approach works. Nodes are represented by integers from 0 to 11 with 0 as a root. The numbers beside each node represent its topological position characteristic and information content.
- The topological position characteristic of the root 0 is μ(0) = 1, and so its topological information is ICT(1) = −ln (1) = 0.
- As 1 and 2 have only parent 0, which has only these two children with μ(0) = 1, this yields μ(1) = 1/2 = μ(2), and so their topological information is ICT(1) = −ln (1/2) = 0.69315 = ICT(2).
- 3 has only one direct parent 1 with μ(1) = 1/2 and this parent has two children, we have μ(3) = 1/4, and its topological information is then ICT(3) = −ln (1/4) = 1.38639.
- 4 has two direct parents 1 and 2. 1 has two children with μ(1) = 1/2 and 2 has three children with μ(2) = 1/2. Thus, its topological position characteristic is the product of topological position characteristics of its parents, respectively, divided by the number of children for each parent μ(4) = 1/41/6 = 1/24 and its topological information is ICT(4) = −ln (1/24) = 3.17806.
- 5 has only one direct parent 2, which has three children and μ(2) = 1/2. Its topological position characteristic is μ(5) = 1/6 and its topological information is ICT = −ln (1/6) = 1.79176.
Unlike edge-based approaches where nodes and edges are uniformly distributed, and edges at the same level of the ontology correspond to the same semantic distance between terms [9
], in this new approach these parameters depend on the topological position characteristic of terms, which are not necessarily the same. In this illustration, nodes 3, 4, and 5 are at the same level but they do not have the same topological position characteristic, thus leading to different topological information or semantic values. Furthermore, the aforemetioned illustration reveals that the product in formula (6
) of topological position characteristic must be carefully considered when implementing the approach, since the exponential tail-off with increasing depth is severe depending on the density of the hierarchical structure under consideration. Here, we suggest computing μ
) iteratively when performing this product, and every time the multiplication is done, the obtained value must immediately be converted to a pair of numbers (α
) such that μ
) = α
with 0.1 ≤ α
< 1 and β
< 0. This means that every time the product is performed, the new value is converted to this format so that in the end, the topological position characteristic is just given by (α
) such that μ
) = α
= −ln (α
) − β
are topologically identical or synonym terms and denoted by
, if the following properties are satisfied.
- ICT(x) = ICT(y) or μ(x) = μ(y).
- There exists one path pxy from x to y.
Therefore, two GO terms are equal if and only if they are either the same or topologically identical terms. Suppose that there exists a path pxy
from term x
to term y
is a more general term compared to y
, or y
is more specific compared to x
and denoted by
) < I
) or μ
) < μ
The topological position μ provides a new way of assessing the intrinsic closeness of GO terms. Two terms in the GO-DAG may share multiple ancestors as a GO term can have several parents through multiple paths. Therefore, we define the topological position μs(x, y) of x and y as that of their common ancestor with the smallest topological position characteristic, that is,
being the set of ancestral terms shared by both terms x
. Finally, the semantic similarity score of the two GO terms is given by
) = −ln μs
) being the topological information shared by the two concepts x
The semantic similarity measure SGO
proposed here is referred to as the GO-universal similarity measure [33
], as it induces a distance or a metric, dGO
, given by dGO
) = 1 − SGO
) (see Supplementary Material available online at doi:10.1155/2012/975783), which in Information Theory is known as a universal metric [34
]. The more topological information two concepts share, the smaller their distance and the more similar they are. Moreover, the similarity formula in (8
) emphasizes the importance of the shared GO terms by giving more weight to the shared ancestors corrected by the maximum topological information, and thus measuring how similar each GO term is to the other. Thus, for two GO terms sharing less informative ancestors the distance is greater and the similarity is smaller, while for two GO terms sharing more informative ancestors, they are closer and their similarity is higher.
To illustrate the GO-universal approach, we use (5
) and (6
) to compute the reachability measure μ
) and topological information measure ICT
) of GO terms z
in a minimum spanning graph shown in adapted from [32
]. Results are shown in for our approach and the Zhang et al. approach. To relate the scale of Zhang et al. to ours, the D
value of a given term is considered to be the probability of usage or occurrence of the term in the structure as suggested by Zhang et al. This means that the information content (IC) of a term x
is calculated as
Subgraph of the GO BP. Each box represents a GO term with GO ID, D value (Zhang et al. measure). This is used to illustrate our approach and compare its effectiveness to the Zhang et al. approach.
Names and characteristics of GO terms in , including topological position characteristics μ and information content ICT from our approach and ICZ and ICZu from the Zhang et al. approach.
Moreover, two approaches, Resnik and Lin's approaches, are used for scaling the semantic similarity measure induced by ICZ
between 0 and 1. The uniform Resnik's measure is given by
) is the uniform ICZ
) obtained by dividing ICZ
) by the maximum scale whose value is ln N
is the total number of terms within the ontology under consideration. ICZu
) is therefore computed as follows:
is the number of terms in the ontology under consideration. Lin's semantic similarity measure is given by
As we can see, the more specific the term, that is, the further it is from the root node, the higher its topological information, meaning that children are more informative or more specific than their parents, and for two GO terms in the same path, the more specific one will either be more informative or topologically identical to that closer to the root. This is not the case for the Zhang et al. approach, in which the semantic values of the terms at the same level tend to be uniform and a child term is not necessarily more specific than a given parent term, independent of the number of parents that the child term has. Our method distinguishes these different local topologies.
We calculate the semantic similarity between every two consecutive GO terms in and results are given in for three different approaches. The formula in (6
) shows that, for our approach, the contribution of a given parent to the term depends on the parent reachability measure. The smaller the reachability measure of that parent and the fewer children it possesses, the higher its similarity compared to another parent of the term. From the results in , we see that GO:0042771 is more similar to GO:0008630 than to GO:0030330, both of which are its parents. This is topologically explained by the lower reachability of GO:0008630 compared to GO:0030330 and the higher number of children the term GO:0030330 possesses. This reduces its influence on each of its children, becoming less relevant for it to represent a given child due to the lower similarity between them. Furthermore, GO:0006977 is more similar to GO:0031571 than to GO:0030330. This is numerically due to the influence of GO:0030330, reflected by its reachability measure, which is lower than that of GO:0031571. It is topologically caused by the higher level of the term GO:0031571 compared to the level of GO:0030330, and therefore gives the term GO:0031571 a higher biological content property than GO:0030330 for better representing the child term GO:0006977.
Table 2 Semantic similarity values between child-parent pairwise terms in from the Wang et al. and Zhang et al. approaches are compared to our approach. SW refers to the semantic similarity between two GO terms obtained using the Wang semantic similarity (more ...)
also includes the semantic similarity between every two consecutive GO terms computed using the Zhang et al. and Wang et al. methods. These results show that Wang's semantic similarity measure between a given term and its immediate child is always greater than 0.6, which is the semantic factor of “part_of” relations, and is independent of the characteristics of the position of these terms in the GO-DAG, including the number of children belonging to the parent term and their levels. This shows how our approach provides a scalable and consistent measurement method, in which the semantic similarity of two terms is completely determined by their reachability measures and that of their highest informative ancestor, that is, the ancestor with the smallest reachability measure. Using the intrinsic topology property of the GO-DAG, the semantic similarity measure of two terms is in agreement with the GO consortium vocabulary, in the sense that two terms whose most common informative ancestor is close to the root share less topological information compared to those having the highest common informative ancestor far from the root.
2.3. Functional Similarity of Proteins Based on GO Similarity
A given protein may perform several functions, thus requiring several GO terms to describe these functions. For characterized or annotated pairwise proteins with known GO terms, functional closeness or GO similarities based on their annotations and consequently the distances between these proteins can be evaluated using the Czekanowski-Dice approach [35
] as follows:
) is the set of GO terms of a given protein p
for a given ontology X
= MF, BP, CC, and |TGOX
)| stands for its number of elements.
Czekanowski-Dice's measure is not convenient for using in the case of GO term sets, since GO terms may be similar at some level without being identical. This aspect cannot be captured in Czekanowski-Dice's measure which only requires the contribution from the GO terms exactly matched between the sets of GO terms of these proteins. One can attempt to avoid this difficulty by incorporating the true path rule in the computation of the intersection and union of GO term sets for proteins. However, in most cases where these proteins are annotated by successive GO terms in the GO-DAG, this may lead to the situation where the number of elements in the union of these sets is equal to that of their intersection plus one, in which case, the functional closeness of these proteins is forced to converge to 1, independently of the biological contents of the GO terms in the GO-DAG.
To overcome this problem, we set up a functional similarity between proteins which emphasizes semantic similarity between terms in their sets of GO terms considered to be uniformly distributed. This functional similarity is given by
)) = 1 − dGO
)), with dGO
)) being the distance between a given term t
and a set of terms TGOX
) for a given protein p
, mathematically defined as follows:
Thus, owing to the fact that dGO(s, t) = 1 − SGO(t, s), we obtain
This shows that the functional closeness formula emphasizes the importance of the shared GO terms by assigning more weight to similarities than differences. Thus, for two proteins that do not share any similar GO terms, the functional closeness value is 0, while for two proteins sharing exactly the same set of GO terms, the functional closeness value is 1. The functional similarity between proteins in (14
) is a value that ranges between 0 and 1 and indicates the percentage of similarity the two proteins share, on average, based on their annotations. For example, a functional similarity between two proteins of 0.9 means that these proteins are 90% similar, on average, based on their annotations.
Note that the approach used here to combine GO term topological information for calculating protein functional similarity scores was used in the context of annotation-based approaches and is referred to as the best match average (BMA) approach. This approach has been suggested to be better than the average (Avg) [2
] or maximum (Max) [19
] approaches from a biological point of view [36
]. However, even Avg and Max approaches can also be used to combine GO term semantic similarity scores produced using this new measure to quantify protein functional similarity depending on the application. Furthermore, the GO-universal metric can be used in the context of the SimGIC approach [9
] derived from the Jaccard index based on the Tversky ratio model of similarity [39
], which uses GO term IC directly in order to compute protein functional similarity scores, and referred to as SimUIC. These approaches are generally referred to as term-based approaches. The GO term topological information scores can also be used to construct protein functional similarity schemes relying on other Tversky ratio models, for example, using the Dice index, referred to as SimDIC, and SimUIX which uses a universal index, given by