Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.11(Suppl 1); 2010**|**PMC3009507

Formats

Article sections

- Abstract
- Background
- Methods
- Results and discussion
- Conclusion
- Competing interests
- Authors' contributions
- References

Authors

Related links

BMC Bioinformatics. 2010; 11(Suppl 1): S35.

Published online 2010 January 18. doi: 10.1186/1471-2105-11-S1-S35

PMCID: PMC3009507

Jayesh Pandey: ude.eudrup.sc@yednapj; Mehmet Koyutürk: ude.esac.scee@krutuyok; Ananth Grama: ude.eudrup.sc@gya

Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010)

Laxmi Parida and Gene Myers

The Eighth Asia Pacific Bioinformatics Conference (APBC 2010)

18-21 January 2010

Bangalore, India

Copyright ©2010 Pandey et al; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Analyzing interaction networks for functional characterization poses significant challenges arising from the noisy, incomplete, and generic nature of both the interaction data as well as functional annotation of molecules. Network-based methods focus on interacting molecules (pairs or sets) occurring in close proximity to infer functional associations.

In this paper we perform a formal comparative investigation of the relationship between functional coherence and topological proximity in networks. We investigate the problem of assessing the coherence of sets of biomolecules (or segments thereof) taking into account functional specificity as well as the distribution of functional attributes across entity groups. We also propose novel measures of topological proximity that are more robust to noisy and incomplete interaction data.

We derive the following results in this paper: (i) there exists strong correlation between functional similarity and topological proximity in various network abstractions, with domain interaction networks (DDIs) demonstrating higher correlation than protein interaction networks (PPIs); (ii) measures that quantify coherence among entire sets of proteins are superior to aggregates of known pair-wise measures; and (iii) random-walk based measures of topological proximity are better suited to existing interaction data. We validate our methods on diverse data, including experimentally and computationally derived PPIs and DDIs, as well as on sets of known biologically related groups of molecules.

Analysis of interaction data generated from high throughput experiments takes a network-centric view of functions of biological systems and the role of the underlying components. Recent advances in this area have focused on the development of computational tools for network-based functional annotation [1], identification of functionally coherent modules [2], and relationship between network structure and function [3,4], among others. Network proximity and connectivity are also shown to be effective in identifying proteins that are implicated in similar phenotypes [5].

In this paper, we comprehensively investigate the relationship between topological and functional modularity in the context of two network abstractions - protein-protein interaction (PPI) and domain-domain interaction (DDI) networks. Key to understanding the relationship between network topology and functional modularity are: (i) suitable measures for assessing the *functional coherence *(or similarity) of a group of entities with respect to each other, and (ii) measures for quantifying the *topological proximity *in a network with potential missing interactions and noise. To assess functional coherence, canonical libraries of molecular function, such as Gene Ontology [6], are typically used [7]. Since annotations for different types of molecular entities (*e.g.*, proteins or domains) are derived in different ways [8], they have different implications with respect to their specificity and frequency distributions. Consequently, an important challenge in assessment of functional coherence is the development of measures that are robust to variations in distribution as well as missing data. In recent work, we have shown that information-theoretic measures that are specifically designed to address these challenges are effective in capturing the relationship between the functional coherence and network proximity of pairs of proteins [9]. In this paper, we build upon existing methods for quantifying functional coherence and topological proximity through the following key results:

• We propose novel measures for assessing the functional coherence of a group of molecules (in contrast to pairs of molecules).

• We propose the use of information flow based modeling of topological proximity and connectivity in a network of interactions (in contrast to traditionally used interaction counts or shortest paths).

We elaborate on these contributions below.

Traditional measures of functional coherence, including our own prior results [9], have largely focused on pair-wise distance measures. Generalizing from pair-wise measures to coherence measures for sets of molecules adds significant complexity. For example, in testing the hypothesis that functional modularity is related to connectivity in PPI networks, it is common to investigate the functional purity of groups of proteins that induce dense subgraphs in the network [10]. While these enrichment-based methods have been widely used, they provide common overrepresented GO terms in a given set. They do not, however, provide a measure for the homogeneity of underlying modules (sets). We show that simple extensions of pair-wise measures to group measures by averaging, taking the min, max, or other such associative operations result in sub-optimal set-coherence measures. We propose novel measures of homogeneity of entire protein sets and demonstrate their superiority over generalized pair-wise measures on known groups of homogeneous complexes as compared to a control of randomly generated protein sets.

Topological information is used to identify functionally related proteins using shortest paths or density of direct interactions [1,5]. However, evidence suggests that multiple alternate paths between functionally associated proteins are often conserved through evolution, owing to their contribution to robustness against perturbations, as well as amplification of signals [11]. Consequently, consideration of multiple paths between molecules in a network of interactions is likely to be more effective in capturing the functional association between these molecules. Furthermore, consideration of alternate paths may account for missing data and noise in PPI networks [12]. There exist many methods for the assessment of network proximity based on the multiplicity of paths between nodes, including effective resistance [13], commute distance [14], and random walk proximity [15]. In this paper, we adapt an abstraction that models information flow in the cell using random walks with restarts [16].

Several methods have been proposed for assessing functional similarity of biological entities (genes, proteins, domains) [17-19]. Since the functional categories in which these entities are categorized are themselves interrelated through a taxonomy (*e.g.*, Gene Ontology), measures for similarity must consider the underlying taxonomy while comparing molecules in terms of their functional annotation [20]. Various approaches take into account different factors, including taxonomic distance, specificity/generality (rank in hierarchy) of common ancestors, and associated number of molecules for the functional terms being compared (statistical significance or information content). Since most molecules are associated with multiple functional terms, assessment of functional similarity between two molecules poses the additional challenge of evaluating the similarity between two *sets *of terms, as opposed to a *pair *of terms. In [9], we developed an information theoretic measure for computing similarity of two sets of terms associated with a pair of molecules. We show that our measure is superior to other composite measures computed by applying associative operators (average, max, etc.) to pairwise term similarity measures.

In this paper, we generalize and extend our results to quantify the functional coherence (or similarity) of a *set *of biomolecules (as opposed to a pair). Since each molecule corresponds to a set of annotations, the problem is one of quantifying the coherence of a set of sets of terms. A straightforward approach to this would compute pairwise similarities of each pair of molecules in the set and to aggregate them using associative operators (min, max, average). Pairwise similarities (similarity of two sets of annotations) may themselves be computed using our information theoretic measure. An alternate approach to the problem, proposed in this paper, computes the coherence of the set of molecules without computing intermediate pairwise similarity scores. We show that the latter approach is strictly superior to the former in quantifying the coherence of a set of biomolecules. We validate this claim by applying our proposed measure, along with several other currently used measures to a test group of known functionally related proteins. We also apply the measures to randomly generated groups and identify measures that induce the greatest separation between the test and random groups.

Finally, in order to study the correlation between functional coherence and topological proximity in networks, we also need a measure for topological proximity. Traditional measures of topological proximity rely on the shortest path between two nodes. While this measure is more suited to well-curated and complete datasets, it is susceptible to missing interactions and noise. A single false positive or negative may lead to significant (erroneous) perturbation in shortest-path based measures. Measures based on random walks with restart [16], on the other hand, are more resilient to incomplete and noisy data. We consider both classes of measures of topological proximity, and evaluate their correlation with various functional similarity measures for both protein interaction (PPI) and domain interaction (DDI) networks. We show that a combination of random-walk based topological proximity and our similarity measure([9]) yield the strongest correlation between network proximity and functional coherence.

Let *C *= {*c*_{i}|1 ≤ *i *≤ *N *} be a finite partially ordered set of concepts. In terms of Gene Ontology (GO), these concepts represent the GO terms in the sub-ontologies (*i.e.*, molecular function, biological process, and cellular component). Without loss of generality, we refer to concepts as terms throughout this paper. Terms are related to each other through *is a *and *part of *relationships, such that *c*_{i }→ *c*_{j }denotes *c*_{i }*is a/part of c*_{j}. Note that, if *c*_{i }→ *c*_{j}, then the molecules associated with *c*_{i }are also associated with *c*_{j}, known as the *true path rule*. Based on these relationships, we define a binary relation over *C*, denoted by . We say *c*_{j }is an ancestor of *c*_{i}, denoted by *c*_{i } *c*_{j }if and only if either *c*_{i }→ *c*_{j}, or for some ≥ 1, there exist for 1 ≤ ≤ 1 such that for 1 ≤ < l, and (*c*_{j }is an ancestor of *c*_{i }in GO hierarchy). Two terms *c*_{i}, *c*_{j }are comparable, denoted by *c*_{i }~ *c*_{j}, if either *c*_{j } *c*_{i }or *c*_{i } *c*_{j}. If *c*_{i }and *c*_{j }are comparable, then the shortest path between *c*_{i }and *c*_{j }is given by *L*(*c*_{i}, *c*_{j}) = *L*(*c*_{j}, *c*_{i}) = + 1 for minimum such .

We denote the set of ancestors of a term *c*_{i }by *A*_{i }= {*c*_{k } *C*|*c*_{i } *c*_{k}}. Note that, not all ancestors of a term are comparable, since the GO hierarchy is a directed acyclic graph, as opposed to a tree. We represent the root term of GO with a terminal concept *r*, such that *c*_{i } *r *∀*c*_{i } *C*.

Semantic similarity measures quantify the similarity between two terms based on the underlying taxonomical relationships. The *information content *based measure of semantic similarity quantifies similarity between a pair of terms by taking into account the distribution of terms among molecules. Specifically, it rewards infrequent similar terms, over those that are frequent. Let *G*_{c }be the set of molecules associated with term *c *in the available database, with *G*_{r }being the set of all molecules. The information content of a term is defined as *I*(*c*) = - log2(|*G*_{c}|/|*G*_{r}|) [20]. Clearly, *I*(*r*) = 0, and as a consequence of the true path rule, *I*(*c*_{j}) ≥ *I*(*c*_{i}) for *c*_{j } *c*_{i}. Then, the semantic similarity between two terms is defined as

(1)

Here, is said to be the *minimum common ancestor *of *c*_{i }and *c*_{j}.

Observe that this measure does not take into account the specificity of terms with identical common ancestors. This problem can be alleviated by normalizing the similarity between two terms by the self-similarities of the terms being compared, *e.g.*, by [21]. Note, this measure has a well defined maximum of 1 and offer bounded interpretation (ranging from 0 to 1) of Resnik's metric. We now generalize these term-similarity measures to set-similarity.

Biomolecules are generally associated with multiple molecular functions and often involved in multiple processes. Consequently annotations of molecules correspond to sets of terms, as opposed to individual terms. While assessing the similarity of sets of terms, we assume that the sets are non-redundant, *i.e.*, each set consists of terms that are not comparable. This can be easily enforced by ensuring that each branch in the hierarchy is represented by at most one term in each set. In GO, this involves considering only the *most specific annotations *associated with a gene, which provides a non-redundant representation of functional annotation. In this representation, the association between the gene and the ancestors of the most specific term is implied by the true path rule.

An important challenge in the assessment of the functional coherence of sets is that these sets are often incomplete (that is, for many molecules, some of their functions are unknown). Therefore, a reliable measure is one that rewards the abundance of similar terms in the terms, but does not penalize existence of unrelated terms in one of the sets, since the relation between these terms and the other set may be currently unknown. Simple associative measures that aggregate the similarity of pairs of terms in the two terms, such as average (*ρ*_{A}) [17], maximum (*ρ*_{M}) [22], or average of maximums (*ρ*_{H}) [18] do not satisfy these properties [9].

Motivated by these considerations, in prior work, we extend the notion of minimum common ancestors to sets of terms, and generalize the concept of information content from a single term to a set of terms [9]. Let be the minimum common ancestor set of term sets *S*_{i }and *S*_{j}, and denote a generalized union operator that preserves non-redundancy by keeping the most specific terms. The similarity between two term sets is defined as the information content of the set of minimum common ancestors, *i.e.*,

(2)

where is the set of molecules that are associated with all terms in the set Λ(*S*_{i}, *S*_{j}). Note that *ρ*_{I }also needs to be normalized with respect to self similarities, *i.e.*, *ρ*_{JC }= 1/(*ρ*_{I}(*S*_{i}, *S*_{i}) + *ρ*_{I}(*S*_{j}, *S*_{j}) - 2*ρ*_{I}(*S*_{i}, *S*_{j}) + 1)

Let be a set of *n *molecular entities (genes, proteins, domains), with each entity being associated with a set of terms, *i.e.*, = {*S*_{1}, *S*_{2}, ..., *S*_{n}}. We aim to develop a measure *σ*() to assess the functional coherence of this set, such that a larger *σ *indicates more semantic similarity between the terms in sets *S*_{1}, *S*_{2}, ..., and *S*_{n}. Without loss of generality, we call a module, since the objective here can also be considered as assessing the modularity of . We consider various measures to assess the functional coherence of a module, which are discussed below. In order to illustrate each measure, we use a running example based on the ontology shown in Figure Figure1.1. In the figure, let _{1 }= {*S*_{1}, *S*_{2}, *S*_{3}, *S*_{4}} be a module that can be interpreted as a complex composed of two sub-complexes _{2 }= {*S*_{1}, *S*_{2}, *S*_{3}} (with the shared term *c*_{4}) and _{3 }= {*S*_{3}, *S*_{4}} (with the shared term *c*_{6}), in which *S*_{3 }"bridges" the two sub-complexes _{2 }and _{3}.

A straightforward way of computing set coherence is to compute the average of the pairwise *n*(*n *- 1)/2 set similarity scores [19,23]:

(3)

In our running example, the average pairwise information content of the molecules in complex _{1 }is given by *σ*_{A}(_{1}) = (*I*(*c*_{4}) + *I*(*c*_{4})/2 + 0 + *I*(*c*_{4})/2 + 0 + *I*(*c*_{6})/4)/6 = 3/8, while that of sub-complex _{2 }is given by *σ*_{A}(_{2}) = (*I*(*c*_{4}) + *I*(*c*_{4})/2 + *I*(*c*_{4})/2)/3 = 2/3, given that *I*(*c*_{4}) = *I*(*c*_{6}) = -log_{2 }(3/6) = 1. Bridged complexes get lower score than specialized complexes due to differences in sub-complex annotations.

It is possible to extend the notion of the minimum common ancestor of pairs of terms to tuples of terms as . In the other words, the minimum common ancestor of a set of *n *terms is defined as the most specific among the terms that are common ancestors of all of *n *terms in the set. Then, for each n-tuple *c*_{1 } *S*_{1}, *c*_{2 } *S*_{2}, ..., and *c*_{n } *S*_{n}, the functional coherence of these terms can be quantified as . Consequently, the minimum common ancestor set of *S*_{1}, *S*_{2}, ..., *S*_{n}can be computed as

leading to a generalization of the information content based measure:

(4)

In our running example, since *λ*(*c*_{4}, *c*_{1}) = *λ*(*c*_{4}, *c*_{6}) = *λ*(*c*_{4}, *c*_{1}, *c*_{6}) = *r*, we have Λ(_{1}) = {*r*}, thus the generalized information content of complex _{1 }is *σ*_{I}(_{1}) = *I*(*r*) = 0. On the other hand, since Λ(_{2}) = {*c*_{4}}, we have *σ*_{I}(_{2}) = *I*(*c*_{4}). As illustrated by this example, *σ*_{I }is a rather conservative measure of functional coherence and it only rewards specialized modules in which all molecules share very similar functions.

We extend the graph information content measure proposed by Pesquita *et al. *[24]. The idea behind this approach is that, if a group of molecules are coherent, then the information content of the DAG induced by the intersection of ancestors is close to the information content of the DAG induced by the union of ancestors. In other words, defining as the ancestor set *S*_{i}, graph information content of set is defined as

(5)

Observe that, if all molecules are annotated with the same set of terms, *σ*_{G}() would be equal to one, and zero if they have no common terms. Similar to *σ*_{I}, a drawback of this measure is its sensitivity to outliers; that is, if a single molecule in the set is sufficiently functionally different it has a significant impact on the score. Indeed, in our running example, we have *σ*_{G}(_{1}) = *I*(*r*) = 0, while *σ*_{G}(_{2}) = (*I*(*c*_{4}) + *I*(*c*_{2}))/(*I*(*c*_{4}) + *I*(*c*_{2}) + *I*(*c*_{6}) + *I*(*c*_{3})) = 1/2, given that *I*(*c*_{2}) = *I*(*c*_{3}) = -log_{2 }(3/6) = 1.

Complexes are functionally cohesive modules, but they are often composed of sub-complexes, each performing a specific part of the general function of the complex [25]. However, as illustrated by our running example, generalized information content (*σ*_{I}) and graph information content (*σ*_{G}) require all molecules to be functionally coherent with each other for the module to be considered coherent. In order to provide a more relaxed, and biologically motivated measure of functional coherence, we consider shared functionality between all combinations of molecules and weigh the information content of shared functionality by the number of molecules that contribute to the shared functionality.

Specifically, let be the set of terms in the ancestor set of *S*_{i }that are not shared with any other molecule in . Then, weighted information content of set is defined as the ratio of the information content of all terms that are shared in at least two molecules to the information content of all terms associated with at least one molecule in the set; that is:

(6)

In other words, we consider all the partial DAGs () generated by each *S*_{i }in . All the terms that are part of overlapping DAG correspond to shared information among those proteins. The numerator in the above equation corresponds to the information content of the overlapping DAG, while the denominator normalizes that score with total information of the combined DAG. In our running example, we have

*σ*_{W }(_{1}) = (3*I*(*c*_{4}) + 3*I*(*c*_{2}) + 2*I*(*c*_{6}) + 2*I*(*c*_{3}))/(3*I*(*c*_{4}) + 3*I*(*c*_{2}) + 2*I*(*c*_{6}) + 2*I*(*c*_{3}) + *I*(*c*_{1})) = 0.86 and *σ*_{W }(_{2}) = (3*I*(*c*_{4}) + 3*I*(*c*_{2}))/(3*I*(*c*_{4}) + 3*I*(*c*_{2}) + *I*(*c*_{6}) + *I*(*c*_{3})) = 3/4, given that *I*(*c*_{1}) = - log_{2 }(2/6) ≈ 1.6 Since members of the module _{1 }share all functions other than *c*_{1}, this measure captures the coherence of the bridged module better than other methods. This method only penalizes for functions which are not shared by a member with rest of the module.

We now discuss how coherence scores are processed to make them comparable against each other for different module sizes and across various sub-ontologies.

The scores discussed above can be based on any of the three sub-ontologies of GO. Since cellular component annotations are sparser than annotations of biological process and molecular function, we use the method proposed by Schlicker et al. [26]. For pairs of molecules, we combine the two coherence scores obtained from biological process and molecular function ontologies as:

where max *ρ*^{(BP) }and max *ρ*^{(MF) }are the maximum possible scores for biological process and molecular function, respectively. Module coherence scores (*σ*) are based only on biological process ontology.

In order to compare modules of different sizes, we normalize the functional coherence scores based on a background distribution that characterizes the coherence of modules of identical size. Specifically, for a given module , we generate a sufficiently large number of random modules of size || and compute the functional coherence of each of these modules. Then, letting denote the average functional coherence of these modules, we compute the size-adjusted coherence score of as .

In order to compare various measures of functional coherence, we assemble a positive (test) group and a randomly selected (control) group of proteins. The positive set comprises of proteins that are known to be functionally related based on prior biological knowledge (*i.e.*, they are known to exist in complexes and perform related functions). Clearly, if we plot coherence values for samples from the test set and from the control set, we expect to see two distinct distributions - samples from the test group are expected to have higher coherence scores than those from the control group. The separation of the two distributions induced by each method indicates the fitness of the measure in quantifying coherence in sample sets, in terms of distinguishing coherent and arbitrary sets. This separation is quantified as:

which is proportional to the area under the binormal ROC curve [27]. Here, *T *and *C *denote the sets of test and control modules, respectively.

The most commonly used measure of topological proximity is graph distance, where the distance between a pair of nodes in a connected graph is defined as the length of the shortest path between them. In the context of biological networks, there are several drawbacks to this measure. It is particularly susceptible to missing or incorrect data - *i.e.*, a single missing edge may reduce proximity significantly, alternately, a single false edge may increase proximity incorrectly [28]. Furthermore, this measure does not take into account the global structure and connectedness of the graph, with alternate paths between a pair of nodes.

Nodes connected to each other via disjoint paths are likely to be functionally closer than nodes that are connected via a single path. Indeed, evidence suggests that multiple alternate paths between functionally associated proteins are often conserved through evolution, owing to their contribution to robustness against perturbations, as well as amplification of signals [11].

To alleviate these drawbacks, we consider an alternate measure that captures the multi-faceted relationship between a pair of nodes [16]. This measure uses a random walk with periodic restarts to estimate the affinity between pairs of nodes. In this model, the random walk is initiated at node *i*, with neighbor transition probability proportional to edge weight, and at each step, the walk returns to source node *i *with probability *c*. The proximity of node *j *to node *i *is defined as the relative amount of time spent at node *j *by such an infinite random walk. It can be shown that the proximity of all nodes to node *j *can be computed iteratively as

Here, **W **is the stochastic matrix derived from the adjacency matrix of the network, is the restart vector with if *j *= *i *and 0 otherwise, and . Then, the proximity of node *j *to node *i *is given by . Repeating this procedure for all proteins, we obtain a matrix of network proximity scores for all pairs of proteins. Note, however, that this measure of proximity is not symmetric (proximity of *j *to *i *is not necessarily equal to the proximity of *i *to *j*). Therefore, we take the average of the two proximity values to compute the proximity between a pair of proteins. Using the proposed measures of functional coherence and the random-walk based measure for topological proximity, we quantify the relationship between topological proximity and functional coherence by computing the correlation of the resulting matrices.

We obtain **protein interaction data **for *S. cerevisiae *and *S. pombe*, from the BioGRID database [29] version 2.0.51. We filter the dataset to obtain a set of physical interactions between proteins, *i.e.*, genetic interactions are removed based on experiment systems (*e.g.*, knockout experiments) mentioned on the BioGRID website. Integr8 [30] is used to map the proteins in the interaction dataset to their Uniprot names, keeping only those proteins that we can map to a Gene Ontology term using Integr8.

We obtain **domain interaction data **from the DOMINE database [31] version 1.1. This dataset is composed of known, as well as predicted domain interactions. Based on the source and quality of the data, we partition this dataset. **Struct **interactions are inferred from PDB entries of protein complexes and are collected from iPfam and 3did. **Comp-2 **interactions are predicted by at least two computational methods that infer domain interactions from protein interaction networks using techniques such as maximum likelihood estimation or from co-evolution of conserved sites in protein sequences. **HC+MC **interactions consists of high and medium confidence interactions (for details, please refer to [31]).

To test the **functional coherence of sets**, we obtain positive and random cases from GRIP [32]. GRIP generates positive cases from MIPS CYGD complex catalogue [33] by picking sets from known complexes. For wildtype cases, GRIP selects proteins at random. We generate a total of 16 datasets of which eight are made up of positive cases and eight are random. Each set consists 2000 sets of proteins (complexes), ranging from four to eleven proteins each.

**Gene Ontology Annotation **(GOA) [34] release 47.0 dated 2009/03/09 is used to obtain annotation information for Uniprot proteins. GOA combines manual and automated inferences of gene product annotations. The mapping of Pfam-A domains to their Gene Ontology functions is obtained from pfam2go http://www.geneontology.org/external2go/pfam2go released on 2009/03/04. We use only the Biological Process and Molecular Function sub-ontologies of Gene Ontology [6] version 1.550 for evaluation, since the coverage for the Cellular Component sub-ontology is relatively sparse.

We first compare the behavior of the molecular similarity metrics by examining their correlation with different topological proximity measures, and follow with a detailed look at their behavior on comprehensive PPI and DDI data. We then investigate the differences between PPI and DDI networks in terms of the relationship between network proximity and functional similarity using our generalized information content based metric. Finally we compare various measures for computing the functional coherence of sets. To evaluate similarity vis-a-vis proximity, we compute, for every pair of nodes in a network, the shortest distance between them, proximity for a given value of *c*, and various semantic similarity measures. Using these, we compute correlation of topological proximity metrics and functional similarity measures. As in [9] we normalize raw similarity scores to obtain a mean similarity score of zero and standard deviation of one. We create groups of pairs based on their topological proximity and compute average semantic similarity for each group.

We first evaluate the proximity measure based on random walks. Since the parameter *c *can be varied to perturb the affinity between nodes, we first estimate an optimal value for *c*. We compute the proximity matrix for various values of *c*, ranging from 0.1 to 0.9, for the domain network *HC+MC*. We also compute the semantic similarity scores for different metrics - average of information content (IC) based term similarity (*ρ*_{A}/*δ*_{I}), average of self-normalized IC based term similarity (*ρ*_{A}/*δ*_{JC}), IC based molecule similarity (*ρ*_{I}), and self-normalized IC based molecule similarity (*ρ*_{JC}). We compute the correlation between these computed functional similarity scores and topological proximity. Semantic similarity is computed for the *biological process *(BP) and *molecular function *(MF) ontology separately, as well as by combining the two scores.

As evident in Figure 2(a), for *c *= 0.3, we obtain the best correlation between the proximity matrix and any semantic similarity metric using combined BP and MF ontology. For further analyses, we use this value (*c *= 0.3) to compute the proximity matrix. In this network we also note that topological proximity (*c *= 0.3) has much better correlation with functional similarity than shortest path, for all similarity metrics. This validates our proposed use of random-walk based topological proximity measure. Indeed, this behavior follows our hypothesis that since proximity takes into account all paths from one node to another, two nodes connected in multiple ways are expected to be more functionally similar.

In Figure 2(b) we plot the correlation of topological proximity and semantic similarity measures using BP, MF and both ontologies. BP offers slightly better correlation than MF. In general, MF corresponds to a lower level property of a molecule directly related to its structure. BP is a higher level construct, related to the wider neighborhood in the network. Hence interacting molecules are more likely to belong to the same processes even if they have different functions. Finally the correlation obtained by combining the two ontologies is higher than taking them separately.

Using the measure *ρ*_{JC }by combining BP and MF ontology, we compare the relationship between functional similarity, random walk based network proximity (Figure (Figure3a),3a), and network distance (Figure (Figure3b).3b). We plot the normalized average semantic similarity, as in the previous case, for the PPI and DDI networks for various groupings of proximity values and shortest path distances. Each bin in Figure Figure3a3a is adjusted such that the number of pairs in each bin in Figure Figure3a3a is approximately equal to that in Figure Figure3b.3b. As evident in the figure, the larger the proximity (between a pair of nodes) the (more) similar their functions. Conversely, lesser the distance between a pair of nodes, higher their similarity. Larger the slope between the groupings the better the measure performs (or dataset is) in grouping similar functioned molecules together. For both proximity measures, we find that DDIs have better functional similarity than PPIs, as also noted in [9]. Further, it is apparent that the relationship between functional similarity and topological proximity is stronger in computationally inferred DDI networks than that in PPI networks. Among the PPI networks, *S. cerevisiae*, which is the most completely annotated and studied, we observe stronger correlation between functional similarity and both proximity measures, compared to other PPIs.

Correlation of proximity measures and similarity in Figure Figure2a2a provides comparison of the curves in Figures Figures3a3a and and3b.3b. Further comparison of Figures Figures3a3a and and3b3b indicates that the slopes of the curves are are generally higher for random walk based proximity, as compared to shortest path. Again, since proximity binds two nodes not just on topological distance but also on number of paths in between, only strongly correlated nodes have higher proximity values. These observations suggest that network proximity based on random walks is likely to be more relevant to, hence indicative of, functional coherence and modularity.

We evaluate coherence measures for sets using index of detectability on sets with (known) functionally correlated protein and sets of randomly selected proteins. We compute functional coherence using the following measures: average of pairwise information content (using term *σ*_{A}/*ρ*_{A}/*δ*_{I }and molecule *σ*_{A}/*ρ*_{I }based similarity), Generalized Information Content (*σ*_{I}), Extended graph information content (*σ*_{G}), and Weighted information content (*σ*_{W}). As we observe in the previous section, since biological processes span wider neighborhoods, they are more likely to be shared in a module. For this reason, we compute the coherence score using only the biological process ontology. As the index of detectability is a measure of significance, we can plot a curve indicating the threshold corresponding to p-value < 0.05.

Figure Figure44 shows the index of detectability for various measures as the size of modules is varied from 4 to 11. We note that average of pairwise similarity based on molecule IC performs best from small modules, and that its performance remains steady as module size increases. Extended graph information content performs the worst and its performance decreases drastically as module size increases. As the module size increases, we expect the complex to be composed of sub-complexes with specific function, while the overall functionally shared among all molecules in this complex may be general. We see similar behavior in the generalized information content measure. The weighted information content based measure demonstrates improved performance as set size increases. This is because it can detect all shared functionality among sub-complexes that are parts of the entire complex, and have overlaps or bridges among them to carry out the biological tasks. Figure Figure44 also displays a curve indicating the threshold on index of detectability that corresponds to a statistical significance of *p *< 0.05 (according to normal distribution). This curve shows that only the weighted information content and pairwise molecule based similarity metric deliver significant performance in distinguishing known complexes and random sets of genes (*p *< 0.05), and the performance of the proposed measure increases with increasing complex size.

We draw the following conclusions from our study: (i) our proposed measure of functional coherence of sets of entities (proteins, domains) is superior to other existing measures, (ii) we comprehensively study the relationship between functional coherence and topological proximity using suitable measures and derive formal conclusions for process- and function- based annotations, and (iii) we use our measures to study a range of PPIs and DDIs and establish the suitability of these abstractions to various kinds of analyses.

The authors declare that they have no competing interests.

JP developed the methods and performed all analyses. The three authors participated in the conception of the methods and the analysis. Together, the three authors wrote this manuscript.

This research was supported by NSF grants 0800568, 0835541, and 0641037.

This article has been published as part of *BMC Bioinformatics *Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.

- Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Mol Syst Biol. 2007;3 doi: 10.1038/msb4100129. [PMC free article] [PubMed] [Cross Ref]
- Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. PNAS. 2003;100(21):12123–12128. doi: 10.1073/pnas.2032324100. [PubMed] [Cross Ref]
- Yook SH, Oltvai ZN, Barabási AL. Functional and topological characterization of protein interaction networks. Proteomics. 2004;4:928–942. doi: 10.1002/pmic.200300636. [PubMed] [Cross Ref]
- Przulj N, Wigle DA, Jurisica I. Functional topology in a network of protein interactions. Bioinformatics. 2004;20:340–348. doi: 10.1093/bioinformatics/btg415. [PubMed] [Cross Ref]
- Lage K, Karlberg OE, Størling ZM, Páll, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N. et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotechnology. 2007;25(3):309–316. doi: 10.1038/nbt1295. http://dx.doi.org/10.1038/nbt1295 [PubMed] [Cross Ref]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT. et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics. 2000;25:25–29. doi: 10.1038/75556. [PMC free article] [PubMed] [Cross Ref]
- Khatri P, Drãghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. doi: 10.1093/bioinformatics/bti565. [PMC free article] [PubMed] [Cross Ref]
- Schug J, Diskin S, Mazzarelli J, Brunk B, Stoeckert C. Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res. 2002;12:648–655. doi: 10.1101/gr.222902. [PubMed] [Cross Ref]
- Pandey J, Koyuturk M, Subramaniam S, Grama A. Functional coherence in domain interaction networks. Bioinformatics. 2008;24(16):i28–34. doi: 10.1093/bioinformatics/btn296. http://dx.doi.org/10.1093/bioinformatics/btn296 [PubMed] [Cross Ref]
- Grossmann S, Bauer S, Robinson PN, Vingron M. An Improved Statistic for Detecting Over-Represented Gene Ontology Annotations in Gene Sets. 10th International Conference on Research in Computational Moecular Biology (RECOMB'06) 2006. pp. 85–98.
- Kelley R, Ideker T. Systematic interpretation of genetic interactions using protein networks. Nature Biotechnology. 2005;23(5):561–566. doi: 10.1038/nbt1096. http://dx.doi.org/10.1038/nbt1096 [PMC free article] [PubMed] [Cross Ref]
- Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS. 2003;100(20):11394–11399. doi: 10.1073/pnas.1534710100. [PubMed] [Cross Ref]
- Tetali P. Random walks and the effective resistance of networks. Journal of Theoretical Probability. 1991;4:101–109. doi: 10.1007/BF01046996. [Cross Ref]
- Chandra AK, Raghavan P, Ruzzo WL, Smolensky R. In STOC '89: Proceedings of the twenty-first annual ACM symposium on Theory of computing. New York, NY, USA: ACM; 1989. The electrical resistance of a graph captures its commute and cover times; pp. 574–586. full_text.
- Fouss F, Pirotte A, Saerens M. In WI '05: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. Washington, DC, USA: IEEE Computer Society; 2005. A Novel Way of Computing Similarities between Nodes of a Graph, with Application to Collaborative Recommendation; pp. 550–556. full_text.
- Tong H, Faloutsos C, Pan JY. Random walk with restart: fast solutions and applications. Knowl Inf Syst. 2008;14(3):327–346. doi: 10.1007/s10115-007-0094-2. [Cross Ref]
- Lord P, Stevens R, Brass A, Goble C. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003;19:1275–1283. doi: 10.1093/bioinformatics/btg153. [PubMed] [Cross Ref]
- Schlicker A, Huthmacher C, Ramírez F, Lengauer T, Albrecht M. Functional evaluation of domain-domain interactions and human protein interaction networks. Bioinformatics. 2007;23:859–865. doi: 10.1093/bioinformatics/btm012. [PubMed] [Cross Ref]
- Chagoyen M, Carazo JM, Pascual-Montano A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics. 2008;9:444. doi: 10.1186/1471-2105-9-444. [PMC free article] [PubMed] [Cross Ref]
- Resnik P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. IJCAI. 1995. pp. 448–453.http://citeseer.ist.psu.edu/resnik95using.html
- Jiang JJ, Conrath DW. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. ICRCL. 1997.
- Sevilla J, Segura V, Podhorski A, Guruceaga E, Mato J, Martínez-Cruz L, Corrales F, Rubio A. Correlation between gene expression and GO semantic similarity. IEEE/ACM Trans Comput Biol Bioinform. 2005;2:330–338. doi: 10.1109/TCBB.2005.50. [PubMed] [Cross Ref]
- Pu S, Vlasblom J, Emili A, Greenblatt J, Wodak SJ. Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics. 2007;7:944–960. doi: 10.1002/pmic.200600636. [PubMed] [Cross Ref]
- Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics. 2008;9(Suppl 5):S4. doi: 10.1186/1471-2105-9-S5-S4. [PMC free article] [PubMed] [Cross Ref]
- Gavin AC, Superti-Furga G. Protein complexes and proteome organization from yeast to man. Curr Opin Chem Biol. 2003;7:21–27. doi: 10.1016/S1367-5931(02)00007-8. [PubMed] [Cross Ref]
- Schlicker A, Rahnenfuhrer J, Albrecht M, Lengauer T, Domingues F. GOTax: investigating biological processes and biochemical activities along the taxonomic tree. Genome Biology. 2007;8(3):R33. doi: 10.1186/gb-2007-8-3-r33. http://genomebiology.com/2007/8/3/R33 [PMC free article] [PubMed] [Cross Ref]
- Simpson A, Fitter M. What is the best index of detectability? Psychological Bulletin. 1973;80(6):481–488. doi: 10.1037/h0035203. [Cross Ref]
- Edwards AM, Kus B, Jansen R, Greenbaum D, Greenblatt J, Gerstein M. Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet. 2002;18:529–536. doi: 10.1016/S0168-9525(02)02763-4. [PubMed] [Cross Ref]
- Breitkreutz B, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner D, Bähler J, Wood V, The BioGRID Interaction Database: 2008 update. Nucleic Acids Res. 2007. [PMC free article] [PubMed]
- Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I. et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res. 2005;33:297–302. doi: 10.1093/nar/gki039. [PMC free article] [PubMed] [Cross Ref]
- Raghavachari B, Tasneem A, Przytycka T, Jothi R. DOMINE: a database of protein domain interactions. Nucleic Acids Res. 2007. [PMC free article] [PubMed]
- Browne F, Wang H, Zheng H, Azuaje F. GRIP: A web-based system for constructing Gold Standard datasets for protein-protein interaction prediction. Source Code Biol Med. 2009;4:2. doi: 10.1186/1751-0473-4-2. [PMC free article] [PubMed] [Cross Ref]
- Güldener U, Müunsterkötter M, Kastenmüller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, García-Martínez J, Pérez-Ortín JE. et al. CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res. 2005;33:D364–368. doi: 10.1093/nar/gki053. [PMC free article] [PubMed] [Cross Ref]
- Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R. The GOA database in 2009-an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 2009;37:396–403. doi: 10.1093/nar/gkn803. [PMC free article] [PubMed] [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |