Accurate interaction networks (interactomes) are fundamental to answering questions about how the biochemical machinery of cells organizes matter, processes information, and carries out transformations to perform specific functions leading to various phenotypes. Toward this goal, a number of experimental
[
1] and computational
[
2-
4] techniques have been devised and applied to map the interactions of human proteins
[
5-
8] and those of model organisms such as yeast
[
9-
12]. Despite their incompleteness
[
13], current interactome maps already serve as a basis for numerous methods aiming to elucidate biological processes in health and disease
[
14,
15]. Current interactome maps are contaminated with false positive interactions that can make up a considerable portion of the data
[
13,
16-
20]. These false positive interactions dim the explanatory light of interaction networks and also decrease the predictive value of methods using such data. It is thus of primary importance to derive confidence values for individual interactions, which can serve to refine current interactome maps or can be used as interaction weights. For example, it has been shown recently that the performance of complex detection approaches is better in confidence-weighted protein-protein interaction networks than in non-weighted networks
[
19,
21].
Several approaches have been proposed for interaction confidence assessment, many of which are reviewed in
[
19,
22,
23]. Most of these methods integrate additional data like interaction homology
[
17], co-expression of genes encoding interacting proteins
[
17,
24,
25], or a combination of these and other evidence features
[
26,
27]. The outcome from such methods depends on the additional data sets. Others combine multiple topological features with additional knowledge to achieve better predictions
[
20,
28]. Methods which are able to use network topology alone to predict interaction veracity
[
29-
32] are the tools of choice for interaction confidence assessment if other types of data are limited or biased.
At various levels (globally as well as locally), the topology of interaction networks encodes biological properties which are largely independent of the biochemical function of the individual members of the network
[
33,
34]. This has been demonstrated through analysis of global properties exploiting topological features such as node degree
[
35] or distance
[
36,
37]. The biological importance of network topology may be even more clear for local structures, as in the case of specific wiring patterns of interaction partners
[
34]. Likewise, modularity of interaction networks is currently the most successful concept for addressing the dynamics of cellular processes
[
8,
38,
39].
Goldberg and Roth
[
29] proposed a connectivity based approach for interaction confidence assessment where the number of common neighbors of a pair of predicted interaction partners counts in support of the interaction. They defined interaction confidence as the level of enrichment of common network neighbors of interacting proteins. It is quantified by the hypergeometric distribution P-value given the number of common neighbors and total network neighbors of both interacting proteins. The underlying principle of the approach has been established in seminal studies demonstrating that biological networks are marked with short interaction paths separating random pairs of proteins in the network (small-world property), and densely connected local neighborhoods (neighborhood cohesiveness property)
[
40]. Real protein-protein interactions are expected to meet the network cohesiveness property more frequently than false positives. More recently, Kuchaiev and co-authors
[
32] proposed another method that embeds interaction networks into a low-dimensional Euclidean space based on network metrics (shortest path length) and then calculates confidence of interactions depending on the Euclidean distance between proteins within that space. The basis of the approach is the geometric graph model that was proposed to better reflect biological networks than e.g. the small-world model
[
41]. Although the biological basis of the geometric graph model remains elusive, the authors show that it measures network distance more reliably. Both of these topology based methods assign confidence as numerical values to protein-protein interactions in a network and are additionally able to predict new interaction candidates by assigning confidence scores to non-interactions. However, both methods have certain shortcomings. The method by Goldberg and Roth is able to assess the confidence of those interactions whose participants have common neighbors only. Often, however, interacting proteins do not share neighbors. The method of Kuchaiev
et al. appears limited in that it requires fixing six free parameters. These include algorithm-specific parameters as well as the prior probability for interactions which depends on knowledge about the interactome size.
Here, we propose CAPPIC (cluster-based assessment of protein-protein interaction confidence) – a novel approach that exploits the inherent modular structure of interactomes for confidence assessment of protein-protein interactions. Our method combines the basic principles of the topology based methods described above: high neighborhood interconnectedness of a couple of proteins and short distance between them (the features exploited by Goldberg and Roth and Kuchaiev
et al., respectively) are indicators that both proteins participate in the same module. We apply Markov clustering
[
42] to the line graph
[
43] of an interaction network to dissect it into modules of interactions. As demonstrated in
[
44], this strategy can generate interaction clusters that significantly overlap with known biological pathways. Notably, the interaction clusters overlap in their protein constitution. This is biologically more meaningful than clustering the proteins into disjoint modules because pathways and protein machineries are known to overlap
[
10,
21]. The rationale behind our approach is that proteins that are specific to certain modules are expected to have more interactions with proteins that are specific to the same modules than with other proteins
[
39]. Intuitively, we assign low confidence to interactions that disagree with the modular structure of biological networks and high confidence to those that comply with it. This rationale has also been used as a basis of approaches for the detection of binary interactions
[
10] or protein complexes
[
45] from complex purification data or to reveal dynamic interaction patterns during the human spliceosome cycle
[
8]. While the aim of CAPPIC is to detect false positive interactions, a different approach, which is however also based on the principle of high link density within network modules, has been proposed for identifying false negatives
[
46].
We applied our method to six large-scale interaction networks from yeast to assess its performance and compare it to previous topology-based methods (Table
). The six networks were fundamentally different with respect to their biological and topological properties as they have been generated using different techniques. These included: 1) a network that was generated using the protein-fragment complementation assay (PCA) technology
[
12] (
Tarassov-all); 2) a sub-network of Tarassov-all obtained by the authors after applying several filtering steps
[
12] (
Tarassov-hq); 3) a combined network of interactions found by yeast-two-hybrid (Y2H) screens (
Yu-Ito-Uetz) comprising the networks published by Yu
et al.[
9], Ito
et al.[
47] and Uetz
et al.[
48] (the integrated data set was retrieved from
[
9]); 4) a network of interactions predicted by Collins
et al.[
49] from protein complex data resulting from affinity purification assays coupled to mass spectrometry (AP-MS)
[
10,
11] (
Collins), downloaded from BioGRID
[
50]; 5) a comprehensive physical interaction network from the interaction meta-database ConsensusPathDB, release 6(yeast)
[
51] obtained by the integration of multiple publicly accessible interaction repositories (CPDB-yeast); and 6) a genetic interaction map published by Costanzo
et al.[
52] obtained at a stringent experimental cutoff (
Costanzo). The physical interaction networks constitute a representative benchmark since they result from different, major interaction detection techniques: yeast-two-hybrid, protein-fragment complementation, affinity purification, and integration of interaction data obtained with different methods. We applied our method additionally to the genetic interaction map by Costanzo
et al. to provide evidence that it is not limited to physical interactome maps. To show that CAPPIC’s performance was consistent across taxonomic species, we also applied it to two human networks. The first was obtained by merging the 15 largest, high-quality human yeast-two-hybrid data sets including refs.
[
5-
8] (Additional file
1: Table S1) (
Y2H-human). The second network corresponded to the top 5% interactions from a probabilistic binary data set generated by Mazloom
et al.[
53] from mass spectrometry-based analysis of 3,290 immuno-precipitation experiments
[
54] (
Mazloom). The properties of the two human networks are summarized in Additional file
2: Table S2.
| Table 1Yeast interactome maps used in this study for method evaluation |