Interpretation of the completed biological genome sequences initiated a decade of landmark studies addressing the critical aspects of cell biology on a system-wide level, including gene expression analysis [
1,
2], gene disruptions detection [
3,
4], identification of protein subcellular location [
5,
6] and so on. An important and challenge task in proteomics is the detection of protein complexes from the available protein-protein interaction (PPI) networks generated by various experimental technologies such as yeast-two-hybrid [
7], affinity purification [
8], mass spectrometry [
9], etc.
Protein complexes, consisting of molecular aggregations of proteins assembled by multiple protein interactions, are of the fundamental units of macro-molecular organizations and play crucial roles in integrating individual gene products to perform useful cellular functions. It is confirmed by the fact that the complex 'RNA polymerase II' transcribes genetic information into messages for ribosomes to produce proteins. Unfortunately, the mechanism for most of biological activities is still unknown and hence accurately predicting protein complexes from the available PPI data has a considerable merit of practice because it allows us to infer the principles of biological processes.
The general methods for protein complexes prediction are based on experimental and computational notions. Experimentally, the Tandem Affnity Purification (TAP) with mass spectrometry [
9] turns out to be popular. However, it is far away from being a satisfying answer because of the limits on TAP [
10]. For example, the transient low affinity protein complexes may be excluded because of the washing and purification operations in the TAP-MS. At the same time, this experimental approach needs the tag proteins to infer the protein complex. Gavin
et al. [
8] have indicated that only limited known yeast protein complex subunits can be extracted by the TAP-MS. Moreover, Schonbach [
11] showed that, in order to validate the experimental results using the subcellular localization information, a preparation of subcellular fractionated lysates is a must. But the preparation procedure is time-consuming. That's why the computational approaches are becoming promising alternatives to complement the experimental ones.
Generally, protein interaction data can be effectively modeled as a graph (also called a network) by regarding each protein as a vertex and each known interaction between two proteins as an edge. Although there are plenty of related results in graph theory and many graph algorithms have been developed, it is still non-trivial to design an efficient algorithm to mine protein complexes from PPI networks. One reason is that there has not been an exact definition for a protein complex. To overcome this difficulty, Tong
et al. [
12] assumed that a protein complex corresponds to a dense subgraph since proteins in the same complex interact frequently among themselves, and similar discussion was also made in Ref. [
13].
Although it is non-trivial to design effective and efficient computational methods for predicting complexes, many algorithms have been devoted to the issue. Markov Cluster Algorithm (MCL) [
14,
15] simulated random walks within graphs based on the intuition that a walker started at an arbitrary protein and visited a neighborhood vertex with a predefined probability. If he walked into a dense region, it is hard to get out of the region. Molecular Complex Detection (MCODE) [
16] relied on the topological structure of a network, where it is assumed that a protein belongs to some complex if it has a subset of neighbors with high degree and there are many interactions among them. CFinder [
17] defined a dense subgraph by using the concept of adjacent
k-cliques. Other non-topological properties such as the functional information [
18] and data of protein binding interface [
19] are also incorporated into algorithms with an immediate purpose to improve the accuracy of prediction. In addition, there are some others relying solely on TAP data [
20-
22], which can be summarized as two points: first, a reliable PPI network is constructed by applying specific scoring strategies based on the purification records and selected protein interactions with high scores; second, some existing algorithms are employed to detect dense clusters in the newly constructed networks.
Except the biological information, some newly developed algorithms using the core-attachment structure in complexes revealed by Gavin
et al. [
8] (As shown in Figure ). Leung
et al. [
23] proposed the CORE algorithm, a statistical framework to identify protein-complex cores. The probability for two proteins to be in the same protein-complex core is mainly determined by two factors: whether the two proteins interact or not and the number of their common neighbors. The CORE then calculates the p-values for all pairs of proteins to detect cores. Wu
et al. [
24] presented the Coach consisting of two steps: it first defines core vertices from the neighborhood graphs and then detects protein-complex cores as the hearts of protein complexes; it then includes the attachments into the cores to form biologically meaningful structures. Ma
et al. [
25] showed that the density of a subgraph is insufficient to characterize the complex and further demonstrated that the graph communicability is much better in characterizing the protein complexes. There are also many newly developed techniques for protein complex prediction [
26-
29]. Further information concerning the computational approaches for predicting protein complexes can be obtained from [
30].
The core-attachment based approaches outperform dramatically the available state-of-the-art algorithms, demonstrating the significance of the structure and indicating the critical role of it in discovering protein complexes. This is one of the our major motivations. On the other hand, another major problem confounding the existing computational algorithm is that, available PPI networks are too sparse, for instance, the average numbers of interactions per protein are 5.29, 6.98, and 10.62 in DIP [
31], Krogan [
22], and Gavin [
21], respectively. In these PPI networks, many protein complexes are difficult to be extracted since the sparse networks are full of noises [
32]. Therefore, designing an efficient algorithm that gets rid of the noise is an important and challenging task to predict protein complexes. Unfortunately, previous algorithms did not pay enough attention to the problem since they only filter the noise by deleting nodes with degree 1 based on the fact that the interactions between proteins have lower reliability to the topological reliability measures [
33,
34]. Aside from issues of noise, all the existing computational approaches only make use of the topological structure information from the vertices and fail to take into consideration the roles of edges. It, however, is unreasonable to ignore the roles of edges, say the
weak tie theory [
35] and
percolation [
36], since an edge may play an important role in enhancing the locality or be significant in maintaining the global connectivity. For example, the famous weak ties theory indicates the job opportunities and new ideas are usually from persons with weak connections. Furthermore, the weak ties can be used to characterized the topological properties of networks such as the stability of biological functions [
37], the accuracy of network structure prediction [
38], the structure in mobile communication networks [
39]. And the percolation characterizes the tendency to undergo a topological phase transition as the number of connections is progressively increased. Motivated by these observations, we pose the following question:
Question: whether the roles of edges can be used in protein complexes detection?
In this study, we aim to investigate the possibility to extract protein complexes by exploring the roles of edges and develop an affirmative answer to the above question. In detail, similar to the weak ties effects in mobile communication [
39] and document networks [
40], we prove complementary results on the PPI networks that is the edges connecting less similar nodes are more significant in maintaining the global connectivity. By using the weak ties and percolation, a reliable virtual network is constructed from the original PPI network, in which each maximal clique corresponds to a protein complex. A core-attachment based method is developed. To test the performance of the proposed algorithm, we applied it to the PPI networks. The experimental results on the yeast PPI network show that the proposed method outperforms DPClus [
41], DECAFF [
42], MCL [
14], MCODE [
16] and Coach [
24]. Further, the analysis of detected modules by the present algorithm suggests that most of these modules have well biological significance in context of complexes, suggesting that the roles of edges are critical in discovering protein complexes.