With the rapid development of technologies to predict protein interactions, huge data sets portrayed as networks have been available. Most real networks typically contain parts in which the nodes are more highly connected to each other than to the rest of the network. The sets of such nodes are usually called clusters, communities, or modules [1
]. The presence of biologically relevant functional modules in Protein-Protein Interaction (PPI) graphs has been confirmed by many researchers [4
]. Identification of functional modules is crucial to the understanding of the structural and functional properties of networks [6
]. There is a major distinction between two biological concepts, namely, protein complexes and functional modules [7
]. A protein complex is a physical aggregation of several proteins (and possibly other molecules) via molecular interaction (binding) with each other at the same location and time. A functional module also consists of a number of proteins (and other molecules) that interact with each other to control or perform a particular cellular function. Unlike protein complexes, proteins in a functional module do not necessarily interact at the same time and location. In this paper, we do not distinguish protein complexes from functional modules because the protein interaction data used for detecting protein complex in this work do not provide temporal and spatial information.
Recently, many research works have been conducted to solve the problem of clustering protein interaction networks [8
]. Some of them are using the graph-based clustering methods for mining functional modules [11
]. These studies are mainly based on the observation that densely connected regions in the PPI networks often correspond to actual protein functional modules. In short, methods proposed in these studies are used to detect densely connected regions of a graph that are separated by sparse regions. Some graph clustering approaches using PPI networks for mining functional modules are introduced in the following. Bader and Hogue [17
] proposed the Molecular COmplex Detection (MCODE) algorithm that utilizes connectivity values in protein interaction graphs to mine for protein complexes. The algorithm first computes the vertex weight value from its neighbour density and then traverses outward from a seed protein with a high weighting value to recursively include neighbouring vertices whose weights are above a given threshold. However, since the highly weighted vertices may not be highly connected to each other, the algorithm does not guarantee that the discovered regions are dense. A simultaneous protein interaction network (SPIN) introduced by Jung et al [12
] specifies mutually exclusive interactions (MEIs). Taking advantages of the SPINs, SPIN_MCODE has outperformed the plain MCODE method.
Amin et al. [18
] proposed a cluster periphery tracking algorithm (DPClus) to detect protein complexes by keeping track of the periphery of a detected cluster. DPClus first weighs each edge based on the common neighbours between two proteins and further weighs nodes by their weighted degree. To form a protein complex, DPClus first selects the seed node having the highest weight as the initial cluster and then iteratively augments this cluster by including vertices one by one, which are out of but closely related with the current cluster. Li et al. [13
] modified the DPClus algorithm for identifying protein complexes that have a small diameter (or a small average vertex distance) and satisfy a different cluster connectivity-density property. The performance of such algorithms depends heavily on the quality of the seeds and the criterion of extending clusters.
Adamcsek et al. [19
] provided a software called CFinder to find functional modules in PPI networks. CFinder detects the k-clique percolation clusters as functional modules using a Clique Percolation Method (CPM)[20
]. In particular, a k-clique is a clique with k nodes and two k-cliques are adjacent if they share (k - 1) common nodes. A k-clique percolation cluster is then constructed by linking all the adjacent k-cliques as a bigger subgraph. Li et al. [14
] proposed a new clustering algorithm called IPC-MCE to identify protein complexes based on maximal clique, and then extend all the maximal cliques by adding their neighbourhoods iteratively. Liu et al [15
] developed an algorithm called Clustering based on Maximal Cliques (CMC) to discover complexes from the weighted PPI network. CMC first finds maximal cliques from PPI networks, and then removes or merges highly overlapped maximal cliques based on their inter-connectivity. However, CMC generates less number of significant functional modules having P-value less than 1E-5 than the DPClus algorithm in the unweighted PPI network [11
]. Wang et al. [16
] also developed an algorithm called CP-DR based on the new topological model for identifying protein complexes. Wang's algorithm extended the definition of k-clique community of the CPM algorithm and introduced distance restriction.
Above computational studies mainly focus on detecting highly connected subgraphs in PPI networks as protein complexes but ignore their inherent organization. However, recent analysis indicates that experimentally detected protein complexes generally contain Core/attachment structures. Protein complexes often include cores in which proteins are highly co-expressed and share high functional similarity. And core proteins are usually more highly connected to each other and may have higher essential characteristics and lower evolutionary rates than those of peripheral proteins [26
]. A protein complex core is often surrounded by some attachments, which assist the core to perform subordinate functions. Gavin et al.'s work [28
] also demonstrates the similar architecture and modularity for protein complexes. Therefore, protein complexes have their inherent organization [26
] of core-attachment. To provide insights into the inherent organization of protein complexes, some methods [21
] are proposed to detect protein complexes in two stages. In the first stage, protein complex cores, as the heart of the protein complexes, are first detected. In the second stage, protein complexes are expanded by incorporating attachments into the protein complex cores. Wu et al. [21
] presented a COre-AttaCHment based method (COACH) and Leung et al. also developed an approach called CoreMethod. These approaches are used to detect protein complexes in PPI networks by identifying their cores and attachments separately [29
]. To detect cores, COACH performs local search within vertex's neighbourhood graphs while the CoreMethod [29
] computes the p-values between all the proteins in the whole PPI networks.
In this paper, a Greedy Search Method based on Core-Attachment structure called GSM-CA is introduced. Comparing with the other methods of core-attachment, the new edge weight calculation method and evaluation criterion for judging a node as a core node or an attachment node are proposed in our GSM-CA method. The GSM-CA method uses a pure greedy procedure to move a node between two different sets. The detected clusters are also core-attachment structures. In particular, GSM-CA firstly defines seed edges of the core from the neighbourhood graphs based on the highest weight and then detects protein-complex cores as the hearts of protein complexes. Finally, GSM-CA includes attachments into these cores to form biologically meaningful structures. The new algorithm is applied to the protein interaction network of S. cerevisiae. The modules identified by the new algorithm are mapped to the MIPS [22
] benchmark complexes and validated by GO [23
] annotations. The experimental results show that the identified modules are statistically significant. In terms of prediction accuracy, the GSM-CA method outperforms several other competing algorithms. Moreover, most of the previous methods can not detect the overlapping functional modules by generating separate subgraphs. But GSM-CA can not only generate non-overlapping clusters, but also overlapping clusters.
The GSM-CA method achieves high accuracy. However, it is computationally expensive. Many module detection approaches are based on the traditional hierarchical methods, which is also computationally inefficient because the hierarchical tree structure generated by the repeated computational process cannot provide adequate information to identify whether a network belongs to a module structure or not. To further improve the computational process of these module detection approaches, the Greedy Search Method based on Fast Clustering (GSM-FC) is proposed in this paper. The edge weight based GSM-FC method uses the greedy procedure to traverse all edges just once to separate the network into the suitable sets of modules. The experimental results demonstrate that the newly proposed algorithm can reduce the computational time noticeably while maintaining high prediction accuracy compared to GSM-CA.
Briefly then, the outline of this paper is as follows. In Section 2 the implementation of our two methods are described in details. In Section 3, our algorithm is applied to the protein interaction network of S. cerevisiae yeast and the results are analyzed. In Section 4, the conclusions are given.