Advances in technology have enabled scientists to determine, identify and validate pairwise protein interactions through a range of experimental approaches. Recently, several high-throughput approaches have produced a large scale of protein–protein interaction (PPI) datasets. These approaches include yeast two-hybrid, protein co-immunoprecipitation followed by mass spectrometry (MS), protein chip technologies and tandem affinity purification (TAP) with MS. Such data have led researchers to discover protein functions through PPI networks, in which a node represents a protein and an edge mimics an interaction between two proteins. A fundamental goal here is to discover functional modules or protein complexes in order to predict the function of unannotated proteins.
The fundamental concept of identifying functional modules is that a pair of proteins interacting with each other has higher probability of sharing the same function than two proteins not interacting with each other. The dense sub-networks in a PPI network can therefore be identified as functional modules. Thus, identifying functional modules is similar to detecting communities (clusters) in a network (graph). However, traditional community detection algorithms are usually ‘hard’ clustering algorithms, i.e. they produces non-overlapped clusters, whereas functional modules are highly ‘overlapped’ (Li et al., 2010
). As a result, a number of ‘soft’ clustering algorithms have been recently proposed to identify functional modules in a PPI network, and they can be grouped into three categories.
The first category includes algorithms such as Peacock (Gregory, 2009
), hub-duplication (Ucar et al., 2006
) and DECAFF (Li et al., 2007
). These algorithms identify the bridge nodes at the beginning, i.e. nodes belong to multiple clusters, and then either duplicate or remove the bridge nodes from the network. A hard clustering algorithm is then applied on the modified network. The problem with this approach is that only the identified bridge nodes can belong to multiple clusters, and it is conflicted with the literature (Ashburner et al., 2000
) that a large fraction of proteins belong to multiple functional modules. For example, in the yeast network in BioGRID database (Stark et al., 2011
), there are 3085 proteins annotated by low-level Gene Ontology (GO) terms, whose information content (see Sec 4.1
) is higher than 2.5, and 2392 of 3085 proteins are annotated by at least two of these GO terms.
Algorithms in the second category adopt line-graph transformation. These algorithms (Ahn et al., 2010
; PereiraLeal et al., 2004
) first transform the input network into a line graph, in which a node represents an edge in the original network. Then, a hard clustering algorithm is applied on the line graph, so edges are clustered instead of nodes. A node in the original network belongs to multiple clusters if its incident edges are clustered into different clusters. It has been pointed out in the literature (Fortunato, 2010
) that clustering edges has a similar issue as clustering nodes: a ‘bridge edge’ that connects nodes of different clusters can only be clustered into one cluster by the line-graph technique. Furthermore, while functional modules are so highly overlapped that an interaction might belong to multiple modules, these algorithms cannot successfully identify all overlapped functional modules.
Algorithms in the third category aim to find local dense subnetworks instead of globally clustering a graph. Each node forms a singleton cluster at the beginning, and then each cluster iteratively adds a neighbor node according to different criteria. Algorithms in this category include MCODE (Bader and Hogue, 2003
), CFinder (Adamcsek et al., 2006
), DPClus (Altaf-Ul-Amin et al., 2006
), IPCA (Li et al., 2008
), MoNet (Luo et al., 2007
), CORE (Leung et al., 2009
), COACH (Wu et al., 2009
), DME (Georgii et al., 2009
), RRW (Macropol et al., 2009
), NWE (Maruyama and Chihara, 2011
), SPICi (Jiang and Singh, 2010
), HUNTER (Chin et al., 2010
) and HC-PIN (Wang et al., 2011
Although the resulting clusters could be highly overlapped, one main drawback of those algorithms is that the criterion for adding a node usually considers relatively local topology. Given that PPI networks are estimated to be quite noisy (Brohee and Helden, 2006
), these algorithms could add several nodes connected by noisy edges.
In addition to those in the above three categories, there are some other algorithms, such as RNSC (King et al., 2004
), principal component analysis (PCA)-based consensus clustering (Asur et al., 2007
) and Markov clustering (MCL) algorithm (Dongen, 2000
), which have targeted identification of functional modules. The detail of most above-mentioned algorithms can be found in recent surveys (Fortunato, 2010
; Li et al., 2010
). MCL, which is based on manipulation of transition probabilities or stochastic flows between nodes of the graph, is shown to be particularly noise-tolerant as well as effective in identifying high-quality functional modules (Brohee and Helden, 2006
; Vlasblom and Wodak, 2009
). Several studies, such as (Friedel et al., 2009
), (Moschopoulos et al., 2008
) and (Srihari et al., 2009
), have adopted MCL as a base algorithm to produce more accurate results. Recently, (Satuluri et al., 2010
) propose an efficient and robust variation of MCL, called Regularized MCL (R-MCL). They show that R-MCL's regularize operation and balance parameter can improve the accuracy of identifying functional modules. Nevertheless, MCL and R-MCL only generate non-overlapped clusters, and they always assign all proteins into clusters while not all proteins are functionally annotated. As a result, MCL and R-MCL usually produce more false-positive clusters than other algorithms (Brohee and Helden, 2006
; Li et al., 2010
In this article, we redress the limitation of R-MCL and propose a new variation called ‘Soft’ R-MCL (SR-MCL), which produces overlapped clusters. The intuition of SR-MCL is to produce overlapped clusters by iteratively re-executing R-MCL while ensuring the resulting clusters are not always the same. In order to produce different clusterings in each iteration, the stochastic flows are penalized if they flow into a node that was an attractor node in previous iterations. Since iteratively re-executing R-MCL would produce several redundant and low-quality clusters, a postprocessing is applied to remove those clusters. Only a cluster that is not removed by the post-processing is predicted as a functional module, so not all proteins are assigned into clusters.
We have conducted a series of experiments on three networks in Saccharomyces cerevisiae
. Based on the gold standard annotation, GO terms (Ashburner et al., 2000
), we find that SR-MCL has significantly higher accuracy than R-MCL. SR-MCL also outperforms a range of algorithms on these three networks. Since it has been pointed out that there are different scales of potential functional relevance within a PPI network (Lewis et al., 2010
), we also demonstrate that R-MCL is capable of identifying both the parent module as well as the child module in the GO hierarchy.