PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
 
BMC Bioinformatics. 2010; 11: 521.
Published online 2010 October 19. doi:  10.1186/1471-2105-11-521
PMCID: PMC2974752

Discover Protein Complexes in Protein-Protein Interaction Networks Using Parametric Local Modularity

Abstract

Background

Recent advances in proteomic technologies have enabled us to create detailed protein-protein interaction maps in multiple species and in both normal and diseased cells. As the size of the interaction dataset increases, powerful computational methods are required in order to effectively distil network models from large-scale interactome data.

Results

We present an algorithm, miPALM (Module Inference by Parametric Local Modularity), to infer protein complexes in a protein-protein interaction network. The algorithm uses a novel graph theoretic measure, parametric local modularity, to identify highly connected sub-networks as candidate protein complexes. Using gold standard sets of protein complexes and protein function and localization annotations, we show our algorithm achieved an overall improvement over previous algorithms in terms of precision, recall, and biological relevance of the predicted complexes. We applied our algorithm to predict and characterize a set of 138 novel protein complexes in S. cerevisiae.

Conclusions

miPALM is a novel algorithm for detecting protein complexes from large protein-protein interaction networks with improved accuracy than previous methods. The software is implemented in Matlab and is freely available at http://www.medicine.uiowa.edu/Labs/tan/software.html.

Background

Protein complexes carry out the majority of biological processes within a cell. Correctly identifying protein complexes in an organism is useful for deciphering the molecular mechanisms underlying many cellular functions. Recent advances in proteomics technologies such as two-hybrid system and mass spectrometry has allowed enormous amount of data on protein-protein interactions (PPI) to be released into the public domain [1]. As the amount of global high throughput protein interaction data keeps increasing, methods for accurately identifying protein complexes from such data become a bottleneck for further analysis of the resulting interactome.

There is a large body of research on computational methods for de novo protein complex detection in PPI networks. These methods can be roughly divided into three categories. Methods in the first group define explicit complex criterion such as dense connectivity within a complex. A heuristic search strategy is then employed to identify complexes [2-4]. In contrast, the second group of methods also define a complex criterion but use complete enumeration to find all complexes that satisfy the criterion [5-7]. Instead of using local search strategy, the third group of methods are based on global graph partitioning techniques [8-11]. For instance, maximization of the modularity (Q) measure proposed by Newman and Girvan [12] has been successfully applied to PPI networks [11]. However, the global modularity measure has an inherent resolution limit for detecting small sub-networks [13], such as protein complexes whose median size is fewer than 10 proteins per complex. The reason for this resolution limit is that global modularity uses the entire network to compute the expected connectivity within a set of proteins, which may not be an appropriate measure of the background around protein complexes. Muff et al. [9] introduced a local version of the modularity measure (LQ) by only considering the immediate neighbors of a complex instead of the entire network. Applying it to the PPI network of E. coli, they showed that LQ was better at identifying small but biologically meaningful protein complexes.

Q and LQ represent two extremes of the neighborhood measure used to estimate background connectivity in a random network. Neither may be optimal for a given PPI network. In this study, we introduce a tunable parameter into the original formulation of modularity to help determine the optimal neighborhood size in calculating expected connectivity of a set of proteins. Another drawback of the previous LQ approach is that the computationally expensive optimization technique, simulated annealing, was used to maximize LQ, which is not feasible for large PPI networks such as yeast or human networks although it was proven useful for the smaller E. coli PPI network.

In this paper we introduce a novel algorithm to infer protein complexes by combining a parametric local modularity measure and a greedy search strategy. We evaluate our approach on the yeast PPI networks using two reference sets of protein complexes and additional functional annotations of yeast proteins. Compared to four existing methods, our algorithm achieves a significantly performance improvement in terms of F-measure and biological relevance of predicted complexes. By applying our method to two large-scale PPI networks, we predict a set of 138 novel protein complexes in the baker's yeast S. cerevisiae that warrant future experimental characterization.

Results

Local Modularity with Coarseness Parameter Improves Complex Prediction

Previously, global (Q) [11] and local modularity (LQ) [9] have been proposed as a measure to detect protein complexes in large PPI networks. However, both measures have their drawbacks. The global modularity measure has an inherent resolution limit for small sub-networks such as protein complexes [13]. The local modularity measure only considers first neighbors of a sub-network, which might not provide enough information for estimating the true background connectivity pattern of a random network. In this paper, we propose a new local modularity measure, LQa (parametric local modularity with the coarseness parameter α) for inferring protein complexes in large PPI networks. To compare how effective the three measures are to detect protein complexes, we first implemented three complex detection algorithms using a common greedy search strategy and each of the three modularity measures as the scoring function. We used the yeast full PPI network from the DIP database [14] and two sets of gold standard protein complex annotations (see Methods). As shown in Figure Figure1,1, LQa performed the best in terms of F-measure when evaluated using both gold standard sets. Note that Q and LQ have no coarseness parameter to set and the sets of predicted complexes are the same for the two sets of known annotations. For LQa we set the coarseness parameter to yield the best F-measure for each set of known complexes.

Figure 1
Performance comparison of three modularity measures. The yeast DIP full network was used as input. Optimal coarseness parameter, a, was optimized on three known complex sets separately.

The number and average size of the predicted complexes are listed in Table Table1.1. As expected, Q found a very small number of complexes with a large number of members, which caused a low recall rate and F-measure. LQ further resolved those large sub-networks into a number of smaller ones. However, the average size of the predicted complexes (37.5) was still much larger than the average size of known complexes (< 10). In contrast, LQa found a reasonable number of complexes in the same size range as the known complexes.

Table 1
Number and average size (arithmetic mean, in parenthesis) of predicted complexes using three different modularity measures and the DIP PPI network as input.

Putting All Together: the miPALM Algorithm

We introduce a novel algorithm, miPALM (module inference by Parametric Local Modularity), for inferring protein complexes from large-scale protein interactome data. The input to miPALM consists of an un-weighted PPI graph and two parameters, α and δ. The algorithm has three major steps. Algorithmic details of each step and the corresponding pseudo-code are described in the Methods section. We briefly describe the major steps of the algorithm here. First, from the input PPI network, miPALM identifies a set of triangle seeds using topological overlap measure. A pair of nodes in a network has high topological overlap if they are both strongly connected to the same group of nodes (see Methods). Therefore, the use of topological overlap measure serves to exclude spurious or isolated connections in the network. Second, from each seed, the algorithm uses a greedy search to expand it into candidate complex(es). Local modularity is used as a scoring function to assess the quality of a candidate complex. The parameter α is used to control the background neighborhood size around a candidate complex. Finally, a filtering step is performed on the set of candidate complexes based on their density scores which is controlled by the parameter δ. The complete algorithm for complex prediction is shown in Algorithm 4.

Performance Comparison with Existing Methods

Next, we compare the performance of our algorithm with four representative algorithms for protein complex prediction, MCODE [2], MCL [10], COACH [15], and DME [7]. MCODE relies on the concept of K-core (a sub-graph in which all nodes have a degree at least k) and greedy search. MCL is a global graph partitioning algorithm that works by simulating stochastic flows in a graph. COACH is conceptually similar to MCODE. It first identifies the core of a candidate complex (maximal set of connected vertices whose degrees are greater than the network average) and then expand the core by including additional nodes if more than 50% of their edges are shared with the core. DME detects all node subsets that satisfy a user-defined minimum density threshold in a greedy fashion. Of the five algorithms, MCL cannot detect overlapping complexes whereas MCODE, COACH, DME, and miPALM can. Additionally, MCL is a global graph partitioning method whereas the other four are based on seeding and local search.

We tested the performance of all five methods using two sets of known complexes in the baker's yeast, S. cerevisiae. CYC08 is a set of protein complexes manually curated from published small-scale studies [16]. Since most small-scale studies tend to be biased towards complexes involved in a limited number of cellular processes, to complement this set, we also used the YHTP08 set of protein complexes [16]. It was constructed by analyzing two recent and most comprehensive genome-wide protein complex screens based on affinity purification coupled with mass spectrometry experiment [17,18].

For performance comparison we determined the optimal parameters for each algorithm to achieve the highest F-measure, given a gold standard set (see Methods). The comparison results are presented in Figures Figures22 and and33 and Table Table2.2. For each method, we report the precision, recall, and F-measure. As can be seen in Figure Figure2A,2A, both COACH and miPALM achieved a much higher F-measure compared to the other three methods. The average F-measure was 0.42, 0.39, 0.23, 0.16, and 0.12 for COACH, miPALM, MCL, MCODE, and DME, respectively.

Figure 2
Performance comparison of five complex detection algorithms. A) F-measure of the five algorithms with their best parameters optimized to two sets of known complexes CYC08 and YHTP08 using the DIP network as input. B) Precision and recall of the methods. ...
Figure 3
Biological relevance of predicted complexes by five algorithms with optimized parameters and the DIP PPI network as input. A) Percentage of complexes enriched for at least one GO term. B) Percentage of complexes whose members are co-localized in the same ...
Table 2
Statistics of predicted complexes by five algorithms with the best parameters optimized on CYC08 and YHTP08 sets and the DIP PPI network as input.

Figure Figure2B2B shows a breakdown of the F-measure into precision and recall for all five methods. On average, MCL achieved the highest recall mainly due to its large number of predictions. On the other hand, MCODE achieved the highest precision because it tends to identify a subset of known complexes with higher overlap than other methods. However, the overall accuracy of both methods (as measured by the F-measure) was lower than those of COACH and miPALM because MCL had a much lower precision and MCODE had a much lower recall. In other words, the higher F-measure achieved by COACH and miPALM is due to a balanced increase in both their recall and precision.

Although F-measure is a popular metric for evaluating the performance of a complex predictor, it is not the only one. Biological relevance is also an important indicator of the quality of predicted complexes. Accordingly, we next conducted GO term enrichment and co-localization analyses to determine the biological relevance of the predicted complexes. Genome-wide protein localization data has been reported for Baker's yeast using fluorescent imaging [19]. For each predicted complex, we calculated a log-odds score that measures the extent to which members of the complex co-localize to the same sub-cellular compartments (see Methods). Compared to the F-measure that relies on an incomplete gold standard set, both GO term and co-localization annotations used here are more comprehensive and thus complementary to the F-measure.

At a p-value of 0.05, our set of predictions had the highest fractions of complexes with enriched functional categories (Figure (Figure3A).3A). Compared to the second best performer (MCODE), the average increase in the fraction of enriched complexes was 8.9% across the two gold standard sets of complexes. For complex member co-localization, our predictions had an 18.8% average increase compared to the second best performer, DME (Figure (Figure3B3B).

Taken together, our benchmarking analyses demonstrated that miPALM achieved the second highest F-measure (3% lower than COACH) when evaluated using known complexes. On the other hand, miPALM outperforms all other algorithms by a large margin (8.9% and 18.8%) when evaluated using functional annotations of complex members.

Novel Complex Predictions Using Large Yeast PPI Networks

Next, we applied miPALM to discover novel protein complexes in two large-scale yeast PPI networks based on interactions obtained from the BioGRID database [20]. The first network consists of all yeast interactions in the BioGRID database. The majority of interactions are derived from high throughput experiments. The second network consists of high-confidence interactions derived by filtering the BioGRID interactions based on their lines of supporting evidence [21]. For brevity's sake, these two networks are termed BioGRID and HC networks in this paper. The BioGRID network contains 5591 proteins and 51880 physical interactions and the HC network contains 2228 proteins and 6209 physical interactions. By studying two networks with different amount of noise, we can assess the robustness of our method on noisy data.

To predict complexes, we set the coarseness parameter α to be 0.364 that gave the highest F-measure as described in the performance comparison section.

In total, miPALM predicted 168 and 208 protein complexes from the BioGRID and HC network, respectively. The respective F-measures for the two sets of predictions are 0.31 and 0.52 (Figure (Figure4A).4A). As expected, predictions using HC network has a higher F-measure due to the higher quality of the input data. Nevertheless, as shown in Figure Figure5,5, the two sets of complexes overlap by 33.3% (56/168). To assess the significance of the overlap, we also used the other four methods in the benchmarking study to predict complexes in the BioGRID and HC networks. We used the same optimized parameters for each method as described in the performance comparison section. The two sets of complexes predicted by COACH had the highest overlap of 43.3%. The average overlap for the four methods was 26.6%. As an additional check, we considered miPALM predictions using the DIP networks as input. The average overlap between the three sets of predictions is 38.3% (Figures S6, S7 in Additional file 1). Taken together, the high level of overlap between miPALM predictions suggests that it is fairly robust against noisy data.

Figure 4
F-measure and biological relevance of predicted complexes using two large-scale PPI networks as inputs. A) F-measures. Circles, F-measures measured against CYC08 (C) or YHTP08 (Y) using the BioGRID network as input. Squares, F-measures measured against ...
Figure 5
Overlap between miPALM predicted complexes using two large-scale PPI networks as inputs. A) Overlap between protein members in the predicted complexes. B) Overlap between complexes. C) Overlap between novel complexes. Blue, predicted complexes using HC ...

After merging overlapped complexes, we ended up with 322 predicted complexes from the two networks. Two hundred thirty two of these complexes (72.5%) are enriched for at least one GO term (Table (Table3),3), suggesting many of them are true protein complexes. Examined separately, 109 (64.9%) BioGRID and 173 (83.2%) HC predictions are enriched for at least one GO term, respectively (Figure (Figure4B4B).

Table 3
Supporting evidence for novel complexes predicted by miPALM compared to gold standard sets of known complexes.

To further corroborate our predictions, we next used a genome-wide protein localization data set to examine if members of our predicted complexes tend to co-localize in the same sub-cellular compartments. For each of our predicted complex, we calculated a co-localization log-odds score that compares the member co-localization probability of a predicted complex to the probability of the same number of random proteins in the PPI network (See Methods). For the set of 320 predicted complexes, 208 (65.0%) are enriched for at least one sub-cellular compartments (Table (Table3).3). Examined separately, 115 (68.5%) BioGRID and 123 (62.0%) HC predictions are enriched for at least one sub-cellular compartment, respectively (Figure (Figure4B4B).

To identify new complexes in our prediction, we used the union of CYC08 and YHTP08 as the set of known complexes. After filtering those complexes matching any of the known complexes, we were left with 138 novel protein complexes. To evaluate the quality of these novel protein complexes, we computed the fraction of complexes that have enriched GO functional terms or are co-localized to the same sub-cellular compartments. Eight five (61.6%) of the novel complexes were enriched for at least one GO terms and 95 (68.8%) complexes were enriched for at least one sub-cellular compartments (Table (Table3).3). The fraction of GO term enriched complexes was comparable to known complexes. Remarkably, the fraction of co-localized complexes in our prediction was much higher than those of the two gold standard sets (Table (Table3).3). These results provide further evidence that the set of novel complexes are true protein complexes. Information about the complete set of predicted complexes with supporting evidence is reported in Additional files 1, 2, 3 and 4.

Discussion

The global modularity measure proposed by Newman and Girvan [12] identifies clusters (sub-networks) in a network by comparing the observed fraction of edges inside a cluster to the expected fraction of edges in the cluster. In doing so, it assumes that connections between all pairs of nodes in the network are equally probable, which reflects all connectivity among all clusters. However, in many molecular interaction networks, most sub-networks are only connected locally. For instance, in metabolic networks, major pathways occur as clusters that are sparsely linked among each other [22]. The same observation can also be made on protein complexes [23].

In this study, we introduced parametric local modularity as a new measure for the quality of clusters in a network. It takes into account local cluster connectivity and overcomes global network dependency. As an analogy, the coarseness parameter functions as the resolution dial of a microscope. By changing the value of the coarseness parameter, we can adjust the size of the cluster neighborhoods when calculating the expected fraction of edges within a cluster. Since different biological networks might have distinct neighborhood connectivity, a tunable local modularity measure allow us to best estimate the local neighborhood connectivity by changing the size of the neigbhorhood under consideration.

Protein complexes are dynamic molecular entities. Depending on the cellular states, membership of a protein complex could change and different complexes could have shared members [18]. Our algorithm can detect overlapping complexes if during the seed expansion step seeds of different candidate complexes are close enough.

The F-measure used for performance evaluation is a popular approach. A drawback of F-measure is that it cannot distinguish whether a predicted complex overlap with just one or multiple known complexes and vice versa. It has been argued that predictions that overlap with fewer known complexes should be regarded as having a higher quality [24]. To further evaluate the methods using this criterion, we use the separation metric introduced by Brohee and van Helden [24] which takes into account the observation above. As shown in Figure S8 (Additional file 1), miPALM again outperforms the other methods. Therefore, it is unlikely that the performance improvement by miPALM is due to a bias in the benchmarking metrics used.

In summary, using three alternative performance measures (F-measure, Biological Relevance, Separation), our benchmarking analysis demonstrate that miPALM achieve an overal best performance among the five algorithms compared. The performance measures of the methods using three input interaction networks are summarized in Additional file 1, Tables S4, S5, S6.

The proposed algorithm can be naturally extended to handle weighted networks by using edge weights for local modularity calculation. Edge weights can be calculated based on topological features of the PPI network and domain-specific information from other omic data, such as microarray gene expression, genome-wide association study, and genome-wide sequence mutation data (e.g. cancer mutation screening). Integration of functional genomic data into miPALM will enable us to find context-dependent sub-networks that are active under specific growth conditions.

Conclusions

Using several performance measures (F-measure, Biological Relevance, and Separation), we have demonstrated that miPALM achieved an overall improvement over previous algorithms. miPALM combines the strength of three key features, triangle seed identification using topological overlap measure, parametric local modularity as a cluster quality measure, and recursive greedy search. By including functional genomic data as edge weights, miPALM can be extended to identify context-dependent gene modules that can in turn be used to assist in network comparison and classification tasks.

Methods

Protein interaction and complex data

Protein interaction networks

Yeast protein-protein interaction data were downloaded from the DIP [14] and BioGRID [20] databases. The DIP "full" set of PPIs (including all physical interactions in the DIP database instead of a subset of high confidence interactions) were used for algorithm development and comparison. The BioGRID and high-confidence [21] sets of PPIs were used for novel protein complex prediction. After removing self-loops and multiple edges, the three networks contain 4859, 5591, and 2228 proteins and 17138, 51880, and 6209 interactions, respectively.

Known annotated protein complexes

Two sets of annotated protein complexes were used for performance evaluation. Pu et al. generated a comprehensive catalogue of 408 protein complexes manually curated from published small-scale experiments reported as of 2008 [16]. This set provides an update of the widely used gold-standard MIPS complexes. In the same study, they also generated a catalogue of 400 high-throughput complexes by a systematic analysis of all high throughput protein-protein interaction data reported as of 2008. After removing complexes with fewer than 3 members, we ended up with two reference sets of protein complexes, termed CYC08 (236 complexes) and YHTP08 (207 complexes), respectively.

Construction of the seed set

Seeding strategy is crucial for a network searching algorithm since the search result is dependent on the starting point (e.g. a node, an edge, or a sub-network). Here we describe how to construct seeds and to rank them based on the local property of the network.

First, we weight every interaction in the PPI network. For discovering good seeds, it is important to rank within-complex edges high and between-complex edges low. We used a modified version of the topological overlap measure by Ravasz et al. [25] as edge weight. It is defined as following:

OT(v,w)=Avw|Γ(v,w)|/(kv+kw)/2
(1)

where |Γ(v, w)| is the number of common neighbors of node v and w, kv and kw are the degrees of node v and w, Avw = 1 if v and w have a direct link and zero otherwise.

In the original definition of OT (v, w), the number of shared interacting partners is normalized by dividing |Γ(v, w)| by min(kv, kw) instead of (kv + kw)/2. We modified the normalization factor because it is improper to treat two proteins topologically equal if one protein has three interactors and the other has 100 interactors (e.g. hub proteins) even though these two proteins share the same three interacting partners.

Second, we enumerated all triangles in the PPI network using the enumeration algorithm described in Algorithm 1. All triangles in the PPI network can be located by Algorithm 1 in O(kmax·m) time with an upper bound of O(n·m), where kmax is the largest node degree in the network.

Algorithm 1: TriangleEnumeration (G)

1   input: Unweighted graph G = (V, E)

2   output: all triangles of G

3   begin

4    for e [set membership] E do

5         (v, w) ← a pair of nodes connected by e

6         Γ (v, w) ← a set of common nodes shared by v and w

7         for x [set membership] Γ(v, w) do

8            output triplet {v, w, x}

9         remove e from G

10   end

We then rank all triangles found by Algorithm 1 based on their triangle-weights obtained by averaging pair-wise edge weights.

Local modularity as the scoring function

The total modularity Q of a network with M modules is defined as following [12]:

Q=s=1MQs=s=1M[mssm(ds2m)2]
(2)

where m is the total number of edges in the network, mss is the number of intra-module edges in module S, and ds is the sum of the degrees of nodes in module S. Essentially, Q is the difference in the fraction of within-module edges between the observed network and a random configuration network model. This definition of modularity is global in the sense that the comparison of mss/m with (ds/2m)2 assumes equal probability of connection between any pair of nodes in the random network model.

During module search, when a node v and a sub-network S are merged, the change in global modularity can be derived, as followings,

ΔQ(v,S)=QvS(Qv+QS)=1m(mvSdvdS2m) 
(3)

where Qv and QS are the modularity of v and S, respectively and QvS is the modularity of the sub-network created by merging v and S.

In order to overcome the resolution limit of the global modularity measure, Muff et al. proposed the local modularity measure LQ [9]

LQ=s=1M[mssms(ds2ms)2] 
(4)

where mss is the number of edges within sub-network S and ms is the total number of edges in S and its first neighbours. LQ is based on the observation that in real world networks most sub-networks are only connected to a small fraction of the entire network.

Inspired by previous work, we introduce a new local modularity measure for a single subnetwork as defined below:

LQα=mSSm(dS2m(α+1)/2)2,0α1
(5)

where the denominator of the second term in Eq. 4 is not fixed to 2m, but varied with a parameter α that we call the coarseness parameter.

After merging v and S the change in the newly defined local modularity is then:

ΔLQα(v,S)=1m(mvSdvdS2mα),0α1 
(6)

Readers are referred to the Suppl. Methods (Additional file 1) for detailed derivation of ΔLQα from LQα

When α = 1, ΔLQα is equivalent to ΔQ in Eq. 3. Decreasing α leads to a smaller number of edges to be considered. For example, if α = 0.5, the ratio of considered edges to the total number of edges in the network (i.e. edge-coverage ratio, r= 2mα /2m )) is m-1/2. Conversely, if we want to cover locally 50% of edges (r = 0.5), then α can be set to 1+logm(0.5). As α goes down to zero, the size of the detected sub-network becomes smaller and smaller because the expected fraction of within-module edges, the second term in Eq. 5, becomes larger. Suppl. Figure S1 (Additional file 1) shows the edge-coverage ratio and size of resultant detected sub-networks as a function of α.

Greedy search by maximizing local modularity measure

The problem of finding a network partition with maximum global modularity is known to be NP-hard [26]. Thus, various heuristic approaches were proposed [27-32]. In particular, greedy search [31,32] based on global modularity have been studied extensively due to its single peakness [33] and fast speed for analyzing very large networks.

Our scoring function (Eq. 5) made it possible to adopt a greedy search strategy to expand a given triangle seed to a larger sub-network iteratively until the increase in local modularity becomes negative. Pseudo codes for our greedy search algorithm are shown in Algorithms 2 and 3. Briefly, starting with the top ranked triangle seed {x, y, z}, our greedy algorithm always merge the direct neighbor w of the seed that increases local modularity the most, growing the seed into a larger sub-network S={w, x, y, z}. The algorithm outputs S if it has no additional neighbor merging of which leads to an increase in the local modularity. This searching process (or seed expansion) is then repeated with a new seed. The time-consuming step of the greedy search algorithm is the calculation of ΔLQα after each merging. We avoid recalculating ΔLQα(v, S') for all neighbours of S', v[set membership]Ns' by taking advantage of the recursive relationship for ΔLQα between before and after merging (see Suppl. Methods and Figure S3 for details, Additional file 1). The upper bound for the time complexity of our search algorithm is O(ns·ds) where ns is the number of proteins in the sub-network S and ds is the sum of degrees of all nodes in the sub-network S.

Algorithm 2: RecursiveGreedySearch (S, A, α)

1   input: triangle seed S, adjacency matrix A, and coarseness parameter α

2   output: Expanded sub-network S' and its neighbor nodes Ns'

3   begin

      Ns ← neighbor nodes of S

5      ΔLQα (·, S) ← change in our local modularity for all v in Ns

6      if max((ΔLQα (·, S)) < 0 then

7         return S and Ns

8      [S', Ns'] ← GrowSeed(S, A, Ns, α, ΔLQα (·, S))

9      return S' and Ns'

10   end

Algorithm 3: GrowSeed (S, A, Ns, α, ΔLQα (·,S))

1   input: triangle seed S, adjacency matrix A, a set of neighbor nodes of S Ns, coarseness parameter α, change in local modularity ΔLQα (v, S) for all v in Ns

2   output: Expanded sub-network S' and its neighbor nodes Ns'

3   begin

4      v*argmaxv{ΔLQα(v,S)}

5      Nv* all neighbor nodes of v*

6      S' ← {S, v*}

7      Ns(Ns{v*})(Nv*(NsS)), vNs{v*}

8      ΔLQα(v,s)ΔLQa(v,S)+ΔLQa(v,v*), vNv*(NsS)

9      ΔLQa(v,S)dvds2mα+1+ΔLQα(v,v*)

10      if max (ΔLQα (·,S')) < 0 then

11         return S' and Ns'

12      [S',Ns']GrowSeed(S', A, α,Ns', ΔLQα(,S'))

13   end

Elimination of unpromising seeds

Unpromising seeds are those that cannot be expanded into larger sub-networks. In other words, they are triangles that have no neighbors that can cause positive change in local modularity if merged. We filtered out those triangles after seed expansion step to speed up the algorithm and reduce the number of false positives (see Figure S2 in Additional file 1).

Complex merging

Proteins in a PPI network could belong to one or more protein complexes simultaneously. This multiple membership of proteins should be uncovered by the clustering algorithm. Complexes found by our method can be overlapped if they are within the same densely connected region in the PPI network. While revealing overlapped complexes is important for understanding their dynamics, allowing algorithm to make overlapped predictions often produce an excessive number of complexes. For example, the algorithm DME [7] predicted 14,780 complexes (minimum density threshold 0.95) on the yeast DIP full set. The majority of them are overlapped, causing low precision and poor overall performance. In this paper we merged any two complexes S and T if they have an overlap score of greater than 0.5, which is defined as |S [intersection operator] T|/min(|S|, |T|).

Complex filtering by density score

After merging complexes produced by the seed expansion step, we rank the candidate complexes by their density score δs that is defined as the product of the connectivity and size of complex S,δs=mssns(ns1)/2·ns.

The miPALM algorithm

Our algorithm takes as input an unweighted PPI network Gn, m={V, E} with n nodes and m edges and outputs a set of predicted protein complexes, M. The pseudo code of the algorithm is shown in Algorithm 4.

Algorithm 4: miPALM (G, α, δ)

1   Input: Unweighted graph Gn, m ={V, E}, n=/V/, m=/E|, coarseness parameter α, and density score threshold δ

3   Output: a set of sub-networks, M

4   begin

5      T ← TriangleEnumeration (G)

6      t ← choose the top ranked triad-seed in T

7      T ← delete t from list T

8      while T is not empty do

9         S ← RecursiveGreedySearch (t, A, α)

10         t ← choose the top triad-seed uncovered by the previous search

11         T ← delete t from list T

12         if the size of S is three then

13         continue

14         S ← refine S by looking around S

15         M ← {M, S}, output S

16      S ← merge sub-networks in S

17      for S [set membership] M do

18         δs get density score f S

19         if δs <δ then

20            delete S from M

21   end

Performance evaluation

We used the F-measure to evaluate the performance of complex prediction algorithms. F-measure is the harmonic mean of the two quantities, precision (Pre) and recall (Rec), 2 Pre Rec/(Pre + Rec). Precision is defined as the ratio of the number of matched sub-networks to the number of predicted sub-networks by each algorithm. Recall is the ratio of the number of matched sub-networks to the number of known complexes.

For comparison purpose, we used the complex matching criterion used in MCODE [2] to identify predicted complexes that overlap with gold standard complexes. A predicted sub-network is considered matched to a known complex if it has a matching score of 0.2 or greater. Matching score is defined as ω = c2/a·b, where a, b are the size of the sub-network and the known complex, respectively, and c is the number of protein members overlapped between the prediction and the known complex. We also examine the precision and recall rates at different overlap scores (see Figure S9 in Additional file 1).

Parameter selection

Our algorithm has two parameters, α for determining the size of the local neighborhood of a candidate complex and δ for filtering candidate complexes based on their density score. For benchmarking purpose, we used the F-measure to determine the parameters yielding the best performance of the algorithm on three sets of known complex. Because the δ parameter is only used for post-search filtering, we first searched for the optimal α value. We varied α from 0 to 1 with an initial step size of 0.01. Once the range of optimal α value was located, we further searched for the optimal parameter value using a finer step size of 0.001 (Figure S4 in Additional file 1). After an optimal α was found, we determined the optimal δ by searching from 0 to 3.5 with a step size of 0.01. To determine the sensitivity of the algorithm to parameter changes, we determined the overlaps between predicted complexes using two α values differed by 0.01. As can be seen in Figure S5 (Additional file 1), our algorithm is not overly sensitive to parameter changes.

For the other four programs we compared, we tested the following parameter ranges that gave optimal F-measure on the three sets of known complexes. For COACH, the affinity threshold was varied from 0 to 1 with a step size of 0.01. For MCL, the inflation parameter was varied from 1.2 to 5.0 with a step size of 0.01. For DME, the density threshold parameter was varied from 0.91 to 1.0 with a step size of 0.01. For MCODE, vertex weight percentage = 0.2, haircut = TRUE, and fluff = FALSE were used. These parameters of MCODE have been optimized to produce the best results by default.

Gene ontology term enrichment test

Yeast Gene Ontology (GO) slim terms were used to evaluate the biological relevance of predicted complexes. P-value for GO term enrichment was calculated using the hypergeometric distribution. A Bonferroni-corrected p-value of 0.05 is considered to be significant.

Co-localization analysis

Based on fluorescence imaging, Huh et al. [19] classified 75% of the yeast proteome into 22 distinct sub-cellular compartments. Protein localization data was downloaded from the yeast GFP fusion localization database http://yeastgfp.yeastgenome.org. To compute a log-odds score of complex sub-cellular localization, we compared the observed number of protein pairs within a sub-network S that are co-localized to sub-cellular compartment k (msk) to the expected number of such pairs in a random network msk¯, defined as following,

msk¯=nsk(nsk1)2·ps

and

ps=2mss/ns(ns1)

where nsk is the number of proteins localized in compartment k in sub-network S and ps is the connectivity for the sub-network. We consider a complex to be localized to a compartment k if the log-odds score log(msk/msk¯)>0.

Authors' contributions

JK and KT conceived and designed the study. JK performed the experiments. JK and KT analyzed the data. All authors have read and approved the final manuscript.

Supplementary Material

Additional file 1:

Supplemental Methods and Materials.

Additional file 2:

Supplemental Table S1.

Additional file 3:

Supplemental Table S2.

Additional file 4:

Supplemental Table S3.

Acknowledgements

We thank the anonymous reviewers for their helpful comments. This work is supported by the American Cancer Society [77-004-31 to K.T.] and the Pharmaceutical Research and Manufacturers of America Foundation [to K.T.].

References

  • Beyer A, Bandyopadhyay S, Ideker T. Integrating physical and genetic maps: from genomes to interaction networks. Nat Rev Genet. 2007;8:699–710. doi: 10.1038/nrg2144. [PMC free article] [PubMed] [Cross Ref]
  • Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4:2. doi: 10.1186/1471-2105-4-2. [PMC free article] [PubMed] [Cross Ref]
  • Everett L, Wang LS, Hannenhalli S. Dense subgraph computation via stochastic search: application to detect transcriptional modules. Bioinformatics. 2006;22:e117–123. doi: 10.1093/bioinformatics/btl260. [PubMed] [Cross Ref]
  • Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22:1021–1023. doi: 10.1093/bioinformatics/btl039. [PubMed] [Cross Ref]
  • Palla G, Derenyi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435:814–818. doi: 10.1038/nature03607. [PubMed] [Cross Ref]
  • Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003;100:12123–12128. doi: 10.1073/pnas.2032324100. [PubMed] [Cross Ref]
  • Georgii E, Dietmann S, Uno T, Pagel P, Tsuda K. Enumeration of condition-dependent dense modules in protein interaction networks. Bioinformatics. 2009;25:933–940. doi: 10.1093/bioinformatics/btp080. [PMC free article] [PubMed] [Cross Ref]
  • Pereira-Leal JB, Enright AJ, Ouzounis CA. Detection of functional modules from protein interaction networks. Proteins. 2004;54:49–57. doi: 10.1002/prot.10505. [PubMed] [Cross Ref]
  • Muff S, Rao F, Caflisch A. Local modularity measure for network clusterizations. Phys Rev E. 2005;72:056107. doi: 10.1103/PhysRevE.72.056107. [PubMed] [Cross Ref]
  • van Dongen S. Graph Clustering by Flow Simulation. University of Utrecht, Physics. 2000.
  • Girvan M, Newman ME. Community structure in social and biological networks. Proc Natl Acad Sci USA. 2002;99:7821–7826. doi: 10.1073/pnas.122653799. [PubMed] [Cross Ref]
  • Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Physical Review E. 2004;69:026113. doi: 10.1103/PhysRevE.69.026113. [PubMed] [Cross Ref]
  • Fortunato S, Barthelemy M. Resolution limit in community detection. Proc Natl Acad Sci USA. 2007;104:36–41. doi: 10.1073/pnas.0605965104. [PubMed] [Cross Ref]
  • Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32:D449–451. doi: 10.1093/nar/gkh086. [PMC free article] [PubMed] [Cross Ref]
  • Wu M, Li X, Kwoh CK, Ng SK. A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics. 2009;10:169. doi: 10.1186/1471-2105-10-169. [PMC free article] [PubMed] [Cross Ref]
  • Pu S, Wong J, Turner B, Cho E, Wodak SJ. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009;37:825–831. doi: 10.1093/nar/gkn1005. [PMC free article] [PubMed] [Cross Ref]
  • Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006;440:637–643. doi: 10.1038/nature04670. [PubMed] [Cross Ref]
  • Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B. et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006;440:631–636. doi: 10.1038/nature04532. [PubMed] [Cross Ref]
  • Huh W-K, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. doi: 10.1038/nature02026. [PubMed] [Cross Ref]
  • Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–539. doi: 10.1093/nar/gkj109. [PMC free article] [PubMed] [Cross Ref]
  • Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M. Still stratus not altocumulus: further evidence against the date/party hub distinction. PLoS Biol. 2007;5:e154. doi: 10.1371/journal.pbio.0050154. [PMC free article] [PubMed] [Cross Ref]
  • Guimera R, Nunes Amaral LA. Functional cartography of complex metabolic networks. Nature. 2005;433:895–900. doi: 10.1038/nature03288. [PMC free article] [PubMed] [Cross Ref]
  • Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004;5:101–113. doi: 10.1038/nrg1272. [PubMed] [Cross Ref]
  • Brohee S, van Helden J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006;7:488. doi: 10.1186/1471-2105-7-488. [PMC free article] [PubMed] [Cross Ref]
  • Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL. Hierarchical organization of modularity in metabolic networks. Science. 2002;297:1551–1555. doi: 10.1126/science.1073374. [PubMed] [Cross Ref]
  • Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D. On modularity clustering. Ieee Transactions on Knowledge and Data Engineering. 2008;20:172–188. doi: 10.1109/TKDE.2007.190689. [Cross Ref]
  • Agarwal GaK D. Modularity-maximizing graph communities via mathematical programming. The European Physical Journal B-Condensed Matter and Complex Systems. 2008;66:409–418. doi: 10.1140/epjb/e2008-00425-1. [Cross Ref]
  • Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D. On Finding Graph Clusterings with Maximum Modularity. Graph-Theoretic Concepts in Computer Science. 2007. pp. 121–132. full_text.
  • Noack AaR R. Experimental Algorithms. Vol. 5526. Heidelberg: Springer; 2009. Multi-level Algorithms for Modularity Clustering; pp. 257–268. full_text.
  • Sales-Pardo M, Guimera R, Moreira AA, Amaral LA. Extracting the hierarchical organization of complex systems. Proc Natl Acad Sci USA. 2007;104:15224–15229. doi: 10.1073/pnas.0703740104. [PubMed] [Cross Ref]
  • Newman MEJ. Fast algorithm for detecting community structure in networks. Physical Review E. 2004;69:066133. doi: 10.1103/PhysRevE.69.066133. [PubMed] [Cross Ref]
  • Schuetz P, Caflisch A. Multistep greedy algorithm identifies community structure in real-world and computer-generated networks. Physical Review E. 2008;78:026112. doi: 10.1103/PhysRevE.78.026112. [PubMed] [Cross Ref]
  • Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Physical Review E. 2004;70:066111. doi: 10.1103/PhysRevE.70.066111. [PubMed] [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central