|Home | About | Journals | Submit | Contact Us | Français|
To facilitate the realization of biological functions, proteins are often organized into complexes. While computational techniques are used to predict these complexes, detailed understanding of their organization remains inadequate. Apart from complexes that reside in very dense regions of a protein interaction network in which most algorithms are able to identify, we observe that many other complexes, while not residing in very dense regions, reside in regions with low neighborhood density. We develop an algorithm for identifying protein complexes by considering these two types of complexes separately. We test our algorithm on a few yeast protein interaction networks, and show that our algorithm is able to identify complexes more accurately than existing algorithms. A software program NDComplex for implementing the algorithm is available at http://faculty.cse.tamu.edu/shsze/ndcomplex.
Since experimental determination of protein complexes remains difficult, computational techniques are used to predict protein complexes. While a common strategy is to predict complexes from given protein interaction networks (Bader and Hogue, 2003; King et al., 2004; Sharan et al., 2005; Altaf-Ul-Amin et al., 2006; Hirsh and Sharan, 2007; Li et al., 2007; Chua et al., 2008; Qi et al., 2008; Liu et al., 2009; Moschopoulos et al., 2009; Jung et al., 2010; Shi et al., 2011; Tu et al., 2011), recent combined experimental–computational strategies utilize these techniques to construct protein complexes from purification data (Gavin et al., 2002; Ho et al., 2002; Krogan et al., 2006).
Apart from the general agreement that protein complexes form dense subgraphs in an interaction network (Spirin and Mirny, 2003), which leads to the strategy of first generating small dense subgraphs and either extending or merging these subgraphs to construct protein complexes (Bader and Hogue, 2003; Li et al., 2007), detailed understanding of the organization of protein complexes remains inadequate. To improve the modeling of complexes, recent approaches separate the tasks of predicting a core complex and its attachment proteins (Leung et al., 2009; Wu et al., 2009).
We observe that most complexes either reside in very dense regions, in which most algorithms are able to identify, or they reside in regions with low neighborhood density, in which most algorithms are less successful to identify. We investigate the following algorithm to consider these two types of complexes separately. Given a protein interaction network, we first identify all the maximal cliques. For each maximal clique, we count the number of other maximal cliques that overlap significantly with it and use it to define neighborhood density. We subdivide these cliques into two sets, with one containing cliques with low neighborhood density and the other containing cliques with high neighborhood density.
Since the maximal cliques with low neighborhood density are likely to correspond to the core region of a complex, we extend each clique to include more proteins as long as the density remains high. This allows each prediction to become a dense subgraph that is not necessarily a clique. Since the maximal cliques with high neighborhood density have overlap with many other maximal cliques, using these cliques directly, as predicted complexes, will lead to significant overestimate of the number of complexes. We extract the most shared sets of proteins from these cliques. We obtain the set of predicted complexes by collecting the above two types of predictions.
We compare the performance of our algorithm to other complex prediction algorithms on a few protein interaction networks and show that our algorithm is able to identify complexes more accurately with respect to complex agreement, complex accuracy, and protein pair agreement measures.
Given a protein interaction network represented by a graph G=(V, E), in which each vertex represents a protein and each edge represents interaction between two proteins, we first obtain the set of all maximal cliques with at least three vertices by using a branch-and-bound algorithm to add one vertex at a time until no more vertices can be added. Since the degree of most vertices in G is small, the number of maximal cliques is not large and this step is feasible, with time complexity .
Given two sets C1 and C2, we define their similarity to be S(C1, C2)=|C1∩C2|2/(|C1||C2|) (Bader and Hogue, 2003; Altaf-Ul-Amin et al., 2006; Li et al., 2007; Chua et al., 2008; Moschopoulos et al., 2009; Wu et al., 2009; Jung et al., 2010; Shi et al., 2011). For each maximal clique C, we count the number of other maximal cliques C′ with S(C, C′)≥t, where t is a given threshold. We define a clique C to have low neighborhood density if this number is below a given threshold c, and it has high neighborhood density otherwise. The worst case time complexity of this step is
Given an induced subgraph G′=(V′, E′) of a graph G=(V, E), we define its density to be D(V′)=2|E′|/(|V′|(|V′|−1)) (Bader and Hogue, 2003; Altaf-Ul-Amin et al., 2006; Li et al., 2007; Moschopoulos et al., 2009; Wu et al., 2009; Shi et al., 2011). For each maximal clique with low neighborhood density, we iteratively identify the best vertex to add so that the density of the enlarged subgraph remains high and the length of the shortest path between two vertices in the enlarged subgraph remains small. We repeat the procedure until no more changes can be made, and use the resulting dense subgraphs as predicted complexes (Fig. 1). Since these subgraphs reside in regions with low neighborhood density, they are likely to function as an independent unit.
For each potential vertex to add, it takes O(|V|) time to compute the density of the enlarged subgraph. By precomputing all vertex pairs that have shortest paths of length at most two, the worst case time complexity of the procedure is , although it should terminate quickly when the density threshold is large.
For each maximal clique with high neighborhood density, if its most similar maximal clique overlaps significantly with it, we replace it by the intersection of the two cliques. We repeat the procedure until no more changes can be made, and use the remaining cliques as predicted complexes (Fig. 2). This procedure reduces the number of highly overlapping predictions significantly, since many cliques become identical after the intersections.
By labeling each clique with a positive integer and picking the one with the lowest label when resolving ties in similarity, we can guarantee that at least two cliques become identical after each iteration, and the procedure takes at most iterations. The worst case time complexity of the procedure is , although it should terminate quickly when the similarity threshold is large.
We use a combined set of true complexes that contain at least two proteins to which we compare the predicted complexes, including 214 curated complexes from the Munich Information Center for Protein Sequences (MIPS) database (Mewes et al., 2004), 101 curated complexes from Aloy et al. (2004), and 363 complexes extracted from the Saccharomyces Genome Database (SGD; Cherry et al., 1998), according to GO slim complex annotations (Friedel et al., 2009). To reduce evaluation bias, we have removed complexes that contain more than 100 proteins from the SGD complexes. We removed the duplicates from the combined set, resulting in a total of 574 true complexes. Most of these complexes are small, with the average number of proteins within a complex being 9.4 (Table 1).
We used the yeast protein interaction network from the MIPS database (Mewes et al., 2004), the yeast protein interaction network from the Database of Interacting Proteins (DIP database) (Xenarios et al., 2000), and the yeast protein interaction network from the Biological General Repository for Interaction Datasets (BioGRID database) (Stark et al., 2006) to obtain separate sets of predicted complexes from each network. For the BioGRID network, only physical interactions are included. These networks have different densities, with the BioGRID network being the densest (Table 2).
We compared the performance of our algorithm NDComplex to the Markov cluster algorithm (MCL) (Enright et al., 2002), which subdivides a given graph into clusters by Markov clustering, to the Molecular complex detection algorithm (MCODE) (Bader and Hogue, 2003), which uses a seed-extension algorithm to identify complexes in dense regions, to the Dense-neighborhood extraction using connectivity and confidence features algorithm (DECAFF) (Li et al., 2007), which is based on the identification and merging of dense subgraphs, and to the Core-attachment based method (COACH) (Wu et al., 2009), which is based on the prediction of a core complex and its attachment proteins. We also compared the performance of our algorithm to the Maximal clique algorithm (CLIQUE), which uses the set of maximal cliques with at least three vertices as predicted complexes.
For MCL, we set inflation to 3.5. For MCODE, we set depth to 100, haircut to true, fluff to false, and fluff density threshold to 0.2. For DECAFF, we implement steps 1 to 4 of Algorithm 1 in Li et al. (2007), including the local clique detection step, the hub removal step, and the merging step, and set the density threshold to 0.8 and the neighborhood affinity threshold to 0.5. Since the other algorithms do not use functional information, we skip the filtering steps 5 to 9 in DECAFF that use the MIPS functional catalog (Mewes et al., 2004). For COACH, we set the neighborhood affinity threshold to 0.1. For NDComplex, we set the similarity threshold t and the occurrence threshold c during the computation of neighborhood density to 0.3 and 3 respectively. We set the density threshold d to 0.7 during the computation of predicted complexes in regions with low neighborhood density, and the similarity threshold s to 0.2 during the computation of predicted complexes in regions with high neighborhood density. These parameters are determined by testing a few combinations and choosing the one that gives the best overall performance on the test sets. We collect the predicted complexes from both types of regions for performance evaluation, with each distinct prediction counted once.
Given a similarity threshold u, we evaluate the agreement between a set of true complexes and a set of predicted complexes by defining the precision (PC) to be the ratio of the number of predicted complexes P that have a true complex C with similarity S(C,P)≥u to the total number of predicted complexes, and the recall (RC) to be the ratio of the number of true complexes C that have a predicted complex P with similarity S(C, P)≥u to the total number of true complexes (Chua et al., 2008; Moschopoulos et al., 2009; Wu et al., 2009; Jung et al., 2010; Shi et al., 2011). We compute the F-measure=2×(PC×RC)/(PC+RC).
Figure 3 shows that NDComplex has the best overall performance when the similarity threshold u is low, while MCODE and COACH perform better when the similarity threshold u is high, which corresponds to a small number of almost perfect predictions. DECAFF has the next best performance, followed by CLIQUE and MCL.
In addition to using the complex agreement measure, we use the complex accuracy measure in Brohée and van Helden (2006), Friedel et al. (2009), and Tu et al. (2011) that does not rely on the use of a similarity threshold. Given a set of true complexes and a set of predicted complexes , we compute the sensitivity , which is the ratio of the sum of the maximum overlap of each true complex with a predicted complex to the total size of true complexes, the positive predictive value , which is the ratio of the sum of the maximum overlap of each predicted complex with a true complex to the total size of predicted complexes, and the accuracy , which is the geometric mean of sensitivity and PPV.
Figure 4 shows that NDComplex has the best accuracy in almost all cases, followed by CLIQUE, DECAFF, and COACH. The performance differences between NDComplex and MCL, MCODE, DECAFF, or COACH are the largest on the BioGRID network. DECAFF puts emphasis on sensitivity, while MCODE puts emphasis on PPV.
We evaluate the protein pair agreement by defining a true positive (TP) to be a protein pair that is within the same true complex and within the same predicted complex, a false positive (FP) to be a protein pair that is within the same predicted complex but not in the same true complex, and a false negative (FN) to be a protein pair that is within the same true complex but not in the same predicted complex. We compute the precision PC=|TP|/(|TP|+|FP|), the recall RC=|TP|/(|TP|+|FN|), and the F-measure=2×(PC×RC)/(PC+RC).
When the true complexes and the predicted complexes both consist of disjoint sets of proteins, this measure evaluates the resemblance of the two sets. Since a protein can appear in more than one complex, this measure evaluates how accurately an algorithm can predict whether a given pair of proteins are within the same complex or not, although this is only an approximation since the set of true complexes may not be complete.
Figure 5 shows that COACH and NDComplex have the best performance with respect to protein pairs. The performance differences between these algorithms and MCL, MCODE, or DECAFF are especially large on the BioGRID network, with CLIQUE performing better than MCL, MCODE, and DECAFF. MCL and DECAFF have low performance on the BioGRID network due to a large number of false positive protein pairs. MCODE puts emphasis on precision, while DECAFF puts emphasis on recall.
We have developed an algorithm to identify complexes from protein interaction networks based on using different strategies for regions with different neighborhood densities. We have shown that this approach is very effective and it achieves the best performance with respect to complex agreement, complex accuracy, and protein pair agreement measures. Among the three networks that we have tested, we found that each algorithm performs the best on the BioGRID network, which has the highest density, with weaker performance on the DIP network, followed by the MIPS network, which has the lowest density. On the BioGRID network, our algorithm NDComplex performs the best with respect to the complex agreement measure when the similarity threshold u is low, and with respect to the complex accuracy measure, while COACH and NDComplex have the best and comparable performance with respect to the protein pair agreement measure.
Ideally, the correspondences between the true complexes and the predicted complexes should be one-to-one. To evaluate the degree of success of each algorithm in reaching this goal, we use the separation measure in Brohée and van Helden (2006) and Tu et al. (2011). Given a true complex C and a predicted complex P, define their separation Sep . Given a set of true complexes and a set of predicted complexes , we compute the average separation over the set of true complexes , the average separation over the set of predicted complexes , and the separation , which is the geometric mean of the above two averages.
Table 3 shows that while MCODE has the highest separation, it produces a small number of predictions that cover very few proteins. MCL achieves the next highest separation, but it produces a disjoint partition of proteins, which is biologically inaccurate since it does not allow a protein to appear in multiple predicted complexes. COACH achieves a better separation than NDComplex, but it covers less proteins. CLIQUE and DECAFF cover the most interactions, but the number of predictions is very high and they have the worst separation. NDComplex achieves better separation than CLIQUE and DECAFF.
Table 4 shows the statistics of maximal cliques and predicted complexes during different stages of our algorithm NDComplex. Within each network, the number of maximal cliques with high neighborhood density is much larger than the number of maximal cliques with low neighborhood density, which means that there are significant overlaps among the maximal cliques with high neighborhood density. For regions with low neighborhood density, our strategy can add a large number of proteins to a prediction, which is especially evident on the BioGRID network. While the density of these predictions remains high, the number of predictions decreases since some of them become identical after the addition of proteins. The situation is very different for regions with high neighborhood density, when the number of predictions decreases drastically after the intersections of cliques, and each of these predictions contains a very small number of proteins, which means that there are only a small number of highly shared parts within these cliques. The predictions from these two types of regions are distinct (compare to Table 3), with almost all predictions coming from regions with low neighborhood density. Since this is sufficient to ensure that the predictions have high quality, a complex can be modeled mostly as an independent unit with dense inside connections and sparse outside connections, while the small number of predictions from very dense regions are still needed to reduce false negatives.
Although the worst case time complexity of our algorithm NDComplex is high, the actual running time is not high (Table 5). For sparse networks such as MIPS and DIP, it takes less than an hour to complete all the steps, and the time to obtain the maximal cliques dominates. For denser networks such as BioGRID, it takes about a day, and the time to obtain predicted complexes from cliques with high neighborhood density dominates.
To investigate the effect of different parameters on our algorithm NDComplex, we picked a few sets of parameters that are close to the one we choose and examine the performance differences. Figure 6 shows that NDComplex is not very sensitive to parameters.
To illustrate the differences in predictions that can be obtained from different algorithms, we examine the predicted complex from each algorithm on the BioGRID network that has the highest similarity to the true complex that contains the MAP kinase cascade of the pheromone response pathway and the filamentation/invasion pathway (Gustin et al., 1998) from the MIPS database. Figure 7 shows that CLIQUE finds the complex that contains the MAP kinase cascade of the filamentation/invasion pathway, while the protein Fus3 in the true complex is not included. NDComplex expands the complex to include two extra proteins Bem1 and Hek2, in addition to all the proteins in the true complex. COACH and DECAFF return the largest complexes that contain a few extra proteins, while MCL and MCODE do not return a complex that has overlap to the true complex.
One drawback of our algorithm is that it takes exponential time to identify all the maximal cliques, thus it is not likely that it will scale up to handle dense networks. One future direction is to investigate whether it is possible to develop polynomial time heuristics without a large decrease in prediction performance.
This work was supported by grants from the National Science Foundation (CCF-0830455, MCB-0951120).
The authors declare that no competing financial interests exist.