|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Clustering algorithms play an important role in the analysis of biological networks, and can be used to uncover functional modules and obtain hints about cellular organization. While most available clustering algorithms work well on biological networks of moderate size, such as the yeast protein physical interaction network, they either fail or are too slow in practice for larger networks, such as functional networks for higher eukaryotes. Since an increasing number of larger biological networks are being determined, the limitations of current clustering approaches curtail the types of biological network analyses that can be performed.
Results: We present a fast local network clustering algorithm SPICi. SPICi runs in time O(V log V+E) and space O(E), where V and E are the number of vertices and edges in the network, respectively. We evaluate SPICi's performance on several existing protein interaction networks of varying size, and compare SPICi to nine previous approaches for clustering biological networks. We show that SPICi is typically several orders of magnitude faster than previous approaches and is the only one that can successfully cluster all test networks within very short time. We demonstrate that SPICi has state-of-the-art performance with respect to the quality of the clusters it uncovers, as judged by its ability to recapitulate protein complexes and functional modules. Finally, we demonstrate the power of our fast network clustering algorithm by applying SPICi across hundreds of large context-specific human networks, and identifying modules specific for single conditions.
Availability: Source code is available under the GNU Public License at http://compbio.cs.princeton.edu/spici
Supplementary information: Supplementary data are available at Bioinformatics online.
High-throughput experimental technologies, along with computational predictions, have resulted in large-scale biological networks for numerous organisms. In recent years, much research effort has focused on analyzing these biological networks in order to obtain hints about cellular organization and functioning. Clustering is perhaps the most common approach for global network analysis, and is frequently applied to uncover functional modules and protein complexes, and to infer protein function (Bader and Hogue, 2003; Hartwell et al., 1999; Pereira-Leal et al., 2004; Rives and Galitski, 2003; Spirin and Mirny, 2003). As a result, numerous clustering algorithms for biological networks have been developed (e.g. Altaf-Ul-Amin et al., 2006; Bader and Hogue, 2003; Blatt et al., 1996; Brun et al., 2003; Chen and Yuan, 2006; Colak et al., 2009; Enright et al., 2002; Georgii et al., 2009; King et al., 2004; Loewenstein et al., 2008; Navlakha et al., 2009; Palla et al., 2005; Samanta and Liang, 2003; Sharan et al., 2005).
Previous methods for clustering biological networks work well on networks of moderate size. However, the size and number of biological networks continue to grow. For example, by extensive data integration, proteome-scale functional networks have been built for hundreds of organisms across the evolutionary spectrum (Jensen et al., 2009). Recently, by additionally considering specific biological processes (BPs) of interest, hundreds of context-specific functional networks for human have been built (Huttenhower et al., 2009). Moreover, in the near future, biological networks will include numerous additional biological entities such as non-coding RNAs as well as a wider range of interaction types.
Large networks present considerable challenges for existing clustering approaches. Here, we develop a new efficient network clustering algorithm SPICi (‘spicy’, Speed and Performance In Clustering). SPICi builds clusters greedily, starting from local seeds that have high weighted degree, and adding nodes that maintain the density of the clusters and are adjacent to a suitable fraction of nodes within them. The intuition underlying SPICi is similar to that of DPClus (Altaf-Ul-Amin et al., 2006). However, SPICi exploits a simpler cluster expansion approach, uses a different seed selection criterion and incorporates interaction confidences. Approaches based on enumeration have also been developed; these aim to uncover all clusters with specific density requirements. CFinder (Palla et al., 2005) finds clusters such that each consists of a maximal connected component of adjacent cliques of size k where two cliques are adjacent if they share k − 1 nodes. An alternate approach relaxes the requirement of complete cliques and instead finds all subsets of nodes with high density (Colak et al., 2009; Georgii et al., 2009). While these approaches guarantee that they output all clusters with a particular property, they are computationally intensive. In contrast, SPICi takes a heuristic approach with respect to the clusters it outputs, but guarantees a runtime of O(V log V+E), where V and E are the number of vertices and edges in the network.
We demonstrate SPICi's excellent runtime and its state-of-the-art performance via several analyses. First, we compare SPICi to nine previous network clustering algorithms (Altaf-Ul-Amin et al., 2006; Bader and Hogue, 2003; Blatt et al., 1996; Enright et al., 2002; Georgii et al., 2009; King et al., 2004; Loewenstein et al., 2008; Palla et al., 2005; Sharan et al., 2005) on a test set of five existing biological networks (Table 1). SPICi is more than 4–1000 times faster than the previous approaches on the networks for which the approaches terminate within 12 h on a standard desktop machine. Moreover, it is the only algorithm of those tested that is able to cluster all the networks within a reasonable amount of time. Second, we show that even though SPICi is much faster than previous clustering approaches, the clusters it uncovers in biological networks recapitulate functional modules just as well. Third, we perform a robustness analysis on synthetic networks, as described by Brohee and van Helden (2006), and show that SPICi has very good performance in recapitulating protein complexes, deteriorating only on extremely incomplete networks. We also find SPICi to be robust to perturbations in dense functional networks. Finally, we use SPICi to cluster 230 large, context-specific human networks (Huttenhower et al., 2009) and identify modules specific for single conditions; because of the size and number of networks, this type of analysis was made feasible only by utilizing our new fast clustering approach.
Given a weighted network, the goal of our algorithm is to output a set of disjoint dense subgraphs. We model the network as a undirected graph G=(V, E) with a confidence score 0<wu,v≤1 for every edge (u,v)E. For any two vertices u and v without an edge between them, we set wu,v=0. Our approach utilizes several measures. For each vertex u, we define its weighted_degree, dw(u), as the sum of all of its incident edges' confidence values:
For each set of vertices SV, we define its density as the sum of the weights of the edges among them, divided by the total number of possible edges (i.e. the density of a set is a measure of how close the induced sub-graph is to a clique, and varies from 0 to 1):
Finally, for each vertex u and set SV, we define the support of u by S as the sum of the confidences of u's edges that are incident to vertices in S:
We use a heuristic approach to greedily build clusters. SPICi builds one cluster at a time, and each cluster is expanded from an original seed pair of proteins. The unclustered node that has the highest support for the cluster is added if the support is high enough and the density of the cluster remains higher than a user-defined threshold; otherwise, the cluster is output and its nodes are removed from the network. SPICi thus has two parameters: Ts, the support threshold and Td, the density threshold. (See Fig. 1 for a simplified example.)
To select the seed vertices, we first find the vertex u that has the highest weighted degree in the current network. Then, we divide the neighboring vertices of u into five bins based on their edge weights: (0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8] and (0.8, 1]. We search from the highest weight bin (0.8, 1] to the lowest weight bin (0, 0.2]. If our current bin is not empty, we use the vertex v in it with the highest weighted degree as the second seed vertex. We refer to (u, v) as the seed edge. We utilize this heuristic approach for seed selection based on two observations for functional networks. First, there is a positive correlation between the weighted degree of a node and a measure of the overall functional enrichment found among its interacting proteins (data not shown); this suggests that high weighted degree nodes are meaningful starting points for local module searches in functional networks. Second, two vertices are more likely to be in the same module if the weight on the edge between them is higher. This is why we search from the highest weight bin to the lowest weight bin. For vertices in each bin, their edge weights to the first seed vertex are quite similar, and by taking the one with highest weighted degree, we obtain a larger candidate set for continuing the search.
After obtaining two seed nodes with an edge between them, we grow each cluster in a procedure similar to that of Altaf-Ul-Amin et al. (2006). At each step, we have a current vertex set S for the cluster, which initially consists of the two seed vertices. We search for the vertex u with maximum value of support(u, S) amongst all the unclustered vertices that are adjacent to a vertex in S. If support(u, S) is smaller than a threshold, we stop expanding this cluster and output it. If not, we put vertex u into S and update the density value. If the density value is smaller than our density threshold Td, we do not include u in the cluster and output S. We repeat this procedure until all vertices in the graph are clustered.
We implement our algorithm using two critical data structures, described in more detail in the next paragraph. The first data structure is a priority queue, DegreeQ, to pick the seed proteins from which clusters are built. Initially, all proteins are organized based on their weighted degree. Once a cluster is built and output, its proteins are removed from DegreeQ and the weighted degrees of all proteins adjacent to these are decreased to reflect their connectivity to other unclustered proteins. Thus, in addition to extracting the maximum degree node, DegreeQ also needs to support deletions and decrease key operations. The second data structure, CandidateQ, is used to expand clusters. It is also a priority queue, where each element is a node u adjacent to one of the nodes in the cluster S being built, and is prioritized based on support(u, S). In addition to extracting the max element, CandidateQ needs to support insertions, as neighbors of nodes added to S are added to it, and increase key operations, since as S grows, the supports of all vertices with respect to S increase. We describe SPICi with these data structures in the pseudocode given in Figure 2.
The specific data structures we use are now described. For CandidateQ, we need efficient Insert, ExtractMax and IncreaseKey operations. We use a Fibonacci heap (Fredman and Tarjan, 1987), as amortized analysis gives a time complexity of O(1) for Insert and IncreaseKey and of O(log n) for ExtractMax, where n is the number of possible items in the heap. We now count the overall time cost for all Expand operations. There are at most O(V) ExtractMax operations, so the overall cost for them is O(V log V). For IncreaseKey, each operation is associated with an edge connected to that vertex, so the overall cost for these operations is O(E). For Insert, the overall cost is O(E) since each insertion is associated with an edge from S. So, the overall time cost for all Expand operations is O(V log V + E). For the Fibonacci heap, the space complexity is O(n), so we get a space complexity of O(V).
For DegreeQ, we need to support ExtractMax, Delete and DecreaseKey. As an optimization, we round off each vertex's weighted degree to an integer, and utilize an extremely fast data structure, which we refer to as an Integer heap, with time complexity of O(1) for Delete and DecreaseKey, and an amortized time complexity of O(1) for ExtractMax. More specifically, since every element in the heap is an integer, we use an array as its backbone, and at every slot in the array, we use a doubly linked list for all vertices with weighted degree rounded to that slot index. Since each edge has confidence ≤1, the number of slots is O(V), and the total space necessary to handle all vertices is O(V). (See Fig. 3 for a schematic of our Integer heap data structure.) For insertion and deletion, we require the procedure call to provide the pointer to the doubly linked list node, and so these two operations can be performed in O(1) time. For DecreaseKey, we first disconnect the node and then reconnect it to a new slot with O(1) cost. Note that we store the initial weights per edge, and then round each time we perform DecreaseKey. For ExtractMax, we just need to pop out a value at the top slot of our array. If any array slot becomes empty, we need to search down the array until we reach a new non-empty slot. The total number of all down searches is V−1, which is the maximum length of the array. Thus, if there are a total of V operations, the amortized time for each operation is O(1). The time complexity of procedure Search without considering the time spent for Expand is O(V+E)=O(E), as there are at most V ExtractMax and Delete operations, E DecreaseKey operations, and each edge is considered at most once when finding the second seed vertex. Thus, considering Search and Expand together, SPICi has time complexity O(V log V+E) and space complexity O(E).
We concentrate our initial analysis on two networks for yeast and three networks for human (Table 1). The two Biogrid (Breitkreutz et al., 2008) networks consist of experimentally determined physical and genetic interactions. The two STRING (Jensen et al., 2009) networks and the human Bayesian network (Huttenhower et al., 2009) consist of functional associations between proteins that are derived from data integration. For Biogrid, we extract all non-redundant interaction pairs, including all protein physical and genetic interactions. For STRING (Jensen et al., 2009), we use all weighted interactions. For the Bayesian human network, we use the global network from Huttenhower et al. (2009); this network is not tuned toward any specific BP. In subsequent analysis, we also use the 229 context-specific human networks from Huttenhower et al. (2009); here each context is a BP from the Gene Ontology (GO; Ashburner et al., 2000), and the training set is altered according to the specific BP context so that the network will better represent that specific context. None of the networks are further processed. For functional module discovery and/or protein function prediction, it may be beneficial in practice to remove high-degree nodes and/or otherwise prune the networks; since different processing may be necessary for different networks, and our primary goal is to test SPICi's ability to cluster large networks, such variations are not explored here.
All experiments are run on an Intel 2 GHz dual core computer with 2 GB memory. We compare our approach to SPC (Blatt et al., 1996), MCL (Enright et al., 2002), MCODE (Bader and Hogue, 2003), RNSC (King et al., 2004), Cfinder (Palla et al., 2005), NetworkBLAST (Sharan et al., 2005), DPClus (Altaf-Ul-Amin et al., 2006), MCUPGMA (Loewenstein et al., 2008) and DME (Georgii et al., 2009). We briefly highlight the main features of these algorithms. SPC associates a ‘spin’ with each node, and spin–spin correlations are used to partition the network. MCL is a global clustering approach based on modified random walks on networks. MCODE is one of the first approaches specifically geared for clustering interactomes, and greedily grows clusters from a seed node. CFinder finds a set of k-clique percolation clusters, each of which consists of a maximal connected component of adjacent cliques of size k where two cliques are adjacent if they share k−1 nodes. NetworkBLAST, designed for comparing multiple protein networks but applicable for clustering a single protein network, greedily builds small ‘dense’ clusters. DPClus is a greedy approach that grows clusters based on adding nodes that are well connected to other nodes in the cluster and that maintains cluster density. MCUPGMA is a memory-efficient average-link hierachical clustering algorithm. DME finds all clusters that satisfy a user-defined minimum density threshold.
For MCL, we set the inflation factor to 1.8, as this has been found to yield the best performance in clustering biological networks (Brohee and van Helden, 2006). For RNSC, we use the parameters given in the sample README file that comes with the software (-e3 -n2 -N100 -D40 -d10 -c300 -t15 -T2). For SPICi, we set both Ts and Td to 0.5. We use the default parameters for the other approaches. For MCUPGMA, the distance between two nodes u and v is set to 1−wu,v. By default, MCUPGMA allocates a fixed amount of memory corresponding to a prespecified limit on the number of edges that are allowed into memory in each clustering iteration. Without this limit, MCUPGMA runs out of memory on the large Bayesian human networks. While we carefully experimented with changing this memory limit for the human functional networks, the running time of MCUPGMA on these networks still exceeded our time limits and thus for simplicity, we ran the program with its default parameter. We also note that MCUPGMA outputs a hierarchical clustering dendogram without a split into clusters. We obtain clusters using an inconsistency coefficient (Jain and Dubes, 1988); this is a standard procedure in MATLAB's statistics toolbox for processing hierarchical clustering dendograms to obtain clusters. The inconsistency coefficient for each merge of the the dendogram is computed by taking its height in the dendogram and subtracting the average height of all merges considered, and dividing this by the standard deviation of these the merges. The higher the value of this coefficient, the more a merge would connect dissimilar nodes. As in MATLAB's default parameters, we consider only the current merge, and the merges one level below. We use a value of 0.8 to cut the tree, as these values range from 0 to 1.2 in the considered dendograms and there are only two peaks in the distribution (data not shown), one corresponding to merges with inconsistency value <0.1, (i.e. the initial merges) and the other corresponding to merges in the range of 0.7 and 0.8.
All reported runtimes are wall clock times for running the clustering portion of the programs only. We do not report CPU times, as some of the clustering algorithms are designed to be run within a user interface, making strict system timing calls difficult.
The quality of clusters obtained from all algorithms are evaluated using the framework described in Song and Singh (2009). We use GO to build our reference set of functional modules, with all IEA (inferred from electronic annotation) terms removed, as suggested in Rhee et al. (2008). For each organism, only GO terms that annotate at most 1000 proteins are considered. For a given GO annotation A, we define the ‘functional module set’ GA to consist of all genes annotated with A. Song and Singh (2009) utilize the following three measures to measure the overlap between computationally derived cluster and the GO functional modules:
Each of these three measures varies from 0 to 1, with higher values indicating better agreement of the uncovered clusters with functional modules corresponding to GO. We consider the GO BP and cellular component (CC) ontologies separately. For each cluster, we calculate these three measures separately for both ontologies, and these measures are assigned to all proteins within the cluster. Genes in singleton clusters are penalized by having Jaccard, PR and semantic density values of 0. Finally, for each of the six measures (three BP and three CC), we compute its average value over all proteins in the network; this is equivalent in the case of non-overlapping clusters to taking a weighted average over all clusters, where each cluster is weighted by its size. For more details, please see Song and Singh (2009). Some of the tested approaches output overlapping clusters. In this case, if a protein is found in more than one cluster, then each of its six measures is obtained by averaging over the values obtained from each of its clusters.
In order to characterize SPICi's robustness to changes in the network, we use the procedure of Brohee and van Helden (2006) to characterize how well MIPS complexes are recapitulated from synthetic test network data. In particular, networks are initially created for each of the 104 Saccharomyces cerevisiae MIPS (Mewes et al., 2004) complexes that are not determined from high-throughput experiments. A node is included in the network for each protein in one of these complexes, and an edge is included in the network between any two proteins in the same complex. This network with |E| edges is then modified as follows. For an edge addition rate pa and an edge deletion rate pd, first pa · |E| edges are added to the network, and then pd · |E| edges are chosen uniformly at random for deletion. Brohee and van Helden (2006) utilize two measures, Accuracy and Separation, for evaluating clusterings. Accuracy measures how well the clustering recovers the gold standard MIPS complexes. Separation measures how specifically the clustering can be mapped to the MIPS gold standard set without cross-complex contamination. [See Brohee and van Helden (2006) for precise definitions of these two measures.]
We run SPICi and nine previous clustering approaches on our five network datasets. Table 2 gives the runtime and memory usage of each approach on each of the datasets. SPICi is the only approach that can cluster each of the five networks within 12 h; indeed it takes <10 s for four of the five networks and takes <20 min on the largest dense functional network. Even on networks that can be clustered by the other approaches, SPICi obtains substantial speed-ups. This decrease in runtime is accompanied by a decrease in memory usage as well. For the human Bayesian functional network, SPICi uses 1.11 GB of memory, which corresponds to the size of the network itself.
We use the procedure from Song and Singh (2009) to assess the overall quality of the clusters we find. Three approaches (SPICi, MCUPGMA, MCL) can cluster four of the networks, and we focus on these methods and networks in our analysis in the main body of the article. We find that neither SPICi, MCUPGMA nor MCL clearly dominates the other approaches (Table 3 and Supplementary Table 1). We observe that these three approaches have complimentary strengths when considering clusters of different sizes on the functional networks (see Supplementary Fig. 1). For clusters with at most five proteins, MCUPGMA has the highest average quality measures. On the other hand, SPICi's clusters of intermediate size (from 6 to 150 proteins) generally have higher quality measures. For clusters with more than 150 proteins, SPICi and MCL perform best.
Many of the remaining seven algorithms can cluster the smaller sized networks well, and in some cases may outperform the three approaches here; however, their runtime or memory requirements limit their applicability. We note that while it is possible to reduce the large functional networks into smaller ones by only keeping edges with a weight above a certain threshold, we find that, for all approaches we tested, by keeping all interactions, we find additional ‘unique’ functionally enriched clusters as well as an increase in the number of proteins in functionally enriched clusters. (Supplementary Material.)
We first apply the procedure of Brohee and van Helden (2006) to compare the robustness of SPICi against that of MCL and MCUPGMA. We build 10 synthetic test networks edges for each pairwise combination of 10 addition rates and 10 deletion rates. The averaged Separation and Accuracy measures (Brohee and van Helden, 2006) for each addition and deletion rate are shown in Figure 4 (see also Supplementary Table 2). We find that SPICi has better overall performance than MCUPGMA (Fig. 4b). When comparing MCL and SPICi, it is clear that neither method dominates the other at all noisy edge insertion and deletion rates. For low interaction insertion and deletion rates, the methods perform comparably. For high deletion rates, MCL generally outperforms SPICi. For high interaction addition rates and low interaction deletion rates, SPICi has better overall performance than MCL. These results suggest that SPICi is less sensitive to noisy edge addition than MCL, and is perhaps better suited for dense functional networks such as the STRING networks. Consistent with this, we find that SPICi is quite robust to perturbations of confidence values in both the STRING human and yeast networks, with a steady but relatively modest decrease in average Jaccard, PR and semantic density values as increasing amounts of noise are added (Supplementary Table 3).
While the Bayesian human network from Huttenhower et al. (2009), obtained by global data integration of multiple sources of evidence linking proteins, proves to be challenging for all the previous network clustering approaches, the authors actually created 229 additional networks of similar size. Each of these networks corresponds to one of 229 specific BPs. Here, we show the type of analysis SPICi enables by its fast clustering approach—analysis that would not be possible by the previous approaches. In particular, we utilize SPICi to uncover context-specific modules from these context-specific networks.
We use SPICi to cluster all 229+1 human functional networks. Altogether, we get 63 973 clusters of size >5 and density >0.5. We select context-specific modules utilizing the following criteria. For each candidate cluster, we require that:
By applying these three criteria, we attempt to uncover modules that are unique to a certain context. In total, 2088 clusters passed these criteria. As an example, we look at one such cluster, found in the response to inorganic substance network (Fig. 5). There are 10 proteins in this cluster. This cluster has very limited overlap (at most two proteins) with clusters found in the other networks. Moreover, all other networks contain this set of proteins with a density <0.25. The cluster is found to be enriched via the hypergeometric distribution with the annotations response to metal ion (P-value=1.39E-015, seven proteins annotated) and transport (P-value=4.20E-006, seven proteins annotated). An interesting case is the DRG1 protein (also known as developmentally regulated GTP-binding protein 1). It is annotated with GO terms such as GTP binding and transcription factor binding, but has no known annotations related with response to metal ion or transport. This uncovered cluster reveals DRG1's potential role in metal ion response and transport.
We have developed a fast, memory-efficient clustering algorithm, SPICi. SPICi is significantly faster than previous clustering algorithms for biological networks, and importantly, enables us to cluster larger networks than previously possible. Moreover, we have demonstrated via several analyses that the clusters uncovered by SPICi are of comparable quality to those found by other state-of-the-art algorithms. In our experience, SPICi is especially well-suited for dense networks, such as functional networks. Within sparser networks, we have found that SPICi also readily identifies dense regions, but for reasonable parameter settings will conservatively leave many proteins unclustered.
We have shown that SPICi can be effectively run on hundreds of large human context-specific networks in order to find context-specific modules. In the future, we foresee using SPICi to perform other types of comparative interactomics. For example, protein interaction networks for a single organism can be modified to incorporate information about each protein's tissue- or condition-specific expression, and comparing clusterings across these networks can help to identify modules that are either conserved across numerous conditions or specific to certain conditions. Given the large number of expression datasets, this leads to the possibility of hundreds or even thousands of varying networks across a single organism. SPICi's runtime and memory efficiency enables these new types of analyses, and should be particularly useful as biological networks continue to grow in size and number.
We thank Curtis Huttenhower and Olga Troyanskaya for providing their human context-specific networks. We thank members of the Singh group, especially Tao Yue and Jimin Song, for helpful discussions on our approach and for comments on our manuscript.
Funding: National Science Foundation (ABI-0850063 to M.S., in part); National Institutes of Health (GM076275 to M.S., in part); National Institute of Health Center of Excellence (grant P50 GM071508 to David Botstein, in part).
Conflict of Interest: none declared.