3.1 Edge weight distribution shape
shows the clustering performance for all four datasets over the threshold range. Clustering performance has been overlaid with the normalized edge weight distribution associated with each dataset.
Fig. 1. Clustering performance and edge weight distributions. Each plot shows the F-measure clustering performance metric for both the Force and MCL clustering algorithms over a range of binned −log(E-value) thresholds, together with a normalized edge (more ...)
The edge weight distributions for all four datasets share similar characteristics. The maximum point in each distribution is located at a very low edge weight value, at or near zero. As the edge weight value increases, the normalized edge count begins to decay. It descends from the maximum value of one towards a small value between 0.1 and zero. In three of the four distributions (A, B and D), a second, much smaller peak is also present. The smaller local maximum is located further along each distribution, at a larger edge weight value. Eventually, as the edge weight increases, each edge weight distribution drops to a value of zero and does not rise again.
The four edge weight distributions may be further subdivided into two broad categories based on the rate with which each distribution descends from the maximum toward the local minimum. In the Amidohydrolase and SLC distributions (A and B, respectively), the descent is immediate, occurring over a range of less than five edge weight bins. In the Enolase and Kinase distributions (C and D, respectively), the descent is more gradual, occurring over a range of 20 or more edge weight bins. We refer to the former as rapid-descent histograms, and the later as gradual-descent histograms.
3.2 MCL performance over threshold range
MCL algorithm performance follows the same general trend for all four datasets. Initially, the unthresholded MCL clustering results produce a low-performance F
-measure. For three of the four datasets (A, C and D), the initial MCL F
-measure is below 0.5. The MCL inflation parameter is known to influence the granularity of the clustering. We therefore explored the impact of alternate inflation parameters on the initial MCL F
-measures for these superfamilies (Supplementary Table S1
). These alternate inflation values did not yield significant improvements.
As the edge weight threshold increases and the edge weigh distribution begins to decrease from the maximum, the MCL performance increases in quality. When the edge weight distribution approaches the local minimum, the MCL performance measure finally plateaus at its maximum value. For three of the four datasets (A, B and D), the maximum performance plateau is above 0.9. Finally, as the edge weight distribution decays completely to zero, MCL performance also drops significantly. Since MCL performance is greatly dependent on the shape of the edge weight distribution, it is not surprising that performance improves at a greater rate and plateaus at a lower threshold in the rapid-descent histograms than it does in the gradual-descent histograms.
3.3 Force performance over threshold range
Force algorithm performance diverges across the two categories of edge weight distributions. In the two rapid-descent histograms, Force performs well even when no initial threshold is present. Force performance for both the rapid-descent histograms lies between 0.8 and 0.9 at edge weight zero. Performance then rises to approximately 0.9 as additional thresholds are considered. Eventually, when the threshold becomes too great, Force performance quickly decays in a manner similar to MCL performance. These results indicate that thresholding provides the Force clustering algorithm with little additional benefits when a rapid-descent distribution is present. Furthermore, unthresholded Force clustering will outperform unthresholded MCL clustering in a rapid-descent histogram.
For the two gradual-descent histograms, the opposite holds true. Initially, unthresholded Force performance is poor, falling below 0.5. As the threshold rises from zero, performance increases. This increase, however, is more gradual than the increase in MCL performance over the same threshold range. Eventually, Force performance plateaus at a maximum value equal to the best MCL performance. However, because of the gradual performance increase, Force reaches its maximum value at a greater threshold than MCL. Eventually, the threshold becomes too large, and algorithm performance decays.
3.4 Developing an automated threshold selection heuristic from edge weight distributions
As the edge weight distribution drops rapidly, we observe an increase in clustering quality. From these observations, we may assume that at low-level thresholds, most of the edges removed exist between protein families. The presence of these intercluster edges is effectively noise, which impacts the overall clustering results. Eventually, when the threshold is high enough, a boundary is reached at which most intercluster edges have already been removed. The boundary represents the optimal threshold separating intercluster edges from intracluster edges. At this boundary, the protein family components may begin to be affected by the filtration process, and certain loosely connected nodes may break off from the main network. However, the final increase in clustering precision overcomes any decrease in clustering recall, and the overall clustering quality noticeably improves. Thus, we would like to use this as our threshold in order to maximize the filtration of intercluster edges while minimizing the disruption of the protein family clusters.
Our goal was to estimate this boundary automatically, without knowing in advance the identity of the proteins in the network. We heuristically approximated this boundary b using two available network properties. The first property, Nn(Th), is the number of nodes connected by one or more edges at threshold Th. The second property is SE(Th), the number edges remaining after threshold Th is applied. We combined these two properties into a single network summary value, Nsv, where Nsv(Th) = SE(Th)/Nn(Th). Conceptually, Nsv(Th) is equivalent to the average weighted node degree at threshold Th.
We chose to examine Nsv because its derivative with respect to the threshold, dNsv(Th)/dTh, could potentially reveal interesting behaviors pertaining to the network. At low value thresholds, most of the filtered edges are between families, while the individual family components remain strongly connected. At these thresholds, the value of SE decreases while the value of Nn(Th) remains stable. Thus, as we begin to increase the threshold and filter out the noisy intercluster edges, we expect the value of dNsv(Th)/dTh to be negative.
When Th = b, we expect most intercluster edges to have been already removed. We also expect a few poorly connected outlier nodes to disconnect completely from the network, leading to a decrease in Nn(Th). SE(Th) will also continue to decrease, but at a lesser rate then at lower thresholds, due to a slowdown in the initial rapid decay observed in the edge weight distribution. If the decrease Nn(Th) is great enough, and the decrease in SE(Th) is low enough, then the value Nsv may actually increase. In this case, dNsv(Th)/dTh will take on a positive value at a threshold proximate to the boundary, but not at lower thresholds. This leads to the following threshold estimation heuristic: b is approximate to the minimum threshold Th at which dNsv(Th)/dTh > 0. If Nsv does not increase at any point in the distribution, then no threshold is returned.
We applied this heuristic to all four networks, generating thresholds of 1.0 for SLC, 1.0 for Amidohydrolase, 20.0 for Enolase and 69.0 for Kinase. Plots of dNsv(Th
for all four networks are available in the Supplementary Figure S1
. The heuristically determined thresholds lead to a performance increase in three of the four datasets (A, C and D) for both the Force and MCL algorithms, relative to the unthresholded performance of these same algorithms. For the fourth dataset (B), thresholding leads to an increase in MCL performance and a slight decrease of .01 in Force performance. For all four datasets, MCL performance at the heuristically determined threshold is greater than Force performance at that same threshold.
We qualitatively confirmed the improvement in clustering quality by visualizing the generated clusters prior to and after filtering with the heuristically determined thresholds described above ( and ) using Cytoscape (Shannon et al., 2003
), a biological network visualization and analysis tool with both MCL and Force clustering capabilities, and through the use of its ClusterMaker plugin (http://www.cgl.ucsf.edu/cytoscape/cluster/clusterMaker.html
). Visualization was carried out by removing all edges from each similarity network that did not correspond to pairs of proteins within the same cluster. Afterwards, the clusters within each network were made visible through Cytoscape's Organic layout, which is a force-directed layout algorithm similar to Fruchterman-Reingold (Fruchterman and Rheingold, 1991
). Nodes in the visualized network were colored by known protein family assignment to allow for visual assessment of clustering quality.
Fig. 2. Visualizing MCL Clusters for the SLC Superfamily. Each set of clustering results has been visualized in Cytoscape using the Force-directed layout algorithm. Each node represents a protein, colored by the currently best available family assignments. Edges (more ...)
Fig. 3. Visualizing MCL Clusters for the Kinase Superfamily. Each set of clustering results has been visualized in Cytoscape using the Force-directed layout algorithm. Each node represents a protein, colored by the currently best available family assignments. (more ...)
A shows the unthresholded MCL clustering output for the SLC superfamily, where many false negatives are present. Multiple proteins belonging to the same family are grouped across distinct clusters, instead of being grouped together. In B, the heuristically selected threshold has been applied and the number of false negatives has correspondingly decreased. Eliminating certain redundant edges between subsets of proteins within families prevents these subsets from clustering into distinct groups. As a result, more proteins within the same family are clustered together.
A shows the unthresholded MCL clustering output for the kinase superfamily. The two resulting clusters provide little discrimination among sequences that are known to belong to different functional families. B shows the corresponding change in clustering output after applying a heuristically selected threshold. Many of the families that have previously clustered together now separate out into their own distinct clusters. This same pattern as seen in the kinase family also holds for Amidohydrolase (Supplementary Fig. S2
) and Enolase (Supplementary Fig. S3
Visualization of the clustering of well-annotated protein families reaffirms that the increase in F
-measure values after automated thresholding corresponds to a genuine increase in clustering quality. Thresholding eliminates false positives in some networks, and eliminates false negatives in others, leading to a relevant improvement in the accuracy of the final clustered results. Visualizing the thresholded networks prior to clustering (Supplementary Fig. S4
) emphasizes the role of thresholding in the improvement of clustering quality. When the thresholded SLC, Amidohydrolase and Enolase networks are displayed using a force-directed layout, the boundaries between individual families clearly become visible, even though edges continue to connect the families. In the thresholded Kinase network, clusters of individual families separate out completely, resulting in high-quality input for any protein family-specific clustering algorithm.
To investigate further why families separate out from the thresholded Kinase network, but not from the other three datasets, we hypothesized that this network consisted of tightly connected Kinase families with unusually high edge weights. To confirm this, we calculated the average edge weight within families, and also between families, for each of the four datasets (Supplementary Table S2
). We found that while the average intrafamily edge weight for Kinase ranked highly at 99.7, the average intrafamily edge weight for Enolase ranked even higher at 114.9. The key to understanding what made Kinase distinct lay in the average edge weight between
families. The average interfamily edge family edge weight for Kinase was very low, at 13.8, relative to its high intrafamily edge weight. Furthermore, the Kinase superfamily was the only dataset assigned a threshold greater than its average interfamily edge weight. In the other three datasets, our heuristic assigned a threshold that filters some but not all of the intercluster edges. From these results, we draw the conclusion that our threshold heuristic will filter out some network noise without completely disrupting network connectively, except in those cases when there is a significant difference between interfamily and intrafamily edge weights.
Additionally, our results showed that the average intrafamily edge weights for SLC and Amidohydrolase were 38.7 and 51.6, respectively. This leads us to hypothesize that the difference in the shape of gradual-descending and rapid-descending distributions is related to the connectivity between families. In the two rapid-descending datasets, connectivity exists at lower edge weights, resulting in a more rapid transition between interfamily edges and intrafamily edges. We believe this more rapid transition results in a more rapid edge weight decay, as observed in our plots.
3.5 Automated threshold selection performance on additional clustering algorithms
lists the performance of the MCL, Force, TransClust, SPCS and Affinity Propagation algorithms for both the unthresholded networks, as well as the networks filtered using our automated threshold selection heuristic. MCL outperforms the other four algorithms, but only when the automated threshold is applied.
F-measure scores of clustering algorithms for thresholded and unthresholded superfamilies
Force ranks second in overall performance. It produces results with F-measure greater than 0.80 for two of the unthresholded networks and three of the thresholded networks. The fourth thresholded network, Enolase, scores an F-measure of 0.74 under Force. This is a great improvement over the unthresholded F-measure of 0.43, but is still less than the 0.83 F-measure associated with the MCL clustering of the thresholded Enolase network.
TransClust ranks third in clustering performance. Both the thresholded and unthresholded SLC networks score an F-measure of 0.87. The thresholded Kinase network scores an F-measure of 0.82, improving significantly from the unthresholded F-measure of 0.15. Thresholding the Enolase and Amidohydrolase networks also improves the TransClust output, but not to a significant extent. Both these thresholded networks score an F-measure of less than 0.70.
SCPS performs exceedingly poorly when its primary parameter values remain unchanged. Although clustering improvements are observed in all four networks after thresholding is applied, F-measure values score below 0.70 for three of our four datasets. In an effort to improve these, we attempted to adjust the SPCS epsilon parameter, which is responsible for tighter clustering at higher values. By sampling the epsilon parameter along increments of 0.01, we determined that an epsilon value of 1.1 leads to better clustering than the primary epsilon value of 1.0. At epsilon 1.1, SCPS clustering of the thresholded SLC and Kinase networks results in F-measure scores that are equal to or greater than 0.80. The thresholded F-measure for Enolase is 0.72, which is a significant improvement over the unthresholded F-measure of 0.40. Unfortunately, the Amidohydrolase F-measure remains exceedingly poor, scoring at less than 0.40, even after thresholding.
Affinity Propagation is unable to cluster the four protein networks into functionally meaningful families. All F-measures fall below 0.20. Sampling alternate Affinity Propagation parameters did not yield the improvement.
In order to confidently reconfirm these quantitative observations, we recalculated the clustering table using the Geometric Separation statistic (Brohee and van Helden, 2006
) initially developed to compare the quality of protein interaction network clustering approaches. The Geometric Separation results (Supplementary Table S3
) confirm the conclusions drawn from the F
-measure table. The use of an automatically selected threshold improves the Separation statistic, and the thresholded MCL Separation scores rank the highest relative to the other clustering algorithms in the table.