Performance of hierarchical clustering as tree building methods
We generated pairwise similarity scores between all pairs of domains in the SCOP C class using three different structure alignment methods, VAST, SHEBA and DALI, and using two different similarity metrics for each method, altogether yielding 6 similarity matrices. We used four different hierarchical clustering methods (Single, Average, Complete linkage and Ward method [20
]) to build dendrograms. The dendrograms were cut using seven SCOP-independent strategies and three SCOP-dependent strategies. Finally, two different criteria for terminating the tree cutting process were investigated. We stopped the cutting either when a 1% false positive rate (FPR) was reached or when the number of clusters matched the number of folds in the SCOP C class, which is 94 for the current dataset. The true positive rate at 1% false positive rate (TPR01
) and the number of clusters in each partition, resulting from these various partitioning approaches, using the 1% FPR termination criterion, are reported in Tables and .
TPR values at 1% FPR (TPR01) for Hierarchical Cluster Methods and Tree Cutting Strategies.
Number of Clusters for Hierarchical Cluster Methods and Tree Cutting Strategies at 1%FPR.
Trees generated by Ward's method result in better performance. The TPR01 values obtained for Ward's clustering method are, with one exception, always higher than for the Average, Complete or Single linkage method. The simplest, level cut strategy with Ward's method achieves an average TPR01 of 50% across all six structure comparison scores. Complete or Average linkage clustering constitute the next best alternatives to Ward's method, with average TPR01 across all similarity scores and structure comparison methods, of 32% and 29% respectively. Single linkage clustering shows uniformly the most divergence from SCOP. Not only does Ward's method achieve the highest average TPR01 value, but the values vary less across different similarity scores and structure comparison methods, than do those for Complete, Average and Single Linkage, suggesting that Ward's is more satisfactory for this application.
With other tree cutting strategies the trend among various clustering methods is the same. For the largest size cut strategy, Ward's method gives an average TPR01 value of 61%, at least 10% better than for the three other clustering methods. Again, Complete and Average Linkage give similar average TPR01 values, while Single Linkage is very low in comparison. In view of the clearly superior results with Ward's method, we investigate tree cutting strategies using that method alone.
Performance of tree cutting strategies
In the following, the performance of tree-cutting strategies is analyzed. The 10 tested tree-cutting strategies fall into four groups according to their TPR01 values (Figure and Table ), and are represented by the longest branch cut (1), level cut (2), largest size cut (3) and mutual information cut (4) strategies. Figure illustrates the comparative performance among these representative tree cutting strategies, for the VAST number of matched residues similarity score.
Figure 1 Receiver Operating Characteristic (ROC) curves for various tree cutting strategies, using the Ward's method clustering based on the VAST Nres similarity score. The true positive rate (TPR, Y axis) is plotted against the false positive rate (FPR, X axis). (more ...)
Among the four SCOP-independent tree cutting strategies shown, "largest size cut" achieves the best performance with its ROC curve dominating all other SCOP-independent ROC curves. Moreover, "largest size cut" ROC curve is the closest to that for Mutual Information cut, a strategy which attempts to maximize the agreement of the partition with SCOP at each step. Similar results (data not shown) were seen for SHEBA and DALI similarity scores.
The longest branch cut group (Table , Group 1) contains the strategies of lowest performance, e.g. tree skewness strategy, as indicated by their respective mean TPR01 values. Longest branch cut strategy results in one of the lowest agreements with SCOP, in particular for FPR values below 30% (Figure ). Above this FPR level, its performance becomes more acceptable relatively to the upper bound given by the mutual information cut. This strategy seems to make more sense when inter-cluster distance is high, i.e. for highest nodes of the tree. But it behaves much worse than others when the inter-cluster distance is reduced, as its ROC curve is closer to the main diagonal than the ROC for other strategies. In the lowest part of the tree, inter-cluster distances vary only slightly so that the longest branch criterion for choosing among the many possibilities, the next sub-tree to be cut might not be discriminative enough.
Level cut and maximum entropy cut strategies form another group (Table , Group 2) characterized by middling performance. Their grouping is explained by the fact that these strategies perform comparably at low FPR values and their average TPR01 values are comparable. Higher variability among the TPR values for maximum entropy cut strategy is noted, however. Indeed, for VAST and SHEBA, level cut strategy is better than maximum entropy cut, as a rule, with DALI number of matched residues (Nres) score the only exception. The highest TPR01 for the maximum entropy cut strategy (58%) is obtained with the Nres metric, while the lowest value (40%) is obtained by VAST Pcli.
The third, largest size cut strategy group (Table , Group 3) also includes the tree completeness cut, and highest tree cut strategies and corresponds to SCOP-independent strategies of highest performance, with average TPR01 values ranging from 58 to 61%. The tree height and tree completeness are closely related to the number of leaves, i.e. the size of the tree. These results indicate that of these three topological properties, the size is a better measure, for the purpose of creating partitions that agree with SCOP. This is true for partitions of small size (the large FPR values), but also when the number of clusters is large (the small FPR values).
The fourth group (Table , Group 4) is made up of mutual information cut and Best TPR/FPR Ratio cut strategies, all of which make direct use of the SCOP fold partition and thus are not useful for an independent classification. The poor performance of the best TPR cut strategy (Table , Group 1) indicates that strategies which ignore the false positives will not generate SCOP-like partitions.
The size of each partition is reported in Table , and shows a large variation in the number of clusters depending on the hierarchical clustering method. Single linkage always leads to partitions with a high number of clusters, in fact close to the total number of domains M, when FPR is kept low at 1%.
Comparison of partitions
The relative organization of automatically generated partitions can be understood by first finding a distance measure between two partitions and then displaying the partitions in a distance preserving graph. The distance between partitions (Eq. 8), computed as the total number of disagreements about whether or not a pair of domains is in the same cluster, is reported in the Table . In this exercise, partitions with exactly 94 clusters were compared, including the SCOP fold partition.
Distance between partitions*, (Δ-distance in 1000s)
Average distances (Table ) among partitions within a given comparison method (either VAST, SHEBA or DALI), are uniformly lower than distances between these latter and SCOP, by almost a factor two. Partitions from VAST tend to be slightly more heterogeneous than those from DALI or SHEBA. On average, automatic partitions from the three comparison methods are similarly distant from SCOP. Similarly, for a given cut strategy, the average distance among its partitions is consistently smaller than the distances from those automated partitions to SCOP (Table ). Further, largest size cut partitions are half as distant among themselves as are those of level cut strategy. The pattern of distances confirm that the largest size cut partitions are much closer to the SCOP fold partition than level cut partitions, as also seen using the ROC curves.
Δ-distance*, in 1000s, between partitions within each method and minimum Δ-distance from each method to SCOP
Table shows that every automatic partition is much closer to other automatic partitions, than to the SCOP fold partition. For example, the DALI Z-score level-cut partition (DZL) is at most 56,000 units from all other automated partitions, but 91,000 units from SCOP. The closest automated partition to SCOP is DALI Z-score, largest size cut (DZS), with a distance of 63,000 units, which in turn is no farther than 49,000 units from all other automated partitions. DZL is farthest from both SCOP and from at least one other automatic partition. Conversely, DALI Nres with Largest size-cut (DNS) is perhaps most representative of automated methods as it minimizes the maximum distance to other such methods.
Distance between Automatic Partitions and SCOP fold Partition
A convenient, intuitive representation of the organization of these partitions is obtained using multi-dimensional scaling, a technique which embeds the 13 partitions into a low-dimension Euclidean space so that the pairwise distances are approximately preserved. Figure represents the automatic and SCOP fold partitions in a 2-dimensional space. Automatic partitions are well-separated from the SCOP fold partition, which appears isolated. Largest size-cut partitions are generally closer to SCOP than are level-cut partitions, and are also less spread.
Figure 2 Classical multidimensional scaling (MDS) plot of various automatically generated partitions and the SCOP fold partition. The MDS plot approximately preserves distances, Δ, between partitions (see Methods). The X axis represents the projection (more ...)
Comparison of cluster size distributions
In addition to the number of clusters in a partition, the distribution of cluster sizes may be of interest in selecting an appropriate classification. Figure shows the cluster size distribution for six automatically generated partitions and for SCOP. For comparison, a partition of the same number of domains randomly assigned to 94 clusters with equal probability is shown. This distribution was approximated by a Poisson distribution with a mean value of 1330/94
Figure 3 Box plots of the cluster size distributions for six automatically generated partitions with 94 clusters and the SCOP partition of the C class. Each partition is associated with a box plot. A box plot summarizes the following statistics: median (line within (more ...)
We observe first that all partitions, including SCOP, have lower median cluster size and greater spread of size than for random. There is evidently sufficient signal strength within the similarity score matrix to influence the size distribution. Second, the 75th percentiles for automated methods tend to be larger than for SCOP, while the SCOP distribution shows larger positive skewness, with a greater number of unusually large clusters. Third, there is some uniformity in the size distribution within tree cutting strategies, with largest size cut showing somewhat higher 75th percentiles and less skewness than do the level cut distributions. Level cut distributions are closer the SCOP distribution in terms of median, 75th percentile size and the larger number of outliers, than the largest size cut distributions.
The largest SCOP fold (c.1), is in fact, split by all strategies and methods as none include a cluster with 182 domains. Largest size cut strategy intentionally eliminates outliers of large size, thereby creating more clusters of intermediate sizes, with greater spread (inter-quartile range) than for level cut. The level cut generates a few large size outliers but the clusters are smaller, typically. This is consistent with the observation made earlier that the level cut behaves like the maximum entropy cut at small FPR ranges. Using partitions of 94 clusters, we find FPR values higher than 1% for level cut strategy but lower than 1% for largest size cut strategy (Table ), indicating that a trade-off must be made in matching the size distribution as well as maximizing the TPR in relationship to the SCOP partition.
Identification of dispersed folds
The spanning cluster of a SCOP fold is the smallest cluster in the dendrogram which spans or includes all domains in that fold. The excess span of that fold (see Methods) are the domains from other folds that are included in its spanning cluster. A homogeneous cluster is a cluster which includes only domains from a single SCOP fold. The size of the excess span and the size of the largest homogeneous cluster are given in Table for each fold in SCOP C Class for three different dendrograms. These two measures allow comparison of each dendrogram to SCOP on a fold by fold basis, and can highlight regions of agreement or disagreement between the two systems. When the excess span is zero and the largest homogenous cluster is 100% of the fold size, the dendrogram and SCOP are in perfect agreement for that fold. A hypothetical tree-cutting strategy could potentially isolate this particular sub-tree to form a cluster exactly matching that fold.
Excess Span and Largest Homogeneous Cluster Size for SCOP C Class folds.
The first 38 folds reported in Table are in perfect agreement with dendrograms of all three structure comparison methods VAST, SHEBA and DALI. Together they comprise 187 of 1330 domains in the C-class. The next 23 folds (Table , rows 39–61, comprising 227 domains) agree perfectly with the dendrogram of at least one structure comparison method. Thus 61 out of 94 SCOP folds in the C class are consistent with the dendrogram built from at least one pairwise structural similarity measure.
Structure comparison methods differ in the consistency of their associated dendrogram with the SCOP folds. Dendrogram derived from each similarity score method disagree with SCOP for various folds. Forty-eight folds (comprising 1080 domains) disagree for the VAST tree, 41(1026 domains) for the SHEBA tree, and 43 (991 domains) for the DALI tree, according to this criterion.
We consider a fold to be highly dispersed if it disagreed with trees of all three structure comparison methods. There are 33 such folds (comprising 916 domains) and they are reported in Table , row 62–94. None of these 33 folds could be obtained as a homogeneous cluster, thus each contributes to the loss of agreement between any automatic partition and the SCOP fold partition. For these 33 folds, the intersection of their excess span was computed, and reported in Table , column labelled INT. INT is a measure of the disagreement with SCOP that remains even if all three dendrograms were combined. Thirty-one out of 33 folds give rise to a positive INT, meaning that same domains contributed to their dispersion within trees. These 31 folds therefore contribute to the remaining distance between the automatic and expert-curated partitions, regardless of tree cutting strategy or structure comparison method used.
Dispersion caused by low structural similarity within folds
To examine such dispersed folds in detail, we select two examples. Figure schematically represents the situation of fold c.58 within each dendrogram. Instead of forming one homogeneous cluster including all domains of the fold, c.58 consists of mainly two homogeneous clusters set far apart in the tree. Between these two homogeneous clusters, domains from other C class folds intervene, in particular from c.78. Thus, one homogeneous cluster of fold c.58 is considered more similar to domains from fold c.78 than to other domains from fold c.58, and this situation pertains to all three structure comparison methods.
Figure 4 Schematic of dispersion of fold c.58. Nodes labelled c.58 represent two of the homogeneous clusters of fold c.58. Node labelled c.78 is one homogeneous cluster of fold c.78. Relative join positions reflected from the complete Ward's method dendrograms, (more ...)
Figure and presents the matrices of pairwise distances between all domains within folds c.58, for VAST, SHEBA and DALI, respectively. Coding the distances by color makes obvious why fold c.58 is split into mainly two homogeneous clusters by all three dendrograms. Figure and also include domain d1b74a1 from fold c.78, which is highly similar to domains of fold c.58. Indeed, many of the pairwise distances involving d1b74a1 (coded yellow), are small enough that it would fit into any cluster of fold c.58. This pattern of high distance between homogeneous clusters of c.58 and low distance of a fold c.78 domain to members of c.58 explains why no clustering method and no tree-cutting strategy is likely to perfectly identify fold c.58.
Figure 5 Matrices of pairwise distances among domains of folds c.58 and c.78. Pairwise distances between domains were obtained from the pairwise similarity scores number of matched residues, according to Eq(1) and symmetrized as specified in Methods. The yellow (more ...)
This pattern of structural similarity measures is further analyzed based on structural superposition of domains. Figure and superpose domain pairs from the same homogeneous cluster, and show the degree of structural similarity typical within cluster. The structural superposition of the domains from different homogeneous clusters of c.58 is shown in Figure . The low similarity observed here is mainly due to the difference of the N terminal features, with the 3 layer a/b/a feature typical of this fold present in both domains. Figure is the structural superposition of a domain from c.58 with one from c.78. Again, the strands and helices making the 3 layer a/b/a feature are well aligned.
Figure 6 Superposition of c.58 and c.78 domain pairs, by DALI alignment. Panel (a): one intra-cluster, cluster 1, pair of c.58, d1edza2 (146 residues, red) and d1b0aa2 (121 residues, blue), 118 residues aligned, RMSD = 2.7Å, and Zscore = 14.4. Panel (b) (more ...)
In Figure , the structural superposition (d) of domains from different folds produces even a better alignment than the superposition (c) of domains from two different clusters within the same c.58 fold. The disagreement between automatic and expert-curated classification, arises directly from the low similarity within fold c.58, and the substantial similarity to domains of another fold. It becomes impossible to merge the two distinct homogeneous clusters of fold c.58 without including domains from c.78 as well.
As another example, we select fold c.1, one of the largest in the dataset, and a highly dispersed fold. The intersection of spanning clusters for this fold includes three domains from the single fold c.6, Cellulases, which are partial barrels. Fold c.6 is found to be easily identifiable by all three similarity methods (Table , row 22). Figure shows two TIM barrel fold structures and one Cellulases structure. TIM barrel structures are easily recognizable and are not likely to be confounded with any other folds by the human expert. The typical TIM barrel structure d1clxa_ (Figure ) appears to be more similar to the Cellulases structure d1dysa_ (Figure ), than to another TIM barrel, d1a4ma_ (Figure ), by all three structure comparison methods. TIM barrel structure (c) has 8 strands although two of them are much longer and two are much shorter than the rest making the barrel somewhat distorted, compared to the typical TIM barrel structure (a). The lower similarity between a typical TIM barrel domain and another member of that fold, compared to the similarity of a c.6 domain to the typical TIM barrel, means that automated clustering methods cannot separate these two folds perfectly. The structural distinctions between the two folds are evidently too subtle to be detected by these structure comparison methods.
Figure 7 Structures from TIM barrel fold c.1 and its variant fold c.6. Panel (a): d1clxa_, a typical TIM Barrel domain belonging to family c.1.8.3 with 345 residues. Panel (b): d1dysa_ c.6.1.1 with 345 residues. Panel (c): d1a4ma_, also a TIM Barrel, belonging (more ...)