The combined use of the ROC curve and the confusion matrix heat map has been the key in making this large scale analysis of protein classification. Several authors[
14,
17,
20,
23] have used the ROC curve to evaluate structure comparison methods using the CATH or SCOP protein classification database as the reference. In the most recent and comprehensive study, Kolodny
et al[
20] compared six different methods and found the highest true positive rate to be 50%, at 1% false positive rate, attained by the DALI method using the native DALI score and CATH as the reference. Our ROC analysis finds a true positive rate of 61.6% and 74.8% at 1% false positive rate, for the comparisons of VAST and SHEBA to SCOP, respectively. The differences between their result and ours might be explained by differences between the comparison methods (DALI, VAST, SHEBA), by differences between the definitions used for the false and true positive rates (they do not give explicit equations), and/or by the use of different databases of protein structures (CATH vs. SCOP). In particular, CATH groups domains into different numbers of folds than does SCOP, as noted by Hadley & Jones[
25] and Day
et al[
26].
Aside from providing a global measure of the agreement, ROC curves are also useful because they provide a practical means to select a score cutoff value for deciding if a pair of structures is to be considered similar or not, by trading off true and false positive rates. Other approaches have used methods other than ROC analysis or have ignored that tradeoff entirely. In their comparison of several structure comparison methods with CATH, Sierk and Pearson[
23] selected a decision level corresponding to the first 100 errors made by the program. Other approaches [
24-
28] do not use the ROC curve and often fail to properly acknowledge the obligatory trade off between false and true positive rates, making it difficult to compare the reported degree of agreement with others.
Although the ROC AUC varies somewhat by method, none of the reported values are high as desired. This raises a fundamental and important question: What mechanisms cause the automatic structural comparison methods to diverge so significantly from SCOP or CATH? To address this aspect of the problem, we need to descend from a global view of the database to a more detailed view of individual folds and finally of the domains comprising each fold. To investigate why structural comparison methods diverge from SCOP, we used the confusion matrix to distribute the 1% false positive comparisons to the individual fold pairs, resulting in a "false and true positive rates" map of the protein fold space. This can be distinguished from the map of the fold space constructed by Hou
et al.[
31,
32] who applied multi-dimensional scaling to pair-wise similarity scores. The exploration of the fold space, guided by our map, leads directly and objectively to the areas or subsets of folds where divergence with structural comparison methods is most evident. In particular, it has allowed us to move from the areas of high false positive or negative rates to the corresponding properties of the fold space. False negative rates are seen to relate directly to the issues of core variation and repeated sub-structures within a fold, while false positive rates are linked to the sharing of a common sub-structure between folds. Since the mathematical quantities
FPR and
TPR are interdependent, so are the corresponding properties of the folds space.
In looking at a particular area of our heat map, we can calculate an index of how likely a method is to confuse those folds, as the ratio of the average of fold-specific false positive rates to the average fold-specific true positive rate in that area. A value near 1 indicates that the folds in this area cannot be distinguished by the structure comparison method, on the average. It is worth noting that this index is cutoff dependent, as expressed in terms of true and false positive rates, and can thus be obtained for more or less severe false positive rates. The index of confusion is related but distinct from the index of "gregariousness" in Harrison
et al.[
13] for the CATH folds (topology level), which is a property of a fold that measures the number of other folds that are similar to it as judged by comparing the score to that of an empirically established standard score distribution at a certain cutoff level. The substantial number of highly confused sets of folds listed in Table allows us to examine in detail the source of the discrepancy between SCOP and our structure comparison methods.
Causes of false negatives and false positives
In the Results section we presented several examples of false negative and false positive cases related in one way or another to the common core. SCOP defines the common core of domains in the same fold to have the "same secondary structure elements in the same arrangement with the same topological connections" (Brenner et al[
29]), leaving open the possibility for some variation such as differences in length, relative orientations and/or number of the SSEs which we call variation of the common core.
Variation of the common core of domains within a fold, considered insignificant by SCOP, may still be large enough to cause VAST and SHEBA to find the domains dissimilar, giving rise to false negatives as in Figures and . False negatives may also occur when the common core is so small compared to the whole structure that the overall structural similarity is unrecognizable, as in Figure . The evidence of structural variations of the common core of proteins within the same fold was shown in the work by Chothia & Lesk[
4]. When the percentage of sequence identity between domains decreases much below 40%, their common cores tend to diverge structurally. The analysis of the confusion matrix shows that some false negatives for folds reported in Table arise from such core structure variations.
When two domains share an apparent common core, but SCOP judges the core elements to be significantly different, SCOP places the domains in distinct folds. However, the automatic methods may find the domains similar, as in Figure and , giving rise to false positives. Also, conversely to the case in Figure , when the repeats of a common motif are organized in a regular fashion in a domain, our methods may consider the domains similar, but SCOP may place them in distinct folds (see Figure ). Table enumerates a number of false positive cases arising from closely related common cores in distinct SCOP folds.
VAST and SHEBA decide on the similarity on the basis of the largest fraction of matching secondary structural elements or residues. However, visual inspection may allow the overall context of the matching and mismatching parts to play a role. If only a small part matches, but the matching part appears to be the core of each structure, then the match may appear more meaningful. If the number of repeats in a structure appears to be an important property of the structure, structures with different numbers of repeats may be placed in different folds. If, on the other hand, the precise number of repeats is not important for a structure, structures with different numbers of repeats are all placed in the same fold. If almost all parts match, but some important part, perhaps one critical beta-strand or even an irregular loop, is missing or placed differently in one structure, it may be placed in a different fold, etc.
It is possible that the problem is rooted in part, in the way structural alignment is currently conceived. Analogous to sequence alignment methodology, structural alignment maximizes the match between two structures, at the residue or secondary structure level, to infer a similarity relationship. On the other hand, the concept of similarity implicitly defined by SCOP, is focused on the sharing of higher level (above SSEs) motifs. This is in contrast to similarity measures based on the residue or SSE-level matches as defined by many structure comparison methods. We have shown examples (beta propellers, or alpha-solenoids) where occurrence of a motif is more appropriate for inferring similarity than is the maximum residue or SSE-level structural match. Although not evaluated directly here, we suspect that the structural comparison methods agree with SCOP when these two concepts agree, i.e. when the motif in question coincides with the maximum residue or SSE-level structural match, but disagree otherwise. Automatic structural similarity measures might thus be improved either by incorporating higher level structural motifs such as barrels or sheets, rather than remaining at the level of residues, strands or helices, or by weighting matching residues according to their structural context or functional importance.
Problems encountered by structural comparison methods might also be a reflection of intrinsic properties of the protein fold space. We have reported examples which tend to support the idea of structural drift [
33], i.e. a series of gradual steps which connect one fold with another, and showed areas where folds were highly confused. In such sets of folds, some structures within the same fold are too dissimilar to be detectable by structural comparison methods, while those in different folds are not always completely distinct. This raises questions about the fold definition. We have observed, for example, that distinction between beta barrel and two layer beta-sandwich domains can be surprisingly difficult. As the relative orientations of the strands in the two beta sheets in a barrel departs from orthogonality, and become more parallel, the distinction between barrel and two layer beta sandwich motifs becomes fuzzy. Drawing the proper separations within a set of domains in which such phenomenon is observed is not obvious and necessarily introduces some arbitrariness. Should such diverse folds be sub-divided into two, three or more folds? If this decision is taken at some point in time, with the then available structures, how stable and universal will this distinction remain over time? VAST and SHEBA are generally well able to a major part reproduce the fold classification of SCOP, consistent with the notion that protein folds are well-defined, discrete entities. However, despite many attempts, SCOP folds or CATH topologies continue to elude precise quantitative or computational definition. We suggest therefore, that for some parts of the fold space, folds are not well separated entities but more nearly a continuum of structural arrangements as also observed in [
1,
3,
34-
36], with some regions more populous than others. Here, apparent "folds" may arise as much from density fluctuations in regions where experimentally determined structures are sparse, as from thermodynamic stability wells which would partition the fold space. We speculate that the idea of continuum will become more apparent as a larger number of new structures are solved by structural genomics projects[
31]. In any case, the classification of structures into folds is probably a valuable and practical way of describing the fold space. When the fold space is continuous, this necessitates some arbitrary classification decisions, which may in fact not be completely reproducible by any automated approach.