For convenience we quickly summarize out methods. Further details are provided in the Methods section. A selection of conserved domain (CD) families were taken from the Conserved Domain Database (CDD). For each CD family a query (representative) structure was chosen and a list of similar structures (neighbors) was generated using the VAST algorithm. The lists were filtered by sequence identity to reduce redundancy. For a given query structure, those neighbors on its list were considered to be "true postives" if and only if they belong to the same superfamily as the query in the SCOP database. A given structural similarity measure/score can be used to rank the pairs of queries and neighbors, and for a chosen cutoff, we can compute the fractions of true positives (sensitivity) and false positives found at or above the cutoff. The fractions of true and false positives provide a basis for comparing the performance of the different similarity measures.
Table shows the sensitivities of all eight similarity scores at two given error (false positive) rates (1% and 5%). As can be seen from this table the LHM, GSAS and HCS measures demonstrate greater sensitivity than the conventional measures of structural and sequence similarity. For example, at the 1% error rate, LHM detects more than twice as many true positives on average as RMSD and fraction aligned, and more than 1.5 times as many true positives as percent identity. In Figure we plot the sensitivity curves for the three scores which perform the best (GSAS, LHM and HCS). It is apparent from this figure that the LHM curve lies lower than the curves corresponding to HCS and GSAS indicating that LHM outperforms on average these two other measures for the overall test set of 152 families.
Table 1 Sensitivity values estimated from curves. Sensitivity values estimated from the curves (Figure 1) at 1% and 5% error rates (fraction of false positives) are listed for different similarity measures: loop Hausdorff measure (LHM), HCS score (HCS), gapped (more ...)
Figure 1 Sensitivity curves for the three best-performing measures. The fraction of correctly ranked homologous VAST neighbors (true positives, sensitivity) is plotted against the fraction of incorrectly ranked homologous VAST neighbors for similarity measures (more ...)
It is also of interest to compare the performance of the different measures with respect to the ranking difficulty. To estimate the ranking difficulty for each CDD family we take the average percent identity between its query structure and the non-redundant set of true positive structures (homologous structure neighbors). There is a broad distribution of sensitivity values across the different degrees of ranking difficulty as shown in Figure , implying that some domain families are easier to recognize than others. Queries which have closely related structure neighbors show higher sensitivity and vice versa, this trend is apparent for all similarity scores used in the study. It should be noted that this analysis was done on a smaller test set of 97 families which had enough family members (at least 20) to make the calculation of sensitivities per family more reliable. We also note that 13 of the CDD families are in the most difficult bin (no more than 10% average sequence identity) and 52 are in the second most difficult bin, where the average sequence identity ranges from 10–20%. Thus, 65 of the 97 CDD families may be considered to be well within the zone of sequence similarity where homology is hard to ascertain.
Figure 2 Performance on families of differing degrees of difficulty. The barplot shows the sensitivity at 5% error rate for each bin of ranking difficulty. Ranking difficulty is estimated as an average percent identity between the query structure and non-redundant (more ...)
Comparing the different scores, it is clear from this figure that HCS, GSAS and LHM exhibit better sensitivity in the twilight zone of sequence similarity below 30% compared to other scores used in this study. Moreover, HCS and GSAS outperform the others in the most difficult cases below 10% of sequence identity. This is not surprising, for example, GSAS represents a combination measure using alignment length, RMSD and the number of unaligned gapped regions. It is not unexpected that a combination measure should do well. As was shown earlier, a linear combination of alignment-based structural score (RMSD) and loop-based structural score (LHM) had a much better performance compared to each of the scores used separately [19
The HCS scores use CD core models which have been determined by careful manual alignment curation using both sequence and structure data. From Figure it is quite clear that recognizing this common conserved core is a powerful method for inferring homology and functional similarity in the most difficult cases. For example, the Class I amino acyl-tRNA synthetase (aaRS) catalytic core domain (cd00802) using the HCS score yields a sensitivity of 0.79 at the 5% error rate whereas the sensitivities obtained with other measures are substantially lower (0.44, 0.26, 0.67, 0.23, and 0.44 with percent identity, RMSD, LHM, fraction aligned, and GSAS respectively). The aaRS catalytic core domain has 56 non-redundant structure neighbors of which 12 are in the same SCOP superfamily, with an average of about 10% sequence identity. The aaRS structural core is based on the Rossmann fold and is well-conserved with a number of functionally important sites located at different core regions. These include a pair of ATP-binding sites with important sequence/structural motifs (the "HIGH" and "KMSKS" motifs) that are characteristic for class I aaRS and included in the core model. Such features cause the HCS score to rank the SCOP superfamily members in this family more highly than the other numerous Rossmann folds with more remote evolutionary relationships and less functional similarity.
The preceding analysis concerns the average performance of the various measures. However, in practice most researchers will be interested in particular protein families, and so we should also investigate what happens in specific cases. To do so, we first further limit the test set to those CDD families with at least 10 true positives and 10 false positives among their non-redundant structure neighbors; there are 44 such CDD families altogether. We found that there are 20 CDD families for which at least one similarity score (LHM, HCS or GSAS) had a sensitivity higher than 80% at the 5% false positive rate. On the other hand, there are seven CDD families for which all three scores have a sensitivity of less than 50% at the 5% false positive rate (Table ).
Difficult families. Difficult families for all of the measures. For these seven CDD families all of the six measures had sensitivities of under 0.50 at the 5% error rate.
It is apparent from Table that the seven "difficult" CDD families involve folds that span a broad range of sequence, function, and phylogenetic diversity and are often referred to as "superfolds". It is certainly to be expected that the measures we consider should encounter difficulty in the correct evolutionary ranking for structure neighbors of such families. Most of these superfolds have protein cores which are very well conserved among all diverse members of these folds due to stability, foldability, or other requirements. Certainly, subtle structural/sequence features or motifs that may provide clues to evolutionary relationships are not all included in our CDD-derived core models. Moreover, as was shown previously there is evidence that all proteins from certain superfolds have a common ancestor and are all therefore possibly homologous (by definition) [19
We also compared the measures over the four different major SCOP fold classes, at the 1% and 5% error rates. These results are available as supplementary data [see Additional file 1
] and via the internet at [22