Reviewer’s report 1
Arcady Mushegian, Stowers Institute for Medical Research, Kansas City, MO, USA
The manuscript by Boratyn et al. describes a new addition to the BLAST family of programs. The main idea of DELTA-BLAST is to start sequence comparisons with matching the query sequence not to the individual homologous sequences in the peptide database (as PSI-BLAST does), but to models of conserved sequence domains in the domain database (in this case, the NCBI CDD database), and to build a probabilistic family model (PSSM) from the alignment of the query to highly-scoring domain model. The PSSM is then submitted to a round of sequence database search, same as in PSI-BLAST. The majority of the manuscript is devoted to benchmarking the performance of DELTA-BLAST against PSI-BLAST and CS-BLAST (an approach, developed by J. Soding’s group, similar to DELTA-BLAST but relying on the library of patterns that is shorter and may be less well curated than CDD).
The authors have outstanding track record in improving methods of sequence database search, and I am sure that DELTA-BLAST, too, will find its uses. I feel, however, that the paper’s focus on benchmarking leaves several more substantive questions not very well answered. My concerns are along three lines, i.e., what is the scientific problem that DELTA-BLAST is aimed at solving; why does it work; and how it is integrated with the suite of other BLAST programs.
We provided answers to the above concerns below, answering Reviewer’s more detailed questions.
One goal of the effort seems to be finding more homologs of a given query than in PSI-BLAST searches. Is that indeed so? In the later sections of the Results and in Discussion, we also see notes on better alignment quality in DELTA-BLAST, and on the ability of DELTA-BLAST to classify the query sequence (“classify” is not defined – does this mean “detect a CD that is the closest match to the query”? Obviously, that would be not the same as to say that the query is an in group of that CD family – it may be an out group). Which of these goals are primary, and which are more of auxiliary benefits?
We added the following explanatory text as the sixth paragraph ofBackground: “Our primary goals for DELTA-BLAST are to make use of a PSSM in the search (as in PSI-BLAST) to find more homologs, but to avoid the time spent in the initial BLASTP search. DELTA-BLAST also allows us to explore whether it is better to use longer homologous alignments to quickly construct a PSSM than the short profiles of Biegert and Söding
]. In future work, it may serve as a platform to experiment with different methods for quickly finding initial matches to a query that can then be used to construct a PSSM.”
The quality of alignments produced by DELTA-BLAST is an auxiliary benefit. Classification of protein sequences was never our goal. We used the term ‘classify’ to describe the first step in DELTA-BLAST, i.e. finding CDs that model a query sequence. We replaced the two occurrences of the word ’classify’ in Background and Discussion to avoid this confusion.
A related concern is: why aligning a query to a PSSM of of matching conserved domains is a better first step in the search strategy (better in the most important sense, i.e., ensuring better sensitivity) than building a PSSM from matching sequences, as PSI-BLAST does? Sometimes the first BLAST search of sequence databases produces no above-the-threshold similarities, and therefore nothing to build a PSSM with, whereas RPS-BLAST gives a significant match to one or more CDs, enabling one to construct a PSSM; I understand that in these cases, DELTA-BLAST would be more sensitive than PSI-BLAST. But many other times, the CD to which query actually belongs has not been described yet and is not in CDD, and yet PSI-BLAST is finding some homologs in the database, allowing a PSSM construction and iterative search. In these cases, PSI-BLAST is sensitive and DELTA-BLAST is moot.
We added the following two paragraphs at the end ofDiscussion: “PSSMs are created from MSAs and constructing an appropriate MSA is critical for any profile-sequence-based search. DELTA-BLAST uses already prepared MSAs stored in CDD for the purpose of annotating protein sequences with conserved domains. DELTA-BLAST performance, whether it is search sensitivity or quality of alignment, strongly depends on the quality and comprehensiveness of the CDD collection. Large numbers of CDs are manually curated to improve MSAs as well as their sensitivity and specificity as search models. CDD also imports MSAs from other projects, which ensures a comprehensive database.”
“Additionally, because CDD search is more sensitive than sequence search, DELTA-BLAST achieves better performance at finding appropriate models to construct a PSSM. Furthermore, manually curated MSAs are less likely to be corrupted by false positive matches as can be the case for a PSI-BLAST PSSM built on the fly. Many query sequences match more than one CD, allowing DELTA-BLAST to build a composite PSSM that may be more effective than the PSSMs associated with individual matching CDs. For sequences that do not match to any CDs, DELTA-BLAST performs a BLASTP search that can be iterated with PSI-BLAST.”
For queries that do not match any CDs, our initial small scale experiments suggested that it is beneficial in some cases to construct a PSSM using possibly non-homologous segments of CDs. This can be done by increasing DELTA-BLAST’s domain inclusion E-value threshold (a user controlled parameter). This requires more thorough research that we plan to do in the future. Furthermore, if a query does not match any CDs, DELTA-BLAST defaults to PSI-BLAST.
Was the testing described in the study done on the sequences that mostly followed the former scenario? If so, why? Is a random subset of sequences from protein database dominated by sequences that are already assigned to CD?
Yes, the testing was done with a set with majority of sequences having an assigned CD. Large-scale experiments that involve different types of proteins require a benchmark set with known homologies. Unfortunately, such a set will often include known proteins and many known proteins are already assigned to a CD. Currently, about 78% of sequences in the NR database match at least one CD with the E-value below 0.01.
The performance was tested on a relatively small set of queries and relatively small database, and it is possible that both are indeed strongly enriched by sequences with known domain composition. Have there been any tests that mimic other common use cases, e.g., the set of queries is a complete list of proteins encoded by newly sequenced genomes, or the database is NR or all proteins encoded by genomes in GenBank Genome division? Would the gain in sensitivity by DELTA-BLAST be the same?
We performed the experiments presented in the manuscript on a gold standard benchmark set with known homologies, so that search accuracy could be compared with results presented in other publications. To mimic other common uses we looked at the second iteration PSI-BLAST searches submitted through the NCBI BLAST web page between February 6 and February 13, 2012. Out of 1064 unique sequences submitted during this time 73% matched at least one CD. We also selected four recently sequenced genomes from diverse taxonomic nodes: Archaea (http://www.ncbi.nlm.nih.gov/genome/11226), Bacteria (http://www.ncbi.nlm.nih.gov/genome/12533), Eukaryota (http://www.ncbi.nlm.nih.gov/genome/11437), and Virus (http://www.ncbi.nlm.nih.gov/genome/12485), and computed the fraction of protein sequences that match at least one CD with E-value below 0.05 (default threshold for DELTA-BLAST). 67% of the 2835 protein sequences in the archaeal genome match a CD. For the bacterial genome, 78% of the 3881 sequences align to a CD. 85% out of the 4434 sequences in the eukaryotic genome and 36% of the 105 sequences in the virus genome match at least one CD. We expect that DELTA-BLAST would provide improved sensitivity for the above sets of sequences, although the gain would probably be smaller than for our benchmark set.
Finally, it would be helpful to describe better the software offering — is it integrated with other BLAST programs in any way? Most immediately, if there are no matching CD, will the program default to PSI-BLAST automatically?
We added the following explanatory text at the end ofBackground: “DELTA-BLAST is fully integrated with the NCBI BLAST website and the stand-alone BLAST+package. It is available from the ‘Protein BLAST’ link at the NCBI BLAST website (http://blast.ncbi.nlm.nih.gov). A DELTA-BLAST search on the website can be followed up by PSI-BLAST iterations or the results can be processed further by the distance tree or multiple alignment tools. A new program named deltablast will be part of the command-line BLAST+package starting with the 2.2.26+ release. Source code and applications for popular platforms are available atftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.”
We also added the last sentence inDiscussion: “For sequences that do not match to any CDs, DELTA-BLAST performs a BLASTP search that can be iterated with PSI-BLAST.”
Quality of written English: Acceptable
Reviewer’s report 2
Nick V Grishin, University of Texas Southwestern Medical Center, Dallas, TX, USA
The main product of this work is a piece of software from the BLAST family that without iterations achieves and possibly surpasses iterated PSI-BLAST performance on ASTRAL superfamilies, thus allowing faster and likely more accurate sequence database searches. Just this fact alone is enough to raise interest of researchers. The suggested innovation is that prior to sequence database search, the new software does CDD search to find homologous families, and uses their pre-computed and curated alignments to seed sequence database search.
On the conceptual level, the authors argue that seeding the search with pre-compiled alignments of homologous families is advantageous to seeding the search with short, possibly non-homologous segments similar to the query sequence. This logical statement is firmly supported by comparing their new program, DELTA-BLAST, with CS-BLAST. However, it might be interesting to study whether there is any advantage in combining the two techniques, and whether adding short segment profiles might help searches when homologous profiles in CDD are either very thin or not found.
We thank the Reviewer for this suggestion. Our small scale experiments suggested that using short CD segments possibly non-homologous to a query may improve DELTA-BLAST sensitivity when there are no strong CD matches. We plan to research this idea further.
My main concern, as always, is with validation. I fully agree with the authors that validation presented is enough to derive main conclusions sought. 1) DELTA-BLAST is not worse, and might be even better than PSI-BLAST in some occasions. Indeed, why would it be worse? It is the same thing, but seeded with more accurate curated CDD alignments. 2) DELTA-BLAST outperforms CS-BLAST, and how could it not? Homologous profiles are expected to be more powerful. However, beyond these conclusions it might not be possible to understand behavior of the three programs better, for the following reasons:
1. ASTRAL superfamily dataset is not ideal. According to SCOP, proteins placed in the same superfamily are homologous. However, it is not stated by SCOP authors that proteins from different superfamilies and even folds are not homologous. Indeed there are many homologous proteins in different superfamilies and folds, e.g. many proteins in a/b class (Rossmann-like folds) are most likely homologous regardless of the fold they are placed into, and their detection by sequence search software with an alignment that matches structure-based alignment should not be counted as “false positive”. Moreover, not performing evaluation on a very rich dataset of pairs within the same fold, but in different superfamilies, the authors neglect the most interesting “gray” area of sequence search their sensitive approach is targeted for, and skew performance statistics. I.e. the majority of protein pairs are thrown away from this evaluation.
Obviously, it is difficult to deal with these pairs, because some of them are homologous, while others are not. However, approaches have been proposed in the literature to deal with this problem.
We agree with the critique and we plan to perform more experiments in the future. We used a gold standard data set used in other publications, so that results can be compared.
2. ROC curve on all data pulled together might not be fully informative. It might be skewed towards families with longer sequences and thicker profiles that attain lower E-values. Thus the ROC-region shown might be dominated by Rossmann folds and P-loop proteins. It might be worth comparing how different programs rank hits for each query, e.g. by checking ROCx plots – fraction of queries with ROCx score above a given value vs. the value. ROCx score is the ratio of the area under the ROC curve up to x-th false positive to the area under ideal ROC curve. x is usually small, e.g. around 5.
We included the ROC5plot suggested by the Reviewer in Figurealong with appropriate text inResults(two last paragraphs in subsectionHomology detection),Discussion(the third sentence), andMethods(the second paragraph inRetrieval accuracy).
3. Different protein types may show different behavior. Is there a dominant fold type contributing to the ROC curve? Could it be Rossmann-like and P-loop proteins? What is the performance in different protein classes?
We computed the fraction of true positive pairs for each SCOP fold as a part of all true positive pairs in the benchmark set. P-loop containing nucleoside triphosphate hydrolases (c.37) has the largest share of true positive pairs: 15%. The share for DNA/RNA-binding 3-helical bundle (a.4) is 11%, NAD(P)-binding Rossmann-fold domains (c.2) 9%, and Immunoglobulin-like beta-sandwich (b.1) is 9%. Other folds have much smaller share and the distribution of the number of true positive pairs per fold has a long tail with many folds with relatively small number of pairs. As suggested by the Reviewer, we included in the manuscript Tablewhich shows ROCnscores computed for SCOP classes along with appropriate text inResults(the fourth paragraph in subsectionHomology detection),Discussion(the second paragraph), andMethods(the last sentence of the first paragraph inRetrieval accuracy).
I am not suggesting to address all these concerns in the present study, however, these points might be worth considering in future work.
As a minor problem, it might be instructive to the readers, especially biologists, to clarify what proteins hide behind the code name “superfamily c.37.1”. It is P-loop NTPases, a very special and interesting group.
We added “(P-loop NTPases)” next to the single occurrence of c.37.1 in the manuscript text.
Reviewer’s report 3
Frank Eisenhaber, Bioinformatics Institute, Singapore
The authors propose another variant of the successful BLAST suite of programs for similarity searches among protein sequences. The weak point of PSI-BLAST was the automated simplified generation of multiple alignments and their sometimes non-satisfactory quality was one of the main reasons why the program did not find certain homologues. Not surprisingly, competitive development such as CS-BLAST attempted to improve the alignment construction by using specially created libraries. The idea to rely on theexisting large collection of manually curated alignments provided by CDD is a nice workaround and certainly worth pursuing.
The authors provide an exhaustive assessment of the accuracy/sensitivity of their tool and it looks quite convincing that the large alignment library indeed boosts the likelihood of finding homologues.
Quality of written English: Acceptable
We thank the Reviewer for these comments.