PSI-BLAST achieves a remarkable compromise between speed and quality
. Ideally, an alignment method should accurately identify related sequences in today's rapidly growing databases within the shortest possible time. While we want to simultaneously optimize speed and reliability, in practice there is a tradeoff: very accurate alignment methods are relatively slow (e.g. profile–profile alignment algorithms), while very fast methods are far less sensitive than we might wish (e.g. BLAST; Altschul et al.
). PSI-BLAST (Altschul et al.
) strikes an excellent compromise between speed and sensitivity.
Consensus sequences improve PSI-BLAST performance
. Consensus sequences were used early on to improve alignments (Patthy, 1987
). The initial approaches mimicked profile-sequence alignments (Henikoff and Henikoff, 1997
; Sonnhammer and Kahn, 1994
). Many improvements followed (Finn et al.
; Kahsay et al.
; Letunic et al.
; Marchler-Bauer et al.
; Merkeev and Mironov, 2006
; Schaffer et al.
; Schultz et al.
; Servant et al.
; Thelen et al.
). However, none of those methods approached the success of PSI-BLAST. We have recently proposed a simple add-on to PSI-BLAST that substantially improves its performance (Przybylski and Rost, 2007
). The add-on did not require any code change in PSI-BLAST. It consisted of adding a final step of ‘freezing’ the profile after the standard, iterative search against native sequences and then using it to search a database with the native sequences replaced by their consensus counterparts. This simple add-on improves the performance throughout the entire sensitivity curve. However, it is not clear how the underlying residue composition of database sequences affects the statistics of alignment scores. This is an important issue because users rely on the estimates of statistical significance to judge retrieved alignments. In addition, incorrect scoring might invalidate iterative searches against consensus sequences; a single false alignment in one of the intermediate searches might pollute a scoring profile and thereby all subsequent searches.
This study was motivated by the following three assumptions: (1) For a given residue substitution scoring matrix, the statistical significance of alignment scores depends on the residue compositions of aligned sequences. Assume that a particular scoring matrix highly rewards the alignment of tryptophan. This implies that sequences rich in tryptophan will likely generate higher alignment scores than those with average tryptophan content. (2) In general, the composition of consensus sequences differs from that of native sequences. Therefore, the distribution of alignment scores is likely different for consensus and native sequences, at least when using the same scoring matrix for both [such as BLOSUM62 (Henikoff and Henikoff, 1992
) or the corresponding position-specific scoring matrices]. (3) PSI-BLAST is very popular, well-maintained, and has a great impact on the community of scientists that use sequence alignments. Therefore, it is desirable to improve PSI-BLAST performance without changing its alignment parameters (including scoring matrices and gap scores) with which the community is already familiar. In order to accomplish this, we have asked the following questions: how much do the parameters of alignment score distribution change for various types of consensus sequences? Can PSI-BLAST compensate for compositional variations through its internal composition-based adjustments (Schaffer et al.
)? Or, can we build consensus sequences in a way that renders statistical significance reported by PSI-BLAST as valid? Finally, can we apply PSI-BLAST to iteratively search consensus sequence databases?