Local alignments are an essential tool for biologists and often provide the first information about the function of an unknown nucleotide or protein sequence. An important question concerns the relationship of the score of a local alignment with the probability that the alignment occurred by chance.
[Karlin and Altschul 1
] developed an asymptotic theory for local alignments, assuming that no gaps are permitted. For two random sequences I
of lengths m
, respectively, the resulting distribution of the optimal alignment score
approximates a Gumbel distribution
The two statistical parameters in Equation (1) are λ, the scale parameter, and k, the pre-factor.
] extended this framework to local alignments with gaps and showed that the Gumbel distribution from Equation (1) is still valid, though different values for λ
] discussed the need for a “finite-size correction” to the lengths m
to improve the accuracy of Equation (1). The resulting statistics are an integral part of the Basic Local Alignment Search Tool (BLAST)
The following presentation emphasizes intuition over mathematical formality, to explain how the finite-size correction can account for the finite sequence lengths m
to improve the accuracy of Equation (1). Let us begin with an optimal local alignment, which starts from score 0 and requires a non-zero sequence length within both I
, before it achieves score y
. Let LI
)) be the required random lengths within both I
), and let
) be the corresponding means. The main idea is that the optimal local alignment cannot start anywhere along the full length m
) of sequence I
), because there might be insufficient sequence to permit it to achieve the score y
). The finite-size correction described in
] and used in BLAST therefore replaced the area mn
of the alignment matrix for Equation (1) by
Figure 1 Sequence alignment graph of two random sequences I and J of lengths m and n, respectively. The black circles are the initiation vertices of local alignment paths just remaining within the large rectangle of the sequence alignment graph before achieving (more ...)
Equation (2) approximates the area within the alignment matrix where the optimal local alignment can start and on average still have enough space to exceed the score y
. If m
) or n
), however, the resulting value in Equation (2) might become negative. The BLAST code for the old finite-size correction therefore set the corrected sequence length to an ad hoc
value (typically 1). For very short query or database sequences, the ad hoc
correction could underestimate the significance of an alignment by many orders of magnitude.
The purpose of this note is to present a new finite-size correction formula for the BLAST statistics. It avoids the ad hoc correction and improves on them by considering the (approximately normal) distributions of the random lengths LI (y) and LJ (y) explicitly, and not just the corresponding means lI (y) and lJ (y). We demonstrate below that the new finite-size correction is better than the older one, both in theory and in practice. All BLAST+ protein-protein applications (i.e., BLASTP, BLASTX) use the new finite-size correction by default, starting with version 2.2.26.