Proteins with low-complexity sequences are common and functionally important but are not well aligned by existing procedures. These proteins are rich in a few amino acids and thus have overall composition significantly different from the “average” compositions seen in the multiple alignments used to construct the BLOSUM alignment scoring matrices and for the BLAST statistical analyses (16
). About 10% of known protein sequences have overall low complexity; eukaryotic genomes and some bacterial pathogens contain even higher percentages of low-complexity sequences (24
). The NCBI nonredundant database currently contains approximately 3.2 million sequences. Thus, there are about 320,000 low-complexity sequences that cannot be accurately compared or aligned and therefore cannot be compared on any large scale, either functionally or evolutionarily. In addition, there are low-complexity segments in half of all proteins (32
). These segments also cannot be reliably aligned and so are currently “masked” by SEG or similar procedures and then ignored by the alignment tools (29
). In globular proteins, low-complexity sequences tend to occur as loops within and between globular domains (19
), regions often important for protein function. Recent papers have highlighted the need to solve this problem, and a logical solution is the modification of scoring matrices to compensate for the composition of the query sequence (6
Fungal cell wall proteins are representative of low-complexity sequences; they average 35% Ser and Thr residues, with some 100-residue segments composed almost exclusively of these two amino acids (11
). As a result, wall proteins are normally aligned only after SEG filtering to remove the low-complexity segments, so sequence comparisons cannot be made for the low-complexity regions. If there were rapid search and alignment protocols that could compare such compositionally biased segments, then both evolutionary and structural comparisons could be attempted.
The major alignment problem for low-complexity sequences is called low-complexity corruption (31
). Intuitively, low-complexity corruption results from the alignment of high-frequency residues. In fungal cell wall proteins, the problem is most egregious for Ser, Thr, Pro, Ala, and Val. This phenomenon gives high alignment scores and low e
values to nonhomologous pairs of protein segments (high-scoring pairs [HSPs]). For example, alignments of Ser with Ser and Thr with Thr in cell wall proteins give alignment scores of +4 and +5, respectively, in BLOSUM62, the standard scoring matrix. Because the residue alignment scores are summed over the segments being aligned, the many pairs of aligned Ser and Thr residues will give a high summed total alignment score, even if the frequently occurring amino acids are randomly distributed in the sequences. Indeed, in searches using low-complexity proteins as the query sequence, there are enough abnormally high-scoring pairs that the distribution of all scores is skewed by the overrepresentation of high scores (Fig. ). The skew means that the score distribution deviates from the expected extreme value distribution, and e
values calculated from the scores are invalid because the underlying distribution is different. For low-complexity sequences, this combination of anomalous high scores and small e
values appears with any search and alignment tool that uses BLOSUM matrices, including BLAST, FASTA, and the initial alignments in PSI-BLAST. Thus, if the alignment scores for frequently occurring amino acids were reduced appropriately, alignments of these residues would not artificially inflate the scores to generate HSPs from sequences with similar amino acid compositions but dissimilar sequences.
FIG. 1. Alignments for best-scoring HSPs (e ≤ 10−3) for Aga1p, Muc1p, and Fig2p. The first 50 aligned residues in each alignment are shown, and identities are shown between the query sequence and the similar sequence. Residues S and T in boldface (more ...)
Matrices other than BLOSUM have been shown to be more appropriate for sequences of nonaverage composition. For example, to make discriminatory matrices and predict hydrophobic and transmembrane segments in proteins, the specialized matrices PHAT and SLIM use the background frequencies present in transmembrane alignments instead of standard amino acid frequencies (23
). Similarly, position-specific scoring matrices (PSSM) are used to predict coiled-coil structures and in all iterative searches after the first in PSI-BLAST (5
). The effectiveness of these specialized matrices on their intended targets attests to the fact that adjustment of matrices to account for amino acid composition in the query and target sequences can be highly discriminating and sensitive.