|Home | About | Journals | Submit | Contact Us | Français|
Yeast glycoproteins are representative of low-complexity sequences, those sequences rich in a few types of amino acids. Low-complexity protein sequences comprise more than 10% of the proteome but are poorly aligned by existing methods. Under default conditions, BLAST and FASTA use the scoring matrix BLOSUM62, which is optimized for sequences with diverse amino acid compositions. Because low-complexity sequences are rich in a few amino acids, these tools tend to align the most common residues in nonhomologous positions, thereby generating anomalously high scores, deviations from the expected extreme value distribution, and small e values. This anomalous scoring prevents BLOSUM62-based BLAST and FASTA from identifying correct homologs for proteins with low-complexity sequences, including Saccharomyces cerevisiae wall proteins. We have devised and empirically tested scoring matrices that compensate for the overrepresentation of some amino acids in any query sequence in different ways. These matrices were tested for sensitivity in finding true homologs, discrimination against nonhomologous and random sequences, conformance to the extreme value distribution, and accuracy of e values. Of the tested matrices, the two best matrices (called E and gtQ) gave reliable alignments in BLAST and FASTA searches, identified a consistent set of paralogs of the yeast cell wall test set proteins, and improved the consistency of secondary structure predictions for cell wall proteins.
The ability to accurately align protein sequences is central to inferences about the evolutionary history of genes and therefore to the evolution of organelles and organisms as well. In addition, homology modeling and even functional inference through the annotation of similar sequences depends on alignment accuracy. For low-complexity sequences such as fungal cell wall proteins, errors caused by anomalous high scores for nonhomologous sequences will inevitably lead to erroneous inferences for evolution, structure, and function.
Proteins with low-complexity sequences are common and functionally important but are not well aligned by existing procedures. These proteins are rich in a few amino acids and thus have overall composition significantly different from the “average” compositions seen in the multiple alignments used to construct the BLOSUM alignment scoring matrices and for the BLAST statistical analyses (16). About 10% of known protein sequences have overall low complexity; eukaryotic genomes and some bacterial pathogens contain even higher percentages of low-complexity sequences (24, 32). The NCBI nonredundant database currently contains approximately 3.2 million sequences. Thus, there are about 320,000 low-complexity sequences that cannot be accurately compared or aligned and therefore cannot be compared on any large scale, either functionally or evolutionarily. In addition, there are low-complexity segments in half of all proteins (32). These segments also cannot be reliably aligned and so are currently “masked” by SEG or similar procedures and then ignored by the alignment tools (29, 34). In globular proteins, low-complexity sequences tend to occur as loops within and between globular domains (19, 21), regions often important for protein function. Recent papers have highlighted the need to solve this problem, and a logical solution is the modification of scoring matrices to compensate for the composition of the query sequence (6, 36, 37).
Fungal cell wall proteins are representative of low-complexity sequences; they average 35% Ser and Thr residues, with some 100-residue segments composed almost exclusively of these two amino acids (11, 20, 28). As a result, wall proteins are normally aligned only after SEG filtering to remove the low-complexity segments, so sequence comparisons cannot be made for the low-complexity regions. If there were rapid search and alignment protocols that could compare such compositionally biased segments, then both evolutionary and structural comparisons could be attempted.
The major alignment problem for low-complexity sequences is called low-complexity corruption (31). Intuitively, low-complexity corruption results from the alignment of high-frequency residues. In fungal cell wall proteins, the problem is most egregious for Ser, Thr, Pro, Ala, and Val. This phenomenon gives high alignment scores and low e values to nonhomologous pairs of protein segments (high-scoring pairs [HSPs]). For example, alignments of Ser with Ser and Thr with Thr in cell wall proteins give alignment scores of +4 and +5, respectively, in BLOSUM62, the standard scoring matrix. Because the residue alignment scores are summed over the segments being aligned, the many pairs of aligned Ser and Thr residues will give a high summed total alignment score, even if the frequently occurring amino acids are randomly distributed in the sequences. Indeed, in searches using low-complexity proteins as the query sequence, there are enough abnormally high-scoring pairs that the distribution of all scores is skewed by the overrepresentation of high scores (Fig. (Fig.1B).1B). The skew means that the score distribution deviates from the expected extreme value distribution, and e values calculated from the scores are invalid because the underlying distribution is different. For low-complexity sequences, this combination of anomalous high scores and small e values appears with any search and alignment tool that uses BLOSUM matrices, including BLAST, FASTA, and the initial alignments in PSI-BLAST. Thus, if the alignment scores for frequently occurring amino acids were reduced appropriately, alignments of these residues would not artificially inflate the scores to generate HSPs from sequences with similar amino acid compositions but dissimilar sequences.
Matrices other than BLOSUM have been shown to be more appropriate for sequences of nonaverage composition. For example, to make discriminatory matrices and predict hydrophobic and transmembrane segments in proteins, the specialized matrices PHAT and SLIM use the background frequencies present in transmembrane alignments instead of standard amino acid frequencies (23, 25). Similarly, position-specific scoring matrices (PSSM) are used to predict coiled-coil structures and in all iterative searches after the first in PSI-BLAST (5, 7, 22). The effectiveness of these specialized matrices on their intended targets attests to the fact that adjustment of matrices to account for amino acid composition in the query and target sequences can be highly discriminating and sensitive.
To improve the alignment of low-complexity sequences, we have developed and tested modifications to produce scoring matrices that are adjusted for the composition of each query. The goals are to prevent alignments of sequences that are compositionally similar but nonhomologous and to generate statistically significant, homology-driven alignments of low-complexity segments necessary for structural and evolutionary studies of the low-complexity portion of the proteome.
Each matrix modification method was evaluated based on the following criteria, as summarized in Table Table1:1: sensitivity (the ability to find a high number of homologs) for both low-complexity and high-complexity query sequences, discrimination against randomized sequences and nonhomologous proteins with similar amino acid compositions, conformance with the expected extreme value distribution of alignment scores that should be generated during the search, accuracy of derived e values, and computational efficiency. The results demonstrated that two of the composition-based matrices are powerful adaptations for BLAST and FASTA searches and alignments for low-complexity Saccharomyces cerevisiae glycoprotein sequences.
We searched for sequences similar to each of 10 yeast cell wall proteins. These proteins have low-complexity regions that constitute 40 to 100% of the open reading frame (ORF) length. For each query sequence, the amino acid frequencies were determined, and the scoring matrix was altered by the rules described below. The modified scoring matrices (Table (Table2)2) were rescaled to the same κ and λ statistical parameters as the standard scoring matrix (BLOSUM62) so that the reported e values were distributed similarly to those from BLOSUM62-based searches of high-complexity sequences (see Table S1 in the supplemental material). A similar rescaling strategy is used in PSI-BLAST (5). (Note that additional mathematical definitions and relationships are described in the supplemental material.) The query sequence was then used as the query in BLAST or FASTA. HSPs were ranked by e values.
FASTA calculates e values for each search by comparison to scores generated by randomized query sequences. Therefore, the e values reported for FASTA searches are appropriate for each query sequence and scoring matrix.
One way to change scoring matrices is to adjust each scoring element, Sij, to compensate for the probability of a match at random. This approach keeps the target frequencies, Qij, equal to the standard target frequencies, in the hope that this will reduce random alignments of frequently appearing amino acids. Each new matrix element, S*ij, can be calculated as follows:
where Pi is the probability of the occurrence of an individual amino acid, i, and P*i is the probability of amino acid i in the query sequence, and the new score is calculated from Sij and Pj. Pj and Qij are taken to be unchanged, so one compensates for the low complexity in the query but not in the database sequence. λ predicts the width of the extreme value score distribution. In essence, each score, S*ij, is reduced or raised to compensate for the degree to which the frequency for i in the query sequence differs from the frequency for i in the standard ratios used in BLOSUM62 (16, 17). The new matrix will have the same target frequencies in the context of the amino acid composition of the query sequence that the original matrix had in the context of standard amino acid composition. Because target frequencies, Qij, are kept constant, equation 1 guarantees that the λ of the matrix in the context of the amino acid composition of the query sequence should not change. BLAST, however, requires that the matrix entries be integers, so λ does change after rounding of the score. For each search, λ* can be set to the λ of the original matrix by multiplying each score by the ratio of the λ* of the unscaled matrix to the λ of original matrix, as described previously (31). We call this matrix modification Q, for target frequency.
The problem of complexity corruption can be thought of in another manner. The expected score, E, of a given matrix is as follows:
The BLAST statistical model requires that value E in equation 2 be negative (18). If the probability of amino acid i in the query sequence is larger than the standard probability for i used in the database or score distribution simulation, the expected score for the ij pair will unduly contribute to the total score of alignments and will select for randomly aligned segments that have amino acid compositions similar to the query. Once again, we can adjust the score of the matrix to compensate for the fluctuation in the amino acid composition of a query from the standard amino acid composition and yet retain the intrinsic property (i.e., expected score, E) of the matrix in the context of the query's amino acid composition as follows:
We call this matrix modification E, for expected score. This modification significantly changes the value of λ, which is then reset according to equation 1.
Fig. S1 in the supplemental material shows the impact of matrix modifications on positive and negative scores relative to the ratio between the probability that amino acid i occurs in the query and the standard Robinson and Robinson probability for that amino acid (3). These matrix modifications decrease positive scores for frequent amino acid pairs but increase negative ones. Matrix modification E increases the negative scores for frequent pairs, but such a negative score never becomes positive nor does a positive score ever become negative. In contrast, Q modifications can convert a negative score into a positive one or vice versa. As a result, the two types of matrix modifications produced distinct total scores and alignments.
A “greater-than” (gt) matrix modification was also implemented. Under this modification, scores are reduced only if a residue is more frequent in the query sequence than in “standard” frequencies calculated according to Robinson and Robinson frequencies (i.e., P*i/Pi > 1) (3). When applied to matrix modification E or to matrix modification Q, this produces scoring matrices gtE and gtQ, respectively.
PSI-BLAST uses BLAST-PGP with a 32-fold scale-up of BLOSUM62 to enhance sensitivity during the first round of comparisons. We have used the same scaling factor to augment the BLOSUM62 matrix before adjusting for amino acid composition deviation. This generates the gtE32 and gtQ32 matrices; gap costs are also scaled up (Table (Table2).2). The gtE32 and gtQ32 matrix modifications were implemented for FASTA only.
As a test of the effects of composition-based matrix modifications, we carried out searches on two sets of proteins. The first was a test to find homologs of low-complexity yeast cell wall proteins in a combined database of the yeast proteome and three complete sets of randomized yeast ORF pseudosequences. Randomizations of the sequences were global (the entire sequence randomized for each ORF) or local. For local randomizations, the sequence was randomized within contiguous windows of 12 residues. This window length corresponds to that of the SEG filter and maintains the local entropy of the sequences. For searches with cell wall proteins as queries, HSPs with authentic yeast ORFs were counted as “true” hits, and HSPs with randomized sequences were counted as “false.” This designation favors nondiscriminating matrices such as BLOSUM62, because some nonhomologous sequences were counted as “true” hits for the tests shown in Fig. Fig.2.2. Inspection of alignments (Fig. (Fig.1)1) and comparison of annotations (see Table Table4)4) showed that these nonhomologous “true” hits were not reported in searches with gtQ and E matrices. The other search set was the Aravind data set, which contains 103 domain-specific query sequences and a total of 1,005 true positives in the yeast proteome, curated as described previously by Schaffer et al. (31). We used those definitions of “true” and “false” hits.
To test the sensitivity and selectivity of pairwise search algorithms for high-complexity sequences with the modified scoring matrices, stand-alone versions of BLAST (version 2.2.2) and FASTA (version 3) were used. As recommended previously (2, 16, 27), gap costs of 9 to 13 were used with BLAST, lower costs, 5 to 9, were used for FASTA searches, and the gap extension cost was set at 1. Each search used one of the scoring matrices (described above) based upon the amino acid frequencies in the individual query sequence. The notations that describe each type of search with each type of matrix are summarized in Table Table22.
All BLAST searches were implemented using the command-line executable “blastall” with the BLAST-x matrices or the command-line executable “blastpgp” with the BLAST-PGP matrix and the composition-based statistics flag on “(-t T).” Both command-line executables produce gapped pairwise alignments, but BLASTPGP uses composition-based statistics to assess significance and can be used to generate PSSM from first-round hits. The PSSM is used to score the second round of searches in PSI-BLAST. To preserve comparability, blastpgp searches were relegated to one round “(-j 1).” The FASTA searches were conducted with the command line-executable “fasta34.” Command line options were default options unless specified otherwise. All matrix modification searches were PERL and BASH shell scripts executed on a Sun Microsystems SunBlade100 workstation running Debian GNU Linux. Searches were performed without SEG filtering unless specifically designated and were repeated for several different gap values.
We tested whether the similarity sets were closed for yeast cell wall proteins. These tests compared output from BLAST-PGP, FASTA-B, and the four matrix modifications that are sufficiently sensitive and discriminating to support searches with low-complexity sequences: E, gtQ, gtQ32, and gtE32. These searches used the 10 yeast cell wall proteins as query sequences to search the yeast protein database (retrieved from the NCBI). Searches were done with the gap costs shown in Fig. Fig.2.2. HSPs with e values less than the specified cutoff for distinct new proteins in each round became the query set for the next round, still against the same database. This process continued until no new, distinct proteins with e values below the specified cutoff were obtained (14, 35).
Comparisons of the transitive closure sets were performed using a Java web application and other Java codes. The WAR file for the web application is available from the authors. The glycosylphosphatidylinositol (GPI) protein set was taken from data described previously (8, 11). Using the Gene Ontology (GO) database terms “cell wall (sensu fungi)” and “cell wall organization and biogenesis,” the Gene Ontology sets were obtained from the Saccharomyces Genome Database website (http://www.yeastgenome.org/). We curated the “cell wall protein,” “non-cell wall protein,” “wall biogenesis,” and “unknown or ambiguous” classifications shown in Table Table44.
The PERL and shell scripts, customized databases, and supplemental sensitivity curves described in this paper can be obtained from the authors.
Two types of score adjustments, with several variations of each, were designed and evaluated in tests using a variety of query sequences and databases summarized in Table Table1.1. The score adjustments and queries are described below.
The E and Q matrix modification methods reduce the alignment score, Sij, for aligned residues i and j for amino acids occurring at a high frequency in a query sequence but preserve the net negative value for the matrix that is required for accurate statistical analyses of the alignments (3). Each modification method yields a different scoring matrix for each query sequence. Each modification method and its variants compensate in different ways for the deviation from the standard Robinson and Robinson frequencies used to derive the gapped BLAST statistical parameters for BLOSUM62, as summarized in Materials and Methods (3, 18). The E method keeps the expected score of the matrix constant, while the Q method keeps the target frequencies, Qij, constant, where Qij is the expected frequency that a residue, i, in one sequence is replaced by j in randomly aligned sequences (18). These frequencies are determined in a set of standard alignments using BLOSUM62.
All matrix modifications are summarized in Table Table22 and are described in detail in Materials and Methods. Throughout, we append suffixes to indicate which modifications were applied to a search method (Table (Table2).2). For example, A “BLAST-BF” search indicates that unmodified BLOSUM62 (“B”) was used with SEG filtering (“F”). FASTA-gtE32 indicates that we carried out a FASTA search with three modifications to the BLOSUM62 matrix: E, gt, and 32. We adopt the BLAST filtering criterion as a working definition for a low-complexity sequence, that is, one with Shannon entropy less than 2.2 over a window of at least 12 amino acid residues (5, 34).
The cell wall query set for most searches with low-complexity queries was a group of 10 cell wall GPI class mannoproteins (8, 11, 20): Cwp2p, Sag1p, Ssr1p, Tip1p, Sed1p, Tir1p, Flo11p, Aga1p, Flo1p, and Fig2p, with lengths of 92, 650, 238, 210, 338, 254, 1,367, 725, 1,537, and 1,609 residues, respectively (8, 11). These sequences are representative of GPI-anchored fungal cell wall proteins and include six unique genes, two members of the FLO gene family, and two members of the TIR/TIP family. These and other cell wall proteins are mosaics of high-complexity and low-complexity segments (8, 10, 11, 20).
Tests with high-complexity queries used a standard data set of 103 yeast signal transduction proteins as queries in searches of the S. cerevisiae proteome and three copies of the proteome with the ORF sequences randomized (31).
The problem of low-complexity corruption is illustrated in Fig. Fig.1.1. BLOSUM62-based BLAST or FASTA searches with yeast cell wall proteins as queries identified homologs with highly similar sequences (Fig. (Fig.1A)1A) but also returned HSPs with randomized sequences and nonhomologous proteins, even when score statistics were adjusted by PGP or when low-complexity regions were masked with SEG (Fig. 1B to D). These alignments were based on high frequencies of matched Ser and Thr residues and therefore identified many nonhomologous sequences as highly similar, a known consequence of low-complexity corruption (31, 34). In the BLAST-B search, the highest-scoring match to Muc1p was a random pseudoprotein segment derived from Dan4p. Similarly, the three highest-scoring matches to Fig2p (e < 10−62) were randomized versions of Muc1p. Like the BLAST- BLOSUM62 searches, BLAST-BF and BLAST-PGP, which uses composition-based statistical analyses with BLOSUM62, gave matches in which >80% of the identities were Ser or Thr (Fig. 1C and D). Other residues were seldom aligned. PGP also identified a large number of best hits with similar compositions but unlikely homology: among the highest-scoring matches for Aga1p was Snt1p, a histone deacetylase subunit, and for Muc1p, the third highest-scoring match was to the Sec31p subunit of the endoplasmic reticulum protein translocation pore. These proteins are unlikely to be homologous on the basis of functional analogy, cellular localization, or alignment of conserved sequence motifs. In addition, BLAST-B, PGP, and BF searches identified many randomized sequences as HSPs with an e value of <10−3.
Alignments were greatly improved after matrix scores were adjusted to reflect the composition of the query sequences. Of the matrix variants listed in Table Table2,2, the E and gtQ variations with BLAST or FASTA, as well as gtQ32 with FASTA, gave more specific alignments. (Our website, http://diverge.hunter.cuny.edu:8080/modmat, has automated, composition-based matrix modifications and search capability for any query sequence.) E matrices were highly specific; they required regions of extensive identity to achieve HSPs with significant e values. The Muc1p/Bsc1p homology (Fig. (Fig.1A)1A) was the only significant hit for any of the three query proteins illustrated in Fig. Fig.1.1. gtQ matrices showed more high-quality HSPs, a result of acquisition of significant scores over even relatively short but highly similar segments (Fig. (Fig.1E).1E). All of the significant HSPs were to proteins that are also localized to cell walls. Note that with gtQ, the best match for Aga1p was in a segment that was aligned with a randomized Muc1p pseudoprotein in the best match of the BLOSUM62-based search (Fig. (Fig.1B1B).
Thus, the alignments showed that searches with BLOSUM62 matrices were subject to low-complexity corruption, even with PGP statistics or SEG filtering. These findings were confirmed in the structural comparisons and the sensitivity and transitive closure tests described below. In contrast, gtQ matrices were highly sensitive, reaching significant e values in relatively short segments of both low-complexity and high-complexity compositions. The E matrices were highly discriminatory and identified only long HSPs with a high likelihood of homology.
Alignments are especially important in structural searches. There are few structures known for low-complexity proteins, and indeed, structures for low-complexity sequences are severely underrepresented in the Protein Databank (21). Therefore, apparent matches to nonhomologous sequences may be used mistakenly as the basis for alignment and modeling. Use of gtQ and E matrices can assure better alignments and more accurate structural predictions.
If aligned regions are homologous, they should have similar secondary structures (15). We tested the composition-modified matrices as predictors of concordant secondary structure predictions for pairs of HSPs with e values of ≤10−3. The cell wall query proteins were used to search the S. cerevisiae genome database. Each aligned sequence segment was used as the input for GOR IV, a secondary structure predictor that does not depend on BLOSUM62-based alignment to homologous sequences (13). The GOR IV secondary structure predictions of α-helix or β-sheet were compared (Table (Table3).3). The gtQ matrices gave the highest degree of concordance, over 80%, followed by E and B matrices. However, the concordance values with PGP had high variance due to the inclusion of nonhomologous HSPs (Fig. (Fig.1).1). We repeated the test for the subset of HSPs with 10−5 ≥ e ≥ 10−30, values for the alignments most likely to be relevant for such predictions. For these HSPs, E and gtQ matrices outperformed BLOSUM62-based matrices. Again, PGP searches had poor concordance and the greatest standard deviation (not shown), indicating variation in the quality of the matches, as expected in situations where HSPs include nonhomologous matches. Thus, the use of modified matrices significantly improved the reliability of secondary structure predictions.
Sensitivity curves are a standard way to illustrate the effectiveness of search strategies (31). These graphs (Fig. (Fig.2)2) illustrate sensitivity (number of homologs identified as HSPs) as horizontal displacement and discrimination (number of false hits identified as HSPs) as vertical displacement. Thus, good performance is indicated by a curve that has a long horizontal component with minimal verticality apparent only at the right-hand end of the curve. Previous work has defined false hits either as randomized sequences of composition similar to that of the true hits (3, 31) or as proteins known to be nonhomologous (31). To test the composition-modified matrices for discrimination against nonhomologous, low-complexity sequences similar to the query sequences, we searched the cell wall protein query set against the S. cerevisiae genome combined with the locally randomized pseudoprotein sequences described in Materials and Methods.
Figure Figure22 shows sensitivity plots for the cell wall protein query set against the S. cerevisiae proteome and three locally randomized copies. All tested matrix modification methods performed better than BLAST with B or BF and FASTA-B, which were unable to discriminate between authentic and randomized sequences. BLAST-PGP, which uses composition-based statistics with BLOSUM62, found 25 true hits (including the 10 query sequences themselves) at e values below that of the first false hit. Among the modified matrix searches, BLAST-E was highly discriminatory (it found very few false hits even with large e values). The gtQ matrices showed by far the best sensitivity (105 true hits with lower e values than the best-scoring false hit). Thus, FASTA-gtQ32 identified the 10 query sequences and 95 paralogs of the query proteins at e values that excluded false hits, whereas BLAST-PGP identified only 15 paralogs.
We used transitive closure as an empirical test of the usefulness of the composition-based matrix modifications. The 10 cell wall proteins were used as query sequences in BLAST and FASTA searches. Each query was used with different matrices derived from its own composition. The ORFs corresponding to all hits with e values of <10−3 were used as the query sequences in the next round of searches, again with scoring matrices derived from each specific composition. This procedure was repeated until no new HSPs were identified. If a search method discriminates between similar and nonsimilar sequences, transitive closure should terminate after a relatively small set of sequences is identified. On the other hand, low-complexity corruption or other artifacts will result in frequent identification of nonhomologous proteins with low e values. The consequences will include a larger number of search rounds to achieve closure, and the significant “hits” will potentially include much of the proteome.
As expected, BLAST-B failed to achieve closure on the low-complexity query sequences, even with a cutoff e value of ≤10−9. With a standard cutoff e value of ≤10−3, there were many new hits in each round, with a total of 863 sequences after five rounds (15% of the yeast proteome) (Fig. (Fig.33 and Table Table4).4). BLAST-BF also failed to close. The other methods achieved closure in 3 to 10 rounds (Table (Table4).4). There were 192 different ORFs identified in one or more of the searches with composition-modified matrices. Of these, 47 ORFs were identified in all searches, with 1 more ORF identified by five of the six modified matrix methods. Thus, there was a core of 48 hits that were most similar to the query sequences.
BLAST-PGP was the most sensitive method that closed, but it did not discriminate against nonhomologous sequences. The BLAST-PGP test identified 135 hits not found in any other search. Most of these extra hits were due to low-complexity corruption, similar to that seen in Fig. Fig.11 and and2.2. The alignments were rich in pairings of nonhomologous Ser and Thr, and there were multiple different alignments in the same segments of the protein pairs with the same score. Such multiple equivalent HSPs are typical of low-complexity corruption. Furthermore, the vast majority of these hits were for proteins that are unlikely to be related to cell wall proteins (Table (Table44).
We reasoned that the most likely homologs of the query sequences would be other cell wall and cell surface proteins, since their composition and domain structures are similar to each other and substantially different from those of globular proteins (20). Therefore, we functionally classified the hits identified in the transitive closure tests. The 343 ORFs identified in any modified matrix search or BLAST-PGP or FASTA-BF were labeled cell wall or not cell wall, either in accordance with the GO database or as curated by the authors. BLAST-PGP and FASTA-BF searches included many non-cell wall proteins among the significant hits (12). In contrast, searches with E and gtQ composition-modified matrices identified a highly similar set of ORFs, almost all of which were classified as cell wall proteins in either BLAST or FASTA. A complete list of hits for BLAST-PGP and composition-modified matrix searches is shown in Table S4 in the supplemental material.
To assess the effects of composition-based matrix modification on searches with high-complexity sequences, we also tested our methods in searches with globular (high-complexity) proteins as queries. The Aravind data set is a set of curated signal transduction proteins within the S. cerevisiae proteome (31). A total of 103 of these proteins were used as queries in BLAST and FASTA searches, counting the number of alignments with curated “true” and “false” homologs within the previously established criterion that the e value was ≤10−2 (Table (Table5)5) (31). As previously reported, BLAST with BLOSUM62 was the most sensitive method, returning 46% of the known homologs at this e value (31). Among the composition-modified matrices, searches with gtQ performed well, with 82 to 86% of BLOSUM62's sensitivity in BLAST and 75% sensitivity in FASTA searches. B and gtQ had similar levels of discrimination against false hits. Again, the E matrices were highly discriminatory and gave no false hits, but the searches were less sensitive. Thus, composition-modified matrices provided moderately lower sensitivity but similar (gtQ) or increased (E) discrimination in searches with sequences whose composition is near the Robinson and Robinson average.
The reliability of e values depends on the statistical distribution of the alignment scores, which must conform to the Gumbel extreme value distribution (18). We tested this conformance for BLOSUM62 and the gtQ and E modifications. Each test used a 1,000-residue segment from Flo1p, a randomized sequence with the same composition as the yeast cell wall query data set, and a random sequence of the same composition as the Robinson and Robinson high-complexity data set as queries. As in previous tests of searches and matrices (3), each query was tested for Smith-Waterman alignments (27) against four databases, each with a size of 104: high-complexity sequences randomized globally, high-complexity sequences locally randomized, low-complexity sequences randomized globally, and low-complexity sequences randomized locally. For each search, the 104 alignment scores were binned and compared to expected scores in the extreme value distribution with Pearson's χ2 test. The distributions of alignment scores generated by the composition-modified matrices, as they should be, were similar to the extreme value distribution with a P value of <0.005. However, in BLOSUM62-based searches for low-complexity sequences in both low-complexity databases, the P value was >0.03 to 0.07. Thus, BLOSUM62 conformed less well than the modified matrices to an extreme value distribution. The detailed data appear in Table S1 in the supplemental material.
The score distributions were used to estimate the statistical parameters κ and λ of the distributions as well (3). For FASTA searches, assuming conformance with the extreme value distribution, κ and λ are calculated and e values are derived from the distribution for each search (26). In contrast, standard BLAST assumes values for these parameters that were derived from empirical estimates in gapped searches of high-complexity sequences. It is noteworthy that for the BLOSUM62-based searches of cell wall queries against the randomized cell wall pseudosequences, the value of κ was as much as 106 times greater than the standard value of 0.0243. This difference is probably the major source of the inaccuracy of e values and subsequent low-complexity corruption in low-complexity searches using BLOSUM62. In contrast, the composition-modified matrices generated score distributions with κ values that differed from the standard by less than fourfold. The λ values were all close to the BLAST-assumed value of 0.24, again with the exceptions of the BLOSUM62-based cell wall searches against the low-complexity and low-complexity pseudosequence databases (see Table S1 in the supplemental material).
Another test for conformance is probability plots of the inverse Poisson distribution P values for alignment scores. Although such plots are often used to compare scores for two samplings of a population, they can also be used to illustrate the number of scores at each probability in two distributions (9). The plots in Fig. S2 show the cumulative fraction of scores above given index scores for comparisons of the E and gtQ matrices compared to the distribution in the BLAST-PGP search of the high-complexity query and database (31). The plots are linear, as expected for comparable score distributions.
In an extreme value distribution, the mean best e value of false hits should be 1 (26). We therefore calculated this quantity for each matrix modification in both BLAST and FASTA searches using high- and low-complexity queries. In searches with high-complexity queries, all matrices had mean first false hit scores between 0.41 and 11.7 (see Table S2 in the supplemental material). Again, E matrices were the most discriminatory and had the largest e values for false hits. In contrast, in searches with low-complexity queries, the composition-modified matrices far outperformed BLOSUM62. For BLOSUM62, even with SEG filtering or composition-modified statistics, the mean e values for the first false hits were between 10−3 and 10−46. Furthermore, the best-scoring false hit in a BLOSUM62-based search had an e value of 10−110. In contrast, the modified matrices generated mean e values of between 10−2 and 102. Thus, in high-complexity searches, the E and gtQ modifications produced e values close to 1 for the first false hit, as expected. For low-complexity sequences, the E and gtQ modifications produced e values much closer to the expected value of 1 than in searches with BLOSUM62.
In BLAST, the major computational burden is the time needed to extend the two- to four-letter words from the query sequence that find similarity to sequences in the database (4, 5). We therefore measured the computation times in BLAST and FASTA. BLAST-E and BLAST-gtQ ran faster than BLAST-B and BLAST-PGP for low-complexity sequences for both the S. cerevisiae genome database and the database that consisted of the genome with randomized sequences (Table (Table6).6). The maximum difference was about a 25-fold speed-up for the BLAST-E search with low-complexity queries. For high-complexity sequences, E matrices were slightly more efficient and gtQ matrices were 40% slower than standard BLAST methods. In contrast, composition-based matrix modifications had little effect on the scan times for searches by FASTA (data not shown).
There is an acute need for bioinformatic tools that align and compare low-complexity sequences. Most available programs merely identify or mask such segments (1, 19, 33, 34). We have shown that strategies that base alignment scores on the frequency of specific amino acids in the query sequence greatly improve the reliability and usefulness of BLAST and FASTA searches for low-complexity query sequences. These E and gtQ matrix modification methods decreased the scores for common residues and were highly discriminatory against nonhomologous sequences. The searches using these matrices identified a consistent set of paralogs of known yeast wall proteins (Table (Table4).4). These proteins share homologous sequence regions and motifs that have not been identified in BLOSUM62-based searches (J. Coronado et al., unpublished data). Searches with composition-modified matrices also improved structural concordance in aligned sequences (Table (Table33).
The modified matrices yielded alignment scores in BLAST and FASTA that conformed to the extreme value distribution (see Table S1 and Fig. S2 in the supplemental material) and generated e values more accurately than BLOSUM62-based searches (see Table S2 in the supplemental material) for low-complexity sequences. In searches with high-complexity queries, the distributions also conformed to the expected extreme value distribution, but the increased discriminatory power of the modified matrices decreased sensitivity somewhat (Table (Table5).5). This finding is consistent with a previous report that BLOSUM62 is the most sensitive matrix for searches with high-complexity sequences (17).
The transitive closure tests demonstrated that searches with E or gtQ modified matrices reliably identified apparent homologs of cell wall query sequences (Table (Table4,4, GO annotation and manually curated sets). In contrast, BLOSUM62-based searches with standard statistics did not close and hit a large fraction of the yeast proteome. The transitive closure test closed with BLAST-PGP, but the majority of the hits with e values of <10−3 were not cell wall-related proteins (Table (Table4).4). Indeed, inspection revealed that most of them were low-complexity sequences in mobile elements or RNA-processing enzymes.
The BLAST and FASTA transitive closure tests with the three best-performing composition-based matrices (BLAST with E or gtQ and FASTA with E, gtQ, or gtQ32) identified 61 apparent homologs of the yeast cell wall proteins with alignment e values smaller than 10−3. Of those apparent homologs, 48 were retrieved by all five of these modified-matrix searches; FASTA-E retrieved only these 48 ORFs. One additional ORF, Ylr110c, was retrieved by the four other modified-matrix searches. Nine more ORFs were identified by BLAST-gtQ, FASTA-gtQ, and FASTA-gtQ32. Based on inspection of the significant alignments and resistance to low-complexity corruption, the E and gtQ modifications used in BLAST, or used with high gap costs in FASTA, define a consistent set of potentially homologous low-complexity proteins efficiently and accurately (Table (Table4;4; see Table S4 in the supplemental material).
Matrices modified for composition of both query and target sequences might further increase sensitivity but at the cost of calculating a new matrix for each HSP. An analysis of reciprocal hits in the transitive closure test shows that query-based modifications were sufficient to find all known paralogous pairs (see the supplemental material).
In a different approach, Yu and colleagues (6, 36, 37) previously proposed composition-based modifications of BLOSUM scoring matrices to do alignments of low-complexity sequences without SEG filtering. The scoring matrices described previously (37) are corrected by keeping the total entropy of each matrix constant, a strategy to maximize sensitivity for queries of unusual composition. Thus, these modifications would apply to a different aspect of the low-complexity search and alignment problem. The consequences of such matrices on a large scale have not yet been published.
Disordered regions of proteins often include low-complexity sequences. DISORDER, a scoring matrix specific for disordered regions of structurally well-characterized proteins, improves scores for homologous protein pairs with 40 to 50% identity (30). The discrimination ability is similar to that of BLOSUM62, and the increase in sensitivity appears to be twofold. In contrast, the E and gtQ matrices increased discrimination for any query sequence, and gtQ showed a greater sensitivity. The result was better agreement in predicted secondary structures of the aligned segments.
We have presented several ways to normalize the alignment scores and statistical parameters for individual query sequences (Table (Table2).2). Of these, the E and gtQ modifications support sensitive, discriminating, and accurate search and scoring statistics for proteins or segments whose amino acid composition is far outside the Robinson and Robinson amino acid frequencies originally used to estimate the statistical parameters of λ and κ.
The scoring matrix modifications E and gtQ rendered SEG filtering unnecessary and generated alignment scores that conformed to the extreme value distribution, which BLOSUM62-based searches could not do for these sequences of unusual composition. The composition-based matrix modifications also generated score distributions with statistical parameters much closer to those assumed in gapped BLAST statistics, so the resultant e values were more accurate than those from BLOSUM62 and at least as accurate as composition-based statistics in BLAST-PGP. Therefore, BLAST or FASTA with the E or gtQ modified matrices showed great resistance to low-complexity corruption and reliably identified apparent homologs of these important, low-complexity sequences without masking out the low-complexity segments. Furthermore, for these sequences, the efficiency of BLAST was improved, and the efficiency of FASTA was not significantly changed. For query sequences containing low-complexity regions, BLAST-gtQ and FASTA-gtQ32 were the most sensitive search methods and had good discrimination against nonhomologous sequences with similar amino acid compositions. Matrix modification E with either BLAST or FASTA searches had maximal discrimination against nonhomologous sequences but was somewhat less sensitive. The results presented here demonstrate that composition-based matrix modifications discriminate against nonhomologous alignments and therefore make accurate comparative studies of low-complexity sequences possible. This accuracy is necessary for phylogenetics and for structural comparisons.
Another benefit of these matrices will be an analogous improvement in the accuracy of genomic annotations, which are often based on functional analogies for homologous sequences. For instance, transitive closure identified a set of 48 sequences in S. cerevisiae that are similar to the cell wall protein queries. Searches through fungal genomes have revealed that apparent homologs of these proteins are present in other ascomycetes and basidiomycetes (Coronado et al., unpublished). These homologies in turn imply commonalities in cell wall structure and function for fungi whose walls are not as well characterized as those of S. cerevisiae.
This work was funded by the NIH/NCRR/RCMI program (grant RR 03037), the NIH/MBRS-SCORE program (grant S06 GM 60654), and the Howard Hughes Medical Institute (grant 52002679). J.C. is a Fellow of the NSF-MAGNET and RCMI programs.
†Supplemental material for this article may be found at http://ec.asm.org/.