|Home | About | Journals | Submit | Contact Us | Français|
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
Whole genome sequencing projects have yielded sequences of a large number of proteins for which no direct experimental information is available. Homology detection is a widely used tool for the structural and functional annotation of such proteins, since two related proteins with a common ancestor may retain the same ancestral function. A number of sequence homology search algorithms1–5 have been developed for this purpose, and are in wide use. Despite these developments, identification of distant homologs in the twilight zone (sequence identity <25%) has remained a challenge.
Pairwise sequence homology search programs1,3 evaluate alignments between sequences by using a scoring scheme that includes a 20 × 20 amino acid substitution matrix and a penalty function for gaps. The substitution matrix assigns a scaled log-odds score for each aligned residue pair. Substitution matrices are also important in programs that utilize “sequence-profile”2 and “profile–profile”5,6 alignments, since the initial profiles are built by collecting homologous sequences identified by using a pairwise comparison score matrix.
Many amino acid substitution matrices have been devised over the years, utilizing a variety of methods.7–11 The popular BLOSUM series is built from multiply aligned sequence segments or ‘blocks’ that represent the most conserved regions in aligned families.12 The accuracy of these alignments will obviously impact the success of these matrices in detecting homologs. In the case of sequence-based matrices such as the BLOSUM series, the sequence alignments tend to become less reliable at large evolutionary distances.13
Structure-based matrices obtain amino acid substitution data directly from structural alignments and hence largely avoid the issues relating to sequence alignment at large evolutionary distances. Thus, sequence alignments obtained from structure alignments have long been considered to be the gold standard, only being superceded by human curated alignments.14 When compared with sequence-based matrices, though, structure-based matrices13,15–17 have suffered from a paucity of data in the form of homologous protein structures required to construct a substitution matrix.13,18 More recent applications of structure-based matrices have focused primarily on the quality of sequence alignments.13,17 In some evaluations, 19 structure-based matrices have not performed as well as sequence-based matrices in the detection of remote homologues, and in the most recent comparison of substitution matrices performed by Brenner and colleagues,20 structure-based matrices were omitted from the comparison.
A large number of protein structures have become available in the past decade, fueled in part by the structural genomics initiative.21 In this paper, we investigate whether structure-based amino acid substitution matrices that exploit this resource could improve the detection of remote homologs. Access to larger datasets of remote homology significantly improved the performance of sequence-based matrices20 and we hypothesized that the same may hold true for structure-based matrices as well.
The structurally aligned substitution matrices (SASM) that we describe here were computed using structurally aligned protein domain pairs. These pairs were selected from an all-against-all pairwise structural superposition of protein domains obtained from the ASTRAL SCOP22 protein domain database. The SASMs that we computed were implemented in BLAST and PSI-BLAST, and their effectiveness in detecting remote homologs was compared against BLOSUM62.
Nonredundant data sets of protein domains with less than 40% and 50% sequence identity to each other, which excluded structures determined by NMR, were selected from the ASTRAL SCOP v1.67 database.22 Domain selections to these sets were based on the SPACI score,22 which is a measure of structure quality. The 0%–40% dataset contained 6551 domains, and the 0%–50% dataset contained 7444 domains. Domains in each set were subjected to an all-against-all pairwise structural superposition using the structure comparison program SHEBA,23 in order to generate a structurally superposed domain pair dataset. The total number of pairwise structural superpositions for the two datasets were 42,915,601 (0%–40%) and 55,413,136 (0%–50%). For each domain pair ab, the best superposition (ab vs. ba) was selected. Self superpositions (aa) were removed from further consideration.
Structurally aligned domain pairs were selected from among the structurally superposed pairs, using the following criteria. These criteria are based on the number of aligned residue pairs, m. SHEBA determines the aligned residue pairs by using a dynamic programming algorithm on two superposed structures. A necessary condition for a pair of residues to be “aligned” is that the distance between the alpha carbons of the pair is less than 3.5Å after superposition of the domains.
where the z-score was defined as
mf = m/number of residues in the larger domain
<mf> = mean of mf over all pairs involving the given domain
s = standard deviation of mf over all pairs involving the given domain
The log-odds scoring matrix was obtained as follows:24
where Sij is the score given when a residue of type i is aligned with a residue of type j. The frequency qij was computed as
where Nij is the number of aligned residue pairs of types i and j. The corresponding frequency expected for a randomly aligned protein pair was calculated by
where Ni(a) is the number of residues of type i in protein a and Nj(b) is the number of residues of type j in protein b which is aligned to protein a, and the summation labeled ab is over all aligned domain pairs, a−b. The constant factor 1/c is set to 2/ln2, to express the score in half-bit units. The matrices generated using structurally aligned protein domain pairs from the 0%–40%, 0%–50% and 0%–60% sequence identity sets were labeled SASM40, SASM50, and SASM60, respectively. Ten matrices were generated for each sequence identity set, by varying the z-score filter used.
The relative entropy H of a matrix was computed according to Altschul24 as follows:
SASMs were implemented in BLAST and PSI_BLAST to evaluate its performance in pairwise and profile homology search algorithms. The statistical parameters for the appropriate extreme value distribution, which are required to calculate the normalized score and the E-value of a hit, were computed by the computer program obtained from Steven Altschul. PSI-BLAST experiments were performed for 20 cycles (or to convergence) using a threshold E-value of 0.001.25,26 The default (11,1) affine gap penalty scheme was used for all experiments.
True hits were determined based on the human-curated SCOP27 database assignments, rather than a pure structure-alignment based score.28 Domains that belong to the same SCOP superfamily were considered to be homologous (true hits), while domains belonging to different folds were considered non-homologous (false hits).19,25,29 The domains that belong to the same fold, but different superfamilies, were not counted as either true or false hits.
For all experiments, the target sequence set was a subset of the ASTRAL SCOP22 v1.65 database, which contained 6,442 protein domain sequences, each with less than 50% sequence identity to any other sequence in the database. The query sequence set consisted of 92 protein domain sequences each of which had at least 10 SCOP family members (including self) in the target sequence set. This was to ensure that a large number of true positives exist in the target dataset. Also, the most difficult pairwise relations to detect tend to be those between members of larger families and superfamilies.20 The 92 query sequences represented six classes and 64 folds in the SCOP database. The BLOSUM62 substitution matrix was chosen as a benchmark for two reasons. First, it continues to be a popular choice for detecting homologs, and is the default matrix employed in both BLAST and PSI-BLAST. Second, in recent tests conducted on the performance of amino acid substitution matrices, the BLOSUM matrices have faired well against sequence- and structure-based matrices.19,20 The (11,1) affine gap penalty function, which is optimal for BLOSUM62,30 was used for both BLOSUM62 as well as for the SASMs.
Protein domain datasets in ASTRAL SCOP v1.67,22 each containing protein domains selected by pairwise sequence identity, were downloaded, and all protein domains within each dataset were structurally superposed in a pair-wise manner, using the program SHEBA.23
The selection criteria described previously (see Methods) were used to identify structurally aligned protein domain pairs from among the structurally superposed domain pairs, for each ASTRAL SCOP dataset. Increasing the z-score filter resulted in increasing the stringency, and a concomitant reduction in the number of protein domain pairs selected (Table 1).
According to Table 1, the SASM40 matrix that is isentropic with BLOSUM62 (0.7) was constructed using a z-score filter of 6.5, and the corresponding amino acid substitution scores are given in Table 2. The coefficient of determination (R2 value) for the two matrices is 0.91. The single substitution score that shows a sign inversion is the cystein/valine substitution, which has a negative score in BLOSUM62.
In order to evaluate their effectiveness in detecting homologous sequences, the SASMs were implemented in BLAST. A subset of the ASTRAL SCOP v1.65 database, containing 6,442 sequences having no more than 50% sequence identity to each other, served as the target database. A set of 92 query sequences were chosen from this database such that each sequence had at least 10 SCOP family members in the target database. The results were evaluated with respect to the default matrix in BLAST, BLOSUM62. A hit (E-value < 0.001) was considered true if it belonged to the same SCOP superfamily as the query.25,29
SASM50 and SASM40 matrices performed better than BLOSUM62 in returning more true hits for a greater number of queries when the z-score filter was between 3.5 and 8.0 (Figure 1). SASM60, which is formally the SASM matrix most akin to BLOSUM62, performed the poorest in this comparison (data not shown). There were no false positives observed for SASM or BLOSUM62 in BLAST.
SASMs was tested in PSI-BLA ST using the same query sequences and target database as for BLAST (see above) and the results were evaluated with respect to BLOSUM62 (Figure 2). As in the case of BLAST, SASM50 and SASM40 matrices performed better than BLOSUM62 in returning more true hits for a greater number of queries when the z-score filter was between 3.5 and 8.0. This effect was more pronounced for PSI-BLAST, when compared to BLAST (Figures 1 and and2).2). The SASM40 matrix that returned more true hits for the greatest number of queries when compared with BLOSUM62 (23 vs 7) was constructed using a z-score filter of 6.5 (Table 2), and is isentropic with BLOSUM62.
The ability of a similarity detection method to report homologous sequences (sensitivity) must be balanced against the spurious detection of nonhomologs or false hits (specificity). Receiver operating characteristic (ROC) curves31 provide a convenient method of indicating the number of true hits for a given number of false hits.19,26 A comparison of ROC50 curves generated from results of PSI-BLAST using BLOSUM62 and the SASM40 matrix isentropic with BLOSUM62 (Table 2) is given in Figure 3. The results show that the latter finds more true hits at all E-value cutoff levels.
Accurate alignment of protein sequences is critical in obtaining reliable amino acid substitution frequencies required for computing substitution matrices. This task becomes more challenging as the evolutionary distances between proteins increase. Since sequence alignments based on structural alignments have been long considered to be the gold standard, it is reasonable to expect that substitution matrices based on structural alignments can lead to better performance in detecting remote homologs. Efforts at obtaining structure-based substitution matrices have been constrained by the limited amount of solved structures available, when compared with sequence data. For example, two of the more recent structure-based matrices, BC17 and SDM13 were computed from the alignments of protein pairs that numbered in the hundreds. These matrices have been useful in generating improved sequence alignments. However, the BLOSUM matrices have been shown to be superior in detecting remote homologs.19 It has therefore remained an open question as to whether structure-based matrices could prove useful in the detection of remote homologs. In this study, we have shown that structure-based matrices computed using the expanded set of protein structures now available can detect a greater number of homologs in the popular homology detection programs BLAST and PSI-BLAST, when compared to BLOSUM62.
The large set of structurally aligned protein pairs used in this study (Table 1) were selected from an even larger set of structurally superposed proteins pairs (see Methods), and the amount of data clearly precluded a manual examination of structural superpositions as a basis for selection. The criteria that were developed to automate the selection process includes the use of a z-score filter (see Methods). The superiority of the structure-based SASM matrices in detecting remote homologs is relatively insensitive to the value of the z-score filter used, in the range between 3.5 and 8.0 (Figures 1 and and2).2). This is somewhat remarkable, given that the number of protein domain pairs selected changes by a factor of three in this range (Table 1). When the z-score filter value is further decreased (<3.5) we have anecdotal evidence showing that structural alignments in the beta sheet regions may, in some cases, be poor, which will lead to errors in pair-wise frequency counts. These results also suggest that the matrix elements themselves may be relatively insensitive to future increases in the size of the protein structure database.
The selection of structurally aligned protein domain pairs, which are presumed to be homologous, was based solely on the application of the selection criteria described in methods. This selection procedure is also supported by the observation that most protein domain pairs thus selected were related by SCOP classification.
There is an expectation that the optimal set of frequencies utilized to compute a substitution matrix, are in the words of Karlin and colleagues, “simply those found in the sort of region we seek to identify.”32 Our results are consistent with this expectation, since SASM50 performs better than SASM60 (or BLOSUM62) when using a set of query sequences which had, at most, 50% sequence identity to target sequences. In the case of sequence-based BLOSUM matrices, BLOSUM62 is often preferred to BLOSUM45 in detecting remote homologs. This discrepancy may be due to the difficulties associated with aligning sequences at greater evolutionary distances, in the absence of structural information.
We thank Dr. Stephen Altschul for providing the program that calculates the statistical parameter values for amino acid substitution matrices. Nalin CW Goonesekere received funding through a Summer Fellowship Award from UNI. The author reports no conflicts of interest in this work.