Whole genome sequencing projects have yielded sequences of a large number of proteins for which no direct experimental information is available. Homology detection is a widely used tool for the structural and functional annotation of such proteins, since two related proteins with a common ancestor may retain the same ancestral function. A number of sequence homology search algorithms1
have been developed for this purpose, and are in wide use. Despite these developments, identification of distant homologs in the twilight zone (sequence identity <25%) has remained a challenge.
Pairwise sequence homology search programs1
evaluate alignments between sequences by using a scoring scheme that includes a 20 × 20 amino acid substitution matrix and a penalty function for gaps. The substitution matrix assigns a scaled log-odds score for each aligned residue pair. Substitution matrices are also important in programs that utilize “sequence-profile”2
alignments, since the initial profiles are built by collecting homologous sequences identified by using a pairwise comparison score matrix.
Many amino acid substitution matrices have been devised over the years, utilizing a variety of methods.7
The popular BLOSUM series is built from multiply aligned sequence segments or ‘blocks’ that represent the most conserved regions in aligned families.12
The accuracy of these alignments will obviously impact the success of these matrices in detecting homologs. In the case of sequence-based
matrices such as the BLOSUM series, the sequence alignments tend to become less reliable at large evolutionary distances.13
Structure-based matrices obtain amino acid substitution data directly from structural alignments and hence largely avoid the issues relating to sequence alignment at large evolutionary distances. Thus, sequence alignments obtained from structure alignments have long been considered to be the gold standard, only being superceded by human curated alignments.14
When compared with sequence-based matrices, though, structure-based matrices13
have suffered from a paucity of data in the form of homologous protein structures required to construct a substitution matrix.13
More recent applications of structure-based matrices have focused primarily on the quality of sequence alignments.13
In some evaluations, 19
structure-based matrices have not performed as well as sequence-based matrices in the detection of remote homologues, and in the most recent comparison of substitution matrices performed by Brenner and colleagues,20
structure-based matrices were omitted from the comparison.
A large number of protein structures have become available in the past decade, fueled in part by the structural genomics initiative.21
In this paper, we investigate whether structure-based amino acid substitution matrices that exploit this resource could improve the detection of remote homologs. Access to larger datasets of remote homology significantly improved the performance of sequence-based matrices20
and we hypothesized that the same may hold true for structure-based matrices as well.
The structurally aligned substitution matrices (SASM) that we describe here were computed using structurally aligned protein domain pairs. These pairs were selected from an all-against-all pairwise structural superposition of protein domains obtained from the ASTRAL SCOP22
protein domain database. The SASMs that we computed were implemented in BLAST and PSI-BLAST, and their effectiveness in detecting remote homologs was compared against BLOSUM62.