|Home | About | Journals | Submit | Contact Us | Français|
A primary challenge for structural genomics is the automated functional characterization of protein structures. We have developed a sequence-independent method called S-BLEST (Structure-Based Local Environment Search Tool) for the annotation of previously uncharacterized protein structures. S-BLEST encodes the local environment of an amino acid as a vector of structural property values. It has been applied to all amino acids in a nonredundant database of protein structures to generate a searchable structural resource. Given a query amino acid from an experimentally determined or modeled structure, S-BLEST quickly identifies similar amino acid environments using a K-nearest neighbor search. In addition, the method gives an estimation of the statistical significance of each result. We validated S-BLEST on X-ray crystal structures from the ASTRAL 40 nonredundant dataset. We then applied it to 86 crystallographically determined proteins in the protein data bank (PDB) with unknown function and with no significant sequence neighbors in the PDB. S-BLEST was able to associate 20 proteins with at least one local structural neighbor and identify the amino acid environments that are most similar between those neighbors.
Understanding the relationship between protein structure and chemical function is a problem of growing importance.1 In particular, structural genomics initiatives are determining the structures of targets without known or characterized function. Some of these initiatives have prioritized targets with potentially novel folds based on sequence with little similarity to known structure. Currently, conservation within a sequence alignment or phylogenetic tree remains the primary method for computationally identifying functional residues, and many experimental methods rely on site-directed mutagenesis in combination with other functional assays.2
Generally, function is inferred computationally by assessing similarity to proteins of known function. This guilt-by-association approach has proven to be valuable. In addition to sequence comparative methods, current structural methods for identifying function rely on one of the following:
Few clustering methods can identify functional residues automatically based on structural properties alone. Sequence-based methods for functional characterization rely on identifying conserved residues within protein structures. More sophisticated methods, such as the evolutionary trace method, use phylogenies combined with structure to define residues of functional importance.2 It is important to develop sequence-independent methods for identifying function to complement sequence-based methods when they are limited by lack of sequence similarity or small datasets.
Methods for identifying key functional residues, or molecular fingerprints, can classify function. These include Fuzzy Functional Forms,4 PROCAT,3 a neural network method developed by Stawiski et al.,6 and FEATURE.7 FEATURE describes a local environment around an arbitrary three-dimensional point in space by building a vector of property values that lie within several radial shells centered about the point. The properties are discrete structural property values for each atom within a shell. These values contain the number of atoms associated with a given residue type, secondary structure, van der Waals volume, and solvent accessibility. Given two sets of vectors, one set associated with some common functional or structural attribute and the other set lacking that attribute, FEATURE uses supervised machine learning to predict new positions within a protein structure that share the common attribute.
SCOP has proven to be a powerful tool for studying known protein structures.8 By maintaining a complete, annotated classification of all known proteins based on sequence, structure, and functional information, the structural components that classify a family can be determined. SCOP is a manually curated database and often is used as a gold standard for structural classification of proteins.
Sequence-independent structure-based methods for function assignment are challenging for several reasons. First, aligning local structure is a difficult computational task.9 Second, estimating the statistical significance of the results is challenging.10 Third, scanning through the entire protein data bank (PDB)11 can be computationally demanding. Finally, and perhaps vexing, structural similarity and functional similarity are not always well correlated.12
We have developed a method for unsupervised mining of structural datasets and automatically identifying local regions within protein structures that are statistically associated with a given annotation. Methods exist for unsupervised mining of structural topology. These include VAST,13 DALI,14 the method of Singh and Saha,15 Dubey et al.,16 and PINTS.17 Our method is complementary to these methods by defining the most structurally significant residue environments for given a classification, based on the structural environments represented in that database. S-BLEST (Structure-Based Local Environment Search Tool) is based on the FEATURE representation of a local environment, and rapidly searches databases of vectors of local structure properties. This method is a structural analog to sequence-based similarity search methods such as BLAST.18 We parameterized and evaluated the method by evaluating how well selected residue environments in the ASTRAL 40 dataset are associated with their annotated SCOP family.
A residue is encoded as a vector of properties using the following procedure, similar to others.7,19 To describe the local environment for each residue, a vector of properties is taken from a set of concentric shells extending outward from the position of the residue’s beta-carbon (Cβ) atom. Cβ atom positions for glycine residues were estimated by determining the average position of a Cβ (relative to the Cα, N, and C atoms) from serine protease 1DSU. Each shell contains 66 properties, which include the number of atoms associated with a given residue type, the number of positively and negatively charged ions, the van der Waals volume of the shell, and the solvent accessibility. Each vector contains three shells with the following radial boundaries: 1.875, 3.75, 5.625, and 7.5 Å. With three shells each having 66 properties, the resulting vector that describes the local environment of a residue has 264 dimensions. The properties are identical to the ones used by Bagley and Altman.7 This representation is orientation independent and can be used on arbitrary coordinates within a protein structure.
To encode the entire PDB, the Cβ of each amino acid from each structure in the ASTRAL 40 nonredundant structure database was encoded as a vector as described above. We use a nonredundant database so that features derived from recent common descent are minimized. All hetero-atoms were removed before encoding, as was all atomic information not associated with the chain the input residue is associated with. A vector set containing the entire PDB with other chains included was also built and tested. Each component of the vectors was normalized as integers from 0 to 255 with the formula:
where mini and maxi are the minimum and maximum value of the ith component across the entire vector set. In addition, the minimum and maximum are capped at a maximum of 18 standard deviations from the mean, in order to prevent odd outliers from skewing the results. Only X-ray crystal structures were used in the analysis. Approximately one million vectors were in the ASTRAL 40 v1.65 vector set. This set was stored in a binary file that contains the normalization factors, mini and maxi, for each dimension and the vector data. Each vector is encoded with the PDB id and chain (5 bytes), the residue type (1 byte), the residue number (1 byte), the insertion code (1 byte), and the vector data (264 bytes).
The S-BLEST method relies on nearest-neighbor searches using a Manhattan distance metric. Manhattan distance was chosen because it is inexpensive to calculate and the most derivative statistics are easy to determine. The closest vector from each chain in the dataset is determined, sorted, and output. A significance score (z-score) is calculated by estimating the mean and variance of all distances between the query residue and the residues in the dataset using the following formula:
Given a query residue, S-BLEST can find the most similar residue in each chain in the dataset and provide a score for the similarity using the z-score.
If a query protein is a member of a known class (such as SCOP family), the residue environments most associated with that family can be readily determined by performing an S-BLEST query on each residue and performing the following protocol. The performance of each residue can be determined by creating a receiver operator characteristics (ROC) plot of the ranking, where the true-positive rate is plotted against the false-positive rate. A true positive is a protein structure that belongs to the same SCOP family as the query protein with a z-score of greater magnitude than the threshold. A false positive is a protein structure that does not belong to the same SCOP family but has a z-score of greater magnitude than the threshold. Each point on the plot represents the true-positive rate and false-positive rate of the ranking at a given z-score threshold. The ROC plot can be summarized by calculating the area under the curve (AUC). The AUC of a residue in a query structure of known function indicates how well the residue environment classifies the SCOP family of the structure and can range from 0.0 (perfect reverse classification) to 1.0 (perfect classification).
Congruence approaches are a useful way to combine several searches to increase statistical significance.20 When given a query with multiple residues, such as all the residues in a query chain, S-BLEST can identify chains in the dataset that are most similar to the query chain and pinpoint the residues between the query chain and the dataset chain that are similar. The score for the dataset chain is the average z-score of the k most similar residues in the chain.
The following procedure is used to identify and score the most similar chain in the dataset to a query chain. For each residue in the query chain, the most similar residue in each dataset chain is identified and scored (using the above z-score). If there were n residues in the query chain, there would be n residues (possibly redundant) in the dataset chain that are identified as most similar to each of the n residues in the query each with a z-score. The score for the chain is the average of the top k z-scores. Each chain can then be ranked according to this averaged z-score, and the top k residues are reported as the residues bringing the query chain and database chain together. Because of the large computational task of building and ranking a table, z-scores of less than -2.5 are filtered out. Although it is possible that filtering out low-scoring hits may affect the results, we did not observe any significant differences in the test cases (data not shown).
We empirically determined the z-score threshold for search results by taking 100 random SCOP families in ASTRAL 40. We then calculated the best cutoff by balancing a high positive predictive value and a large number of true positive hits. This analysis is displayed in Figure 1.
ASTRAL 40 v1.65 encoded 4,129 crystallographically determined structures. Each search takes approximately 2 s to encode and query as single vector on an Intel Xeon 2.8-GHz processor. Figure 2 shows example background distributions used to calculate the z-scores. These distributions are generally not Gaussian and often contain shoulders or evidence of higher complexity.
To illustrate the utility of using the AUC of an ROC plot, we determined how well each residue environment in a protein was associated with the protein’s annotated SCOP family. We looked in detail at the S-BLEST search results for residues in P38 mitogen-activated protein kinase from Homo sapiens 1DI9 chain A (1DI9:A)21 and found that S-BLEST identifies residues near functional regions of the structure as being associated with the protein’s SCOP family of protein kinases. The functional environments were considered to be the adenosine 5′-triphosphate (ATP) binding site, the peptide binding channel residues, and residues known to be phosphorylated. Figure 3 illustrates how the top scoring residues discriminate function. The functional residues were identified by ranking the AUC for S-BLEST search of each residue in the protein. The top 10 residues are shown in Figure 3. The residues that are good at classification form a core that is close to all three of the functionally interesting regions of the enzyme, the peptide binding channel, the ATP binding site, and the activating phosphorylated residues.
Our goal is to show that S-BLEST finds structurally similar environments with potential implications for fold, family, and function. To do this, we selected 100 random SCOP families in ASTRAL 40. For each protein structure, an S-BLEST search was performed for every amino acid in the structure. The result of each search is a list of residue environments from a database of protein structures ranked by their similarity to the query residue based on a significance score (z-score). Only one residue, the one with highest similarity, is selected from each structure in the dataset. The datasets developed include all analyzable X-ray crystallographic structures from the ASTRAL 40 nonredundant dataset.22 To evaluate how well the environmental similarity of a residue from each structure can be used to assign the SCOP family of the structure, we examined the rankings of the members of the SCOP family associated with each structure using the procedure described in Materials and Methods.
A z-score threshold for each protein was offset at -5.5. The positive predicted value (PPV) of the search can be defined as the number of true positives above the threshold divided by the total number of hits above the threshold.
We next applied S-BLEST to crystal structures of proteins with unknown function. Eighty-six of these structures had no significant hits when searched against the PDB using BLAST with e-value cutoff of 1e-4 (Table I). These proteins were selected from the PDB by searching for the phrase “unknown function.” Because the search phrase “unknown function” can have several intended meanings, these proteins represent a broad spectrum of proteins whose function is understood to variable degrees of precision. Table I lists the 86 structures and highlights the structures that were returned for each query protein with an average z-score better than the threshold of -5.00. With this procedure, we have identified residues in the PDB that have similar local environments as those in the query structure with potential structural significance.
Among all the proteins of unknown function, we chose several interesting results for detailed analysis. Succinyl diaminopimelate desuccinylase from Neisseria meningitides, 1VGY:A, illustrates S-BLEST’s effectiveness in identifying statistically interesting residues from an uncharacterized protein structure. S-BLEST found that 1VGY:A shared highly significant residues with a dinuclear zinc aminopeptidase Pepv from Lactobacillis delbrueckii, 1LFW:A, with a z-score of -6.36. BLAST matches these proteins with an e-value of 3 × 10-4. The matching top five residues from 1VGY:A paired with 1LFW:A are ARG97 with ARG115, HIS68 with HIS87, ASP70 with ASP89, GLY98 with GLY112, and GLU136 with GLU154. As illustrated in Figure 4(A), the best matched residues in 1LFW flank the active site of the protein and are in close contact to the AEP ligand that was crystallized with the structure. This suggests that the corresponding residues in 1VGY are likely in this region as well. To further test S-BLEST, we assigned 1VGY to a SCOP family based on the top hit. 1LFW:A the protein with multiple similar residues is associated with SCOP family c.56.5.4, bacterial dinuclear zinc exopeptidases. We hypothesized that 1VGY:A is a member of this SCOP family and, in a process analogous to the one described earlier, we performed an S-BLEST search on each residue and determined the residues most associated with that SCOP family. There are five structures with less than 40% sequence identity (according to ASTRAL 1.65) thatbelong to the SCOP family of bacterial dinuclear zinc exopeptidases. We find that residues HIS68, GLU135, ASP70, ASP134, and HIS350 from 1VGY:A all can be used to annotate the structure very well (Fig. 5). Each of these residues is localized to one region of the structure, and the corresponding residues in 1LFW:A are near the active site.
Sometimes residues sit in an environment that is sufficiently unique to give several hits, but those hits are based on unique structural properties, and not necessarily the protein’s function. For example, the hypothetical gene product from Escherichia coli, 1OYZ:A, is matched with protein phosphatase PP2A from H. sapiens, 1B3U:A, with a z-score of -5.21. We observe that residues found at the helix-loop interface and are oriented toward another secondary structural element are often identified as being good matches between structures [Fig. 4(B)]. For the strict purposes of this article, these should be considered as false positives, although the underlying reason for their uniqueness and any functional relationship between these seemingly unrelated proteins is intriguing and may deserve follow-up.
Another interesting hit, an archael SM-like protein AF-SM2 from Archeoglobus fulgidus, 1LJO:A, is matched with a small nuclear ribonucleoprotein SM D1 from H. sapiens, 1B34:A, with a z-score of -5.64. BLAST matches these proteins with an e-value of 1e-7. The top five matching residues all are close in space and all are close in sequence between the query and the match [Fig. 4(C)]. The proteins share the same fold, and several matching environments are identified.
The characterization of proteins from their structure is an important goal for the high throughput structural genomics pipeline. S-BLEST provides a method for quickly identifying similar local structures and the corresponding residue environments. Furthermore, it does not rely on fold recognition or the pre-identification of evolutionarily conserved residues. This method is intended to identify statistically significant environments in protein structures and will be complementary to both sequence-based methods such as BLAST or HMMs and fold recognition methods.
S-BLEST can be easily combined with BLAST for a sequence-structure analysis of a query protein. This allows for identification of highly conserved structural sites, as well as highly conserved sequence neighbors. For example, with the analysis of a random member from each of 100 random families, S-BLEST (threshold of -5.1) finds 28 SCOP family members that BLAST (threshold of 1e-5) does not find, and BLAST finds 89 family members that S-BLEST does not find, because of local structural variability between the proteins. There is a cost, however, of 66 false-family positives, all but 13 of which share the superfamily of the query. Additionally, for each BLAST hit, the degree of structural conservation of each residue environment can be easily determined using S-BLEST.
We were surprised to find that many residues that were annotated as being important for enzyme chemistry are not the ones that are most useful for recognizing structural similarities. The method sometimes does not select the critical residues (such as the catalytic triad) likely because the environments around those residues are structurally variable between members. The residue environments that are chosen, however, are those environments that are structurally conserved across a family. There are several possible explanations for why apparently critical residue environments are not conserved. These residue environments may adopt different structures in the presence of different ligands, crystallizing conditions, or in the presence of mutation. Methods that take into account protein may uncover similar ensembles of important residues that appear different in static structures.
The computational requirements of our method are relatively modest. For a single residue search, a 3.06-GHz Intel Xeon CPU can complete the search in less than 30 s. Querying with a 300-residue protein against the PDB can take as long as 4 h and requires a relatively large amount of memory (1-2 GB), whereas the same protein takes less than one half of an hour with ASTRAL 40. The vector data for the entire PDB is currently split between two files, each around 1.2 GB in size.
We developed S-BLEST to meet a need for rapidly identifying similar structures to a query protein using local structural features. To complement fold-recognition methods, we sought a method that could identify the local residue environments that correspond to that match. Our solution, S-BLEST, identifies constellations of structurally similar residues between the query protein and the full database of known protein structures. Moreover, we find that many of the structural environments in SCOP have statistically significant local environment neighbors.
S.D.M. was funded by an American Cancer Society John Peter Hoffman Fellowship and NIH grant LM06244 (Russ Altman, PI) and is now funded by the INGEN grant from the Lilly Endowment. M.H.-P.L. is funded by NIH grants LM-05652, LM-07033, and GM-63495. The authors acknowledge Giselle Knudsen and David Konerding for helpful comments.
Grant sponsor: American Cancer Society; Grant sponsor: National Institutes of Health; Grant numbers: LM06244, LM-05652, LM-07033, and GM-63495; Grant sponsor: Lilly Endownment