Understanding the relationship between protein structure and chemical function is a problem of growing importance.1
In particular, structural genomics initiatives are determining the structures of targets without known or characterized function. Some of these initiatives have prioritized targets with potentially novel folds based on sequence with little similarity to known structure. Currently, conservation within a sequence alignment or phylogenetic tree remains the primary method for computationally identifying functional residues, and many experimental methods rely on site-directed mutagenesis in combination with other functional assays.2
Generally, function is inferred computationally by assessing similarity to proteins of known function. This guilt-by-association approach has proven to be valuable. In addition to sequence comparative methods, current structural methods for identifying function rely on one of the following:
- phylogenetic trees derived from sequence similarity,2
- hand curated molecular fingerprints,3,4 or
- fold recognition and alignment methods.5
Few clustering methods can identify functional residues automatically based on structural properties alone. Sequence-based methods for functional characterization rely on identifying conserved residues within protein structures. More sophisticated methods, such as the evolutionary trace method, use phylogenies combined with structure to define residues of functional importance.2
It is important to develop sequence-independent methods for identifying function to complement sequence-based methods when they are limited by lack of sequence similarity or small datasets.
Methods for identifying key functional residues, or molecular fingerprints, can classify function. These include Fuzzy Functional Forms,4
a neural network method developed by Stawiski et al.,6
FEATURE describes a local environment around an arbitrary three-dimensional point in space by building a vector of property values that lie within several radial shells centered about the point. The properties are discrete structural property values for each atom within a shell. These values contain the number of atoms associated with a given residue type, secondary structure, van der Waals volume, and solvent accessibility. Given two sets of vectors, one set associated with some common functional or structural attribute and the other set lacking that attribute, FEATURE uses supervised machine learning to predict new positions within a protein structure that share the common attribute.
SCOP has proven to be a powerful tool for studying known protein structures.8
By maintaining a complete, annotated classification of all known proteins based on sequence, structure, and functional information, the structural components that classify a family can be determined. SCOP is a manually curated database and often is used as a gold standard for structural classification of proteins.
Sequence-independent structure-based methods for function assignment are challenging for several reasons. First, aligning local structure is a difficult computational task.9
Second, estimating the statistical significance of the results is challenging.10
Third, scanning through the entire protein data bank (PDB)11
can be computationally demanding. Finally, and perhaps vexing, structural similarity and functional similarity are not always well correlated.12
We have developed a method for unsupervised mining of structural datasets and automatically identifying local regions within protein structures that are statistically associated with a given annotation. Methods exist for unsupervised mining of structural topology. These include VAST,13
the method of Singh and Saha,15
Dubey et al.,16
Our method is complementary to these methods by defining the most structurally significant residue environments for given a classification, based on the structural environments represented in that database. S-BLEST (Structure-Based Local Environment Search Tool) is based on the FEATURE representation of a local environment, and rapidly searches databases of vectors of local structure properties. This method is a structural analog to sequence-based similarity search methods such as BLAST.18
We parameterized and evaluated the method by evaluating how well selected residue environments in the ASTRAL 40 dataset are associated with their annotated SCOP family.