Elucidation of a protein’s three-dimensional (3D) structure is viewed as a major step in understanding the molecular basis of its biological function. Knowledge of structure may not be sufficient, however, for understanding the mechanism of function, because biological function often depends on conformational dynamics. Usually, the protein function is associated with particular sequence or structure motifs, and the identification of functional patterns and their role in the overall dynamics of the protein requires additional data and analysis. With the exponential growth in the number of experimentally determined structures in the Protein Data Bank (PDB) (
Berman et al., 2000), a wealth of computational methods, usually based on sequence comparisons, have been developed to examine or extract such patterns, while structural dynamics has not been systematically invoked. One can utilize pattern recognition approaches using prior biological knowledge, or adopt pattern discovery methods to find statistically significant patterns that can be tested and verified experimentally.
A small number of residues are usually reported to be directly involved in protein function. This suggests that there are strong correlations between function and microenvironment. Microenvironment refers to the local structure assumed by residues close in space, but not necessarily contiguous along the sequence. Protein function is however a collective property of the structure, and it is conceivable that the global protein dynamics is coupled with the local structure near the active site.
A catalytic residue dataset (
Bartlett et al., 2002) was built by manually extracting information from primary sources in the literature. A thorough analysis, in terms of secondary structure, solvent accessibility, flexibility, conservation, quaternary structure and function, has been performed on the residues directly involved in catalysis in 178 enzyme active sites. This work provided a good understanding of the molecular features that affect catalytic function and, in particular, the importance of flexibility. However, it is not designed to retrieve information automatically, or to discover new frequent patterns (FPs).
Several web-based databases to search similar substructures are freely available in the public domain. PROCAT (
Wallace et al., 1997) uses a geometric hashing algorithm to build and search 3D enzyme active site templates from conserved geometry. WEBFEATURE (
Bagley and Altman, 1995;
Liang et al., 2003) applies a Bayesian supervised learning algorithm for a succinct characterization of the site microenvironment, expressed in terms of a set of biochemical properties. However, the results are sensitive to assumptions about background distributions and training sites chosen. The PINTS (Patterns in non-homologous tertiary structures) server (
Stark and Russell, 2003) allows for searches of a reasonably large number of predefined structure and function motifs in three different ways: (1) protein versus pattern database (2) pattern versus protein database and (3) pairwise comparison of proteins. This work is significant as it detects similarities in the spatial arrangement of side chains among protein structures without any prior knowledge of the active or binding site (
Russell, 1998). It also develops a statistics to calculate the significance of root-mean-square deviation (RMSD) between spatial positions of equivalent amino acids after optimal superimposition of matching structural patterns (
Stark and Russell, 2003). However, the algorithm is developed to find local patterns (radius <7.5 Å) in non-homologous proteins, exclusively, and suffers if two similar proteins are compared. It excludes amino acids with side chains containing only H and C atoms (Ala, Phe, Gly, Ile, Leu, Pro and Val) that are not specific enough to efficiently discriminate between correct and false matches.
Two other unsupervised methods have proved to conduct successful discovery of novel sequence–structure pattern. I-site (
Bystroff and Baker, 1998) is a library of short sequence patterns that strongly correlate with 3D structural elements of protein. It provides a new methodology for local structure prediction. TRILOGY (
Bradley et al., 2002) treats both sequence and structure component as patterns, which are identified and extended simultaneously during the search process. Thousands of significant sequence–structure patterns were discovered in this work. However, patterns of structurally conserved residues are not necessarily adjacent in the protein sequence and can occur in any order, such as the trypsin-like catalytic triad. Patterns that lack sequence–pattern component will not be detected by these algorithms.
In this work, a novel unsupervised learning approach is proposed to discover FPs in protein (sub)families. In addition to sequence and structure similarities, structural dynamics are considered. These patterns are thus characterized in terms of their dynamics, biochemistry and geometry in the microenvironment. Without any sequence alignment, structurally conserved residues whose sequence and order are not necessarily well maintained can be identified. Experiments indicate different patterns in the microenvironment at the catalytic triad of the examined three protease subfamilies, which are correctly distinguished in the detected patterns.