The physical interactions between proteins and other molecules in protein crystal structures provide crucial insights into protein function. It is precisely these structures that enable researchers to study interactions in atomic detail, and find out, for example, how a specific mutation in a protein affects its function, or how a few atom modifications in a small molecule might lead to a more effective drug. With the large number of available crystal structures (nearly 60,000 currently in the RCSB Protein Data Bank), it is of great importance to improve the tools available for study of these interactions.
Moreover, a powerful method of inference can be used to predict function and interactions. It is based on the observation that homologous proteins have similar functions and often interact with their small molecules in a similar manner. Thus it is possible to infer protein-small molecule interactions even if there are no crystal structures available for a particular protein of interest, as long as there are structures of sufficiently close homologs. Recent estimates suggest that the majority of Entrez Protein sequences have homologs with a known structure [
1,
2], thereby providing a reasonable chance to find relevant interactions via structures for protein sequences.
Homology inference methods, although powerful, have certain limitations. Common descent does not necessarily imply similarity in function or interactions; and annotations transferred from one protein to a homolog may result in incorrect functional or interolog assignment at larger evolutionary distances [
3-
6]. To verify and guide annotations, it is often essential to ensure close evolutionary relationships, and at the same time characterize the details of interactions in terms of binding site similarity. Current binding site prediction methods can be subdivided into several major categories: those which use evolutionary conservation of binding site motifs [
7-
9], those which use information about a structure of a complex [
10-
12], and docking and other methods [
13,
14]. Structure-based methods use detailed knowledge of the protein structure to identify binding sites on the basis of the physico-chemical properties of individual residues, their electrostatic contribution, and their location in the 3D structure [
15-
26].
A number of methods and servers have been developed for predicting protein function by identifying similarities in sequence and structural features of binding pockets in homologous proteins, or evolutionary constraints on residues [
27], or by using threading and other approaches [
20,
28-
39]. The main goal of these methods is to provide functional annotation for proteins out to the most distant homology relationships.
FINDSITE [
40], for example, looks for structural templates with bound small molecules for a query protein using threading. The templates are superimposed and the centers of mass of the bound small molecules are clustered to annotate putative binding sites on the query. Threading based methods, although capable of recognizing distant functional relations, are limited by the complexity of model building and low reliability of function transfer associated with distant homology [
41,
42].
Firestar [
31] predicts functionally important residues based on PSI-BLAST [
43] alignments between the query sequence and structures with functional information derived from the PDB and the Catalytic Site Atlas [
44].
PHUNCTIONER [
20] uses sequence profiles based on clustered sequences with matching GO [
45] terms; potential binding sites are detected from sequence conservation. This method is capable of inferring the location of highly conserved small molecule binding sites, but might be questionable if the conservation of sites is caused by factors other than binding.
Transitive annotation of small molecule binding sites is also possible by detection of functional domains in the query protein sequence through BLAST heuristics and mapping the functionally important residues and/or features from the domain family members [
30,
46].
There are a few other methods that directly detect small molecule binding sites via geometric analysis of protein structures. These methods include LIGSITE
csc [
29], CAST [
47], PASS [
48], SURFNET [
49], SCREEN [
50], and ConCavity [
51]. All of these algorithms attempt to identify solvent-accessible pockets formed by surface residues on the protein, and to rank those pockets (for example by volume), in order to assign the most highly ranked pockets as the predicted/putative small molecule binding sites. LIGSITE
csc, SURFNET, and ConCavity use a more complex ranking function that takes into account residue conservation of binding site residues. These geometric methods are reasonably accurate, achieving success rates of 60-70% in correctly identifying small molecule binding sites. In their evaluation of LIGSITE
csc, the authors showed that their algorithm outperformed the other three methods on a test set of 48 structures [
29]. The SCREEN method identifies binding sites geometrically, and also computes feature vectors that are used by machine learning techniques. SCREEN is included in a suite of powerful modeling tools for functional annotation [
52]
Recently we have developed a new database and method called "IBIS" (Inferred Biomolecular Interaction Server [
53],
http://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.html) which enables researchers to conveniently study biomolecular interactions that have been observed in protein structures and through inference by homology to formulate predictions/hypotheses for biomolecular interactions, even if the data for specific biomolecules is not available. Therefore, IBIS can be considered a resource for functional annotation of proteins that have relevant homologs in the PDB [
54]. An input protein sequence may or may not have a structure itself; if not, it is assigned to the most closely related structure(s) using BLAST. IBIS can identify and infer a protein's interaction partners together with the locations of the corresponding binding sites on the protein query. It provides annotations of binding sites for proteins, small molecules (chemicals), nucleic acids, peptides and ions. In this paper we describe the method used in IBIS to annotate protein-small molecule interactions. To ensure biological relevance of binding sites, IBIS clusters similar binding sites found in homologous proteins based on conservation of sequence and structure of the binding site residues. Binding sites which appear evolutionarily conserved among non-redundant sets of homologous proteins are given higher priority. Additionally, binding site clusters are validated by comparing them with available binding site annotations from a manually curated subset of the CDD database [
55,
56], and sites with non-biological small molecules are excluded. After binding sites are clustered, position specific score matrices (PSSMs) are constructed from the corresponding binding site alignments. Together with other measures, the PSSMs are subsequently used to rank binding sites to assess how well they match the query, and to gauge the biological relevance of binding sites with respect to the query.
A critical difference between our method and others is that IBIS pays particular attention to ensuring the biological relevance of binding sites, and homology between the unknown query sequence and the known structures of protein complexes. Our method might miss some remote similarities which could be detectable, for example by FINDSITE, but in exchange IBIS's top ranked annotations should be considered highly reliable. Unlike other methods, IBIS does not filter out similar structures to speed up the search process, but accounts for all structures so that interesting small molecule binding complexes are easily accessible. Our method derives the actual binding sites from observed structures, and groups them to account for variations in the binding site residues due to differences in small molecule size and conformations. This is essential for proteins which are important drug targets, as they have often been co-crystallized with a great variety of inhibitors. The clustering (grouping) of binding sites by similarity is very important because it identifies the distinct binding modes and allows for an easier interpretation of the results, despite the great growth in the amount of structure data over the last several years. As we have shown, it is possible to do the clustering automatically and in a biologically meaningful way.