|Home | About | Journals | Submit | Contact Us | Français|
IBIS is the NCBI Inferred Biomolecular Interaction Server. This server organizes, analyzes and predicts interaction partners and locations of binding sites in proteins. IBIS provides annotations for different types of binding partners (protein, chemical, nucleic acid and peptides), and facilitates the mapping of a comprehensive biomolecular interaction network for a given protein query. IBIS reports interactions observed in experimentally determined structural complexes of a given protein, and at the same time IBIS infers binding sites/interacting partners by inspecting protein complexes formed by homologous proteins. Similar binding sites are clustered together based on their sequence and structure conservation. To emphasize biologically relevant binding sites, several algorithms are used for verification in terms of evolutionary conservation, biological importance of binding partners, size and stability of interfaces, as well as evidence from the published literature. IBIS is updated regularly and is freely accessible via http://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.html.
Proteins function by interacting with other biomolecules, and a complete protein functional annotation is impossible without knowledge of the protein interactions. Mapping biomolecular interactions is invaluable in deciphering the interactome, the entire set of molecular interactions in a cell. Recent advances in the experimental and computational tools for identifying proteins and their complexes have spawned a wealth of information that encourages such a mapping (1,2).
The most successful function prediction methods rely on evolutionary relationships between proteins and the conservation of their molecular function; they look for sequence similarities between unknown queries and functionally annotated proteins (3,4). A similar approach has been used to infer protein interaction partners from a set of homologous proteins, where an interaction between two proteins is predicted if this interaction has been observed between orthologs (interologs) in other species (5). Homology inference methods have certain limitations, though. Common descent does not necessarily imply similarity in function or interactions, and annotations transferred from one homologous protein to another may result in incorrect functional or interolog assignment at larger evolutionary distances (3,6–8). To verify and guide annotations, it is often essential to detect functionally important binding sites. Current binding site prediction methods can be subdivided into several major categories: those which use evolutionary conservation of binding site motifs, those which use information about a structure of a complex, and docking methods (9).
The knowledge of protein structure may facilitate and improve the annotation of protein function and the characterization of protein binding partners and binding sites. Structure-based methods use detailed knowledge of the protein structure to identify binding sites on the basis of the physico-chemical properties of individual residues, their electrostatic contribution, and their location in the 3D structure (10–14). A number of servers have been developed for predicting protein binding sites from structures by locating the binding pockets, by identifying sequence and structural features of homologous proteins which are important for binding, or by using threading and other approaches (14–22).
We have developed a new database and server called IBIS (Inferred Biomolecular Interaction Server), which provides tools to investigate biomolecular interactions observed in a given protein structure together with the complex set of interactions inferred from its close homologs. IBIS identifies and predicts a protein’s interaction partners together with the locations of the corresponding binding sites on the protein query. It does not focus on one specific type of interacting molecule, but provides annotations of binding sites for proteins, small chemicals, nucleic acids and peptides (interactions with ions are currently under development). This may allow the mapping of a comprehensive biomolecular interaction network for a given query, depending on the data available for its protein family.
To focus on biologically relevant binding sites, IBIS clusters similar binding sites found in homologous proteins based on the sites’ conservation of sequence and structure. Binding sites which appear evolutionarily conserved among non-redundant sets of homologous proteins are given higher priority in the displays. Additionally, binding site clusters are validated by comparing them with binding site annotations from a manually curated subset of the Conserved Domain Database (CDD) (23), if available. In the case of protein–protein binding sites, IBIS also compares its findings to binding interfaces confirmed by the PISA algorithm (24), which estimates the stability of protein-protein interfaces observed in crystal structures. After binding sites are clustered, position specific score matrices (PSSMs) are constructed from the corresponding binding site alignments. Together with other measures, the PSSMs are subsequently used to rank binding sites to assess how well they match the query, and to gauge the biological relevance of binding sites with respect to the query.
The current release of the Molecular Modeling Database (MMDB) (25), an automatically parsed and validated derivative of the Protein Data Bank (PDB) (26) hosted by the National Center for Biotechnology Information (NCBI) is used in this study. MMDB addresses several issues in interpreting PDB’s 3D structure data and provides standardized structural information. For example, MMDB attempts to fix atom name ambiguity, establishes chemical graphs that contain explicit bonding information, extracts biopolymer sequences and small molecules that get deposited into corresponding databases, and cross-references its entries to GenBank, PubMed, the NCBI taxonomy database and PubChem.
Protein–protein interactions are identified and analyzed on the level of domains. The Conserved Domain Search service (CD-Search) provides domain annotation for query sequences and pre-computed annotation for the majority of all entries in NCBI’s Entrez Protein database (27). If a complete protein chain is used as a query, protein–protein interaction annotations are provided separately for each domain identified on this query. Interactions between protein domains may occur on the same protein chain, not involving any other molecule. For other types of binding partners (chemicals, nucleic acids and peptides), interactions are defined for a complete protein chain regardless of its domain annotations and always involve another molecule. One effort that is made to reduce nonbiological contacts regards the case of a chemical that interacts with multiple chains. If contacts to that chemical are dominated by one of the chains (>75%), then its interactions with the other proteins are not considered; otherwise each protein interaction with the chemical will be listed separately.
An interaction and binding site is defined if a protein has at least five residues in contact with another protein, chemical, DNA, RNA or peptide. Contacts are defined if any of the heavy-atom inter-atomic distances is shorter than 4Å. The binding site is defined as a group of residues which make a contact with a given type of interaction partner. For protein–DNA interactions each DNA strand is considered separately. In the case of protein–chemical interactions, chemical ligands are all validated and standardized (if possible) by the PubChem databases and have explicit links to PubChem (28) which may provide extensive information on their known biological activities. There are two types of interactions and binding sites recorded in IBIS: ‘observed’ from experimental structures and ‘inferred’ from homologs.
A flowchart that summarizes the inference of binding partners and sites on a query is presented in Figure 1. First we collect homologs with known structures and higher than 30% identity to the query. To ensure good quality alignments, the VAST structure–structure comparison algorithm is used (29). If a query does not have a structure, the BLAST heuristic is applied to find the most closely related structure (30). The closest homolog (with an E-value <0.01) is picked with a conservative threshold for alignment extent, requiring 80% or more of the query sequence to be aligned.
A binding site cluster represents a collection of structures which are related to the query, and where all members of the cluster contain similar overlapping binding sites when mapped onto the query. Similarity between binding sites is measured in terms of sequence similarity, and those positions which overlap structurally are assigned an additional weight. Binding sites are clustered by a hierarchical complete linkage clustering procedure. To decide on the cutoff for clustering, we use a recently described energy function which maximizes the mean similarity of members within a cluster and minimizes the complexity of the description provided by cluster membership (number of bits required to describe the data) (31). Clusters which contain an actual interaction observed in the query structure are marked by the letter ‘O’. By expanding the cluster one can see additional information about its members.
All binding site clusters are ranked in terms of their predicted biological relevance and similarity to the query. The components of the ranking score are the sequence-PSSM score; the average sequence identity between the query and cluster members calculated over the whole structure–structure alignment; the number of interfacial contacts and the average sequence conservation of binding site alignment columns. All components of the ranking score are then normalized and all clusters are ranked with respect to the Z-scores.
To emphasize biologically relevant binding sites we validate sites according to a few criteria. First, we assess the evolutionary conservation of binding site clusters. Those sites which reoccur in diverse enough protein complexes are ranked higher, an idea which was previously implemented in the Conserved Binding Modes (CBM) database (32). Clusters that have only one non-redundant member (after members with >90% identity are purged) are considered ‘singletons’ and are displayed at the bottom of the interaction summary table with a low rank. Another way to evaluate binding sites is to compare them with manually curated site annotations from the Conserved Domain Database (CDD), which have been extracted from the published literature or derived from manual interpretation of individual three-dimensional structures (23). Binding site clusters which overlap by >50% with a CDD annotation are ranked first. For protein–chemical interactions, we exclude by default chemicals such as buffers, salts, detergents, solvents and ions that are typically added for the purpose of crystallization and/or purification. Most often, these are not relevant with respect to the protein’s biological function. Finally, we employ the PISA algorithm (24) to validate protein–protein interaction interfaces and eliminate those interfaces which appear to be the result of crystal packing.
Currently, a total of 40 716 proteins (151 887 protein chains/domains) are represented in IBIS with at least one type of interaction observed in their structural complexes. As can be seen from Figure 2, protein–protein and protein–chemical interactions are the most frequent types of interactions observed in protein structures. Protein–protein interactions are the most prevalent interactions as reflected by the number of domains involved in interactions and the number of binding sites. The number of inferred interactions is always higher than the number of observed interactions, especially for protein–peptide and protein–nucleic acid interactions, where the number of inferred interactions exceeds the number of observed ones (in terms of the number of protein chains) almost 5-fold. This ratio is even higher for binding site clusters (Figure 2B). Altogether, IBIS provides information on binding partners and binding site locations with averages of 3.4 protein–chemical binding site clusters per chain, and eight protein–protein binding site clusters per domain. The scale of such annotations is approaching the scale of whole interactomes.
IBIS may be queried by supplying either a protein NCBI GenBank identifier or PDB code (the one letter PDB chain identifier is optional). For a given query, it is possible to see different types of interactions, protein–protein, protein–chemical, protein–DNA, protein–RNA and protein–peptide, by navigating through different tabs at the top of the page (the display of protein-ion interactions is currently under development). Figure 3 illustrates an IBIS Interaction Summary page. Observed and inferred binding site clusters are sorted by the ranking score. Each row in the table corresponds to a binding site cluster and can be expanded to show the cluster members.
The main features of binding sites and interaction partners in the Interaction Summary table are as follows:
‘Interaction partner’—name of the interaction partner which interacts with either the actual query (‘observed’ interactions) or homologs of the query from within a given binding site cluster (‘inferred’ interactions). For protein–protein interactions, the CDD domain name of the binding partner is listed. For protein–chemical interactions, the column reports the name of the chemical bound to a representative member of the cluster. For protein–nucleic acid and protein–peptide interactions, the column reports the sequence of the first 20 biopolymer residues from the interaction partner of a representative cluster member.
‘Ranking score’—the score which ranks the binding site clusters in terms of their biological relevance and similarity to the query. The ranking score is not defined for the ‘singleton’ clusters.
‘Number of cluster members’—the number of cluster members. Upon cluster expansion only non-redundant cluster members are displayed (at <90% identity level). A complete list of members can also be viewed by clicking the ‘See all members’ link.
‘Average percent identity to query’—the average sequence identity between the query and the cluster members calculated over all of their structural alignments with the query.
‘Number of binding site residues’—the union of binding sites mapped from all members of the cluster to the query.
‘Number of chemicals’ (for protein–chemical interactions)—the number of unique, standardized chemicals present in a given binding site cluster.
‘Curator annotation’—binding site annotation from the CDD which overlaps by >50% with the sites annotated by IBIS. Binding site clusters with matching CDD annotation are top-ranked irrespective of their ranking score.
‘Taxonomic diversity’—the last common ancestor of the proteins from a given cluster, listed with a link to NCBI’s Taxonomy Browser, so that one can explore all taxonomic groups represented by the cluster.
The actual binding site residue alignment can be seen upon expanding the clusters, including the PDB codes corresponding to all complex structures summarized by the clusters. It is also possible to view the inferred binding sites projected onto the actual query structure using the Cn3D visualization software (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml). For the case of protein–protein interactions, the expanded table will provide the PISA validation status for each interaction interface. PISA may not be able to process a particular complex structure; these cases are indicated by an ‘N/A’ symbol.
The features of binding site clusters can be examined by using the ‘Advanced search’ option found on the left side bar. This option allows one to filter the interactions within a given interaction type by various criteria like level of sequence identity, structural similarity, names of interacting partner and others. In the case of chemical binding sites, for example, it is possible to pick and inspect various sites a particular chemical may bind to on a given query.
Spleen tyrosine kinase (Syk) is a non-receptor tyrosine kinase, expressed in a wide range of cell types, which plays an important role in immunoreceptor signaling (33). It is an attractive drug target for the treatment of allergic and antibody mediated autoimmune diseases, breast and gastric cancers. Syk is characterized by two N-terminal SH2 adapter domains, a linker region and a C-terminal catalytic domain. Several drugs/inhibitors target the active site of the Syk catalytic domain and decrease its activity.
Here, we demonstrate how IBIS can be used to annotate the binding sites of the Syk catalytic domain. We start with a Syk sequence for which a structure of the complex with the ligands is available (pdb code: 1XBB); we predict binding sites using IBIS, and finally compare predicted sites with the actual binding sites observed in the structure. First we find the closest homolog with a known structure, a Zap-70 kinase (1U59 Chain A; Blast E-value of 6e-99 and 77% identity to the query sequence, Figure 2). Second, we use the structure of 1U59 as a query in IBIS and find nine protein–chemical binding site clusters. The top two clusters overlap with the ‘active site/ATP binding site’ CDD annotations. The first binding site cluster includes 360 homologous structures bound to 170 different chemicals. The consensus binding site alignment is 65 residues long, due to the diversity and size variation of the chemicals bound, but it highlights 13 highly conserved residues. The ATP-binding site represents an attractive target for the design of kinase inhibitors, and IBIS provides a concise summary of interactions at that site, which would otherwise require significant comparative analysis. Here IBIS groups and identifies an ATP-binding site, and provides a list of various chemicals, among them many kinase inhibitors, which might potentially bind to and inhibit the query protein. All binding sites observed in the actual structure complex with the anticancer drug imatinib (1XBB) are correctly annotated by IBIS (see table in Figure 4). Interestingly, imatinib binds not only to the ATP-binding site but also to a regulatory myristoylation site on the C-terminus (from the binding site cluster #8) that can be annotated on the query sequence.
In addition to chemical binding sites, it is also possible to predict protein interaction partners for the Syk protein. For example, binding site cluster #1 under protein–protein interactions points to a potential SH2 domain binding site which is further validated by CDD curator annotation, although no structural complexes have been solved between Syk and SH2.
In this paper, we presented a comprehensive, web-accessible database, which organizes, analyzes and predicts different types of interaction partners and binding sites in proteins. For proteins with or without known binding partners, IBIS provides a succinct and informative representation of observed binding sites and binding sites inferred from homologs with known 3D structure. It provides analysis of how well a binding site is conserved across members of a homologous protein family. Several structures of the same protein or close homologs with different binding partners may be available in the Protein Data Bank, or the same protein may have been crystallized under different physiological conditions. In such cases, the IBIS database facilitates a detailed classification and analysis of binding sites. IBIS also attempts to validate binding sites by assessing their biological relevance and ranks them accordingly. It can be used to annotate oligomeric states by inferring relevant homo-oligomer interfaces and should prove useful in studying the evolution of protein interactions.
IBIS is updated regularly (currently on a biweekly schedule) to account for the growth of the GenBank, PDB/MMDB, VAST and CDD databases. Recently, it was estimated that almost half of all sequences in the GenBank database have at least one structure homolog with an extensive alignment and at least 30% identical residues (34). As the on-going structural genomics initiative continues to close the sequence-structure gap, IBIS serves as a powerful knowledge-based annotation system for proteins of unknown structure.
National Institutes of Health/DHHS (Intramural Research program of the National Library of Medicine). Funding for open access charge: National Institutes of Health/DHHS (Intramural Research program of the National Library of Medicine).
Conflict of interest statement. None declared.
The authors would like to thank Yanli Wang and Lewis Geer for useful discussions and Eugene Krissinel for help with the PISA software.