|Home | About | Journals | Submit | Contact Us | Français|
The modulation of protein–protein interactions (PPIs) by small drug-like molecules is a relatively new area of research and has opened up new opportunities in drug discovery. However, the progress made in this area is limited to a handful of known cases of small molecules that target specific diseases. With the increasing availability of protein structure complexes, it is highly important to devise strategies exploiting homologous structure space on a large scale for discovering putative PPIs that could be attractive drug targets. Here, we propose a scheme that allows performing large-scale screening of all protein complexes and finding putative small-molecule and/or peptide binding sites overlapping with protein–protein binding sites (so-called “multibinding sites”). We find more than 600 nonredundant proteins from 60 protein families with multibinding sites. Moreover, we show that the multibinding sites are mostly observed in transient complexes, largely overlap with the binding hotspots and are more evolutionarily conserved than other interface sites. We investigate possible mechanisms of how small molecules may modulate protein–protein binding and discuss examples of new candidates for drug design.
Protein–protein interactions (PPIs) play a key role in numerous biological processes such as cell proliferation, growth, differentiation, signal transduction and apoptosis; moreover, it has been shown that PPIs are disrupted in many diseases including cancer.1,2 This suggests the attractive possibility of manipulating PPIs for therapeutic intervention. However, targeting PPIs is more challenging than traditional drug discovery that, for example, designs small molecules to bind to enzyme active sites. The complications arise from the fact that PPI interfaces are relatively large, less conserved, often flat or more shallow and featureless in contrast with ligand binding pockets.3–7 Presently, there are small-molecule drugs known to affect about 1% of human proteins,8 and 10–15% of all human proteins are considered “druggable”.9 The historical record of drug design and discovery has given rise to the idea that PPIs are much more intractable with respect to small-molecule drug discovery.8 Indeed, for therapeutic use, the chemicals or drugs should be small enough to get inside the cell and also be able to affect the large and often shallow PPI interaction sites.
Nevertheless, there have been a number of studies that suggest targeting PPIs for treatments for some human diseases,10–21 showing that protein–protein interfaces or regions near interfaces might be inherently flexible or intrinsically disordered allowing a small molecule to penetrate these complexes and displace the protein interaction partner.22–24 Several papers review the progress made in this research area.8,25–29 Additional evidence that small molecules do not have to cover the entire protein–protein binding interface but rather target only a small number of interface residues, the so-called “binding hotspot” sites, which contribute the most to the binding energy, has been obtained.30
Many approaches so far have focused on discovering druggable PPIs by in silico screening of small-molecule libraries and searching the chemical space of PPI inhibitors. It was found that PPI inhibitors usually represent relatively large rigid small molecules, containing hydrophobic and aromatic groups.4–7 A recent study showed that SVM (support vector machine) kernels can be successfully used to select molecular descriptors for PPI inhibitors, which characterize specific molecular shapes and the presence of a privileged number of aromatic groups.9 Molecular dynamics simulations, drug design and protein docking studies tried to uncover dynamic and physicochemical properties of protein–protein complexes to find those regions and pockets that can be targeted in small-molecule library screenings.31 Recently, a model was proposed to predict the druggability of pharmaceutically important proteins based on the crystal structures of the binding pockets.32 Despite these efforts, the most comprehensive list of known structure complexes with small molecules disrupting protein–protein interfaces is still very limited [27 Protein Data Bank (PDB) complexes in total]26 and is represented by only eight protein families. More recently, a database with an extensive analysis of these known protein–protein interfaces was made available.33
It is clear that the systematic and large-scale analysis of experimentally observed protein–small-molecule and protein–protein complexes is needed to discover potential protein–protein interfaces that, at the same time, have tendencies to bind small molecules. Such an approach has been undertaken recently by looking for homologous complexes in a protein structure database with overlapping protein–protein and protein–small-molecule binding sites.34,35 This study demonstrated that sampling of the space of homologs is an extremely useful and encouraging approach that not only allowed the recovery of known interaction modulators but also provided a list of potential drug targets. However, a large source of error might come from including complexes that are the result of crystal packing interactions. In this regard, specific methods have been developed to predict and confirm the biological relevance of specific interfaces in crystal structures.36,37 Another source of annotation errors comes from inferring that functions and interactions from distant homologs are common descents and do not necessarily imply similarity in function or interactions and that annotations transferred from one protein to a homolog may result in incorrect functional or interolog assignment at larger evolutionary distances.38 In the current study, we used the homology inference approach and used our recently developed IBIS (Inferred Biomolecular Interaction Server) method39,40 to find those protein complexes that can potentially be targeted by small molecules. To ensure biological relevance of binding sites to the query, IBIS clusters similar binding sites found in homologous proteins based on their sequence and structure conservation, further validates them using various approaches and finally ranks binding sites to assess how well they match the query.
We look for those sites that bind both proteins and small molecules and define them as “multibinding sites”. We ask what thermodynamic and structural properties of protein complexes make them more targetable by small molecules. We find that small molecules have a tendency to bind to hotspot residues and preferentially target weaker and more transient protein–protein interfaces. Moreover, we show that multibinding sites are more conserved than the rest of the interface. From the most recent update of the protein structure databank (Research Collaboratory for Structural Bioinformatics PDB41), we compile a nonredundant set of potential PPI interfaces from 642 proteins representing 60 protein families, with strong evidence of multibinding and potential properties of small-molecule PPI inhibitors.
Protein complexes from the current release of the MMDB,55 an automatically parsed and validated derivative of the Research Collaboratory for Structural Bioinformatics PDB,41 are used in this study. PPIs are identified and analyzed at the domain level. The domain assignment is performed by searching the protein sequence against a comprehensive collection of domain models in the CDD.56 PPIs are recorded between different functional domains in the same chain or between different chains from a protein complex.
Protein–small-molecule and protein–peptide interactions are defined for a complete protein chain regardless of its domain annotations. Peptide is defined as a segment of polypeptide chain of 20 amino acids or fewer. Both protein–protein and protein–peptide interactions are considered as PPIs in this study. For small molecules bound to multiple chains, the interaction is assigned to the chain with dominant contacts (>75% of the contacts); otherwise, each protein interaction with the small molecule is recorded separately. An interaction is defined if a protein domain/chain has at least five residues in contact with another protein, small molecule or peptide, and two residues are said to be in contact if any of the heavy-atom interatomic distances is smaller than 4 Å. The “binding site” refers to a group of residues that make a contact with an interaction partner.
We have used the IBIS method described earlier39,40 to analyze experimentally observed complexes and at the same time infer interaction partners and binding sites in proteins without known complexes by inspecting homologs with known interactions. For a given query protein, IBIS collects all its homologs with known structures of complexes from MMDB that have significant structural similarity and at least 30% sequence identity to the query as calculated from the structure–structure VAST57 alignment. IBIS then clusters binding sites using a complete-linkage clustering algorithm. Binding site similarity is assessed based on the structural alignment using similarity scores. At the end of this step, a list of all inferred binding site clusters and binding partners (chain/domains, small molecules and peptides) is compiled, which is derived from homologous structural complexes. We refer to these inferred binding site clusters as inferred binding sites. For each interaction type (protein–protein, protein–small molecule and protein–peptide), all binding site clusters are ranked in terms of their biological relevance and similarity to the query. The components of the ranking score include the sequence PSSM score, the average sequence identity between the query and cluster members calculated over the whole structure–structure alignment and the number of interfacial contacts and the average sequence conservation of binding site alignment columns. The binding site clusters that contain the observed interactions of the query are regarded as observed binding sites, and the rest are defined as inferred binding sites coming from the homologs.
In addition to the ranking scheme, which aims to rank the inferred binding sites using evolutionary relatedness with respect to the query protein, we used other sources. In the case of protein–small-molecule interactions, small molecules were all validated and standardized by the PubChem database,58 which often provides extensive information on their known biological activities. Small molecules with less than five heavy atoms and/or having a molecular mass outside the range of 70–800 Da were ignored in this study. We also excluded nonbiological small molecules based on the list used in our previous study.40 A small-molecule-inferred binding site is deemed nonbiological if all the bound small molecules in an inferred binding site are nonbiological.
Likewise for PPIs, the oligomeric states and binding interfaces were verified using PISA algorithm,36 which identifies biologically relevant interfaces present in crystal structures. If all interfaces in an inferred PPI binding site are invalid according to PISA, the site is deemed nonbiological.
To detect those sites that can bind to both proteins/peptides and small molecules, for a given protein with known structure, we extract all IBIS-inferred sites interacting with other proteins, peptides and small molecules. An inferred binding site is a union of the observed binding site residues of the members of an inferred binding site cluster. The overlap score between protein–protein binding sites and protein–small-molecule binding sites is calculated as:
where Na is the number of residues in protein–protein binding site “a”, Nb is the number of residues in the protein–small-molecule binding site “b” and Nab is the number of residues in the intersection of the binding sites “a” and “b”, that is, the number of multibinding residues (as indexed with respect to the query). The multibinding sites are defined as those having an SC score greater than 0.5.
Currently, a total of 239,395 protein chains/domains from 61,413 protein structures are present in IBIS with at least one type of interaction either observed in their structural complexes or inferred from their homologs†. Our method allows analyzing the mechanisms of how a small molecule competes with a natural protein partner.
First, we focused on those cases where protein–protein and protein–small-molecule complexes are available as separate structures in the structure databases. Some of the known examples are shown in Fig. 1 with the structure superpositions of the two different complexes of the same protein. We first tested if our method can recover known examples from the literature of PPIs that are modulated by small molecules. The method successfully recovered six out of eight known cases (Table SM1 and Fig. 1). The interaction between IL2 and IL2-alpha receptor was not recovered because of a partial Conserved Domain Database (CDD) domain mapping (see Materials and Methods). The case of tumor necrosis factor-alpha trimer dissociation mediated by a small molecule has been missed due to the stringent overlap threshold used in our method. In a survey of all the observed PPI and small-molecule interactions in the current PDB, 3223 domains/chains were found to have their observed PPI interfaces overlap with observed or inferred small-molecule binding sites; on the other hand, 4532 protein domains/chains have their observed small-molecule binding sites overlapping with observed or inferred PPI interfaces.
The same protein may be represented in multiple structures solved under different conditions or with mutations and/or in complex with several different small molecules. To account for this, we inferred PPI from close homologs with more than 90% sequence identity and found 6255 chains/domains in PDB with multibinding interfaces. A few examples are presented in Table 1, and the complete list can be accessed from the Web page provided in Supplementary Information.
For each protein chain in PDB/Molecular Modeling Database (MMDB), we assembled a comprehensive list of inferred protein–protein and protein–small-molecule binding sites using IBIS. Since inferred binding sites represent the consensus of binding sites from close homologs, their location reflects the conformational diversity and variability of homologous complexes and at the same time differences in the sizes of small molecules. A total of 27,340 protein chains from 501 CDD families have been determined to contain inferred multibinding sites, and a nonredundant set of 642 chains (culled at 50% sequence identity) from 60 families was compiled with multibinding interfaces that are biologically relevant according to the PISA (Protein Interfaces, Surfaces, and Assemblies server) algorithm.36
Although establishing the biological validity of each of these multibinding sites still requires experimental verification, these sites might be used as starting points to target small-molecule PPI inhibitors. We provide additional annotation for the multibinding sites including their PISA status, biological relevance of small molecules or verification of small molecules using DrugBank42 in Supplementary Materials. For example, of the 642 protein chains, about 400 have at least one multibinding site in which the bound small molecule is also biological.
It should be mentioned that protein structures in PDB often contain additives, detergents and other types of substances used for crystallization. These are not true biological ligands, but they can sometimes be difficult to distinguish from the biologically relevant ones. In the current study, we distinguish the most common biological ligands (see Supplementary Materials). Indeed, the crystallization agents, if present, may sometimes provide additional insights into the binding interfaces on a protein. For example, the nonbiological small molecule in the interface between the E2 protein and E1 helicase of human papillomaviruses probably defines additional regions of the binding pocket that could be exploited to design more potent inhibitors21 (Fig. 1d).
We also note that the PISA program36 is regarded as a state-of-the-art method for the annotation of biological assemblies, with an accuracy estimated at 80–90%. One might expect that the author-determined biological units in the PDB files would be the most reliable. However, it is not difficult to find examples where the annotations in the PDB file are not consistent with the authors' own paper; some authors are quite diligent about this, whereas others may not be so careful. Moreover, an estimate for the accuracy of the author-determined annotations is lacking. Therefore, a highly dependable method such as PISA provides the most consistent results, although researchers should be careful and should consult the relevant literature when accuracy is critical.
We calculated a conservation score using the Shannon entropy with the Henikoff–Henikoff sequence weights for each position in a multibinding site based on the binding site cluster alignments. The highly conserved residues are most likely critical for binding and may well be the binding hotspot residues, namely, those that contribute the most to the binding energy of the protein complex. We analyzed the nonredundant set of protein chains in our data set to check how often the multibinding residues are predicted to be binding hotspots. Hotspot residues were predicted using the PCRPi (Presaging Critical Residues in Protein interfaces) method, which integrates a number of different metrics involving sequence and structure, sequence conservation, “topographical index”, computational alanine scanning and others into a probabilistic measure by using Bayesian networks.30 A residue in an inferred binding site is considered as a hotspot if the corresponding residue in the homolog contributing to the site is annotated as a hotpot by PCRPi.
Among the 642 nonredundant protein sequences with biological interfaces and multibinding sites, 259 chains had at least one hotspot on their multibinding inferred site. We found that the association between multibinding sites and hotspots is statistically significant (χ2 p-value 0.01) (Fig. 2), which points to the critical role of binding hotspots in modulating PPIs by small molecules.
At the same time, the analysis of residues in the protein–protein interfaces showed that multibinding residues are more evolutionarily conserved than the rest of the interface (χ2 p-value 0.01) (Fig. 3), and hotspots on multibinding interfaces are more conserved than the rest of multibinding interface (χ2 p-value 0.01). Previously, it was shown that binding hotspots are more evolutionarily conserved than the rest of the interface.43
We mention that small-molecule binding sites can overlap protein–protein binding sites for different functional reasons. First, small molecules can represent natural substrates of enzymes that, in turn, can be inhibited through the mechanism of competitive binding by other proteins. A second group consists of all other cases where small molecules modulate PPIs. Automatically, classifying a small molecule as a native substrate is a challenging task, and there is no such information provided in the PDB files. Therefore, we used a data source from the previous study,44 which compared the PDB ligands to small molecules from the ENZYME and KEGG databases using graph matching algorithms to assess the chemical similarity between small molecules. We mapped these native/cognate ligand annotations to inferred binding sites in IBIS. A total of 108 proteins from the nonredundant set of 642 proteins were found to have at least one multibinding site for a cognate ligand/native substrate. The analysis of these multibinding sites with native substrates in these 108 proteins showed a similar trend of multibinding residues to be more conserved and to have a higher tendency to include binding hotspots (χ2 p-value 0.01). However, the first group of enzyme–substrate complexes showed a lower hotspot frequency but a higher evolutionary conservation of multibinding residues compared to the second group of multibinding interfaces.
The functional enrichment of the nonredundant set of 642 multibinding proteins has been assessed using a selected list (GO slim) of gene ontology (GO)45 functional terms. We define this as a study group—a set of multibinding proteins found in PDB from our study—and a population group—all of the proteins in PDB. The frequency of annotation to a GO term for the study group is then compared to the overall population, which compensates for the functional bias in the PDB. We used Ontologizer,46 which performs a modified Fisher's exact test with correction for multiple testing and also takes the parent–child relationship47 in the GO hierarchy into consideration. GO assignments to PDB entries have been derived from the “gene_association.goa_pdb” gene association file provided by the UniProtKB-GOA group. We found that metabolic, regulation, transducer activity, electron carrier activity and multicellular organismal processes are significantly enriched with multibinding proteins (χ2 p-value 0.01) (Fig. 4). An analysis based on the assigned Enzyme Commission48 numbers of the nonredundant set of multibinding proteins and the rest of the proteins in PDB showed that multibinding sites are significantly overrepresented in enzymes (χ2 p-value 0.01), that is, protein chains with assigned Enzyme Commission numbers.
We checked the hypothesis whether the ability of small molecules to modulate and inhibit PPIs will depend on the stability of protein–protein complexes. We estimated the free energy of dissociation ΔGdiss by using the PISA algorithm. We compared the stability of dimeric complexes containing multibinding sites against all nonredundant dimers in the PDB as determined by PISA. As can be seen in Fig. 5, dimers with multibinding sites are less stable compared to the other dimers (t-test p-value = 0.016). Indeed, proteins with multibinding sites constitute a significantly smaller fraction among permanent protein complexes with the dissociation constant in the nanomolar-to-picomolar range and ΔGdiss of 20–30 kcal/mol or higher.
A majority of multibinding proteins observed in structure databases41 include synthetic small molecules targeting protein inhibitor binding sites. The modulation of PPIs is carried out through disruption, inhibition, stabilization or allosteric regulation. For example, kirromycin antibiotics such as aurodox lock EF-Tu (elongation factor-thermo unstable) in its EF-Tu/GTP conformation, preventing its release from the ribosome, which illustrates the mediation of a PPI by a small molecule. Similarly, the fungal phytotoxin fusicoccin stabilizes the interaction of 14-3-3 with PMA2, a plant proton pump.49 Two more small molecules, pyrrolidone 1 and epibestatin, have recently been found to stabilize the 14-3-3–PMA2 interaction.50 Rapamycin, a potent immunosuppressive drug, mediates the interaction between the human signaling proteins FKBP12 and FRAP that do not normally interact with one another.51 The overlap of the rapamycin binding site (1FAP_A: RAP) with type I TGF-beta receptor in complex with FKBP12 (1B6C_A:B) suggests a possible role of rapamycin in TGF-beta signaling. Indeed, FKBP12 binding to the TGF-beta receptor shields a regulatory segment of the receptor from phosphorylation, which maintains the receptor in its inactive state. Therefore, rapamycin bound to FKBP12 should permit easier activation of the TGF-beta receptor.52
We found that the GTPase (Cdc42) and cell polarity protein (Par6) PDZ domain interface (1NF3_A:C) overlaps with the AMP binding site of a small GTPase Rab1b (3NKV:AMP), suggesting a possible regulation of Par6 PDZ binding to Cdc42 through AMPylation. It has also been shown that mutational disruption of Cdc42–Par6 PDZ coupling leads to inactivation of Par6 in a certain type of epithelial cells.53 Another example is the cresidine binding site of Cu–Zn superoxide dismutase (SOD1) (2WZ0_F:ZZT), which overlaps with the dimer interface of Cu–Zn superoxide dismutase (2NNX_D:A). Recently, this binding site has been annotated as “druggable” for therapeutic purposes for SOD1-associated motor neuron disease.54 A complete list of all the protein chains with observed and inferred multibinding sites is available in Supplementary Materials.
The rapid increase in data on protein sequences and structures is posing new challenges to interpret and use these data productively. While many studies have presented novel methods for functional annotation of these sequences and structures, the annotations in public databases are still error prone or often hypothetical. Even when structural data are available for a particular protein, it can be inconclusive or hard to interpret. Manual curation by an expert, which is the most rigorous and reliable means to legitimize a functional site, is limited by the sheer volume of the data.
Toward addressing this problem, we recently developed an algorithm (IBIS) to analyze and conservatively annotate binding sites in proteins based on knowledge gained via homologous complexes. One important advantage of IBIS-derived sites is that they are weighted based on their recurrence in homologous proteins and ranked using binding site sequence and structure conservation. In the current study, we used the IBIS database to discover many protein–protein interfaces that potentially also bind small molecules. We have found that about 33% of all protein chains/domains with both PPI and small-molecule sites inferred from their homologs have multibinding sites. The likelihood that these sites are biologically relevant is increased by the conservative thresholds built into our method. The GO analysis suggests that multibinding sites are often enriched in metabolic processes. This is also reflected by the significant number of multibinding sites in GO enzyme annotations.
An earlier study showed that those positions, which may bind both small molecules and other proteins, are less conserved compared to monofunctional sites and also exhibit different amino acid propensities.34 However, we found that multibinding sites are more evolutionarily conserved and more likely to contain binding hotspots than other interface positions that are potentially involved in binding of one partner. Moreover, it has been shown previously that hotspots are more conserved than other interface residues,43 and we confirmed this result here with regard to hotspots on multibinding interfaces. Since hotspot residues contribute the most to the binding energy of PPIs, we suggest that binding of small molecules to these positions will have the most disruptive effect on those interactions. It is especially true if these interactions are not very strong. Indeed, as was shown in our study, small molecules mostly bind to transient protein–protein complexes.
It should be mentioned that the apparent conflict between our conservation results for multibinding residues and the results presented previously34 are undoubtedly due to important differences in methodology between the two studies. For instance, the earlier work used SCOP and CATH domain definitions and classifications, which allows for considerably more remotely related proteins and may affect the reliability of homology inference. IBIS uses a conservative threshold for homology inference (at least 30% sequence identity), and we observe that multibinding sites are under stronger evolutionary constraints than even the fairly conserved family background.
Binding of small molecules to proteins and protein complexes could cause a shift of equilibrium in favor of a subset of conformations that has higher or lower preferences to binding another partner. A small molecule may bind at a site far from the protein–protein interface and regulate protein binding through allosteric mechanisms or might bind at or near the protein–protein interface and directly influence the binding. In this study, we focused on the latter case. We showed that small molecules can bind to hotspots and through competitive binding prevent PPI. At the same time, we show examples of small molecules mediating PPIs. Other mechanisms of small molecule binding and their functional roles might be elucidated in future studies when more structural complexes are available.
This work was supported by National Institutes of Health/Department of Health and Human Services (Intramural Research Program of the National Library of Medicine).
†All multibinding sites observed in our study for proteins in the PDB are accessible from http://www.ncbi.nlm.nih.gov/Structure/ibis/P-D/multibinding.html.