To date, structural information has had relatively little impact in constructing protein-protein interactomes, primarily because there is a dramatic difference between the number of proteins with known sequence and those with an experimentally known structure. For example, as of early 2010, the PDB (Protein Data Bank) provides structures for ~600 of the total complement of ~6,500 yeast proteins (~10%), while structural coverage of protein-protein complexes is even more sparse with only about 300 structures available out of the approximately 75,000 PPIs (<0.5%) recorded in publically available databases. However, ~3,600 additional yeast proteins have homology models in either the ModBase10
databases. Moreover, there were about 37,000 protein-protein complexes derived from multiple organisms in the PDB and PQS12
(Protein Quaternary Structure) databases, that might be used as “templates” to model PPIs. Clearly, if structure is to be useful on a large scale, it is essential that modeling of individual proteins and of complexes be exploited.
A number of studies have used structurally characterized complexes as “templates” to construct models of complexes that might be formed between proteins that have been classified as having sequence and/or structural relationships to the proteins in the template13–15
. Here we search more broadly for templates using geometric relationships between groups of secondary structure elements as revealed by structural alignment, independently of how they are classified. It has been demonstrated that even distantly related proteins often use regions of their surface with similar arrangements of secondary structure elements to bind to other proteins16–18
, suggesting the possibility of significantly expanding the number of putative PPIs that can be identified. It is likely that further expansion can be achieved if interactions involving unstructured regions of proteins are taken into account, but these are not considered in the current work.
Our approach to the prediction of PPIs is embodied in an algorithm we have named PrePPI (Predicting Protein-Protein Interactions) that combines structural and non-structural interaction clues using Bayesian statistics (see and online Methods for details). The structural component of PrePPI involves a number of steps. Briefly, given a pair of query proteins (QA and QB), we first use sequence alignment to identify structural representatives (MA and MB) that correspond to either their experimentally determined structures or homology models. We then use structural alignment to find both close and remote structural neighbors (NAi
) of MA and MB (an average of ~1500 neighbors are found for each structure). Whenever two (e.g. NA1
) of the over 2 million pairs of neighbors of MA and MB form a complex reported in the PDB, this defines a template for modeling the interaction of QA and QB. Models of the complex are created by superimposing the representative structures on their corresponding structural neighbors in the template (i.e., MA on NA1
and MB on NB3
). This procedure produces about 550 million “interaction models” for about 2.4 million PPIs involving about 3,900 yeast proteins and about 12 billion models for about 36 million PPIs involving about 13,000 human proteins. Note that an interaction model is based on structure-based sequence alignments of query proteins to their individual templates (Figure S1
) and that we do not construct a three-dimensional model of each complex since the scoring of so many individual complexes would be prohibitively time consuming using standard energy functions (for example as used in docking19
Predicting protein-protein interactions using PrePPI
Once an interaction model has been created, it is evaluated using a combination of five empirical scores that measure properties derived from alignments of the individual monomers to their templates (Figure S1
). The first score, SIM, depends on the structural similarity between models of the two query proteins (i.e. MA and MB) and those in the template complex (i.e. NA1
). The next two scores determine whether the interface in the template complex actually exists in the model. They are calculated as SIZ, the number and COV, the fraction of interacting residue pairs in the template (e.g. NA1
) that align to some pair of residues in the model (MA-MB). The final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (e.g., residue type, evolutionary conservation, or statistical propensity to be in protein-protein interfaces). This information is obtained from three publically available servers that predict interfacial residues based on the sequence and structure of the individual subunits of the model20–22
. These scores are calculated as OS, identical to SIZ but with the additional requirement that both residues in an interacting pair of the template align to predicted interfacial residues in MA and MB and OL, the number of template interfacial residues that align to predicted interfacial residues in MA and MB. We note that although the interaction models produced by our procedure can reveal the approximate locations of potential interfaces, they will not, in general, be accurate at atomic resolution.
The five empirical scores are combined using a Bayesian network (Figure S2
) to yield a likelihood ratio (LR) that a candidate protein-protein complex represents a true interaction (see Methods online). The network is trained on positive and negative “gold standard” reference datasets. Similar to two recent studies23,24
, we combine interaction data from multiple databases to ensure a broad coverage of true interactions. We divide these sets into high-confidence (HC) and low-confidence (LC) subsets (Table S1
); the HC sets contain 11,851 yeast interactions and 7,409 human interactions which have more than one publication supporting their existence; interactions with only one supporting publication compose the LC set. All potential PPIs in a given genome not
in the HC+LC set form the negative (N) reference set. Using the Bayesian network classifier trained on the yeast HC set, we select the best interaction model with the highest LR for each PPI.
To quantitatively assess the performance of structural modeling (SM), we compared it with a number of non-structural clues previously used to infer PPIs24–26
: a) essentiality of the proteins in the interacting pair; b) co-expression level; c) Gene Ontology (GO) functional similarity; d) MIPS functional similarity; and e) phylogenetic profile similarity. We used the same algorithms or data for other clues as Gerstein and coworkers25
but developed our own phylogenetic profile algorithm (see details in Methods online and Table S2
). Briefly, a phylogenetic profile was constructed for every protein using a set of completely resolved proteomes as references. Since interacting proteins tend to co-evolve, proteins with similar profiles are predicted to interact.
As shown in Figure S3 and S4
, SM yields comparable performance to other clues over the entire range of false positive rate (FPR) but is considerably more effective at low FPR (e.g. FPR ≤ 0.1%). This is critical since, due to the huge number of negative interactions, only very low FPR rates can produce a small enough number of false positives to be used effectively in practice. At low FPR, SM by itself outperforms even the naïve Bayesian classifiers that combine all non-structure-based clues (NS). Looking specifically at the thousands of high confidence SM predictions in the LC and the N sets with an LR score > 600 (a value used in Ref. 25
and corresponding in our study to FPR of ~0.1%, see Methods online), about 70% and 50%, respectively, share GO biological term at, or more specific than, the 6th
level of the GO hierarchy, suggesting that many of these interactions are real (Figure S5
As mentioned above, PrePPI combines structural and non-structural clues using a naïve Bayesian network24–26
. Figures S4
shows that PREPPI’s performance is superior to that obtained from structural and non-structural evidence alone implying that the two sources of information are largely complementary. This point can be clearly seen in the Venn diagrams of high confidence (LR > 600) predictions shown in Figure S6
. It is evident from the figure that combining structural and non-structural clues yields many more high confidence predictions and identifies more HC interactions than either source of information alone. As an independent test of PrePPI, we assessed its performance against one of the challenges in the 2009 DREAM (Dialogue for Reverse Engineering Assessments and Methods) workshop specifically aimed at PPI predictions27
. As discussed in Table S3
, PrePPI outperformed all other methods for cases where structural information is available.
We have compared the performance of PrePPI to that of high-throughput (HT) experiments (Table S4
) using data provided in a detailed comparison of different HT techniques reported by Vidal and coworkers23
. We used their datasets to define true positives and compiled a new negative reference set which consists of protein pairs where each protein in a pair is annotated as localized to a different cellular compartment (see Figure S7
and Methods online). This was essential for comparison to experimental assays, since, as constructed, our N set excludes data compiled from HT experiments, and hence the FPR for experimental assays is artificially zero (see also related discussion in SOM of Ref. 23
As can be seen in the ROC (receiver operating characteristic) curves reported in and Figure S8
, PrePPI performance is generally comparable, although somewhat better overall, than HT methods for most data sets that were tested. shows a Venn diagram in which the PrePPI dataset is based on an LR cutoff of 600 (FPR ≈ 0.1%). Results for other LRs and additional reference sets are shown in Figure S9
. As can be seen, many of the interactions inferred by PrePPI are different from those identified by HT assays. Methods that combine both approaches may thus prove to be highly effective in expanding the coverage of PPIs.
ROC curve (A) and Venn diagram (B) for PrePPI predictions and high-throughput (HT) experiments for yeast
At an LR cutoff of 600, PrePPI predicts 31,402 high confidence interactions for yeast and 317,813 interactions for human. These, as well as predictions with lower LR scores, are available in a database from the PrePPI website (http://bhapp.c2b2.columbia.edu/PrePPI/
). As a further validation of PrePPI we tested its performance on the approximately 24,000 new interactions involving human proteins that were added to public databases after August 2010 (Table S5
). Among these interactions, 1,644 are predicted by PrePPI to have an LR>600 (based on a Bayesian classifier derived from pre-2009 data on yeast) so that they essentially correspond to experimental validation of true predictions.
Specific experimental validation of 19 individual PrePPI predictions, using co-immunoprecipitation (co-IP) assays, was carried out in four separate labs, leading to confirmation of 15 of these interactions (Figure S10~14
, Table S6
). Specifically, the investigators in each lab queried the PrePPI database for previously uncharacterized interactions involving proteins of interest and which, as much as possible, had relatively high SM and PrePPI scores (see Table S6
for more information). Here we briefly discuss some of our findings with emphasis on the structural domains predicted by PrePPI to form the protein-protein interface.
One set of predictions involves potential PPIs formed between the nuclear receptor peroxisome proliferator-activated receptor gamma (PPARγ) and other transcription factors. PPARγ plays a pivotal role in regulating glucose and lipid metabolism, inflammatory response and tumorigenesis and is known to heterodimerize with Retinoid X Receptors (RXRs) and to recruit cofactors to regulate target gene transcription. PrePPI predicts high confidence interactions between PPARγ and the transcription factors LXRβ, PAX7, PDX1, NKX2.2 and HHEX (Table S6
). Except for HHEX, all of the interactions were validated (Figure S10
). The predicted interaction with nuclear receptor LXRβ might have been expected based on the ability of these proteins to heterodimerize through their ligand binding domains. Nevertheless, this specific interaction had not previously been characterized and suggests a heretofore unrecognized convergence of signaling and metabolic pathways regulated by these two nuclear receptors. The interaction between the ligand binding domain of PPARγ and the homeodomains of PAX7, PDX1 and NKX2.2 are fundamentally new observations that require further studies, as they suggest that PPARγ may have a role in endocrine progenitor and pancreatic beta-cell development.
A second set of examples involves the suppressor of cytokine signaling protein, SOCS3, an SH2 domain-containing protein that negatively regulates cytokine-induced signal transduction. To date, the mechanism of the inhibitory function of SOCS3 has been primarily established for its involvement in the JAK/STAT pathway. PrePPI predicts that SOCS3 forms complexes with GRB2 and RAF1, two key components in the Ras/MAPK pathway, and these interactions were confirmed experimentally (Figure S11A and B
). PrePPI also predicts the formation of a complex between of SOCS3 and BTK, a cytoplasmic tyrosine kinase important in B-lymphocyte development, differentiation, and signaling, and this interaction was also validated (Figure S11C
). The SOCS3 GRB2 interaction is predicted to be mediated by their SH2 domains, whereas the SOCS3 interaction with BTK is predicted to be mediated by an SH2-SH3 domain interaction. Analysis of the predicted binding preferences of SH2 domains as well as results on other protein families indicates that the PrePPI scoring function accounts, at least in part, for the binding preference of closely related protein domains (Figure S15
, see also below).
A third group of novel observations involves the identification of kinases that interact with the clustered protocadherin proteins (protocadherin α, β and γ – PCDHα, β and γ). The PCDHs have six cadherin-like extracellular domains and unique cytoplasmic domains. They assemble into large complexes at the cell surface, and associate with a variety of proteins, including signaling adaptors, kinases and phosphatases. Analysis of potential PCDH-kinase PPIs confirmed published interactions between PCDHα and γ with the tyrosine kinase RET, and predicted interactions with ROR2, VEGFR2 and ABL1 (Tables S6
, Figure S12
– experiments done in mice). PrePPI predicts that these PPIs are mediated by the extracellular cadherin domains and Ig domains, a result that was confirmed experimentally (Figure S12A~D
). A hydrophobic residue, Phe 64, of the ROR2-Ig domain is predicted to be in the center of the interface it forms with PCDHα4. Mutating this Phe to an Ala, a smaller hydrophobic residue, has no detectable effect on binding while mutating it to charged residues significantly weakens the interaction (Figure S12B and C
). These results suggest that, in addition to predicting binary interactions, PrePPI has the potential to reveal novel and unsuspected interfaces.
The fourth group of experiments was carried out with the goal of identifying new components of large protein-protein complexes. We validated two previously uncharacterized interactions between the special AT-rich sequence-binding protein SATB2 and the Emerin “proteome” complex 32, and one involving the pre-mRNA-processing factor PRPF19 and the centromere chromatin complex (Figure S13
). It is important to emphasize that each of the PPIs detected must be confirmed through appropriate in vivo
experiments. Taken together, however, these findings suggest that PrePPI has sufficient accuracy and sensitivity to provide a wealth of novel hypotheses that can drive biological discovery.
The accuracy and range of applicability of PrePPI, and the crucial role of structural modeling, were unanticipated, but should not come as a complete surprise. Most protein complexes in the PDB have structural neighbors that share binding properties17
, and protein interface space may well be close to “complete” in terms of the packing orientations of secondary structure elements18
. Moreover, these elements can be identified with geometric alignment methods17,28
, a fact that has been exploited in the approach introduced here. Although the information required to predict whether two proteins interact appears to be present in the PDB, the question has been how to mine the data.
Three key elements are responsible for the success of structural modeling and PrePPI. The first is the significant expansion of the number of interactions that can be modeled, due to the use of both homology models and remote structural relationships. About 8,600 PDB structures but more than 31,000 models are found as representatives of at least one domain of ~14,100 human proteins. If we had only used experimentally determined structures in our analysis, a total of only ~2.5 million human PPIs (vs. 36 million when homology models are used) could have been modeled. Similarly, had we limited ourselves to structural neighbors taken from the same SCOP fold, only ~225 thousand interactions could have been modeled, as opposed to 36 million.
As might be expected, predictions based on the structural modeling that use only PDB structures or close structural neighbors are more likely to recover known interactions (defined by their presence in databases) than those that only use homology models or remote structural relationships (Figure S16
). However the latter, on their own, yield a dramatic expansion in the total number of interaction models and, consequently many more high confidence predictions and known interactions. Most importantly, in the calculation of the PrePPI score, the huge number of low confidence structural interaction models lead to an even greater expansion in high-confidence predictions when combined with functional, evolutionary and other sources of evidence (Figure S16
The second key element in our strategy is the efficiency of our scoring scheme for interaction models which allows us to evaluate an extremely large number of models while still discriminating among closely related family members. Discrimination among complexes involving members of the same protein family, i.e. specificity, is obtained from the properties of the predicted interface, e.g. the statistical propensity of certain amino acids to appear in interfaces20,21
(and, additionally, from non-structural clues, e.g. are the two proteins co-expressed). As examples, our analysis of the SH2 and GTPase families shows that the structural modeling (and PrePPI scores) for these closely related proteins produce a wide range of LRs with the higher LRs associated with a higher probability of being a known interaction (Figure S15
The third element responsible for the success of PrePPI is the Bayesian evidence integration method that allows independent and possibly weak interaction clues to be combined so as to make reliable predictions and to improve prediction specificity (Figure S15~16
provides two examples of the use of remote structural relationships and homology models. In , an HC set interaction of serine/threonine-protein kinase D1 (PKD1) and protein kinase C epsilon (PKCε) is recovered by structural modeling using a complex of two proteins in the ubiquitin pathway (not kinases) as template. Note that PKD1 and PKCε are not sequence homologues of the two corresponding ubiquitin pathway proteins and are classified as belonging to different SCOP folds. However, the interaction model has a significant SM score (LR=130) arising from both local structural similarity and a conserved interface. describes a prediction of an LC set interaction between the elongation factor 1-delta (EF1δ) and the von Hippel-Lindau tumor suppressor (VHL) using the same template as that used in . Again, there is no sequence relationship between the target and the template proteins, and they are classified into different folds. Nevertheless, the interaction model has an LR of 70. We note that the EF1δ and VHL were found to interact using mass spectroscopy29
and by co-IP experiments reported here (Figure S14
Models for the PPI formed between (A) PKD1 and PKCε, and (B) EF1δ and VHL using homology models and remote structural relationships
The exploitation of homology models and of remote structural relationships implies that each new structure that is determined experimentally can be used to detect large numbers of new functional relationships even if the protein in question is of only limited biological interest on its own. In this regard, our approach has benefitted from structural genomics initiatives which produced a large increase in the coverage of sequence families that did not have structural representatives30
. We note that PrePPI appears in many cases to offer a viable alternative to HT experiments yielding, in addition to a likelihood of a given interaction, a model (albeit a crude one) of the domains and residues that form the relevant protein-protein interface. This should in turn facilitate the generation of experimentally testable hypotheses as to the presence of a true physical interaction. In conclusion, our study suggests the ability to add a structural “face” for a large number of PPIs and that Structural Biology can play an important role in molecular Systems Biology.