Protein-protein interactions (PPIs) play a central role in all biological processes. Akin to the complete sequencing of genomes, complete description of interactomes is a fundamental step towards a deeper understanding of biological processes, and has a vast potential to impact systems biology, genomics, molecular biology and therapeutics. Although high-throughput biochemical approaches for discovering PPIs have proven very successful[1
], the coverage of experimentally determined PPI data remains poor (Table S1
) and is prone to errors[5
]. Such low coverage is partly because the set of possible PPIs to be verified is so large (50 million for a species with 10,000 genes) that any exhaustive experimental verification will take a long time, even with high-throughput techniques. While the rate of PPI discovery has leveled off in recent years (see Fig S1
), the number of solved protein structural complexes has rapidly grown: there has been a 40% increase in the number of complex templates in the 14 months between the two versions of Structural Classification of Proteins database (SCOP, 1.65 and 1.69)[7
]. This growing resource of structural data presents an opportunity to utilize this information for accurate PPI predictions.
There have recently been proposals to harness the information provided by structure-based computational approaches as a potentially high-quality, high-coverage data source for large-scale integrative approaches to interactome construction[8
]. Prieto, Las and Rivas[13
] have reviewed publicly available interaction databases of known structural data that facilitate analysis of PPIs[14
]. In the absence of a solved structure for a pair of protein “query” sequences, structure-based approaches typically rely on aligning the query sequences to either sequence or structure-based “templates” for solved structures in the Protein Data Bank (PDB)[17
In one such approach, homology modeling, two protein sequences are assumed to interact based simply on their primary sequence homology to known interacting proteins. Homology modeling has had considerable success at predicting PPIs on a genome scale[11
] and reconstructing and predicting 3D multi-protein complexes[9
]. More recently, Fukuhara and Kawabata have described HOMCOS[21
], a web-server that performs a similar task to Aloy and Russell's InterPrets[9
], again by homology modeling. MODBase is a database of homology models for protein complexes that have sequence similarity to known structures higher than 50%[23
]. ADAN is a specialized database for prediction of protein-protein interactions mediated by linear motifs and utilizes position-specific matrices to assess putative interactions[24
]. Other sequence-based methods utilize genetic information and multiple sequence alignments to predict specific protein-protein interactions[25
]. However, effective use of homology modeling requires relatively high sequence similarity between the query and template protein-pairs[8
In another popular approach, threading, the three-dimensional (3D) structure for a pair of protein query sequences is predicted by aligning their sequences to templates, based on both sequence and structure profiles, for complexes in the PDB to see if a similar structure can be found. The goodness of a query pair-template alignment is evaluated using a scoring function. The essential computational components of a PPI threading approach are: template construction, alignment of query sequences to templates, and interaction scoring. Lu et al. developed Multiprospector[29
], a threading algorithm that constructs statistical potential functions to evaluate potential PPIs[30
]. Singh, Xu & Berger further proposed a machine-learning based threading algorithm DBLRAP, which also performs full complex threading, and demonstrated its superiority in predicting PPIs over homology modeling and Multiprospector[8
]. Threading identifies compatible structures for proteins that share less sequence similarity with the template; thus typically widening the range of proteins for which predictions can be made over homology modeling.
While homology modeling/threading approaches work well and have good overall accuracy when sequences are somewhat similar to their putative templates, they perform poorly in the “twilight zone” of sequence identities. In particular, they often give inaccurate alignments in the putative interaction regions for sequences with low similarity and therefore are unable to predict interactions accurately in such cases, which we demonstrated previously for the special case of cytokines[32
]. It has been observed that functional residues such as those at the interface are more conserved than non-functional ones, both in sequence[33
] and structure[36
]. Furthermore, it has been shown just recently that partial homology models, based only on interface alignments, are good candidates for templates used in docking studies[38
]. Here we capitalize on these observations by performing threading on only the protein-protein interface after a suitable complex template is identified.
We introduce the program iWRAP (Interface Weighted RAPtor), which predicts whether two proteins interact by combining a novel linear programming approach for interface alignment with a boosting classifier[39
] for interaction prediction. iWRAP simultaneously optimizes contacts in query sequences to templates of protein-protein interfaces, after constraining alignments to only those residues likely to be involved in the interaction. This approach is in contrast to existing threading approaches that align each sequence individually to an entire protein structure in the complex. We recently demonstrated the utility of interface threading on two cytokine receptor families by implementing LTHREADER[32
], where we manually generated templates specific to this family and aligned each query sequence separately to each template. The driving hypothesis of iWRAP's approach is that more accurate prediction of protein-protein interfaces improves predictions of protein-protein interactions. We show in this paper for general PPIs that (i) more accurate interface alignments lead to improved interface contact prediction, which in turn (ii) significantly improves PPI prediction. Thus, by optimizing the interface alignments after identifying a suitable template, iWRAP exploits functional conservation at the interface to predict PPIs.
We demonstrate the efficacy of these techniques on two datasets, SCOPPI, a database that classifies protein complexes in the PDB[40
], and the yeast genome. First, we use SCOPPI as our gold standard database to confirm hypothesis (i): we show that interface threading, i.e. localized threading, leads to better interface contact prediction over full-complex threaders. For difficult alignment problems and a range of sequence identity values less than 40%, iWRAP outperforms standard threading and sequence-based methods, while for easier problems the methods are comparable. Our results on the full yeast genome scan address hypothesis (ii): we demonstate that our method, which novelly uses boosting[39
] to classify iWRAP's interface threading scores for PPI prediction, outperforms methods based on whole-sequence alignments. In particular, we perform a full genome scan of yeast to predict interactions, and compare iWRAP's performance on experimental data to DBLRAP, which has been shown to have the best performance amongst available structure-based PPI prediction methods[8
As an application, through mapping of yeast cancer related genes and their putative interactions to the human genome, we identify interactions enriched relative to a recent yeast genetic interaction set[41
]. We find that these interacting genes are involved in chromatin remodeling, ribonuclear complex assembly and nucleosome organization[42
]; processes known to be critically involved in cancer. We focus on yeast cancer related genes and putative interactions since the function and interactions of yeast genes are much better understood than human genes[43
]. Moreover, the malignant behavior of human cells is often caused by dis-regulation of cell cycle, growth and apoptosis processes that are conserved across eukaryotic organisms at the level of genes and their interactions[44
iWRAP's predictions are made publicly available at its website so that they can be used for further exploration or systems-level integrative approaches.