|Home | About | Journals | Submit | Contact Us | Français|
PrePPI (http://bhapp.c2b2.columbia.edu/PrePPI) is a database that combines predicted and experimentally determined protein–protein interactions (PPIs) using a Bayesian framework. Predicted interactions are assigned probabilities of being correct, which are derived from calculated likelihood ratios (LRs) by combining structural, functional, evolutionary and expression information, with the most important contribution coming from structure. Experimentally determined interactions are compiled from a set of public databases that manually collect PPIs from the literature and are also assigned LRs. A final probability is then assigned to every interaction by combining the LRs for both predicted and experimentally determined interactions. The current version of PrePPI contains ~2 million PPIs that have a probability more than ~0.1 of which ~60 000 PPIs for yeast and ~370 000 PPIs for human are considered high confidence (probability > 0.5). The PrePPI database constitutes an integrated resource that enables users to examine aggregate information on PPIs, including both known and potentially novel interactions, and that provides structural models for many of the PPIs.
Knowledge of protein–protein interactions (PPIs) is essential to understanding cellular regulatory processes. Much effort involving a multitude of methods has been devoted to the determination of direct physical interactions between proteins (1,2). Although most detection methods can only be used for small-scale studies, a few techniques, such as the yeast two-hybrid assays and affinity purification, can be scaled up to determine PPIs in a high-throughput manner (3,4). These high-throughput techniques have been applied to genome-wide studies of PPIs for a number of model organisms, including yeast (5–12), fly (13), worm (14), bacteria (15,16), human (17–19) and, more recently, Arabidopsis (20).
A number of databases have been created to systematically collect and store information on experimentally determined PPIs, including the Munich Information Center for Protein Sequence (MIPS) protein interaction database (21), the database of interacting proteins [DIP, (22)], the protein interaction database [IntAct, (23)], the molecular interaction database [MINT, (24)], the Human Protein Reference Database [HPRD, (25)] and the Biological General Repository for Interaction Datasets [BioGRID, (26)]. To date, hundreds of thousands of PPIs have been stored in these databases that cover hundreds of different organisms and contain interactions determined by tens of different methods (27,28).
Although these databases are crucially valuable resources, they inevitably contain some number of false interactions (false positives) and are largely incomplete in that many interactions are still not annotated (false negatives) (29–31). Although false negatives mainly result from the inherent limitations of different detection methods and incomplete screening of the vast possible interaction space, false positives in these databases can result from errors or ambiguities in experiments (32). In particular, data sets generated from high-throughput methods are estimated to have a much higher error rate than traditional small-scale studies (33). In addition to experimental errors, false-negative and false-positive interactions also result from curation errors. For example, a study of discrepancies between different databases showed that, even for the same set of publications, two databases on average only fully agree on 42% of the interactions and 62% of the proteins (34). The differences were attributed to divergent assignments of organism or splice isoforms, and alternative representations of multiprotein complexes, etc.
Parallel to experimental studies and literature curations, computational predictions have also been used to infer new interactions from indirect clues. Information such as sequence and structural homology, domain–domain interaction profile, genomic context, gene fusion, phylogenetic profile/tree similarity, gene co-expression, function similarity and network topology has been effectively exploited to evaluate the reliabilities of experimentally determined interactions (35,36), and to predict PPIs on a large scale (37–41). Usually, every indirect clue by itself is only a weak PPI predictor, but predictions can be improved by integrating different sources of evidence using a variety of machine learning methods. There have been a number of online databases that store PPIs predicted from these integrative methods, such as STRING (42), Predictome (43), OPHID (44) and its replacement I2D, IntNetDB (45) and PIPs (46). These databases have their own limitations, and it should be noted that, owing to the nature of many prediction methods, many of the predicted interactions are often more indicative of protein functional associations than of direct physical interactions.
Recently, we described a PPI prediction method (PrePPI) that is largely based on 3D protein structural information (47). We showed that, with the exploitation of homology models and remote geometric relationships, structural information can be used to accurately predict PPIs on a genome-wide scale. The further integration of structural with other functional clues yields prediction performance comparable with high-throughput experiments. Experimental tests of a number of predictions demonstrate the ability of the structure-based algorithm to identify novel unsuspected PPIs of significant biological interest.
Given the inconsistent levels of reliability and lack of complete overlap between different PPI databases, a resource that integrates different sources of information and that reports an appropriate measure of reliability should be extremely valuable. In this article, we describe the PrePPI database that contains interactions predicted from our structure-based integrative method, and also includes interactions compiled from a set of public databases that manually curate experimentally determined PPIs from the literature. A probability for each interaction is calculated using a Bayesian framework as described later in the text.
Predicted interactions in the PrePPI database are generated by our structure-based integrative PPI prediction method that combines structural modeling with other genomic, evolutionary and functional clues (47). Briefly, for a pair of proteins of interest, we first search for representative structures of the query proteins in the PDB and homology model databases, and then use these to search for structural neighbors of each protein. A protein–protein complex found in the Protein Quaternary Structure database or Protein Data Bank is used as a ‘template’ for the interaction whenever it contains a pair of interacting chains that are structural neighbors of the respective query proteins. We then construct a model by superposing the individual subunits on their corresponding structural neighbors in the template complex and calculate a likelihood ratio (LR) for each model to represent a true interaction using a Bayesian network trained on a positive and a negative interaction reference set. We finally combine the structure-derived LR with non-structural evidence associated with the query proteins using a naïve Bayesian classifier.
Our analyses show that the performance of the prediction method is comparable with high-throughput studies, and that this is primarily due to the large-scale use of structural information made possible by the use of homology models and looking broadly across protein structure space for structure/function relationships. To put this in perspective, using structure alone we build structural models for ~2.4 million and 36 million yeast and human interactions, respectively.
We collected PPIs from six publicly available databases (MIPS, DIP, IntAct, MINT, HPRD and BioGRID) and obtained 117 803 interactions for yeast and 82 060 interactions for human. We mapped protein identifiers from different databases to UniProt accession numbers and used pairs of accession numbers as the unique identifiers of all PPIs. Different databases contain different numbers of false-positive and false-negative interactions that are due to both experimental and curation errors. We have used Bayesian statistics to calculate an LR for database interactions as follows. We used a positive reference set that contains 11 851 yeast interactions and 7409 human interactions that have more than one supporting publication, and a negative reference set constructed by pairing proteins located in different cellular compartments (47). We assigned each of these interactions to one of seven categories and calculated an LR for each category. The first category contains interactions that are present in multiple databases, and the other six contain interactions present in exclusively one of the databases listed earlier in the text. In this way, we obtain an objective evaluation that accounts for both experimental and curation quality.
An advantage of using a Bayesian framework to calculate an LR for each database is that we can easily combine experimentally determined interactions with computationally predicted interactions. Because the two are weakly correlated, we use a naïve Bayesian classifier to combine them by simply multiplying the two LR scores to obtain a combined LR score for each interaction.
In the PrePPI database, we have scaled the combined LR to a probability using the following equation:
The PrePPI database now contains ~2 million PPIs with a probability >0.1. Of these, 61 720 PPIs for yeast and 372 545 PPIs for human have a probability >0.5.
The PrePPI database can be queried through the UniProt accession number (e.g. P03989), gene name (e.g. PRNP) or protein name (e.g. Histone H2A) of a gene or protein. The server will return a description of the query protein, the number of proteins it interacts with and a table with detailed information about each interaction (Figure 1). Each row of the table lists proteins predicted to interact with the query, the sources of information used in the prediction, different LRs and the final combined probability, as well as whether the interaction has been documented in databases or in the literature.
The sources of information used in the prediction are represented by their ‘prediction codes’. Details on different types of information can be found in the ‘Help’ page of the web server. The ‘Prediction LR’ column shows the LR obtained from the Bayesian network that combines the different sources of structural and non-structural evidence for the interaction represented by their prediction codes [see (47) for details on the types of evidence used]. We also calculate a ‘database LR’ as described earlier and combine this with the prediction LR to get a final LR, which is shown in the table as a probability (Final prob.) determined from Equation 1. If an interaction has been previously documented, we put the corresponding database symbols in the seventh column and the PubMed links to the description of the relevant experiments in the eighth column.
Interactions are ordered according to their final probabilities. By default, we only show the high confidence predictions (final probability >0.5), but predictions with lower probabilities can be viewed by clicking the link at the bottom right. All interactions for the query protein can be downloaded by clicking the link at the bottom left.
A unique feature of the PrePPI database is the availability of structural interaction models for those PPIs predicted from our structural modeling algorithm. Figure 2 shows an example of an interaction model built for the human TGF-β receptor type-1 (P36897) and the complement component C1q receptor (Q9NPY3), using a homology model from Skybase (49) for Q9NPY3 and exploiting the remote structural relationship between these monomer structures and a designed protein that forms a homodimer (50). Users can investigate the interaction model and generate experimentally testable hypotheses for how the two proteins interact. It is important to emphasize that no structural refinement of PrePPI models is carried out, so they may contain physically unrealistic features such as steric clashes. The structure-based LR for the model is shown in the viewer and, together with the reasonableness of the model itself, should be considered when evaluating its biological relevance and when deciding whether some form of structural refinement might be of value.
The goal of PrePPI is to generate testable hypotheses derived in part from structure, but its use should be seen, in our opinion, as an early step in the process of biological discovery. PrePPI is under constant development, but at this stage, it is worth pointing out a number of caveats. First, although we have shown that the structure-based LR can account for specificity in the sense that it can differentiate closely related structural domains that form complexes from those that do not [see Figure S15 in the supplemental material of (47)], the methods used are not perfect and predictions should be considered carefully in the context of any additional data that might be available (for example, the highest scoring predictions may be paralogs that appear in different cellular compartments). As discussed earlier in the text, other problems may arise from the fact that we do not attempt to evaluate the 3D model of a putative complex beyond scoring of the interface (47) so that in many cases the model may appear physically unrealistic. Ideally, it will be possible to address such issues automatically through, for example, the use of orthology databases or refinement of side chains, loops and relative domain orientations. We plan to implement such features in future versions of PrePPI. However, because PrePPI evaluates billions of interaction models (47), structural refinement would have to be carried out in a later filtering step, perhaps motivated by biological interest. At this stage, we have chosen to present all high probability predictions with the expectation that a thoughtful user will be able to recognize obvious false positives using the information available on the server itself, in external databases or in the biological literature.
Finally we note that a high probability PrePPI prediction for an interaction says nothing about the oligomerization state of the proteins involved. Our goal at this stage is to assign a probability for an interaction between two proteins to occur and provide an initial model of where an interface might be located. Again, our hope is that the interested user will be able to use the information provided in the PrePPI database as a basis for new experimental and computational efforts on a particular system of interest.
The PrePPI database differs from other PPI databases based on the following four novel features: (i) PrePPI provides structural information for many more interactions than has previously been possible using structure-enabled approaches and databases (51–53); (ii) the predicted PPIs in PrePPI are obtained by combining structural and non-structural information; (iii) the PrePPI database contains integrative information of PPIs from major PPI databases and provides a Bayesian measure as to the confidence level of these interactions; and (iv) the PrePPI database assigns a single probability for each interaction using a Bayesian framework that combines quantitative results based on computational predictions with evidence contained in publicly available databases. PrePPI now offers a comprehensive source of PPI information for the yeast and human genomes and will soon be expanded to other model organisms.
National Institutes of Health [GM030518, GM094597, CA121852]. Funding of the open access charge: Howard Hughes Medical Institute.
Conflict of interest statement. None declared.