Knowledge of protein–protein interactions (PPIs) is essential to understanding cellular regulatory processes. Much effort involving a multitude of methods has been devoted to the determination of direct physical interactions between proteins (1
). Although most detection methods can only be used for small-scale studies, a few techniques, such as the yeast two-hybrid assays and affinity purification, can be scaled up to determine PPIs in a high-throughput manner (3
). These high-throughput techniques have been applied to genome-wide studies of PPIs for a number of model organisms, including yeast (5–12
), fly (13
), worm (14
), bacteria (15
), human (17–19
) and, more recently, Arabidopsis
A number of databases have been created to systematically collect and store information on experimentally determined PPIs, including the Munich Information Center for Protein Sequence (MIPS) protein interaction database (21
), the database of interacting proteins [DIP, (22
)], the protein interaction database [IntAct, (23
)], the molecular interaction database [MINT, (24
)], the Human Protein Reference Database [HPRD, (25
)] and the Biological General Repository for Interaction Datasets [BioGRID, (26
)]. To date, hundreds of thousands of PPIs have been stored in these databases that cover hundreds of different organisms and contain interactions determined by tens of different methods (27
Although these databases are crucially valuable resources, they inevitably contain some number of false interactions (false positives) and are largely incomplete in that many interactions are still not annotated (false negatives) (29–31
). Although false negatives mainly result from the inherent limitations of different detection methods and incomplete screening of the vast possible interaction space, false positives in these databases can result from errors or ambiguities in experiments (32
). In particular, data sets generated from high-throughput methods are estimated to have a much higher error rate than traditional small-scale studies (33
). In addition to experimental errors, false-negative and false-positive interactions also result from curation errors. For example, a study of discrepancies between different databases showed that, even for the same set of publications, two databases on average only fully agree on 42% of the interactions and 62% of the proteins (34
). The differences were attributed to divergent assignments of organism or splice isoforms, and alternative representations of multiprotein complexes, etc.
Parallel to experimental studies and literature curations, computational predictions have also been used to infer new interactions from indirect clues. Information such as sequence and structural homology, domain–domain interaction profile, genomic context, gene fusion, phylogenetic profile/tree similarity, gene co-expression, function similarity and network topology has been effectively exploited to evaluate the reliabilities of experimentally determined interactions (35
), and to predict PPIs on a large scale (37–41
). Usually, every indirect clue by itself is only a weak PPI predictor, but predictions can be improved by integrating different sources of evidence using a variety of machine learning methods. There have been a number of online databases that store PPIs predicted from these integrative methods, such as STRING (42
), Predictome (43
), OPHID (44
) and its replacement I2D, IntNetDB (45
) and PIPs (46
). These databases have their own limitations, and it should be noted that, owing to the nature of many prediction methods, many of the predicted interactions are often more indicative of protein functional associations than of direct physical interactions.
Recently, we described a PPI prediction method (PrePPI) that is largely based on 3D protein structural information (47
). We showed that, with the exploitation of homology models and remote geometric relationships, structural information can be used to accurately predict PPIs on a genome-wide scale. The further integration of structural with other functional clues yields prediction performance comparable with high-throughput experiments. Experimental tests of a number of predictions demonstrate the ability of the structure-based algorithm to identify novel unsuspected PPIs of significant biological interest.
Given the inconsistent levels of reliability and lack of complete overlap between different PPI databases, a resource that integrates different sources of information and that reports an appropriate measure of reliability should be extremely valuable. In this article, we describe the PrePPI database that contains interactions predicted from our structure-based integrative method, and also includes interactions compiled from a set of public databases that manually curate experimentally determined PPIs from the literature. A probability for each interaction is calculated using a Bayesian framework as described later in the text.