The development of high-throughput techniques during the last decade has led to an unprecedented increase in the volume of identified human protein-protein interactions (PPIs). The currently available individual PPI data sets can be roughly categorized into three sets:
1) proteome-wide, large-scale screenings aimed at investigating all possible PPIs [
1-
3],
2) semi-large-scale screenings aimed at investigating the interactions between a specific group of proteins (typically in a pathway) and all other proteins [
4,
5], and
3) small-scale, traditional studies aimed at detecting specific PPIs among biologically interesting proteins, e.g., oncogenes and their regulators. Although this latter set is still numerically dominant (~80% of all PPIs belong to this set), examples of the first two types of investigations are expanding rapidly.
Given this extensive resource of known human PPIs and their continuous accelerated growth, how to globally analyze and aggregate the data remain a challenge. Statistical methods for inferring confidence of protein interactions can be broadly divided into two groups [
6-
8]: scoring schemes that rely on the interaction data themselves (e.g., affinity purification/mass-spectrometry [AP/MS] data or yeast two-hybrid [Y2H] data) and scoring schemes that require additional data sources not directly related to the interactions
per se (e.g., functional annotation or gene expression data). Herein, we address the question of how to extract high-confidence PPIs while relying only on the aggregated interaction data themselves.
The most intuitive approach to infer high-confidence PPIs is to score PPIs based on the number of times an interaction has been reported [
9-
11]. However, using the number of times a PPI has been reported (occurrence) across different studies as the metric of reliability could be influenced by numerous unknowable experimental factors, e.g., recent studies have demonstrated that such factors may result in decreased reliability of PPIs containing frequently studied proteins [
12]. Moreover, a large number of PPIs share the same number of reported occurrences, making it impossible to use occurrence alone to establish the reliability of these PPIs and rank-order them. For example, for the data analyzed here, we found that the majority (>83%) of currently available human PPIs have been reported only once.
Herein, we propose an unsupervised statistical approach to score and rank a set of diverse, experimentally identified PPIs. We applied this methodology to human PPIs (non-physical associations excluded) aggregated from nine publicly available primary databases that exclusively contain experimental data (Additional file
1): the Biomolecular Interaction Network Database (BIND) [
13], the Biological General Repository for Interaction Datasets (BioGRID) [
14], the Database of Interacting Proteins (DIP) [
15], the Human Protein Reference Database (HPRD) [
16], IntAct [
17], the Molecular INTeraction database (MINT) [
18], the mammalian PPI database of the Munich Information Center on Protein Sequences (MIPS) [
19], PDZBase (a PPI database for PDZ-domains) [
20], and Reactome [
21]. Our method re-normalizes the importance of frequently occurring proteins among PPIs to avoid giving added (and potentially artificial) weight to those interactions. We estimated the importance of a PPI by comparing the actual observed occurrence of a PPI with its occurrence in a randomized sample. This calculation gauges the likelihood that the interaction occurs by chance in the set of all observed PPIs. Using these estimates, we rank-ordered the aggregated input PPI data set, allowing us to create high-confidence subsets based on a given rank threshold. At the lowest ranked threshold, all interactions are included and there is no difference between the ranked data and the original set of PPIs.
The presented scoring and ranking procedure can be seen as an extension of our previous effort to infer high-confidence interactions from the affinity purification raw data, termed interaction detection based on shuffling (IDBOS) [
22], and, in the following, we will also refer to our scoring and ranking scheme as IDBOS. Our proposed procedure shares similarities to estimating probabilities of observed interactions above a random background based on the hypergeometric distribution [
23], with the distinction that the IDBOS-generated probability density distribution functions correct for biases toward self-interaction among frequently studied proteins. Although other methods exist for assigning confidence scores to PPIs, these generally require additional data or reference sets [
24,
25], or
a priori assumptions of network topology [
26]. To the best of our knowledge, this is the first application of an unsupervised probabilistic scoring and ranking scheme to create subsets of unbiased high-confidence human PPI networks.
We evaluated the improvement in using IDBOS-ranked PPI data by comparing it with other methods and assessing their ability to retrieve biological associations from a number of diverse and independent reference sets. These reference sets contain known biological data that are either directly (e.g., crystallographically determined protein complexes) or indirectly (e.g., co-expressed genes) linked to interactions between proteins. The hypothesis we tested was that sets of highly ranked PPIs are enriched in biological associations as determined from the diverse reference sets. We quantified the average effect of using ranked protein interaction data to retrieve this information and showed that, when compared to randomly ranked interaction data sets, IDBOS created a larger enrichment (~134%) than either ranking based on the hypergeometric test (~109%) or occurrence ranking (~46%).
From our evaluations, it was clear that ranked interactions were always of value because higher-ranked PPIs had a higher likelihood of retrieving biologically relevant data. Statistically removing the biasing factors inherent in aggregated PPI data via the IDBOS-ranking scheme further increased the accuracy and enrichment of biological information associated with PPIs.