A crucial step towards understanding cellular systems properties is mapping networks of physical DNA-, RNA- and protein-protein interactions, the “interactome network”, of an organism of interest as completely and accurately as possible. One approach consists in systematically testing all pairwise combinations of predicted proteins to derive the “binary” interactome. Early attempts at binary interactome mapping used high-throughput yeast two-hybrid (Y2H), in which a protein interaction reconstitutes a transcription factor that activates expression of reporter genes. High-throughput Y2H maps have been generated for Saccharomyces cerevisiae
), Caenorhabditis elegans
), Drosophila melanogaster
), and human (8
). An alternative approach consists in generating “co-complex” interactome maps, achievable by high-throughput co-affinity purification followed by mass spectrometry (AP/MS) to identify proteins bound to tagged baits, as done for Escherichia coli
), S. cerevisiae
), and human (17
To investigate fundamental questions of interactome network structure and function, it is necessary to understand how the size and quality of currently available maps, including thorough evaluation of differences between binary and co-complex maps, might have affected conclusions about global and local properties of interactome networks (18
). Here, we address these issues using the yeast S. cerevisiae
as a model system.
First, we compared the quality of existing high-throughput binary and co-complex datasets to information obtained from curating low-throughput experiments described in the literature (). For binary interactions we examined: (i) the subset found by Uetz et al.
in a proteome-scale all-by-all screen (“Uetz-screen”), excluding the pairs found in a focused, potentially biased experiment involving only 193 baits (“Uetz-array”) (2
); and (ii) the Ito et al.
interactions found three times or more (“Ito-core”), independently from those found one or two times (“Ito-noncore”), a distinction recommended by the authors but seldom applied in the literature (3
). For co-complex associations, we investigated two high-throughput AP/MS datasets referred to as “Gavin” (15
) and “Krogan” (16
). For literature-curated interactions, we only considered those curated from two or more publications (“LC-multiple”) (20
), which we considered of higher quality than those curated from a single publication.
Fig. 1 Evaluation of S. cerevisiae protein-protein interaction datasets. (A) Number of interactions reported in various large-scale S. cerevisiae protein-protein interaction datasets. (B) Schema of pipeline used to assemble binary positive and random reference (more ...)
To experimentally compare the quality of these datasets, we selected a representative sample of ~200 protein interaction pairs from each one and tested them by means of two independent interaction assays, Y2H and a yellow fluorescent protein complementation assay (PCA) (21
) [Supporting Online Material (SOM) I]. In PCA, bait and prey proteins are fused to non-fluorescent fragments of yellow fluorescent protein that, when brought in close proximity by interacting proteins, reconstitute a fluorescent protein in mammalian cells. In contrast, reconstitution of a transcription factor in Y2H experiments takes place in the nucleus of yeast cells. In terms of assay designs, Y2H and PCA can be considered as orthogonal assays and can be used to validate each other's results.
No single assay is expected to detect 100% of genuine interactions, and the actual fraction of positives detected is inherently linked to the stringency at which the assay is implemented. To identify the optimal scoring condition of each assay we selected a set of ~100 well-documented yeast protein-protein interaction pairs [“positive reference set” (PRS)] and a set of ~100 random pairs [“random reference set” (RRS)] (; SOM II). Because RRS pairs were picked uniformly from the 14×106
possible pairings of proteins within our yeast ORFeome collection (22
) (excluding those reported as interacting), these pairs are extremely unlikely to be interacting.
Sampled pairs from binary Uetz-screen and Ito-core datasets tested positive at levels as high as the positive control PRS, demonstrating their high quality (). A sample of literature-curated LC-multiple interactions tested slightly lower with Y2H, while being indistinguishable by PCA (), demonstrating that high-throughput Y2H datasets can be comparable in quality to literature-curated information. In striking contrast, sampled pairs from Ito-noncore tested at levels similar to the negative control RRS, confirming the low quality of this particular dataset ().
Sampled pairs from Gavin and Krogan high-throughput AP/MS datasets tested poorly in our two binary interaction assays (), albeit at levels similar to Munich Information Center for Protein Sequences (MIPS) complexes, a widely-used gold standard (23
). This observation demonstrates that, at least for detecting binary interactions, Y2H performs better than AP/MS
Our experimental data quality assessment shows that binary Uetz-screen, Ito-core, and LC-multiple datasets are of high quality, while Ito-noncore should not be used. AP/MS datasets, although of intrinsically good quality (15
), should be used with caution when binary interaction information is needed.
Our experimental results contrast strikingly with computational analyses that suggested that high-throughput Y2H datasets contain more false positives than literature-curated or high-throughput AP/MS datasets (24
). In computational analyses, the quality of a dataset is often determined by the fraction of interactions also present in a pre-defined gold standard set (24
). Generally, MIPS complexes have been considered as gold standard with all proteins constituting a given complex modeled as interacting with each other. Such modeling results in limited and biased sampling issues against binary interactions since not all proteins in a complex contact each other directly (fig. S1), and not all direct physical interactions occur within complexes (fig. S2; SOM III). Hence, while MIPS complexes are appropriate for benchmarking co-complex membership datasets, they are not for binary interaction datasets. This distinction is corroborated by the poor experimental confirmation rate of pairs from MIPS complexes using binary assays ().
To computationally re-examine the quality of existing yeast interactome datasets we assembled a binary gold standard set (“Binary-GS”) of 1,318 high-confidence physical binary interactions (; SOM III). Binary-GS includes direct physical interactions within well-established complexes as well as conditional interactions (e.g.
, dependent on posttranslational modifications) and thus represents well-documented direct physical interactions in the yeast interactome (26
). When measured against Binary-GS, the quality of high-throughput Y2H datasets (with the exception of Ito-noncore) was substantially better (SOM IV and V) than that of high-throughput AP/MS datasets (). Our results demonstrate the distinct nature of binary and co-complex data. Generally, Y2H datasets contain high quality direct binary interactions, whereas AP/MS co-complex datasets are composed of direct interactions mixed with preponderant indirect associations (SOM VI).
The proteome-wide binary datasets, Uetz-screen and Ito-core, contain 682 and 843 interactions, respectively (2
). The overlap between these two datasets appears low (3
): 19% of Uetz-screen and 15% of Ito-core interactions were detected in the other dataset. Given our demonstration of high quality for these datasets (), we conclude that the small overlap stems primarily from low sensitivity (i.e.
, many false negatives) rather than from low specificity (i.e.
, many false positives as previously suggested).
Several factors might affect sensitivity. First, the space of pair-wise protein combinations actually tested in each dataset might have been considerably different. We refer to the fraction of all possible pairs tested in a given screen as the “completeness”. For example, missing 10% of ORFs in each mapping project could reduce the common tested space down to 66% [(0.9×0.9) × (0.9×0.9)] of all possible pair-wise combinations. Second, different protein interaction assays or even different versions of the same assay detect different subsets of pairs out of all possible interactions, explaining partly the limited overlap between datasets obtained with different Y2H versions. For any assay, the “assay-sensitivity” is estimated as the fraction of PRS interactions detected, which for our Y2H assay was determined empirically to be ~20% (). Finally, when screening tens if not hundreds of millions of protein pairs in any tested space, that search space might need to be sampled multiple times to report all or nearly all interactions detectable by the assay used. The fraction of all theoretically detectable interactions by a particular assay found in a given experiment is its “sampling-sensitivity”. These three parameters fully account for the seemingly small overlap between Ito-core and Uetz-screen (SOM VII), demonstrating that a large fraction of the S. cerevisiae binary interactome remains to be mapped. Therefore, we carried out a new proteome-scale yeast high-throughput Y2H screen (fig. S3).
We used 5,796 Gateway-cloned ORFs available in the yeast MORF collection (22
). After subcloning these ORFs into Y2H vectors and removing auto-activators (27
), our search space became 3,917 DB-Xs against 5,246 AD-Ys, representing a completeness of 77% (; SOM VI), comparable to that of recent AP/MS datasets (15
) (~78%; SOM VI).
Fig. 2 Large-scale Y2H interactome screen. (A) Completeness of the Y2H screen. (B) Sampling-sensitivity of CCSB Y2H screens measured by screening a subspace multiple times. (C) Fraction of protein pairs in PRS, RRS, and CCSBYI1 that test positive by PCA, MAPPIT (more ...)
To address sampling-sensitivity, we determined what fraction of all detectable interactions is found in each pass after eight trials in a search space of 658 DB-X and 1,249 AD-Y ORFs. A single trial identified about 60% of all possible interactions that can be detected with our high-throughput Y2H, whereas three to five repeats were required to obtain 80-90% (; SOM VI). We decided to screen the whole search space three times independently to yield an estimated sampling-sensitivity of 85% (). In total ~88,000 colonies were picked, of which 21,432 scored positive upon more detailed phenotyping (SOM I). After identifying all putative interaction pairs by sequencing, phenotypically retesting them using fresh cultures from archival stocks, and eliminating de novo
), we obtained a final dataset, “CCSBYI1”, of 1,809 interactions among 1,278 proteins.
To validate the overall quality of CCSB-YI1, we tested 94 randomly-chosen interactions by PCA and mammalian protein-protein interaction trap (MAPPIT; SOM I) (21
). MAPPIT takes place at the mammalian cell membrane and measures interactions via activation of STAT3-dependent reporter expression. Using both PCA and MAPPIT the confirmation rate of CCSB-YI1 was similar to those of Ito-core and Uetz-screen (). The precision [i.e.
, fraction of true positives in the dataset (30
)] of CCSB-YI1 is estimated at 94-100% (; fig. S4; SOM VI). Additionally, the performance of our high-throughput Y2H approach was confirmed via a larger RRS of 1,000 random pairs (30
) (), none of which tested positive (SOM II).
The overlaps of Uetz-screen (27%) and Ito-core (35%) with CCSB-YI1 () can be explained by the completeness, assay- and sampling-sensitivity of the three experiments (SOM VII) and agree well with the results of the pairwise confirmation of those two datasets (). Similar principles apply to other large-scale experiments such as AP/MS, likely accounting for the low overlap between Krogan and Gavin (~25%; fig. S5B).
Factoring in completeness, precision, assay-, and sampling-sensitivity, we estimated that the yeast binary interactome consists of ~18,000±4,500 interactions (SOM VI), experimentally validating previous computational estimates of 17,000 to 25,000 interactions (31
). To obtain a more comprehensive map of the binary yeast interactome we combined the three available high-quality proteome-scale Y2H datasets (SOM VII). The union of Uetz-screen, Ito-core, and CCSB-YI1, “Y2H-union”, contains 2,930 binary interactions among 2,018 proteins, which, according to our empirical estimate of the interactome size, represents ~20% of the whole yeast binary interactome ().
Fig. 3 Network analysis of Y2H-union, Combined-AP/MS and LC-multiple datasets. (A) Network representations. Shown are relationships between increasing degree of a gene product and (B) the fraction of essential genes with the corresponding degree, (C) the fraction (more ...)
We re-examined global topological features of this new yeast interactome network, facing lower risk of over-interpreting properties due to limited sampling and various biases in the data (18
). To contrast topological properties of the binary Y2H-union network with that of the co-complex network, we used an integrated AP/MS dataset (33
), which was generated by combining raw high-throughput AP/MS data (15
). This “Combined-AP/MS” dataset, composed of 9,070 co-complex membership associations between 1,622 proteins, attempts to model binary interactions from co-complex data ().
As found previously for other macromolecular networks, the connectivity or “degree” distribution of all three datasets is best approximated by a power-law (34
) (fig. S6; SOM VIII). Highly connected proteins, or “hubs”, are reportedly more likely encoded by essential genes than less connected proteins (35
). Surprisingly, Y2H-union lacked any correlation between degree and essentiality (). This discrepancy might stem from biases in the datasets available at the time of the original observation: interactions reported in Uetz et al.
(Uetz-array and Uetz-screen) and literature-curated interactions. Although Uetz-array is of high quality (fig. S7), its experimental design could negatively influence network analyses. Most hub proteins in Uetz-array were found as baits (fig. S8) and the percentage of essential proteins in the 193 bait proteins is two times higher (34.7%) than that of all protein-encoding ORFs in the yeast genome (18.4%), explaining the high correlation between degree and essentiality (). Likewise, literature-curated interactions seem prone to sociological and other inspection biases (SOM VII). Thus, we refrain heretofore from using LC-multiple in our further topological and biological analyses. No significant correlation between degree of connectedness and essentiality was observed in any of the three proteome-wide high-throughput binary datasets available today (i.e.
, Ito-core, Uetz-screen, and CCSB-YI1; ), as well as new versions of our C. elegans
and human interactome maps (fig. S9; SOM IX).
Hub proteins instead relate to pleiotropy, the number of phenotypes observed as a consequence of gene knock-out (SOM I). There was a significant correlation in Y2H-union between connectivity and the number of phenotypes observed in global phenotypic profiling analyses of yeast genes (36
) (). Thus the number of binary physical interactions mediated by a protein seems to better correlate with the number of cellular processes in which it participates than its essentiality. The correlation between degree and number of phenotypes is not observed in Combined-AP/MS, likely because co-complex associations reflect the size of protein complexes more than the number of processes they might be involved in.
We confirmed the concept of modularity in the yeast interactome network, whereby date hubs that dynamically interact with their partners appear particularly central to global connectivity while static party hubs appear to function locally in specific biological modules (37
). The proportion of date and party hubs is strikingly different between Y2H-union and Combined-AP/MS (). There are significantly more date hubs in the binary network, whereas party hubs are prevalent in the co-complex network. In the binary network, date hubs are crucial to the topological integrity of the network, while party hubs have minimal effects. However, in the co-complex network, date and party hubs affect the topological integrity of the network equally, likely because most hubs in Combined-AP/MS reside in large stable complexes, while hubs in Y2H-union preferentially connect diverse cellular processes.
Surprisingly, essential proteins strongly tended to interact with each other (; SOM IX). Concentrating on the subnetwork formed by interactions mediated by and among essential proteins (fig. S10), we found a giant component whose size is much larger than expected by chance (). To better understand the clustering of essential proteins, we examined the interacting essential protein pairs that are also reported to be in the same complex, finding 106 interacting essential protein pairs, a greater number than expected by chance (; SOM IX).
Fig. 4 Clustering of essential proteins. (A) Average fraction of essential proteins among proteins whose distance are equal to d from a protein selected from essential, non-essential and all proteins. (B) Giant component size of network formed by essential proteins (more ...)
We investigated the overall relationships between Y2H-union and Gene Ontology (GO) attributes (38
), phenotypic and expression profiling similarities (39
), and transcriptional regulatory networks (40
). Both Y2H-union and Combined-AP/MS show significant enrichment (all P
) for functionally similar pairs in all three GO branches () (41
). There is also significant enrichment of positive correlations of phenotypic profiles (36
) between interacting pairs in both datasets (; fig. S11). Such interactions supported by strong phenotypic information constitute likely possibilities of functional relationships. Lastly, both datasets are significantly enriched with pairs co-expressed across many conditions (fig. S12), although Combined-AP/MS shows higher enrichment (), agreeing well with the different nature of the two assays: AP/MS aims at detecting stable complexes whereas Y2H tends to detect more transient and condition specific protein interactions. This observation is further supported by enrichment of kinase-substrate pairs in Y2H-union (SOM X; fig. S13).
Fig. 5 Biological features of yeast interactome datasets. (A) Enrichment of interacting protein pairs (relative to random) that share GO annotations in the biological process, cellular component and molecular function branches of GO ontology. (B) Pearson correlation (more ...)
To explore the mechanisms behind co-expression of interacting protein pairs we combined transcriptional regulatory networks with interactome network information (40
). Interacting proteins in both networks showed a tendency to be co-regulated by common transcription factors (TFs; ). Similarly to what we observed in the co-expression correlation analysis (), the enrichment for interacting pairs in Combined-AP/MS was significantly higher than that of Y2H-union. Strikingly, we observed a significant enrichment of protein-protein interactions between TFs involved in a common “multi-input motif” (42
) (MIM, where multiple TFs co-regulate a given set of genes; ; SOM ×). The fraction of co-regulating TF pairs is much higher in the binary interactome than in the co-complex network, suggesting that various TFs function together to form transient complexes to differentially regulate transcriptional targets (44
These observations suggest that our binary interactome dataset is enriched in transient or condition-specific interactions linking different subcellular processes and molecular machines. To further explore this possibility we calculated “edge-betweenness” for each interaction in a merged network of all available interactions (SOM XI), measuring the number of shortest paths between all protein pairs that traverse a given edge. The higher edge-betweenness of interactions from Y2H-union shows the tendency of Y2H to detect key interactions outside of complexes, significantly more often than AP/MS (). Several examples of such complex-to-complex connectivity are evident in a complete map of MIPS complexes connected by Y2H interactions (fig. S14).
Overall, we infer that Y2H interrogates a different subspace within the whole interactome than AP/MS, and Y2H interactions represent key connections between different complexes and pathways. Y2H and AP/MS provide orthogonal information about the interactome and are both vital to obtain a complete picture of cellular protein-protein interaction networks.