In a large-scale interaction assay, we consider each protein's interactions to be a random sample from the population of observed interactions. A simple and general theoretical error model, based on the hypergeometric distribution, can be used to calculate the probability of observing each interaction from a random background. This model builds on related models that have previously been applied to several linkage and interaction types [
18-
21]. Within a given dataset, the probability (P-value) of an interaction between proteins A and B being observed at random is:
where
where k = the number of times the interaction between A and B is observed, n and m are the total number of interactions for proteins A and B, and N is the total number of interactions observed in the entire data set. When applied to the matrix model interpretation of protein interactions, the scoring scheme can identify highly accurate subsets of interactions. The process is illustrated in Figure .
We generated matrix-model interpretations of the Ho, Gavin, and Krogan datasets. The only other TAP-MS data set of significant scale [
22] is a subset of [
11] and was omitted. We then applied the scoring method to each, applying to each interaction in a dataset a P-value calculated from the observations within that set. We then evaluated the quality of the scoring by calculating recall and precision versus the set of protein complexes manually defined from literature sources by the Munich Information center on Protein Sequences (MIPS) [
23]. Recall was scored as TP/(TP + FN), where TP, true positives, are experimental interactions that are in the MIPS set and FN, false negatives, are the MIPS interactions not present in the experimental data. Precision was defined as TP/(TP + FP), where TP is as above and FP, false positives, are interactions observed experimentally where both corresponding proteins are in the MIPS set, but the interaction is not. For all three data sets, the method displays improved recall and/or precision relative not only to the spoke model interpretation of the same dataset, but also to the group's published complexes (Figure ). As each co-complex data set represents an independent experimental observation, the probabilities can be combined to provide higher confidence in repeated observations. We therefore combined the three scored data sets by multiplying the P-values for a given interaction across all three datasets, applying a P-value of 1 if the interaction was missing from a dataset. The combined interaction dataset, which we call the Probabilistic Integrated Co-complex (PICO) network, is more accurate and provides greater coverage than any of the individual datasets it comprises.
The PICO network contains a large number (~160,000) of protein-protein interactions, each with a relative confidence measure as described by the P-value. The full list is available for download [see Additional File
1]. We filtered out low-confidence interactions before deriving complexes from the data, beginning by rank-ordering the interactions by P-value, lowest to highest. We then applied a series of increasingly stringent expected (E) value thresholds, where

, starting with E = 1 and tightening in order of magnitude increments to E = 10
-6. The number of interactions in the PICO network at each threshold is shown in Figure .
We derived a set of complexes at each threshold by using MCL [
24], an implementation of a Markov clustering algorithm. MCL was evaluated in [
25] and was used to derive complexes from the raw data in [
12]. To evaluate the accuracy of each set of complexes, we measured the Hubert statistic, H, of the derived complexes versus a reference set of complexes [
26]. Briefly, calculating H involves generating a matrix M of protein pairs (i, j) where M(i, j) = 1 if the proteins are in the same complex and 0 otherwise. The correlation between the experimental and reference matrices is then measured, resulting in a score from -1 to 1, with 1 implying identical complex assignments and values near zero indicating random assignment. We measured the Hubert statistic of complexes measured at each threshold against the set of curated MIPS complexes [
23] with ribosomal subunits removed and against a filtered set of Gene Ontology (GO) Cellular Component (CC) annotations (see Methods). The correlations generally improve with increasing stringency (Figure ), although the rate of increase in correlation with GO component drops off sharply after the 10
-2 cutoff. This improvement in accuracy comes at the price of decreasing coverage, reflected in the decreasing number of interactions at each threshold as shown in Figure . In an attempt to balance accuracy and coverage, we selected the complexes derived from the E = 10
-2 threshold, hereafter called the E-2 complexes, for further study.
Features of the E-2 complexes
The E-2 complexes contain 1689 proteins grouped into 390 clusters of sizes ranging from two to 35 subunits. A network view of the complexes, generated using Cytoscape [
27], is shown in Figure ; the Cytoscape file is available for download [see Additional File
2]. To measure the accuracy of individual complexes, we tested each for significant enrichment of GO component annotation. GO component annotations enriched at P <0.01 (with Bonferroni correction for multiple hypothesis testing) are noted for each complex [see Additional File
3]. The Simpson coefficient of each enriched annotation is also listed as an easily understood metric for measuring the completeness with which any GO term describes a complex (or vice versa).
The large fraction of E-2 complexes that correspond to existing annotations suggest that the data set is highly accurate. Of the 132 complexes with four or more subunits, 69% (91) are highly enriched for one or more specific GO component annotations; of the 44 complexes of size eight or larger, 84% (37) are so annotated. Furthermore, there are virtually no uncharacterized genes in these large complexes, and the few that appear have relatively weak connections to the other members of their respective clusters. This suggests that the yeast community has achieved a fairly complete description of a large fraction of the "complex-ome," at least for complexes containing many proteins. In fact, only one complex of size four or greater consists entirely of unnamed subunits and thus could be considered truly novel (complex C132, composed of proteins YAL049C, YDL025C, YGR016W, and YHR009C).
Several E-2 clusters represent amalgamations of known complexes. The MCL algorithm assigns each protein to exactly one complex, so protein complexes with shared subunits are sometimes found combined into a single cluster in the E-2 complexes. The C1 cluster, for example, includes RNA polymerase I, II, and III, largely because all three enzymes contain the Rpb5, Rpb8, Rpb10, and Rpo26 subunits. Likewise, complex C7 contains the TAFIID complex and the SAGA transcription factor/chromatin remodeling complex; these complexes share the Taf5, 6, 9, 10, and 12 proteins. It seems clear from the RNA polymerase case that the E-2 clusters occasionally contain discrete complexes that presumably do not physically interact.
Even the clusters that lack significant GO terms tend to have subunits that share similar free-text descriptions in the Saccharomyces Genome Database (SGD) [
28]. For example, complex C44 contains eight proteins, all of which are essential. Of these, seven are explicitly described in SGD as being involved in 60 S ribosome biogenesis or as components of 66 S pre-ribosomal particles, and the eighth is involved in export of pre-ribosomal large subunits from the nucleus. No GO term enrichment is found because the CC annotation is typically "nucleolus," a weak term excluded from our analysis (see Methods). Likewise, unannotated complexes C20, C30, and C78 contain 13, 10, and 5 proteins, respectively (10, 9, and 5 essential), that are all known or suspected to be involved in ribosome biogenesis. Other unannotated complexes include C43, eight largely nonessential proteins in the well-described cyclin/cyclin-dependent kinase group; C51, seven nonessential proteins involved in catabolite inactivation of FBPase; and C72, six proteins (five essential), of which five are involved in retrograde Golgi-to-ER trafficking and the sixth, Sec39, is of unknown function but "proposed to be involved in protein secretion."
Hierarchical structure of co-complex network
The high-confidence subset of the PICO network from which the E-2 complexes were derived contains 5,352 interactions; of these, 4,411 are present in the E-2 complex map of 390 complexes. The remaining 941 interactions all occur between subunits of different complexes. We examined the structure of these interactions by collapsing each complex into a single node and looking at the interactions between complexes. The resulting intercomplex network, depicted in Figure , suggests a hierarchical organization of protein complexes in the cell. Over one-third of the interactions (341, or 36%) appear in just three clusters: the U4/U6 × U5 tri-snRNP complex and its neighbors (191 interactions), the C20/C30/C44/C78 ribosome biogenesis nexus (86 interactions), and the C17 histone-associated complex (64 interactions). In all three cases, the intercomplex interactions link complexes that are involved in closely related physiological processes. Taken together, these observations suggest that yeast proteins complexes exhibit a hierarchical organization, with complexes interacting with each other in a well-ordered fashion.
Essentiality of protein complexes
The E-2 network shows an enrichment of essential genes in general: the 1689 proteins in the network comprise 29% of all yeast proteins, but contain 58% of all essential proteins (602 essentials out of 1033 total). The descriptions above, as well as a glance at the complex map in Figure , suggests concentration of essential proteins into some complexes, and exclusion from others (see Additional file
4). To measure whether there is such a concentration, we considered the distribution of complexes with respect to the fraction of essential proteins in each and sorted this distribution into ten uniformly spaced bins. We bootstrapped a background distribution by randomly assigning the same number of essential genes to an identical set of complexes, repeating this process 10,000 times, and calculating the mean for each bin. We then took the log of the ratio of the observed to the random frequencies in each bin. The results, plotted in Figure , show clear enrichment for complexes either mostly essential (>70%) or almost completely nonessential (<10%), with underrepresentation in intermediate values.