Computational and experimental approaches have now mapped a great many yeast and human protein interactions, but how many interactions should we expect? We argue here that the sizes of the complete yeast and human protein interaction networks will be larger than most early estimates. We do not yet know the size of any complete protein-interaction network. We can, however, roughly estimate the expected sizes for the yeast network using two different approaches that agree reasonably well. These estimates are derived from considering the interactions shared between each pair of large-scale protein interaction assays published so far.

First, provided two large-scale assays sample the same portion of 'interaction space' (that is, they sample the same pairs of interacting proteins - usually a subset of the interactome), then the number of interactions detected by both assays should be distributed according to the hypergeometric distribution, well-approximated for large populations by the binomial distribution. Given two assays of size

*n*_{1 }and

*n*_{2 }interactions, respectively, with

*k *in common, as well as estimates of the false-positive rates of the two assays (

*fpr*_{1 }and

*fpr*_{2}), the maximum likelihood estimate of the number of interactions,

*N*, within that subspace is [

*n*_{1}(1 -

*fpr*_{1}) ×

*n*_{2}(1 -

*fpr*_{2})]/

*k*, provided

*n*_{1 }and

*n*_{2 }are sufficiently large (

*n*_{1}(1 -

*fpr*_{1}) ×

*n*_{2}(1 -

*fpr*_{2}) >>

*N*; see Additional data file 1 for a derivation of the statistics). This intersection analysis (Figure ) has a rich history in other fields, such as mark-recapture methods for estimating the size of an animal population [

30], and has recently been applied to protein-interaction networks [

31].

In order to use this method, the datasets must be corrected for their error rates. One method for estimating the false-positive rates of large-scale assays, described by D'haeseleer and Church [

32], involves comparing the two datasets to each other and to a reference dataset. The method does not require a gold-standard reference; only that the reference not be biased toward either of the samples being measured. This requirement is met by comparing two similar assays: that is, either two mass spectrometry or two two-hybrid datasets. The method, described in Figure , uses the ratio of the intersections of the three datasets to estimate the number of true positives in each sample. An example using the interactions derived from the two recent genome-scale TAP/mass spectrometry assays published by Gavin

*et al*. [

27] and Krogan

*et al*. [

28], compared to the Munich Information Center for Protein Sequences (MIPS) reference set [

33], is presented in Figure . In this and all subsequent analyses, the interaction data were used as for Krogan

*et al*. [

28] bait-prey pairs; for Gavin

*et al*. [

27] bait-prey pairs derived from lists of prey associated with each bait.

To estimate the interactome size by intersection analysis, we first take the interactions in each dataset that are derived from the common sample space of the two assays. (Figure shows only the interactions in this common sample space.) Each group purified around 2,000 TAP-tagged strains for mass spectrometry, with the common set of baits numbering 1,243, of which 1,128 yielded at least one identical interaction. While a true 'apples-to-apples' comparison of these results is difficult given the data that these two groups have published, as discussed by Goll and Uetz [

34], we tried to extract the interactions derived from these common baits for this analysis from the published filtered datasets. After calculating error rates and subtracting false positives from the two datasets, their intersection was used to predict the number of interactions within the subspace they sample. That prediction was then scaled up to the size of the whole interactome (around 5,800

^{2}/2) to estimate the total number of protein-protein interactions in the organism.

The error estimates for Gavin

*et al*. [

27] and Krogan

*et al*. [

28], as well as those for other large-scale yeast interaction datasets, are shown in Table . The false-positive rate of the computationally derived Jansen dataset [

22] was determined by comparing it to Gavin

*et al*. [

27] and Krogan

*et al*. [

28] individually, although these comparisons may violate the no-bias requirement for the reference dataset. Table shows the interactome size predictions derived from these pairs of mass spectrometry assays, which give an average interactome size of about 53,000 interactions, although the Gavin-Krogan pairwise estimate has the largest intersection and is, therefore, likely to be the most accurate estimate of the three. The two-hybrid assays [

35,

36] share too few interactions to give a meaningful estimate of interactome size.

| **Table 1**Yeast protein-interaction assay false-positive rates: yeast datasets |

| **Table 2**Prediction of the size of the yeast interactome |

These projected interactome sizes agree with those generated by a simple, very approximate, scaling argument: we observe approximately 5-10 unique interactions per yeast protein in current networks; multiplying these values by around 5,800 yeast genes gives estimates of approximately 29,000-58,000 interactions. These values are somewhat larger than previous estimates of 10,000-30,000 total yeast interactions [

20,

29,

31,

37-

39].

Unfortunately, applying these techniques to high-throughput assays of human protein-protein interactions is still problematic. The two large-scale yeast two-hybrid screens published recently [

14,

15] share only six interactions, too small an intersection to generate reliable error rate or inter-actome size estimates; similarly, data from Stelzl

*et al*. [

15] share only 5 and 13 interactions with orthology-transferred interactions from Lehner and Fraser [

40] and the computationally derived set of Rhodes

*et al*. [

23], ruling out these comparisons for estimating interactome size. However, comparison of the Rual

*et al*. [

14] data with those of Lehner and Fraser [

40] and Rhodes

*et al*. [

23] yielded consistent false-positive estimates, suggesting that reference bias is minimal (Table ). The human interactome estimates generated from these pairs of datasets are shown in Table . These projections, while consistent with the estimate of approximately 260,000 interactions offered by Rual

*et al*. [

14], still stem from small intersections and limited information about sample space, and should be considered very rough estimates.

| **Table 3**Human protein-interaction assay false-positive rates: human datasets |

| **Table 4**Prediction of the size of the human interactome |