At any time hundreds of thousands of macromolecular interactions occur in a cell, mediating functions that maintain normal cellular activities. High-throughput approaches have been developed to determine interactions in many organisms at large scale. Current high-throughput protein-protein “interactome” datasets are of high quality, but have low coverage1,2
. For humans, more than 95% of the interactome remains to be mapped1
A bottleneck for high-throughput interactome mapping methods, such as yeast one-3
, and three-hybrid5
systems is determining the identities of the interacting protein, DNA, or RNA molecules. Implementation of next-generation DNA sequencing (NGS) technologies6–8
, as opposed to Sanger technology, would substantially increase throughput and decrease cost. Although highly effective for genome and transcriptome “shotgun” sequencing, next-generation DNA sequencing technologies are not readily applicable for identifying interacting pairs. The necessary pooling of PCR amplicons in the preparation of interacting sequence tags (ISTs) () would inevitably eliminate the association within each pair of DNA sequences coding for interacting molecules.
Figure 1 Stitch-Seq interactome mapping. (a) Interactome mapping using different sequencing technologies. Above, each DNA fragment within each interacting pair is PCR-amplified individually and Sanger sequenced. The association is tracked via position on the plate. (more ...)
Here we describe a massively parallel interactome mapping strategy that incorporates NGS (), and test the strategy in a high-throughput yeast two-hybrid (Y2H) system. This general scheme can be readily extended to increase throughput and decrease cost for other interactome mapping methods, particularly other binary protein-protein interaction assays1
, yeast one-hybrid3
, or genetic screens where pairs of DNA molecules are selected and identified9
In current protocols of high-throughput Y2H screens, the open reading frames (ORFs) or cDNAs encoding selected pairs of interacting hybrid proteins (X fused to a DNA binding domain (DB-X) and Y fused to an activation domain (AD-Y)) are amplified directly from yeast transformants and subsequently identified by Sanger DNA sequencing (Supplementary Fig. 1
. Since X and Y originate from recorded positions in paired PCR plates, they can be computationally re-assembled to form pairs of ISTs10
The first step of our methodology, termed “Stitch-Seq”, is PCR stitching, which places a pair of sequences encoding interacting proteins on the same PCR amplicon, and which has previously been used to link genes encoding interacting pairs20
. PCR stitching consists of two rounds of PCR (). In the first round, X and Y (present on the Y2H DB-X and AD-Y) vectors are amplified with DB- and AD-vector-specific upstream primers, respectively (Supplementary Table 1a
). A common sequence on the downstream primers is complementary to the Gateway-specific attB2
site immediately following the ORFs. We tested the PCR stitching concept for Y2H experiments using Gateway clones, though the approach can be generalized to other interactome mapping assays with different vectors. In the second round of PCR (Supplementary Table 1b
), X and Y amplicons from the first round are used as templates to produce a concatenated PCR product composed of X and Y ORFs connected by an 82 bp linker sequence (). All PCR products are then pooled and sequenced by NGS to produce stitched ISTs or “sISTs”.
Concatenated PCR products should, on average, be twice the length of single ORFs (). To test the length limit of PCR stitching, we chose four DB-X and four AD-Y constructs of various ORF lengths, 500 bp, 1 kb, 2 kb, and 3 kb (Supplementary Fig. 2a
). As expected the first-step colony PCR reactions succeeded at amplifying all eight ORFs (Supplementary Fig. 2b
). Second-step PCR reactions tested all 16 possible combinations, with the longest combination (A4–D4) over 6 kb. Concatenated ORF pairs up to 6 kb in total length were generated efficiently and accurately (Supplementary Fig. 2c
We next applied PCR stitching to pairs of ORFs identified from a Y2H screen aimed at expanding the human interactome map11
. After Y2H screening of a 6K by 6K search space within the ORFeome 3.1 set of human ORFs12
() with two rounds of phenotype testing, we selected ~5,200 positive colonies. PCR stitching applied to these colonies produced ~5,000 stitched PCR amplicons. We sequenced stitched amplicons with the 454 FLX platform7
, producing ~400,000 reads (). The average read length was 207 bases (), which is 125 bases longer than the 82 bp linker sequence, so that many reads could unambiguously identify pairs of unique X and Y ORFs, thereby generating sISTs. To identify ORFs encoding pairs of interacting proteins, we selected reads that contain the linker sequence (~10%) and also covered at least 15 bases of ORF specific sequences on both ends of the linker. After matching these sequences to human ORFeome v3.1 (ref. 12
) we identified 2,089 unique sISTs.
Figure 2 Human interactome (“HI-NGS”) produced by massively parallel interactome mapping. (a) ORF Search spaces within the human ORFeome 3.1 space12 of HI1 (8K × 8K11 and HI-NGS (6K × 6K). (b) Length distribution of 454 reads for (more ...)
Comparison of interactome mapping and IST-sIST identification numbers at key pipeline steps for Sanger sequencing and Stitch-Seq protocols.
We experimentally retested by pairwise Y2H all sISTs starting from fresh yeast transformants stored in our collection and confirmed 1,318 pairs of ORFs as demonstrably encoding Y2H interacting proteins (). Because the collection contains multiple ORFs for some genes (e.g.
, splice isoforms), the final tally was 979 interactions among proteins encoded by 997 genes (). This confirmation rate is virtually identical to that previously described for Y2H screens using Sanger sequencing11
. Furthermore, the confirmation rate does not vary between sISTs discovered uniquely and sISTs discovered multiple times (Supplementary Note 1
and Supplementary Fig. 3
For comparison we also sequenced all of the ~5,200 positive colonies individually by Sanger sequencing, and identified 820 interactions among proteins encoded by 914 genes ( and ). Of these, 633 interactions were also identified by 454 FLX sequencing. This overlap is higher than the expected overlap of ~70% (Supplementary Fig. 4
), even taking into account a ~5% failure rate of PCR and Sanger sequencing reactions. We did detect 19% more interactions using our “Stitch-Seq” strategy, but that was probably because of the higher coverage of the 454 FLX sequencing and the inherent failure rate of Sanger sequencing.
We next quantitatively evaluated the quality of this interactome dataset based on orthogonal interaction assays1,13
. We selected 94 protein pairs at random from all verified interactions that were identified by: 1) only 454 FLX sequencing (“454 Unique”); 2) only Sanger sequencing (“Sanger Unique”); or 3) both (“454 and Sanger”) (). We combined these 282 interactions with positive and random reference set interactions (PRS/RRS) consisting of 92 interactions each2,13
, which serve to benchmark assay performance13
. We tested the 466 pairs by two assays orthogonal to Y2H: a protein complementation assay (PCA)13
and a modified version of the nucleic acid programmable protein array (wNAPPA)13
. In all three groups the detection rate of new interactions was statistically indistinguishable from the PRS detection rate of both PCA and wNAPPA (all P
values > 0.2), and significantly higher than that of the RRS pairs (all P
values < 0.001) (). PRS interactions in the search space were recovered at the expected rate and no RRS pair was found (Supplementary Note 2
). Because shorter products can amplify more efficiently than longer ones by PCR, our stitching scheme might have favored identification of shorter ORFs, but the size distributions of ORFs, as determined by both 454 FLX and Sanger sequencing, were identical to that of the ORFs in the previous human interactome version 1 (HI1)11
(). Thus, large numbers of high-quality sISTs can be identified in a single next-generation sequencing reaction.
Combining 454 and Sanger sequencing results produced a high-quality human interactome dataset, HI-NGS (H
nteractome produced with N
equencing) containing 1,166 interactions among proteins encoded by 1,147 human genes ( and Supplementary Table 2
). This represents a 42% (1,149 novel interactions) increase over HI1 (ref. 11
). The overlap of 127 interactions between the two datasets matches the expected overlap of 138 pairs (Supplementary Note 3
. The distribution of numbers of interactors per protein in HI-NGS is similar to that of previous datasets ( and Supplementary Fig. 5
HI-NGS network. (a) Network view (main connected component above the unconnected components) of HI-NGS (gold) produced with PCR stitching compared to HI1 (blue). (b) Degree distribution of HI-NGS compared to HI1.
Despite the PCR stitching protocol involving one additional PCR reaction for each ORF pair compared to the traditional Y2H method, our strategy reduces the overall cost by at least ~40%, and should therefore allow increased throughput (Supplementary Fig. 6
and Supplementary Note 4
). With continued improvement of NGS technologies, the cost of sequencing should keep diminishing14
. Because 454 sequencing can accommodate lower capacity runs and because samples can be combined with other sequencing samples there is no lower size limit for screens to which this method can be applied. The 82 bp linker has no identical sequence in all of GenBank, so sISTs can in principle be sequenced in combination with other samples (Supplementary Note 5
). The approach would be equally effective with cDNA library screens as it was here for an ORFeome library screen.
The linker length of 82 bp requires that the average read length be >100 bp for reliable identification of sISTs. Among existing NGS technologies, the 454 technology is to our knowledge the only one that reliably produces reads of more than 100 bp on average7
. The application of paired-end sequencing15
to stitched PCR products would extend the approach to NGS platforms which have average read-lengths less than 100 bp (Supplementary Note 6
The Stitch-Seq strategy implemented here for Y2H can be readily implemented to other types of interaction assays, leading to improved capacity and expanded scope of interactome network mapping.