Domain-based interactome mapping
To define interaction domains, we developed a Y2H approach based on screening a PCR-generated library of systematically produced protein domains fused to the Gal4p activation domain (AD-Fragment library) (). This unbiased approach should identify novel protein interaction domains as well as domains corresponding to computationally defined domain signatures. In addition, using an AD-Fragment library should increase the completeness of interaction networks. Current interactome maps are far from complete, partly due to inherent limitations in the methods used (Venkatesan et al.
, personal communication). Y2H fusion proteins are frequently incapable of interacting, for example because they do not fold properly in yeast or because the full-length protein is locked in a “closed” conformation that masks potential interaction domains. The use of multiple fragments for each protein in a fragment library increases the probability that at least one fusion product will be capable of interacting in the assay. In addition, false negatives due to underrepresentation of particular proteins can be significantly reduced by using a normalized fragment library as we generate here (Reboul et al., 2003
Strategy for generating the AD-Fragment library and effect on Y2H sensitivity and specificity
We first examined the effect of using a fragment library on specificity and detectability of the Y2H system based on a literature derived set of binary interactions between human proteins (Venkatesan et al., personal communication). Specifically, we tested if the AD-Fragment library approach could recover a higher fraction of 20 literature derived interactions than a full-length clone based approach, while retaining specificity, i.e. not identifying interactions between 20 random protein pairs that serve as a negative control. We recovered the 3 literature derived interactions that we previously found to test positive using full-length constructs (Venkatesan et al., personal communication), as well as 4 additional interactions already described in the literature (). These findings are consistent with the idea that using a fragment library increases the sensitivity of the Y2H system. Importantly, we did not identify any of the 20 randomly selected protein pairs (), suggesting that specificity is not dramatically decreased.
An early embryogenesis interactome domain map
To generate a high quality early embryogenesis AD-Fragment library, we first generated sequence-verified wild-type full-length Gateway (Hartley et al., 2000
) entry clones for 681 early embryogenesis proteins (Table S1
and File S1
). These clones and an additional 68 full-length PCR products were used as templates in PCR reactions to generate fragments (). Most self-folding domains are estimated to be between 100 and 200 residues long (Trifonov and Berezovsky, 2003
). We generated all possible fragments up to a size of 800 base pairs (266 residues). In addition, we generated select fragment sizes between 800 base pairs and full length (). Finally, for each ORF we generated three full-length constructs, starting at base pairs 1, 7, and 13, to increase the probability of identifying interactions with (nearly) full-length constructs. In total, we completed 32,158 PCRs for 804 ORFs corresponding to 749 genes, resulting in an average of 40 fragments per ORF (Table S2
). PCR fragments were cloned into the Y2H AD vector and pooled to generate the final AD-Fragment library.
As bait proteins, we generated 706 full-length Gal4p DNA binding domain (DB) fusion constructs that do not result in auto-activation of Y2H reporter genes (Walhout and Vidal, 2001a
) (Table S2
). To obtain the highest coverage possible, the AD-Fragment library should ideally be screened with multiple fusions for each bait protein. As this was not feasible for all ORFs, we tested the benefits of using multiple DB-ORF fusion constructs for two molecular machines: the centrosome and the nuclear pore complex (NPC). For 16 centrosome and 12 NPC proteins (Table S2
), we generated 5 additional bait constructs corresponding to the N-terminal and C-terminal fragments spanning ~2/3 of the proteins, and the N-terminal, middle, and C-terminal fragments spanning ~1/3 of the proteins.
All DB-ORF strains were screened against the AD-Fragment library described above, as well as an AD-cDNA library generated from mixed stage C. elegans
(a kind gift from X. Xin and C. Boone, U. Toronto). To increase the precision of our interaction data set, we eliminated de novo
autoactivators that arose during the screening process (Vidalain et al., 2004
; Walhout and Vidal, 1999
), and included only those interactions found in two or more independent yeast colonies. The final data set involves 522 proteins and 755 Y2H interactions between them (Table S3
), of which only 92 were previously published or identified by Y2H mapping. Of the 755 interactions, 472 were between early embryogenesis proteins ().
Properties of the Y2H protein-protein interaction network
Experimental verification of interactions
To provide an overall estimate of the quality of our data set, we retested a sample of the identified interactions in an independent assay: the Mammalian Protein-Protein Interaction Trap (MAPPIT) (Eyckerman et al., 2001
). MAPPIT is based on reconstitution of a JAK/STAT signaling pathway through interaction of a bait protein fused to a receptor lacking STAT binding sites with a prey protein fused to a STAT recruitment domain. Previously we found that MAPPIT recovers 25% ± 4.7% of 40 literature derived interactions between C. elegans
proteins () (Simonis et al.
, personal communication). We tested all pairs for which we had wild-type full-length Gateway clones of both proteins available (355 corresponding to 47% of all interactions). The overall proportion of pairs verified by MAPPIT was 20% ± 2.2%. This represents 80% of the maximum number of interactions expected to test positive using MAPPIT based on the retest rate of the literature derived pairs. Verification by MAPPIT was only attempted using full-length constructs. This is likely the main reason why interactions originally found with full-length AD-ORF fusions retested at a higher rate than those where only truncated AD-ORF clones were found (29% ± 4.1% and 16% ± 2.4%, respectively).
AD-Fragment library screens increase fraction of detectable interactions
Most interactions between early-embryogenesis proteins (376/472) were found only using the AD-Fragment library. This is likely due to a combination of in-depth screening of a normalized library, and detection of interactions that cannot be detected using full-length constructs. The AD-cDNA library derived interactions enabled us to examine the level of saturation of our AD-Fragment library screens, i.e. the fraction of interactions detected out of all interactions that can be identified using the exact Y2H procedure employed here. Out of 96 cDNA derived interactions where both proteins are present in the AD-Fragment library, we recovered 75 (78%) in the AD-Fragment library screens (). This high recovery rate indicates that the AD-Fragment library screens approach saturation.
Most interactions were identified exclusively by AD-ORF clones smaller than the full length ORF (). For the AD-Fragment library, a full-length clone was identified for 34% of interactions – significantly less than the 60% expected based on the contents of the AD-Fragment library and the number of times the library was sampled (p<1×10−5). This indicates that we indeed identify interactions that are difficult or impossible to find using full-length clones.
We examined the properties of proteins that were only identified as truncated AD-ORF clones, and found that these proteins are much larger than those for which a full-length clone was observed (average 777 vs. 393 amino acids). We suspect that this is due to larger proteins folding less efficiently in yeast. In addition, although not statistically significant, proteins found as full length were enriched 3.4 fold for the Gene Ontology (GO) term ‘nuclear’, while proteins found only as truncated clones were enriched 4 and 4.6 fold for the GO terms ‘membrane’ and ‘membrane part’ respectively. This fits well with the notion that the Y2H system, which relies on interactions to occur in the nucleus, may have difficulty identifying interactions with membrane proteins.
Although the MAPPIT results already demonstrated the overall quality of the data set, we also examined whether certain protein regions taken out of context of the full-length protein may become promiscuous interactors. A promiscuously interacting fragment would result in a prey protein connected to many different bait proteins. Bait proteins were only tested as full-length constructs and would lack such highly connected promiscuous interactors. We therefore compared the distribution of connectivity of bait and prey proteins (). We also compared the connectivity distribution of prey proteins found as full-length with prey proteins never found as full-length (). In both cases we observed no significant difference (Mann-Whitney U test p-values >0.96 and >0.92 respectively). Thus, the use of fragments does not appear to result in additional promiscuous interactors.
An expanded network of early embryogenesis
We compared our data set with the most recent version of the worm interactome (CCSB-WI8), which contains 108 interactions between early embryogenesis proteins (http://interactome.dfci.harvard.edu/C_elegans
) (Simonis et al
. personal communication). Our screens found 45 of these, and identified an additional 427 interactions between early embryogenesis proteins (), a nearly 5-fold expansion of interactions between early embryogenesis proteins. In addition, the AD-cDNA library screens identified 283 interactions linking early embryogenesis proteins to the rest of the proteome.
We used two different criteria to establish the biological relevance of our data set. First, we found that 52 of our interactions were previously identified in C. elegans
or as interologs (Matthews et al., 2001
; Walhout and Vidal, 2001b
) in other organisms (Table S4
), as opposed to 4 interactions when the prey names were shuffled. This result supports the overall biological relevance of our interactions.
We next compared the Y2H interactions with the RNAi phenotypes of the corresponding genes. Detailed phenotypic characterizations are available from RNAi experiments for most of the genes involved in early embryogenesis (Sönnichsen et al., 2005
). Out of 320 interactions where a phenotypic profile was determined for both binding partners, 55 (17%) belonged to the same functional class (). To determine the significance of this observation, we calculated the phenotypic similarity between each interacting protein pair (Gunsalus et al., 2005
). We found a significant enrichment in protein pairs with similar phenotypes, as well as a significant depletion of pairs with low phenotypic correlation (). In addition, interacting protein pairs were more likely to share functional annotations (GO terms), and show similar mRNA expression profiles ().
Enrichment in similar phenotypes, GO terms, and mRNA expression profiles for interacting protein pairs
Finally, we examined whether interactions identified only by truncated clones are as biologically relevant as interactions where a full-length clone was identified. We therefore compared the enrichment in shared GO terms, phenotypes, and expression profiles between these subsets of interactions (Figure S2
). We restricted the analysis of interactions where only truncated clones were identified to those interactions where a full-length clone was >50% likely to have been identified. Although the numbers that can be examined are low and there were variations, no significant differences were found between the two sets. Therefore, interactions where only truncated AD-ORF clones were found are not dramatically less biologically relevant by these criteria.
Centrosome assembly and nuclear pore complex architecture
We used our domain-based interaction data set to examine interactions within two different molecular machines: the nuclear pore complex (NPC) and the centrosomes. The first is a symmetric molecular array whose structure has been solved at high resolution using conventional methods, whereas centrosomes, apart from the centriole, have no apparent ultrastructural organization. We first examined the results of using multiple DB-ORF fusion constructs for each bait protein. In the entire screen, 37% of full-length DB-ORF fusions yielded interactors. The use of 5 additional bait constructs for 28 centrosome and nuclear pore proteins resulted in the identification of interactors for 23 of these 8 proteins (82%), illustrating that greater coverage can be obtained by using multiple constructs for each bait protein.
Current understanding of NPC architecture is summarized in (adapted from (Alber et al., 2007
; Lim and Fahrenkrog, 2006
; Schwartz, 2005
). Out of 20 known C. elegans
NPC proteins (Galy et al., 2003
), we used the 12 identified as required for early embryogenesis as bait (Table S2
). We identified 6 interactions between NPC proteins and 8 interactions between proteins located near the surface of the NPC and the nuclear import/export machinery (). The relatively low number of binary interactions recovered within the core NPC is consistent with a view of the nuclear pore as an assembly of soluble multiprotein sub-complexes refractory to dissection as binary protein interactions. All but one of the 14 interactions identified are consistent with published interactions and EM localization data for proteins within the NPC () (Alber et al., 2007
; Lim and Fahrenkrog, 2006
; Schwartz, 2005
). Among the core components, the interaction between NPP-7 (NUP-153) and NPP-10 (NUP96) is novel and suggests a mechanism for anchoring the nuclear basket to the nuclear face of the NPC.
Y2H results of nuclear pore complex (NPC) and centrosome screens
illustrates current understanding of centrosome assembly during the first cell division of C. elegans
, based primarily on a genetic hierarchy of localization dependencies (Oegema and Hyman, 2006
). Centrosome assembly starts with duplication of the centriole, which requires sequential and dynamic recruitment of SPD-2, ZYG-1, and SAS-4, SAS-5, SAS-6 (Dammermann et al., 2008
; Delattre et al., 2006
; Pelletier et al., 2006
). The Polo kinase PLK-1 is also localized to the centriole in a SPD-2 dependent manner (Kemp et al., 2004
), although its role in centrosome function is less well understood. Following centriole duplication, the pericentriolar material (PCM) is assembled, a process that is critically dependent on SPD-5, a coiled-coil protein required to recruit all known effector components to the PCM (Dammermann et al., 2004
; Hamill et al., 2002
). Surprisingly, the only protein known to interact with SPD-5 to date is RSA-2, the centrosome targeting subunit of a protein phosphatase 2A (PP2A) complex (Schlaitz et al., 2007
We recovered 12 interactions between proteins throughout the centrosome assembly pathway, indicating that this process can be viewed as a set of binary protein-protein interactions that can occur independently of one another. We identified all four previously described direct physical interactions (SAS-5/SAS-6, SPD-5/RSA-2, AIR-1/TPXL-1, and TAC-1/ZYG-9). The remaining intra-centrosomal interactions are novel physical interactions consistent with previous epistatic analyses. The homotypic interactions of SAS-5 and SPD-5 suggest a scaffolding role for these proteins in centriole duplication and PCM assembly, respectively. The binding of both SPD-2 and AIR-1 (the aurora A homolog in C. elegans) to SPD-5 provides a testable biochemical model for the genetic requirement of all three proteins for PCM growth. Moreover, both SAS-4 and SPD-2 are required for centriole duplication and bind PLK-1. As SPD-2 is required to target PLK-1 to the centrioles, the role of SPD-2 in centriole duplication might in part be the targeting of PLK-1 to SAS-4.
We also identified two novel interactors of RSA-2: the microtubule-associated proteins TAG-201 and EBP-1. TAG-201 is uncharacterized, while EBP-1 is an evolutionarily conserved protein that binds the growing plus-ends of microtubules. Functional analysis of RSA-2 binding to the microtubule-binding proteins should shed light on how PP2A stabilizes microtubules in mitosis.
Identification and validation of minimal regions of interaction
For each interaction, we defined the minimal region of interaction (MRI) as the smallest region shared by all interacting protein fragments. Our approach was sensitive enough to resolve two independent Ran-binding domains in NPP-9 (). The AD-Fragment library screens defined MRIs in 149 proteins. We observed a small tendency for MRIs to localize toward the C-terminus of proteins (Figure S3
). On average, MRIs are 217 amino acids long and correspond to ~39% of their respective fulllength protein (). Only 30 proteins were found solely as full-length fusions (). These proteins were generally small – average length 288 amino acids compared to 565 for all proteins in the AD-Fragment library – and likely consist of a single globular domain that fails to fold properly when truncated. The AD-cDNA derived interactions define MRIs for an additional 134 proteins. However, as the AD-cDNA library contains mostly 5’ deletions, these MRIs are less well refined, with an average length of 400 amino acids, over 67% of their corresponding full-length proteins. Two examples of MRIs that fully encompass a structurally determined binding region are shown in Figure S4
, and graphical representations of all MRIs are shown in Figure S5
Identification and validation of minimal regions required for interaction (MRIs)
To verify the accuracy of the identified MRIs, we first compared them to published interaction domains. For 26 proteins in our data set, interaction domains were present in the literature. For 23 (88%), the MRI identified is consistent with the known interaction site of the C. elegans
or orthologous protein, demonstrating the accuracy of our approach (Table S4
). For three, we found a difference between our MRI and the interaction site of the orthologous human proteins (). Differences in the MRIs in NPP-7 and NPP-9 and their human counterparts can be explained by evolutionary divergence between the proteins. For example, in our data set IMB-4 binds to the N-terminus of NPP-9, while the mammalian counterpart of IMB-4, Exportin1, binds to a Zinc-finger-rich region located in the center of the NPP-9 homolog RanBP2 (Singh et al., 1999
). This region is largely lacking in NPP-9, and motif searches identify only one potential Zinc finger in NPP-9. Interestingly, this region appears subject to rapid evolution, as bovine, mouse, and human RanBP2 have 5, 6, and 8 Zinc fingers, respectively. It is generally assumed that maintaining interactions, especially essential ones, restricts evolutionary drift. These examples indicate that it is possible to maintain an interaction while changing the binding site.
Comparison of MRIs with computational domain predictions
To experimentally demonstrate the functional relevance of novel MRIs, we examined the subcellular localization of SAS-5 and RSA-2 MRIs by fusing them to GFP. SAS-5 localizes to centrioles in a SAS-6 dependent manner, while RSA-2 localizes to the PCM in a SPD-5 dependent manner. We generated transgenic lines expressing GFP fusions of the SAS-5 and RSA-2 MRIs responsible for binding to SAS-6 and SPD-5 respectively. The RSA-2 and SAS-5 MRIs accurately recapitulated the localization of the full-length proteins to the PCM and centrioles, respectively (). SAS-5 MRI localization was observed starting at the ~32 cell stage. The recapitulation of subcellular localization by MRIs further demonstrates their relevance in vivo.
Comparison of MRIs with computational predictions
Although protein interactions have traditionally been viewed as being between two structured domains, many interactions involve one structured domain and a short, linear amino acid motif (Davey et al., 2006
; Puntervoll et al., 2003
) typically present in a disordered loop or tail (Fuxreiter et al., 2007
; Mohan et al., 2006
). To better understand the structural composition of the MRIs delineated, we examined them for overlap with computational domain and structure predictions (Table S5
). The predictors used were: Pfam-A and Superfamily, two collections of manually curated domain signatures (Finn et al., 2008
; Gough et al., 2001
); Pfam-B, a collection of automatically generated domain signatures (Finn et al., 2008
); Ginzu, a protocol using orthologous protein sequences to predict the boundaries of globular domains (Chivian et al., 2003
); COILS, a coiled-coil prediction algorithm (Lupas et al., 1991
); and two different predictors of disordered regions: PONDR VL-XT (Li et al., 1999
; Romero et al., 2001
) and VSL2 (Obradovic et al., 2005
; Peng et al., 2006
). We did not observe enrichment of any domain predictions in MRIs compared to the whole proteins ().
We used the overlap between MRIs and the domain predictions to classify our MRIs as known folding region (Pfam-A, Superfamily, structure-based Ginzu), predicted folding region (Pfam-B, coiledcoil, non-structure-based Ginzu), unstructured region (>50% of residues predicted to be disordered), or potential new folding region. As minimal overlap cutoffs for classifying an MRI we used 20%, 40%, 60%, or 80% of the MRI length. Depending on the cutoff chosen, the fraction of novel folding and disordered MRIs ranges from 14% to 38% (). Interactions with peptide motifs are especially difficult to predict, because they appear frequently at random in a protein. Our data should help narrow searches for linear motifs that mediate interactions.
Finally, we compared our experimentally defined MRIs with binding sites predicted by InSite, a recently developed algorithm that predicts protein-protein interaction binding sites based on the domain composition of proteins (Wang et al., 2007
). We used InSite to predict Pfam-A binding sites for those interactions where the MRI overlaps with a single Pfam-A domain, and the protein contains more than one Pfam-A domain. For 78 interactions satisfying these criteria, 53 binding site predictions (68%) matched our experimentally defined MRI. Randomly assigning a Pfam-A domain as binding site for each interaction results in a 35% overlap with our MRIs. The high overlap between binding site predictions and experimentally defined MRIs further highlights the quality of our approach.