|Home | About | Journals | Submit | Contact Us | Français|
The sequence specificity of DNA-binding proteins is the primary mechanism by which the cell recognizes genomic features. Here, we describe systematic determination of yeast transcription factor DNA-binding specificities. We obtained binding specificities for 112 DNA-binding proteins representing 19 distinct structural classes, one-third of which have not been previously reported. Several newly discovered binding sequences have striking genomic distributions relative to transcription start sites, supporting their biological relevance and suggesting a role in promoter architecture. Among these are Rsc3 binding sequences, containing the core CGCG, which are found preferentially ~100 bp upstream of transcription start sites. Mutation of RSC3 results in a dramatic increase in nucleosome occupancy in hundreds of proximal promoters containing a Rsc3 binding element, but has little impact on promoters lacking Rsc3 binding sequences, indicating that Rsc3 plays a broad role in targeting nucleosome exclusion at yeast promoters.
The targeting of a transcription factor (TF) to specific genomic loci is determined by its DNA-binding activity, which is typically encoded by a conserved DNA-binding domain (DBD), together with cofactor interactions and the chromatin state of potential targets (Barrera and Ren, 2006). A foundation of any complete and accurate model of transcriptional regulation will be knowledge of the sequence specificities of DNA-binding proteins (Beer and Tavazoie, 2004; Segal et al., 2008). Despite intense study, there is currently no organism for which a complete encyclopaedia of such TF sequence specificities exists. Even in the well-studied yeast S. cerevisiae, prior to this study, binding sequences were understood with confidence for only about half of its ~200 TFs. The majority of yeast TFs have been analyzed by ChIP-chip, but even when assayed under several different growth conditions (Harbison et al., 2004), these experiments often fail to identify either significant binding events or associated motifs, presumably because the TF is not binding DNA under the assay conditions. Further complicating de novo motif identification is the possibility that ChIP-chip and related techniques (e.g. ChIP-seq) may identify binding sequences for cofactors rather than the intended TF (Carroll et al., 2005). In some cases it may be possible to infer TF sequence preferences on the basis of similarity among DBDs or identities of DNA-contacting residues (Berger et al., 2008; Wolfe et al., 2000), but for no DBD class is there a complete and accurate combinatorial code that dictates sequence specificity.
Incomplete knowledge of TF binding specificities hinders our understanding of basic mechanisms of transcription and nuclear organization. For example, RSC (remodel the structure of chromatin) is an abundant nuclear protein complex with a role in nucleosome organization at many yeast promoters (Cairns et al., 1996; Ng et al., 2002; Parnell et al., 2008). RSC contains two Gal4-class transcription-factor-like proteins (Rsc3 and Rsc30) with very similar amino acid (AA) sequences but apparently different cellular functions (Angus-Hill et al., 2001; Wilson et al., 2006). Neither Rsc3 nor Rsc30 has known sequence specificity, and the mechanisms that target RSC to individual loci remain poorly-defined.
More generally, the mechanisms responsible for nucleosome-free regions (NFRs) in yeast promoters are incompletely understood. Current models of intrinsic nucleosome-DNA preference do not explain all of the observed nucleosome positioning and occupancy (Lee et al., 2007; Segal et al., 2006; Yuan and Liu, 2008). TF binding sequences are often enriched in NFRs (Lee et al., 2007; Liu et al., 2006), and in at least some cases TFs make strong contributions to the local chromatin landscape. For example, Abf1, Reb1, and Rap1 are found frequently in yeast promoters, and are able to define chromatin domains and enable activation or repression by other TFs in diverse pathways (Chasman et al., 1990; Elemento and Tavazoie, 2005; Fourel et al., 2002; Planta et al., 1995). Abf1, Reb1, or Rap1 binding sites are found in only a minority of promoters, however (Harbison et al., 2004), highlighting the probability that additional nucleosome-displacing factors, or combinations of factors, remain to be identified.
Here, we have measured the sequence preferences of the majority of yeast TF DBDs, using a combination of systematic microarray-based approaches. These data provide a resource for genomic analyses, and for the study of the evolution of both the genome and the TFs themselves. Our data include binding preferences for 36 proteins for which there was previously no reported binding specificity information, and provide independent support for many more that were previously inferred from ChIP-chip or identified on the basis of one or a few binding sequences. Among the proteins for which we have defined specificities for the first time are Rsc3 and Rsc30. Binding sequences for these proteins occur preferentially between −125 and −75 upstream of TSS, and Rsc3 is essential for the maintenance of a nucleosome-free region in hundreds of yeast promoters as well as transcript abundance from these promoters.
We began by creating a list of 218 yeast proteins that either contain a TF DBD or are known to bind to specific DNA sequences and regulate transcription (Supplementary Table 1). We were able to clone 207 of the 218 DBDs (or full-length proteins in the event that the DBD is unknown) as GST and/or MBP fusion proteins, and upon expression obtained a protein for 195. We analyzed the sequence specificities of these 195 using at least one of three methods: (i) Protein Binding Microarrays (PBMs), in which the proteins are applied to an Agilent microarray consisting of 40,330 double-stranded 60-mers, each containing a unique 35-mer, such that all 10-mers are represented once and only once (Berger et al., 2006; Mintseris and Eisen, 2006); (ii) Cognate Site Identifier (CSI) (Warren et al., 2006), in which proteins are applied to a Nimblegen array of 262,148 DNA hairpins each containing an 11bp randomized region permitting display of all possible 10-mers; and/or (iii) DNA immunoprecipitation chip (Dip-chip) (Liu et al., 2005), in which a purified transcription factor, bound to yeast genomic DNA, is immunoprecipitated in vitro and analyzed using microarrays.
Supplementary Table 1 and our project website contain a summary of which proteins were analyzed by each method, and details on motif derivation. The majority of data produced resulted from PBMs (Berger et al., 2006). To discover the motifs preferentially bound by each protein in the PBM experiments, we first took the median signal intensity across the array from the 32 spots containing each 8-mer, and expressed this as a Z-score (Berger et al., 2006). We then sought DNA sequence motifs (Position Weight Matrices or PWMs) that produced predicted binding scores (Granek and Clarke, 2005) that correlated with the 8-mer based Z-scores for each factor (see Experimental Procedures for details). The 112 resulting motifs identified are shown in Fig 1. Fig 2A illustrates how the PWM-derived scores correlate with the 8-mer Z-score data for Gzf3. Fig 2B, which shows a comparison of 8-mer Z-scores obtained for Gzf3 using either PBM or CSI, demonstrates that the imperfect correlation cannot be attributed primarily to measurement noise in the assay or the array platform, because the 8-mer profile is consistent between these two different experiment types, even among less-preferred 8-mers. This observation may reflect shortcomings in PWM and consensus models (Benos et al., 2002). PWMs do, however, identify the best binding sequences in all of our experiments, and since they are compact, intuitive, and compatible with existing analysis techniques, we used PWMs for the remainder of our analyses.
We next asked if the 112 motifs we obtained agree with those previously identified for the same proteins, from either global ChIP-chip analysis (Harbison et al., 2004; MacIsaac et al., 2006), or individual studies in the literature ((Nash et al., 2007) and others), by manual comparison of logos, consensus sequences, and individual binding sites (Supplementary Table 1). Sixty-three of our motifs bear an obvious correspondence to previous information (although not always all previous information), while 11 are inconsistent. The remaining 38 represent newly discovered specificities, although most of these motifs are consistent with expectations in some way (see below).
For some of the 11 discrepancies, additional evidence suggests that our measurements are likely to represent at least a correct in vitro monomeric binding sequence (Supplementary Table 2). For example, our Fhl1 motif is a close match to that of its human homolog, FoxN1 (Schlake et al., 1997). Our motifs for Stp4 and Yml081w are very similar to those we obtained from Stp3 and Zms1, respectively, their corresponding yeast paralogs that arose from an ancient whole genome duplication (WGD) (Kellis et al., 2004). We verified by Electrophoretic Mobility Shift Assay (EMSA) that Stp3 and Yml081w bind to DNA sequences matching our motifs and not those previously described (Supplementary Fig 1).
A few other discrepancies can be explained by the methodology we employed. For example, the A/T-rich motif we obtained for Sum1 is different from the published motif because when cloning DBDs we selected the N-terminal AT hook domain, rather than the C-terminal fragment that binds the established Sum1 motif, but does not, however, contain a known conserved domain (Pierce et al., 2003). Despite this discrepancy, promoter scans with our Sum1 motif do have a high correspondence to ChIP-chip results, suggesting that this additional DNA-binding activity of Sum1 may contribute to targeting in vivo (Spearman correlation P < 10−92; Wilcoxon Rank Sum P < 0.000011 with 61 targets defined by (Harbison et al., 2004) at P < 0.001).
Other variations from the literature are likely reproducible in vitro phenomena that are characteristic of members of a structural class. Four of the eight GATA-class proteins we analyzed (Ecm23, Srd1, Gat3, and Gat4) bound unexpectedly to sequences resembling the palindrome AGATCT. No binding sequences have been described for three of these four proteins, Ecm23, Srd1, or Gat4, and we know of no other in vitro or in vivo data that confirms or refutes our observations. A noncanonical motif different from AGATCT was derived for the fourth protein, Gat3, on the basis of ChIP-chip and sequence conservation of putative target sites (MacIsaac et al., 2006), and has not been experimentally pursued to our knowledge. Our motif does not correlate with the ChIP-chip data, which is highly enriched for subtelomeric loci. However, we confirmed by EMSA that Gat3 binds the sequence we identified more strongly than the sequence identified by ChIP-chip, and that Ecm23 binds to the newly-identified motif (~Supplementary Fig 1).
Three of the discrepancies (Ecm22, Put3, and Ume6) are for Gal4-class proteins, which also have characteristic behaviour in our analyses. It appears that our data largely capture monomeric specificities, rather than the dimeric motifs typically associated with proteins in this class (MacPherson et al., 2006) (for all DBD classes, we counted correct monomeric specificities as consistent with previous information for dimeric proteins). Still, all but two of the motifs we obtained for Gal4-class proteins do contain the expected CGG core sequence (MacPherson et al., 2006), which is not always the case for the motifs derived from other studies. The capture of monomeric specificities could be a consequence of the domain definitions used for expression, or the epitope-tagging strategy. In order to include dimerization contacts, our Gal4-class contacts included 50 AAs of flanking sequence beyond the boundaries of the DBD (or to the end of the protein if within 50 AAs). The choice of flanking sequence length was based on inspection of a number of Gal4-class protein-DNA complexes, all are of dimers in the crystal. However, the family is structurally diverse in the way the DBD dimerizes, and it may be that for some members of the family the flanking sequence that was included was insufficient to mediate dimerization. In addition, our constructs are N-terminal GST fusions; Gal4-class DBDs are typically found at the N-terminus of yeast proteins and either dimerization or DNA-binding by dimers may be intolerant of or otherwise influenced by N-terminal GST tags. The array designs we used may also fail to detect long motifs, because the arrays are designed primarily to detect sequences up to ~10 bases (for PBM and CSI). Nonetheless Gal4-class proteins do sometimes function in vivo as monomers (Kim et al., 2003; Larochelle et al., 2006; Vik and Rine, 2001), and several of our monomeric motifs are enriched in the promoters of functionally-related genes and at specific promoter positions (see below).
Most of the 36 proteins we classified as having no previously established binding sequences are members of structural classes that have characteristic binding site properties, and many are members of gene families that might be expected to share related sequence specificities. Indeed, most of our new motifs conform to expectation. The C2H2 zinc finger family provides several such examples (Fig 3). All three Mig proteins share virtually identical DNA-binding activities, as expected (Lutfiyya et al., 1998), as do Stp3 and Stp4 as described above. In contrast, C2H2 zinc-finger proteins with unique motifs (Azf1, Crz1, Fzf1, Rpn4, Rei1, Rim101) all have less than 60% identity to any other yeast protein in the DBD. ClustalW-derived phylograms similar to Fig 3 are given for all other structural classes in Supplementary Fig 2. Three major observations include: (i) Two Gal4-class proteins with related DBD sequences, Rsc3 and Rsc30, prefer sites that contain CGCG rather than the CGG typical of this class of proteins. Not coincidentally, perhaps, these two proteins are also unusual in having glycine at a position that is almost always lysine or arginine (corresponding to K20 in the Gal4 DBD). The lysine or arginine normally found at this position is in close proximity to the phosphate backbone in crystal structures of protein-DNA complexes (Supplementary Fig 3). It is also just two positions C-terminal to the residue that makes base-specific contacts to the usual CGG half-site. Thus, the unusual glycine at this position in Rsc3 and Rsc30 may affect the orientation of the domain with respect to DNA, resulting in the unusual DNA binding specificity discovered here. (ii) Dot6 and Ybl054w, a pair of related SANT domain proteins originating from the WGD (Kellis et al., 2004), both bound to sequences containing the core CGATG, which resembles the PAC (Polymerase A and C) motif (Dequard-Chablat et al., 1991). However, we found no evidence indicating that they bind to the promoters of genes containing these motifs (Harbison et al., 2004). (iii) We obtained similar motifs containing the core TGTCA for Tos8 and Cup9, a pair of homeodomain proteins originating from the WGD. Neither protein has previously-established binding specificity.
We next scanned the yeast genome with the motifs and asked if the potential binding sites for each TF are associated with genes in shared functional classes. Twenty-seven of the 112 motifs had a hypergeometric P-value of < 0.000005 (corresponding to a Bonferroni-corrected P-value of 0.01) for enrichment of at least one GO Biological Process category among the top 100 promoter/motif hits. Expected enrichments include Ste12 (Sterile 12), with “cell-cell fusion” (P < 2.2 × 10−14) and Pdr1 (Pleiotropic Drug Resistance), with “response to drug” (P < 1 × 10−6). Our analysis is consistent with the function of Rgt1 (Restores Glucose Transport 1) as a Gal4-class TF that binds DNA as a monomer in vivo (Kim et al., 2003), since our monomeric motif is associated with “hexose transport” (P < 6.1 × 10−10). Ypr196w and Ydr520c binding sequences were also enriched in the promoters of hexose transporters (P < 2.4 × 10−8; 6.35 × 10−7); the motifs for these proteins are related to that of Rgt1 and the top promoter/motif matches are found in an overlapping but not identical set of transporters, suggesting a more complex regulatory network of sugar utilization than that currently known. We were also intrigued to find that the monomeric motif we obtained for Lys14 has the same enrichment in promoters of lysine biosynthesis genes as the established dimeric motif (P < 3.8 × 10−6 for both), suggesting that both binding modes may be used in vivo.
We next examined how the occurrences of the motifs we discovered were distributed within promoters. Fig 4A shows that most of our 21 monomeric Gal4 motifs occur preferentially in the position of the NFR (approximately −130 to −50, relative to TSS), providing support for their widespread in vivo relevance. Fig 4B shows 14 motifs we classified as new and unexpected; several of these are also located preferentially in the NFR. The most striking instances are Rsc3 and Rsc30, which share very similar binding preferences to sequences containing CGCG. At a stringent motif score threshold, these sequences are 16-fold more likely to occur in the position of the NFR than they are within genes. Only a handful of other TFs have this extreme bias (Lee et al., 2007), most notably Abf1 and Reb1, which are capable of remodelling chromatin in the vicinity of their binding sites. At a more liberal PWM score threshold, 708 yeast genes contain a potential Rsc3 binding sequence in the NFR region (−130 to −75), compared to only 146 found in an identical amount of ORF sequence. These 708 genes represent a broad spectrum of functional classes, including 169 (of 1101) that are essential for cell viability (hypergeometric P < 2.2 × 10−6). Given that RSC is an abundant protein complex that repositions nucleosomes (Angus-Hill et al., 2001; Cairns et al., 1996; Parnell et al., 2008), we reasoned that Rsc3 and Rsc30 may play a broad role in directing the establishment or maintenance of nucleosome-free regions in promoters. We focused on Rsc3 because it is essential, and therefore its activity is required under typical laboratory growth conditions.
Three previous studies have analyzed RSC binding sites in the yeast genome using ChIP-chip (Damelin et al., 2002; Ng et al., 2002; Parnell et al., 2008), two involving Rsc3. Promoters containing the Rsc3 motif displayed a statistically significant correspondence to overall RSC occupancy in these previous studies: among 5,015 (4,947 with ChIP-chip data) yeast genes with well-defined TSS (Lee et al., 2007), 2,325 (2,296 with ChIP-chip data) have a match to our Rsc3 motif (using our liberal cutoff). Among these are 416 of 667 RSC targets defined in Ng et al., using a combined P-value cutoff of <0.01 (the P-value of this overlap among 4,947 genes is P < 4.36 × 10−19). The correspondence to Rsc3 ChIP-chip occupancy (defined in Ng et al., 2002 using a P-value cutoff < 0.01) is lower, although still significant (162 out of 293 targets; P < 0.0011). We note, however, along with others (Parnell et al., 2008), that ChIP-chip experiments with RSC subunits, particularly Rsc3, tend to have very low enrichment ratios. One possible explanation, consistent with the activity of RSC as an enzyme that displaces nucleosomes, may be that the association of RSC with target promoters is transient, as may be the case for the DNA-binding TFIIIC module, which also has relatively low ChIP-chip enrichments (Roberts et al., 2003; Soragni and Kassavetis, 2008). We therefore sought an alternative functional assay to ask if Rsc3 binding sites in promoters influence nucleosome occupancy.
We assayed nucleosome occupancy in the rsc3-1 mutant (Angus-Hill et al., 2001) using full-genome tiling arrays with 4-nt resolution (Lee et al., 2007). The biochemical defect of rsc3-1 is unknown, but the mutations (M709I and L828S) are outside the DBD (AA1-37). We compared nucleosomal DNA enrichment (i.e. ratio of nucleosomal DNA vs. total genomic DNA) in the rsc3-1 mutant to that in an isogenic wildtype control grown at the same temperature (37 degrees, for 6 hours). Fig 5A shows an example locus in which nucleosome depletion over a Rsc3 binding sequence in a promoter region is dependent on RSC3. Fig 5B shows that this phenomenon occurs at many yeast promoters, with a clear preference for the affected region to be located near −100 from TSS. Moreover, the location of the increase in nucleosome occupancy (and the position of the NFR itself) tracks with the Rsc3 binding sequence across hundreds of promoters. Such changes are not observed at promoters that do not contain Rsc3 binding sequences (Fig 5C); in fact, nucleosome occupancy appears to decrease in these promoters, perhaps as a consequence of microarray signal normalization or redistribution of nucleosomes in vivo. This observation illustrates specificity of this phenomenon for Rsc3 binding sequences, and not just NFRs in general. Unlike a previous study that used a greater tiling interval on selected promoters to examine the effects of mutating another RSC subunit (Parnell et al., 2008), we saw little or no effect on nucleosome positioning or occupancy at tRNA genes (Supplementary Fig 4), indicating that the effects we observed are distinct from a general loss of RSC activity. We also surveyed RNA abundance in the rsc3-1 strain using the same arrays, and observed a clear trend in which the Pol II promoters with an increase in nucleosome occupancy tend to exhibit lower RNA abundance (Fig 6). Overall, our results are consistent with a function for Rsc3 in nucleosome removal and promoting transcription from Pol II promoters that contain Rsc3 binding sequences in the NFR region.
In order to ask whether the effect of Rsc3 is mediated by RSC, we compared the relative occupancy of Rsc8 in wildtype and rsc3-1 strains using ChIP-chip. In previous studies (Damelin et al., 2002; Ng et al., 2002; Parnell et al., 2008), Rsc8 has the highest occupancy ratios of any RSC subunit, with up to 6-fold enrichment at tRNAs. In our wildtype strain, Rsc8 occupancy ratios are also highest at tRNAs (maximum enrichment 8.5-fold in our analysis, Supplementary Fig 4), and at Pol II promoters there is a significant correspondence between Rsc8 occupancy and the Rsc3 motif score (Spearman rank correlation P < 1.3 × 10−9). Furthermore, occupancy at tRNAs is not affected by rsc3-1 (Supplementary Fig 4), suggesting that RSC is targeted to Pol III transcripts by a RSC3-independent mechanism. Surprisingly, in rsc3-1, we saw a global (albeit modest) increase in occupancy of Rsc8 at Pol II promoters (Fig 6), which could be an indirect effect of the fitness defects seen in rsc3-1 mutant cells (Angus-Hill et al., 2001), and/or the dramatic alterations we observed in chromatin organization and transcript profiles. Nonetheless, the increase is clearly smaller for promoters in which nucleosome occupancy increases in response to rsc3-1 (Fig 6) and it is also smaller for those promoters carrying a Rsc3 sequence (Wilcoxon rank sum test P < 2.7 × 10−5 among Rsc8-bound promoters, with Rsc3 positives defined as genes with a Rsc3 site in the NFR (−150 to −70)). Together these observations suggest that Rsc3 may function by targeting RSC, but do not rule out the possibility that Rsc3 acts by other mechanisms.
Finally, we asked whether other TFs have an impact on nucleosome occupancy and transcription similar to that observed for Rsc3. Indeed, the correspondence between Rsc3 binding sequences and the impact of the rsc3-1 mutant on nucleosome occupancy in promoters and transcript levels from the corresponding gene is similar to that seen with Abf1 and Reb1 (Fig 6 and Supplementary Fig 5). Binding sequences for these TFs are found in the proximal promoter of hundreds of yeast genes, and, as predicted from their known roles as chromatin modifiers, mutation of each TF results in a specific increase in the occupancy of nucleosomes over the potential binding site (Fig 6), with the most affected NFRs in the mutants typically containing the TF binding sequence. We also analyzed nucleosome occupancy in mutants in the essential DNA-binding proteins Tbf1, Rap1, and Mcm1; all three appear to influence nucleosome occupancy at promoters containing their cognate binding sequences, although the number of promoters affected is smaller than for Rsc3, Abf1, and Reb1 (Supplementary Figs 5 and 6). By way of comparison, there is no relationship between binding sequences for Cep3, a centromere-binding protein, and nucleosome occupancy at Pol II promoters (Fig 6 and Supplementary Fig 5). There is, however, a perfect match to the Cep3 motif in all sixteen yeast centromeres, and the array signal in our nucleosome preparations at each centromere is depleted in the cep3 mutant (Supplementary Fig 7; signal from centromere probes could reflect occupancy by centrosomes).
Our in vitro survey of yeast TF-DBD sequence specificities raises the number of yeast TFs with known sequence preference to 174, or ~80% (Supplementary Table 1). This expanded index of sequence preferences provides a new resource for exploration of the function and evolution of gene regulatory networks. Our comparison of predicted promoter preferences to GO categories represents only one possible exploratory approach; by examining correlations between theoretical promoter affinity for TFs (Granek and Clarke, 2005) and relative induction or repression in individual microarray experiments, we have identified hundreds of statistically significant associations (unpublished data). In addition, because motif representations almost certainly do not fully describe in vitro TF binding preferences (e.g. see Fig 2), and because previous studies have concluded that weak and/or non-canonical binding sites are likely to be functional in some instances (Blackwell et al., 1993; Buck and Lieb, 2006; Tanay, 2006), in the future it may be useful to scan the genome with indices of relative affinity to individual sequences, rather than positional models of specificity.
One aspect of global gene expression and regulation that has been difficult to model is precisely how factors within cells assemble at promoters, rather than other genomic locations with similar sequence characteristics. In our study, Rsc3 emerged as a major player in NFR formation/maintenance and promoter function for hundreds of yeast genes. Our data are consistent with prior conjecture that Rsc3 uses its sequence-specific binding activity to target RSC to promoters and creating the NFR (Angus-Hill et al., 2001; Parnell et al., 2008; Wilson et al., 2006). Our data are also consistent with previous ChIP-chip analyses of RSC, because promoters containing Rsc3 binding site are enriched in RSC immunoprecipitates. Rsc3 itself is frustratingly refractory to study by ChIP-chip (Parnell et al., 2008); although there is a significant enrichment of Rsc3 binding sites among ChIP-chip targets, the enrichment ratios, the overlap with Rsc3 binding sequences, and the resolution of published ChIP-chip data (Damelin et al., 2002; Ng et al., 2002; Parnell et al., 2008) are all too low to specify exact target interactions. Therefore, we cannot rule out that the effects of Rsc3 on occupancy of many promoters are indirect, although we have no other explanation for the extremely strong association between Rsc3 binding sequences and the promoter nucleosome occupancy changes in the rsc3-1 mutant (Figs 5 and and6).6). Several other TFs bind to sequences containing CGCG (e.g. Mbp1, Swi6, Dal82, and Rsc30), but no other known TF binding site (Harbison et al., 2004) or binding sequence ((MacIsaac et al., 2006) and this study) correlates as powerfully with the rsc3-1 data as does that of our Rsc3 PWM (Spearman rank correlation P < 4.4 × 10−43 between the Rsc3 PWM score and the relative change in the NFR in rsc3-1 shown in Fig 6). Moreover, motif searches in the promoters most affected in rsc3-1 yield CGCG-containing sequences (data not shown).
Promoters in diverse organisms are enriched for both characteristic DNA structural features and binding sites for specific proteins (Lee et al., 2007). Our analyses extend these observations and furthermore demonstrate that TFs contribute to either establishment or maintenance of the NFR (Figs 5, ,6,6, and Supplementary Figs 3 and 4). Our data also link NFR formation to promoter function, since in all of the TF mutants we analyzed, an increase in nucleosome occupancy in the NFR generally corresponds to a decrease in transcript levels (Fig 6 and Supplementary Fig 4). Correlation between binding sequence and effect of the mutation is, however, imperfect in all cases, supporting the notion that NFRs, and promoters, are created by a combination of factors, likely including both DNA structural features and specific TF recognition sites. It is curious and somewhat unexpected that the TFs that play key roles in NFR formation in yeast are not highly-conserved proteins: obvious orthologs of Reb1, Abf1, and Rsc3 are not found outside of fungi (Wilson et al., 2006). Possibly, TFs involved in promoter establishment evolve with gene architecture, chromosome structure, and nuclear organization. If this is the case, then large-scale study of TF binding specificities in other organisms may be needed as much to understand how the cell identifies genomic landmarks as to map regulatory pathways.
Additional details and data are found in Supplementary Methods and on our project web site (see below).
We cloned PCR amplicons (pfam-defined DBDs plus 50 flanking residues) into pMAGIC (Li and Elledge, 2005). Resulting inserts were transferred into pTH1137, a T7-GST-tagged variant of pML280 (Berger et al., 2008). We obtained proteins by either purification from E. coli C41 DE3 cells (Lucigen), or in vitro transcription/translation reactions (Ambion ActivePro Kit) without purification, as indicated on our project web site.
The Supplementary methods contain a detailed description of microarray analyses and motif derivation methods. PBM arrays and assays were as described (Berger et al., 2006). CSI methods essentially followed (Warren et al., 2006). DIP-chip was carried out as described previously (Liu et al., 2005) and the resulting DNA was hybridized to NimbleGen microarrays covering the yeast genome at 32bp resolution.
Extraction of nucleosomal DNA from the samples and hybridization onto the yeast tiling array was performed according to (Lee et al., 2007). Isolation of total RNA and hybridization onto the tiling arrays followed (Juneau et al., 2007), except that Actinomycin D was added in a final concentration of 6 μg/ml during cDNA synthesis to prevent antisense artefacts.
We grew isogenic wildtype and rsc3-1 strains, each carrying Rsc8-TAP, in parallel under rsc3-1 restrictive growth conditions. After formaldehyde crosslinking and chromatin extraction we performed a single pulldown with IgG sepharose. Following decrosslinking, we analyzed these samples on Nimblegen tiling arrays using a two-color procedure, comparing the pulled-down DNA to genomic DNA. We then compared relative enrichment between wildtype and rsc3-1.
The probability of a transcription factor binding somewhere within a promoter was estimated using PWMs obtained in this study and the program GOMER (Granek and Clarke, 2005), run with default parameters, with promoters defined as the 600bp region 5′ to the ORF. The top 100 hits were input into FunSpec (Robinson et al., 2002).
This work was supported by Genome Canada through the Ontario Genomics Institute, the Ontario Research Fund, a grant from the CIHR to CN and TRH (MOP 86705) and grants from NIH (GM069420) and USDA/Hatch to AZA. GBB was supported by a CIHR postdoctoral fellowship, HvB by the Netherlands Organization for Scientific Research (825.06.033), CDC by American Heart Association Predoctoral Fellowship No. 0615615Z, CLW by Computation and Informatics in Biology and Medicine Training Grant T15LM007359. AZA is a Shaw Scholar. JDL and AJG are supported by NIH R01-GM072518. We thank Brenda Andrews, Charlie Boone, Li Zhijang, Zhaolei Zhang, Quaid Morris, Larry Hiesler, Martha Bulyk, Mike Berger, and Andrew Gehrke for assistance and helpful discussions.
Supplementary material and URLs. Supplementary data files including clone sequences and 8-mer scores and motifs for all TFs are posted at http://hugheslab.ccbr.utoronto.ca/supplementary-data/yeastpbm/. Affymetrix tiling array data is available at ArrayExpress (record E-MEXP-1754); all other microarray data is available at GEO (record GSE12349).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.