|Home | About | Journals | Submit | Contact Us | Français|
The orchestrated binding of transcriptional activators and repressors to specific DNA sequences in the context of chromatin defines the regulatory program of eukaryotic genomes. We developed a digital approach to assay regulatory protein occupancy on genomic DNA in vivo by dense mapping of individual DNase I cleavages from intact nuclei using massively parallel DNA sequencing. Analysis of > 23 million cleavages across the Saccharomyces cerevisiae genome revealed thousands of protected regulatory protein footprints, enabling de novo derivation of factor binding motifs as well as the identification of hundreds of novel binding sites for major regulators. We observed striking correspondence between nucleotide-level DNase I cleavage patterns and protein-DNA interactions determined by crystallography. The data also yielded a detailed view of larger chromatin features including positioned nucleosomes flanking factor binding regions. Digital genomic footprinting provides a powerful approach to delineate the cis-regulatory framework of any organism with an available genome sequence.
The binding of transcriptional regulators to specific sites on DNA provides the fundamental mechanism for actuating genomic programs of gene expression, DNA replication, environmental response and other basic cellular processes. Delineation of the complete set of genomic sites bound in vivo by these proteins is therefore essential for an understanding of genome function. The discovery more than 35 years ago that regulatory proteins protect their underlying DNA sequences from nuclease attack1,2 has been widely exploited to define cis-regulatory elements in diverse organisms. Although conceptually simple, classical DNase I ‘footprinting’3, which reveals a DNA sequence protected from nuclease cleavage relative to flanking exposed nucleotides, is laborious in practice and particularly challenging to apply systematically to the study of in vivo protein binding in the context of native chromatin. Current genomic approaches for localizing sites of regulatory factor-DNA interaction in vivo such as chromatin immunoprecipitation coupled to DNA microarrays4 or to high-throughput DNA sequencing5,6, while more readily executed on a large scale, require both prior knowledge of binding factors and factor-specific reagents, yet do not provide nucleotide-level resolution.
Regulatory factor binding to DNA in place of canonical nucleosomes results in markedly increased accessibility of the DNA template both immediately surrounding the factor binding regions, and over neighboring chromatin. This accessibility is manifest as DNase I hypersensitive sites in chromatin, which comprise a structural signature of the regulatory regions of eukaryotic genes from yeast to humans7. Within hypersensitive sites, cleavages accumulate at nucleotides that are not protected by protein binding. We therefore reasoned that these binding sites could be detected systematically provided sufficiently dense local sampling of DNase I cleavage sites. We present an analysis of binding sites on yeast DNA based on over 23 million DNA sequence reads mapped back to the genome.
Here we couple DNase I digestion of intact nuclei with massively parallel sequencing and computational analysis of nucleotide-level patterns to disclose the in vivo occupancy sites of DNA-binding proteins genome-wide. The resulting maps provide gene-by-gene views of transcription factor binding and related cis-regulatory phenomena at the resolution of individual factor binding sites. This level of detail is sufficient to define regulatory factor binding motifs de novo, and to correlate factor occupancy patterns with higher-level features such as chromatin remodeling, gene expression, and chromatin modifications.
To visualize regulatory protein occupancy across the genome of Saccharomyces cerevisiae, we coupled DNase I digestion of yeast nuclei with massively parallel DNA sequencing to create a dense whole-genome map of DNA template accessibility at nucleotide-level. We analyzed a single well-studied environmental condition, yeast a cells treated with the pheromone α-factor, which synchronizes cells in the G1 phase of the cell cycle. We isolated yeast nuclei and treated them with a DNase I concentration sufficient to release short (<300 bp) DNA fragments while maintaining the bulk of the sample in high molecular weight species (Supplementary Fig.1). These small fragments derive from two DNase I “hits” in close proximity, and therefore their isolation minimizes contamination by single fragment ends derived from random shearing8. Because each end of the released DNase I ‘double-hit’ fragments represents an in vivo DNase I cleavage site, the sequence and hence genomic location of these sites can be readily determined by sequencing (Supplementary methods).
Using an Illumina Genome Analyzer I, we obtained 23.8 million high-quality 27 bp end-sequence reads that could be localized uniquely within the S. cerevisiae genome following filtering for duplicated sequences such as telomeric regions, transposable elements, tRNA genes, rDNA genes, and other paralogous elements (Supplementary methods). The DNase I cleavages mapped by these 23.8 million sequences were confined to 6.4 million unique positions within the yeast genome. We computed both the density of DNase I cleavage sites across the genome using a 50 bp sliding window, as well as the number of times that individual nucleotide positions had been cleaved by DNase I (per nucleotide cleavage). To control for possible DNase I cleavage bias at particular nucleotide combinations, we carried out a parallel experiment with naked DNA from the same cells digested to yield an equivalent fragment size distribution. We obtained 3.27 million DNase I cleavages mapping to distinct genomic positions, from which we computed background cleavage rates for all possible dinucleotide pairs flanking (i.e., tetranucleotides straddling) the DNase I cleavage sites (Supplementary Table 1). We then used these background propensities to normalize the per nucleotide cleavage counts obtained from in vivo DNase I treatment (Supplementary Fig.2, Supplementary methods).
Data from an exemplary 100 kb region (Fig.1a) showed that regional peaks in DNase I cleavage density concentrate in yeast intergenic regions (Fig.1b), where they coincide with contiguous stretches of individual nucleotides that had been struck repeatedly by DNase I (Fig.1c). Within the upstream regions of yeast genes, individual nucleotide positions were cleaved tens to hundreds of times.
On close inspection, we observed that DNase I cleavage patterns upstream of transcriptional start sites (TSSs) were punctuated by short stretches of protected nucleotides consistent with the footprints of DNA-binding proteins, and that in many cases individual footprints could be matched to known DNA-binding motifs (Fig.1d). We also examined the degree to which computationally predicted factor binding sites within yeast intergenic regions exhibited DNase I protection. For any given factor, computational predictions are expected to contain a mixture of true- and false-positive sites. Fig.2a shows the DNase I cleavage patterns surrounding 907 computationally-predicted9 Reb1 binding sites (+/-25 bp) within yeast intergenic regions, ranked by the ratio of DNase I cleavage flanking the motif to that within the motif. This analysis showed that a significant proportion of predicted Reb1 sites exhibited DNase I protection consistent with protein binding in vivo and, moreover, that the DNase I protection patterns were specifically localized to the motif region. We observed analogous patterns for other motifs, with considerable variation in the fraction of computationally predicted motif instances that evidenced DNase I protection (data not shown), commensurate with the expectation that many (if not most) binding sites predicted from motif scans alone are not actuated in vivo.
To detect footprints systematically across the yeast genome, we developed a computational algorithm to identify short regions (between 8 and 30 bp) over which the DNase I cleavage density was significantly reduced compared with the immediately flanking regions (Supplementary methods). To assess statistical significance and compute a false discovery rate (FDR) for footprint predictions, we compared predictions with a randomly shuffled local background distribution (Supplementary methods). Using this approach, we identified 4,384 footprints within the intergenic regions of the yeast genome at a false discovery rate of 5% (FDR=0.05; Supplementary Table 2). At least one FDR=0.05 footprint was identified in the proximal promoter region of 1,778 genes, with 630 genes harboring two or more footprints. At a false discovery rate of 10%, we identified 6,056 footprints distributed across 2,929 gene promoters, with 1,048 of them evincing >2 footprints.
We categorized the 4,384 FDR=0.05 footprints by deriving sequence motifs de novo using MEME9, and comparing the results with previously-described factor-binding motifs. The predicted numbers of in vivo binding sites across the yeast genome for different regulators vary by nearly two orders of magnitude10. However, MEME readily recovered high-quality motifs corresponding to many important yeast regulators including Reb1, Abf1, Hsf1, Rap1, Mcm1, and Cbf1 (Supplementary Table 3 and Supplementary methods).
Beyond the factor binding sequences recovered de novo using relatively stringent thresholds, footprints were significantly enriched (vs. yeast intergenic regions generally) for a broad range of regulators (Supplementary Table 4), indicating that the footprinted space was densely populated with previously recognized protein binding sites. Collectively, 35.2% of the FDR=0.05 footprints overlapped an occurrence of a conserved factor binding site inferred from ChIP data10. To assess the effect of stringently thresholded footprint detection, we computed factor motif-specific receiver-operator characteristic (ROC) curves for a variety of regulators (Supplementary Fig.3). All curves were well above the diagonal, indicating strong enrichment of previously-recognized factor binding sites near the P<0.05 threshold. This observation implies that many additional real sites exist in the data and simply do not meet the selected detection threshold.
Since footprints identified at the FDR=0.05 level are well-distinguished from their local background, we speculated that these might be generally enriched in factors with strong binding specificities, while many more weakly binding factors might not have yet achieved requisite coverage depth for detection using our algorithm. In both cases, we predicted that protection of the underlying DNA sequence from nuclease attack should be roughly inversely proportional to the binding affinity of the overlying regulatory factors. To test this, we compared the information content (a measure of the size and complexity of the predicted binding site10) of 117 known factor motifs with the level of DNase I protection within all predicted matches of each motif genome-wide, and found them to be significantly anticorrelated (P<10-16; Supplementary Fig.4). This result suggested that high information content of a binding site was a good predictor of the affinity of a factor for its cognate DNA sequences, and consequently its propensity to generate footprints detectable at the FDR=0.05 level given the current depth of sequence sampling. The result also indicates that weaker motifs should be progressively recovered at deeper levels of DNase I cleavage sampling whereupon their cognate footprints may become reliably distinguished from the background.
To visualize consensus nucleotide-level DNase I protection patterns for motifs corresponding to the most abundant footprints, we computed aggregate per-nucleotide DNase I cleavage and evolutionary conservation (PhastCons11) across all instances of each motif (Fig.2b). This showed that several footprint-derived consensus sequences were more information-rich than prior predictions based on inference from ChIP and conservation data alone (Fig.2b)10,12. For example, the previously-characterized motif weight matrix for Reb1 spans 8 nucleotides10, whereas the footprint-derived consensus fine-tunes the motif core and extends it an additional 3 nucleotides (Fig.2b). In some cases, such as Hsf1, the de novo footprint-derived motif is substantially more complex than previous predictions (Fig.2b).
We observed that nucleotide-level DNase I protection closely parallels evolutionary conservation for virtually all factors, further attesting to the significance of the footprints and their derived cognate motifs (Fig.2b). The width of the conserved region is typically broader than the span of previously-derived consensus sequence, but matches closely the footprint-derived consensus. To assess the significance of the aggregate conservation patterns for each motif, we used a permutation approach to compare the observed patterns to random samples from yeast intergenic regions (Fig.2b and Supplementary Methods). These calculations confirmed the significance of the patterns seen for factors such as Reb1, Rap1, Mcm1 and others (Supplementary Fig.5), paralleling previous results from analyses of factor binding sites across yeast species13,14. Although the majority of individual footprints genome-wide were well-conserved, many lacked significant conservation, consistent with the known potential for some sites to undergo rapid evolutionary turnover15.
In comparison with binding site catalogues based on ChIP and conservation data10, digital footprinting revealed 678 Reb1 sites vs. the 158 previously predicted; 536 vs. 151 Abf1 sites; and 311 Rap1 footprints vs. 42 previously predicted10 (Fig.2b). These discrepancies are partly a reflection of the statistical significance thresholds applied both to earlier and the present data, though they suggest an important contribution of condition-specific binding.
A striking feature of the DNase I cleavage and protection profiles for many factors is the presence of complex patterns within and surrounding the derived consensus motif sequence. For example, Mcm1 sites display a characteristic multi-phasic cleavage pattern, with three short protected regions alternating with two accessible regions (Fig.3a). Analogously, Cbf1 sites evince a broad protected region with a central zone of accessibility (Fig.3b). We surmised that these and other stereotypical ‘structural motifs’ reflected patterns of interaction of each factor with the DNA helix. To examine this in detail, we aligned the nucleotide-level DNase I accessibility motifs, the corresponding sequence motifs, and co-crystal structures of Mcm116 and a Cbf1 homolog17 (Fig.3a,b), This revealed striking correspondence between mean nucleotide-level DNase I accessibility and the pattern of protein:DNA contacts. Mcm1 is a MADS box factor that binds DNA through long α-helices that make numerous contacts along the major groove16. Mcm1 binding introduces significant bends into the DNA helix, which distort the opposing minor grooves, rendering them more susceptible to nuclease attack18. These effects are evident in the nucleotide-level cleavage patterns which show a concentration of nuclease attack opposite the Mcm1 alpha helices (Fig.3a). Similarly, in the case of the helix-loop-helix protein Cbf1, alignment of the DNase I cleavage profile to the crystal structure of the human homologue (which shares the same DNA-binding residues) reveals protection of nucleotides by the opposite alpha helices, separated by a central region of increased accessibility (Fig.3b). Taken together, these data suggest that the mean nucleotide-level DNA accessibility patterns derived from digital genomic footprinting of specific factors represent structural motifs that parallel protein:DNA interactions in vivo.
Digital genomic footprinting data are sufficiently dense to enable analysis of regulatory factor occupancy patterns at the level of individual regulatory regions. The examples in Fig.4 and Supplementary Fig.6 provide snapshots of a diverse population of regulators and binding site contexts. In many cases, high-confidence footprints agree with previous predictions for specific regulators (Fig.4a,b,d,e and Supplementary Fig.6a). However, we also observed numerous examples of discordance (Fig.4c,e), possibly reflecting condition-specific binding. For example, at the REB1 promoter (Fig.4e), we detected footprints at two previously-identified evolutionarily-conserved Reb1 binding sites19, neither of which were identified under conditions used in prior ChIP experiments. Conversely, ChIP data annotated a nearby Rpn4 site that does not fall within an FDR=0.05 footprint.
The data also illustrate considerable variability in the degree to which a given regulator protects different cognate recognition sites (compare pairs of Rap1, Reb1, and Pdr3 sites in Fig.4a,e, and f, respectively). In some cases, the identification of footprints matching characterized regulators could be used to revise gene annotations. For example, we identified a Rap1 site upstream of RPS30B that is situated within the hypothetical open reading frame for FYV12. However, the marked DNase I sensitivity and general lack of evolutionary conservation within this region suggest that FYV12 is not a gene but rather the promoter of the neighboring RPS30B (Supplementary Fig.7).
We next sought to visualize patterns of DNase I cleavage and protection at the level of extended promoter domains. We extracted DNase I cleavage data from -1 kb to +1 kb intervals around the TSSs of ~5,000 yeast genes and performed hierarchical clustering (Fig.5a). This revealed that 93% of yeast genes could be organized into four distinct clusters, ranging from low (red cluster) to high (purple) mean chromatin accessibility (Fig.5a). For genes in the red cluster, chromatin accessibility was maximal over the -100 region, visualized in Fig.5a as a prominent central vertical yellow stripe. Even at this resolution, a ~10 bp footprint centrally positioned within the -100 region could be discerned at a surprising proportion of genes (Fig.5a). A prominent feature of the DNase I cleavage patterns is the presence of regular undulations in accessibility, with a period of ~175 bp symmetrically flanking the central high-accessibility zone (Fig.5a). This pattern is consistent with the presence of phased nucleosomes. We further observed that the periodic pattern emanated from the boundaries of the central high-accessibility region, even though this region varied in size between the four clusters. This observation suggested that phased nucleosomes were in fact distributed relative to central sites occupied by factors. To explore further the relationship between nucleosome-level features and factor occupancy, we examined the long-range distribution of DNase I cleavages surrounding footprints of individual regulators across the genome. The distribution of DNase I cleavages relative to footprints for Reb1 and Abf1 revealed periodic undulations, consistent with phased nucleosome arrays symmetrically distributed relative to the factor-binding sites. However, Rap1 and Mcm1 exhibited less prominent patterns (Fig.5c), suggesting that some factors (e.g. Reb1 and Abf1) have a more determinative role in establishing chromatin architecture at promoters20. Collectively, these data are consistent with statistical positioning of nucleosomes relative to factor binding-induced ‘barrier’ events21,22.
We also observed that the binding of many factors appears to be positionally constrained relative to transcriptional start sites. For six factors (Reb1, Abf1, Rap1, Mcm1, Cbf1, and Pdr3), footprints exhibit tight clustering into a ~50 bp zone centered ~100 bp upstream of the TSS (Fig.5b). Furthermore, the region immediately 3′ to the -100 region is generally depleted of footprints (Fig.5b), consistent with the presence of a positioned nucleosome. These results are compatible with the existence of a common focal point for the organization of promoter architecture of a substantial fraction of yeast genes22,23.
We next asked whether the four chromatin structural clusters (Fig.5a) were correlated with expression of their constituent genes. We found that the average expression level of genes from each cluster increases monotonically with the extent of chromatin disruption upstream of the TSS (Fig.5d). This organization is most readily explained by the size of the domain over which factor binding takes place. For the genes in the red cluster, factor binding is largely restricted to the -100 region, with a prominent -1 nucleosome around -200. By contrast, for genes in the blue cluster, the accessible factor-binding region extends from the TSS to approximately -360, with a 5′ shift in the -1 nucleosome. For genes in the green and purple clusters, the factor-binding region extends to -450 bp and -750, respectively. Taken together, these observations suggest that, rather than simple gain or loss of an upstream nucleosome23-26, high expression of yeast genes may involve increases in both the number and longitudinal extent of regulatory factors bound in the upstream region. Conversely, many genes expressed at a low level nonetheless exhibited high chromatin accessibility across their promoter regions, with attendant footprints indicative of factor binding. The existence of such promoters parallels reports of binding by well-described regulators such as Heat Shock Factor (Hsf1), Gal4, Abf1 and Pdr1/Pdr3 under conditions in which they do not activate transcription27. These results emphasize the heterogeneous nature of factor binding and consequent control of gene expression, requiring gene-level analyses of factor occupancy.
DNase I footprinting has long been used in an in vitro context to interrogate protein-DNA interactions. However, application of this approach to the study of in vivo interactions has proven difficult, and only a handful of studies have been reported for highly targeted loci such as individual cis-regulatory elements28. By coupling DNase I digestion of intact nuclei with massively parallel sequencing and computational analysis of nucleotide-level patterns, the digital genomic footprinting approach we describe now enables genome-scale detection of the in vivo occupancy of genomic sites by DNA-binding proteins. Although detection of individual binding events is dependent on the depth of sequence coverage at a given position, the approach takes advantage of the concentration of cleavages within DNase I hypersensitive regions. In the case of mammalian genomes, DNase I cleavage is highly targeted to DNase I hypersensitive sites, which comprise only 1-2% of the genome in each cell type. As such, although the human genome is ~250-fold larger than the yeast genome, the collective span of human DNase I hypersensitive sites is only 1-2% of the genome, and therefore potentially addressable with only modest scale-up.
To date, genome-scale localization of regulatory factor binding sites has largely relied on a top-down approach centered on chromatin immunoprecipitation. Several limitations of this approach are addressed by digital genomic footprinting. Whereas ChIP requires prior knowledge of each DNA-binding protein to be interrogated by genome-wide location analysis, and can be carried out on only one protein at a time, DNase I footprinting addresses all factors simultaneously in their native state, and detects regions of direct binding at nucleotide precision vs. inference based on motif enrichment analysis. However, many regulatory factors share common binding sequences, and ChIP offers definitive identification of the protein of interest. The joint application of digital genomic footprinting with ChIP should therefore provide particularly rich information concerning the fine-scale architecture of cis-regulatory circuitry.
Digital genomic footprinting also provides a powerful tool for annotation of the genomes of diverse organisms about which little is known beyond the genome sequence itself. In these contexts, top-down approaches to regulatory factor binding site localization are limited. By contrast, digital genomic footprinting can be applied to develop rapidly both a gene-by-gene map and a lexicon of major regulatory motifs.
Cis-regulatory alterations accompanying different growth, conditions or cell differentiation and cycling impact multiple regulators simultaneously and are difficult to study. The approach described herein is readily extensible to the analysis of such changes across the genome by sampling sequential time points to visualize cis-regulatory dynamics. Digital genomic footprinting therefore has the potential to expose and probe the cis-regulatory regulatory framework of virtually any sequenced organism in a single experiment, regardless of its prior level of functional characterization.
Footprints were identified using a computational algorithm that evaluates short regions (between 8 and 30 bp) over which the DNase I cleavage density was significantly reduced compared with the immediately flanking regions (Supplementary Methods). FDR thresholds were assigned by comparing p-values obtained from real and shuffled cleavage data. Software and data used for this analysis are available at http://noble.gs.washington.edu/proj/footprinting/.
Supplementary Figure 1: Digital genomic footprinting strategy.
Supplementary Figure 2: Correction of in vivo DNase I cleavage data for sequence bias.
Supplementary Figure 3: Motif-specific ROC curves for the top motifs detected in Supplementary Table S4
Supplementary Figure 4: Information content vs. DNase I cleavages within putative transcription factor motifs.
Supplementary Figure 5: Summary plots comparing DNase I cleavage and conservation of ChIP-defined sites for 13 transcription factors.
Supplementary Figure 6: Individual yeast regulatory regions and factor binding sites.
Supplementary Figure 7: Footprint within a hypothetical ORF.
Supplementary Table 1: Dinucleotide cleavage biases.
Supplementary Table 2: Footprints identified from digital DNase I cleavage data.
Supplementary Table 3: Motif matches to major regulators.
Supplementary Table 4: Numbers of occurrences of ChIP-defined motifs in the identified footprints.
We thank the staff of the University of Washington Genome Sciences High-Throughput Genomics Unit for technical assistance with Illumina/Solexa sequencing, and members of the Stamatoyannopoulos and Fields labs for many helpful discussions. This work was supported by NIH grants R01GM071923 and U54HG004592 to J.S. and P41RR11823 to S.F. S.F. is an Investigator of the Howard Hughes Medical Institute. X.C. was supported by a fellowship from the Natural Sciences and Engineering Research Council of Canada (NSERC PGS D3).
Additional methods: Additional methodological details are provided in Supplementary Methods.
The authors declare no competing financial interests.
EdSumm: Dense mapping of DNase I cleavage sites across the whole yeast genome by next generation sequencing reveals a global view of the binding of regulatory proteins to genomic DNA. The high resolution allows the identification of new binding sites for known factors as well as the de novo derivation of factor binding motifs.