|Home | About | Journals | Submit | Contact Us | Français|
The study of 5-hydroxylmethylcytosines (5hmC) has been hampered by the lack of a method to map it at single-base resolution on a genome-wide scale. Affinity purification-based methods cannot precisely locate 5hmC nor accurately determine its relative abundance at each modified site. We here present a genome-wide approach, Tet-assisted Bisulfite Sequencing (TAB-Seq), for mapping 5hmC at base resolution and quantifying the relative abundance of 5hmC as well as 5mC when combined with traditional bisulfite sequencing. Application of this method to embryonic stem cells not only confirms widespread distribution of 5hmC in the mammalian genome, but also reveals sequence bias and strand asymmetry at 5hmC sites. We observe high levels of 5hmC and reciprocally low levels of 5mC near but not on transcription factor binding sites. Additionally, the relative abundance of 5hmC varies significantly among distinct functional sequence elements, suggesting different mechanisms for 5hmC deposition and maintenance.
5-methylcytosine (5mC) in mammalian genomic DNA is essential for normal development and impacts a variety of biological functions. In 2009, 5-hydroxymethylcytosine (5hmC) was discovered as another relatively abundant form of cytosine modification in embryonic stem cells (ESCs) and Purkinje neurons (Kriaucionis and Heintz, 2009; Tahiliani et al., 2009). The TET proteins, which are responsible for conversion of 5mC to 5hmC, have been shown to function in ESC regulation, myelopoiesis and zygote development (Dawlaty et al., 2011; Gu et al., 2011; Iqbal et al., 2011; Ito et al., 2010; Ko et al., 2010; Koh et al., 2011; Wossidlo et al., 2011). 5hmC was found to be widespread in many tissues and cell types, although with diverse levels of abundance (Globisch et al., 2010; Munzel et al., 2010; Song et al., 2011; Szwagierczak et al., 2010). Proteins that can recognize 5hmC-containing DNA have also been investigated (Frauer et al., 2011; Yildirim et al., 2011). In addition, 5hmC can be further oxidized to 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC) by TET proteins (He et al., 2011; Ito et al., 2011; Pfaffeneder et al., 2011), and demethylation pathways through these modified cytosines have been shown (Cortellino et al., 2011; Guo et al., 2011; He et al., 2011; Maiti and Drohat, 2011; Zhang et al., 2012). Together, these studies provide an emerging paradigm in which 5mC oxidation plays important roles in sculpting a cell’s epigenetic landscape and developmental potential through the regulation of dynamic DNA methylation states.
Strategies to label and/or enrich 5hmC in genomic DNA have been developed to investigate the distribution and function of 5hmC in the genome (Ficz et al., 2011; Pastor et al., 2011; Robertson et al., 2012; Robertson et al., 2011; Song et al., 2011; Stroud et al., 2011; Williams et al., 2011; Wu et al., 2011; Xu et al., 2011). While 5hmC is more enriched in gene bodies than transcription starting sites in mouse cerebellum (Song et al., 2011; Szulwach et al., 2011b), all genome-wide maps of 5hmC in human and mouse embryonic stem cells indicate that 5hmC tends to exist in gene bodies, promoters, and enhancers (Ficz et al., 2011; Pastor et al., 2011; Stroud et al., 2011; Szulwach et al., 2011a; Williams et al., 2011; Wu et al., 2011; Xu et al., 2011). However, in all cases, the resolution of these maps was restricted by the size of the immunoprecipitated or chemically captured DNA, which varied from several hundred to over a thousand bases.
The study of 5mC has been facilitated by the development of whole genome bisulfite sequencing methods that can resolve the genomic location of methylcytosine at single-base resolution (Cokus et al., 2008; Lister et al., 2008; Lister et al., 2009). However, current bisulfite sequencing methods cannot distinguish between 5mC and 5hmC (Huang et al., 2010; Jin et al., 2010). Therefore, the genome-wide bisulfite sequencing maps generated in recent years may not accurately capture the true abundance of 5mC at each base in the genome. A more detailed understanding of the function of 5hmC as well as 5mC has, therefore, been hampered by the lack of a single-base resolution sequencing technology capable of detecting the relative abundance of 5hmC per cytosine.
Here we present a Tet-assisted bisulfite sequencing (TAB-Seq) strategy, which provides a method for single-base resolution detection of 5hmC amenable to both genome-wide and loci-specific sequencing. Applying this new method, we have generated the first genome-wide, single-base resolution maps of 5hmC in ESCs. Distinct classes of functional elements exhibit variable abundance of 5hmC, with promoter-distal regulatory elements harboring the highest levels of 5hmC. High levels of 5hmC and reciprocally low levels of 5mC can be found near binding sites of transcription factors. In contrast to 5mC, 5hmC sites display strand asymmetry and sequence bias. Finally, the base-resolution maps of 5hmC provide more accurate estimates of both 5hmC and 5mC levels at each modified cytosine than previous whole genome bisulfite sequencing approaches. Our results support a dynamic DNA methylation process at distal regulatory elements, and suggest that different mechanisms of DNA modification may be involved at distinct classes of functional sequences in the genome.
Traditional bisulfite sequencing cannot discriminate 5mC from 5hmC because both resist deamination by bisulfite treatment (Huang et al., 2010; Jin et al., 2010). We have recently found that TET proteins not only oxidize 5mC to 5hmC, but also further oxidize 5hmC to 5caC, and that 5caC exhibits similar behavior as unmodified cytosine after bisulfite treatment (He et al., 2011; Ito et al., 2011). This deamination difference between 5caC and 5mC/5hmC under standard bisulfite conditions inspired us to explore TAB-Seq. In this approach, we introduce a glucose onto 5hmC using β-glucosyltransferase (βGT), generating β-glucosyl-5-hydroxymethylcytosine (5gmC) to protect 5hmC from further TET oxidation. After blocking of 5hmC, all 5mC is converted to 5caC by oxidation with excess of recombinant Tet1 protein. Bisulfite treatment of the resulting DNA then converts all C and 5caC (derived from 5mC) to uracil or 5caU, respectively, while the original 5hmC bases remain protected as 5gmC. Thus, subsequent sequencing will reveal 5hmC as C, providing an accurate assessment of abundance of this modification at each cytosine when combined with traditional bisulfite sequencing (Figure 1A). We first confirmed that 5gmC is read as C in traditional bisulfite sequencing (data not shown). We cloned and expressed the catalytic domain of mouse Tet1 (mTet1) (Figure S1A), as previously reported (Ito et al., 2010). We tested a synthetic double-stranded DNA with site-specifically incorporated 5mC or 5hmC modification (Figure 1B). Application of our method with Sanger sequencing of the PCR amplified products showed that the original 5mC was completely converted into T after treatment, indicating efficient oxidation of 5mC to 5caC by mTET1 (Figure 1B). However, the original 5hmC was sequenced as C, confirming that the protected 5gmC is resistent to deamination under bisulfite treatment (Figure 1B). The products of each step were confirmed by MALDI-TOF/TOF using a shorter model duplex DNA (Figure 1C). Full conversion of 5mC in the context of genomic DNA was also confirmed by conventional bisulfite, PCR, and both Sanger and semiconductor sequencing (Figure S1B–C). Additionally, application to genomic DNA confirmed conversion of 5mC to 5caC and protection of 5hmC, and that 5fC is undetectable by immunoblot on the final reaction products (Figure 1D). Thus, coupling βGT-mediated transfer of glucose to 5hmC with mTet1-catalyzed oxidation of 5mC to 5caC enables the distinction of 5hmC from both C and 5mC after sodium bisulfite treatment.
The ability to distinguish 5hmC at base resolution offers a significant opportunity to further parse DNA methylation/hydroxymethylation states at specific genomic loci. We applied traditional bisulfite sequencing and TAB-Seq to known 5hmC-enriched loci in mouse cerebellum that were identified previously (Song et al., 2011; Szulwach et al., 2011b). Comparing the sequencing results, we were able to identify genuine 5hmC and 5mC sites (Figure S1D).
We next applied TAB-Seq to genomic DNA from H1 human ES cells and E14Tg2a mouse ES cells, and sequenced to an average depth of 26.5X and 17X per cytosine, respectively. Successful detection of 5hmC is governed by three key parameters: 1) efficient conversion of unmodified cytosine to uracil; 2) efficient conversion of 5mC to 5caU/U; and 3) efficient protection of 5hmC. To directly assess these conversion rates in the context of genomic DNA, sequenced samples were spiked in with fragments of lambda DNA amplified by PCR to contain three distinct domains having either unmodified cytosine, 5mC, or 5hmC. We observe low non-conversion rates for unmodified cytosine (0.38%) and 5mC (2.21%), contrasted to a high non-conversion rate of 5hmC (84.4%) (Figure S2B). Further analysis indicates that this latter value is an underestimate of the true 5hmC protection rate in H1, which is closer to 87.0% (Figure S2D–E). These data further confirm the capability of TAB-Seq for robust distinction of 5hmC from 5mC and unmodified cytosine in the context of genomic DNA.
We next focused our analysis on the map of H1 human ES cells. To confidently identify 5hmC-modified bases we took advantage of the highly annotated H1 methylome generated using methylC-Seq, which identifies both 5mC as well as 5hmC. Accordingly, we restricted our search for 5hmC to the subset of methylated bases previously identified by methylC-Seq (Lister et al., 2009). The probability that a cytosine can be confidently identified as 5hmC is governed by the sequencing depth at the cytosine and abundance of the modification (Figure S2C). Modeling this probabilistic event with a binomial distribution (Lister et al., 2009) with N as the depth of sequencing at the cytosine and p as the 5mC non-conversion rate, we identified a total of 691,414 5hmCs with a false discovery rate of 5% (Figure S2F, see Extended Experimental Procedures). Given an average sequencing depth of 26.5, our assay can on average resolve 5hmC having an abundance of 20% or higher (Figure S2C).
Genomic profiles of absolute 5hmC levels are comparable to a map previously generated using an affinity-based approach (Szulwach et al., 2011a) (Figure 2A). As sequenced fragments are equally distributed among the population of cells, TAB-Seq provides a steady-state glimpse of 5hmC in the entire population. This is in contrast to affinity-based approaches, which bias sequencing towards 5hmC-enriched DNA fragments. By TAB-Seq, identified 5hmCs are highly clustered, unlike 5mCs (Figure S3A), and track well with peaks of 5hmC enrichment previously identified by affinity sequencing (Figure 2A). There are 7.6 times as many 5hmCs overlapping affinity-identified regions as expected by chance (Figure 2B, Z-score = 1,579). Furthermore, 81.5% of these 82,221 affinity-identified regions were recovered by at least one 5hmC. In contrast, only 35.6% of 5hmCs are recovered by affinity-based approaches, suggesting an increased sensitivity of TAB-Seq. Using semiconductor sequencing, we verified the presence/absence of 5hmC at 57 out of 59 individual cytosines (9 out of 11 hydroxymethylated CpGs, with depth ≥ 30) within regions that previously escaped detection by 5hmC affinity capture (Figure S2A), underscoring the sensitivity and specificity of our approach.
Application of TAB-Seq to mouse ESCs resulted in 2,057,636 high-confidence 5hmCs. This larger number of sites is likely attributable to higher level expression of both Tet1 and Tet2 in mouse ESCs as revealed by RNA-Seq analysis (Lister et al., 2011; Myers et al., 2011) (B.R., unpublished data). Like H1, these 5hmCs are also significantly enriched at genomic loci recovered by affinity sequencing (Figure S2J). In addition, these 5hmC sites are significantly enriched for previously mapped binding sites of Tet1 (Williams et al., 2011; Wu et al., 2011), confirming the TAB-Seq approach.
DNA methylation of cytosines can exist in several contexts: CpG (denoted CG), CHG, and CHH (H = A, C, or T). While it has been suggested that mouse ESCs may harbor 5hmC in non-CG content (Ficz et al., 2011) and while non-CG methylation is present in human and mouse ESCs (Lister et al., 2009; Stadler et al., 2011), we found that nearly all (99.89%) of H1 5hmCs exist in the CG context (Figure 2C). Similarly, this figure is 98.7% in mouse ESCs (Figure S2G).
The combination of traditional methylC-Seq and TAB-Seq maps allows us to estimate the true abundance of both 5hmC and 5mC. We observe that, in a steady-state population of cells, 5mC and 5hmC often coexist at the same cytosine (Figure 2D). The median observed abundance of 5hmC at 5hmC-rich cytosines is 19.2%, compared to 60.7% for 5mC as estimated from traditional bisulfite sequencing (Figure 2E). Adjusting for the 87.0% protection rate of 5hmC by TAB-Seq, we estimate the corrected median 5hmC and 5mC abundance to be 22.1% and 57.8%, respectively. These results suggest that, at the base level, the abundance of 5hmC is lower than 5mC. This observation is corroborated in mouse ESCs (Figure S2H–I), and is consistent with a previous estimate of global 5hmC levels in embryonic stem cells (Tahiliani et al., 2009).
Previous studies using affinity-based approaches have demonstrated that 5hmC is enriched at promoters, enhancers, CTCF binding sites, exons, and gene bodies (Ficz et al., 2011; Pastor et al., 2011; Stroud et al., 2011; Szulwach et al., 2011a; Williams et al., 2011; Wu et al., 2011; Xu et al., 2011), suggesting an extensive role for this modification in gene regulation. Supporting a functional role of 5hmC, we observe a trend of increasing sequence conservation for increasing abundance of 5hmC (Figure S3B). However, the absolute abundance of 5hmC cannot be assessed from affinity-based detection methods, therefore precluding further quantitative analysis of 5hmC’s role at each class of regulatory elements. In H1, we found that almost half (46.4%) of the 5hmCs reside in distal regulatory elements mapped by ChIP-Seq and DNase-Seq (Figure 3A). Assessing relative enrichment of 5hmC at each class of regulatory element by normalizing with genomic coverage, H1 distal regulatory elements including p300 binding sites (observed/expected = 7.6), predicted enhancers (o/e =7.8), CTCF binding sites (o/e = 5.1), and DNase I hypersensitive sites (o/e = 3.4) are more enriched with 5hmC than other genic regions (Figure 3B). Intriguingly, the subset of cytosines showing nearly equal levels of 5mC and 5hmC are more enriched in distal regulatory elements and less enriched at promoters and genic features (Figure S3E), suggesting that active demethylation is strongest outside of genes. In support of this observation, promoter-distal ChIP-Seq peaks for OCT4, SOX2, NANOG, KLF4, and TAFII are also more enriched with 5hmC than genic features (Figure S3D). Finally, we observe that increasing DNase I hypersensitivity signal correlates well with increased 5hmC and decreased 5mC enrichment (Figure S3C). These results are also supported by observations in mouse ESCs (Figure S3F–G), though we observe an increase in intragenic 5hmC occupancy.
Examining only those genomic elements having significant 5hmC enrichment, we found that the absolute levels of 5hmC at all classes of distal regulatory elements are significantly higher than promoter-proximal elements (Figure 3C). In contrast, gene bodies with significant levels of 5hmC show statistically lower levels of 5hmC. Furthermore, examining the estimated level of 5mC at these loci, we observed an inverse relationship between 5mC and 5hmC (Figure 3C). Distal regulatory elements have the lowest levels of 5mC, with p300 and enhancers having median abundances of 42.2% and 53.7%, respectively. This suggests that highly demethylated elements such as p300 contain more cytosines in a non-5mC/5hmC form, implicating stronger demethylation at these regulatory elements.
In combination with the observations that: 1) between 44% and 74% of distal regulatory elements are significantly enriched with 5hmC in human and mouse ESCs (Figure 3D, S3H); 2) the same class of elements are also enriched in mouse ESCs (Figure 3E, S3G); and 3) the sequence-conserved distal-regulatory elements in H1 are conserved for 5hmC in mouse ESCs (Figure 3F), our data suggests that the marking of functional regulatory elements with 5hmC is an evolutionarily conserved phenomenon with potential functional consequences. Together, these data show that 5hmC is most abundant at promoter-distal regulatory elements, and particularly enriched in distal regulatory elements.
Besides distal regulatory elements, we observe significant enrichment of 5hmC at genes of all tiers, but lowly expressed genes are more enriched than highly expressed genes (Figure S3I), consistent with previous studies (Pastor et al., 2011). In contrast to the abundant 5hmC found at regulatory elements in H1, the vast majority of repetitive elements are highly enriched with 5mC, but not 5hmC (Figure S3J). Between 3.5 and 7.5% of repetitive elements are significantly enriched with 5hmC, with LTRs being the highest (Figure S3K). At these significant loci, the absolute abundance of 5hmC is on par with promoters, but less than distal-regulatory elements (Figure S3L).
5mC is thought to confer specificity to gene regulation by influencing transcription factor binding or serving as a substrate of recognition for chromatin regulators (Bird, 2011; Chen and Riggs, 2011; Jaenisch and Bird, 2003; Quenneville et al., 2011). Similarly, it has been suggested that 5hmC offers a different platform upon which transcription factors may bind or 5mC specific binding proteins may be excluded (Hashimoto et al., 2012; Kriaucionis and Heintz, 2009; Valinluck et al., 2004; Yildirim et al., 2011). Since 5hmC is enriched near enhancers, one possibility is that this modified base is specifically recognized by transcription factors as a core base in binding motifs. But as sequence motifs are typically shorter than 20bp, the resolution of affinity-based approaches is not sufficient to resolve whether 5hmC is actually present within or outside of the binding site. We observed that while 5hmC is abundant within 500bp of distal p300 binding sites, there is a local depletion near the expected TF binding site (Figure 4A, Figure S4A). To increase resolution, we anchored p300 binding with the OCT4/SOX2/TCF4/NANOG consensus motif (Lister et al., 2009). Total DNA methylation (5mC+5hmC) decreases towards the motif, in agreement with a recent study (Stadler et al., 2011), while 5hmC displays a bimodal peak of enrichment centered at the motif with a maximum average abundance of 12.3% (Figure 4B).
Similarly for CTCF binding sites, we observed a bimodal enrichment profile of 5hmC abundance ~150bp around the motif, with almost no 5hmC within the motif itself (Figure 4C). 5hmC increases to a maximum abundance of 13.4%, coinciding with a dramatic depletion of 5mC from an average high of 86.2% to a low of 21.0% (Figure 3D). We also observed similar results for NANOG binding sites (Figure S4B–C). Together, these data suggest that 5hmC is typically not observed within potential binding sites of transcription factors, but rather is most enriched in regions immediately adjacent to sequence motifs. The reciprocal profiles of 5hmC and 5mC is consistent with a model of dynamic DNA methylation associated with DNA-binding transcription factors, and provides additional evidence supporting a role for 5hmC in the locally reduced levels of 5mC at distal regulatory elements (Stadler et al., 2011).
Cytosine methylation in CG context is symmetric, and the maintenance methyltransferase DNMT1 ensures efficient propagation of symmetric 5mCG during cell division, thus providing one of the central modes of epigenetic inheritance (Bird, 2011; Chen and Riggs, 2011; Goll and Bestor, 2005; Jaenisch and Bird, 2003; Wigler et al., 1981). Our observation that the bimodal distribution of 5hmC around CTCF is strand-asymmetric (Figure 4C–D) prompted us to examine if 5hmC is strand-biased in H1. While 91.8% of 5mCs are symmetrically modified, we found that only 21.0% of 5hmCs are symmetric. However, since the abundance of 5hmC is rare at any given cytosine (median 19.2%, Figure 2E), it is possible that sequencing depth was not sufficient to identify all 5hmCs, making this an under-estimate. To address this issue, we compared the pool of all called 5hmCs with the pooled 5hmC content on the opposite cytosine (Figure 5A). The average abundance of 5hmC is 20.0% at called 5hmCs, compared to 10.9% at the opposite cytosine, which corresponds to an 83.8% enrichment of 5hmC (Figure 5B, p < 1E-15, binomial). As a control, the base-line 5hmC content of all methylated cytosines in CG context is symmetric and comparable to the methylcytosine non-conversion rate (Figure 5C). At promoters and within gene bodies, we found that strand bias is not dependent on the orientation of the transcript (Figure S5A) (ppromoter = 0.0339, pgene body = 0.0719).
To confirm the asymmetry of 5hmCG, we examined the difference in methylation state of called 5hmCs and the cytosines located at the opposite strands. From traditional bisulfite sequencing, the median difference in total methylation (5mCG+5hmCG) between called and opposite cytosines is 0%. In contrast, TAB-Seq reveals a shifted distribution with a median of 10.9% less hydroxymethylation on the opposite cytosine (Figure 5D, p <1E-15, Wilcoxon). Simultaneous examination of the absolute levels of 5hmC on both called and opposite cytosines showed that the shift in hydroxymethylation state towards the called cytosine is evident, in contrast to DNA methylation levels that remain symmetric (Figure 5E). Our analysis of the spike-in lambda DNA showed no strand nor sequence bias of the TAB-Seq method (Figure S5B, Figure S2D). This conclusion was further supported by analyzing the βGT-catalyzed glucosylation efficiency of a fully-hydroxymethylated model dsDNA, which is over 90% (Figure S5C).
The asymmetry of 5hmC in H1 suggests that, on a population average, one strand is more likely to be hydroxymethylated than the other strand. One possible explanation for this phenomenon is a sequence preference of 5hmC for one strand compared to the other. To examine this systematically, we aligned all 5hmCs in CG context and examined base composition (Figure 6A). On the strand containing 5hmC, we observed a modest increase in local guanine abundance with depletion of adenine and thymine content. Within a window of 100bp around 5hmCs, the local sequence content of guanine increases to an average of 29.9%, significantly higher than the 25.6% observed for randomly sampled methylated cytosines (Figure S6A, p < 1E-15, Wilcoxon). These observations are not a function of regulatory element class, as similar trends hold for subsets of 5hmC found at promoters, distal regulatory elements, and genic regions (Figure S6C). Furthermore, similar trends are observed in mouse ESCs (Figure S6D), and analysis of the spike-in lambda DNA shows that this observation is not a systematic bias of the TAB-Seq method (Figure S6B).
Our observations suggest that 5hmC deposition is biased towards the strand with a higher local density of guanine. To test this hypothesis, we developed a predictive algorithm: given that a strand-biased hydroxymethylation event exists at a particular CG (p-value = 0.01, Fisher’s exact test) and that one strand has local guanine content significantly different from the other strand (p-value = 0.01, Fisher’s exact test), we predict the strand with higher guanine content to have the hydroxymethylation event. This model correctly predicts the hydroxymethylated strand with 82.7% accuracy, significantly better than the 50% expected by chance (Figure 6B, p < 1E-15, binomial), confirming that local sequence content plays a role in strand-specific hydroxymethylation. However, while both human and mouse ESCs exhibit a bias of 5hmCG to occur on the strand with more guanine content (Figure S6C–D), the effect is weaker in mouse ESCs (Figure S6E–F), which is one potential reason that guanine content does not predict 5hmC in mouse ES cells. One possible explanation is the large difference in the expression levels of TET1 and TET2 in human and mouse ESCs.
Recent affinity-based studies in mouse ESCs have observed 5hmC to be frequently enriched at CpG island-containing promoters (Ficz et al., 2011; Pastor et al., 2011; Williams et al., 2011), and that the highest levels of 5hmC correspond to the highest density of CpGs (Ficz et al., 2011). In contrast, an affinity-based map of 5hmC produced in H1 found 5hmC-rich regions to be depleted of CpG dinucleotides (Szulwach et al., 2011a). These confounding results prompted us to examine the relationship between absolute steady-state 5hmC level and CpG content at promoters. We found that promoters with the highest levels of 5hmCG are almost exclusively of low CpG content (Figure 7A), which are also the promoters most likely to have the highest 5mCG (Figure S7A). In agreement with this observation, when we divide promoters by CpG content, we observe that the density of 5hmC is lowest at HCPs, while at LCPs and ICPs 5hmC is at least 3.3 times more abundant (Figure S7H). Analyses of mouse ESCs give similar results (Figure 7B). In both human and mouse ESCs, CpG-rich promoters are almost devoid of steady-state 5hmC. Moreover, these results apply to promoters containing H3K4me3 or bivalent chromatin modifications (Figure S7G).
Together with our observation of an increased local density of guanine on the strand of hydroxymethylation, we postulated that promoters with high GC content but low CpG density are more likely to be hydroxymethylated. Indeed, such bivalent (p < 1E-300) and H3K4me3-only promoters (p = 7.8E-286) are more enriched with 5hmC (Figure 7C).
To determine if hydroxymethylation at distal regulatory elements is also biased towards low CpG density, we examined three classes of DNase I hypersensitive sites (DHSs): 1) those lacking the enhancer histone modifications H3K4me1 and H3K27ac; 2) putative poised enhancers bearing only H3K4me1; and 3) putative active enhancers with both modifications (Hawkins et al., 2010; Myers et al., 2011). Poised and active enhancers exhibit the strongest enrichment of 5hmC (Figure 7D–F), which almost exclusively corresponds to low CpG density regions. Like promoters, the few distal DHSs with high CpG density are generally composed of low 5hmCG content. We also observed similar results at distal p300 binding sites (Figure S7B–C). Together, these results suggest that the highest levels of 5hmC occur at regions of the genome with low CpG density.
Comparing DNase I hypersensitive sites lacking the H3K4me1 and H3K27ac enhancer marks to poised enhancers having only H3K4me1, 5mCG drops by 12.6% and 5hmC increases by 2.7-fold (Figure S7D–F). In contrast, active enhancers having both H3K4me1 and H3K27ac have 8.3% less 5mCG than poised enhancers, but with only a 1.08-fold increase in 5hmC. These results suggest that while 5mCG is inversely related to both H3K4me1 and H3K27ac, 5hmC is primarily proportional to H3K4me1.
Bisulfite sequencing has been broadly used to analyze the genomic distribution and abundance of 5mC (Bernstein et al., 2007; Clark et al., 1994; Lister et al., 2008; Meissner, 2010; Pelizzola and Ecker, 2011). However, because traditional bisulfite sequencing cannot distinguish 5mC from 5hmC, results from such approaches cannot yet accurately reveal 5mC abundance (Huang et al., 2010; Jin et al., 2010). Recent experiments show that 5hmC is widespread in the mammalian genome, and at least two functions have been proposed for this cytosine modification: 1) 5hmC serves as an intermediate in the process of DNA demethylation, either passively (Inoue and Zhang, 2011), or actively through further oxidation (He et al., 2011; Ito et al., 2011; Maiti and Drohat, 2011; Zhang et al., 2012); 2) 5hmC may be recognized by chromatin factors (Frauer et al., 2011; Yildirim et al., 2011), and that its presence could reduce binding of certain methyl-CpG-binding proteins (Hashimoto et al., 2012; Kriaucionis and Heintz, 2009; Valinluck et al., 2004). These functions implicate two opposing notions about the relative stability of 5hmC at distinct genomic loci. As the first step toward understanding these molecular mechanisms associated with 5hmC function, it is important to not only precisely locate 5hmC in the genome, but also to determine the relative abundance at each modified site. Here we describe a modified bisulfite sequencing method that when combined with traditional bisulfite sequencing can determine the location of 5hmC at single-base resolution and quantitatively assess the abundance of 5mC and 5hmC at each modified cytosine.
Using synthetic model DNA we demonstrated that coupling βGT-mediated protection of 5hmC with mTet1-based oxidation of 5mC allows for the distinction of 5hmC from unmodified cytosine and 5mC by sequencing. 5fC and 5caC presented in the original genomic DNA do not interfere with TAB-Seq since they behave like unmodified cytosine under bisulfite treatment (He et al., 2011). We also utilized this method to examine previously reported 5hmC enriched loci and successfully identified genuine 5hmC sites. These results show the general utility of TAB-Seq to assess 5hmC in a loci-specific manner, much the same as traditional bisulfite sequencing is currently used.
We applied this technique to mammalian genomes by generating single-base resolution maps of 5hmC in human and mouse ESCs. We show that these maps agree well with previous maps generated using affinity-based 5hmC profiling. Importantly, these single-base maps also revealed a significant number of new 5hmC sites. Analyses of two 5hmC maps in ESCs identified several novel sequence-based characteristics of 5hmC that were previously unknown. We observed that, much like 5mC, 5hmC tends to occur primarily at CpG-dinucleotides yet, unlike 5mC, exhibits an asymmetric strand bias. We also observed a relatively strong local sequence preference surrounding 5hmC, with 5hmC occurring within a G-rich context. This observation is consistent with previous report that 5hmC regions are GC-skewed (Stroud et al., 2011). These sequence-based features associated with 5hmC may provide a basis for future mechanistic insight into the means by which 5hmC is deposited, recognized, and dynamically regulated.
The ability to quantify 5hmC abundance with base resolution offered the unique opportunity to assess its relative abundance at various regulatory elements and genomic annotations without bias. In contrast to the nearly uniform distribution of 5mC outside of promoter regions, we found that the abundance of 5hmC varies among different classes of functional sequences. It is most enriched at distal regulatory regions where levels of 5mC are correspondingly lower than the genome average. This observation agrees with recent findings from others (Stadler et al., 2011), and suggests that active demethylation occurs at active regulatory elements through 5hmC. This active demethylation is distributed around, but not within, transcription factor consensus motifs. Supporting the notion of active demethylation, total DNA methylation exhibits a strong negative correlation with 5hmC at distal regulatory elements (Spearman correlation = −0.30). One interesting observation of these distal cis-regulatory elements is that 5hmC and 5mC often occur together at the same cytosine. Currently, the exact mechanisms that determine the dynamics of 5hmC and 5mC at these cis-regulatory sequences are unclear.
Previous affinity-based studies have suggested enrichment of 5hmC at CpG-rich transcription start sites. However, these observations relied heavily on antibody-base detection, which has been shown to exhibit bias toward 5hmC dense regions. Here we find that, in general, 5hmC is most abundant at regions of low CpG content. Furthermore, even promoters with relatively high 5hmC content tend to have low CpG content in both mouse and human ESCs. These findings highlight the utility of a base-resolution method for measuring 5hmC abundance, and provide new insight into its dynamic regulation at promoter sites with distinct CpG content.
Tahiliani and colleagues (Tahiliani et al., 2009) recently estimated the genome-wide abundance of 5hmC to be about 14 times less than that of 5mC, which would correspond to ~4.4 million 5hmCs in human. However, as our results indicate that the base-level abundance of 5hmC is several times lower than 5mC, this is likely an under-estimate. The comparatively low number of 5hmCs confidently detected in our study (691,414) is likely explained by the frequent hydroxymethylation of gene bodies previously observed in affinity-based studies (Ficz et al., 2011; Pastor et al., 2011; Stroud et al., 2011; Szulwach et al., 2011a; Williams et al., 2011; Wu et al., 2011; Xu et al., 2011). Since genic cytosines likely exist at a relatively low abundance of 5hmC (between 3–4%), they would have escaped detection at our current sequencing depth. In order to resolve low abundance 5hmCs at single-base precision, significantly more sequencing would be required. This observation highlights the biases inherent in affinity-based 5hmC mapping, which can amplify frequent weak signals found in gene bodies to overshadow rare but stronger ones at distal regulatory elements.
In summary, we have developed a genome-wide approach to determine 5hmC distribution at base resolution, and generated the first base-resolution maps of 5hmC in both human and mouse ESCs. These maps provide a template for further understanding the biological roles of 5hmC in stem cells as well as gene regulation in general. In conjunction with methylC-Seq, the TAB-Seq method described here represents a general approach to measure the absolute abundance of 5mC and 5hmC at specific sites or genome-wide, which could be widely applied to various cell types and tissues.
Glucosylation reaction was performed in a 50 µl solution with 50 mM HEPES buffer (pH 8.0), 25 mM MgCl2, 100 ng/µl sonicated genomic DNA with spike-in control, 200 µM UDP-Glc, and 1 µM wild-type βGT. The reaction was incubated at 37 °C for 1 h. After the reaction, the DNA was purified by QIAquick Nucleotide Removal Kit (Qiagen). The oxidation reaction was performed in a 50 µl solution with 50 mM HEPES buffer (pH 8.0), 100 µM ammonium iron (II) sulfate, 1 mM α-ketoglutarate, 2 mM ascorbic acid, 2.5 mM DTT, 100 mM NaCl, 1.2 mM ATP, 10 ng/µl glucosylated DNA and 3 µM recombinant mTet1. The reaction was incubated at 37 °C for 1.5 h. After proteinase K treatment, the DNA was purified with Micro Bio-Spin 30 Columns (Bio-Rad) and then by QIAquick PCR Purification Kit (Qiagen).
For a given genomic interval, the abundance of hydroxymethylation (%hmCG) is estimated as the number of cytosine base calls in the interval divided by the number of cytosine plus thymine base calls in the interval from TAB-Seq reads, where the reference is in CG context. To estimate %5mC level, we subtracted the total methylation level from methylC-Seq by the %5hmC level from TAB-Seq. In all instances, only base calls with Phred score ≥ 20 were considered.
This study was supported by National Institutes of Health (GM071440 to C.H., NS051630 and P50AG025688 to P.J., U01 ES017166 to B.R.), a Catalyst Award (C.H. and J.-H.M.) from the Chicago Biomedical Consortium with support from the Searle Funds at The Chicago Community Trust, the Emory Genetics Discovery Fund (P.J.), the Simons Foundation Autism Research Initiative (P.J.), the Autism Speaks grant (#7660 to X.L.), and the Ludwig Institute for Cancer Research (B.R.).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Sequencing data have been deposited to GEO (accession GSE36173).