5-Hydroxymethylcytosine was first observed in mammals in 1972, but it was not given much attention until recently1
. In 2009, 5-hmC was found to exist in relatively high abundance in Purkinje neurons and embryonic stem cells (ESCs), and was produced specifically through 5-mC oxidation catalyzed by the Tet family of proteins2,3
. 5-hmC is thought to be an intermediate in an active demethylation process and may have direct roles in gene expression, as the modified base itself cannot be recognized by most 5-mC–binding proteins3–7
. With the development and application of more sensitive detection technologies, 5-hmC has been found to be present at different levels in the genomes of various cell types or tissues8–11
. Genome-wide profiling of 5-hmC further indicates potential regulatory roles of 5-hmC in ESC regulation, myelopoiesis, zygote development and neurodevelopment, thus suggesting that it may serve as an epigenetic mark12–20
After the discovery of 5-hmC, several groups independently reported further oxidization of 5-hmC to 5-formylcytosine (5-fC) and 5-caC catalyzed by Tet proteins21–23
. Both 5-caC and 5-fC can be recognized and excised by thymine DNA glycosylase (TDG) and then converted back to cytosine through the base excision repair pathway 21,24,25
. This newly discovered active demethylation pathway again suggests that 5-hmC is an intermediate of demethylation. 5-hmC accumulates to high abundance in certain brain tissues, implying functional roles other than as an intermediate in demethylation. Determination of the exact location and relative abundance of 5-hmC will be crucial in order to fully unveil the biology associated with this base modification. We describe here a detailed protocol for the TAB-seq method that we recently published for single-base resolution sequencing of 5-hmC26
Development of the protocol
Traditional bisulfite sequencing, which has been widely used to detect 5-mC at single-base resolution, cannot differentiate 5-mC from 5-hmC, as both resist deamination during the treatment of DNA with sodium bisulfite7,27
. The protocol described here overcomes this limitation by selectively converting 5-mC to 5-caC in two steps (): protection of 5-hmC through glucosylation and mTet1-mediated oxidation of 5-mC to 5-caC. After subsequent bisulfite conversion, the protected β-glucosyl-5-hydroxymethylcytosine (5-gmC; from 5-hmC) is sequenced as C, whereas 5-caC and C read as T, enabling single-base resolution sequencing of 5-hmC26
Overview of Tet-assisted bisulfite sequencing (TAB-seq). 5-hmC is protected specifically by β-GT to generate 5-gmC, followed by oxidation of 5-mC to 5-caC by mTet1. Only 5-gmC is read as C after bisulfite treatment and PCR amplification.
In the first step, β-GT, a T4 bacteriophage protein, is used to transfer a glucose to the hydroxyl group of 5-hmC and generate 5-gmC28,29
. This β-GT–catalyzed glucosylation is highly selective and efficient with either natural or chemically modified uridine diphosphate (UDP)-glucose11,30
. Several groups, including ours, have used this selective glucosylation reaction of 5-hmC for the enrichment of 5-hmC–containing genomic DNA fragments11,31–33
5-Methylcytosine can be converted to 5-caC by Tet proteins, which is eventually read as T in bisulfite sequencing. 5-fC, which can be partially converted to T under standard bisulfite treatment, can also be oxidized by Tet proteins to 5-caC34
. Thus, only protected 5-gmC will read as C in TAB-seq. Most reagents in the protocol are readily available. Active mTet1 is now commercially available (Wisegen) and expression as well as purification procedures for wild-type β-GT and the active domain of mTet1 can be followed as reported11,13,26
. We also provide a detailed protocol for producing and purifying a recombinant mTet1 protein (Box 1
Applications of the method and limitations
TAB-seq is amenable to both whole-genome sequencing and locus-specific sequencing. This method has recently been used to produce genome-wide 5-hmC maps at base resolution in human and mouse ESCs26
. Although we have not tested this method combined with reduced representation bisulfite sequencing (RRBS), we believe that the TAB method is compatible with RRBS35
. In TAB-seq, the detection limit is governed by the conversion rate of 5-mC, protection efficiency of 5-hmC, abundance of 5-hmC at the modification site and sequencing depths26
. With the protocol described here, highly efficient conversion of 5-mC to T (above 96%) in genomic DNA can be achieved, with at least 90% of the 5-hmC protected from conversion. Thus, sensitivity and specificity of 5-hmC detection by TAB-seq depends on sequencing depth. A cytosine base with less-abundant 5-hmC modification (i.e., < 5%) will require more sequencing depth than a base with a higher level of 5-hmC. With RRBS or locus-specific sequencing, a better sensitivity of 5-hmC detection may be achieved owing to higher sequencing depth at selected bases.
Comparison with other methods
The affinity-based methods have been widely used to enrich the 5-hmC–containing fragments and profile 5-hmC distribution in the genome. There are two main strategies that were developed previously. One is antibody based, wherein antibodies against 5-hmC12,36–38
or cytosine 5-methylenesulphonate, the product of 5-hmC after bisulfite treatment31
, were used. The other is β-GT based and involves one of the following three approaches to achieve selective modification of 5-hmC for affinity purification: (i) an azide-modifed glucose is transferred onto 5-hmC, followed by selective chemical labeling to attach a biotin tag11
; (ii) a natural glucose is transferred, followed by periodate oxidation and biotinylation31
; or (iii) a protein (JBP1) that specifically recognizes and binds to 5-gmC is used to enrich glucosylated 5-hmC32,33
. However, affinity-based methods can neither detect 5-hmC at single-base resolution nor quantify its abundance at the modification site. In the antibody-based approach, recovery of hydroxymethylated fragments can be affected by the density of 5-hmC, especially in 5-hmC immunoprecipitation (hMeDIP), which uses antibodies that recognize 5-hmC31
. The regions with high 5-hmC density may be overrepresented, whereas the regions with low 5-hmC density may be underrepresented.
Single-molecule real-time sequencing
Single-molecule real-time sequencing can identify modified bases on the basis of the different polymerase passing rates at and around the base. Although this technology is capable of detecting 5-mC and 5-hmC modifications directly, its application is limited by low sensitivity and low throughput39
. In early 2012, we modified our previous 5-hmC labeling method and combined it with single-molecule real-time (SMRT) DNA sequencing40
. With larger kinetic signature, increased 5-hmC abundance and reduced amount of DNA to sequence, SMRT sequencing can be applied to detect 5-hmC in genomic DNA at single-base resolution. However, the quantitative information about 5-hmC at each modification site is lost during enrichment. The throughput of the method needs to be improved for the sequencing of large genomes.
Oxidative bisulfite sequencing
Oxidative bisulfite sequencing (oxBS-seq), which can discriminate 5-mC from 5-hmC, was recently reported34
. In this modified bisulfite-sequencing method, KRuO4
selectively oxidizes 5-hmC to 5-fC at high efficiency, followed by conversion to T in subsequent bisulfite treatment and PCR amplification. A comparison of the results of oxBS-seq with those of standard bisulfite sequencing allows for the quantitative sequencing of both 5-mC and 5-hmC at single-base resolution. Application of oxBS-seq requires multiple rounds of bisulfite treatment to fully deaminate 5-fC, and chemical oxidation may cause extensive oxidative DNA damage. Compared with oxBS-seq, TAB-seq gives direct reads of 5-hmC, and the treatment procedure incurs less DNA damage34
. However, TAB-seq requires highly active Tet protein for efficient conversion of 5-mC to 5-caC.
Purification of the active domain of mTet1
As incomplete oxidation of 5-mC will result in false-positive 5-hmC signals, the availability of highly active mTet1 protein is crucial to TAB-seq. The expression and purification procedures of recombinant mTet1 are described in Box 1. To ensure high activity of the recombinant mTet1, all steps must be performed at 4 °C or on ice during purification. Aliquots of mTet1 proteins are stored at − 80 °C before use. Improper storage or multiple freeze-thaw cycles (more than twice) may result in decreased oxidation activity. The protocol to test the activity of purified mTet1 is described in Box 2; it is strongly recommended to carry out the activity test on each batch of newly purified recombinant mTet1 before applying it to TAB-seq.
For both locus-specific and whole-genome sequencing, two key parameters exist for an accurate estimation of 5-hmC abundance besides the conversion rate of unmodified cytosine to uracil: the oxidation efficiency of 5-mC to 5-caC and the protection efficiency of 5-hmC. Although nonprotected 5-hmC can result in an underestimation of 5-hmC abundance, nonconversion of unmodified C and 5-mC will result in false-positive 5-hmC signals, and should therefore be determined in each experiment. With sufficiently high C and mC conversion rates and 5-hmC protection rates, the abundance of 5-hmC can be quantified from the frequency with which C is read compared with T at a given genomic position in any sequencing experiment. To assess the conversion rates in genomic DNA, controls containing 5-mC and 5-hmC need to be spiked in before treatment. Such controls should be of sufficient complexity and contain modified cytosines in various sequence contexts (i.e., multiple CpGs throughout). For genome-wide sequencing, spike-in DNA should span at least 1 kb of the sequence, such that subsequent random fragmentation by sonication and sequencing can distinguish PCR duplicates. Furthermore, after bisulfite conversion, spike-in DNA should not align to a bisulfite-converted target genome. In practice, we find that DNA from the lambda phage and the pUC19 plasmid work well as spike-ins for mouse and human samples.
For the 5-mC control, DNA can either be selectively methylated at CpG sites using CpG methyltransferase or amplified with 5-mdCTP. However, the CpG-methylated control is recommended, as the frequent neighboring 5-mC generated by PCR may lead to the underestimation of the oxidation efficiency.
For the 5-hmC control, there is no enzyme that can selectively generate 100% 5-hmC from C or 5-mC; therefore, besides synthesizing long oligonucleotides containing 5-hmC at required positions, the easiest and most cost-effective way to generate DNA longer than 1 kb with multiple 5-hmC sites may be through PCR amplification with 5-hydroxymethyl dCTP (5-hmdCTP). With this method, each C position is supposed to be 100% 5-hmC. However, we have found that many commercial 5-hmdCTPs contain contaminant dCTP, which will result in the underestimation of protection efficiency because unmodified cytosine will display as `T' in TAB-seq (). Purifying the commercial 5-hmdCTP with HPLC offers one solution to this problem and takes about 2 d (Box 3). However, it should be noted that it may be difficult to amplify fragments larger than 2 kb with purified 5-hmdCTP. Alternatively, if experiments are run alongside conventional bisulfite treatments, dCTP contamination can be adjusted for by direct measurement of bisulfite-converted cytosine in the 5-hmC spike-in control. Typically, > 90% protection efficiency can be achieved for 5-hmC protection after taking contamination into account.
HPLC analysis of commercial 5-hmdCTP. The commercial 5-hmdCTP contains 4–5% of dCTP. mAU, milliabsorbance units.
In this protocol, we use methylated λ-DNA as C and 5-mC control in both genome-wide and locus-specific sequencing. For the 5-hmC control, a PCR product of 1.64 kb from a pUC19 vector (5-hmC control 1; generated with 5-hmC_F_1 and 5-hmC_R_1 primers) is used for the estimation of 5-hmC protection efficiency only in genome-wide sequencing. The 290-bp control (5-hmC control 2; generated with 5-hmC_F_2 and 5-hmC_R_2 primers), which is relatively easier to amplify and clone after bisulfite treatment, is used for the verification of 5-hmC protection efficiency after β-GT and mTet1 treatment with TOPO cloning. In both 5-hmC spike-in controls, all Cs are 5-hmC except for those in the primer sequence.
Verification of 5-mC conversion and 5-hmC protection on spike-in controls with TOPO cloning is strongly recommended after β-GT and mTet1 treatment but the procedure before proceeding with large-scale sequencing. However, it may be simplified if no quantitative conversion or protection rate is required for locus-specific sequencing.
Data analysis for TAB-seq and traditional bisulfite sequencing are nearly identical. At each genomic locus, the estimated abundance of 5-hmC (A5-hmC) is measured as the number of cytosine base calls divided by the total (C + T) sequencing depth at the locus. For genome-wide analysis, only good-quality base calls (Phred score of 20 or greater) are considered. To correct for the 5-hmC protection rate (r5-hmC) not being 100% efficient, the absolute abundance of 5-hmC is calculated as E5hmC = A5hmC / r5hmC.
The resolution of TAB-seq to detect 5-hmC is crucially dependent on sequencing depth. For example, the median abundance of 5-hmC at 5-hmC sites in H1 ESCs is just under 20%, and detecting base-resolution 5-hmC at this level requires sequencing at a depth of ~25 times per cytosine, or ~50 times the haploid genome size. For cells with higher levels of 5-hmC, less sequencing is required: we estimate that a depth of ~15 times per cytosine is sufficient to detect base-resolution 5-hmC with an abundance of 30%. If base-resolution precision is not necessary, much less sequencing is required. We recommend sequencing reads with a length of at least 100 bp. Please see genome-wide bisulfite sequencing methods for additional details26,41
The entire workflow for the method is shown in . The details of all oligonucleotides used in this protocol are listed in . Supplementary Note 1
shows the sequence of the 5-hmC spike-in controls.
Details of the oligonucleotides used in the protocol.