|Home | About | Journals | Submit | Contact Us | Français|
PvuRts1I is a modification-dependent restriction endonuclease that recognizes 5-hydroxymethylcytosine (5hmC) as well as 5-glucosylhydroxymethylcytosine (5ghmC) in double-stranded DNA. Using PvuRts1I as the founding member, we define a family of homologous proteins with similar DNA modification-dependent recognition properties. At the sequence level, these proteins share a few uniquely conserved features. We show that these enzymes introduce a double-stranded cleavage at the 3′-side away from the recognized modified cytosine. The distances between the cleavage sites and the modified cytosine are fixed within a narrow range, with the majority being 11–13nt away in the top strand and 9–10nt away in the bottom strand. The recognition sites of these enzymes generally require two cytosines on opposite strand around the cleavage sites, i.e. 5′-CN11–13↓N9–10G-3′/3′-GN9–10↓N11–13C-5′, with at least one cytosine being modified for efficient cleavage. As one potential application for these enzymes is to provide useful tools for selectively mapping 5hmC sites, we have compared the relative selectivity of a few PvuRts1I family members towards different forms of modified cytosines. Our results show that the inherently different relative selectivity towards modified cytosines can have practical implications for their application. By using AbaSDFI, a PvuRts1I homolog with the highest relative selectivity towards 5ghmC, to analyze rat brain DNA, we show it is feasible to map genomic 5hmC sites close to base resolution. Our study offers unique tools for determining more accurate hydroxymethylomes in mammalian cells.
Modification-dependent restriction endonucleases are widely present in bacterial genomes and are thought to protect hosts from invading bacteriophages containing modified DNA (1). Among many examples are the T-even phages, in which only 5-hydroxymethylcytosines (5hmC) are incorporated into the genome during replication and further modified to 5-glucosylhydroxymethylcytosine (5ghmC) by glucosyltransferases (1). Although T4 wild-type DNA is resistant to most regular restriction enzymes, there are types of modification-dependent restriction enzymes that are able to restrict their infection in vivo, including PvuRts1I (2,3) among a few others. For a long time, the detailed in vitro biochemical properties of PvuRts1I remained obscure (4).
In mammalian genomes, it is commonly believed that 5-methylcytosine (5mC) is the major form of epigenetic base modification. Recently, the observation of 5hmC as the enzymatic oxidative product of 5mC in mammalian genomes (5,6) has added an extra layer of complexity to the current understanding of epigenetic regulation and spurred rising interest in determining its genomic locations and metabolism. However, although the modified base 5hmC was discovered in bacteriophages >50 years ago (7), there are few useful methods, either enzymatic or chemical, to specifically recognize 5hmc residues and pinpoint their locations in DNA, largely due to their close structural similarity to 5mC. For example, 5mC-dependent endonucleases, such as the the MspJI family (8) or McrBC (9), do not distinguish 5mC and 5hmC; 5mC-sensitive endonucleases, such as MspI or HpaII, etc., in most cases are equally affected by 5mC and 5hmC (4). The widely used bisulfite conversion method cannot differentiate between 5mC and 5hmC and reports both forms indistinguishably (10,11). Recently, the availability of 5hmC-specific antibodies has enabled a few enrichment-based methods [e.g. hMeDIP (12)]. However, the format of the experiment, based on affinity pull-down, may limit the range of its application, and the resolution of the data is still far from base resolution.
Given mounting evidence for the importance of 5hmC in mammalian epigenetics and the previous experimental observations that PvuRts1I is able to specifically recognize 5hmC both in vivo and in vitro (3), we have set out to investigate the in vitro biochemical properties of PvuRts1I and its homologs identified in REBASE (4). During the course of our study, Szwagierczak et al. (13) reported that recombinant PvuRts1I selectively cleaves 5hmC-containing DNA substrates and that the double-stranded cleavage sites are at N11–12/N9–10 on the 3′-side of the recognized 5hmC site. In addition, the authors notice that PvuRts1I prefers to cleave at symmetric sites 5′-hmCN11–12↓N9–10G-3′/3′-GN9–10↓N11–12hmC-5′, suggesting a likely in-cis dimerization cleavage process (13). Still, there are a number of questions left unanswered. For example, it is not clear whether PvuRts1I is applicable for mapping genomic 5hmC sites along with needing details concerning its practical use. In this regard, a quantitative description of substrate selectivity on 5hmC versus 5mC or unmodified cytosine is crucial, because in most human tissue DNA, the level of 5hmC is usually on the order of 0.01% of the total nucleotide (14). During our investigation, we have observed that PvuRts1I is sensitive to different purification procedures, such that certain ions used in the buffer may quickly inactivate most of the enzyme in crude lysates (see ‘Results’ section). We have thus optimized purification conditions to obtain highly active enzymes. Furthermore, we have observed that in certain reaction conditions (e.g. reaction buffer or high enzyme concentration), PvuRts1I starts to digest 5mC and 5hmC indiscriminately (Figure 3 in Results). This raises the concern of a possibly elevated false discovery rate if it is used improperly, which must be carefully monitored during its application.
In this article, we systematically characterized the enzymatic properties of several members in the PvuRts1I family. In particular, we focus on comparing their substrate selectivity on different forms of cytosine modifications and evaluating their suitability in mapping genomic 5hmC sites. As one of the conclusions, we show that by using AbaSDFI, a homolog of PvuRts1I with much higher substrate selectivity, it is possible to map genomic 5hmC sites close to base resolution.
Genes in the PvuRts1I family, including PvuRts1I, PpeHI and AbaSDFI (Supplementary Table S1), were synthesized using the optimized Escherichia coli codon set from Integrated DNA Technologies Inc. They were then sub-cloned into pTXB1, and overexpressed in E. coli strain T7 Express (NEB #C2566). Cells were grown at 30°C in LB medium with ampicillin to late log phase and induced by IPTG at 16°C overnight. Cells were harvested by centrifugation and re-suspended in 0.5M KOAc, 10mM Tris–OAc (pH 8.0) (column buffer). After sonication and centrifugation, the clear supernatant was loaded onto a chitin column (NEB #S6651), which was equilibrated with the column buffer containing 0.1% Triton-X100. The column was washed with 50 column volumes of the column buffer. For intein cleavage, the column was flushed with the column buffer containing 30mM DTT and incubated at 4°C overnight. Fractions containing the purified protein were eluted from the column using the column buffer.
Activities of enzymes were assayed on either T4 gt DNA or T4 wt DNA, depending on the preference of each enzyme. One unit of the enzyme is defined as the amount to digest 1µg of substrate DNA (T4 gt or T4 wt) to completion in NEB buffer 4 at 23°C within 20min.
To test the enzyme sensitivity to different salts in crude lysates, 1.5ml PvuRts1I-expressing E. coli cells from overnight culture were spun down and supernatant was removed, then 1.5-µl 1-M Tris–acetate (pH 8.0) and 150µl of a 1-M solution of each different salt, all buffered to pH 8.0 by its own ion type, were added. Cells were then sonicated, spun again and left at 23°C for 6h. The supernatant was then diluted in 10-, 100- or 1000-fold by diluent (250mM KOAc, 10mM Tris–acetate, pH 8.0 and 200µg/ml BSA). Of diluted supernatant, 3µl was tested for activity by incubating with 125ng T4 gt DNA at 23°C for 20min in NEB buffer 4. The reactions were stopped by adding 6× loading dye and visualized on a 1% agarose gel (Figure 3).
To prepare the DNA substrates used in Figures 2C and Figure 3, DNA fragments were PCR-amplified from the T4 gt genomic or pUC19 DNA by using dATP/dGTP/dTTP mixed with dhmCTP (Bioline #BIO-39046), dmCTP (NEB #N0356S) or dCTP, respectively. PCRs were carried out using Phusion polymerase (NEB #M0530). The DNA fragment containing 5ghmC was obtained by further modification of 5hmC DNA fragment by the T4 β-glucosyltransferase (NEB #M0357). All PCR primers are listed in the Supplementary Table S2.
The synthetic oligonucleotides used in Figure 4 were made from PCR by using primers hmCG_ACGT_F and hmCG_ACGT_R on the hmCG_ACGT_template in the presence of dhmCTP /dATP/dGTP/dTTP (Supplementary Table S3). The oligonucleotide sequence is designed so that there is only one CG site (underlined in Supplementary Table S3), which contains the 5hmC in the top strand. Before PCR, each primer is individually labeled by using γ-33P-ATP (Perkin-Elmer) and T4 polynucleotide kinase (NEB #M0201), followed by purification using G-25 columns (GE Healthcare) to make the 5′-end labeled substrates. PCR products were purified from the QIAGEN Nucleotide Removal kit. To make the 3′-end labeled substrate, purified unlabeled PCR product was incubated with Taq polymerase (NEB #M0273) and α-33P-dATP (Perkin-Elmer). This way both of its 3′-ends were labeled. As the final step, all labeled DNA fragments were further modified by T4 β-glucosyltransferase.
The synthetic oligonucleotides containing 5hmC used in Figure 5 were synthesized in-house (Supplementary Table S4). Each oligo was resuspended in H2O to 20µM. Equal volumes of top strand and bottom strand were then mixed. The final concentration of double-stranded substrate is at 10µM. To be used as AbaSDFI substrate, each double-stranded oligo was glucosylated using T4 β-glucosyltransferase. In Table 4, 5hmC_21C_top pairs with 5hmC_215hmC_bottom as substrate used in Figure 5B. Similarly, 5hmC_21C_top pairs with 5hmC_21mC_bottom (Figure 5C); 5hmC_21C_top pairs with 5hmC_21C_bottom (Figure 5D); 5hmC_nonC_top pairs with 5hmC_nonC_bottom (Figure 5E); C_21C_top pairs with 5hmC_21C_bottom (Figure 5F).
In each digestion series, 125ng substrate DNA was digested by PvuRts1I, PpeHI or AbaSDFI in a 2-fold serial dilution in NEB buffer 4 with additional KOAc (final concentration 250mM) for 20min at 23°C. Addition of KOAc was found to significantly inhibit the enzyme activity on 5hmC, 5mC and C, with less effect on 5ghmC. The ratio of the relative selectivity is determined by the comparison of the extent of digestion on different substrates.
Of T4 gt DNA, 0.9µg was digested by PvuRts1I and purified using spin columns. Digested DNA was then treated with T4 DNA polymerase (NEB #M0203) for end-polishing. The DNA fragments were ligated to dephosphorylated linear pUC19 (linearized by HincII). Colonies were picked after transformation to NEB Turbo competent cells (C2984) and the inserts were sequenced.
A total of 2µg rat brain genomic DNA from mixed tissue was glucosylated using the T4 β-glucosyltransferase at 37°C overnight. After heat inactivation at 65°C for 20min, the DNA was then precipitated using isopropanol and re-suspended in 20µl water. The digestion was completed in a total volume of 30µl, with 3µl NEB buffer 4, 6µl KOAc (2M, pH=8), 20µl glucosylated gDNA and 100 U of AbaSDFI, at room temperature for 1h. The DNA was then precipitated using isopropanol and re-suspended in 20µl water. Ligation was performed in a total volume of 10.5µl, with 1µl ligation buffer, 8µl DNA, 0.5µl of double-stranded adaptor (P1b_top_2N+P1b_bottom for the 2N library or, P1b_top_3N+P1b_bottom for the 3N library, both at 10µM, see Supplementary Table S4 for sequences), and 1µl T4 ligase (NEB #M0202), at room temperature for overnight. The ligated DNA was then resolved on a 1% low-melting agarose gel (Lonza #50080) with a DNA size marker. The gel piece containing DNA fragments within the 1–3 kb size range was excised and digested using β-agarase (NEB #M0392). Adaptor-specific PCR was prepared by using primer P1XbaIcloningprimer (Supplementary Table S4) and Phusion DNA polymerase. After PCR, the DNA fragments were cloned into the PmeI site in pNEB193 and individually sequenced in 96-well format (2 plates for the 2N library and 3 plates for the 3N library). The cloned genomic fragments were identified by trimming the adaptor sequence (but leaving the randomized 2N or 3N nt) and aligned to the rat reference genome (REFSEQ ID: NC_005109.2) using BLASTN (15). The ends of each cloned genomic fragment signify half of the enzymatic cleavage sites. The other half of each cleavage site was inferred by extracting the adjacent 30-nt sequences from the reference genome and joined to the cloned sequence for analysis (Figure 6).
Using PvuRts1I protein sequence as the query, we searched the NR and ENV_NR databases at NCBI using BLAST (15) and identified a number of homologs; collectively, we call them the PvuRts1I enzyme family. Using both in vivo phage restriction assays against T4 phages and in vitro digestion assays on modified DNA, we evaluated the activity of each homolog and summarized our results in Supplementary Table S1. In the following, we focus our discussion on three representative members in the family: PvuRts1I from Proteus vulgaris Rts1, PpeHI from Proteus penneri ATCC 35198 and AbaSDFI from Acinetobacter baumannii SDF. All enzyme entries can be found in REBASE (4).
For a long time, PvuRts1I was placed into the ‘weirdo’ class of restriction endonucleases in REBASE (4), mainly due to its unique biological properties and lack of detailed experimental characterization. With our recent screening efforts, we have identified a number of active PvuRts1I homologs from complete bacterial genome sequences as well as environmental sequences (Supplementary Table S1 and Figure 1). These genes are significantly similar to each other at the sequence level, yet no previously known conserved domains can be identified in the family. Examination of the multiple sequence alignment (Supplementary Figure S2) of the PvuRts1I family protein sequences does not reveal the hallmarks of the usual catalytic motifs that are often observable in the restriction endonucleases, such as PD…(D/E)XK or HNH motifs, etc. (16). Figure 1A shows a schematic sequence conservation profile at the amino acid level abstracted from the multiple sequence alignment (Supplementary Figure S2) (17). The scale of the conservation is from 0 to 9, with 9 being most conserved. The absolutely conserved residues and the predicted secondary structure elements are shown on the top of the profile in Figure 1A (18). It appears that the N-terminal region of the PvuRts1I family is more evolutionarily constrained, with more conserved residues and more well-defined structural elements than the C-terminal region. In the absence of previously known catalytic motifs, we attempted to identify potential catalytically important residues based on conservation and observed enzymatic properties. The activities of the PvuRts1I enzymes are dependent on Mg2+ in the reaction buffer, suggesting the possible involvement of metal-ion chelating residues. Figure 1B shows a multiple sequence alignment encompassing a conserved cluster of negatively charged residues in the N-terminal region (box B in Figure 1A). It is likely that this region may be responsible for metal ion binding and can act as the catalytic center. Systematic mutagenesis experiments and structure determination are needed in the future to test the above speculations.
All genes were synthesized using optimized E. coli codons. We first fused a few genes with a 6×His-tag, either at the N- or C-terminus, to facilitate quick purification. To our surprise, although a high level of cytosine modification-dependent activity was detected in the crude lysate of the expression clones, a large portion of the activity was quickly lost after purification, even though the target protein was successfully recovered and purified. It appeared to us that the loss of activity may be due to the specific chemicals used during purification. We then investigated the sensitivity of PvuRts1I enzymes in crude lysates to different salt concentrations, as shown in Figure 2A for PvuRts1I. Indeed, a high concentration of imidazole salts, as routinely used for eluting the His-tagged protein from chelating columns, leads to the loss of the majority of the PvuRts1I activity in crude lysates (Figure 2A, lanes d and e). High concentrations of another anion, Cl−, which is commonly used to increase the ionic strength of the buffer, also seem to inhibit activity. Most of these activity losses appear irreversible, since dilution or buffer change cannot restore the lost enzymatic activities. Since the presence of high concentrations of NaCl or KCl can adversely affect PvuRts1I enzyme in crude lysates, many common salt gradient elution purification schemes cannot be used as the first step. In Supplementary Figure S1, we show the salt sensitivity of the two other enzymes, PpeHI and AbaSDFI. It appears that the sensitivity profile of each enzyme to a specific salt also varies.
To find a mild and universal purification method, we expressed the recombinant protein fused with a cleavable intein and a chitin-binding domain (CBD) (19). First, the fusion protein was bound to the chitin column under mild conditions; then, the CBD tag of the fusion protein was cleaved off by the embedded intein in the presence of dithiothreitol (DTT) (19). Using this strategy, we obtained each wild-type enzyme in highly active form and close to homogeneity on an SDS-PAGE gel (Figure 2B). The activity of each enzyme was assayed on wild-type T4 (containing 5ghmC, referred to as T4 wt hereafter) or a mutant phage T4 gt (containing 5hmC, referred to as T4 gt hereafter) genomic DNA (see ‘Materials and Methods’ section). Table 1 lists the basic properties of each purified enzyme. These preparations were then used in the following characterization experiments.
Figure 2C demonstrates the modification-dependent activity of AbaSDFI on a set of DNA fragments designed to test modification selectivity when a choice is offered. Each differently sized fragment carries one form of cytosine at all C locations—5ghmC, 5hmC, 5mC or unmodified C. Under such competitive digestion conditions, it can be seen that AbaSDFI digests 5ghmC- and 5hmC-containing DNA, but prefers the former, and does not act on either 5mC- or C-containing DNA (Figure 2C).
It is known that wild-type restriction endonucleases sometimes exhibit activity on non-canonical sites under certain in vitro conditions, e.g. high enzyme concentrations or extended incubation times, etc. (20). These so-called ‘star’ activities usually do not impair the fitness of the bacterial hosts from which these enzymes originate and are thus not selected against by nature, as the in vivo concentration of the enzymes is relatively low. Similarly, restriction endonucleases known to recognize one particular modification may exhibit activity on other modifications, as long as these modifications are not present in the hosts or are not deleterious. A good example is ScoMcrA, which recognizes both DNA phosphorothioation and methylation (21). To use PvuRts1I-like enzymes to map 5hmC sites in the mammalian genome, it is important to know their relative selectivity to different modified cytosines, as C, 5mC and 5hmC all exist in the genome and 5hmC constitutes only a tiny fraction of the cytosine pool (14). Enzymes with low relative substrate selectivity can result in a high false discovery rate.
To quantify the relative selectivity of the PvuRts1I enzymes on different cytosine modifications, we adopted an approach similar to that previously used for regular restriction endonucleases (20). For example, as shown in Figure 3A, with an increasing 2-fold titration of purified PvuRts1I, the enzyme shows a different activity profile on each substrate DNA. PvuRts1I acts on 5hmC and 5ghmC DNA almost equally. When the enzyme concentration is relatively high, PvuRts1I starts to digest DNA containing only 5mC and C as well. From a practical standpoint of mapping 5hmC sites, this is undesired. We define quantitatively the relative selectivity of each enzyme as the ratio of specific activity on different forms of modified cytosines. For example, the relative selectivity for PvuRts1I is 5hmC:5ghmC:5mC:C=2000:2000:8:1 (Figure 3A). Similarly, the relative selectivity for PpeHI is 5hmC:5ghmC:5mC:C=128:256:2:1 (Figure 3B); the relative selectivity for AbaSDFI is 5hmC:5ghmC:5mC:C=500:8000:1:ND (ND: none detected) (Figure 3C). Figure 3D shows the comparison of the three enzymes’ relative selectivity normalized based on the activity towards 5mC. From the comparison, we conclude that among the active PvuRts1I-like enzymes we characterized, AbaSDFI has the best discriminative power on 5ghmC over 5mC and C. In addition, only AbaSDFI does not have detectable activity towards unmodified cytosine (Figure 3D). These properties were used in our 5hmC site mapping experiment (see below). Here, we consider 5ghmC equally important as 5hmC because although 5ghmC is not known to be present in the mammalian genome, in vitro 5hmC can be converted to 5ghmC essentially completely using the T4 β-glucosyltransferase (22).
To investigate the cleavage positions of PvuRts1I enzymes near the modified sites, we individually labeled oligonucleotide substrates at either the 5′- or the 3′-ends (Figure 4A). In the example shown in Figure 4B, the enzyme used was AbaSDFI and the recognition site is a hemi-5ghmC site in the top strand. In Figure 4B, left panel, the top-strand-labeled substrate (lane 2) and the bottom-strand-labeled substrate (lane 1) were separately digested by AbaSDFI. The digested products were resolved on a denaturing polyacrylamide gel to single base resolution for small fragments and compared with synthetic size markers for the bottom strand cleavage site (Figure 4B, left panel). It can be seen that AbaSDFI cleaves both the top strand and the bottom strand on the 3′-side of the recognition site, producing a large-labeled fragment from the top-strand-labeled substrate (lane 2) and a short-labeled fragment from the bottom-strand-labeled substrate (lane 1). The bottom strand cleavage products are of a size that allows discrimination at single-base resolution. The distance from the bottom strand cleavage site to the modified cytosine is predominately 10nt for this particular substrate, with minor cleavage plus or minus 1nt.
To precisely map the top strand cleavage site, α-33P-dATP was incorporated into the 3′-ends of both strands by the non-templated polymerization activity of Taq polymerase. The AbaSDFI-digested products were resolved by PAGE (Figure 4B, right panel) and compared with synthetic size markers (Figure 4A). From lane 3 in Figure 4B right panel, the top-strand cleavage site can be deduced to be 12 or 13nt away from the 3′-side of the modified cytosine for this particular substrate.
Overall, our results suggest that AbaSDFI generates a double-stranded cleavage on the 3′-side and away from the modified cytosine. The substrate tested in Figure 4 is hemi-modified. We have additionally tested fully modified sites and observed activity. For a fully modified site, AbaSDFI cleaves on both sides of the site, essentially carving a small fragment from the DNA, like enzymes in the MspJI family (8). The difference is that the cleavage distance from the recognition site for MspJI (N12/N16) is longer than that of PvuRts1I. Thus, whereas MspJI can produce 32-mer fragments from the fully modified sites, the length of such small fragments for PvuRts1I is ~24bp, which may provide only limited resolution power in the human genome if sequenced. Since the amount of 5hmC is usually very low, we have not visually observed the appearance of the 24-mer band from the digestion of different genomic DNAs using PvuRts1I-family enzymes. Another intriguing observation according to our experiment is that AbaSDFI generates a mixture of fragments with either 2- or 3-base 3′-overhang, which provides the basis for our later genomic mapping experiment (see below).
To determine whether PvuRts1I enzymes require other sequence elements in addition to modified cytosines, we initially cloned and sequenced the digested T4 wt or T4 gt genomic DNA fragments. The T4 wt or gt genomic DNA provides a complex, fully modified substrate, thus allowing identification of preference for sequence context. Briefly, T4 gt DNA was digested by PvuRts1I to completion. The digested DNA fragments were blunt-ended by DNA polymerases and cloned into pUC19 for sequencing (see ‘Materials and Methods’ section for details). After mapping the inserts to the T4 genome, the sequences encompassing the ends of the inserts, which signify the cleavage sites of PvuRts1I, are subject to further analysis for compositional bias. The identified consensus recognition sites are shown in Supplementary Figure S3. Two 5hmCs on opposite strands are significantly enriched on either side of the cleavage site, with distance to the cleavage site either 12nt (top strand cut) or 9–10nt (bottom strand cut) (Supplementary Figure S3). The distances between the two 5hmCs are either 21nt (47% of all cases) or 22nt (45% of all cases), which reflects the variable cleavage positions of the enzyme. Importantly, there is little compositional bias in the adjacent positions of the two 5hmC (Supplementary Figure S3), suggesting the possibility that PvuRts1I primarily recognizes the two 5hmCs. Our findings in this experiment are consistent with the study published from Szwagierczak et al. (13).
The symmetrical configuration of the recognition sites suggests a possible cleavage process in which two individual monomers bind, one to each modified site, then interact through dimerization leading to double-stranded cleavage. On the other hand, it is important to realize that all cytosines in the T4 wt or T4 gt DNA are in the form of 5ghmC or 5hmC. To provide further experimental support for the dimerization hypothesis, we tested whether there is a dependence on the modification status of the suitably placed cytosines on opposite strands. Figure 5 compares the activity of AbaSDFI on synthetic oligonucleotides containing designed sites with one constant 5ghmC and another base, either as 5ghmC (Figure 5B), 5mC (Figure 5C), unmodified C (Figure 5D) or no cytosine properly placed in the opposite strand (Figure 5E). As a control, Figure 5F shows that AbaSDFI does not act on non-modified DNA substrate. By comparing the cleavage efficiency in Figure 5, it can be concluded that the cleavage efficiency decreases ~25-fold when one of the two 5ghmCs in the recognition site changes to 5mC or unmodified C. However, the cleavage efficiency drops dramatically when there is no cytosine within the suitable distance range in the opposite strand (Figure 5A and E). Supplementary Figure S4 shows the activity of PvuRts1I on the same set of oligonucleotide substrates (without glucosylation). Similar to AbaSDFI, PvuRts1I prefers sites with two properly placed 5hmC (Supplementary Figure S4B). Sites with one 5mC or one C are digested with a lower efficiency (Supplementary Figure S4CD). The efficiency further drops on sites with only one 5hmC (Supplementary Figure S4E). Consistent with the results in Figure 3, PvuRts1I even digests unmodified DNA substrate in high concentration (Supplementary Figure S4F). These results further support the high substrate selectivity of AbaSDFI.
Overall, it appears that the PvuRts1I-family of enzymes recognizes two cytosines on opposite strands, which are separated by 21 or 22nt and at least one cytosine needs to be suitably modified as 5hmC or 5ghmC.
The property of introducing a double-stranded cleavage at a narrowly specified distance from 5hmC sites by the PvuRts1I family enzymes suggests a potential application for mapping genomic 5hmC sites. As a proof-of-principle experiment, we chose the enzyme AbaSDFI, which has the highest relative selectivity on 5ghmC versus 5mC or C. Briefly, we first glucosylated rat brain genomic DNA, using recombinant T4 β-glucosyltransferase (see ‘Materials and Methods’ section for details). AbaSDFI was then used to digest the glucosylated genomic DNA. The digested DNA was then ligated with a double-stranded adaptor with either a 2- or 3-base randomized 3′-overhang, which are referred as the 2N or the 3N libraries hereafter. The ligated DNA was size-selected from 1 to 3kb on an agarose gel and PCR-amplified with an adaptor-specific primer. The amplified DNA was cloned into pUC19 for sequencing the inserts. The sequenced inserts were then aligned to the rat reference genome to identify the cleavage sites at both ends.
Figure 6 summarizes the analysis of the sequence fragments around the 122 identified cleavage sites in the 2N library. One of the advantages of using the 2N adaptor is that it preserves the 2-base 3′ extension on the digested DNA fragments to allow precise determination of the cleavage sites in both strands from the sequencing data. By considering the variable cleavage distance of the enzyme, i.e. either 12/10 (denoted as C12/10, C is the cytosine being recognized) or 11/9 (denoted as C11/9) for the 2N library, the sequences around the cleavage sites can be grouped based on a few different configurations (Figure 6AB), for example, sequences with two symmetrical C12/10 cleavages, sequences with two symmetrical C11/9 cleavages, or, sequences with 1 C12/10 and 1 C11/9 cleavages, etc.(Figure 6B). Figure 6B shows the comparison on the frequency of occurrences of these sites between the cloned library and those expected by chance. For example, ~20% of the cleavage sites have two symmetrical C12/10 (category 2) and ~20% have two symmetrical C11/9 (category 1), which are significantly higher than 3.5% expected by chance (Figure 6B). The same significant overrepresentation is seen for sites with 1 C12/10 and 1 C11/9 (category 3 in Figure 6B). While these configurations are significantly overrepresented in the cloned library, sites with C in only one side of the cleavage sites (category 5 in Figure 6B), or, sites with no suitable C in the vicinity of the cleavage sites (category 6 in Figure 6B) only constitute 11% and 3.3% of the cloned library respectively, much lower than the 49% and 32% expected by chance. Thus, it appears that sites that are not recognized by AbaSDFI are significantly underrepresented in the cloned library. There are ~20% of the sequences which contain ‘C12/10C11/9’ as a half site and a C12/10 or C11/9 as the other half site (category 4 in Figure 6B). For these, we could not determine which C in the ‘CC’ is recognized by the enzyme. Nevertheless, they are significantly overrepresented as well and may be 5hmC-containing sites. Overall, a high percentage (86%) of all the cleavage sites appears to be true cleavage sites catalyzed by the enzyme.
Figure 6C shows the sequence logo representation of the sites in categories 1 (symmetrical C11/9), 2 (symmetrical C12/10) and 3 (1 C12/10 1 C11/9) in Figure 6B (23). Supplementary Figures S6 and S7 list the aligned sequences in categories 1 and 2. It is important to realize that due to the symmetrical nature of the AbaSDFI recognition sites, the motifs presented in Figure 6C do not distinguish which of the two cytosines around the cleavage sites is the real 5ghmC site. Based on the results in Figure 5, it is possible that both cytosines are 5hmC, or, one is 5hmC and the other is 5mC or unmodified cytosine. Indeed, this may be reflected by the appearance of the enriched CG dinucleotide encompassing the recognized cytosines in Figure 6C. On the one hand, it is expected since the 5hmCs most likely arise from the methylated CpG sites from the action of the TET enzymes in the brain DNA (5); on the other hand, the methylated CG sites may also constitute half of the recognition sites for AbaSDFI. Interestingly, the flanking position on the 5′-side of the recognized C shows an overrepresentation of A or T in the symmetrical C11/9 cleavages, whereas it is absent in the symmetrical C12/10 cleavages (Figure 6C). This suggests that the cleavage distance may be affected by the nucleotide flanking the recognized cytosine. Further experiments are needed to test this hypothesis. In addition, Supplementary Figure S5 summarizes the analysis of 188 sequenced cleavage sites in the 3N library from which similar observations can be made.
In this article, we compared the in vitro biochemical properties of a few members in the PvuRts1I family. The first example of this family, PvuRts1I, was known to restrict T-even bacteriophages with 5hmC or 5ghmC in their genomic DNA (3). Using a relatively mild purification procedure, we were able to obtain pure enzymes in highly active form; all exhibit DNA modification-dependent endonuclease activity with similar cleavage properties. In addition, our results suggest that these enzymes differ from each other in their relative selectivity toward various forms of modified cytosine. The relative selectivity provides a quantitative index of their ‘fidelity’ towards each desired forms of modified cytosine. From the application perspective, this is important due to the extremely low abundance of 5hmC compared with the 5mC and C in the genome. As a result, we find that AbaSDFI, a homolog of PvuRts1I, has the highest (8000:1) relative selectivity between 5ghmC and 5mC. Furthermore, it does not have any detectable activity on unmodified C. These properties allow reliable mapping of genomic 5hmC sites in large mammalian genomes. This high selectivity may also reflect the enzyme's inherent ability to distinguish the major structural difference between 5ghmC and 5mC.
The PvuRts1I family differs from many other well-studied restriction endonucleases in several aspects. First, they display no identifiable previously known motifs for metal-ion chelation or catalysis, suggesting a novel enzymatic DNA cleavage chemistry. Multiple sequence alignment reveals that the N-terminal region is more conserved than the C-terminal region both in amino acid sequence and in the secondary structure elements (Figure 1A). There are strings of conserved positions in both the N-terminal and the C-terminal regions (Figure 1A). We surmise that a cluster of conserved acidic residues in the N-terminal region may be responsible for chelating Mg2+ and could form part of the active center (Figure 1B). Furthermore, it is tempting to assume the nearby conserved histidines may act as the general base in the cleavage process (16). If true, this implies a domain organization with the N-terminus responsible for cleavage activity and the C-terminus responsible for binding. This is different from the domain organization of other type IIS restriction endonucleases such as FokI (24) or MspJI (8), which have an N-terminal domain for binding and a C-terminal domain for cleavage, but is similar to that of MmeI (25). Site-directed mutagenesis and structure determination will be needed for further elucidation. Second, the requirement of two cytosines within a defined distance range on separate DNA strands suggests a likely dimerization step in the cleavage process. Intriguingly, our results suggest that as long as one binding site contains the recognized modified cytosine, e.g. 5hmC or 5ghmC, the other site can be 5mC, or even unmodified cytosine, with moderate decrease of the cleavage efficiency (Figure 5). It is possible that PvuRts1I-like enzymes recognize not only the 5-modification on the cytosine, but also other structural elements of cytosine; this may explain why two cytosines are required for cleavage. Further biochemical studies are needed to clarify the role of the second cytosine.
Theoretically, the coverage of 5hmC sites using the PvuRts1I-like enzymes could be quite high, due to its flexible requirement for the binding sites. Based on our data, for each 5hmC site, as long as there is another cytosine, modified or not, at its 3′-side in the opposite strand and 20–22nt away, it should elicit enzymatic cleavage. This translates to a theoretical coverage of ~58% (1−0.75*0.75*0.75), assuming there is no severe bias of base composition in the genome. In our proof-of-principle experiment in Figure 6, all the mapped sites contain at least one 5hmC. Although there is still ambiguity in the results as to which cytosine is the real 5hmC, this provides a much higher resolution than the hMeDIP-like approaches and may offer better insights into the existence of 5hmC in the genome. Future experiments will see the application of these enzymes in the genomic mapping of different cell types using the latest high-throughput sequencing technologies.
The in vitro biochemical properties of the PvuRts1I enzymes dictate the experimental approaches to mapping genomic 5hmC sites along with the computational interpretation of the sequencing data. Because these enzymes can generate a mixture of ends from a single recognition site, we used double-stranded DNA adaptors with randomized 2- or 3-base 3′-overhangs to separate the population. This provides the advantage of precisely locating the cleavage sites in both DNA strands in the sequencing data, which in turn reduces the uncertainty in searching for the nearby recognition sites. Given the variable cleavage distance property of the PvuRts1I enzymes, this may be a crucial step in library construction.
In summary, the PvuRts1I family of enzymes defines a unique group of DNA modification-dependent restriction endonucleases. Having them and combined with the high-throughput sequencing platforms, it should be possible to improve the resolution of the current hydroxymethylomes in mammalian cells.
Supplementary Data are available at NAR Online.
Funding for open access charge: National Institutes of General Medical Sciences, SBIR (grant #4R44GM095209-02 to Y.Z., in part); New England Biolabs Inc.
Conflict of interest statement. None declared.
The authors would like to thank Drs Elisabeth Raleigh, William Jack and Zhiyi Sun for critical reading and suggestions on the article. We thank in-house organic synthesis for synthesizing 5hmC oligos. The rat brain DNA is a kind gift from Dr Alan Herbert (Boston University School of Medicine).