|Home | About | Journals | Submit | Contact Us | Français|
Cytosine DNA methylation is important in regulating gene expression and in silencing transposons and other repetitive sequences 1, 2. Recent genomic studies in Arabidopsis have revealed that many endogenous genes are methylated either within their promoters or within their transcribed regions, and that gene methylation is highly correlated with transcription levels 3-5. However, plants have different types of methylation controlled by different genetic pathways, and detailed information on the methylation status of each cytosine in any given genome is lacking. To this end, we generated a map at single base pair resolution of methylated cytosines for Arabidopsis, by combining bisulfite treatment of genomic DNA with ultra-high-throughput sequencing using the Illumina 1G Genome Analyzer and Solexa sequencing technology 6. This approach, termed BS-Seq, unlike previous microarray-based methods, allows one to sensitively measure cytosine methylation on a genome-wide scale within specific sequence contexts. We describe methylation on previously inaccessible components of the genome along with an analysis of the DNA methylation sequence composition and distribution. We also describe the effect of various DNA methylation mutants on genome-wide methylation patterns, and demonstrate that our newly developed library construction and computational methods can be applied to large genomes such as mouse.
To generate a DNA methylation map at one nucleotide resolution across the genome, we adapted the Illumina 1G Genome Analyzer using Solexa sequencing technology (Illumina GA) for shotgun sequencing of bisulfite-treated Arabidopsis genomic DNA. Sodium bisulfite converts unmethylated cytosines to uracils, but 5-methylcytosines remain unconverted. Hence, after polymerase chain reaction amplification, unmethylated cytosines appear as thymines and methylated cytosines appear as cytosines 7. We created genomic DNA libraries after bisulfite conversion and produced ~3.8 billion nucleotides of high quality sequence which successfully mapped to the genome. We subsequently used several filters to ensure accuracy, including only retaining reads mapping to sequences that are unique in the genome after bisulfite conversion from every possible methylation pattern (see Supplementary Methods and Supplementary Table 1). This resulted in a conservative dataset of ~2.6 billion nucleotides mapping to unique genomic locations with very high confidence, covering ~93% of all cytosines which could theoretically be covered (~92% of the ~43 million cytosines in the ~120 Mbp Arabidopsis genome can be covered uniquely with 31 nucleotide sequences). This represents ~20-fold average coverage, similar to typical coverage in a traditional bisulfite sequencing experiment for a single locus.
Methylation in Arabidopsis exists in three sequence contexts, CG, CHG (where H = A, C, or T), and asymmetric CHH 1. We observed overall genome-wide levels of 24% CG, 6.7% CHG, and 1.7% CHH methylation (Supplementary Fig. 1a). Most CGs were either unmethylated or were highly methylated (80–100%), whereas CHH sites were either unmethylated or methylated at ~10%. CHG sites showed a more uniform distribution between 20–100% (Supplementary Fig. 1b-d). These differences underscore the fact that each type of methylation is under distinct genetic control 1. Our reads also contained 504-fold average coverage of 99.97% of theoretically-coverable cytosines in the unmethylated chloroplast genome 3, 8, giving false positive rates of 0.29% (CG), 0.29% (CHG), and 0.25% (CHH) (Supplementary Fig. 1a, Supplementary Fig. 2). The BS-Seq data were highly consistent with traditional bisulfite sequencing data from individual methylated or unmethylated loci 3 (Supplementary Table 2, Supplementary Fig. 3, and below).
While CG, CHG, and CHH methylation were highly correlated, showing enrichment in repeat-rich pericentromeric regions (Fig. 1a), a striking deviation was found within gene bodies, which contained almost exclusively CG methylation (Fig. 1b). This is consistent with previous studies 3, 4, 9 and with a depletion of short interfering RNAs (siRNAs) in the bodies of genes (Fig. 1b). Conversely, genomic regions corresponding to siRNAs were highly correlated with CG, CHG, and CHH methylation, consistent with the known molecular nature of RNA-directed DNA methylation (Fig. 1c) 1. For methylation of all types there was a strong positive correlation with the length of the methylated sequence (Fig. 1d).
BS-Seq appears to be more sensitive than previously-employed microarray-based methods 3-5. For example, we found a cluster of 5 methylated CG sites in a 34 base pair region and a lone methylated CG site, both within the FWA locus, that were not detected by previous methods (Supplementary Fig. 4). We also found CG methylation within genes previously classified as unmethylated 3, 4 (Supplementary Fig. 5). Finally, in analyzing genes whose expression is de-repressed in DNA methyltransferase mutants, BS-Seq was more accurate in identifying genes with promoter methylation that was otherwise variably detected in previous microarray studies (Supplementary Fig. 6).
BS-Seq can be used to analyze repetitive sequences that are difficult to study with microarrays as they may exceed the dynamic detection range or cross-hybridize. For example, we mapped methylation across the highly repetitive rDNA loci and found high levels of CG, CHG, and CHH methylation, including on the minimal promoter and upstream Sal1 repeats (Supplementary Fig. 7). Further, we detected methylation in telomeric repeat sequences (CCCTAAA)n which have not been previously shown to be methylated (Fig. 1e). Interestingly, the vast majority of methylation occurred at the cytosine in the third position (Fig. 1e).
The single base resolution of BS-Seq allows determination of the precise boundaries between methylated and unmethylated regions. For example, we found that the boundary between tandem repeats and flanking DNA showed a sharp drop in methylation, but DNA methylation extended from inverted repeats into flanking DNA, showing a more gradual reduction (Fig. 1b). This apparent “spreading” of methylation was not correlated with siRNA spreading because siRNA abundance levels drop sharply at the flanks of both tandem and inverted repeats (Fig. 1b).
We analyzed the relationship between sequence context and preference of methylation. We calculated the percent methylation of all possible 7-mer sequences in which the methylated cytosine was either in the fifth position (allowing an analysis of four nucleotides upstream of CG, CHG, and CHH methylation; Fig. 2, Supplementary Table 3) or in the first position (allowing analysis of 6 nucleotides following the methylated cytosine; Supplementary Fig. 8, Supplementary Table 4). To ensure that sequence preferences were not simply 7-mers enriched in particular components of the genome, we analyzed either all of chromosome 1, only sequences previously defined to be methylated by methyl-DNA immunoprecipitation, or a group of 9,507 body-methylated genes containing mostly CG methylation 3 (Fig. 2, Supplementary Fig. 8 and 9). We observed a surprisingly high level of sequence context specificity. The highest and lowest methylated 7-mers showed a 13-fold difference for CG-methylation, an 11-fold difference for CHG methylation, and > 900-fold difference for CHH-methylation (Supplementary Table 3). Sequences with the lowest CG methylation were highly enriched for the sequence ACGT (Fig. 2, Supplementary Fig. 9). Poorly methylated CHG sites were depleted of upstream cytosines but tended to contain cytosine following the methylated C. This trend is consistent with nearest-neighbour analysis of wheat germ DNA that found CAG and CTG sites methylated at a higher level than CCG sites 10. Highly methylated CHH sequences had a very specific configuration, with a tendency for cytosines and CG dinucleotides to be present upstream (Supplementary Table 3) and the sequence TA following the methylated cytosine. In contrast, poorly methylated CHH sequences always contained a cytosine following the methylated cytosine, and frequently contained a cytosine but always lacked an adenine two nucleotides downstream (Fig. 2, Supplementary Fig. 8). These results are consistent with data from individual plant genes showing that cytosines preceding a cytosine are undermethylated while those following a cytosine are more heavily methylated 11-13, and with asymmetric methylation in mammalian genomes that is found at CT and CA sequences more frequently than CC sequences 14. It is also of interest that Arabidopsis telomere sequences (CCCTAAA)n are composed of nearly optimal asymmetric target units, possibly explaining the high methylation of the third cytosines (Fig. 1e). While the molecular basis for these trends is unknown, the results suggest that DNA methyltransferases show strong sequence preferences beyond the CG, CHG, and CHH contexts. Finally, we found that regions with higher concentrations of CG dinucleotides were more heavily methylated at CG sites (Supplementary Fig. 10). Interestingly, this is different from mammalian genomes that show the opposite trend: CGs are depleted in methylated regions and at a higher density in unmethylated CpG islands.
We used autocorrelation analysis to examine the correlation between methylation in different sequence contexts and methylation at adjacent residues. We observed significant correlation between methylated cytosines for distances up to 5,000 nucleotides or more, a likely reflection of regional foci of methylation throughout the genome and of large blocks of pericentromeric heterochromatin (Supplementary Fig. 11, Supplementary Table 5). We also found a high correlation of CHG and CHH methylation within several nucleotides downstream of methylated CG sites, and a tendency for CHH methylation four nucleotides downstream of methylation at CHG sites (Fig. 2, Supplementary Fig. 12, Supplementary Table 5). These data suggest complex interactions between the different types of methylation.
We analyzed the propensity for full methylation of the strand-symmetrical CG and partially symmetrical CHG sequences. As expected, CG methylation on one strand was highly correlated with CG methylation on the opposing strand. We also saw a high correlation for CHG methylation of the two strands, showing that, like CG methylation, CHG sites show a strong tendency for symmetrical methylation (Supplementary Fig. 12). Unexpectedly, we observed a correlation between CHH methylation on one strand, and methylation at the cytosine three nucleotides downstream and on the opposite strand (Supplementary Fig. 12, Supplementary Table 5). Since the sequence of such sites is CHHG, this shows that “asymmetric” methylation shows a propensity for symmetrical methylation at these sites, even though methylation on CHHG sites is not particularly prominent in the genome (Supplementary Fig. 8, Supplementary Table 4).
Autocorrelation analysis also revealed a striking periodicity of 10 nucleotides (the length of one helical DNA turn) for CHH methylation (Fig. 3a, b). We confirmed this period in data from the whole genome and from regions previously defined to be methylated, and confirmed that the periodicity was not due to our computational filtering of the data (Supplementary Fig. 13). We observed this period both when looking at average methylation of cytosines in the genome (Fig. 3a, b, Supplementary Fig. 13) and when individual reads are directly examined (Supplementary Fig. 14). Mammalian Dnmt3a was recently shown to act as a tetramer with Dnmt3L, and two active sites methylate two CG sequences spaced ~8–10 nucleotides apart 15. Since DRM2 is the main enzyme controlling asymmetric methylation and is the ortholog of Dnmt3a 16, these data suggest that the mechanism of action of these enzymes may be conserved between plants and mammals.
Autocorrelation also showed a period of 167 nucleotides (Fig. 3c, Supplementary Fig. 15), which is similar to, but slightly shorter than, estimates of the average spacing of nucleosomes in plant chromatin 17-19. One explanation for this period is that nucleosomes or particular histone modifications might dictate access to the DNA by methyltransferase proteins. Furthermore, the slightly shorter length of 167 nucleotides relative to most estimates of plant nucleosome repeat length (175-185 nt) 17-19 suggests that DNA methylated chromatin may be more compact because of shorter linker regions or depletion in linker histones 20.
We utilized BS-Seq to study the genome-wide effects of a variety of methyltransferase mutants on DNA methylation (Fig. 4). The MET1, CMT3, and DRM1/DRM2 DNA methyltransferase enzymes are mostly responsible for CG, CHG, and CHH methylation, respectively, though at many loci CHG and CHH methylation is redundantly controlled by CMT3 and DRM1/DRM2 1, 12. We sequenced and mapped ~90 million nucleotides of BS-Seq data from each of several combinations of DNA methyltransferase mutants (Supplementary Table 1) including met1 single mutants, cmt3 single mutants, drm1 drm2 double mutants, met1 cmt3 double mutants, met1 drm1 drm2 triple mutants, and drm1 drm2 cmt3 triple mutants 21. We then analyzed the effect of these mutants on global methylation, the methylation on genes and chromosomes, and the methylation on rDNA and telomeres (Supplementary Table 6, Fig. 1e, Fig. 4, Supplementary Fig. 7, 16). The met1 single mutant, or any mutant combination containing met1, essentially eliminated CG methylation throughout the genome. For instance, gene body methylation, which is almost exclusively CG, was eliminated in all met1-containing strains (Fig. 4a). Surprisingly, in the met1 drm1 drm2 triple mutant, we observed a marked hypermethylation of CHG sites in the bodies of genes (Fig. 4a). This methylation was skewed toward the 3’ end and in this way assumed a pattern of methylation similar to the missing CG methylation. Although previous studies have suggested that the drm1 drm2 cmt3 triple mutant eliminates CHG and CHH methylation 12, BS-Seq data shows residual methylation (Supplementary Table 6), particularly in pericentromeric heterochromatin (Fig. 4b), suggesting that another enzyme is involved 22. Furthermore, the met1 cmt3 double mutant was equally effective in reducing CHH methylation as was drm1 drm2 cmt3 (Supplementary Table 6), suggesting that CHH methylation depends in part on the presence of CG and CHG methylation. These compensating behaviours suggest that the different DNA methyltransferases act redundantly, and help explain the viability of these mutant combinations whereas the met1 cmt3 drm1 drm2 quadruple mutant causes embryonic lethality 21.
The BS-Seq procedure described here should be generally useful in other organisms. For example we applied BS-Seq to quantify the overall genomic methylation difference between wild type mouse embryonic stem cells and cells carrying a mutation in the UHRF1 gene recently shown to control maintenance of CG methylation 23, 24. By analyzing ~60 million nucleotides of shotgun sequencing data from each, we found that Uhrf1−/− cells contained only 25% of the CpG methylation level of wild type (Fig. 4c). Furthermore, to demonstrate that the complete analysis pipeline used for Arabidopsis is applicable to larger genomes, we produced a library from mouse germ cell tissue and generated ~46 million nucleotides of high quality mapped BS-seq data. Approximately 66% of the reads mapped uniquely, a level only slightly lower than that of Arabidopsis (Supplementary Table 1), suggesting that it is practical to apply BS-Seq to entire mammalian genomes.
In summary, BS-Seq analysis of wild type and methyltransferase mutants has allowed a more detailed characterization of the Arabidopsis methylome. In addition, the computational approaches developed in this study should be generally useful for other short read sequencing genomics approaches. An installation of the UCSC browser allowing community access to detailed methylation patterns of individual genes and a source code distribution of the computational methods are available at http://epigenomics.mcdb.ucla.edu/BS-Seq/.
Bisulfite treatment of DNA was performed as previously described 25, except that adaptor sequences and PCR conditions were modified and optimized for this study. Library generation and ultra-high-throughput sequencing were carried out according to manufacturer instructions (Illumina).
Raw data from Illumina GA was processed using the initial stages of the Solexa software pipeline (Illumina) into short reads, except that per-lane per-cycle multidimensional Gaussian mixture models (GMMs) were developed to optimize base call A–vs.–C–vs.–G–vs.–T probability distribution accuracies at each sequenced base compared to the Solexa software pipeline’s _ prb files. Sequenced reads were mapped to reference genomes fully using per-base probabilities from the GMMs using highly-optimized novel C++ tools. Sequences that mapped to more than one position with similar scores (within 1% of the maximum likelihood mapping) were removed in order to retain only reads that map uniquely. To eliminate unconverted bisulfite reads, a filter discarded reads with three or more consecutive methylated cytosines when each of these was in a CHH context, resulting in a loss of ~0.23% of reads. This filter was effective and with only minimal loss of true CHH methylation (Supplementary Table 1, Supplementary Fig. 13, 17, and 18).
Traditional bisulfite sequencing was employed to validate BS-Seq results at select loci (Supplementary Table 2, Supplementary Fig. 4, 6, 17). The PCR primers used in validation are listed in Supplementary Table 7.
We thank Yana Bernatavichute for nuclear DNA isolation protocols, Amander Clarke for providing ES cell DNA, Angelique Girard and Greg Hannon for providing mouse germ cell DNA, Jonathan Hetzel for technical assistance, and Carey Fey Li for assistance with rDNA annotation. This work was supported in part by a grant from the NSF Plant Genome Research Program (award number 0701745) and some aspects of the work were performed in the UCLA DNA Microarray Facility. S.F. is a Howard Hughes Medical Institute Fellow of the Life Science Research Foundation. X.Z. was supported by a fellowship from the Jonsson Cancer Center Foundation. S.E.J. is an investigator of the Howard Hughes Medical Institute.