To characterize the abundant non-CG methylation in the H1 genome, we compared the average density of methylation relative to the underlying density of all potential sites of methylation in each context (henceforth referred to as the relative methylation density), throughout various genomic features (, Supplementary Fig. 5
). We observed a correlation in the density of mCG and the distance from the transcriptional start site (TSS), with mCG density increasing in the 5’ UTR to a similar level in exons, introns and the 3’ UTR as to 2 kb upstream of the TSS (). We generally observed lower relative densities of methylation at CG islands and TSS, however a subset of these regions did not display this depletion (Supplementary Fig. 6
. mCHG and mCHH methylation densities also decreased significantly toward the TSS and returned to the same level as 2 kb upstream at the end of the 5’ UTR, however within exons, introns and 3’ UTRs the non-CG methylation densities were twice as high. Intriguingly, the mCHH density was approximately 15–20% higher in exons than within introns and the 3’ UTR. To identify links between gene activity and non-CG methylation level within the gene body we performed strand-specific RNA-Seq15
and observed a positive correlation between gene expression and mCHG (R = 0.60) or mCHH (R = 0.58) density (), with highly expressed genes containing 3-fold higher non-CG methylation density than non-expressed genes (Supplementary Fig. 7a
). However, no correlation was observed between CG methylation density and gene expression in the H1 cells ().
Non-CG DNA methylation in H1 embryonic stem cells
We identified 447 and 226 genes that were proximal to genomic regions highly enriched for mCHG and mCHH, respectively, with 180 genes in common. An example of non-CG methylation enrichment in such a gene, Splicing Factor 1
, is shown in . Analysis of gene ontology terms for each set revealed significant enrichment for genes involved in RNA processing, RNA splicing, and RNA metabolic processes (P
2 × 10−11
, Supplementary Fig. 7b
). Unexpectedly, we found a significant enrichment of non-CG methylation on the anti-sense strand of gene bodies, for both mCHG and mCHH enriched sets of genes (P
< 0.1 and P
< 0.001, respectively, ). The anti-sense strand serves as the template for RNA polymerization, and further investigation will be required to determine if there are functional repercussions of this non-CG methylation strand bias. We also observed that genes in H1 had significantly more RNA originating from introns than in IMR90, relative to the total number of sequenced reads in each sample, and this discrepancy in intronic read abundance was significantly enhanced in the mCHG and mCHH enriched genes (P
< 0.001, ). The higher abundance of intronic reads was associated with higher non-CG methylation within gene bodies, rather than differential non-CG methylation of exons versus introns.
In the Arabidopsis
genome, the methylation state of a cytosine in the CG and CHG contexts is highly correlated with the methylation of the cytosine on the opposite strand within the symmetrical site15,16
. While we observed that 99% of mCG sites from the human cell lines were methylated on both strands, surprisingly mCHG was highly asymmetrical, with 98% of mCHG sites being methylated on only one strand. This raises an interesting question as to how these sites of DNA methylation are consistently methylated in a considerable fraction of the genomes without two hemi-methylated CHG sites as templates for faithful propagation of the methylation state (). It is not yet known whether continual, but indiscriminate, de novo
methyltransferase activity preferentially methylates particular CHG sites after replication, or if a persistent targeting signal is present that drives CHG methylation.
We analyzed the genome sequence proximal to sites of non-CG methylation to determine whether enrichment of particular local sequences were evident, as previously reported in the Arabidopsis
. Whereas no local sequence enrichment was observed for mCG sites, a preference for the TA dinucleotide upstream of non-CG methylation was observed ( and Supplementary Fig. 8
). Furthermore, the base following a non-CG methylcytosine was most commonly an A, with a T also observed relatively frequently, a sequence preference observed in previous in vitro
studies of the mammalian DNMT3 methyltransferases21,22
To determine whether there was any preference for the distance between adjacent sites of DNA methylation in the human genome, we analyzed the relative distance between methylcytosines in each context within 50 nucleotides in introns. We focused on introns because these are genomic regions enriched in non-CG methylation, but unlike exons, are not constrained by protein coding selective pressures (). Analyses for random genomic sequences and exons are presented in Supplementary Fig. 9
, together with mCG spacing patterns. For methylcytosines in all contexts, a periodicity of 8–10 bases was evident ( and Supplementary Fig. 9
), but interestingly a strong tendency was observed for two pairs of 8-base separated mCHG sites spaced with 13 bases between them. An 8–10 base periodicity was also evident for mCHH sites, corresponding to a single turn of the DNA helix, as previously observed in the Arabidopsis
, indicating that the molecular mechanisms governing de novo
methylation at CHH sites may be common between the plant and animal kingdoms. A structural study of the mammalian de novo
methyltransferase DNMT3A and its partner protein DNMT3L found that 2 copies of each form a heterotetramer that contains two active sites separated by the length of 8–10 nucleotides in a DNA helix26,27
. The consistent 8–10 nucleotide spacing we observed in the human genome suggests that DNMT3A may be responsible for catalysing the methylation at non-CG sites. Notably, the mCHG and mCHH relative spacing patterns were distinct, suggesting that this sub-categorization of the non-CG methylation is appropriate, and that distinct pathways may be responsible for the deposition of mCHG and mCHH in the human genome.