|Home | About | Journals | Submit | Contact Us | Français|
Methylation of cytosines is a pervasive feature of eukaryotic genomes and an important epigenetic layer that is fundamental for cellular differentiation processes and control of transcriptional potential. DNA methylation patterns can be inherited and influenced by the environment, diet and aging, and disrupted in diseases.
Complete DNA methylomes for several organisms are now available, helping clarify the evolutionary story of this epigenetic mark and its distribution in key genomic elements. Nonetheless, a complete understanding of its role, the mechanisms responsible for its establishment and maintenance, and its cross talk with other components of cellular machiney remains elusive.
Methylation at the carbon 5 position of cytosines (5meC) constitutes an important epigenetic layer that contributes to the definition of transcriptional and regulatory potential of genomic DNA [1, 2]. DNA methylation is a typical characteristic of most eukaryotes and some of its features are conserved in many species. While cytosine methylation is a stable modification of the genomic DNA that can be inherited, it also dynamically changes during the lifespan of certain cells and tissues of an organism and it is susceptible to diet and other environmental influences. Indeed, it is essential for the correct onset of differentiation processes and for defining tissue specific transcriptional profiles, and can be dysregulated in disease states. Recently, methods have been developed for the genome-wide detection of 5meC, and complete maps for several organisms are available, including humans. Currently available high-throughput data combined with results from classic genetic experiments are beginning to clarify the roles of DNA methylation in a variety of processes. Nonetheless, the mechanisms for the establishment and maintenance of epigenetic patterns and the complex interplay of DNA methylation with other epigenetic and regulatory layers are not well understood . The correct reprogramming of DNA methylation is critical when considering regenerative medicine for the generation of induced Pluripotent Stem Cells (iPSCs) with full differentiation potential. However, base-resolution iPSCs DNA methylomes are not yet available. Several studies have shown that methylation profiles of iPSCs are aberrant with respect to those of Embryonic Stem Cells (ESC) and these differences may decrease or restrict the differentiation potential [4, 5]. Additionally, aberrant methylation in iPSCs was observed to be inherited from the progenitor cell type . In general, additional genome-wide high-resolution data from both healthy and diseased cells will be required to shed light on the dynamic variation of these marks, their role in healthy cells, and their relevance in diseases.
Recently, a plethora of new methods have been developed for the determination of genome-wide DNA methylation patterns . These developments are beginni to contribute to our comprehension of the role of this epigenetic mark in bot development and disease states. Traditionally, DNA methylation could be determined only for specific loci through Sanger sequencing of bisulfite converted and PCR amplified genomic DNA fragments. While sodium bisulfite has no effect on 5meC, it specifically converts cytosine to uracil, and during PCR amplification of bisulfite treated DNA, uracil is replaced with thymine.
Several methods have been developed which enable capture of genome-wide profiling of DNA methylation. These can be divided into three types: 1) enrichment of methylated genomic DNA fragments, 2) digestion with methylation-sensitive restriction enzymes (RE) and 3) sequencing of bisulfite converted DNA. Each of these methods have been scaled for the analysis of genome-wide profiles with quantification of methylation being based on either microarrays or high-throughput DNA sequencing.
Methylated DNA Immunoprecipitation (MeDIP) is the most common method based on enrichment, where an antibody specific for 5meC is used to capture methylated genomic DNA fragments [8, 9]. This method can provide relatively cheap and reasonably comprehensive genome-wide data, but the resolution is limited and the enrichment is not linearly related to the actual methylation level . Exemplifying methods based on methylation sensitive RE, the HpaII tiny fragment Enrichment by Ligation-mediated PCR assay (HELP) can be used to determine genome-wide patterns based on the combined activity of HpaII and MspI restriction enzymes (RE) . The main disadvantages of this approach are in the resolution of the data and the bias due to the non-uniform distribution of RE cutting sites. The only methods that currently provide genome-wide base-resolution methylation information are based on high-throughput sequencing of bisulfite converted DNA [12, 13]. While these methods (BS-Seq and MethylC-Seq) are still relatively expensive for large genomes (currently ~$10,000 for 30x coverage of the human genome), the cost of sequencing is dramatically decreasing at greater than Moore’s law pace (doubling every 18 months), meaning that soon the cost of enrichment will be significantly greater than the cost of sequencing.
Alternative methods to target specific regions of the genome have also been developed. Some of these methods allow choosing of the target regions, like padlock-probe based targeting or enrichment methods coupled with promoter tiling arrays [4, 9]. Another method, Reduced Representation Bisulfite Sequencing (RRBS) relies on a combination of RE fragment size selection, and bisulfite sequencing, in order to fractionate the genome and enrich for fragments with high CpG content regions prior to sequencing .
In a recent study the coverage, resolution, cost and concordance of four sequencing-based methylation profiling methodologies were evaluated . The concordance between these technologies was quite high when comparing methods based on sequencing of bisulfite converted DNA (up to 82% and 99% for mC in CG and non-CG sequence context, respectively), and when comparing enrichment methods (99%), while regions assessed by all four methods were 97% concordant. The authors also showed the power of integration of two complementary methods, in that enriching for hyper- and hypo-methylated regions, along with histone methylation, RNA, and SNP could allow for assessment of allele-specific epigenetic states.
The choice of the best-suited method depends inevitably on required data resolution, cost, size of the genome and throughput in the number of samples .
In prokaryotes a methyl group can be added to both cytosines and adenine in palindrome target sequences  while in eukaryotes DNA methylation is restricted to cytosines. The most common sequence context where this epigenetic mark is found is the CpG (or CG) di-nucleotide (mCpG or mCG). This di-nucleotide is under represented in eukaryotic genomes because of the mutagenic effect of this modification . In fact the few regions that are GC and CpG rich, named CpG Islands (CGI), are usually depleted of 5meC . CGI are offen within upstream gene regulatory regions and their presence, altered methylation status, and the overall CpG content of the promoter regions are highly predictive of the transcriptional potential of the downstream gene .
In several eukaryotic organisms 5meC can also be found in other sequence contexts . In general, this type of modification is named non-CG methylation. In particular 5meC in the CHG and CHH sequence context were found (mCHG and mCHH, respectively; H being A, C or T). Recently, pervasive non-CG methylation was found in the human genome, even if in restricted differentiation stages [19, 20] (see following sections).
Certain genomic regions are enriched in 5meC within specific sequence contexts. For example, mCHH are highly enriched in A. thaliana transposons . Indeed, in some organisms different enzymes and pathways are responsible for these alternative types of methylation, while for others this is still open to debate .
Recently, the presence of an additional DNA methylation modification, 5-hydroxy-methylcytosine (5hmC), was found in mouse Purkinje neurons and the brain . This base can be detected using thin-layer chromatography, while standard RE and bisulfite conversion based methods appear not to be able to reveal it . Importantly, TET proteins can oxidize 5meC to 5hmC, which is poorly recognized by DNMT1 and can be converted to cytosine, providing a possible pathway for passive de-methylation .
Most species where this epigenetic mark is present have two general types of enzymes: de novo and maintenance DNA methyltransferases (DNMTs) . The mechanism by which methylation patterns are established, maintained and inherited is not completely clarified in any species though . The common paradigm is that de novo DNMTs (DNMT3A and DNMT3B in human) establish methylation patterns early in embryonic development . These enzymes have the same activity on both hemi- and un-methylated DNA and are down-regulated but still expressed after cell differentiation . Maintenance enzymes (DNMT1 in human) then copy the newly established DNA methylation patterns through each cellular division . DNMT1 has a preference for hemi-methylated DNA while showing some de novo activity [29, 30]. The picture seems to be more complex though, since genetic experiments with gene knock-outs for these enzymes suggest that the concerted activity of de novo and maintenance enzymes is required for the complete and correct establishment of methylation profiles . Nevertheless, their expression varies greatly during the differentiation processes, suggesting that the requirement for their activity is not constant. Moreover, these enzymes appear not to be freely available in the cells but rather associated with chromatin complexes or at the DNA replication fork . Finally, these enzymes require accessory proteins and show a complex interplay with nucleosome positioning and particular histone marks . These findings help to explain why not all cytosines are methylated. A revised model points to the possibility that the overall methylation level of a region might be copied rather than the exact methylation status of each individual cytosine .
In general, the complex cross-talk of these enzymes with other components of the transcriptional machinery has to be considered. In fact, as discussed in the following sections, despite DNA methylation being usually considered a repressive mark, its role can vary greatly in different genomic contexts, in associations with other regulatory mechanisms.
The role of DNA methylation in different genomic contexts can be remarkably different, and alternative mechanisms are available for establishing 5meC patterns in different genomic regions. The density of possible methylation sites, as well as the distribution of 5meC, is not uniform in the genome. In particular, control of methylation levels in promoters, gene bodies, regulatory features, and transposable and repetitive elements to be critical.
Recently, base-resolution sequencing of several eukaryotic methylomes has provided initial insights into the evolutionary history of DNA methylation and its distribution in these key genomic regions. Currently, the complete methylomes for 23 organisms are available (10 animals, 8 plants and 5 fungi; Fig. 1) [13, 19, 32, 33]. In general the DNA methylation landscape can be either continuous along the genome, or constituted by a series of heavily methylated DNA domains interspersed with domains that are methylation free. When considering the data available for a limited set of loci in an even wider set of organisms, the current perspective is that the evolutionary history of DNA methylation in gene bodies and transposons may be independent. These genomic regions are present in both plants and vertebrates; however, transposon methylation is only conserved in fungi while gene body methylation only occurs in invertebrates . Gene body methylation seems to be a property inherited from ancient genomes, while transposon methylation appears to be related to the degree of sexual outcrossing . In general, for those organisms where the methylation of certain genomic elements was present, their methylation pattern is quite conserved. However, the presence of non-CG methylation is less common, but when present, it is always at a lower level than mCG [13, 19, 32, 33].
Base-resolution global maps of DNA methylation in humans remained elusive for a long time because of the difficulties of performing comprehensive analysis on a multiple-gigabase genome. Recent improvements in the throughput of sequencing technologies, simultaneous reduction of the cost, and coupling of bisulfite conversion with cutting edge sequencing methods, has now enabled acquisition of the complete human methylomes for several cell types: human embryonic stem cells (hESC), fetal and neonatal fibroblasts, and fibroblastic differentiated derivative of hESC [19, 20]. The choice of these particular cell types is related to the relevance of DNA methylation in the onset and regulation of cellular differentiation processes. These maps demonstrated for the first time the feasibility of applying such genome-wide methods on genomes of this size, and now provide reference maps against which other maps of healthy and diseased cells can be compared.
A striking finding from the first human methylomes is the relative abundance of non-CG methylation in hESC [19, 20]. Indeed, one quarter of the 5meC was present in the CHG or CHH sequence context, with some preference for CA di-nucleotides in both sequence contexts. Moreover, non-CG methylation was lost during the differentiation process but can be re-established at the same loci upon generation of iPSCs . While the re-appearance of non-CG methylation in iPSCs was shown for several loci, more comprehensive experiments will be necessary to fully evaluate how the DNA methylome is restored in these cells compared to ESCs. Non-CG methylation was distributed non-randomly in the genome and was particularly enriched in the gene bodies, with increasing levels corresponding to more highly transcribed genes . Non-CG methylation was also particularly enriched in genes related with processing of mRNA and in genes having higher pre-mRNA levels . These findings point to the potential involvement of DNA methylation with the splicing machinery [19, 20]. This hypothesis is also supported by the patterns of CG methylation at the exon-intron boundaries . Enzymes responsible for the deposition of mCG were shown to have a preference for targeting CG di-nucleotides with a relative spacing of 8 bp . Interestingly, the same result was observed for 5meC in CHG and CHH sequence contexts . These findings suggest that the same enzymes may be responsible for non-CG methylation, even if patterns for 5meC at relative distances multiple of 8 bp are not found as clearly as for mCG. Interestingly, methylation at non-CG was much lower level than mCG as only 25% of the sequencing reads for a given non-CG residue were methylated, compared to 80-90% for mCG . Finally, mCHH was slightly more enriched on the antisense strand in gene bodies, and the potentially symmetric mCHG was 98% hemi-methylated . For all of these reasons, it is necessary to acquire deep sequence coverage for those samples where non-CG methylation is expected. Finally, further research will be necessary to clarify the role of non-CG methylation and its relevance for pluripotency.
Biophysical studies of DNA methylation reveal that this base modification plays a important role in repressing accessibility of the transcriptional machinery to the DNA . Indeed, some transcription factors (TF) such as Sp1 are known to be methylation sensitive, even if this seems to be dependent on the considered condition or tissue [36, 37]. Moreover, not all TFs have methylation sites in their binding sites (TFBS) or methylation does not affect their binding.
Recently, binding of TFs and other proteins important for general control of transcription (TAF1 and P300) or for the stem cell biology (SOX2, 0CT4, NANOG and KLF4) was profiled by ChIP-seq in both hESC and fetal fibroblasts . The regions immediately surrounding these TFBS showed depletion of non-CG 5mC in human ESCs, while mCG were much less depleted for many factors .
Enhancer regions, as defined by H3K4me1 and H3K27ac ChIP-seq sites, were also profiled and these elements also showed depletion of non-CG 5mC in human ESCs. Interestingly, the enhancers present in fibroblasts showed mCpG depletion, while enhancers shared by both cell types showed depletion of 5meC in the CpG and non-CG context in differentiated and un-differentiated cells, respectively. These findings would suggest that cells in different states of differentiation use different DNA methylation mechanisms to mark these important regulatory features .
While non-CG methylation was absent in differentiated cells, widespread differences were found in the distribution and levels of CG methylation. Hundreds of differentially methylated regions (DMRs) hyper-methylated in fibroblasts have been identified, many associated with genes important for stem cells functions . On the other hand, large hypo-methylated regions have also been identified when comparing differentiated cells to human ESCs. These have been termed Partially Methylated Domains (PMDs). PMDs might also be expected on the chromosome X, given that the cells examined were derived from a female and that the dosage compensation occurs on sex chromosome. In fact PMDs cover 80% of chromosome X. Nonetheless, surprisingly almost 40% of autosomes were found in the PMD state. Genes in these domains were down-regulated compared to human ESCs, and PMDs were also enriched in histone repressive marks . Interestingly, large blocks of H3K9me2 were present in differentiated cells but not in embryonic cells in the mouse. These same regions were found to overlap human PMDs and shown too be lost in certain cancer cells [38, 39].
CG DNA methylation is usually considered a repressive mark . Indeed, methylation in the context of promoters is inversely related with the transcriptional level of the downstream gene. However, CG methylation in gene bodies is positively correlated with the gene transcriptional level, indicating that the meaning of this epigenetic mark is rather complex and context dependent . Genome-wide studies showed that gene-body non-CG methylation is also clearly positively correlated with gene expression . Rather, gene bodies in human ESCs are always highly methylated, even in poorly expressed genes . All of these findings suggest that the positive correlation between gene-body CG methylation and transcriptional expression could be re-interpreted as loss of DNA methylation upon differentiation and the formation of repressed chromatin blocks , rather than expecting a positive effect of gene-body methylation on gene expression. In agreement, there is evidence in A. thaliana for association between gene body methylation and transcriptional elongation, suggesting a scenario where transcription contributes to maintaining or enhancing DNA methylation levels .
There are specific pitfalls related to the analysis of DNA methylation data. Some of these issues are specific to the methodology chosen for detecting 5meC. For example, a problem for enrichment methods is that the enrichment signal is not linearly related to the actual methylation level (Fig. 2A]. Fortunately, several methods were developed to correct for this bias [10, 42, 43]. When analyzing base-resolution data the challenges are in dealing with non uniform coverage within a sample and comparing samples with different overall sequencing depth (Fig. 2B]. This is particularly critical for the detection of non-CG methylation. In fact, as discussed in a previous section, the level of methylation for 5mC in this sequence context is rather low, and only 25% of the sequencing reads for a given non-CG residue are methylated in average. Rather, alignment of bisulfite-converted reads and bisulfite conversion errors do not represent a serious issue, since a high proportion of reads can be mapped, covering most of the genome.
Once the final set of 5meC calls, or an estimate derived from the enrichment level, is available, one has to consider both the absolute and relative methylation level. The absolute level represents the density of 5meC. On the other hand, this information has to be considered with respect to the total number of available methylation sites, and in general the CG content of the loci under consideration. As an example, a promoter containing a hypo-methylated CGI can have the same 5meC density as a fully methylated low CpG content promoter, in absolute terms, but their functional consequences are remarkably different (Fig. 2C]. Indeed hypo-methylation at CGI is likely to be associated with transcriptional expression of the downstream gene. In general, it was shown that transcription of genes downstream of low CG content promoters does not correlate with their upstream-region DNA methylation level. However, the methylation status of high and especially intermediate CpG content promoters is critical for transcriptional repression .
A substantial difference between enrichment and bisulfite-sequencing based methods is that for the former the count of reads for given loci determines the methylation level, while for the latter the read depth is related to the library reparation, sequencing and mapping process. Also, the ability to call a 5meC is directly related to the sequencing depth in bisulfite sequencing-based experiments (Fig. 2B]. Related with this, the methylation level, the proportion of reads with 5meC over the set of reads covering a specific base, is a measure whose variability decreases with increasing depth. Finally, since the methylation level is a finite scale ß-distributed measure, the variance of measurements with a mean near the mid-range can be much larger than the variance of measurements with a mean close to the limits (0 and 1; Fig. 2D). These are important pitfalls to keep in mind when determining differential methylation levels.
In general, few tools are currently available for the analysis of epigenomics data, in particular high-throughput DNA methylation data . Pipelines for both low- and high-level analyses must be developed. Low-level analysis can be defined as read mapping algorithms and 5meC calling, taking into account bisulfite conversion level, sequencing errors and multiple testing issues. Higher-level analysis involves the determination of absolute and relative methylation levels across genomic regions based on genome annotation, clustering, visualization and integration with other heterogeneous data types. Particularly important is the determination of differentially methylated regions, since DNA methylomes vary between different cell types, as a function of differentiation stage, age, and as a result of disease states. Methods for the identification of differentially methylated regions must also take into account that genomic regions of very different scale might exist, spanning from megabase size PMDs to small DMRs that constitute only few differentially methylated cytosine bases.
Technologies for high-throughput detection of the sites of DNA methylation in genomes are being constantly improved and with equally significant reductions in cost. Three important limitations might be overcome in the near future. Single cell sequencing analysis would eliminate many problems of interpreting DNA methylomes, which currently are the produce of a mixed population of cells/chromosomes, each with heterogeneous methylation status [44, 45]. Similarly, of “amplification-free” methods would greatly reduce the problem of the uneven distribution of bisulfite converted reads [44, 45]. Finally, direct detection of 5meC, without use of bisulfite conversion, would eliminate the issue of the degradation and loss of material following the chemical conversion [44-47] as well as increase the ability to map the sequencing reads onto the genome.
Several aspects regarding the prevalence and role(s) of DNA methylation must be clarified. A comprehensive list of DNA-binding proteins affected or insensitive to cytosine methylation should be developed. This is particularly critical for TFs, given that they bind upstream gene regulatory regions, where DNA methylation has always been considered a critical factor. Also, the relationship between the methylation of specific regions and the transcriptional potential of nearby genes is still unclear. DNA methylation is only one epigenetic control point. Future studies must attempt to integrate this key mark with all other epigenetic and regulatory mechanisms. For example, there is strong evidence on the relevance of promoter DNA methylation and transcriptional repression of the downstream gene. Induced de-methylation can restore the expression of silenced genes, as well as re-establishment of high methylation levels can suppress them . However, in other cases, variation in the level of DNA methylation may simply be a consequence of closed chromatin structure, as it has been hypothesized for PMDs, where embryonic methylation levels might be lost after differentiation, resulting in accumulation of repressive chromatin marks .
Few examples of allelic methylation are available, but this important phenomenon requires more comprehensive analysis . This will require the availability of matched genomes/methylomes, and possibly the use of methods that are not dependent on bisulfite conversion, as this complicates the evaluation of the differential methylation for C-T SNPs.
Finally, a critical area is the application of base-resolution methylome analysis for clinical studies where DNA methylomes of diseased cells can be generated and compared to healthy cells, possibly using matched samples. These studies will be also necessary to better understand the effect of drugs that target the epigenome such as the demethylating agent Decitabine. These drugs are used to de-methylated tumor suppressor genes that have been silenced by DNA methylation  but their genome-wide effect is poorly understood.
MP is supported by a Catharina Foundation postdoctoral fellowship; work in our laboratory is supported by grants from NIH (U01 ES017166), DOE (FG02-04ER15517), NSF (MCB-0929402) and by the Mary K. Chapman Foundation (JRE).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting typesetting and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content and all legal disclaimers that apply to the journal pertain.
List of abbreviations: 5meC, MeDIP, RE, HELP, mCpG, CGI, mCHG, mCHH, 5hmC, DNMT, ESC, hESC, IPS, TF, TFBS, ChIP, DMR, PMD1