The role of DNA methylation in human diseases has sparked interest in genome-scale methods for DNA methylation profiling
1. Among an array of protocols for measuring DNA methylation, bisulfite sequencing stands out for its ability to quantify the DNA methylation status of essentially all non-repetitive regions in the genome at single-nucleotide resolution
2. We recently developed reduced representation bisulfite sequencing (RRBS) as an accurate yet cost-efficient method for genome-scale DNA methylation analysis
3,4. Here, we show that RRBS is highly appropriate for DNA methylation profiling of human disease cohorts, and we address four obstacles that hamper epigenome mapping in clinical samples: (i)
High input DNA requirements. Methods such as MeDIP-seq
5, MBD-seq
6, Methyl-seq
7 and CHARM
8 consume micrograms of genomic DNA, which is infeasible for many clinical samples such as tumors obtained by laser capture microdissection or rare stem cell populations. (ii)
Inability to analyze FFPE samples. We are not aware of a genome-scale method for DNA methylation mapping that works well on formalin-fixed, paraffin-embedded (FFPE) clinical samples, rendering many of the best-annotated patient cohorts inaccessible for epigenome studies. (iii)
Incomplete bisulfite conversion. Whole-genome bisulfite sequencing cannot use specific primers to enrich for fully converted DNA, such that incomplete bisulfite conversion is likely to result in measurement artifacts. (iv)
Lack of data analysis tools. Few statistical methods or bioinformatic tools exist that would allow sensitive detection of DNA methylation alterations that distinguish disease case and control samples.
The RRBS protocol combines DNA digestion with a methylation-insensitive restriction enzyme and size selection to select a reproducible subset of the genome
3,4. This ‘reduced representation’ is bisulfite-sequenced and its DNA methylation profile compared between disease cases and control samples. To translate the RRBS protocol from mouse to human, we initially performed
in silico digestions, confirming that MspI digestion and a size selection of 40 basepairs to 220 basepairs enriches for CpG islands and promoter regions (data not shown). We tested this protocol on two fresh-frozen clinical samples, a colon tumor and adjacent normal tissue from the same patient. A total of 8.7 and 5.3 million high-quality aligned reads were obtained, yielding DNA methylation data for more than 1 million unique CpGs (). Highly quantitative data with more than 25 individual CpG measurements were obtained for 65% of core promoters, 50% of CpG islands and 17% of putative regulatory elements (). Furthermore, we observed coverage of a sizable number of CpG island ‘shores’
9, enhancers, exons, 3′ UTRs, and repetitive elements (see
http://rrbs-techdev.computational-epigenetics.org for details). This constitutes a slight improvement compared to previously reported RRBS in mouse samples
4.
| Table 1Summary of reduced representation bisulfite sequencing experiments |
For the analysis of clinical samples, three aspects of the RRBS protocol were specifically optimized. First, we minimized the input DNA requirement to be able to process minimal tissue samples and FACS-sorted cell populations (). In two subsequent rounds of optimization we reduced the amount of input DNA from 1 μg to 300 ng and from 100 ng to 30 ng (), observing Pearson correlation coefficients of 0.97 and 0.96, respectively, calculated over all CpGs with at least 25-fold sequencing coverage. This analysis was performed on DNA from mouse ES cells rather than on human material to minimize the number of potential confounding factors. To confirm that the low-input protocol works well for human disease samples, we performed RRBS on two human blood samples using 30 ng of input DNA, and we observed a correlation of 0.96 between the two samples (
Supplementary Table 1).
Second, we optimized RRBS analysis for DNA extracted from FFPE tissue slices. Focusing on two matched colon samples that were stored in FFPE format since 2001, we observed the characteristic DNA degradation pattern of FFPE samples (
Supplementary Fig. 1a). To avoid degradation products in the selected size range (40–220bp), we size-selected DNA fragments greater than 500 basepairs before digesting the genomic DNA with MspI. Our protocol resulted in high-quality RRBS libraries (
Supplementary Fig. 1b), and the sequencing yield was comparable to fresh-frozen samples (). We also observed high overall agreement between the FFPE samples and the fresh-frozen samples in terms of genomic coverage and DNA methylation measurements (
Supplementary Fig. 2). Specifically, the correlation of DNA methylation levels at CpGs with at least 25-fold sequencing coverage was 0.87 between the fresh-frozen and the FFPE colon tumor and 0.88 between the fresh-frozen and the FFPE normal colon tissues (
Supplementary Table 1).
Third, we optimized bisulfite treatment in order to maximize conversion of unmethylated cytosines while minimizing loss of input DNA due to bisulfite-induced degradation. Across multiple experiments in clinical samples and mouse ES cells, we found a conversion protocol with two subsequent 5-hour bisulfite treatments
10 was more effective than our previously used single-step 14-hour protocol (conversion rate >99% in all experiments). We also performed RRBS on
in vitro methylated and
in vivo demethylated DNA from a single cell line. This experiment confirmed that the overall level of DNA methylation does not have a visible effect on the bisulfite conversion rate (). Finally, we compared the DNA sequence properties (sequence composition, structural features, repeat content, etc.) between the regions that exhibited comparatively low vs. high levels of bisulfite conversion, using the EpiGRAPH web service
11. No consistent correlation with the bisulfite conversion rates could be identified (data not shown), suggesting that systematic bisulfite conversion bias is not a problem when applying RRBS to human disease samples.
As an additional validation, we performed DNA methylation analysis of the fresh-frozen colon tumor sample using the Infinium HumanMethylation27 platform, which combines bisulfite conversion with a genotyping microarray to measure DNA methylation in promoter regions
12. For 1,027 CpGs both methods yielded high-confidence measurements, and we observed a correlation of 0.88 between Infinium and RRBS (). Furthermore, when we allowed for up to 100 basepairs distance between the CpGs assayed by Infinium and RRBS, the high-confidence overlap between both methods increased to 7,324 CpGs, while the correlation between the two assays remained high (Pearson’s
r = 0.77). This observation is consistent with high autocorrelation of DNA methylation levels in the CpG-rich regions of the human genome
13,14 and provides justification for measuring DNA methylation at a subset of indicator CpGs, rather than at every single CpG within a given region.
To complement the experimental optimizations described above, we developed a bioinformatic data analysis pipeline that is designed to identify subtle alterations of DNA methylation in genomic regions with putative gene-regulatory potential (
Supplementary Note). This pipeline builds upon a comprehensive set of pre-annotated genomic regions (which includes promoters, CpG islands and many other genomic features). For each region it performs a statistical test for differential DNA methylation, and it calculates p-values without having to introduce any arbitrary threshold parameters. Multiple-testing correction is performed by controlling the false discovery rate. Importantly, restricting the analysis to a relevant subset of the genome increases the statistical power for detecting subtle alterations in gene-regulatory regions, because the p-values are not diluted by multiple-testing correction for regions that are
a priori unlikely to be differentially methylated.
To illustrate the features of the bioinformatic analysis pipeline, we compared the DNA methylation profile of the colon tumor with matched normal colon tissue. We observed tumor-specific hypermethylation at 52 gene promoters, 114 CpG islands and hundreds of additional genomic regions. Affected genes include
SOX17 () and
GATA5 (
Supplementary Fig. 3), which are known targets of hypermethylation in colon cancer
15,16. However, classical targets such as
APC and
MGMT were unmethylated in this particular tumor. To corroborate the observation that the tumor exhibits hypermethylation at a relatively small number of genes, we assessed whether or not the tumor classifies as CpG island methylator phenotype (CIMP) based on a recently published biomarker
17. CIMP is a characteristic property of a subset of colon cancers exhibiting widespread DNA methylation at a large number of CpG island promoters. We inspected the promoters of five genes that have been identified as predictive of CIMP
17, and the RRBS data clearly denote the tumor as CIMP-negative. In addition to hypermethylation at a small but significant number of gene promoters, we also observe cases of tumor-specific
hypomethylation. An example is
HNF4A (), a hepatic transcription factor that has an essential role in colon development
18.
The RRBS method’s deep coverage of gene promoters plus selective sampling of all other types of genomic regions makes it most useful for detecting novel epigenetic alterations, for example in the context of biomarker discovery
19. Compared to truly genome-wide bisulfite sequencing, its focus on a reduced representation of the genome translates into a substantial cost advantage and the ability to screen larger patient cohorts. On the other hand, padlock-targeted bisulfite sequencing and epigenotyping microarrays currently achieve substantially lower genomic coverage, making these technologies more suitable for validating findings than for initial discovery. In terms of sample quality and input DNA requirements, RRBS is more forgiving than any other method for epigenome profiling that we are aware of. It is thus possible to run RRBS as an add-on for essentially all ongoing tumor genomics initiatives, and to generate genome-wide methylation profiles of some of the most interesting and best-annotated sample collections. Finally, with ever-decreasing sequencing costs RRBS will readily scale to more comprehensive genomic coverage, for example, by using additional restriction enzymes or widening the size-selection window.