|Home | About | Journals | Submit | Contact Us | Français|
We describe the direct detection of DNA methylation, without bisulfite conversion, through single-molecule real-time (SMRT) sequencing. In SMRT sequencing, DNA polymerases catalyze the incorporation of fluorescently labeled nucleotides into complementary nucleic acid strands. The arrival times and durations of the resulting fluorescence pulses yield information about polymerase kinetics and allow direct detection of modified nucleotides in the DNA template, including N6-methyladenosine, 5-methylcytosine, and 5-hydroxymethylcytosine. Measurement of polymerase kinetics is an intrinsic part of SMRT sequencing and does not adversely affect determination of the primary DNA sequence. The various modifications affect polymerase kinetics differently, allowing discrimination between them. We utilize these kinetic signatures to identify adenosine methylation in genomic samples and show that, in combination with circular consensus sequencing, they can enable single-molecule identification of epigenetic modifications with base-pair resolution. This method is amenable to long read lengths and will likely enable mapping of methylation patterns within even highly repetitive genomic regions.
DNA methylation, in its various forms, has been implicated in the regulation of a variety of biological processes across virtually every branch of the taxonomic tree. For example, in certain bacteria, N6-methyladenosine (mA) appears primarily within GATC sequence contexts and helps regulate replication, the mismatch repair pathway, and the expression of certain genes1. In plants, 5-methylcytosine (mC) appears in multiple sequence contexts, each controlled by separate genetic mechanisms2, 3. 5-methylcytosine within vertebrates usually occurs at CG dinucleotides, which often cluster in regions called CpG islands that are at or near transcription start sites4-6. Methylation within these islands regulates gene expression within cells7 and can also confer epigenetic heritability in offspring8, 9. Changes in mC patterns play a crucial role in development10, 11 and have been associated with cancer12, 13 and other diseases14. Abundant cytosine methylation in non-CG contexts was recently found in human embryonic stem cells but not in differentiated cells, suggesting that it is a distinct type of methylation involved in the maintenance of the pluripotent state15. Finally, 5-hydroxymethylcytosine (hmC) is a newly identified epigenetic mark whose biological function is not yet understood, found thus far in mouse Purkinje neurons16 and embryonic stem cells17.
Because of the key role it plays in human health and disease, cytosine methylation is the most widely studied of the DNA modifications described above, and there is much interest in mapping genome-wide mC patterns across different cell types and in response to various environmental influences18. Currently, the most common technique for studying cytosine methylation involves bisulfite treatment (which transforms epigenetic information to genetic information by converting cytosine, but not methylcytosine, to uracil) followed by massively parallel DNA sequencing18. Using this approach, researchers have recently constructed single-base resolution methylation maps for the Arabidopsis thaliana genome2, 3, for a subset of the mouse genome19, and for both fibroblasts and embryonic stem cells throughout the majority of the human genome15. Despite these advances, enabled by bisulfite sequencing, there remain several drawbacks to the technique. For example, the sample preparation steps associated with bisulfite sequencing can be costly and time-consuming, and the harsh reaction conditions necessary for complete conversion can degrade DNA. In addition, the reduction of complexity in converted genomes constrains primer design for subsequent PCR amplification20 and also complicates alignment to a reference genome6. Finally, discrimination between C, mC, and hmC cannot be accomplished with bisulfite sequencing16, 17, 21, 22.
Direct detection of methylation is possible for nucleotides in solution using techniques such as thin layer chromatography16, 17, high performance liquid chromatography16, 23, mass spectrometry16, 17, 23, and nanopore amperometry24. No high-throughput method, however, has been demonstrated that allows the determination of primary sequence at the same time as methylation status. In this paper we present a method to directly detect DNA methylation during single-molecule, real-time (SMRT) DNA sequencing, an emerging technique for studying nucleic acid sequence and structure25. In this technique, single DNA polymerase molecules are observed in real time while catalyzing the incorporation of fluorescently labeled nucleotides complementary to a template nucleic acid strand. These reactions are measured simultaneously within thousands of arrayed zero-mode waveguides (ZMWs)26, nanophotonic structures that reduce background fluorescence, thereby enabling use of the high concentrations of labeled nucleotides necessary to support fast and processive DNA sequencing-by-synthesis. Incorporation of a nucleotide is detected as a pulse of fluorescence whose color identifies that nucleotide. The pulse ends when the fluorophore, linked to the nucleotide's terminal phosphate, is cleaved by the polymerase before translocation to the next base in the DNA template. Typical polymerase synthesis rates in SMRT sequencing are currently 1-3 bases per second25.
Fluorescence pulses in SMRT sequencing are characterized not only by their emission spectra, but also by their duration and by the interval between successive pulses25. These metrics, defined here as pulse width (PW) and interpulse duration (IPD), add valuable information about DNA polymerase kinetics. PW is a function of all kinetic steps after nucleotide binding and up to fluorophor release, while IPD is determined by the kinetics of nucleotide binding and polymerase translocation27. We have previously demonstrated that SMRT sequencing polymerase synthesis rates are sensitive to DNA primary and secondary structure25. Therefore, we hypothesized that methylated bases in a DNA template might be detected directly on the principle that their presence affects polymerase kinetics during SMRT sequencing (Fig. 1). Curiously, to our knowledge the kinetics of nucleotide incorporation against methylated templates have not been studied previously, even in bulk, despite evidence that other types of modified nucleotides do, in fact, alter DNA polymerase kinetics28.
In order to test this hypothesis, we designed several synthetic DNA templates that were identical except for their methylation status at specific sites. One template served as a control and contained no methylation, while the other templates contained several mA, mC, or hmC bases. Because the methylated bases could, in principle, affect the kinetics of DNA synthesis over a range of several nearby bases, we separated them by no less than 11 bases29. In all cases, mA was located within a GATC context, while mC and hmC were located within CG contexts. We sequenced these templates and then compared the average IPD in each of the methylated templates to the average IPD in the control template by computing their ratio at every template position. For all three methylation types, there is a clear excursion in polymerase synthesis kinetics in the vicinity of the methylated bases (Fig. 2). The methylated base is in contact with the polymerase for several bases prior and subsequent to occupying the active site29. Consistent with this model, the kinetic impact of methylation is not restricted to the nucleotide incorporation opposite the modified base. In addition, the IPD ratio patterns differ between the two methylated positions. As the only source of difference between these two loci in all three template types is the local sequence context (see Supplementary Note for template sequences), we conclude that, in general, the kinetic signatures of methylation will be sequence context dependent.
There are, however, several features in common between the two instances of each methylation type, suggesting that there may be universal interactions between the methylated nucleobase and specific sites of the polymerase. For mA (Fig. 2a), the ratio of these IPDs is largest (ranging between 5-6) opposite the methylated positions themselves. The N6 position is involved in hydrogen bonding during complementary base pairing, and therefore it is possible that the methyl group of mA directly modifies nucleotide binding kinetics. Another characteristic common between the two mA positions is an excursion in the IPD ratio 5 bases after incorporation opposite the methylated base.
The templates containing mC (Fig. 2b) display considerable IPD increases 2, 3, and 6 bases after both methylated positions. The templates containing hmC (Fig. 2c) exhibit common IPD signals at 2 and 6, but not 3, bases following the methylated positions. Plots of PW ratios display excursions that are more pronounced for hmC than for mC (Supplementary Fig. 1). In fact, because each modification displays a unique IPD and PW signature within a given context (see position 73 in Fig. 2b-c and Supplementary Fig. 1), this approach opens up the intriguing possibility of directly distinguishing between C, mC, and hmC during real-time sequencing. To this end, we used principal component analysis to find the combination of weights for the IPD and PW signals at the various template positions near the putative modification that optimizes the resolution (Supplementary Table 1). The separation between projections of the kinetic signatures for each template onto the first two principal components (Fig. 3) demonstrates the discrimination amongst these three cytosine nucleobase types by utilizing information from multiple kinetic parameters at multiple template positions.
While the previous experiments establish the principle of methylation detection in populations of identical molecules, in practice individual positions in a genomic sample might be methylated in only a fraction of the molecules present. When a genome position is not always methylated, the aggregate kinetic data over an ensemble will be a linear superposition of the kinetic signatures for the methylated and unmethylated cases. Partial methylation can be quantitated by fitting the data to a two-component model, but for the highly overlapping IPD distributions that result from a single, rate-limiting step25, it can be preferable to sequence a smaller number of molecules, but multiple times each. To enable reading of individual molecules multiple times, we exploited the circular topology of our DNA templates, achieved by the ligation of hairpin adaptors to both ends of a double-stranded DNA insert (Fig. 4a). A strand-displacing DNA polymerase can carry out multiple laps of DNA synthesis around such a DNA template and enable repeated, or circular consensus, sequencing of the same DNA molecule. Repeated measurements (which we call circular subreads) of IPD at a particular DNA template position yield a mean IPD at that position that follows a gamma distribution, which is narrower than the underlying exponential distribution. As more circular subreads are collected, the distributions of mean IPD for methylated and unmethylated bases become better separated (Fig. 4b). This substantially improves the discrimination between them, as shown by receiver operating characteristics (ROC) curves for calling A versus mA (Fig. 4c) within a particular context. The normalized area under the ROC curve is 0.80 after the first circular subread but increases to 0.92 and 0.96 after three and five circular subreads, respectively. In fact, after five subreads, >85% of mA bases can be detected at this template position with a false positive rate of only ~5%, and additional subreads enabled by longer read lengths would yield even better discrimination. The discrimination can be better still if all of the template positions affected by the methylated site are taken into consideration (as seen with mC and hmC in Fig. 3). Interestingly, there is a similarity between using multiple observations of the same template position through circular consensus and using multiple affected positions from the same subread. Both lead to transitions from highly overlapping exponential distributions to better separated modal distributions. The combination of circular consensus sequencing and methods (such as principal component analysis) to combine all available information will greatly aid in the extension of this technique to quantitation of variably methylated genomic sites with base-pair resolution.
To demonstrate the application of direct methylation detection to genomic DNA, we mapped the dependence of IPD on sequence context for a C. elegans fosmid, isolated from a DNA adenosine methyltransferase positive (dam+) E. coli strain. To provide an unmethylated control template, a portion of the sample was subjected to whole genome amplification (WGA), which is expected to erase any methylation signatures (Supplementary Fig. 2). The sequencing kinetics of a 3.7-kb section of this fosmid were examined in detail (Fig. 5, Supplementary Fig. 3, Supplementary Data). Over a range of sequence contexts with varying GC content (Fig. 5a), the dam+ samples have average IPDs at GATC positions that are generally greater than those at non-GATC positions (Fig. 5b). In contrast, average IPDs were similar at all template positions within the WGA samples (Fig. 5c). The resulting ratio of average IPDs between the two samples (Fig. 5d) demonstrates that polymerase kinetics are altered substantially by adenosine methylation in a wide variety of surrounding sequence contexts (Table 1). The IPD ratio increase is similar over different GC-content levels for the range represented in this sample (Supplementary Fig. 4). The increase in average IPD caused by adenosine methylation in E. coli was consistent with the range of IPD ratios measured in the synthetic mA templates. Average IPDs over all possible 4-mer sequence contexts in the entire fosmid sample (48 kb including the vector, see Supplementary Fig. 5) show notable context dependence, evident as a non-random profile in the dam+ and WGA heat maps, highlighting the sensitivity of the method for studying DNA polymerase kinetics. The high degree of similarity between the two maps demonstrates the robustness of SMRT sequencing IPD measurements. The notable exception to their similarity is the sequence context GATC, which has a mean IPD ~4× larger in the dam+ samples than in the WGA samples. Extension to much larger genomic samples will be straightforward using future commercial versions of SMRT sequencing instrumentation that have ~102× greater throughput than the prototype instrument25, 30 used in these experiments.
In the experiments described in this paper, both the methylated and control DNA were sequenced for each experiment, but in resequencing applications unmethylated kinetic reference data could be collected just once and tabulated for all subsequent studies of the same species. In the future, de novo detection of methylation may also be possible through tabulation of expected kinetics over a suitable number of contexts or by taking advantage of heuristics that embody the observed trends in SMRT sequencing kinetics.
In SMRT DNA sequencing, measurement of polymerase kinetics occurs directly alongside primary sequence determination and does not require any additional sample preparation steps. We have shown that several forms of methylation in the DNA template, namely mA, mC, and hmC, all alter incorporation kinetics. Based on the same principle, we expect that other epigenetic modifications, as well as various forms of DNA damage, may also be detected by this method. Unique kinetic signatures displayed by each modification will permit discrimination between them within the same DNA sample. By enabling repeated interrogation of individual molecules, circular consensus sequencing allows base-pair resolution and single-molecule sensitivity for detection of mA. For mC and hmC, enhancements of kinetic sensitivity will likely be required. Such improvements could come from optimized solution conditions, polymerase mutations, and algorithmic approaches that take advantage of the kinetic signatures' spread over multiple template positions, while deconvolution techniques will help resolve neighboring mC bases (Supplementary Fig. 6). The long read lengths of SMRT sequencing will likely permit methylation profiling in highly-repetitive genomic regions, where a substantial fraction of mC residues resides6. Combined with single-molecule sensitivity, these long reads will also allow phasing of methylation status between different genomic positions. As we continue to refine this technique, de novo methylation profiling may become possible.
Sets of ~35-base ssDNA oligonucleotides (Figs. 1--44 and Supplementary Fig. 1) were purchased (Trilink Biotechnologies, San Diego, CA). Presence of base modifications within these single-stranded oligonucleotides was verified by mass spectrometry. After hybridization and ligation, each end of the resulting dsDNA oligonucleotides was ligated to a hairpin oligonucleotide. Samples were treated with exonucleases to remove any molecules that were not covalently closed. Sequences for the resulting DNA templates, which were 199 bases in length and consisted of a central 84-bp double-stranded region with single-stranded loops at each end, are shown (Supplementary Note).
For sequencing of the full fosmid (Supplementary Figs. 2 and 5), a fosmid clone (clone id: WRM0639cE06) containing an ~40 kb C. elegans genomic insert was obtained from Geneservice (Cambridge, UK, http://www.geneservice.co.uk/products/clones/Celegans_Fos.jsp) in dam+ E. coli strain EPI300, and cultured and amplified using the inducible origin (CopyControl system, Epicentre, Madison, WI). Fosmid DNA was purified using standard methods. DNA templates were then created directly from fosmid DNA or from whole genome amplified (WGA) fosmid DNA. For WGA libraries, 25 ng of fosmid DNA was amplified using the manufacturer recommended conditions in the GenomiPhi HY DNA Amplification Kit (GE Healthcare, U.K.).
For sequencing of the subsection of the fosmid (Fig. 5 and Supplementary Figs. 3 and 4), an ~3.7 kb segment (corresponding to positions 12797-16484 within the fosmid) containing 13 instances of the GATC sequence context was PCR amplified from the fosmid using Phusion High-Fidelity DNA Polymerase (New England Biolabs, Ipswich, MA) and the following primers: Forward 5′-AGTCCTGATGCTTTCACCAAAT-3′; Reverse 5′-ATTTAGATTGCCAAAGCCGTAA-3′. PCR products were cloned into the pCR-Blunt vector using the Zero Blunt PCR Cloning Kit (Invitrogen, Carlsbad, CA) and propagated in the dam+ E. coli strain TOP10 (Invitrogen, Carlsbad, CA). Approximately 25 ng of the DNA was amplified using the REPLI-g Mini Kit (QIAGEN, Valencia, CA) for the generation of the unmethylated control sample.
Fosmid DNA, or an equivalent quantity of WGA fosmid DNA, was sheared to a mean size of 500 bp (Fig. 5 and Supplementary Figs. 3 and 4) or 200 bp (Supplementary Fig. 5) using an ultrasonicator (Covaris Inc, Woburn, MA). Sheared DNA was then end-repaired with a cocktail of T4 DNA polymerase and T4 polynucleotide kinase, purified, and subjected to 3′ A-tailing with Klenow(exo-). The A-tailed fragments were ligated to hairpin oligonucleotides that contained a single 3′ T overhang and 5′ phosphate. Samples were treated with a mixture of exonucleases to remove any molecules that were not covalently closed. The resulting DNA templates were purified using SPRI magnetic beads (AMPure, Agencourt Bioscience, Beverly, MA) and annealed to a two-fold molar excess of a sequencing primer (5′-GGAGGAGGAGGA -3′) that specifically bound to the single-stranded loop region of the hairpin adapters.
DNA polymerases were generated as described24, 33. Phospholinked dNTPs were generated as described24, 33, with the exception of replacing Alexa Fluor 660 with a modified Cy5.5 fluorophore. Additional modifications included permuting the nucleobases associated with each dye to the following configuration: A555-dT, A568-dG, A647-dA, Cy5.5-dC. The excitation laser lines used were the same as described24, 33. Protocols for DNA polymerase/template complex formation, complex immobilization on the ZMW array, and sequencing reactions were similar to those described previously.
Data collection was performed on a highly parallel confocal fluorescence detection instrument, as previously described24, 30. Pulse calling, which utilized a threshold algorithm on the dye-weighted intensities of fluorescence emissions, and read alignments, achieved using a Smith-Waterman algorithm, have been described24. Reads were filtered after alignment to remove low quality sequences derived from doubly-loaded ZMWs. Interpulse duration (IPD) values were tabulated from consecutive pairs of correctly aligning template positions and were assigned to the second template position in the pair. Pulse width (PW) values were computed as the duration of the pulses associated with correctly aligning base calls. To avoid outlier effects, the smallest and largest five percent of IPDs and PWs at each position were excluded from all analyses.
Bar plot error bars (Fig. 2) represent an estimate of the standard error of the mean IPD ratio, computed by bootstrapping 10 randomly selected subsamples of 10% of the data. It can be seen that the error bars underestimate the error somewhat, as small excursions do occur at positions far away from any modification, where differences from the control are not expected. Molecular coverage varied between template, with an average coverage of 346, 504, and 393 for the 10% subsamples of the mA, mC, and hmC templates, respectively. No bootstrapping was performed for creation of the PW plots (Supplementary Fig. 1), for which all molecules were used.
Standard principal component analysis34 was carried out using the prcomp function from the Stats Package of the statistical computing program, R35. Input variables were scaled to have zero mean and unit variance, and the resulting first and second principal components were determined from the entire data set (see Supplementary Table 1). To generate the principal component scatter plot (Fig. 3), 500 subsets of 20% of the data for each template were first projected onto these first two principal components. These values were then converted into a z-score by subtracting the mean and dividing by the standard deviation of all 1500 data points for each principal component.
IPD distributions (Fig. 4) were determined by averaging multiple IPD measurements at the same template position within single molecules. All molecules from the mA experiment were used. The corresponding ROC curves were generated by sliding a threshold value across the full range of observed average IPDs. The true positive rate was computed for each threshold as the fraction of methylated observations with an average IPD larger than the threshold. Similarly, the false positive rate was determined by the fraction of non-methylated observations with an average IPD larger than the threshold.
For the fosmid experiments (Fig. 5, Supplementary Figs. 3-5), GATC positions are defined as those positions at which a T is incorporated opposite a template A that is within a template GATC context. Non-GATC positions correspond to all other positions. IPD ratios at each position were normalized by the ratio of the average IPD over all dam+ reads to the ratio of the average IPD over all WGA reads. Average sequencing coverage for the 3,688-bp fosmid region analyzed in this figure was 121-fold for the dam+ sample and 91-fold for the WGA sample. This coverage was obtained using nine ZMW arrays (SMRT™ Cells) and a total of ~1.5 hours of sequencing for each sample (dam+ and WGA).
Local sequence context (Supplementary Fig. 5) was determined using the standard Smith-Waterman alignment algorithm. The ‘local context’ of detected bases was defined as the two bases previously detected (shown on the left axis), the detected base itself and the next base detected afterwards (both shown on the bottom axis). For example, in the local context 5′-GATC-3′, the mean IPD reported describes the average duration between the detected A (complementary to a T in the template DNA) and the detected T (complementary to an A or mA in the template). On average, 1890 observations (remaining after removal of the smallest and largest five percent) were used to compute the mean IPD for each of the 256 possible 4-mer contexts, corresponding to 10-fold coverage of the entire fosmid.
We thank the entire staff at Pacific Biosciences, in particular J. Londry and D. Kolesnikov for sample preparation; E. Mollova, M. Berhe, and J. Yen for running sequencing experiments; J. Sorenson, J. Chin, A. Kislyuk, and D. Holden for help with data analysis; and E. Schadt and J. Eid for helpful discussions. Supported by National Human Genome Research Institute grant 1RC2HG005618-01.
Author Contributions: B.A.F., K.T., J.K., J.L., and S.W.T. designed the experiments. E.O. and T.A.C. prepared fosmid library constructs. B.A.F. conducted the sequencing experiments. D.W. and B.A.F. performed data analysis. B.A.F., J.K., S.W.T., E.O, D.W., and T.A.C. wrote the manuscript.
Competing Interests Statement: The authors declare competing financial interests: all of the authors are employees of Pacific Biosciences, a private company commercializing DNA sequencing technology.