Parallel target capture with padlock probes
Padlock probes were previously used for exon capture and resequencing15
. As in the eMIP method, our approach to targeted bisulfite sequencing involves the in situ
synthesis of long (~150 nt) oligonucleotides on programmable microarrays, followed by their cleavage and enzymatic conversion into padlock probes. A library of padlock probes is annealed to the template DNA, circularized, and amplified by PCR before shotgun sequencing ().
Figure 1 Targeted bisulfite sequencing with padlock probes. (a) Each padlock probe has a common linker sequence flanked by two target-specific capturing arms (H1 and H2). H1 and H2 are melting temperature–normalized, and a spacer sequence is included to (more ...)
There are, however, two major challenges in performing padlock capture for bisulfite sequencing. First, bisulfite treatment converts all unmethylated cytosines into uracils, resulting in marked reduction of sequence complexity. Achieving specific target capture on bisulfite-converted DNA is more difficult than on native genomic DNA. Second, we initially observed a low capturing sensitivity, high bias and random losses of alleles with the eMIP method15
. Obtaining accurate and efficient quantification of DNA methylation was not possible with the existing protocol, especially with the presence of allelic drop-outs.
We designed 10,582 padlock probes, each capturing a 175- to 225-bp region, including 9,350 probes covering 2,020 CpG islands (Supplementary Table 1
online) on human chromosomes 12 and 20, 705 probes covering 237 promoters in eight ENCODE (the Encyclopedia of DNA Elements) regions, and 527 probes targeting 4-kb regions centered on the transcription start sites (TSS) of 26 genes related to development or pluripotency (Supplementary Table 2
online). The total size of captured fragments was 2.1 Mbp, representing 0.064% of the human genome. Because some probes contain CpG sites within the capturing arms, we iterated all possible C/T combinations on these CpG sites, and synthesized a total of 30,000 non-degenerate probes. We chose to perform the proof-of-concept study focusing mostly on CpG islands primarily because they represent a relatively well-defined set of genomic features in the human genome annotation. To increase the sensitivity and reduce bias, we increased the probe/target ratio by more than tenfold, extended the reaction time and added an additional five cycles of circularization compared with the published protocol15
. To integrate construction of sequencing libraries with padlock capture, we developed a new method that uses a combination of uracil-specific excision reagent (USER) enzymes and S1 nuclease to create fragments with random ends ().
We first validated the targeted bisulfite sequencing method by capturing bisulfite-converted Jurkat cell genomic DNA with all 30,000 padlock probes in a single-tube reaction. PCR amplicons from the circularization reactions were template specific, and PAGE analysis showed the expected size distribution (). We estimated the specificity of the capturing reaction by ligating captured DNA fragments to a sequencing vector, cloning these into Escherichia coli, and sequencing 96 clones. Of 89 high-quality Sanger sequencing reads obtained, 80 were from the targeted regions, indicating a specificity of 90%. We then used an Illumina Genome Analyzer to sequence the ends of the captured fragments. We mapped 5.5 millions reads to 10,364 of 10,582 targets, which translate to a sensitivity of 98%. These results indicate that padlock probes can specifically extract a large set of genomic targets for single-molecule bisulfite sequencing.
Normalization of padlock-captured DNA fragments
Although 98% of the targets were observed at least once in the end-sequencing analysis, the abundance of different captured fragments varied across a 10,000-fold range. Analysis of variance revealed that the bias resulted from a combination of factors, including GC content and length of the ligation arms, and the size of the targets to be captured (Supplementary Fig. 1 online). To normalize the relative abundance among different DNA fragments, we used a combination of two strategies: ‘subsetting’ and ‘suppressor oligos’ (). All 30,000 padlock probes were ranked based on the capturing efficiency determined by end sequencing, and divided into four subsets, two containing 5,000, and two containing 10,000, oligos. The three less efficient subsets were resynthesized. For each DNA sample, four capturing reactions were performed separately using probes from the original set of 30,000 and the three resynthesized subsets. The PCR amplicons from the capturing reactions were pooled in equal molar ratios before constructing a shotgun sequencing library (). This subsetting strategy increased the relative abundance of less efficient targets by orders of magnitude. For reasons that we still do not fully understand, a very small number of probes were extremely efficient. For example, the top 48 (0.016%) most efficient probes account for 13.3% of mappable reads in the end-sequencing analysis.
Figure 2 Normalization of padlock-capturing efficiency. (a) The ‘subsetting’ strategy. The 30,000 probes were divided into four sets (5 k, 10 k, 10 k, 5 k). The three less efficient sets were resynthesized. We reused the original 30,000-probe set (more ...)
Although the subsetting strategy allowed us to adjust relative abundance among several relatively large subsets of probes, we also needed a method to specifically reduce the efficiency of a small number of probes in a library. For this, we designed a set of 48 suppressor oligos, which contained chimeric sequences: the 5′ region was reverse complementary to the extension arm H2, and the 3′ region contained a short sequence unrelated to the ligation arm H1. When these suppressor oligos were mixed with padlock probes in a high molar ratio (100-fold molar excess of suppressor oligos), the 48 most efficient probes tended to anneal to the suppressor oligos, were extended from the 3′ ends and yielded linear-extended sequences that were removed in the subsequent exonuclease digestion (). We tested this normalization strategy on the same bisulfite-converted Jurkat cell DNA, performed end sequencing on the captured DNA fragments and obtained 2.2 million mappable reads. The effect of normalization was obvious: the fraction of probes with at least half of the average abundance increased from 31% to 49%; the average efficiency for the 48 most abundant probes was reduced by fivefold ().
Accuracy of methylation quantification
To validate the measurement accuracy of our method, we took advantage of the built-in redundancy in our probe design. Each CpG island was covered by multiple probes targeting partially overlapping DNA fragments on alternating strands (). The CpG sites in the overlapping regions were captured independently from two DNA strands with different probes. Because the sequencing reads were mapped in a strand-specific manner and CpG methylation is symmetric on the two DNA strands13
, the accuracy of the assay can be determined by comparing the methylation level of these CpG sites on the two strands. For 2,697 such CpG sites that were covered by >50 sequencing reads, the Pearson correlation coefficient (R
) was 0.987 (Supplementary Fig. 2a
online). To confirm the measurement accuracy with an independent method, we quantified the methylation levels of 182 randomly selected CpG sites with conventional bisulfite Sanger sequencing. Correlation between the two assays was high (R
= 0.975; Supplementary Fig. 2b
). Finally, we compared the methylation measurements from two batches of IMR90 fibroblast cultures. Excellent correlation was observed between the biological replicates (R
= 0.970; Supplementary Fig. 2c
). Taken together, these three validation experiments indicate that our assay is highly robust.
The CpG sites characterized in this study were based on the human reference genome (UCSC hg18). A small fraction of such sites could be different in our samples owing to the presence of genetic variations. Particularly, C → T transitions at CpG sites cannot be distinguished from unmethylated CpG sites after bisulfite conversion. Such transitions are detectable, however, if the sequences of the reverse complementary strand are available. We would expect to see asymmetric dinucleotides: TG on one strand and TA on the other. We searched for sites with such an asymmetric pattern in the subset of regions that were captured and sequenced from both strands, and found from 11 to 38 transitions in each sample (Supplementary Table 3 online). The majority (57–90%) of such transitions are known C/T polymorphisms in the NCBI single-nucleotide polymorphism (SNP) database. Because Hues6-BJ-Hybrid1 is a tetraploid, the number of transitions in this hybrid line is roughly twice as many as in other cell lines. As C → T transitions represent only 0.13–0.32% of CpG sites, the resulting artifacts are negligible.
Padlock capture yields relatively short products (insert sizes of 175–225 bp). As such small DNA fragments cannot be further fragmented using conventional DNA shearing methods, such as nebulization or hydra-shearing, we developed an enzymatic fragmentation method for constructing shotgun sequencing libraries. To validate this method, we compared enzymatic fragmentation with adaptive focused sonication16
. Two sequencing libraries were made from the same padlock-captured DNA (Hues6-BJ-Hybrid1) using the two methods, and sequenced. The methylation levels are highly consistent between the two libraries (Pearson R
= 0.994 for the 22,937 CpG sites covered with>50 sequencing reads in both libraries; Supplementary Fig. 3a
online). We also characterized the average sequencing coverage across the capturing targets. Both methods exhibit bias toward the center of the sequences, which was commonly observed in fragmentation of DNA with defined sizes. However, our enzymatic fragmentation protocol produced more even coverage across the captured DNA fragments (Supplementary Fig. 3b
Single-molecule analysis of DNA methylation in iPS cells
To demonstrate the utility of targeted bisulfite sequencing, we characterized the changes of chromosome-wide methylation status during reprogramming of human fibroblasts to pluripotent cells. We performed the methylation assay on three sets of fibroblasts and iPS cells from three laboratories: IMR/IMR90-iPS17
reprogrammed with four factors (Oct4, Sox2, Nanog and Lin28); hFib2/hFib2-iPS18
reprogrammed with a different set of factors (Oct4, Sox2, Klf4 and Myc); BJ/BJ-iPS11/BJ-iPS12 (ref. 19
) reprogrammed with the five factors (Oct4, Sox2, Nanog, Klf4 and Myc) controlled by an inducible promoter. We also characterized a line of hybrid stem cells (BJHues6-Hybrid1)20
, which were reprogrammed by fusing the human fibroblasts (BJ) with hES cells (Hues6), as well as three hES cell lines (Hues12, Hues42, Hues63). We performed bisulfite conversion, padlock capture and construction of shotgun sequencing libraries on each DNA sample. We sequenced each library in one lane in the flow cell of Illumina Genome Analyzer, and obtained 2–3 million reads that were mapped to the targeted regions (). The bisulfite conversion rates were >98.5%. To avoid stochastic sampling drift, we removed CpG sites that were covered by <10 reads from the following analyses.
Targeted bisulfite sequencing of three human fibroblast lines, four iPS cell lines, one hybrid stem cell line and three hES cell lines.
DNA methylation in 2,020 CpG islands
The global methylation patterns in all 12 samples (11 cell lines plus a biological replicate on IMR90 fibroblasts) were visualized using the UCSC Genome Browser. The chromosome-wide patterns of CpG island methylation were highly similar among all the cell lines (). Globally, the methylation level of CpG dinucleotides followed similar bimodal distribution: 67% were weakly methylated (<20% methylation), 22% were highly methylated (>80% methylation) and the remaining 11% have intermediate levels of methylation (Supplementary Fig. 4 online). To distinguish CpG islands with different methylation patterns, we generated a histogram (bin size = 0.05) for the distribution of methylation on all CpG sites within a CpG island. Treating such histograms as 20-component vectors, we performed hierarchical clustering to partition the CpG islands, and divided all CpG islands into three clusters based on the similarity of distribution between pluripotent and fibroblast lines (Supplementary Fig. 5 online): in cluster 1, the CpG islands (1,451; 77.3%) have similar distributions in the two cell types (R > 0.5); in cluster 2, CpG islands (252; 13.4%) have less similar distributions (0.5 ≥ R > 0); in cluster 3, CpG islands (173; 9.2%) are anti-correlated (R ≤ 0). Therefore, only a small fraction of CpG islands show cell-type-specific methylation.
Figure 3 Patterns of CpG methylation in fibroblasts and pluripotent cells. (a,b) Patterns of CpG island methylation on chromosomes 12 and 20 in twelve samples. Red indicates highly methylated CpG, and green indicates weakly methylated CpG. (c) Correlation between (more ...)
Because CpG islands are not defined in a functional manner, we divided the CpG islands into three categories. The first comprises CpG islands in the regions from 2 kb upstream to 500 bp downstream of TSS. These ‘upstream regions’ often include promoter regions. The second class (‘gene body CpG islands’) comprises CpG islands in the regions from 500 bp downstream of TSS to the ends of the last exons. The final category comprises CpG islands outside of gene body and promoter regions. CpG islands in each category were further divided into three groups according to CpG density. Consistent with previous findings, most (91.8%) CpG islands in promoter regions were weakly methylated (<20% methylation), 3.4% were highly methylated (>80% methylation) and the remaining 4.8% showed an intermediate level of methylation (20–80% methylation). The distributions are quite similar among the three groups with different CpG densities (Supplementary Fig. 6a online). In contrast, only 45.2% of CpG islands in the gene body were weakly methylated, whereas roughly one-third of them (37.7%) were highly methylated. Methylated CpGs tend to locate in islands with low CpG density (Supplementary Fig. 6b). In regions outside of gene body and promoter regions, we found more weakly methylated CpG islands (58.9%) than highly methylated CpG islands (26.6%) (Supplementary Fig. 6c). Similarly, CpG islands with low CpG density were more methylated. There were 80 genes in our data set that contained both promoter-region CpG islands and gene-body CpG islands. Sixty-two of these genes were weakly methylated in promoter regions. Among these, 48.4% were highly methylated in the gene body and 29.0% displayed weak gene-body methylation.
CpG island methylation and gene expression
As methylation at CpG dinucleotides is considered an important transcriptional regulatory mechanism, we sought to characterize the effects of methylation at different CpG sites on gene expression. We used the Illumina HumanRef BeadArray to profile gene expression of nine cell lines (hFib, hFib-iPS, BJ, BJ-iPS12, IMR90, IMR90-iPS, Hues12, Hues42, Hues63) and then investigated the distribution of functional methylation sites relative to gene structures. For both chromosomes 12 and 20, we computed the Spearman’s rank correlation between the average methylation level of all CpG sites in 500-bp windows and the expression level of its corresponding genes. The 500-bp windows were moved relative to the TSS of the corresponding genes with a step size of 200 bp from 10 kb upstream to 10 kb downstream, excluding genes with multiple TSS. Windows that overlap with adjacent genes in the upstream or downstream regions were also removed from the analysis. To exclude artifacts resulting from uneven sampling of CpG sites only from CpG islands, we permutated the gene expression values to establish the empirical distribution of background correlation coefficients. These were then used to estimate the significance of correlation between DNA methylation and gene expression. We found that methylation in a 2.4-kb region (TSS upstream 1.0 kb to downstream 1.4 kb, P
) showed the strongest negative correlation with gene expression (). The methylation level and gene expression became positively correlated in the 2- to 10-kb region downstream of TSS, suggesting that active genes tend to be unmethylated around the TSS, but methylated in the gene body. This is consistent with the observation of gene body methylation in actively transcribed genes21–23
. We also observed significant correlation (P
< 0.01) between gene expression and methylation in other regions, which could be related to the chromatin structure of active genes or anti-sense gene expressions ().
Differentially methylated regions associated with reprogramming
We next sought to identify regions that exhibit methylation differences between fibroblasts and pluripotent cells. The weakly or negatively correlated CpG island clusters identified above were based on the overall distribution of CpG methylation in CpG islands. The single-nucleotide resolution data obtained in this study allowed us to search for regions with specific changes independent of the definition of CpG island. We started by examining the methylation levels in the TSS flanking regions associated with the 26 selected genes. Most of these genes’ overall methylation frequencies were similar between pluripotent cells and fibroblasts but there were a few genes (e.g., OCT4 (also known as POU5F1), DNMT3B, and NANOG) for which methylation frequencies were noticeably different between the two cell types (). We found that these genes contained differentially methylated regions (DMRs) that were separated by stretches of CpG sites with little methylation difference (). To search for DMRs with similar patterns in all the regions included in this study, we performed K-mean clustering (k = 2) on the 12 data sets based on the methylation level of all CpG sites in 400-bp windows sliding along the chromosomes with the step size of 100 bp. Excluding windows that contain <3 CpG sites or >30% missing values, we examined a total of 18,434 windows. We reasoned that, for a true DMR, the eight data sets from pluripotent lines and four sets from fibroblast lines should be distinguished based on the methylation of all CpG sites in the region. Therefore, windows in which partitioning is consistent with the phenotypic difference were considered as candidate DMRs. We also required that the difference of median methylation should be at least 0.1 between stem cells and fibroblasts. A total of 1,273 partially overlapping DMRs were found.
To evaluate the robustness of this procedure, we randomly shuffled the 12 data sets, and applied the same algorithm on the permutated data where at least three samples were placed into incorrect categories. We found an average of 6.8 DMRs in 100 permutations, suggesting a false-positive rate of 0.53%. After combining multiple overlapping DMRs, we found 288 discontinuous DMRs (44 from ENCODE promoters, 16 from selected genes, 228 from CpG islands) that were associated with reprogramming. To identify associated genes, we mapped these DMRs to regions from 10 kb upstream to 10 kb downstream of the TSS of RefSeq genes. We mapped 132 DMRs to 155 genes, including 10 from the 26 selected genes, 49 in the ENCODE regions, and 96 on chromosomes 12 and 20 (Supplementary Table 4
online). We then looked for the enrichment of genes in certain functional categories. Because some targets were chosen based on known functions, we focused on a subset of 125 genes on chromosomes 12 and 20, and five ENCODE regions (ENm004, ENm005, ENm007, ENr123, ENr333) to avoid sampling bias. We performed gene ontology analysis using DAVID tools24
. At a false-discovery rate of < 0.05, we found six enriched functional categories related to ion transport for the 94 genes associated with an elevated level of methylation in pluripotent cells. The other 31 genes associated with reduced methylation in pluripotent cells were enriched for eight functional categories related to transcription, metabolic and developmental regulations (Supplementary Table 5
hES cells have unique gene expression and methylation signatures3,25
. To test whether we can distinguish pluripotent cells based on the methylation profiles established in this study, we performed hierarchical clustering on the methylation level on all CpG sites with at least 50× sequencing coverage. Fibroblasts and iPS/ES cell lines were clearly grouped into two different clades (). In addition, the iPS and hES cell lines had subtle yet noticeable differences, with the Hues6-BJ hybrid line more closely resembling hES cells. Interestingly, the similarity between hES cell lines and fibroblasts were slightly higher than that between iPS cell lines and fibroblasts (see Supplementary Table 6
online for the similarity matrix). Globally, both iPS and hES cell lines were more methylated than fibroblasts; and iPS cell lines were more methylated than hES cell lines (Supplementary Fig. 7
online). Locally, key pluripotent genes were less methylated in pluripotent cells (). These results suggest that iPS cells were reprogrammed into an epigenetic state that made them less similar to fibroblasts than hES cells are. Possibly, the persistent overexpression of transcription factors from integrated viral transgenes accounts for the more substantial shift in the overall epigenetic state (Supplementary Fig. 8
online). Alternatively, such a difference could be due to the minor variations in iPS derivation and hES culture among different laboratories. Further characterization of a larger panel of iPS/hES cell lines with similar passage numbers under standard culture conditions is required to reach a generalized conclusion on whether the epigenetic states of iPS cells differ systematically from those of hES cells.