Cytosine (C) DNA methylation plays a crucial role in various biological processes such as gene expression, chromatin accessibility, and imprinting, as well as in many diseases including cancer. Over the decades, bisulfite sequencing [1
] has remained the gold standard for DNA methylation analysis. Bisulfite treatment of DNA followed by PCR amplification leads to a chemical conversion of unmethylated Cs to Ts without affecting As, Gs, Ts or methylated Cs. This C to T conversion results in non-complementarity in the two strands of DNA (Figure ). Following strand-specific and locus-specific PCR amplification, direct- or pyro-sequencing is used to determine the methylation ratio of any given C locus as the proportion of remaining Cs in all the sequencing reads. This PCR-based procedure is very labor intensive and time-consuming, and therefore inappropriate for high throughput studies.
Figure 1 Pipeline of bisulfite sequencing. 1) Denaturation: separating Watson and Crick strands; 2) Bisulfite treatment: converting un-methylated cytosines (blue) to uracils; methylated cytosines (red) remain unchanged; 3) PCR amplification of bisulfite-treated (more ...)
Tremendous progress has been achieved in the past two years in the development of massively parallel sequencing such as Illumina/Solexa and Applied Biosystems/SOLiD. Tens of millions of short tags (36–75 bases) can now be simultaneously sequenced at less than 1% of the cost of traditional Sanger methods. Without the locus-specific PCR step, bisulfite treatment coupled with next-generation shotgun sequencing (BS-seq) [2
] has become a powerful technique with the potential to quantitatively detect the methylation pattern of every C in the genome. Nevertheless, the mapping of millions of bisulfite reads to the reference genome remains a computational challenge.
First, the searching space is significantly increased relative to the original reference sequence. Unlike normal sequencing, the Watson and Crick strands of bisulfite-treated sequences are not complementary to each other because the bisulfite conversion only occurs on Cs. As a result, there will be four distinct strands after PCR amplification: BSW (bisulfite Watson), BSWR (reverse complement of BSW), BSC (bisulfite Crick), and BSCR (reverse complement of BSC) (Figure ). During shotgun sequencing, a bisulfite read is almost equally likely to be derived from any of the four strands.
Second, sequence complexity is reduced. In the mammalian genome, although ~19% of the bases are Cs and another 19% are Gs, only ~1.8% of dinucleotides are CpG dinucleotides. Because C methylation occurs almost exclusively at CpG dinucleotide, the vast majority of Cs in BSW and BSC strands will be converted to Ts. Therefore, most reads from the above two strands will be C-poor. However, PCR amplification will transcribe all Gs as Cs in BSWR and BSCR strands, so reads from those two strands are typically G-poor and have a normal C content. As a result, we expect the overall C content of bisulfite reads to be reduced by ~50%.
Third, C to T mapping is asymmetric. The T in the bisulfite reads could be mapped to either C or T in the reference but not vice versa. This phenomenon not only increases the search space for mapping but also complicates the matching process (Figure ). Efficient implementation of such asymmetric C/T matching is critical for mapping high-throughput bisulfite reads to the reference genome and is still lacking in current short read alignment software.
Figure 2 Mapping of bisulfite reads. 1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment. 2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in blue) but not the (more ...)
A common approach to overcome these issues is to convert all Cs to Ts and map the converted reads to the converted reference; then, the alignment results are post-processed to count false-positive bisulfite C/T alignments as mismatches, where a C in the BS-read is aligned to a T in the reference [2
]. Although this all-inclusive C/T conversion is effective for reads derived from the C-poor strands, it is not appropriate for reads derived from the G-poor strands, where all the Cs are actually transcribed from Gs by PCR amplification and thus could not be converted to Ts during bisulfite treatment. During shotgun sequencing, however, a bisulfite read is almost equally likely to be derived from either the C-poor or the G-poor strands. There is no precise way to determine the original strand a bisulfite read is derived from. Furthermore, by ignoring the C/T mapping asymmetry, this strategy generates a large number of false-positive bisulfite mappings and greatly increases the computational load in a quadratic manner with an increase in the size of the reference sequence. In order to accurately extract the true bisulfite mappings in the post-processing stage, all mapping locations have to be recorded, even the non-unique mappings. Therefore, this approach is only practical for small reference sequences, where only the C-poor strands are sequenced. For example, Meissner et al. used this mapping strategy for reduced representation bisulfite sequencing (RRBS) [2
], where the genomic DNA was digested by the Mspl restriction enzyme and 40–220 bp segments were selected for sequencing. The reference sequence (~27 M nt) is only about 1% of the whole mouse genome, covering 4.8% of the total CpG dinucleotides.
Lister et al. used another bisulfite mapping strategy in their study of DNA methylation in Arabidopsis
]. Bisulfite reads were aligned to 3 reference sequences, namely the original genome and the two bisulfite-converted genomes with Cs converted to Ts for the forward strand and Gs converted to As for the reverse strand. The mapping results for all three references were combined to generate methylation information. Unlike the strategy used by Meissner et al. [2
], this approach does not change the bisulfite read but rather captures possible bisulfite C/T alignments by allowing them to be mismatches. The biggest drawback of this approach is that the number of bisulfite C/T alignments that can be detected in a read is bounded by the number of mismatches allowed by the mapping software; this number might be further reduced by true mismatches such as SNPs. In this study, the bisulfite read length was 32 bp and 2 mismatches were allowed, so reads with more than 2 bisulfite C/T alignments, most of which were derived from CpG islands, were not detectable. This strategy could substantially compromise the mapping sensitivity.
Naturally, it is desirable to map bisulfite reads directly to the reference sequence. Cokus et al. [4
] used a custom-made tool, CokusAlignment, to map bisulfite reads to the Arabidopsis
genome. This method is based on a tree searching algorithm, which is both computationally intensive and memory demanding. CokusAlignment runs at a moderate speed (~25 reads/sec/CPU) with a relatively small genome (~120 M nt) by applying many project-specific optimizations, which might not be applicable to larger genomes or longer bisulfite reads. In addition, CokusAlignment does not support basic alignment functions such as gapped or pair-end alignment. From a practical point of view, this method is not suitable for general purpose bisulfite sequence mapping due to its slow speed, lack of functions, and excessive hardware requirements, especially for large genomes. To our knowledge, an efficient, multifunctional, general-purpose bisulfite sequence mapping software is not yet available, and the lack of such a tool has become a major bottleneck for whole-genome DNA methylation profiling using bisulfite sequencing.
We present here a general-purpose Bisulfite Sequence MAPping program, BSMAP, which addresses all the above issues. We used the general premise that all the C positions in the genome, where the asymmetric C/T transition can occur, are already known and can be used to guide the mapping of bisulfite reads. BSMAP masks Ts in the bisulfite reads as Cs (i.e., reverse bisulfite conversion) only at C positions in the original reference while keeping all other Ts in the bisulfite reads unchanged. BSMAP then maps the masked BS read directly to the reference. The asymmetric C/T conversion is achieved through position-specific bitwise masking of the bisulfite reads; this method is extremely fast. In addition, BSMAP is based on the more efficient HASH table seeding algorithm, which indexes the reference for all possible k-mers, called seeds, and only searches the locations where the seeds are perfectly matched with part of the read. By looking up the seed table, the majority of non-mapping positions are skipped, and the searching efficiency is greatly improved. The seeding length and patterns can also be adjusted to allow for a different number of mismatches. Because of these advantages, the seeding algorithm has been incorporated into most short read mapping software, including SOAP [5
], ELAND, MAQ [6
], and RMAP [7
]. By combining fast seeding and bitwise masking (Figure ), BSMAP offers a great improvement in performance as well as flexibility over the existing bisulfite mapping approaches.
Figure 3 BSMAP algorithm. A) Bisulfite seed table, using the original seed and bisulfite variants as keys and corresponding coordinates in the reference genome as values. Each read was looked up in the seed table for potential mapping positions. B) A positional (more ...)