Cytosine methylation of DNA plays an important role in mammalian gene regulation, chromatin structure and imprinting during normal development and the development of pathological conditions such as cancer. With the dramatic increase in throughput made possible by next-generation DNA sequencing technologies, sodium bisulfite conversion followed by massively parallel sequencing (Bisulfite-seq) has become an increasingly popular method for investigating epigenetic profiles in the human genome (reviewed in [
1]). Several sequencing strategies have been applied that vary in terms of cost and the regions of the genome covered. Reduced Representation Bisulfite-Seq (RRBS [
2]) uses restriction fragment size selection to select a portion of the genome enriched for CpG Islands and gene regulatory sequences. Bisulfite Padlock Probes (BSPP [
3]) or solution-based hybridization capture (Agilent, Inc., Santa Clara, CA, USA) can be designed for customizable selection of hundreds of thousands of regions throughout the genome. Whole-Genome Bisulfite-Seq (WGBS [
4]) is the most comprehensive technique, covering more than 90% of cytosines in the human genome. Bisulfite-seq is well-suited to the investigation of epigenetic changes from clinical tissue samples [
5,
6], and can be applied to very small quantities of DNA [
7] including formalin-fixed samples [
8]. WGBS and RRBS data have been used to profile a number of cell lines and human tissues by large sequencing consortia including the ENCODE project [
9], the NIH Epigenomics Roadmap, and The Cancer Genome Atlas (TCGA), and these datasets are publicly available for download.
Bisulfite treatment of DNA converts unmethylated cytosines to uracils, which are replaced by thymines during amplification. This dramatic change to sequence composition necessitates specialized software for almost all sequence analysis tasks. Typically, the first step in processing high-throughput sequencing data is to map and align each read to the correct location in the reference genome (genome mapping), and a number of powerful tools have been developed to map bisulfite-converted reads (reviewed in [
10]). The next step is to identify differences between the reference genome and the sample genome, including single-nucleotide polymorphisms (SNPs) and insertion/deletion events (indels). The identification of SNPs has been an active area of research and a number of powerful statistical tools have been developed for SNP calling of non-bisulfite sequencing data [
11-
13]. SNP calling of bisulfite sequencing data has significant complications. First, reads from the two genomic strands are not complementary, and this assumption of complementarity is made by all SNP calling algorithms. Second, true (evolutionary) C>T SNPs in the sample cannot be distinguished from C>T substitutions that are caused by bisulfite conversion, and can thus be misidentified as unmethylated Cs. Consequently, identification of such SNPs is important for accurate quantification of methylation levels, especially so given the fact that C>T is the most common substitution in the human population (65% of all SNPs in dbSNP) and these usually occur in the CpG context [
14].
Accurate SNP calling at the positions immediately surrounding a cytosine is equally important. Those nucleotides lying one or two positions 3' of the cytosine are particularly critical, as they are subject to the specificity of particular methyltransferases. These methyltransferase-specific context positions can be organism or cell type specific. In mammals, CpG dinucleotides are often highly methylated in most cell types, while CpA dinucleotides have much lower methylation levels and are cell type restricted [
4,
15]. In plants, by contrast, CHG trinucleotides are often methylated [
16,
17]. Other sequences within a slightly wider genomic neighborhood can also have strong
cis effects on methylation, perhaps due to the presence of key regulatory motifs [
18]. Heterozygous SNPs in proximity to cytosines can be used to reveal widespread allele-specific methylation patterns [
19] and important regulatory changes such as loss of imprinting [
20-
22].
Despite the great interest in Bisulfite-seq and the availability of a number of tools for genomic mapping, no adequate software exists for SNP calling [
10]. In order to overcome the difficulty in identifying SNPs in bisulfite-treated sequences, some groups have relied on matched non-bisulfite sequencing data in the same sample [
23-
25]. Others have used non-bisulfite SNP microarrays [
26,
27], or used study designs relying on isogenic mouse strains with known parental genotypes [
22,
24].
A key property of some bisulfite-related protocols is that G nucleotides on the strand opposing a C are not affected by conversion. This strand-specificity principle has been exploited in order to distinguish bisulfite conversion from C>T SNPs [
28]. The Illumina-based protocol currently being used in most Bisulfite-seq studies has this important property, and thus it has been classified as a
directional bisulfite-seq protocol [
10].
Non-directional protocols (those that also result in G>A substitutions) have been used [
17], but have not been widely adopted. Figure illustrates the directional protocol, where approximately half the reads at a given cytosine position (those mapping to the 'C-strand') can be used for methylation quantification but cannot distinguish C>T SNPs. The other half (those mapping to the 'G-strand', boxed in Figure ) yield no methylation information but can be used to identify C>T SNPs. When these C>T SNPs are heterozygous, they can be used in the analysis of allele specific methylation (Additional File
1).
The inherent directionality of Illumina Bisulfite-seq has thus far been used only in a limited and
ad hoc way. The Salk Institute group filtered out cytosines which did not have one or more unconverted Cs on the C-strand, but this approach can result in lost information about completely unmethylated cytosines (which play a crucial role in gene regulation) [
4,
29]. Our own group filtered out reference Cs if opposing reads contained As, but the number of such A reads required was somewhat arbitrary [
6]. A third group removed all C/T reads on the C-strand, and called SNPs by requiring a minimum number of reads containing two different alleles [
30]. Importantly, none of these so-called 'k-allele' approaches took advantage of base calling quality scores, which have been shown to be extremely important for distinguishing true SNPs from sequencing errors [
31]. Others used various methods that did not attempt to identify C/T or other SNPs occurring at cytosines [
3,
20,
21]. Such methods may be useful for analyzing allele-specific patterns in a limited way, but do not address the need to improve methylation quantification by identifying SNPs.
Here, we describe a probabilistic SNP caller,
Bis-SNP, that is based on methods that have proven successful in non-bisulfite SNP calling [
12,
13]. Bis-SNP uses Bayesian inference to evaluate a model of strand-specific base calls and base call quality scores, along with prior information on population SNP frequencies, experiment-specific bisulfite conversion efficiency, and site-specific DNA methylation estimates. It also takes advantage of base call quality score recalibration, an addition that has greatly improved SNP calling in the non-bisulfite context [
12]. Bis-SNP is open-source and based on the GATK framework [
32], which takes advantage of the parallel Map-Reduce computation strategy and provide practical execution times. Bis-SNP accepts either single-end or paired-end mapped Bisulfite-seq data in the form of BAM files, and outputs SNP and methylation information using standard file formats. We show that Bis-SNP is a practical tool that can both (1) improve DNA methylation calling accuracy by detecting SNPs at cytosines and adjacent positions, and (2) identify heterozygous SNPs that can be used to investigate mono-allelic DNA methylation and polymorphisms in cis-regulatory sequences.