New methods that assay epigenetic modifications over the whole genome promise to reveal insights into cell differentiation and development
[1]–
[15]. Moreover, incorporation of genome-scale epigenetic data into case-control studies is now becoming feasible, and has the potential to be a powerful tool in the study of disease
[16]. Recent evidence has suggested that epigenetic variation is heritable, and may underlie phenotypic variation in humans (
[17], our own observation with the human and chimpanzee methylomes). Such comparative studies rely both on the ability to obtain genome-scale epigenomic information cheaply and efficiently, and on the availability of methods for analysis of the data produced.
Cytosine methylation, which in vertebrates is mostly confined to CG dinucleotides, cooperates with other epigenetic modifications to suppress transcription initiation
[3],
[18] (in this paper we denote a cytosine that is followed by a guanine as CG, rather than CpG, and similarly CCGG is equivalent to CpCpGpG. We leave the notation for CpG islands unchanged). In vertebrates, most CGs are methylated. However, early experiments with the methylation-sensitive restriction enzyme HpaII showed that unmethylated CGs are clustered in “HpaII Tiny Fragment Islands”
[19]. These unmethylated islands are frequently active promoter elements. Methods used to annotate them on a genomic scale have been based only on sequence composition, because until recently genome-scale assessment of HpaII fragments has not been practicable. The methylation status of these regions, known as CpG islands, has not been considered in their annotation and is generally unknown. Genome-scale survey of the methylation status of CGs would enable the annotation of CpG islands based on their methylation states, rather than their sequence. Patterns of unmethylated islands differ among tissues, and changes in the methylation states of certain regions are associated with disease, particularly cancer
[2],
[3],
[20]–
[22].
High-throughput sequencing technologies have catalyzed the development of new methods for measuring DNA methylation. These methods can be broadly classified as
methyltyping versus
methylome sequencing, in analogy with
genotyping versus
genome sequencing for DNA. Methyltyping technologies allow for the assessment of genome-scale methylation patterns, while emphasizing low cost at the expense of high resolution. Assays based on sequencing avoid problems associated with hybridization to arrays. Examples include MethylSeq, which is based on digestion with a methylation-sensitive enzyme and is the focus of this paper, and RRBS which is based on digestion with a methylation-insensitive enzyme followed by bisulfite sequencing
[3],
[23]. In contrast to methyltyping, whole-genome bisulfite sequencing offers the ability to measure absolute levels of DNA methylation at single-nucleotide resolution
[7],
[24],
[25], but it is expensive because it requires sequencing of whole genomes. The issues of cost and coverage are complicated by a number of other issues. In the case of bisulfite sequencing, conversion may not always be complete. Also, the analysis requirements for the different assays vary in difficulty. For these reasons, there has been a proliferation of methods whose pros and cons are constantly changing as sequencing technologies change. A recent analysis ( in
[26]) suggested that MethylSeq is the method with the most favorable profile of pros and cons, with respect to the measures chosen for comparison. summarizes characteristics of MethylSeq and of the most commonly used alternative methods. MethylSeq retrieves information spanning more of the genome than RRBS, because of a more favorable profile of fragment sizes produced by HpaII relative to MspI (see the discussion of size selection bias below and the
Methods section).
| Table 2Counts of the SUMIs annotated in the four human neutrophil samples. |
| Table 1Features of different methyltyping techniques. |
MethylSeq is a convenient methyltyping strategy because it is cost-effective, requires only small amounts of material, and avoids bisulfite conversion. Briefly, the assay works by digestion of DNA with a methylation-sensitive enzyme (HpaII) that cuts unmethylated CGs at CCGG sites. Subsequent sequencing and mapping to the genome reveals unmethylated CGs (). Although the experiment is relatively simple, interpretation of the sequencing data is confounded by the dependence of read depth at a given site on the methylation status of neighboring sites. This has limited the use of MethylSeq; previous studies either pointed out the need for a method of site-specific normalization
[1], or attempted to deal with the bias by removing problematic HpaII sites from the analysis
[5](resulting in the loss from the analysis of more than 19% of HpaII sites in CpG islands, see
Methods).
In order to make effective use of MethylSeq for genome-scale methyltyping we developed a freely available program, called MetMap, that infers methylation at individual CGs by modeling biases inherent in MethylSeq experiments. An additional important feature of MetMap is the annotation of strongly unmethylated islands (SUMIs) which, as opposed to the current definition of CpG islands, incorporate information from both a reference sequence and genome-scale methylation data. We have validated MetMap's site-specific analysis, as well as its unmethylated-island annotation, with bisulfite sequencing of specific sites.
We demonstrate the use of MethylSeq with MetMap by methyltyping four male human individuals, and annotating their unmethylated islands. We show that the picture revealed by such analysis is sufficient to survey methylation states across the genome. Such analysis gives significant insight into the methylome of each specimen, inside and outside of CpG islands, at site specific resolution. We show evidence that the mean extent of methylation of an island is more informative than the methylation state of the different sites in the island, because the correlation between the methylation states of any two samples improved when considering the mean. MetMap identifies numerous unmethylated regions, of varying lengths, which have not previously been annotated as CpG islands and are associated with other features indicative of transcriptional function. We conclude that MetMap leverages the cost-efficiency and practical ease of MethylSeq to produce informative genome-scale methylation annotations (methyltypes) that are suitable for both region- and site-specific comparative and case-control studies.
The remainder of this paper is organized as follows. We begin by explaining in detail significant biases present in MethylSeq experiments. We then describe the MetMap framework, which is designed to correct for such biases, starting with a description of MetMap's graphical model and continuing with a description of the software's different outputs. We then describe the validation of MetMap's procedure, using the methyltypes of four human individuals, and our discovery of new unmethylated regions in the human neutrophil genome, found through the use of MetMap on MethylSeq data. Finally, we discuss the advantages of using MetMap with MethylSeq to generate and analyze large numbers of samples, and outline our plans for the extension of MetMap's framework.