DNA methylation is a chemical modification of DNA that plays a key role in regulation of gene expression (). As an “epigenetic” mark, it encodes an additional layer of heritable information on top of DNA without changing the underlying genetic sequence. While all cell types in an organism share nearly the same genome sequence, their DNA methylation patterns can be markedly different (Song and others, 2005
; Meissner and others, 2008
). DNA methylation marks help encode tissue-specific transcriptional programs in diverse cell types and allow these gene expression patterns to be passed down to daughter cells. Chemically, DNA methylation involves the modification of a cytosine (C) base to form methyl-cytosine. These methylation marks are recognized by specialized proteins that bind the methylated DNA and inhibit the expression of neighboring genes (Bird, 2002
). In adult cells of mammals, this modification occurs almost exclusively at cytosines that are immediately followed by a guanine (G) in the 5′
direction, denoted “CpG.”
Fig. 1. DNA (black strand) is wrapped around histone proteins (gray spheres). Unmethylated DNA (left) tends to be loosely packed. Genes in such regions are accessible to the cell's transcriptional machinery and can be expressed. DNA methylation involves the addition (more ...)
The health implications of deciphering the DNA methylation code have recently received much attention both in the scientific literature and in the media (Issa, 2007
; Cloud, 2010
; SchuBeler, 2009
). Work in the rapidly evolving field of stem cell biology, for example, has shown that DNA methylation can contribute to the cellular memory mechanism used by the stem cells to retain their pluripotent state during repeated cell divisions (Sen and others, 2010
). In cancer biology, it is clear that aberrations in DNA methylation almost universally accompany the initiation and progression of cancers (Feinberg and Tycko, 2004
). Much of the excitement surrounding epigenetics relates to the promise of therapies that alter the epigenetic code, activating or silencing disease-related genes. While the majority of such treatments are still hypothetical or experimental, 2 epigenetic drugs that reactivate tumor suppressor genes by removing methylation marks have recently received U.S. Food and Drug Administration approval (Sharma and others, 2010
; Kaminskas and others, 2005
). These studies and therapies highlight the medical promise of mapping and understanding the role of DNA methylation.
Fully describing the methylation profile of a given cell requires measuring the methylation state of every CpG. However, current practical laboratory protocols do not permit single cell methylation measurements and because studied cell populations are known to be heterogeneous, methylation measurements are expected to be continuous rather than binary. Therefore, for any given cell type, we aim to measure the percentage of methylated cells at each CpG site. These measurements can be made by treating DNA with sodium bisulfite, which selectively converts unmethylated cytosine (C) to uracil (U) while leaving methylated C as is, followed by DNA amplification and sequencing (Clark and others, 2006
; Frommer and others, 1992
). However, although considered a gold standard, this procedure comes at significant cost when applied genome-wide due to the amount of sequencing coverage required. Therefore, this technology is not yet suitable for affordable genome-wide measurements. The methods presented in this paper are motivated by the demand for high-throughput measurements necessary to construct genome-wide methylation profiles.
Recent advances in microarray technology and laboratory protocols provide an alternative high-throughput platform for assessing DNA methylation. Since methylation of adjacent cytosines in small regions of a few hundred base pairs tends to be highly correlated (Eckhardt and others, 2006
), lower resolution strategies based on methylated DNA enrichment provide a cost-effective alternative to bisulfite sequencing. These approaches employ proteins that selectively bind (affinity purification) (Weber and others, 2005
) or cut (restriction enzymes) DNA depending on its methylation status. Following a procedure that enriches for either methylated or unmethylated DNA, microarrays are used to detect the DNA fragments. Coupled with suitable analytical tools, these strategies can provide accurate genome-wide methylation profiles. A recent comparison of methods É found that the restriction enzyme McrBC, which selectively cuts methylated DNA (Sutherland and others, 1992
; Ordway and others, 2006
), has higher sensitivity than the commonly used methylated DNA immunoprecipitation (MeDIP) affinity purification protocol, É in regions of lower CpG density (Irizarry and others, 2008
). While analytical methods have been developed for MeDIP (Down and others, 2008
; Pelizzola and others, 2008
), tools are currently lacking for McrBC DNA methylation data.
The microarray measurements produced by the procedures described above present new statistical challenges. We have developed an empirical Bayes estimation strategy that, when combined with appropriately normalized McrBC-enriched DNA microarray data, produces accurate percentage methylation estimates with a mean error of 10% compared to bisulfite sequencing estimates. As in applications of empirical Bayes methods to other microarray data settings (Efron and others, 2001
; Smyth, 2004
; Irizarry and others, 2003
), we take advantage of the massively parallel structure of the data to borrow information across the ensemble of probes. Since accurate methylation estimates are highly dependent on suitable pre-processing to remove systematic biases, we also present a novel normalization strategy. In so doing, we demonstrate that well-established methods, developed largely in the context of gene expression analysis, are often inappropriate for DNA methylation data. While these methods have been widely and successfully used in a variety of other microarray applications, certain key assumptions underlying the strategies are violated in the DNA methylation setting leading to inaccurate estimates.
This article is organized as follows. We begin with a description of our example data sets in Section 2. Section 3 lays out limitations of existing methods for preprocessing DNA methylation data. In Section 4, we describe our normalization strategy and the empirical Bayesian estimator of percentage methylation. Results demonstrating the utility of our methods are presented in Section 5. We conclude with a discussion in Section 6. Derivations and practical issues including microarray data quality control are included in the supplementary material
available at Biostatistics