It is now understood that virtually all human cancer types are the result of the accumulation of both genetic and epigenetic changes. DNA methylation is a molecular modification of DNA that is crucial for normal development. Genes that are rich in CpG dinucleotides are usually not methylated in normal tissues, but are frequently hypermethylated in cancer. With the advent of high-throughput platforms, large-scale structure of genomic methylation patterns is available through genome-wide scans and tremendous amount of DNA methylation data have been recently generated. However, sophisticated statistical methods to handle complex DNA methylation data are very limited. Here we developed a likelihood based Uniform-Normal-mixture model to select differentially methylated loci between case and control groups using Illumina arrays. The idea is to model the data as three types of methylation loci, one unmethylated, one completely methylated, and one partially methylated. A three-component mixture model with two Uniform distributions and one truncated normal distribution was used to model the three types. The mixture probabilities and the mean of the normal distribution were used to make inference about differentially methylated loci. Through extensive simulation studies, we demonstrated the feasibility and power of the proposed method. An application to a recently published study on ovarian cancer identified several methylation loci that are missed by the existing method.
DNA methylation; mixture model; case-control designs
DNA methylation is a key component of mammalian gene regulation and the most classical example of an epigenetic mark. DNA methylation patterns are mitotically heritable and stable over time, but they undergo considerable changes in response to cell differentiation, diseases and environmental influences. Several methods have been developed for DNA methylation profiling on a genomic scale. Here, we benchmark four of these methods on two sample pairs, comparing their accuracy and power to detect DNA methylation differences. The results show that all evaluated methods (MeDIP-seq: methylated DNA immunoprecipitation, MethylCap-seq: methylated DNA capture by affinity purification, RRBS: reduced representation bisulfite sequencing, and the Infinium HumanMethylation27 assay) produce accurate DNA methylation data. However, these methods differ in their ability to detect differentially methylated regions between pairs of samples. We highlight strengths and weaknesses of the four methods and give practical recommendations for the design of epigenomic case-control studies.
Epigenome profiling; epigenetics; sequencing; differentially methylated regions; molecular diagnostics; biomarker discovery; cancer
There is an increasing demand for accurate biomarkers for early non-invasive colorectal cancer detection. We employed a genome-scale marker discovery method to identify and verify candidate DNA methylation biomarkers for blood-based detection of colorectal cancer.
We used DNA methylation data from 711 colorectal tumors, 53 matched adjacent-normal colonic tissue samples, 286 healthy blood samples and 4,201 tumor samples of 15 different cancer types. DNA methylation data were generated by the Illumina Infinium HumanMethylation27 and the HumanMethylation450 platforms, which determine the methylation status of 27,578 and 482,421 CpG sites respectively. We first performed a multistep marker selection to identify candidate markers with high methylation across all colorectal tumors while harboring low methylation in healthy samples and other cancer types. We then used pre-therapeutic plasma and serum samples from 107 colorectal cancer patients and 98 controls without colorectal cancer, confirmed by colonoscopy, to verify candidate markers. We selected two markers for further evaluation: methylated THBD (THBD-M) and methylated C9orf50 (C9orf50-M). When tested on clinical plasma and serum samples these markers outperformed carcinoembryonic antigen (CEA) serum measurement and resulted in a high sensitive and specific test performance for early colorectal cancer detection.
Our systematic marker discovery and verification study for blood-based DNA methylation markers resulted in two novel colorectal cancer biomarkers, THBD-M and C9orf50-M. THBD-M in particular showed promising performance in clinical samples, justifying its further optimization and clinical testing.
Global loss of DNA methylation and locus/gene-specific gain of DNA methylation are two distinct hallmarks of carcinogenesis. Aberrant DNA methylation is implicated in smoking-related lung cancer. In this study, we have comprehensively investigated the modulation of DNA methylation consequent to chronic exposure to a prototype smoke-derived carcinogen, benzo[a]pyrene diol epoxide (B[a]PDE), in genomic regions of significance in lung cancer, in normal human cells. We have used a pulldown assay for enrichment of the CpG methylated fraction of cellular DNA combined with microarray platforms, followed by extensive validation through conventional bisulfite-based analysis. Here, we demonstrate strikingly similar patterns of DNA methylation in non-transformed B[a]PDE-treated cells vs control using high-throughput microarray-based DNA methylation profiling confirmed by conventional bisulfite-based DNA methylation analysis. The absence of aberrant DNA methylation in our model system within a timeframe that precedes cellular transformation suggests that following carcinogen exposure, other as yet unknown factors (secondary to carcinogen treatment) may help initiate global loss of DNA methylation and region-specific gain of DNA methylation, which can, in turn, contribute to lung cancer development. Unveiling the initiating events that cause aberrant DNA methylation in lung cancer has tremendous public health relevance, as it can help define future strategies for early detection and prevention of this highly lethal disease.
DNA methylation is an indispensible epigenetic modification of mammalian genomes. Consequently there is great interest in strategies for genome-wide/whole-genome DNA methylation analysis, and immunoprecipitation-based methods have proven to be a powerful option. Such methods are rapidly shifting the bottleneck from data generation to data analysis, necessitating the development of better analytical tools. Until now, a major analytical difficulty associated with immunoprecipitation-based DNA methylation profiling has been the inability to estimate absolute methylation levels. Here we report the development of a novel cross-platform algorithm – Bayesian Tool for Methylation Analysis (Batman) – for analyzing Methylated DNA Immunoprecipitation (MeDIP) profiles generated using arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). The latter is an approach we have developed to elucidate the first high-resolution whole-genome DNA methylation profile (DNA methylome) of any mammalian genome. MeDIP-seq/MeDIP-chip combined with Batman represent robust, quantitative, and cost-effective functional genomic strategies for elucidating the function of DNA methylation.
DNA methylation plays a very important role in the silencing of tumor suppressor genes in various tumor types. In order to gain a genome-wide understanding of how changes in methylation affect tumor growth, the differential methylation hybridization (DMH) protocol has been developed and large amounts of DMH microarray data have been generated. However, it is still unclear how to preprocess this type of microarray data and how different background correction and normalization methods used for two-color gene expression arrays perform for the methylation microarray data. In this paper, we demonstrate our discovery of a set of internal control probes that have log ratios (M) theoretically equal to zero according to this DMH protocol. With the aid of this set of control probes, we propose two LOESS (or LOWESS, locally weighted scatter-plot smoothing) normalization methods that are novel and unique for DMH microarray data. Combining with other normalization methods (global LOESS and no normalization), we compare four normalization methods. In addition, we compare five different background correction methods.
We study 20 different preprocessing methods, which are the combination of five background correction methods and four normalization methods. In order to compare these 20 methods, we evaluate their performance of identifying known methylated and un-methylated housekeeping genes based on two statistics. Comparison details are illustrated using breast cancer cell line and ovarian cancer patient methylation microarray data. Our comparison results show that different background correction methods perform similarly; however, four normalization methods perform very differently. In particular, all three different LOESS normalization methods perform better than the one without any normalization.
It is necessary to do within-array normalization, and the two LOESS normalization methods based on specific DMH internal control probes produce more stable and relatively better results than the global LOESS normalization method.
DNA methylation plays a vital role in normal cellular function, with aberrant methylation signatures being implicated in a growing number of human pathologies and complex human traits. Methods based on the modification of genomic DNA with sodium bisulfite are considered the 'gold-standard' for DNA methylation profiling on genomic DNA; however, they require relatively large amounts of DNA and may be prohibitively expensive when used on the large sample sizes necessary to detect small effects. We propose that a high-throughput DNA pooling approach will facilitate the use of emerging methylomic profiling techniques in large samples.
Compared with data generated from 89 individual samples, our analysis of 205 CpG sites spanning nine independent regions of the genome demonstrates that DNA pools can be used to provide an accurate and reliable quantitative estimate of average group DNA methylation. Comparison of data generated from the pooled DNA samples with results averaged across the individual samples comprising each pool revealed highly significant correlations for individual CpG sites across all nine regions, with an average overall correlation across all regions and pools of 0.95 (95% bootstrapped confidence intervals: 0.94 to 0.96).
In this study we demonstrate the validity of using pooled DNA samples to accurately assess group DNA methylation averages. Such an approach can be readily applied to the assessment of disease phenotypes reducing the time, cost and amount of DNA starting material required for large-scale epigenetic analyses.
Summary: Methyl-Analyzer is a python package that analyzes genome-wide DNA methylation data produced by the Methyl-MAPS (methylation mapping analysis by paired-end sequencing) method. Methyl-MAPS is an enzymatic-based method that uses both methylation-sensitive and -dependent enzymes covering >80% of CpG dinucleotides within mammalian genomes. It combines enzymatic-based approaches with high-throughput next-generation sequencing technology to provide whole genome DNA methylation profiles. Methyl-Analyzer processes and integrates sequencing reads from methylated and unmethylated compartments and estimates CpG methylation probabilities at single base resolution.
Availability and implementation: Methyl-Analyzer is available at http://github.com/epigenomics/methylmaps. Sample dataset is available for download at http://epigenomicspub.columbia.edu/methylanalyzer_data.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
Recent progress in high-throughput technologies has greatly contributed to the development of DNA methylation profiling. Although there are several reports that describe methylome detection of whole genome bisulfite sequencing, the high cost and heavy demand on bioinformatics analysis prevents its extensive application. Thus, current strategies for the study of mammalian DNA methylomes is still based primarily on genome-wide methylated DNA enrichment combined with DNA microarray detection or sequencing. Methylated DNA enrichment is a key step in a microarray based genome-wide methylation profiling study, and even for future high-throughput sequencing based methylome analysis.
In order to evaluate the sensitivity and accuracy of methylated DNA enrichment, we investigated and optimized a number of important parameters to improve the performance of several enrichment assays, including differential methylation hybridization (DMH), microarray-based methylation assessment of single samples (MMASS), and methylated DNA immunoprecipitation (MeDIP). With advantages and disadvantages unique to each approach, we found that assays based on methylation-sensitive enzyme digestion and those based on immunoprecipitation detected different methylated DNA fragments, indicating that they are complementary in their relative ability to detect methylation differences.
Our study provides the first comprehensive evaluation for widely used methodologies for methylated DNA enrichment, and could be helpful for developing a cost effective approach for DNA methylation profiling.
Evidence supports a role for epigenetic mechanisms in the pathogenesis of late-onset Alzheimer’s disease (LOAD), but little has been done on a genome-wide scale to identify potential sites involved in disease. This study investigates human postmortem frontal cortex genome-wide DNA methylation profiles between 12 LOAD and 12 cognitively normal age- and gender-matched subjects. Quantitative DNA methylation is determined at 27,578 CpG sites spanning 14,475 genes via the Illumina Infinium HumanMethylation27 BeadArray. Data are analyzed using parallel linear models adjusting for age and gender with empirical Bayes standard error methods. Gene-specific technical and functional validation is performed on an additional 13 matched pair samples, encompassing a wider age range. Analysis reveals 948 CpG sites representing 918 unique genes as potentially associated with LOAD disease status pending confirmation in additional study populations. Across these 948 sites the subtle mean methylation difference between cases and controls is 2.9%. The CpG site with a minimum false discovery rate located in the promoter of the gene Transmembrane Protein 59 (TMEM59) is 7.3% hypomethylated in cases. Methylation at this site is functionally associated with tissue RNA and protein levels of the TMEM59 gene product. The TMEM59 gene identified from our discovery approach was recently implicated in amyloid-β protein precursor post-translational processing, supporting a role for epigenetic change in LOAD pathology. This study demonstrates widespread, modest discordant DNA methylation in LOAD-diseased tissue independent from DNA methylation changes with age. Identification of epigenetic biomarkers of LOAD risk may allow for the development of novel diagnostic and therapeutic targets.
DNA methylation; epigenetics; late onset Alzheimer’s disease; prefrontal cortex
Affinity capture of DNA methylation combined with high-throughput sequencing strikes a good balance between the high cost of whole genome bisulfite sequencing and the low coverage of methylation arrays. We present BayMeth, an empirical Bayes approach that uses a fully methylated control sample to transform observed read counts into regional methylation levels. In our model, inefficient capture can readily be distinguished from low methylation levels. BayMeth improves on existing methods, allows explicit modeling of copy number variation, and offers computationally efficient analytical mean and variance estimators. BayMeth is available in the Repitools Bioconductor package.
DNA methylation profiles differ among disease types and, therefore, can be used in disease diagnosis. In addition, large-scale whole genome DNA methylation data offer tremendous potential in understanding the role of DNA methylation in normal development and function. However, due to the unique feature of the methylation data, powerful and robust statistical methods are very limited in this area.
In this paper, we proposed and examined a new statistical method to detect differentially methylated loci for case control designs that is fully nonparametric and does not depend on any assumption for the underlying distribution of the data. Moreover, the proposed method adjusts for the age effect that has been shown to be highly correlated with DNA methylation profiles. Using simulation studies and a real data application, we have demonstrated the advantages of our method over existing commonly used methods.
Compared to existing methods, our method improved the detection power for differentially methylated loci for case control designs and controlled the type I error well. Its applications are not limited to methylation data; it can be extended to many other case–control studies.
Nonparametric method; One-sided test; Combining p-value
Motivation: DNA methylation is a molecular modification of DNA that plays crucial roles in regulation of gene expression. Particularly, CpG rich regions are frequently hypermethylated in cancer tissues, but not methylated in normal tissues. However, there are not many methodological literatures of case-control association studies for high-dimensional DNA methylation data, compared with those of microarray gene expression. One key feature of DNA methylation data is a grouped structure among CpG sites from a gene that are possibly highly correlated. In this article, we proposed a penalized logistic regression model for correlated DNA methylation CpG sites within genes from high-dimensional array data. Our regularization procedure is based on a combination of the l1 penalty and squared l2 penalty on degree-scaled differences of coefficients of CpG sites within one gene, so it induces both sparsity and smoothness with respect to the correlated regression coefficients. We combined the penalized procedure with a stability selection procedure such that a selection probability of each regression coefficient was provided which helps us make a stable and confident selection of methylation CpG sites that are possibly truly associated with the outcome.
Results: Using simulation studies we demonstrated that the proposed procedure outperforms existing main-stream regularization methods such as lasso and elastic-net when data is correlated within a group. We also applied our method to identify important CpG sites and corresponding genes for ovarian cancer from over 20 000 CpGs generated from Illumina Infinium HumanMethylation27K Beadchip. Some genes identified are potentially associated with cancers.
Supplementary data are available at Bioinformatics online.
CpG methylation is a key component of the epigenome architecture that is associated with changes in gene expression without a change to the DNA sequence. Since the first reports on deregulation of DNA methylation, in diseases such as cancer, and the initiation of the Human Epigenome Project, an increasing need has arisen for a detailed, high-throughput and quantitative method of analysis to discover and validate normal and aberrant DNA methylation profiles in large sample cohorts. Here we present an improved protocol using base-specific fragmentation and MALDI-TOF mass spectrometry that enables a sensitive and high-throughput method of DNA methylation analysis, quantitative to 5% methylation for each informative CpG residue. We have determined the accuracy, variability and sensitivity of the protocol, implemented critical improvements in experimental design and interpretation of the data and developed a new formula to accurately measure CpG methylation. Key innovations now permit determination of differential and allele-specific methylation, such as in cancer and imprinting. The new protocol is ideally suitable for detailed DNA methylation analysis of multiple genomic regions and large sample cohorts that is critical for comprehensive profiling of normal and diseased human epigenomes.
Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess the model.
Based on data from colon cancer samples we show that our model captures genome-wide characteristics of methylation patterns. We estimate the parameters of the model and show that they vary between different tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily interpretable data summaries.
We have developed a beta-mixture model for methylation microarray data. The model substantially reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect changes in methylation status between different samples and tissues.
A number of empirical Bayes models (each with different statistical distribution assumptions) have now been developed to analyze differential DNA methylation using high-density oligonucleotide tiling arrays. However, it remains unclear which model performs best. For example, for analysis of differentially methylated regions for conservative and functional sequence characteristics (e.g., enrichment of transcription factor-binding sites (TFBSs)), the sensitivity of such analyses, using various empirical Bayes models, remains unclear. In this paper, five empirical Bayes models were constructed, based on either a gamma distribution or a log-normal distribution, for the identification of differential methylated loci and their cell division—(1, 3, and 5) and drug-treatment-(cisplatin) dependent methylation patterns. While differential methylation patterns generated by log-normal models were enriched with numerous TFBSs, we observed almost no TFBS-enriched sequences using gamma assumption models. Statistical and biological results suggest log-normal, rather than gamma, empirical Bayes model distribution to be a highly accurate and precise method for differential methylation microarray analysis. In addition, we presented one of the log-normal models for differential methylation analysis and tested its reproducibility by simulation study. We believe this research to be the first extensive comparison of statistical modeling for the analysis of differential DNA methylation, an important biological phenomenon that precisely regulates gene transcription.
DNA methylation patterns have been shown to significantly correlate with different tissue types and disease states. High-throughput methylation arrays enable large-scale DNA methylation analysis to identify informative DNA methylation biomarkers. The identification of disease-specific methylation signatures is of fundamental and practical interest for risk assessment, diagnosis, and prognosis of diseases.
Using published high-throughput DNA methylation data, a two-stage feature selection method was developed to select a small optimal subset of DNA methylation features to precisely classify two sample groups. With this approach, a small number of CpG sites were highly sensitive and specific in distinguishing lung cancer tissue samples from normal lung tissue samples.
This study shows that it is feasible to identify DNA methylation biomarkers from high-throughput DNA methylation profiles and that a small number of signature CpG sites can suffice to classify two groups of samples. The computational method we developed in the study is efficient to identify signature CpG sites from disease samples with complex methylation patterns.
Crohn disease (CD) and ulcerative colitis (UC) are common forms of inflammatory bowel diseases (IBD). Monozygotic (MZ) twin discordance rates and epidemiologic data implicate that environmental changes and epigenetic factors may play a pathogenic role in IBD. DNA methylation (the methylation of cytosines within CpG dinucleotides) is an epigenetic modification, which can respond to environmental influences. We investigated whether DNA methylation might be connected with IBD in peripheral blood leukocyte (PBL) DNA by utilizing genome-wide microarrays.
Two different high-throughput microarray based methods for genome wide DNA methylation analysis were employed. First, DNA isolated from MZ twin pairs concordant (CD: 4; UC: 3) and discordant (CD: 4; UC: 7) for IBD was interrogated by a custom made methylation specific amplification microarray (MSAM). Second, the recently developed Illumina Infinium HumanMethylation450 BeadChip arrays were used on 48 samples of PBL DNA from discordant MZ twin pairs (CD:3; UC:3) and treatment naive pediatric cases of IBD (CD:14; UC:8), as well as controls (n=14). The microarrays were validated with bisulfite pyrosequencing.
The Methylation BeadChip approach identified a single DNA methylation association of IBD at TEPP (testis, prostate and placenta-expressed protein) when DNA isolated selectively from peripheral blood mononuclear cells was analyzed (8.6% increase in methylation between CD and control, FDR=0.0065).
Microarray interrogation of IBD dependent DNA methylation from PBLs appears to have limited ability to detect significant disease associations. More detailed and/or selective approaches may be useful for the elucidation of connections between the DNA methylome and IBD in the future.
inflammatory bowel disease; DNA methylation; peripheral blood; twin; TEPP
DNA methylation has been linked to genome regulation and dysregulation in health and disease respectively, and methods for characterizing genomic DNA methylation patterns are rapidly emerging. We have developed/refined methods for enrichment of methylated genomic fragments using the methyl-binding domain of the human MBD2 protein (MBD2-MBD) followed by analysis with high-density tiling microarrays. This MBD-chip approach was used to characterize DNA methylation patterns across all non-repetitive sequences of human chromosomes 21 and 22 at high-resolution in normal and malignant prostate cells.
Examining this data using computational methods that were designed specifically for DNA methylation tiling array data revealed widespread methylation of both gene promoter and non-promoter regions in cancer and normal cells. In addition to identifying several novel cancer hypermethylated 5' gene upstream regions that mediated epigenetic gene silencing, we also found several hypermethylated 3' gene downstream, intragenic and intergenic regions. The hypermethylated intragenic regions were highly enriched for overlap with intron-exon boundaries, suggesting a possible role in regulation of alternative transcriptional start sites, exon usage and/or splicing. The hypermethylated intergenic regions showed significant enrichment for conservation across vertebrate species. A sampling of these newly identified promoter (ADAMTS1 and SCARF2 genes) and non-promoter (downstream or within DSCR9, C21orf57 and HLCS genes) hypermethylated regions were effective in distinguishing malignant from normal prostate tissues and/or cell lines.
Comparison of chromosome-wide DNA methylation patterns in normal and malignant prostate cells revealed significant methylation of gene-proximal and conserved intergenic sequences. Such analyses can be easily extended for genome-wide methylation analysis in health and disease.
DNA methylation; prostate cancer; tiling microarray; epigenetics; methylated DNA binding domain; MBD-chip; ADAMTS1; SCARF2; DSCR9; HLCS
Repetitive elements represent a large portion of the human genome and contain much of the CpG methylation found in normal human postnatal somatic tissues. Loss of DNA methylation in these sequences might account for most of the global hypomethylation that characterizes a large percentage of human cancers that have been studied. There is widespread interest in correlating the genomic 5-methylcytosine content with clinical outcome, dietary history, lifestyle, etc. However, a high-throughput, accurate and easily accessible technique that can be applied even to paraffin-embedded tissue DNA is not yet available. Here, we report the development of quantitative MethyLight assays to determine the levels of methylated and unmethylated repeats, namely, Alu and LINE-1 sequences and the centromeric satellite alpha (Satα) and juxtacentromeric satellite 2 (Sat2) DNA sequences. Methylation levels of Alu, Sat2 and LINE-1 repeats were significantly associated with global DNA methylation, as measured by high performance liquid chromatography, and the combined measurements of Alu and Sat2 methylation were highly correlative with global DNA methylation measurements. These MethyLight assays rely only on real-time PCR and provide surrogate markers for global DNA methylation analysis. We also describe a novel design strategy for the development of methylation-independent MethyLight control reactions based on Alu sequences depleted of CpG dinucleotides by evolutionary deamination on one strand. We show that one such Alu-based reaction provides a greatly improved detection of DNA for normalization in MethyLight applications and is less susceptible to normalization errors caused by cancer-associated aneuploidy and copy number changes.
Significance: Methylation of cytosine in DNA is linked with gene regulation, and this has profound implications in development, normal biology, and disease conditions in many eukaryotic organisms. A wide range of methods and approaches exist for its identification, quantification, and mapping within the genome. While the earliest approaches were nonspecific and were at best useful for quantification of total methylated cytosines in the chunk of DNA, this field has seen considerable progress and development over the past decades. Recent Advances: Methods for DNA methylation analysis differ in their coverage and sensitivity, and the method of choice depends on the intended application and desired level of information. Potential results include global methyl cytosine content, degree of methylation at specific loci, or genome-wide methylation maps. Introduction of more advanced approaches to DNA methylation analysis, such as microarray platforms and massively parallel sequencing, has brought us closer to unveiling the whole methylome. Critical Issues: Sensitive quantification of DNA methylation from degraded and minute quantities of DNA and high-throughput DNA methylation mapping of single cells still remain a challenge. Future Directions: Developments in DNA sequencing technologies as well as the methods for identification and mapping of 5-hydroxymethylcytosine are expected to augment our current understanding of epigenomics. Here we present an overview of methodologies available for DNA methylation analysis with special focus on recent developments in genome-wide and high-throughput methods. While the application focus relates to cancer research, the methods are equally relevant to broader issues of epigenetics and redox science in this special forum. Antioxid. Redox Signal. 18, 1972–1986.
Aberrant DNA methylation of CpG islands, CpG island shores and first exons is known to play a key role in the altered gene expression patterns in all human cancers. To date, a systematic study on the effect of DNA methylation on gene expression using high resolution data has not been reported. In this study, we conducted an integrated analysis of MethylCap-sequencing data and Affymetrix gene expression microarray data for 30 breast cancer cell lines representing different breast tumor phenotypes. As well-developed methods for the integrated analysis do not currently exist, we created a series of four different analysis methods. On the computational side, our goal is to develop methylome data analysis protocols for the integrated analysis of DNA methylation and gene expression data on the genome scale. On the cancer biology side, we present comprehensive genome-wide methylome analysis results for differentially methylated regions and their potential effect on gene expression in 30 breast cancer cell lines representing three molecular phenotypes, luminal, basal A and basal B. Our integrated analysis demonstrates that methylation status of different genomic regions may play a key role in establishing transcriptional patterns in molecular subtypes of human breast cancer.
To discover cancer specific DNA methylation markers, large-scale screening methods are widely used. The pharmacological unmasking expression microarray approach is an elegant method to enrich for genes that are silenced and re-expressed during functional reversal of DNA methylation upon treatment with demethylation agents. However, such experiments are performed in in vitro (cancer) cell lines, mostly with poor relevance when extrapolating to primary cancers. To overcome this problem, we incorporated data from primary cancer samples in the experimental design. A strategy to combine and rank data from these different data sources is essential to minimize the experimental work in the validation steps.
To apply a new relaxation ranking algorithm to enrich DNA methylation markers in cervical cancer.
The application of a new sorting methodology allowed us to sort high-throughput microarray data from both cervical cancer cell lines and primary cervical cancer samples. The performance of the sorting was analyzed in silico. Pathway and gene ontology analysis was performed on the top-selection and gives a strong indication that the ranking methodology is able to enrich towards genes that might be methylated. Terms like regulation of progression through cell cycle, positive regulation of programmed cell death as well as organ development and embryonic development are overrepresented. Combined with the highly enriched number of imprinted and X-chromosome located genes, and increased prevalence of known methylation markers selected from cervical (the highest-ranking known gene is CCNA1) as well as from other cancer types, the use of the ranking algorithm seems to be powerful in enriching towards methylated genes.
Verification of the DNA methylation state of the 10 highest-ranking genes revealed that 7/9 (78%) gene promoters showed DNA methylation in cervical carcinomas. Of these 7 genes, 3 (SST, HTRA3 and NPTX1) are not methylated in normal cervix tissue.
The application of this new relaxation ranking methodology allowed us to significantly enrich towards methylation genes in cancer. This enrichment is both shown in silico and by experimental validation, and revealed novel methylation markers as proof-of-concept that might be useful in early cancer detection in cervical scrapings.
DNA methylation profiling reveals important differentially methylated regions (DMRs) of the genome that are altered during development or that are perturbed by disease. To date, few programs exist for regional analysis of enriched or whole-genome bisulfate conversion sequencing data, even though such data are increasingly common. Here, we describe an open-source, optimized method for determining empirically based DMRs (eDMR) from high-throughput sequence data that is applicable to enriched whole-genome methylation profiling datasets, as well as other globally enriched epigenetic modification data.
Here we show that our bimodal distribution model and weighted cost function for optimized regional methylation analysis provides accurate boundaries of regions harboring significant epigenetic modifications. Our algorithm takes the spatial distribution of CpGs into account for the enrichment assay, allowing for optimization of the definition of empirical regions for differential methylation. Combined with the dependent adjustment for regional p-value combination and DMR annotation, we provide a method that may be applied to a variety of datasets for rapid DMR analysis. Our method classifies both the directionality of DMRs and their genome-wide distribution, and we have observed that shows clinical relevance through correct stratification of two Acute Myeloid Leukemia (AML) tumor sub-types.
Our weighted optimization algorithm eDMR for calling DMRs extends an established DMR R pipeline (methylKit) and provides a needed resource in epigenomics. Our method enables an accurate and scalable way of finding DMRs in high-throughput methylation sequencing experiments. eDMR is available for download at http://code.google.com/p/edmr/.
Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II error rates; however, real microarray data do not always fit their distribution assumptions. Smyth's ubiquitous parametric method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed. Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the Smyth's parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods. Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower false discovery rates compared to Smyth's parametric method when data are not normally distributed. The Resampling-based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data analysis.