1.  Prediction of epigenetically regulated genes in breast cancer cell lines 
BMC Bioinformatics  2010;11:305.
Methylation of CpG islands within the DNA promoter regions is one mechanism that leads to aberrant gene expression in cancer. In particular, the abnormal methylation of CpG islands may silence associated genes. Therefore, using high-throughput microarrays to measure CpG island methylation will lead to better understanding of tumor pathobiology and progression, while revealing potentially new biomarkers. We have examined a recently developed high-throughput technology for measuring genome-wide methylation patterns called mTACL. Here, we propose a computational pipeline for integrating gene expression and CpG island methylation profles to identify epigenetically regulated genes for a panel of 45 breast cancer cell lines, which is widely used in the Integrative Cancer Biology Program (ICBP). The pipeline (i) reduces the dimensionality of the methylation data, (ii) associates the reduced methylation data with gene expression data, and (iii) ranks methylation-expression associations according to their epigenetic regulation. Dimensionality reduction is performed in two steps: (i) methylation sites are grouped across the genome to identify regions of interest, and (ii) methylation profles are clustered within each region. Associations between the clustered methylation and the gene expression data sets generate candidate matches within a fxed neighborhood around each gene. Finally, the methylation-expression associations are ranked through a logistic regression, and their significance is quantified through permutation analysis.
Our two-step dimensionality reduction compressed 90% of the original data, reducing 137,688 methylation sites to 14,505 clusters. Methylation-expression associations produced 18,312 correspondences, which were used to further analyze epigenetic regulation. Logistic regression was used to identify 58 genes from these correspondences that showed a statistically signifcant negative correlation between methylation profles and gene expression in the panel of breast cancer cell lines. Subnetwork enrichment of these genes has identifed 35 common regulators with 6 or more predicted markers. In addition to identifying epigenetically regulated genes, we show evidence of differentially expressed methylation patterns between the basal and luminal subtypes.
Our results indicate that the proposed computational protocol is a viable platform for identifying epigenetically regulated genes. Our protocol has generated a list of predictors including COL1A2, TOP2A, TFF1, and VAV3, genes whose key roles in epigenetic regulation is documented in the literature. Subnetwork enrichment of these predicted markers further suggests that epigenetic regulation of individual genes occurs in a coordinated fashion and through common regulators.
PMCID: PMC2903569  PMID: 20525369
2.  Integrative DNA Methylation and Gene Expression Analyses Identify DNA Packaging and Epigenetic Regulatory Genes Associated with Low Motility Sperm 
PLoS ONE  2011;6(6):e20280.
In previous studies using candidate gene approaches, low sperm count (oligospermia) has been associated with altered sperm mRNA content and DNA methylation in both imprinted and non-imprinted genes. We performed a genome-wide analysis of sperm DNA methylation and mRNA content to test for associations with sperm function.
Methods and Results
Sperm DNA and mRNA were isolated from 21 men with a range of semen parameters presenting to a tertiary male reproductive health clinic. DNA methylation was measured with the Illumina Infinium array at 27,578 CpG loci. Unsupervised clustering of methylation data differentiated the 21 sperm samples by their motility values. Recursively partitioned mixture modeling (RPMM) of methylation data resulted in four distinct methylation profiles that were significantly associated with sperm motility (P = 0.01). Linear models of microarray analysis (LIMMA) was performed based on motility and identified 9,189 CpG loci with significantly altered methylation (Q<0.05) in the low motility samples. In addition, the majority of these disrupted CpG loci (80%) were hypomethylated. Of the aberrantly methylated CpGs, 194 were associated with imprinted genes and were almost equally distributed into hypermethylated (predominantly paternally expressed) and hypomethylated (predominantly maternally expressed) groups. Sperm mRNA was measured with the Human Gene 1.0 ST Affymetrix GeneChip Array. LIMMA analysis identified 20 candidate transcripts as differentially present in low motility sperm, including HDAC1 (NCBI 3065), SIRT3 (NCBI 23410), and DNMT3A (NCBI 1788). There was a trend among altered expression of these epigenetic regulatory genes and RPMM DNA methylation class.
Using integrative genome-wide approaches we identified CpG methylation profiles and mRNA alterations associated with low sperm motility.
PMCID: PMC3107223  PMID: 21674046
3.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions 
BMC Bioinformatics  2008;9:365.
Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence. One of the most commonly studied epigenetic alterations is cytosine methylation, which is a well recognized mechanism of epigenetic gene silencing and often occurs at tumor suppressor gene loci in human cancer. Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancer-related genes. Model-based cluster analysis is often used to identify DNA methylation subgroups in data, but it is unclear how to cluster DNA methylation data from arrays in a scalable and reliable manner.
We propose a novel model-based recursive-partitioning algorithm to navigate clusters in a beta mixture model. We present simulations that show that the method is more reliable than competing nonparametric clustering approaches, and is at least as reliable as conventional mixture model methods. We also show that our proposed method is more computationally efficient than conventional mixture model approaches. We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.
Our proposed recursively-partitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.
PMCID: PMC2553421  PMID: 18782434
4.  Aberrant DNA Methylation of OLIG1, a Novel Prognostic Factor in Non-Small Cell Lung Cancer 
PLoS Medicine  2007;4(3):e108.
Lung cancer is the leading cause of cancer-related death worldwide. Currently, tumor, node, metastasis (TNM) staging provides the most accurate prognostic parameter for patients with non-small cell lung cancer (NSCLC). However, the overall survival of patients with resectable tumors varies significantly, indicating the need for additional prognostic factors to better predict the outcome of the disease, particularly within a given TNM subset.
Methods and Findings
In this study, we investigated whether adenocarcinomas and squamous cell carcinomas could be differentiated based on their global aberrant DNA methylation patterns. We performed restriction landmark genomic scanning on 40 patient samples and identified 47 DNA methylation targets that together could distinguish the two lung cancer subgroups. The protein expression of one of those targets, oligodendrocyte transcription factor 1 (OLIG1), significantly correlated with survival in NSCLC patients, as shown by univariate and multivariate analyses. Furthermore, the hazard ratio for patients negative for OLIG1 protein was significantly higher than the one for those patients expressing the protein, even at low levels.
Multivariate analyses of our data confirmed that OLIG1 protein expression significantly correlates with overall survival in NSCLC patients, with a relative risk of 0.84 (95% confidence interval 0.77–0.91, p < 0.001) along with T and N stages, as indicated by a Cox proportional hazard model. Taken together, our results suggests that OLIG1 protein expression could be utilized as a novel prognostic factor, which could aid in deciding which NSCLC patients might benefit from more aggressive therapy. This is potentially of great significance, as the addition of postoperative adjuvant chemotherapy in T2N0 NSCLC patients is still controversial.
Christopher Plass and colleagues find thatOLIG1 expression correlates with survival in lung cancer patients and suggest that it could be used in deciding which patients are likely to benefit from more aggressive therapy.
Editors' Summary
Lung cancer is the commonest cause of cancer-related death worldwide. Most cases are of a type called non-small cell lung cancer (NSCLC). Like other cancers, treatment of NCSLC depends on the “TNM stage” at which the cancer is detected. Staging takes into account the size and local spread of the tumor (its T classification), whether nearby lymph nodes contain tumor cells (its N classification), and whether tumor cells have spread (metastasized) throughout the body (its M classification). Stage I tumors are confined to the lung and are removed surgically. Stage II tumors have spread to nearby lymph nodes and are treated with a combination of surgery and chemotherapy. Stage III tumors have spread throughout the chest, and stage IV tumors have metastasized around the body; patients with both of these stages are treated with chemotherapy alone. About 70% of patients with stage I or II lung cancer, but only 2% of patients with stage IV lung cancer, survive for five years after diagnosis.
Why Was This Study Done?
TNM staging is the best way to predict the likely outcome (prognosis) for patients with NSCLC, but survival times for patients with stage I and II tumors vary widely. Another prognostic marker—maybe a “molecular signature”—that could distinguish patients who are likely to respond to treatment from those whose cancer will inevitably progress would be very useful. Unlike normal cells, cancer cells divide uncontrollably and can move around the body. These behavioral changes are caused by alterations in the pattern of proteins expressed by the cells. But what causes these alterations? The answer in some cases is “epigenetic changes” or chemical modifications of genes. In cancer cells, methyl groups are aberrantly added to GC-rich gene regions. These so-called “CpG islands” lie near gene promoters (sequences that control the transcription of DNA into mRNA, the template for protein production), and their methylation stops the promoters working and silences the gene. In this study, the researchers have investigated whether aberrant methylation patterns vary between NSCLC subtypes and whether specific aberrant methylations are associated with survival and can, therefore, be used prognostically.
What Did the Researchers Do and Find?
The researchers used “restriction landmark genomic scanning” (RLGS) to catalog global aberrant DNA methylation patterns in human lung tumor samples. In RLGS, DNA is cut into fragments with a restriction enzyme (a protein that cuts at specific DNA sequences), end-labeled, and separated using two-dimensional gel electrophoresis to give a pattern of spots. Because methylation stops some restriction enzymes cutting their target sequence, normal lung tissue and lung tumor samples yield different patterns of spots. The researchers used these patterns to identify 47 DNA methylation targets (many in CpG islands) that together distinguished between adenocarcinomas and squamous cell carcinomas, two major types of NSCLCs. Next, they measured mRNA production from the genes with the greatest difference in methylation between adenocarcinomas and squamous cell carcinomas. OLIG1 (the gene that encodes a protein involved in nerve cell development) had one of the highest differences in mRNA production between these tumor types. Furthermore, three-quarters of NSCLCs had reduced or no expression of OLIG1 protein and, when the researchers analyzed the association between OLIG1 protein expression and overall survival in patients with NSCLC, reduced OLIG1 protein expression was associated with reduced survival.
What Do These Findings Mean?
These findings indicate that different types of NSCLC can be distinguished by examining their aberrant methylation patterns. This suggests that the establishment of different DNA methylation patterns might be related to the cell type from which the tumors developed. Alternatively, the different aberrant methylation patterns might reflect the different routes that these cells take to becoming tumor cells. This research identifies a potential new prognostic marker for NSCLC by showing that OLIG1 protein expression correlates with overall survival in patients with NSCLC. This correlation needs to be tested in a clinical setting to see if adding OLIG1 expression to the current prognostic parameters can lead to better treatment choices for early-stage lung cancer patients and ultimately improve these patients' overall survival.
Additional Information.
Please access these Web sites via the online version of this summary at
Patient and professional information on lung cancer, including staging (in English and Spanish), is available from the US National Cancer Institute
The MedlinePlus encyclopedia has pages on non-small cell lung cancer (in English and Spanish)
Cancerbackup provides patient information on lung cancer
CancerQuest, provided by Emory University, has information about how cancer develops (in English, Spanish, Chinese and Russian)
Wikipedia pages on epigenetics (note that Wikipedia is a free online encyclopedia that anyone can edit)
The Epigenome Network of Excellence gives background information and the latest news about epigenetics (in several European languages)
PMCID: PMC1831740  PMID: 17388669
5.  Non-specific filtering of beta-distributed data 
BMC Bioinformatics  2014;15:199.
Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.
We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.
We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.
PMCID: PMC4230495  PMID: 24943962
6.  An integrative characterization of recurrent molecular aberrations in glioblastoma genomes 
Nucleic Acids Research  2013;41(19):8803-8821.
Glioblastoma multiforme (GBM) is the most common and malignant primary brain tumor in adults. Decades of investigations and the recent effort of the Cancer Genome Atlas (TCGA) project have mapped many molecular alterations in GBM cells. Alterations on DNAs may dysregulate gene expressions and drive malignancy of tumors. It is thus important to uncover causal and statistical dependency between ‘effector’ molecular aberrations and ‘target’ gene expressions in GBMs. A rich collection of prior studies attempted to combine copy number variation (CNV) and mRNA expression data. However, systematic methods to integrate multiple types of cancer genomic data—gene mutations, single nucleotide polymorphisms, CNVs, DNA methylations, mRNA and microRNA expressions and clinical information—are relatively scarce. We proposed an algorithm to build ‘association modules’ linking effector molecular aberrations and target gene expressions and applied the module-finding algorithm to the integrated TCGA GBM data sets. The inferred association modules were validated by six tests using external information and datasets of central nervous system tumors: (i) indication of prognostic effects among patients; (ii) coherence of target gene expressions; (iii) retention of effector–target associations in external data sets; (iv) recurrence of effector molecular aberrations in GBM; (v) functional enrichment of target genes; and (vi) co-citations between effectors and targets. Modules associated with well-known molecular aberrations of GBM—such as chromosome 7 amplifications, chromosome 10 deletions, EGFR and NF1 mutations—passed the majority of the validation tests. Furthermore, several modules associated with less well-reported molecular aberrations—such as chromosome 11 CNVs, CD40, PLXNB1 and GSTM1 methylations, and mir-21 expressions—were also validated by external information. In particular, modules constituting trans-acting effects with chromosome 11 CNVs and cis-acting effects with chromosome 10 CNVs manifested strong negative and positive associations with survival times in brain tumors. By aligning the information of association modules with the established GBM subclasses based on transcription or methylation levels, we found each subclass possessed multiple concurrent molecular aberrations. Furthermore, the joint molecular characteristics derived from 16 association modules had prognostic power not explained away by the strong biomarker of CpG island methylator phenotypes. Functional and survival analyses indicated that immune/inflammatory responses and epithelial-mesenchymal transitions were among the most important determining processes of prognosis. Finally, we demonstrated that certain molecular aberrations uniquely recurred in GBM but were relatively rare in non-GBM glioma cells. These results justify the utility of an integrative analysis on cancer genomes and provide testable characterizations of driver aberration events in GBM.
PMCID: PMC3799430  PMID: 23907387
7.  DNA methylation subgroups and the CpG island methylator phenotype in gastric cancer: a comprehensive profiling approach 
BMC Gastroenterology  2014;14:55.
Methylation-induced silencing of promoter CpG islands in tumor suppressor genes plays an important role in human carcinogenesis. In colorectal cancer, the CpG island methylator phenotype (CIMP) is defined as widespread and elevated levels of DNA methylation and CIMP+ tumors have distinctive clinicopathological and molecular features. In contrast, the existence of a comparable CIMP subtype in gastric cancer (GC) has not been clearly established. To further investigate this issue, in the present study we performed comprehensive DNA methylation profiling of a well-characterised series of primary GC.
The methylation status of 1,421 autosomal CpG sites located within 768 cancer-related genes was investigated using the Illumina GoldenGate Methylation Panel I assay on DNA extracted from 60 gastric tumors and matched tumor-adjacent gastric tissue pairs. Methylation data was analysed using a recursively partitioned mixture model and investigated for associations with clinicopathological and molecular features including age, Helicobacter pylori status, tumor site, patient survival, microsatellite instability and BRAF and KRAS mutations.
A total of 147 genes were differentially methylated between tumor and matched tumor-adjacent gastric tissue, with HOXA5 and hedgehog signalling being the top-ranked gene and signalling pathway, respectively. Unsupervised clustering of methylation data revealed the existence of 6 subgroups under two main clusters, referred to as L (low methylation; 28% of cases) and H (high methylation; 72%). Female patients were over-represented in the H tumor group compared to L group (36% vs 6%; P = 0.024), however no other significant differences in clinicopathological or molecular features were apparent. CpG sites that were hypermethylated in group H were more frequently located in CpG islands and marked for polycomb occupancy.
High-throughput methylation analysis implicates genes involved in embryonic development and hedgehog signaling in gastric tumorigenesis. GC is comprised of two major methylation subtypes, with the highly methylated group showing some features consistent with a CpG island methylator phenotype.
PMCID: PMC3986689  PMID: 24674026
8.  Epigenomic diversity of colorectal cancer indicated by LINE-1 methylation in a database of 869 tumors 
Molecular Cancer  2010;9:125.
Genome-wide DNA hypomethylation plays a role in genomic instability and carcinogenesis. LINE-1 (L1 retrotransposon) constitutes a substantial portion of the human genome, and LINE-1 methylation correlates with global DNA methylation status. LINE-1 hypomethylation in colon cancer has been strongly associated with poor prognosis. However, whether LINE-1 hypomethylators constitute a distinct cancer subtype remains uncertain. Recent evidence for concordant LINE-1 hypomethylation within synchronous colorectal cancer pairs suggests the presence of a non-stochastic mechanism influencing tumor LINE-1 methylation level. Thus, it is of particular interest to examine whether its wide variation can be attributed to clinical, pathologic or molecular features.
Utilizing a database of 869 colorectal cancers in two prospective cohort studies, we constructed multivariate linear and logistic regression models for LINE-1 methylation (quantified by Pyrosequencing). Variables included age, sex, body mass index, family history of colorectal cancer, smoking status, tumor location, stage, grade, mucinous component, signet ring cells, tumor infiltrating lymphocytes, CpG island methylator phenotype (CIMP), microsatellite instability, expression of TP53 (p53), CDKN1A (p21), CTNNB1 (β-catenin), PTGS2 (cyclooxygenase-2), and FASN, and mutations in KRAS, BRAF, and PIK3CA.
Tumoral LINE-1 methylation ranged from 23.1 to 90.3 of 0-100 scale (mean 61.4; median 62.3; standard deviation 9.6), and distributed approximately normally except for extreme hypomethylators [LINE-1 methylation < 40; N = 22 (2.5%), which were far more than what could be expected by normal distribution]. LINE-1 extreme hypomethylators were significantly associated with younger patients (p = 0.0058). Residual plot by multivariate linear regression showed that LINE-1 extreme hypomethylators clustered as one distinct group, separate from the main tumor group. The multivariate linear regression model could explain 8.4% of the total variability of LINE-1 methylation (R-square = 0.084). Multivariate logistic regression models for binary LINE-1 hypomethylation outcomes (cutoffs of 40, 50 and 60) showed at most fair predictive ability (area under receiver operator characteristics curve < 0.63).
LINE-1 extreme hypomethylators appear to constitute a previously-unrecognized, distinct subtype of colorectal cancers, which needs to be confirmed by additional studies. Our tumor LINE-1 methylation data indicate enormous epigenomic diversity of individual colorectal cancers.
PMCID: PMC2892454  PMID: 20507599
9.  A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposure 
Bioinformatics  2013;29(22):2884-2891.
Motivation: DNA methylation is a heritable modifiable chemical process that affects gene transcription and is associated with other molecular markers (e.g. gene expression) and biomarkers (e.g. cancer or other diseases). Current technology measures methylation in hundred of thousands, or millions of CpG sites throughout the genome. It is evident that neighboring CpG sites are often highly correlated with each other, and current literature suggests that clusters of adjacent CpG sites are co-regulated.
Results: We develop the Adjacent Site Clustering (A-clustering) algorithm to detect sets of neighboring CpG sites that are correlated with each other. To detect methylation regions associated with exposure, we propose an analysis pipeline for high-dimensional methylation data in which CpG sites within regions identified by A-clustering are modeled as multivariate responses to environmental exposure using a generalized estimating equation approach that assumes exposure equally affects all sites in the cluster. We develop a correlation preserving simulation scheme, and study the proposed methodology via simulations. We study the clusters detected by the algorithm on high dimensional dataset of peripheral blood methylation of pesticide applicators.
Availability: We provide the R package Aclust that efficiently implements the A-clustering and the analysis pipeline, and produces analysis reports. The package is found on
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3810849  PMID: 23990415
10.  Bis-class: a new classification tool of methylation status using bayes classifier and local methylation information 
BMC Genomics  2014;15(1):608.
Whole genome sequencing of bisulfite converted DNA (‘methylC-seq’) method provides comprehensive information of DNA methylation. An important application of these whole genome methylation maps is classifying each position as a methylated versus non-methylated nucleotide. A widely used current method for this purpose, the so-called binomial method, is intuitive and straightforward, but lacks power when the sequence coverage and the genome-wide methylation level are low. These problems present a particular challenge when analyzing sparsely methylated genomes, such as those of many invertebrates and plants.
We demonstrate that the number of sequence reads per position from methylC-seq data displays a large variance and can be modeled as a shifted negative binomial distribution. We also show that DNA methylation levels of adjacent CpG sites are correlated, and this similarity in local DNA methylation levels extends several kilobases. Taking these observations into account, we propose a new method based on Bayesian classification to infer DNA methylation status while considering the neighborhood DNA methylation levels of a specific site. We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates. This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.
Our method improves the existing binomial method for binary methylation calls by utilizing a posterior odds framework and incorporating local methylation information. This method should be widely applicable to the analyses of methylC-seq data from diverse sparsely methylated genomes. Bis-Class and example data are provided at a dedicated website (
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-608) contains supplementary material, which is available to authorized users.
PMCID: PMC4117951  PMID: 25037738
11.  Single-CpG-resolution methylome analysis identifies clinicopathologically aggressive CpG island methylator phenotype clear cell renal cell carcinomas 
Carcinogenesis  2012;33(8):1487-1493.
To clarify the significance of DNA methylation alterations during renal carcinogenesis, methylome analysis using single-CpG-resolution Infinium array was performed on 29 normal renal cortex tissue (C) samples, 107 non-cancerous renal cortex tissue (N) samples obtained from patients with clear cell renal cell carcinomas (RCCs) and 109 tumorous tissue (T) samples. DNA methylation levels at 4830 CpG sites were already altered in N samples compared with C samples. Unsupervised hierarchical clustering analysis based on DNA methylation levels at the 801 CpG sites, where DNA methylation alterations had occurred in N samples and were inherited by and strengthened in T samples, clustered clear cell RCCs into Cluster A (n = 90) and Cluster B (n = 14). Clinicopathologically aggressive tumors were accumulated in Cluster B, and the cancer-free and overall survival rates of patients in this cluster were significantly lower than those of patients in Cluster A. Clear cell RCCs in Cluster B were characterized by accumulation of DNA hypermethylation on CpG islands and considered to be CpG island methylator phenotype (CIMP)-positive cancers. DNA hypermethylation of the CpG sites on the FAM150A, GRM6, ZNF540, ZFP42, ZNF154, RIMS4, PCDHAC1, KHDRBS2, ASCL2, KCNQ1, PRAC, WNT3A, TRH, FAM78A, ZNF671, SLC13A5 and NKX6-2 genes became hallmarks of CIMP in RCCs. On the other hand, Cluster A was characterized by genome-wide DNA hypomethylation. These data indicated that DNA methylation alterations at precancerous stages may determine tumor aggressiveness and patient outcome. Accumulation of DNA hypermethylation on CpG islands and genome-wide DNA hypomethylation may each underlie distinct pathways of renal carcinogenesis.
Abbreviations:BAMCAbacterial artificial chromosome array-based methylated CpG island amplificationCnormal renal cortex tissue obtained from patients without any primary renal tumorCIMPCpG island methylator phenotypeHCChepatocellular carcinomaNnon-cancerous renal cortex tissue obtained from patients with clear cell renal cell carcinomasNCBINational Center for Biotechnology InformationRCCrenal cell carcinomaTtumorous tissueTNMTumor-Node-Metastasis
PMCID: PMC3418891  PMID: 22610075
12.  The Honey Bee Epigenomes: Differential Methylation of Brain DNA in Queens and Workers 
PLoS Biology  2010;8(11):e1000506.
Using genome-wide methylation profiles in honey bee queen and worker brains to understand how contrasting organismal outputs are generated from the same genotype.
In honey bees (Apis mellifera) the behaviorally and reproductively distinct queen and worker female castes derive from the same genome as a result of differential intake of royal jelly and are implemented in concert with DNA methylation. To determine if these very different diet-controlled phenotypes correlate with unique brain methylomes, we conducted a study to determine the methyl cytosine (mC) distribution in the brains of queens and workers at single-base-pair resolution using shotgun bisulfite sequencing technology. The whole-genome sequencing was validated by deep 454 sequencing of selected amplicons representing eight methylated genes. We found that nearly all mCs are located in CpG dinucleotides in the exons of 5,854 genes showing greater sequence conservation than non-methylated genes. Over 550 genes show significant methylation differences between queens and workers, revealing the intricate dynamics of methylation patterns. The distinctiveness of the differentially methylated genes is underscored by their intermediate CpG densities relative to drastically CpG-depleted methylated genes and to CpG-richer non-methylated genes. We find a strong correlation between methylation patterns and splicing sites including those that have the potential to generate alternative exons. We validate our genome-wide analyses by a detailed examination of two transcript variants encoded by one of the differentially methylated genes. The link between methylation and splicing is further supported by the differential methylation of genes belonging to the histone gene family. We propose that modulation of alternative splicing is one mechanism by which DNA methylation could be linked to gene regulation in the honey bee. Our study describes a level of molecular diversity previously unknown in honey bees that might be important for generating phenotypic flexibility not only during development but also in the adult post-mitotic brain.
Author Summary
The queen honey bee and her worker sisters do not seem to have much in common. Workers are active and intelligent, skillfully navigating the outside world in search of food for the colony. They never reproduce; that task is left entirely to the much larger and longer-lived queen, who is permanently ensconced within the colony and uses a powerful chemical influence to exert control. Remarkably, these two female castes are generated from identical genomes. The key to each female's developmental destiny is her diet as a larva: future queens are raised on royal jelly. This specialized diet is thought to affect a particular chemical modification, methylation, of the bee's DNA, causing the same genome to be deployed differently. To document differences in this epigenomic setting and hypothesize about its effects on behavior, we performed high-resolution bisulphite sequencing of whole genomes from the brains of queen and worker honey bees. In contrast to the heavily methylated human genome, we found that only a small and specific fraction of the honey bee genome is methylated. Most methylation occurred within conserved genes that provide critical cellular functions. Over 550 genes showed significant methylation differences between the queen and the worker, which may contribute to the profound divergence in behavior. How DNA methylation works on these genes remains unclear, but it may change their accessibility to the cellular machinery that controls their expression. We found a tantalizing clue to a mechanism in the clustering of methylation within parts of genes where splicing occurs, suggesting that methylation could control which of several versions of a gene is expressed. Our study provides the first documentation of extensive molecular differences that may allow honey bees to generate different phenotypes from the same genome.
PMCID: PMC2970541  PMID: 21072239
13.  Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures 
DNA methylation is a well-recognized epigenetic mechanism that has been the subject of a growing body of literature typically focused on the identification and study of profiles of DNA methylation and their association with human diseases and exposures. In recent years, a number of unsupervised clustering algorithms, both parametric and non-parametric, have been proposed for clustering large-scale DNA methylation data. However, most of these approaches do not incorporate known biological relationships of measured features, and in some cases, rely on unrealistic assumptions regarding the nature of DNA methylation. Here, we propose a modified version of a recursively partitioned mixture model (RPMM) that integrates information related to the proximity of CpG loci within the genome to inform correlation structures from which subsequent clustering analysis is based. Using simulations and four methylation data sets, we demonstrate that integrating biologically informative correlation structures within RPMM resulted in improved goodness-of-fit, clustering consistency, and the ability to detect biologically meaningful clusters compared to methods which ignore such correlation. Integrating biologically-informed correlation structures to enhance modeling techniques is motivated by the rapid increase in resolution of DNA methylation microarrays and the increasing understanding of the biology of this epigenetic mechanism.
PMCID: PMC4007267  PMID: 23468465
14.  Genome-Wide DNA Methylation Analysis Predicts an Epigenetic Switch for GATA Factor Expression in Endometriosis 
PLoS Genetics  2014;10(3):e1004158.
Endometriosis is a gynecological disease defined by the extrauterine growth of endometrial-like cells that cause chronic pain and infertility. The disease is limited to primates that exhibit spontaneous decidualization, and diseased cells are characterized by significant defects in the steroid-dependent genetic pathways that typify this process. Altered DNA methylation may underlie these defects, but few regions with differential methylation have been implicated in the disease. We mapped genome-wide differences in DNA methylation between healthy human endometrial and endometriotic stromal cells and correlated this with gene expression using an interaction analysis strategy. We identified 42,248 differentially methylated CpGs in endometriosis compared to healthy cells. These extensive differences were not unidirectional, but were focused intragenically and at sites distal to classic CpG islands where methylation status was typically negatively correlated with gene expression. Significant differences in methylation were mapped to 403 genes, which included a disproportionally large number of transcription factors. Furthermore, many of these genes are implicated in the pathology of endometriosis and decidualization. Our results tremendously improve the scope and resolution of differential methylation affecting the HOX gene clusters, nuclear receptor genes, and intriguingly the GATA family of transcription factors. Functional analysis of the GATA family revealed that GATA2 regulates key genes necessary for the hormone-driven differentiation of healthy stromal cells, but is hypermethylated and repressed in endometriotic cells. GATA6, which is hypomethylated and abundant in endometriotic cells, potently blocked hormone sensitivity, repressed GATA2, and induced markers of endometriosis when expressed in healthy endometrial cells. The unique epigenetic fingerprint in endometriosis suggests DNA methylation is an integral component of the disease, and identifies a novel role for the GATA family as key regulators of uterine physiology–aberrant DNA methylation in endometriotic cells correlates with a shift in GATA isoform expression that facilitates progesterone resistance and disease progression.
Author Summary
Women develop endometriosis when endometrial tissue with altered sensitivity to ovarian hormones grows outside the uterus. The persistent survival of these cells results in chronic pelvic pain and infertility. Although the origin of the disease remains a mystery, it only occurs in women and menstruating primates, suggesting that the unique evolution behind primate uterine development and menstruation are linked to the disease. Epigenetic defects affecting the uterine physiological response to ovarian hormones are also involved in endometriosis, and several genes implicated in disease progression are differentially methylated. Here we compared DNA methylation with gene expression in endometriosis using large-scale arrays. By comparing healthy and diseased cells treated with or without hormones to mimic part of the menstrual cycle, we uncovered many differentially methylated genes with defective expression in endometriosis that also regulate the hormone-dependent aspects of menstruation. In addition to expanding our understanding of how methylation affects endometriosis many fold, this also led us to propose an epigenetic switch that permits GATA6 expression in endometriosis instead of GATA2, and this switch promotes the aberrant expression of many of the genes seen in endometriosis. Our work provides novel unifying insight into the cause and development of endometriosis.
PMCID: PMC3945170  PMID: 24603652
15.  Method to Detect Differentially Methylated Loci with Case-Control Designs using Illumina Arrays 
Genetic epidemiology  2011;35(7):686-694.
It is now understood that virtually all human cancer types are the result of the accumulation of both genetic and epigenetic changes. DNA methylation is a molecular modification of DNA that is crucial for normal development. Genes that are rich in CpG dinucleotides are usually not methylated in normal tissues, but are frequently hypermethylated in cancer. With the advent of high-throughput platforms, large-scale structure of genomic methylation patterns is available through genome-wide scans and tremendous amount of DNA methylation data have been recently generated. However, sophisticated statistical methods to handle complex DNA methylation data are very limited. Here we developed a likelihood based Uniform-Normal-mixture model to select differentially methylated loci between case and control groups using Illumina arrays. The idea is to model the data as three types of methylation loci, one unmethylated, one completely methylated, and one partially methylated. A three-component mixture model with two Uniform distributions and one truncated normal distribution was used to model the three types. The mixture probabilities and the mean of the normal distribution were used to make inference about differentially methylated loci. Through extensive simulation studies, we demonstrated the feasibility and power of the proposed method. An application to a recently published study on ovarian cancer identified several methylation loci that are missed by the existing method.
PMCID: PMC3197755  PMID: 21818777
16.  The Dynamics of DNA Methylation Covariation Patterns in Carcinogenesis 
PLoS Computational Biology  2014;10(7):e1003709.
Recently it has been observed that cancer tissue is characterised by an increased variability in DNA methylation patterns. However, how the correlative patterns in genome-wide DNA methylation change during the carcinogenic progress has not yet been explored. Here we study genome-wide inter-CpG correlations in DNA methylation, in addition to single site variability, during cervical carcinogenesis. We demonstrate how the study of changes in DNA methylation covariation patterns across normal, intra-epithelial neoplasia and invasive cancer allows the identification of CpG sites that indicate the risk of neoplastic transformation in stages prior to neoplasia. Importantly, we show that the covariation in DNA methylation at these risk CpG loci is maximal immediately prior to the onset of cancer, supporting the view that high epigenetic diversity in normal cells increases the risk of cancer. Consistent with this, we observe that invasive cancers exhibit increased covariation in DNA methylation at the risk CpG sites relative to normal tissue, but lower levels relative to pre-cancerous lesions. We further show that the identified risk CpG sites undergo preferential DNA methylation changes in relation to human papilloma virus infection and age. Results are validated in independent data including prospectively collected samples prior to neoplastic transformation. Our data are consistent with a phase transition model of carcinogenesis, in which epigenetic diversity is maximal prior to the onset of cancer. The model and algorithm proposed here may allow, in future, network biomarkers predicting the risk of neoplastic transformation to be identified.
Author Summary
DNA methylation is a covalent modification of DNA which can regulate how active genes are. DNA methylation is altered at many genomic loci in cancer cells, leading to widespread functional disruption. Importantly, DNA methylation alterations across the genome are seen even in early carcinogenesis. Although the pattern of DNA methylation change during carcinogenesis has been studied at individual genomic loci, no study has yet analysed how these patterns change at a systems-level, specifically how do DNA methylation patterns at pairs of genomic sites change during disease progression. Doing so can shed light on how the epigenetic diversity of cell populations changes during the carcinogenic process. This study performs a systems-level analysis of the dynamic changes in DNA methylation correlation pattern during cervical carcinogenesis, demonstrating that epigenetic diversity is maximal just prior to the onset of cancer. Importantly, this supports the view that the risk of cancer development is closely related to an increase in epigenetic diversity in apparently healthy cells. In addition, the study provides a computational algorithm which successfully identifies the altered genomic sites confering the risk of cervical cancer.
PMCID: PMC4091688  PMID: 25010556
17.  A Beta-mixture model for dimensionality reduction, sample classification and analysis 
BMC Bioinformatics  2011;12:215.
Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess the model.
Based on data from colon cancer samples we show that our model captures genome-wide characteristics of methylation patterns. We estimate the parameters of the model and show that they vary between different tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily interpretable data summaries.
We have developed a beta-mixture model for methylation microarray data. The model substantially reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect changes in methylation status between different samples and tissues.
PMCID: PMC3126746  PMID: 21619656
18.  Quantitation of DNA methylation by melt curve analysis 
BMC Cancer  2009;9:123.
Methylation of DNA is a common mechanism for silencing genes, and aberrant methylation is increasingly being implicated in many diseases such as cancer. There is a need for robust, inexpensive methods to quantitate methylation across a region containing a number of CpGs. We describe and validate a rapid, in-tube method to quantitate DNA methylation using the melt data obtained following amplification of bisulfite modified DNA in a real-time thermocycler.
We first describe a mathematical method to normalise the raw fluorescence data generated by heating the amplified bisulfite modified DNA. From this normalised data the temperatures at which melting begins and finishes can be calculated, which reflect the less and more methylated template molecules present respectively. Also the T50, the temperature at which half the amplicons are melted, which represents the summative methylation of all the CpGs in the template mixture, can be calculated. These parameters describe the methylation characteristics of the region amplified in the original sample.
For validation we used synthesized oligonucleotides and DNA from fresh cells and formalin fixed paraffin embedded tissue, each with known methylation. Using our quantitation we could distinguish between unmethylated, partially methylated and fully methylated oligonucleotides mixed in varying ratios. There was a linear relationship between T50 and the dilution of methylated into unmethylated DNA. We could quantitate the change in methylation over time in cell lines treated with the demethylating drug 5-aza-2'-deoxycytidine, and the differences in methylation associated with complete, clonal or no loss of MGMT expression in formalin fixed paraffin embedded tissues.
We have validated a rapid, simple in-tube method to quantify methylation which is robust and reproducible, utilizes easily designed primers and does not need proprietary algorithms or software. The technique does not depend on any operator manipulation or interpretation of the melt curves, and is suitable for use in any laboratory with a real-time thermocycler. The parameters derived provide an objective description and quantitation of the methylation in a specimen, and can be used to for statistical comparisons of methylation between specimens.
PMCID: PMC2679043  PMID: 19393074
19.  Polycomb group genes are targets of aberrant DNA methylation in renal cell carcinoma 
Epigenetics  2011;6(6):703-709.
The combined effects of genetic and epigenetic aberrations are well recognized as causal in tumorigenesis. Here, we defined profiles of DNA methylation in primary renal cell carcinomas (RCC) and assessed the association of these profiles with the expression of genes required for the establishment and maintenance of epigenetic marks. A bead-based methylation array platform was used to measure methylation of 1,413 CpG loci in ∼800 cancer-associated genes and three methylation classes were derived by unsupervised clustering of tumors using recursively partitioned mixture modeling (RPMM). Quantitative RT-PCR was performed on all tumor samples to determine the expression of DNMT1, DNMT3B, VEZF1 and EZH2. Additionally, methylation at LINE-1 and AluYb8 repetitive elements was measured using bisulfite pyrosequencing. Associations between methylation class and tumor stage (p = 0.05), LINE-1 (p < 0.0001) and AluYb8 (p < 0.0001) methylation, as well as EZH2 expression (p < 0.0001) were noted following univariate analyses. A multinomial logistic regression model controlling for potential confounders revealed that AluYb8 (p < 0.003) methylation and EZH2 expression (p < 0.008) were significantly associated with methylation class membership. Because EZH2 is a member of the Polycomb repressive complex 2 (PRC2), we next analyzed the distribution of Polycomb group (PcG) targets among methylation classes derived by clustering the 1,413 array CpG loci using RPMM. PcG target genes were significantly enriched (p < 0.0001) in methylation classes with greater differential methylation between RCC and non-diseased kidney tissue. This work contributes to our understanding of how repressive marks on DNA and chromatin are dysregulated in carcinogenesis, knowledge that might aid the development of therapies or preventive strategies for human malignancies.
PMCID: PMC3230543  PMID: 21610323
20.  A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data 
BMC Bioinformatics  2009;10:165.
Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.
This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).
Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.
PMCID: PMC2717092  PMID: 19480678
21.  Identification of New Differentially Methylated Genes That Have Potential Functional Consequences in Prostate Cancer 
PLoS ONE  2012;7(10):e48455.
Many differentially methylated genes have been identified in prostate cancer (PCa), primarily using candidate gene-based assays. Recently, several global DNA methylation profiles have been reported in PCa, however, each of these has weaknesses in terms of ability to observe global DNA methylation alterations in PCa. We hypothesize that there remains unidentified aberrant DNA methylation in PCa, which may be identified using higher resolution assay methods. We used the newly developed Illumina HumanMethylation450 BeadChip in PCa (n = 19) and adjacent normal tissues (n = 4) and combined these with gene expression data for identifying new DNA methylation that may have functional consequences in PCa development and progression. We also confirmed our methylation results in an independent data set. Two aberrant DNA methylation genes were validated among an additional 56 PCa samples and 55 adjacent normal tissues. A total 28,735 CpG sites showed significant differences in DNA methylation (FDR adjusted P<0.05), defined as a mean methylation difference of at least 20% between PCa and normal samples. Furthermore, a total of 122 genes had more than one differentially methylated CpG site in their promoter region and a gene expression pattern that was inverse to the direction of change in DNA methylation (e.g. decreased expression with increased methylation, and vice-versa). Aberrant DNA methylation of two genes, AOX1 and SPON2, were confirmed via bisulfate sequencing, with most of the respective CpG sites showing significant differences between tumor samples and normal tissues. The AOX1 promoter region showed hypermethylation in 92.6% of 54 tested PCa samples in contrast to only three out of 53 tested normal tissues. This study used a new BeadChip combined with gene expression data in PCa to identify novel differentially methylated CpG sites located within genes. The newly identified differentially methylated genes may be used as biomarkers for PCa diagnosis.
PMCID: PMC3485209  PMID: 23119026
22.  Genomic Distribution and Inter-Sample Variation of Non-CpG Methylation across Human Cell Types 
PLoS Genetics  2011;7(12):e1002389.
DNA methylation plays an important role in development and disease. The primary sites of DNA methylation in vertebrates are cytosines in the CpG dinucleotide context, which account for roughly three quarters of the total DNA methylation content in human and mouse cells. While the genomic distribution, inter-individual stability, and functional role of CpG methylation are reasonably well understood, little is known about DNA methylation targeting CpA, CpT, and CpC (non-CpG) dinucleotides. Here we report a comprehensive analysis of non-CpG methylation in 76 genome-scale DNA methylation maps across pluripotent and differentiated human cell types. We confirm non-CpG methylation to be predominantly present in pluripotent cell types and observe a decrease upon differentiation and near complete absence in various somatic cell types. Although no function has been assigned to it in pluripotency, our data highlight that non-CpG methylation patterns reappear upon iPS cell reprogramming. Intriguingly, the patterns are highly variable and show little conservation between different pluripotent cell lines. We find a strong correlation of non-CpG methylation and DNMT3 expression levels while showing statistical independence of non-CpG methylation from pluripotency associated gene expression. In line with these findings, we show that knockdown of DNMTA and DNMT3B in hESCs results in a global reduction of non-CpG methylation. Finally, non-CpG methylation appears to be spatially correlated with CpG methylation. In summary these results contribute further to our understanding of cytosine methylation patterns in human cells using a large representative sample set.
Author Summary
Epigenetic modifications including DNA methylation at the position 5 of the cytosine base provide regulatory information to the genome sequence. The primary target of cytosine methylation in mammals is the CpG dinucleotide. However, previous studies in the mouse and more recent work in humans have highlighted the presence of non-CpG methylation in pluripotent cells. Currently, little is known about the role of this type of DNA methylation. We sought to further characterize non-CpG methylation by employing a comprehensive data set of genome-scale methylation maps across various human cell types. Our analysis reveals that non-CpG methylation varies dramatically between pluripotent cells and is closely linked to CpG methylation. Moreover, we show that depletion of the de novo DNA methyltransferases results in a global reduction of non-CpG methylation levels. Taken together, these findings further advance our understanding of cytosine methylation and describe its distribution among a large number of human cell types.
PMCID: PMC3234221  PMID: 22174693
23.  Widespread Hypomethylation Occurs Early and Synergizes with Gene Amplification during Esophageal Carcinogenesis 
PLoS Genetics  2011;7(3):e1001356.
Although a combination of genomic and epigenetic alterations are implicated in the multistep transformation of normal squamous esophageal epithelium to Barrett esophagus, dysplasia, and adenocarcinoma, the combinatorial effect of these changes is unknown. By integrating genome-wide DNA methylation, copy number, and transcriptomic datasets obtained from endoscopic biopsies of neoplastic progression within the same individual, we are uniquely able to define the molecular events associated progression of Barrett esophagus. We find that the previously reported global hypomethylation phenomenon in cancer has its origins at the earliest stages of epithelial carcinogenesis. Promoter hypomethylation synergizes with gene amplification and leads to significant upregulation of a chr4q21 chemokine cluster and other transcripts during Barrett neoplasia. In contrast, gene-specific hypermethylation is observed at a restricted number of loci and, in combination with hemi-allelic deletions, leads to downregulatation of selected transcripts during multistep progression. We also observe that epigenetic regulation during epithelial carcinogenesis is not restricted to traditionally defined “CpG islands,” but may also occur through a mechanism of differential methylation outside of these regions. Finally, validation of novel upregulated targets (CXCL1 and 3, GATA6, and DMBT1) in a larger independent panel of samples confirms the utility of integrative analysis in cancer biomarker discovery.
Author Summary
The incidence of esophageal adenocarcinoma (EA) is increasing at an alarming pace in the United States. Distinct pathological stages of Barrett's metaplasia and low- and high-grade dysplasia can be seen preceding malignant transformation. These precursor lesions provide a unique in vivo model for deepening our understanding the early steps in human neoplasia. By integrating genome-wide DNA methylation, copy number, and transcriptomic datasets obtained from endoscopic biopsies of neoplastic progression within the same individual, we are uniquely able to define the molecular events associated progression of Barrett esophagus. We show that the predominant change during this process is loss of DNA methylation. We show that this global hypomethylation occurs very early during the process and is seen even in preinvasive lesions. This loss of DNA methylation drives carcinogenesis by cooperating with gene amplifications in upregulating proteins during this process. Finally we uncovered proteins that upregulated by loss of methylation or gene amplification (CXCL1 and 3, GATA6, and DMBT1) and show their relevance by validating their levels in larger independent panel of samples, thus confirming the utility of integrative analysis in cancer biomarker discovery.
PMCID: PMC3069107  PMID: 21483804
24.  CMS: A Web-Based System for Visualization and Analysis of Genome-Wide Methylation Data of Human Cancers 
PLoS ONE  2013;8(4):e60980.
DNA methylation of promoter CpG islands is associated with gene suppression, and its unique genome-wide profiles have been linked to tumor progression. Coupled with high-throughput sequencing technologies, it can now efficiently determine genome-wide methylation profiles in cancer cells. Also, experimental and computational technologies make it possible to find the functional relationship between cancer-specific methylation patterns and their clinicopathological parameters.
Methodology/Principal Findings
Cancer methylome system (CMS) is a web-based database application designed for the visualization, comparison and statistical analysis of human cancer-specific DNA methylation. Methylation intensities were obtained from MBDCap-sequencing, pre-processed and stored in the database. 191 patient samples (169 tumor and 22 normal specimen) and 41 breast cancer cell-lines are deposited in the database, comprising about 6.6 billion uniquely mapped sequence reads. This provides comprehensive and genome-wide epigenetic portraits of human breast cancer and endometrial cancer to date. Two views are proposed for users to better understand methylation structure at the genomic level or systemic methylation alteration at the gene level. In addition, a variety of annotation tracks are provided to cover genomic information. CMS includes important analytic functions for interpretation of methylation data, such as the detection of differentially methylated regions, statistical calculation of global methylation intensities, multiple gene sets of biologically significant categories, interactivity with UCSC via custom-track data. We also present examples of discoveries utilizing the framework.
CMS provides visualization and analytic functions for cancer methylome datasets. A comprehensive collection of datasets, a variety of embedded analytic functions and extensive applications with biological and translational significance make this system powerful and unique in cancer methylation research. CMS is freely accessible at:
PMCID: PMC3632540  PMID: 23630576
25.  On the potential of models for location and scale for genome-wide DNA methylation data 
BMC Bioinformatics  2014;15:232.
With the help of epigenome-wide association studies (EWAS), increasing knowledge on the role of epigenetic mechanisms such as DNA methylation in disease processes is obtained. In addition, EWAS aid the understanding of behavioral and environmental effects on DNA methylation. In terms of statistical analysis, specific challenges arise from the characteristics of methylation data. First, methylation β-values represent proportions with skewed and heteroscedastic distributions. Thus, traditional modeling strategies assuming a normally distributed response might not be appropriate. Second, recent evidence suggests that not only mean differences but also variability in site-specific DNA methylation associates with diseases, including cancer. The purpose of this study was to compare different modeling strategies for methylation data in terms of model performance and performance of downstream hypothesis tests. Specifically, we used the generalized additive models for location, scale and shape (GAMLSS) framework to compare beta regression with Gaussian regression on raw, binary logit and arcsine square root transformed methylation data, with and without modeling a covariate effect on the scale parameter.
Using simulated and real data from a large population-based study and an independent sample of cancer patients and healthy controls, we show that beta regression does not outperform competing strategies in terms of model performance. In addition, Gaussian models for location and scale showed an improved performance as compared to models for location only. The best performance was observed for the Gaussian model on binary logit transformed β-values, referred to as M-values. Our results further suggest that models for location and scale are specifically sensitive towards violations of the distribution assumption and towards outliers in the methylation data. Therefore, a resampling procedure is proposed as a mode of inference and shown to diminish type I error rate in practically relevant settings. We apply the proposed method in an EWAS of BMI and age and reveal strong associations of age with methylation variability that are validated in an independent sample.
Models for location and scale are promising tools for EWAS that may help to understand the influence of environmental factors and disease-related phenotypes on methylation variability and its role during disease development.
PMCID: PMC4227139  PMID: 24994026
