Advances in whole genome profiling have revolutionized the cancer research field, but at the same time have raised new bioinformatics challenges. For next generation sequencing (NGS), these include data storage, computational costs, sequence processing and alignment, delineating appropriate statistical measures, and data visualization. The NGS application MethylCap-seq involves the in vitro capture of methylated DNA and subsequent analysis of enriched fragments by massively parallel sequencing. Here, we present a scalable, flexible workflow for MethylCap-seq Quality Control, secondary data analysis, tertiary analysis of multiple experimental groups, and data visualization. This workflow and its suite of features will assist biologists in conducting methylation profiling projects and facilitate meaningful biological interpretation.
next generation sequencing; DNA methylation; epigenetics; cancer; data analysis; data visualization
DNA methylation is an important epigenetic mark and dysregulation of DNA methylation is associated with many diseases including cancer. Advances in next-generation sequencing now allow unbiased methylome profiling of entire patient cohorts, greatly facilitating biomarker discovery and presenting new opportunities to understand the biological mechanisms by which changes in methylation contribute to disease. Enrichment-based sequencing assays such as MethylCap-seq are a cost effective solution for genome-wide determination of methylation status, but the technical reliability of methylation reconstruction from raw sequencing data has not been well characterized.
We analyze three MethylCap-seq data sets and perform two different analyses to assess data quality. First, we investigate how data quality is affected by excluding samples that do not meet quality control cutoff requirements. Second, we consider the effect of additional reads on enrichment score, saturation, and coverage. Lastly, we verify a method for the determination of the global amount of methylation from MethylCap-seq data by comparing to a spiked-in control DNA of known methylation status.
We show that rejection of samples based on our quality control parameters leads to a significant improvement of methylation calling. Additional reads beyond ~13 million unique aligned reads improved coverage, modestly improved saturation, and did not impact enrichment score. Lastly, we find that a global methylation indicator calculated from MethylCap-seq data correlates well with the global methylation level of a sample as obtained from a spike-in DNA of known methylation level.
We show that with appropriate quality control MethylCap-seq is a reliable tool, suitable for cohorts of hundreds of patients, that provides reproducible methylation information on a feature by feature basis as well as information about the global level of methylation.
Non-small cell lung carcinoma (NSCLC) is a complex malignancy that owing to its heterogeneity and poor prognosis poses many challenges to diagnosis, prognosis and patient treatment. DNA methylation is an important mechanism of epigenetic regulation involved in normal development and cancer. It is a very stable and specific modification and therefore in principle a very suitable marker for epigenetic phenotyping of tumors. Here we present a genome-wide DNA methylation analysis of NSCLC samples and paired lung tissues, where we combine MethylCap and next generation sequencing (MethylCap-seq) to provide comprehensive DNA methylation maps of the tumor and paired lung samples. The MethylCap-seq data were validated by bisulfite sequencing and methyl-specific polymerase chain reaction of selected regions.
Analysis of the MethylCap-seq data revealed a strong positive correlation between replicate experiments and between paired tumor/lung samples. We identified 57 differentially methylated regions (DMRs) present in all NSCLC tumors analyzed by MethylCap-seq. While hypomethylated DMRs did not correlate to any particular functional category of genes, the hypermethylated DMRs were strongly associated with genes encoding transcriptional regulators. Furthermore, subtelomeric regions and satellite repeats were hypomethylated in the NSCLC samples. We also identified DMRs that were specific to two of the major subtypes of NSCLC, adenocarcinomas and squamous cell carcinomas.
Collectively, we provide a resource containing genome-wide DNA methylation maps of NSCLC and their paired lung tissues, and comprehensive lists of known and novel DMRs and associated genes in NSCLC.
DNA Methylation; Epigenetics; MethylCap; Next generation sequencing; Non-small cell lung Cancer
Cisplatin resistance is one of the major reasons leading to the high death rate of ovarian cancer. Methyl-Capture sequencing (MethylCap-seq), which combines precipitation of methylated DNA by recombinant methyl-CpG binding domain of MBD2 protein with NGS, global and unbiased analysis of global DNA methylation patterns. We applied MethylCap-seq to analyze genome-wide DNA methylation profile of cisplatin sensitive ovarian cancer cell line A2780 and its isogenic derivative resistant line A2780CP. We obtained 21,763,035 raw reads for the drug resistant cell line A2780CP and 18,821,061reads for the sensitive cell line A2780. We identified 1224 hyper-methylated and 1216 hypomethylated DMRs (differentially methylated region) in A2780CP compared to A2780. Our MethylCap-seq data on this ovarian cancer cisplatin resistant model provided a good resource for the research community. We also found that A2780CP, compared to A2780, has lower observed to expected methylated CpG ratios, suggesting a lower global CpG methylation in A2780CP cells. Methylation specific PCR and bisulfite sequencing confirmed hypermethylation of PTK6, PRKCE and BCL2L1 in A2780 compared with A2780CP. Furthermore, treatment with the demethylation reagent 5-aza-dC in A2780 cells demethylated the promoters and restored the expression of PTK6, PRKCE and BCL2L1.
Two cost-efficient genome-scale methodologies to assess DNA-methylation are MethylCap-seq and Illumina’s Infinium HumanMethylation450 BeadChips (HM450). Objective information regarding the best-suited methodology for a specific research question is scant. Therefore, we performed a large-scale evaluation on a set of 70 brain tissue samples, i.e. 65 glioblastoma and 5 non-tumoral tissues. As MethylCap-seq coverages were limited, we focused on the inherent capacity of the methodology to detect methylated loci rather than a quantitative analysis. MethylCap-seq and HM450 data were dichotomized and performances were compared using a gold standard free Bayesian modelling procedure. While conditional specificity was adequate for both approaches, conditional sensitivity was systematically higher for HM450. In addition, genome-wide characteristics were compared, revealing that HM450 probes identified substantially fewer regions compared to MethylCap-seq. Although results indicated that the latter method can detect more potentially relevant DNA-methylation, this did not translate into the discovery of more differentially methylated loci between tumours and controls compared to HM450. Our results therefore indicate that both methodologies are complementary, with a higher sensitivity for HM450 and a far larger genome-wide coverage for MethylCap-seq, but also that a more comprehensive character does not automatically imply more significant results in biomarker studies.
Monoallelic gene expression is typically initiated early in the development of an organism. Dysregulation of monoallelic gene expression has already been linked to several non-Mendelian inherited genetic disorders. In humans, DNA-methylation is deemed to be an important regulator of monoallelic gene expression, but only few examples are known. One important reason is that current, cost-affordable truly genome-wide methods to assess DNA-methylation are based on sequencing post-enrichment. Here, we present a new methodology based on classical population genetic theory, i.e. the Hardy–Weinberg theorem, that combines methylomic data from MethylCap-seq with associated SNP profiles to identify monoallelically methylated loci. Applied on 334 MethylCap-seq samples of very diverse origin, this resulted in the identification of 80 genomic regions featured by monoallelic DNA-methylation. Of these 80 loci, 49 are located in genic regions of which 25 have already been linked to imprinting. Further analysis revealed statistically significant enrichment of these loci in promoter regions, further establishing the relevance and usefulness of the method. Additional validation was done using both 14 whole-genome bisulfite sequencing data sets and 16 mRNA-seq data sets. Importantly, the developed approach can be easily applied to other enrichment-based sequencing technologies, like the ChIP-seq-based identification of monoallelic histone modifications.
Learning and memory formation are known to require dynamic CpG (de)methylation and gene expression changes. Here, we aimed at establishing a genome-wide DNA methylation map of the zebra finch genome, a model organism in neuroscience, as well as identifying putatively epigenetically regulated genes. RNA- and MethylCap-seq experiments were performed on two zebra finch cell lines in presence or absence of 5-aza-2′-deoxycytidine induced demethylation. First, the MethylCap-seq methodology was validated in zebra finch by comparison with RRBS-generated data. To assess the influence of (variable) methylation on gene expression, RNA-seq experiments were performed as well. Comparison of RNA-seq and MethylCap-seq results showed that at least 357 of the 3,457 AZA-upregulated genes are putatively regulated by methylation in the promoter region, for which a pathway analysis showed remarkable enrichment for neurological networks. A subset of genes was validated using Exon Arrays, quantitative RT-PCR and CpG pyrosequencing on bisulfite-treated samples. To our knowledge, this study provides the first genome-wide DNA methylation map of the zebra finch genome as well as a comprehensive set of genes of which transcription is under putative methylation control.
DNA-methylation is an important epigenetic feature in health and disease. Methylated sequence capturing by Methyl Binding Domain (MBD) based enrichment followed by second-generation sequencing provides the best combination of sensitivity and cost-efficiency for genome-wide DNA-methylation profiling. However, existing implementations are numerous, and quality control and optimization require expensive external validation. Therefore, this study has two aims: 1) to identify a best performing kit for MBD-based enrichment using independent validation data, and 2) to evaluate whether quality evaluation can also be performed solely based on the characteristics of the generated sequences. Five commercially available kits for MBD enrichment were combined with Illumina GAIIx sequencing for three cell lines (HCT15, DU145, PC3). Reduced representation bisulfite sequencing data (all three cell lines) and publicly available Illumina Infinium BeadChip data (DU145 and PC3) were used for benchmarking. Consistent large-scale differences in yield, sensitivity and specificity between the different kits could be identified, with Diagenode's MethylCap kit as overall best performing kit under the tested conditions. This kit could also be identified with the Fragment CpG-plot, which summarizes the CpG content of the captured fragments, implying that the latter can be used as a tool to monitor data quality. In conclusion, there are major quality differences between kits for MBD-based capturing of methylated DNA, with the MethylCap kit performing best under the used settings. The Fragment CpG-plot is able to monitor data quality based on inherent sequence data characteristics, and is therefore a cost-efficient tool for experimental optimization, but also to monitor quality throughout routine applications.
Assessment of DNA promoter methylation markers in cervical scrapings for the detection of cervical intraepithelial neoplasia (CIN) and cervical cancer is feasible, but finding methylation markers with both high sensitivity as well as high specificity remains a challenge. In this study, we aimed to identify new methylation markers for the detection of high-grade CIN (CIN2/3 or worse, CIN2+) by using innovative genome-wide methylation analysis (MethylCap-seq). We focused on diagnostic performance of methylation markers with high sensitivity and high specificity considering any methylation level as positive.
MethylCap-seq of normal cervices and CIN2/3 revealed 176 differentially methylated regions (DMRs) comprising 164 genes. After verification and validation of the 15 best discriminating genes with methylation-specific PCR (MSP), 9 genes showed significant differential methylation in an independent cohort of normal cervices versus CIN2/3 lesions (p < 0.05). For further diagnostic evaluation, these 9 markers were tested with quantitative MSP (QMSP) in cervical scrapings from 2 cohorts: (1) cervical carcinoma versus healthy controls and (2) patients referred from population-based screening with an abnormal Pap smear in whom also HPV status was determined. Methylation levels of 8/9 genes were significantly higher in carcinoma compared to normal scrapings. For all 8 genes, methylation levels increased with the severity of the underlying histological lesion in scrapings from patients referred with an abnormal Pap smear. In addition, the diagnostic performance was investigated, using these 8 new genes and 4 genes (previously identified by our group: C13ORF18, JAM3, EPB41L3, and TERT). In a triage setting (after a positive Pap smear), sensitivity for CIN2+ of the best combination of genes (C13ORF18/JAM3/ANKRD18CP) (74 %) was comparable to hrHPV testing (79 %), while specificity was significantly higher (76 % versus 42 %, p ≤ 0.05). In addition, in hrHPV-positive scrapings, sensitivity and specificity for CIN2+ of this best-performing combination was comparable to the population referred with abnormal Pap smear.
We identified new CIN2/3-specific methylation markers using genome-wide DNA methylation analysis. The diagnostic performance of our new methylation panel shows higher specificity, which should result in prevention of unnecessary colposcopies for women referred with abnormal cytology. In addition, these newly found markers might be applied as a triage test in hrHPV-positive women from population-based screening. The next step before implementation in primary screening programs will be validation in population-based cohorts.
Electronic supplementary material
The online version of this article (doi:10.1186/s13148-016-0196-3) contains supplementary material, which is available to authorized users.
Cervical cancer screening; Cervical precancerous lesions; Human papillomavirus (HPV); Cervical scraping; MethylCap-seq; DNA methylation; Quantitative methylation-specific PCR (QMSP)
DNA methylation is a key component of mammalian gene regulation and the most classical example of an epigenetic mark. DNA methylation patterns are mitotically heritable and stable over time, but they undergo considerable changes in response to cell differentiation, diseases and environmental influences. Several methods have been developed for DNA methylation profiling on a genomic scale. Here, we benchmark four of these methods on two sample pairs, comparing their accuracy and power to detect DNA methylation differences. The results show that all evaluated methods (MeDIP-seq: methylated DNA immunoprecipitation, MethylCap-seq: methylated DNA capture by affinity purification, RRBS: reduced representation bisulfite sequencing, and the Infinium HumanMethylation27 assay) produce accurate DNA methylation data. However, these methods differ in their ability to detect differentially methylated regions between pairs of samples. We highlight strengths and weaknesses of the four methods and give practical recommendations for the design of epigenomic case-control studies.
Epigenome profiling; epigenetics; sequencing; differentially methylated regions; molecular diagnostics; biomarker discovery; cancer
There is a need to supplement or supplant the conventional diagnostic tools, namely, cystoscopy and B-type ultrasound, for bladder cancer (BC). We aimed to identify novel DNA methylation markers for BC through genome-wide profiling of BC cell lines and subsequent methylation-specific PCR (MSP) screening of clinical urine samples.
The methyl-DNA binding domain (MBD) capture technique, methylCap/seq, was performed to screen for specific hypermethylated CpG islands in two BC cell lines (5637 and T24). The top one hundred hypermethylated targets were sequentially screened by MSP in urine samples to gradually narrow the target number and optimize the composition of the diagnostic panel. The diagnostic performance of the obtained panel was evaluated in different clinical scenarios.
A total of 1,627 hypermethylated promoter targets in the BC cell lines was identified by Illumina sequencing. The top 104 hypermethylated targets were reduced to eight genes (VAX1, KCNV1, ECEL1, TMEM26, TAL1, PROX1, SLC6A20, and LMX1A) after the urine DNA screening in a small sample size of 8 normal control and 18 BC subjects. Validation in an independent sample of 212 BC patients enabled the optimization of five methylation targets, including VAX1, KCNV1, TAL1, PPOX1, and CFTR, which was obtained in our previous study, for BC diagnosis with a sensitivity and specificity of 88.68% and 87.25%, respectively. In addition, the methylation of VAX1 and LMX1A was found to be associated with BC recurrence.
We identified a promising diagnostic marker panel for early non-invasive detection and subsequent BC surveillance.
DNA methylation and histone modifications are epigenetic marks implicated in the complex regulation of vertebrate embryogenesis. The cross-talk between DNA methylation and Polycomb-dependent H3K27me3 histone mark has been reported in a number of organisms , , , , , ,  and both marks are known to be required for proper developmental progression. Here we provide genome-wide DNA methylation (MethylCap-seq) and H3K27me3 (ChIP-seq) maps for three stages (dome, 24 hpf and 48 hpf) of zebrafish (Danio rerio) embryogenesis, as well as all analytical and methodological details associated with the generation of this dataset. We observe a strong antagonism between the two epigenetic marks present in CpG islands and their compatibility throughout the bulk of the genome, as previously reported in mammalian ESC lines (Brinkman et al., 2012). Next generation sequencing data linked to this project have been deposited in the Gene Expression Omnibus (GEO) database under accession numbers GSE35050 and GSE70847.
DNA methylation; Polycomb; Embryogenesis; Zebrafish
Methyl-binding domain (MBD) enrichment followed by deep sequencing (MBD-seq), is a robust and cost efficient approach for methylome-wide association studies (MWAS). MBD-seq has been demonstrated to be capable of identifying differentially methylated regions, detecting previously reported robust associations and producing findings that replicate with other technologies such as targeted pyrosequencing of bisulfite converted DNA. There are several kits commercially available that can be used for MBD enrichment. Our previous work has involved MethylMiner (Life Technologies, Foster City, CA, USA) that we chose after careful investigation of its properties. However, in a recent evaluation of five commercially available MBD-enrichment kits the performance of the MethylMiner was deemed poor. Given our positive experience with MethylMiner, we were surprised by this report. In an attempt to reproduce these findings we here have performed a direct comparison of MethylMiner with MethylCap (Diagenode Inc, Denville, NJ, USA), the best performing kit in that study. We find that both MethylMiner and MethylCap are two well performing MBD-enrichment kits. However, MethylMiner shows somewhat better enrichment efficiency and lower levels of background “noise”. In addition, for the purpose of MWAS where we want to investigate the majority of CpGs, we find MethylMiner to be superior as it allows tailoring the enrichment to the regions where most CpGs are located. Using targeted bisulfite sequencing we confirmed that sites where methylation was detected by either MethylMiner or by MethylCap indeed were methylated.
Diminished ovarian function occurs early and is a primary cause for age-related decline in female fertility; however, its underlying mechanism remains unclear. This study investigated the roles that genome and epigenome structure play in age-related changes in gene expression and ovarian function, using human ovarian granulosa cells as an experimental system. DNA methylomes were compared between two groups of women with distinct age-related differences in ovarian functions, using both Methylated DNA Capture followed by Next Generation Sequencing (MethylCap-seq) and Reduced Representation Bisulfite Sequencing (RRBS); their transcriptomes were investigated using mRNA-seq. Significant, non-random changes in transcriptome and DNA methylome features are observed in human ovarian granulosa cells as women age and their ovarian functions deteriorate. The strongest correlations between methylation and the age-related changes in gene expression are not confined to the promoter region; rather, high densities of hypomethylated CpG-rich regions spanning the gene body are preferentially associated with gene down-regulation. This association is further enhanced where CpG regions are localized near the 3ʹ-end of the gene. Such features characterize several genes crucial in age-related decline in ovarian function, most notably the AMH (Anti-Müllerian Hormone) gene. The genome-wide correlation between the density of hypomethylated intragenic and 3ʹ-end regions and gene expression suggests previously unexplored mechanisms linking epigenome structure to age-related physiology and pathology.
DNA methylation; transcription end site; fertility; ovarian granulosa cell; transcriptome
Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The goals of this study are to compare NGS-derived retinal transcriptome profiling (RNA-seq) to microarray and quantitative reverse transcription polymerase chain reaction (qRT–PCR) methods and to evaluate protocols for optimal high-throughput data analysis.
Retinal mRNA profiles of 21-day-old wild-type (WT) and neural retina leucine zipper knockout (Nrl−/−) mice were generated by deep sequencing, in triplicate, using Illumina GAIIx. The sequence reads that passed quality filters were analyzed at the transcript isoform level with two methods: Burrows–Wheeler Aligner (BWA) followed by ANOVA (ANOVA) and TopHat followed by Cufflinks. qRT–PCR validation was performed using TaqMan and SYBR Green assays.
Using an optimized data analysis workflow, we mapped about 30 million sequence reads per sample to the mouse genome (build mm9) and identified 16,014 transcripts in the retinas of WT and Nrl−/− mice with BWA workflow and 34,115 transcripts with TopHat workflow. RNA-seq data confirmed stable expression of 25 known housekeeping genes, and 12 of these were validated with qRT–PCR. RNA-seq data had a linear relationship with qRT–PCR for more than four orders of magnitude and a goodness of fit (R2) of 0.8798. Approximately 10% of the transcripts showed differential expression between the WT and Nrl−/− retina, with a fold change ≥1.5 and p value <0.05. Altered expression of 25 genes was confirmed with qRT–PCR, demonstrating the high degree of sensitivity of the RNA-seq method. Hierarchical clustering of differentially expressed genes uncovered several as yet uncharacterized genes that may contribute to retinal function. Data analysis with BWA and TopHat workflows revealed a significant overlap yet provided complementary insights in transcriptome profiling.
Our study represents the first detailed analysis of retinal transcriptomes, with biologic replicates, generated by RNA-seq technology. The optimized data analysis workflows reported here should provide a framework for comparative investigations of expression profiles. Our results show that NGS offers a comprehensive and more accurate quantitative and qualitative evaluation of mRNA content within a cell or tissue. We conclude that RNA-seq based transcriptome characterization would expedite genetic network analyses and permit the dissection of complex biologic functions.
RNA sequencing (RNA-seq), a next-generation sequencing technique for transcriptome profiling, is being increasingly used, in part driven by the decreasing cost of sequencing. Nevertheless, the analysis of the massive amounts of data generated by large-scale RNA-seq remains a challenge. Multiple algorithms pertinent to basic analyses have been developed, and there is an increasing need to automate the use of these tools so as to obtain results in an efficient and user friendly manner. Increased automation and improved visualization of the results will help make the results and findings of the analyses readily available to experimental scientists.
By combing the best open source tools developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we have implemented QuickRNASeq, a pipeline for large-scale RNA-seq data analyses and visualization. The QuickRNASeq workflow consists of three main steps. In Step #1, each individual sample is processed, including mapping RNA-seq reads to a reference genome, counting the numbers of mapped reads, quality control of the aligned reads, and SNP (single nucleotide polymorphism) calling. Step #1 is computationally intensive, and can be processed in parallel. In Step #2, the results from individual samples are merged, and an integrated and interactive project report is generated. All analyses results in the report are accessible via a single HTML entry webpage. Step #3 is the data interpretation and presentation step. The rich visualization features implemented here allow end users to interactively explore the results of RNA-seq data analyses, and to gain more insights into RNA-seq datasets. In addition, we used a real world dataset to demonstrate the simplicity and efficiency of QuickRNASeq in RNA-seq data analyses and interactive visualizations. The seamless integration of automated capabilites with interactive visualizations in QuickRNASeq is not available in other published RNA-seq pipelines.
The high degree of automation and interactivity in QuickRNASeq leads to a substantial reduction in the time and effort required prior to further downstream analyses and interpretation of the analyses findings. QuickRNASeq advances primary RNA-seq data analyses to the next level of automation, and is mature for public release and adoption.
RNA-seq; Pipeline; Workflow; Automation; Visualization; Batch processing; High-performance computing; Large-scale data analysis; D3; jQuery
While next-generation sequencing (NGS) technologies are rapidly advancing, an area that lags behind is the development of efficient and user-friendly tools for preliminary analysis of massive NGS data. As an effort to fill this gap to keep up with the fast pace of technological advancement and to accelerate data-to-results turnaround, we developed a novel software package named SeqAssist ("Sequencing Assistant" or SA).
SeqAssist takes NGS-generated FASTQ files as the input, employs the BWA-MEM aligner for sequence alignment, and aims to provide a quick overview and basic statistics of NGS data. It consists of three separate workflows: (1) the SA_RunStats workflow generates basic statistics about an NGS dataset, including numbers of raw, cleaned, redundant and unique reads, redundancy rate, and a list of unique sequences with length and read count; (2) the SA_Run2Ref workflow estimates the breadth, depth and evenness of genome-wide coverage of the NGS dataset at a nucleotide resolution; and (3) the SA_Run2Run workflow compares two NGS datasets to determine the redundancy (overlapping rate) between the two NGS runs. Statistics produced by SeqAssist or derived from SeqAssist output files are designed to inform the user: whether, what percentage, how many times and how evenly a genomic locus (i.e., gene, scaffold, chromosome or genome) is covered by sequencing reads, how redundant the sequencing reads are in a single run or between two runs. These statistics can guide the user in evaluating the quality of a DNA library prepared for RNA-Seq or genome (re-)sequencing and in deciding the number of sequencing runs required for the library. We have tested SeqAssist using a synthetic dataset and demonstrated its main features using multiple NGS datasets generated from genome re-sequencing experiments.
SeqAssist is a useful and informative tool that can serve as a valuable "assistant" to a broad range of investigators who conduct genome re-sequencing, RNA-Seq, or de novo genome sequencing and assembly experiments.
SeqAssist; next generation sequencing (NGS); sequencing data analysis; genome-wide coverage; breadth; depth; evenness; genome (re-)sequencing; RNA-Seq; FASTQ; BWA-MEM.
The advent of massively parallel sequencing (MPS) technology has lead to the development of assays which facilitate the study of epigenomics and genomics at the genome-wide level. However, the computational burden resulting from the need to store and process the gigbytes of data streaming from sequencing machines, in addition to collecting metadata and returning data to users, is becoming a major issue for both sequencing cores and users alike. We present WASP, a LIMS system designed to automate MPS data pre-processing and analysis. WASP integrates a user-friendly MediaWiki front end, a network file system (NFS) and MySQL database for recording experimental data and metadata, plus a multi-node cluster for data processing. The workflow includes capture of sample submission information to the database using web forms on the wiki, recording of core facility operations on samples and linking of samples to flowcells in the database followed by automatic processing of sequence data and running of data analysis pipelines following the sequence run. WASP currently supports MPS using the Illumina GaIIx. For epigenomics applications we provide a pipeline for our novel HpaII-tiny fragment enrichment by ligation-mediated PCR (HELP)-tag method which enables us to quantify the methylation status of ∼1.8 million CpGs located in 70% of the HpaII sites (CCGG) in the human genome. We also provide ChIP-seq analysis using MACS, which is also applicable for methylated DNA immunoprecipitation (MeDIP) assays, in addition to miRNA and mRNA analyses using custom pipelines. Output from the analysis pipelines is automatically linked to a users wiki-space and the data generated can be immediately viewed as tracks in a local mirror of the UCSC genome browser. WASP also provides capabilities for automated billing and keeping track of facility costs. We believe WASP represents a suitable model on which to develop LIMS systems for supporting MPS applications.
Extensive reprogramming and dysregulation of DNA methylation is an important characteristic of pancreatic cancer (PC). Our study aimed to characterize the genomic methylation patterns in various genomic contexts of PC. The methyl capture sequencing (methylCap-seq) method was used to map differently methylated regions (DMRs) in pooled samples from ten PC tissues and ten adjacent non-tumor (PN) tissues. A selection of DMRs was validated in an independent set of PC and PN samples using methylation-specific PCR (MSP), bisulfite sequencing PCR (BSP), and methylation sensitive restriction enzyme-based qPCR (MSRE-qPCR). The mRNA and expressed sequence tag (EST) expression of the corresponding genes was investigated using RT-qPCR.
A total of 1,131 PC-specific and 727 PN-specific hypermethylated DMRs were identified in association with CpG islands (CGIs), including gene-associated CGIs and orphan CGIs; 2,955 PC-specific and 2,386 PN-specific hypermethylated DMRs were associated with gene promoters, including promoters containing or lacking CGIs. Moreover, 1,744 PC-specific and 1,488 PN-specific hypermethylated DMRs were found to be associated with CGIs or CGI shores. These results suggested that aberrant hypermethylation in PC typically occurs in regions surrounding the transcription start site (TSS). The BSP, MSP, MSRE-qPCR, and RT-qPCR data indicated that the aberrant DNA methylation in PC tissue and in PC cell lines was associated with gene (or corresponding EST) expression.
Our study characterized the genome-wide DNA methylation patterns in PC and identified DMRs that were distributed among various genomic contexts that might influence the expression of corresponding genes or transcripts to promote PC. These DMRs might serve as diagnostic biomarkers or therapeutic targets for PC.
CGI shore; DNA methylation; genome-wide; methyl capture sequencing; orphan CGI; pancreatic adenocarcinoma
Many tools exist in the analysis of bacterial RNA sequencing (RNA-seq) transcriptional profiling experiments to identify differentially expressed genes between experimental conditions. Generally, the workflow includes quality control of reads, mapping to a reference, counting transcript abundance, and statistical tests for differentially expressed genes. In spite of the numerous tools developed for each component of an RNA-seq analysis workflow, easy-to-use bacterially oriented workflow applications to combine multiple tools and automate the process are lacking. With many tools to choose from for each step, the task of identifying a specific tool, adapting the input/output options to the specific use-case, and integrating the tools into a coherent analysis pipeline is not a trivial endeavor, particularly for microbiologists with limited bioinformatics experience.
To make bacterial RNA-seq data analysis more accessible, we developed a Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis (SPARTA). SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots.
SPARTA provides an easy-to-use bacterial RNA-seq transcriptional profiling workflow to identify differentially expressed genes between experimental conditions. This software will enable microbiologists with limited bioinformatics experience to analyze their data and integrate next generation sequencing (NGS) technologies into the classroom. The SPARTA software and tutorial are available at sparta.readthedocs.org.
Bioinformatics; Data analysis; Transcriptomics; Microbiology; Next-generation sequencing; High-throughput sequencing
An important model of hepatocellular carcinoma (HCC) that has been described in southeast Asia includes the transition from chronic hepatitis B infection (CHB) to liver cirrhosis (LC) and, finally, to HCC. The genome-wide methylation profiling of plasma cell-free DNA (cfDNA) has not previously been used to assess HCC development. Using MethylCap-seq, we analyzed the genome-wide cfDNA methylation profiles by separately pooling healthy control (HC), CHB, LC and HCC samples and independently validating the library data for the tissue DNA and cfDNA by MSP, qMSP and Multiplex-BSP-seq.
The dynamic features of cfDNA methylation coincided with the natural course of HCC development. Data mining revealed the presence of 240, 272 and 286 differentially methylated genes (DMGs) corresponding to the early, middle and late stages of HCC progression, respectively. The validation of the DNA and cfDNA results in independent tissues identified three DMGs, including ZNF300, SLC22A20 and SHISA7, with the potential for distinguishing between CHB and LC as well as between LC and HCC. The area under the curve (AUC) ranged from 0.65 to 0.80, and the odds ratio (OR) values ranged from 5.18 to 14.2.
Our data revealed highly dynamic cfDNA methylation profiles in support of HBV-related HCC development. We have identified a panel of DMGs that are predictive for the early, middle and late stages of HCC development, and these are potential markers for the early detection of HCC as well as the screening of high-risk populations.
Electronic supplementary material
The online version of this article (doi:10.1186/1868-7083-6-30) contains supplementary material, which is available to authorized users.
Plasma; Cell-free DNA; HBV; HCC development; Genome-wide; DNA methylation
The ability to assay genome-scale methylation patterns using high-throughput sequencing makes it possible to carry out association studies to determine the relationship between epigenetic variation and phenotype. While bisulfite sequencing can determine a methylome at high resolution, cost inhibits its use in comparative and population studies. MethylSeq, based on sequencing of fragment ends produced by a methylation-sensitive restriction enzyme, is a method for methyltyping (survey of methylation states) and is a site-specific and cost-effective alternative to whole-genome bisulfite sequencing. Despite its advantages, the use of MethylSeq has been restricted by biases in MethylSeq data that complicate the determination of methyltypes. Here we introduce a statistical method, MetMap, that produces corrected site-specific methylation states from MethylSeq experiments and annotates unmethylated islands across the genome. MetMap integrates genome sequence information with experimental data, in a statistically sound and cohesive Bayesian Network. It infers the extent of methylation at individual CGs and across regions, and serves as a framework for comparative methylation analysis within and among species. We validated MetMap's inferences with direct bisulfite sequencing, showing that the methylation status of sites and islands is accurately inferred. We used MetMap to analyze MethylSeq data from four human neutrophil samples, identifying novel, highly unmethylated islands that are invisible to sequence-based annotation strategies. The combination of MethylSeq and MetMap is a powerful and cost-effective tool for determining genome-scale methyltypes suitable for comparative and association studies.
In the vertebrates, methylation of cytosine residues in DNA regulates gene activity in concert with proteins that associate with DNA. Large-scale genomewide comparative studies that seek to link specific methylation patterns to disease will require hundreds or thousands of samples, and thus economical methods that assay genomewide methylation. One such method is MethylSeq, which samples cytosine methylation at site-specific resolution by high-throughput sequencing of the ends of DNA fragments generated by methylation-sensitive restriction enzymes. MethylSeq's low cost and simplicity of implementation enable its use in large-scale comparative studies, but biases inherent to the method inhibit interpretation of the data it produces. Here we present MetMap, a statistical framework that first accounts for the biases in MethylSeq data and then generates an analysis of the data that is suitable for use in comparative studies. We show that MethylSeq and MetMap can be used together to determine methylation profiles across the genome, and to identify novel unmethylated regions that are likely to be involved in gene regulation. The ability to conduct comparative studies of sufficient scale at a reasonable cost promises to reveal new insights into the relationship between cytosine methylation and phenotype.
Aberrant DNA methylation often occurs in colorectal cancer (CRC). In our study we applied a genome-wide DNA methylation analysis approach, MethylCap-seq, to map the differentially methylated regions (DMRs) in 24 tumors and matched normal colon samples. In total, 2687 frequently hypermethylated and 468 frequently hypomethylated regions were identified, which include potential biomarkers for CRC diagnosis. Hypermethylation in the tumor samples was enriched at CpG islands and gene promoters, while hypomethylation was distributed throughout the genome. Using epigenetic data from human embryonic stem cells, we show that frequently hypermethylated regions coincide with bivalent loci in human embryonic stem cells. DNA methylation is commonly thought to lead to gene silencing; however, integration of publically available gene expression data indicates that 75% of the frequently hypermethylated genes were most likely already lowly or not expressed in normal tissue. Collectively, our study provides genome-wide DNA methylation maps of CRC, comprehensive lists of DMRs, and gives insights into the role of aberrant DNA methylation in CRC formation.
DNA methylation; colorectal cancer; biomarkers; H3K27me3; gene expression; Illumina sequencing
Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) or ChIP followed by genome tiling array analysis (ChIP-chip) have become standard technologies for genome-wide identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that allow identification of binding sites from ChIP-seq or ChIP-chip datasets and subsequent visualization in the University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are distributed widely across the genome.
We have developed ChIPpeakAnno as a Bioconductor package within the statistical programming environment R to facilitate batch annotation of enriched peaks identified from ChIP-seq, ChIP-chip, cap analysis of gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding sites annotated with ChIPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented functionalities for determining the significance of overlap between replicates or binding sites among transcription factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent genes.
ChIPpeakAnno enables batch annotation of the binding sites identified from ChIP-seq, ChIP-chip, CAGE or any technology that results in a large number of enriched genomic regions within the statistical programming environment R. Allowing users to pass their own annotation data such as a different Chromatin immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date annotation retrieval from the BioMart database.
Exploration of DNA methylation and its impact on various regulatory mechanisms has become a very active field of research. Simultaneously there is an arising need for tools to process and analyse the data together with statistical investigation and visualisation.
MethVisual is a new application that enables exploratory analysis and intuitive visualization of DNA methylation data as is typically generated by bisulfite sequencing. The package allows the import of DNA methylation sequences, aligns them and performs quality control comparison. It comprises basic analysis steps as lollipop visualization, co-occurrence display of methylation of neighbouring and distant CpG sites, summary statistics on methylation status, clustering and correspondence analysis. The package has been developed for methylation data but can be also used for other data types for which binary coding can be inferred. The application of the package, as well as a comparison to existing DNA methylation analysis tools and its workflow based on two datasets is presented in this paper.
The R package MethVisual offers various analysis procedures for data that can be binarized, in particular for bisulfite sequenced methylation data. R/Bioconductor has become one of the most important environments for statistical analysis of various types of biological and medical data. Therefore, any data analysis within R that allows the integration of various data types as provided from different technological platforms is convenient. It is the first and so far the only specific package for DNA methylation analysis, in particular for bisulfite sequenced data available in R/Bioconductor enviroment. The package is available for free at http://methvisual.molgen.mpg.de/ and from the Bioconductor Consortium http://www.bioconductor.org.