Advances in whole genome profiling have revolutionized the cancer research field, but at the same time have raised new bioinformatics challenges. For next generation sequencing (NGS), these include data storage, computational costs, sequence processing and alignment, delineating appropriate statistical measures, and data visualization. The NGS application MethylCap-seq involves the in vitro capture of methylated DNA and subsequent analysis of enriched fragments by massively parallel sequencing. Here, we present a scalable, flexible workflow for MethylCap-seq Quality Control, secondary data analysis, tertiary analysis of multiple experimental groups, and data visualization. This workflow and its suite of features will assist biologists in conducting methylation profiling projects and facilitate meaningful biological interpretation.
next generation sequencing; DNA methylation; epigenetics; cancer; data analysis; data visualization
Non-small cell lung carcinoma (NSCLC) is a complex malignancy that owing to its heterogeneity and poor prognosis poses many challenges to diagnosis, prognosis and patient treatment. DNA methylation is an important mechanism of epigenetic regulation involved in normal development and cancer. It is a very stable and specific modification and therefore in principle a very suitable marker for epigenetic phenotyping of tumors. Here we present a genome-wide DNA methylation analysis of NSCLC samples and paired lung tissues, where we combine MethylCap and next generation sequencing (MethylCap-seq) to provide comprehensive DNA methylation maps of the tumor and paired lung samples. The MethylCap-seq data were validated by bisulfite sequencing and methyl-specific polymerase chain reaction of selected regions.
Analysis of the MethylCap-seq data revealed a strong positive correlation between replicate experiments and between paired tumor/lung samples. We identified 57 differentially methylated regions (DMRs) present in all NSCLC tumors analyzed by MethylCap-seq. While hypomethylated DMRs did not correlate to any particular functional category of genes, the hypermethylated DMRs were strongly associated with genes encoding transcriptional regulators. Furthermore, subtelomeric regions and satellite repeats were hypomethylated in the NSCLC samples. We also identified DMRs that were specific to two of the major subtypes of NSCLC, adenocarcinomas and squamous cell carcinomas.
Collectively, we provide a resource containing genome-wide DNA methylation maps of NSCLC and their paired lung tissues, and comprehensive lists of known and novel DMRs and associated genes in NSCLC.
DNA Methylation; Epigenetics; MethylCap; Next generation sequencing; Non-small cell lung Cancer
DNA methylation is an important epigenetic mark and dysregulation of DNA methylation is associated with many diseases including cancer. Advances in next-generation sequencing now allow unbiased methylome profiling of entire patient cohorts, greatly facilitating biomarker discovery and presenting new opportunities to understand the biological mechanisms by which changes in methylation contribute to disease. Enrichment-based sequencing assays such as MethylCap-seq are a cost effective solution for genome-wide determination of methylation status, but the technical reliability of methylation reconstruction from raw sequencing data has not been well characterized.
We analyze three MethylCap-seq data sets and perform two different analyses to assess data quality. First, we investigate how data quality is affected by excluding samples that do not meet quality control cutoff requirements. Second, we consider the effect of additional reads on enrichment score, saturation, and coverage. Lastly, we verify a method for the determination of the global amount of methylation from MethylCap-seq data by comparing to a spiked-in control DNA of known methylation status.
We show that rejection of samples based on our quality control parameters leads to a significant improvement of methylation calling. Additional reads beyond ~13 million unique aligned reads improved coverage, modestly improved saturation, and did not impact enrichment score. Lastly, we find that a global methylation indicator calculated from MethylCap-seq data correlates well with the global methylation level of a sample as obtained from a spike-in DNA of known methylation level.
We show that with appropriate quality control MethylCap-seq is a reliable tool, suitable for cohorts of hundreds of patients, that provides reproducible methylation information on a feature by feature basis as well as information about the global level of methylation.
Cisplatin resistance is one of the major reasons leading to the high death rate of ovarian cancer. Methyl-Capture sequencing (MethylCap-seq), which combines precipitation of methylated DNA by recombinant methyl-CpG binding domain of MBD2 protein with NGS, global and unbiased analysis of global DNA methylation patterns. We applied MethylCap-seq to analyze genome-wide DNA methylation profile of cisplatin sensitive ovarian cancer cell line A2780 and its isogenic derivative resistant line A2780CP. We obtained 21,763,035 raw reads for the drug resistant cell line A2780CP and 18,821,061reads for the sensitive cell line A2780. We identified 1224 hyper-methylated and 1216 hypomethylated DMRs (differentially methylated region) in A2780CP compared to A2780. Our MethylCap-seq data on this ovarian cancer cisplatin resistant model provided a good resource for the research community. We also found that A2780CP, compared to A2780, has lower observed to expected methylated CpG ratios, suggesting a lower global CpG methylation in A2780CP cells. Methylation specific PCR and bisulfite sequencing confirmed hypermethylation of PTK6, PRKCE and BCL2L1 in A2780 compared with A2780CP. Furthermore, treatment with the demethylation reagent 5-aza-dC in A2780 cells demethylated the promoters and restored the expression of PTK6, PRKCE and BCL2L1.
DNA methylation is a key component of mammalian gene regulation and the most classical example of an epigenetic mark. DNA methylation patterns are mitotically heritable and stable over time, but they undergo considerable changes in response to cell differentiation, diseases and environmental influences. Several methods have been developed for DNA methylation profiling on a genomic scale. Here, we benchmark four of these methods on two sample pairs, comparing their accuracy and power to detect DNA methylation differences. The results show that all evaluated methods (MeDIP-seq: methylated DNA immunoprecipitation, MethylCap-seq: methylated DNA capture by affinity purification, RRBS: reduced representation bisulfite sequencing, and the Infinium HumanMethylation27 assay) produce accurate DNA methylation data. However, these methods differ in their ability to detect differentially methylated regions between pairs of samples. We highlight strengths and weaknesses of the four methods and give practical recommendations for the design of epigenomic case-control studies.
Epigenome profiling; epigenetics; sequencing; differentially methylated regions; molecular diagnostics; biomarker discovery; cancer
Aberrant DNA methylation often occurs in colorectal cancer (CRC). In our study we applied a genome-wide DNA methylation analysis approach, MethylCap-seq, to map the differentially methylated regions (DMRs) in 24 tumors and matched normal colon samples. In total, 2687 frequently hypermethylated and 468 frequently hypomethylated regions were identified, which include potential biomarkers for CRC diagnosis. Hypermethylation in the tumor samples was enriched at CpG islands and gene promoters, while hypomethylation was distributed throughout the genome. Using epigenetic data from human embryonic stem cells, we show that frequently hypermethylated regions coincide with bivalent loci in human embryonic stem cells. DNA methylation is commonly thought to lead to gene silencing; however, integration of publically available gene expression data indicates that 75% of the frequently hypermethylated genes were most likely already lowly or not expressed in normal tissue. Collectively, our study provides genome-wide DNA methylation maps of CRC, comprehensive lists of DMRs, and gives insights into the role of aberrant DNA methylation in CRC formation.
DNA methylation; colorectal cancer; biomarkers; H3K27me3; gene expression; Illumina sequencing
There is a need to supplement or supplant the conventional diagnostic tools, namely, cystoscopy and B-type ultrasound, for bladder cancer (BC). We aimed to identify novel DNA methylation markers for BC through genome-wide profiling of BC cell lines and subsequent methylation-specific PCR (MSP) screening of clinical urine samples.
The methyl-DNA binding domain (MBD) capture technique, methylCap/seq, was performed to screen for specific hypermethylated CpG islands in two BC cell lines (5637 and T24). The top one hundred hypermethylated targets were sequentially screened by MSP in urine samples to gradually narrow the target number and optimize the composition of the diagnostic panel. The diagnostic performance of the obtained panel was evaluated in different clinical scenarios.
A total of 1,627 hypermethylated promoter targets in the BC cell lines was identified by Illumina sequencing. The top 104 hypermethylated targets were reduced to eight genes (VAX1, KCNV1, ECEL1, TMEM26, TAL1, PROX1, SLC6A20, and LMX1A) after the urine DNA screening in a small sample size of 8 normal control and 18 BC subjects. Validation in an independent sample of 212 BC patients enabled the optimization of five methylation targets, including VAX1, KCNV1, TAL1, PPOX1, and CFTR, which was obtained in our previous study, for BC diagnosis with a sensitivity and specificity of 88.68% and 87.25%, respectively. In addition, the methylation of VAX1 and LMX1A was found to be associated with BC recurrence.
We identified a promising diagnostic marker panel for early non-invasive detection and subsequent BC surveillance.
In previous work, we designed a modified aptamer-free SELEX-seq protocol (afSELEX-seq) for the discovery of transcription factor binding sites. Here, we present original software, TFAST, designed to analyze afSELEX-seq data, validated against our previously generated afSELEX-seq dataset and a model dataset. TFAST is designed with a simple graphical interface (Java) so that it can be installed and executed without extensive expertise in bioinformatics. TFAST completes analysis within minutes on most personal computers.
Once afSELEX-seq data are aligned to a target genome, TFAST identifies peaks and, uniquely, compares peak characteristics between cycles. TFAST generates a hierarchical report of graded peaks, their associated genomic sequences, binding site length predictions, and dummy sequences.
Including additional cycles of afSELEX-seq improved TFAST's ability to selectively identify peaks, leading to 7,274, 4,255, and 2,628 peaks identified in two-, three-, and four-cycle afSELEX-seq. Inter-round analysis by TFAST identified 457 peaks as the strongest candidates for true binding sites. Separating peaks by TFAST into classes of worst, second-best and best candidate peaks revealed a trend of increasing significance (e-values 4.5×1012, 2.9×10−46, and 1.2×10−73) and informational content (11.0, 11.9, and 12.5 bits over 15 bp) of discovered motifs within each respective class. TFAST also predicted a binding site length (28 bp) consistent with non-computational experimentally derived results for the transcription factor PapX (22 to 29 bp).
TFAST offers a novel and intuitive approach for determining DNA binding sites of proteins subjected to afSELEX-seq. Here, we demonstrate that TFAST, using afSELEX-seq data, rapidly and accurately predicted sequence length and motif for a putative transcription factor's binding site.
Although several tools for the analysis of ChIP-seq data have been published recently, there is a growing demand, in particular in the plant research community, for computational resources with which such data can be processed, analyzed, stored, visualized and integrated within a single, user-friendly environment. To accommodate this demand, we have developed PRI-CAT (Plant Research International ChIP-seq analysis tool), a web-based workflow tool for the management and analysis of ChIP-seq experiments. PRI-CAT is currently focused on Arabidopsis, but will be extended with other plant species in the near future. Users can directly submit their sequencing data to PRI-CAT for automated analysis. A QuickLoad server compatible with genome browsers is implemented for the storage and visualization of DNA-binding maps. Submitted datasets and results can be made publicly available through PRI-CAT, a feature that will enable community-based integrative analysis and visualization of ChIP-seq experiments. Secondary analysis of data can be performed with the aid of GALAXY, an external framework for tool and data integration. PRI-CAT is freely available at http://www.ab.wur.nl/pricat. No login is required.
High-throughput sequencing technologies enable direct approaches to catalog and analyze snapshots of the total small RNA content of living cells. Characterization of high-throughput sequencing data requires bioinformatic tools offering a wide perspective of the small RNA transcriptome. Here we present SeqBuster, a highly versatile and reliable web-based toolkit to process and analyze large-scale small RNA datasets. The high flexibility of this tool is illustrated by the multiple choices offered in the pre-analysis for mapping purposes and in the different analysis modules for data manipulation. To overcome the storage capacity limitations of the web-based tool, SeqBuster offers a stand-alone version that permits the annotation against any custom database. SeqBuster integrates multiple analyses modules in a unique platform and constitutes the first bioinformatic tool offering a deep characterization of miRNA variants (isomiRs). The application of SeqBuster to small-RNA datasets of human embryonic stem cells revealed that most miRNAs present different types of isomiRs, some of them being associated to stem cell differentiation. The exhaustive description of the isomiRs provided by SeqBuster could help to identify miRNA-variants that are relevant in physiological and pathological processes. SeqBuster is available at http://estivill_lab.crg.es/seqbuster.
Summary: A combination of bisulfite treatment of DNA and high-throughput sequencing (BS-Seq) can capture a snapshot of a cell's epigenomic state by revealing its genome-wide cytosine methylation at single base resolution. Bismark is a flexible tool for the time-efficient analysis of BS-Seq data which performs both read mapping and methylation calling in a single convenient step. Its output discriminates between cytosines in CpG, CHG and CHH context and enables bench scientists to visualize and interpret their methylation data soon after the sequencing run is completed.
Availability and implementation: Bismark is released under the GNU GPLv3+ licence. The source code is freely available from www.bioinformatics.bbsrc.ac.uk/projects/bismark/.
Supplementary information: Supplementary data are available at Bioinformatics online.
Reduced representation bisulfite sequencing (RRBS), which couples bisulfite conversion and next generation sequencing, is an innovative method that specifically enriches genomic regions with a high density of potential methylation sites and enables investigation of DNA methylation at single-nucleotide resolution. Recent advances in the Illumina DNA sample preparation protocol and sequencing technology have vastly improved sequencing throughput capacity. Although the new Illumina technology is now widely used, the unique challenges associated with multiplexed RRBS libraries on this platform have not been previously described. We have made modifications to the RRBS library preparation protocol to sequence multiplexed libraries on a single flow cell lane of the Illumina HiSeq 2000. Furthermore, our analysis incorporates a bioinformatics pipeline specifically designed to process bisulfite-converted sequencing reads and evaluate the output and quality of the sequencing data generated from the multiplexed libraries. We obtained an average of 42 million paired-end reads per sample for each flow-cell lane, with a high unique mapping efficiency to the reference human genome. Here we provide a roadmap of modifications, strategies, and trouble shooting approaches we implemented to optimize sequencing of multiplexed libraries on an a RRBS background.
Next Generation Sequencing (NGS) technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing this information.
We have developed a Graphical User Interface (GUI) software tool named SAMMate. SAMMate allows biomedical researchers to quickly process SAM/BAM files and is compatible with both single-end and paired-end sequencing technologies. SAMMate also automates some standard procedures in DNA-seq and RNA-seq data analysis. Using either standard or customized annotation files, SAMMate allows users to accurately calculate the short read coverage of genomic intervals. In particular, for RNA-seq data SAMMate can accurately calculate the gene expression abundance scores for customized genomic intervals using short reads originating from both exons and exon-exon junctions. Furthermore, SAMMate can quickly calculate a whole-genome signal map at base-wise resolution allowing researchers to solve an array of bioinformatics problems. Finally, SAMMate can export both a wiggle file for alignment visualization in the UCSC genome browser and an alignment statistics report. The biological impact of these features is demonstrated via several case studies that predict miRNA targets using short read alignment information files.
With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files. Our software is constantly updated and will greatly facilitate the downstream analysis of NGS data. Both the source code and the GUI executable are freely available under the GNU General Public License at http://sammate.sourceforge.net.
In a single experiment, chromatin immunoprecipitation combined with high throughput sequencing (ChIP-seq) provides genome-wide information about a given covalent histone modification or transcription factor occupancy. However, time efficient bioinformatics resources for extracting biological meaning out of these gigabyte-scale datasets are often a limiting factor for data interpretation by biologists. We created an integrated portable ChIP-seq data interpretation platform called seqMINER, with optimized performances for efficient handling of multiple genome-wide datasets. seqMINER allows comparison and integration of multiple ChIP-seq datasets and extraction of qualitative as well as quantitative information. seqMINER can handle the biological complexity of most experimental situations and proposes methods to the user for data classification according to the analysed features. In addition, through multiple graphical representations, seqMINER allows visualization and modelling of general as well as specific patterns in a given dataset. To demonstrate the efficiency of seqMINER, we have carried out a comprehensive analysis of genome-wide chromatin modification data in mouse embryonic stem cells to understand the global epigenetic landscape and its change through cellular differentiation.
Liver fibrosis is caused by chemicals or viral infection. The progression of liver fibrosis results in hepatocellular carcinogenesis in later stages. Recent studies have revealed the importance of DNA hypermethylation in the progression of liver fibrosis to hepatocellular carcinoma (HCC). However, the importance of DNA methylation in the early-stage liver fibrosis remains unclear.
To address this issue, we used a pathological mouse model of early-stage liver fibrosis that was induced by treatment with carbon tetrachloride (CCl4) for 2 weeks and performed a genome-wide analysis of DNA methylation status. This global analysis of DNA methylation was performed using a combination of methyl-binding protein (MBP)-based high throughput sequencing (MBP-seq) and bioinformatic tools, IPA and Oncomine. To confirm functional aspect of MBP-seq data, we complementary used biochemical methods, such as bisulfite modification and in-vitro-methylation assays.
The genome-wide analysis revealed that DNA methylation status was reduced throughout the genome because of CCl4 treatment in the early-stage liver fibrosis. Bioinformatic and biochemical analyses revealed that a gene associated with fibrosis, secreted phosphoprotein 1 (Spp1), which induces inflammation, was hypomethylated and its expression was up-regulated. These results suggest that DNA hypomethylation of the genes responsible for fibrosis may precede the onset of liver fibrosis. Moreover, Spp1 is also known to enhance tumor development. Using the web-based database, we revealed that Spp1 expression is increased in HCC.
Our study suggests that hypomethylation is crucial for the onset of and in the progression of liver fibrosis to HCC. The elucidation of this change in methylation status from the onset of fibrosis and subsequent progression to HCC may lead to a new clinical diagnosis.
Chromatin immunoprecipitation followed by massively parallel next-generation sequencing (ChIP-seq) is a valuable experimental strategy for assaying protein-DNA interaction over the whole genome. Many computational tools have been designed to find the peaks of the signals corresponding to protein binding sites. In this paper, three computational methods, ChIP-seq processing pipeline (spp), PeakSeq and CisGenome, used in ChIP-seq data analysis are reviewed. There is also a comparison of how they agree and disagree on finding peaks using the publically available Signal Transducers and Activators of Transcription protein 1 (STAT1) and RNA polymerase II (PolII) datasets with corresponding negative controls.
CHIP-Seq analysis; Next-generation sequencing; comparative analysis; bioinformatics
The rapid expansion in the quantity and quality of RNA-Seq data requires the development of sophisticated high-performance bioinformatics tools capable of rapidly transforming this data into meaningful information that is easily interpretable by biologists. Currently available analysis tools are often not easily installed by the general biologist and most of them lack inherent parallel processing capabilities widely recognized as an essential feature of next-generation bioinformatics tools. We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets. R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading. In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Annotating repetitive regions of genomes experimentally is a challenging task. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) provides valuable data for characterizing repetitive regions of genomes in terms of transcription factor binding. Although ChIP-seq technology has been maturing, available ChIP-seq analysis methods and software rely on discarding sequence reads that map to multiple locations on the reference genome (multi-reads), thereby generating a missed opportunity for assessing transcription factor binding to highly repetitive regions of genomes. We develop a computational algorithm that takes multi-reads into account in ChIP-seq analysis. We show with computational experiments that multi-reads lead to significant increase in sequencing depths and identification of binding regions that are otherwise not identifiable when only reads that uniquely map to the reference genome (uni-reads) are used. In particular, we show that the number of binding regions identified can increase up to 36%. We support our computational predictions with independent quantitative real-time ChIP validation of binding regions identified only when multi-reads are incorporated in the analysis of a mouse GATA1 ChIP-seq experiment.
The high-throughput - next generation sequencing (HT-NGS) technologies are currently the hottest topic in the field of human and animals genomics researches, which can produce over 100 times more data compared to the most sophisticated capillary sequencers based on the Sanger method. With the ongoing developments of high throughput sequencing machines and advancement of modern bioinformatics tools at unprecedented pace, the target goal of sequencing individual genomes of living organism at a cost of $1,000 each is seemed to be realistically feasible in the near future. In the relatively short time frame since 2005, the HT-NGS technologies are revolutionizing the human and animal genome researches by analysis of chromatin immunoprecipitation coupled to DNA microarray (ChIP-chip) or sequencing (ChIP-seq), RNA sequencing (RNA-seq), whole genome genotyping, genome wide structural variation, de novo assembling and re-assembling of genome, mutation detection and carrier screening, detection of inherited disorders and complex human diseases, DNA library preparation, paired ends and genomic captures, sequencing of mitochondrial genome and personal genomics. In this review, we addressed the important features of HT-NGS like, first generation DNA sequencers, birth of HT-NGS, second generation HT-NGS platforms, third generation HT-NGS platforms: including single molecule Heliscope™, SMRT™ and RNAP sequencers, Nanopore, Archon Genomics X PRIZE foundation, comparison of second and third HT-NGS platforms, applications, advances and future perspectives of sequencing technologies on human and animal genome research.
CHIP-chip; Chip-seq; De novo assembling; High-throughput next generation sequencing; Personal genomics; Re-sequencing; RNA-seq
Motivation: The avalanche of data arriving since the development of NGS technologies have prompted the need for developing fast, accurate and easily automated bioinformatic tools capable of dealing with massive datasets. Among the most productive applications of NGS technologies is the sequencing of cellular RNA, known as RNA-Seq. Although RNA-Seq provides similar or superior dynamic range than microarrays at similar or lower cost, the lack of standard and user-friendly pipelines is a bottleneck preventing RNA-Seq from becoming the standard for transcriptome analysis.
Results: In this work we present a pipeline for processing and analyzing RNA-Seq data, that we have named Grape (Grape RNA-Seq Analysis Pipeline Environment). Grape supports raw sequencing reads produced by a variety of technologies, either in FASTA or FASTQ format, or as prealigned reads in SAM/BAM format. A minimal Grape configuration consists of the file location of the raw sequencing reads, the genome of the species and the corresponding gene and transcript annotation.
Grape first runs a set of quality control steps, and then aligns the reads to the genome, a step that is omitted for prealigned read formats. Grape next estimates gene and transcript expression levels, calculates exon inclusion levels and identifies novel transcripts.
Grape can be run on a single computer or in parallel on a computer cluster. It is distributed with specific mapping and quantification tools, but given its modular design, any tool supporting popular data interchange formats can be integrated.
Availability: Grape can be obtained from the Bioinformatics and Genomics website at: http://big.crg.cat/services/grape.
firstname.lastname@example.org or email@example.com
Summary: Methyl-Analyzer is a python package that analyzes genome-wide DNA methylation data produced by the Methyl-MAPS (methylation mapping analysis by paired-end sequencing) method. Methyl-MAPS is an enzymatic-based method that uses both methylation-sensitive and -dependent enzymes covering >80% of CpG dinucleotides within mammalian genomes. It combines enzymatic-based approaches with high-throughput next-generation sequencing technology to provide whole genome DNA methylation profiles. Methyl-Analyzer processes and integrates sequencing reads from methylated and unmethylated compartments and estimates CpG methylation probabilities at single base resolution.
Availability and implementation: Methyl-Analyzer is available at http://github.com/epigenomics/methylmaps. Sample dataset is available for download at http://epigenomicspub.columbia.edu/methylanalyzer_data.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
The ability to assay genome-scale methylation patterns using high-throughput sequencing makes it possible to carry out association studies to determine the relationship between epigenetic variation and phenotype. While bisulfite sequencing can determine a methylome at high resolution, cost inhibits its use in comparative and population studies. MethylSeq, based on sequencing of fragment ends produced by a methylation-sensitive restriction enzyme, is a method for methyltyping (survey of methylation states) and is a site-specific and cost-effective alternative to whole-genome bisulfite sequencing. Despite its advantages, the use of MethylSeq has been restricted by biases in MethylSeq data that complicate the determination of methyltypes. Here we introduce a statistical method, MetMap, that produces corrected site-specific methylation states from MethylSeq experiments and annotates unmethylated islands across the genome. MetMap integrates genome sequence information with experimental data, in a statistically sound and cohesive Bayesian Network. It infers the extent of methylation at individual CGs and across regions, and serves as a framework for comparative methylation analysis within and among species. We validated MetMap's inferences with direct bisulfite sequencing, showing that the methylation status of sites and islands is accurately inferred. We used MetMap to analyze MethylSeq data from four human neutrophil samples, identifying novel, highly unmethylated islands that are invisible to sequence-based annotation strategies. The combination of MethylSeq and MetMap is a powerful and cost-effective tool for determining genome-scale methyltypes suitable for comparative and association studies.
In the vertebrates, methylation of cytosine residues in DNA regulates gene activity in concert with proteins that associate with DNA. Large-scale genomewide comparative studies that seek to link specific methylation patterns to disease will require hundreds or thousands of samples, and thus economical methods that assay genomewide methylation. One such method is MethylSeq, which samples cytosine methylation at site-specific resolution by high-throughput sequencing of the ends of DNA fragments generated by methylation-sensitive restriction enzymes. MethylSeq's low cost and simplicity of implementation enable its use in large-scale comparative studies, but biases inherent to the method inhibit interpretation of the data it produces. Here we present MetMap, a statistical framework that first accounts for the biases in MethylSeq data and then generates an analysis of the data that is suitable for use in comparative studies. We show that MethylSeq and MetMap can be used together to determine methylation profiles across the genome, and to identify novel unmethylated regions that are likely to be involved in gene regulation. The ability to conduct comparative studies of sufficient scale at a reasonable cost promises to reveal new insights into the relationship between cytosine methylation and phenotype.
Cancer cells undergo massive alterations to their DNA methylation patterns that result in aberrant gene expression and malignant phenotypes. However, the mechanisms that underlie methylome changes are not well understood nor is the genomic distribution of DNA methylation changes well characterized.
Here, we performed methylated DNA immunoprecipitation combined with high-throughput sequencing (MeDIP-seq) to obtain whole-genome DNA methylation profiles for eight human breast cancer cell (BCC) lines and for normal human mammary epithelial cells (HMEC). The MeDIP-seq analysis generated non-biased DNA methylation maps by covering almost the entire genome with sufficient depth and resolution. The most prominent feature of the BCC lines compared to HMEC was a massively reduced methylation level particularly in CpG-poor regions. While hypomethylation did not appear to be associated with particular genomic features, hypermethylation preferentially occurred at CpG-rich gene-related regions independently of the distance from transcription start sites. We also investigated methylome alterations during epithelial-to-mesenchymal transition (EMT) in MCF7 cells. EMT induction was associated with specific alterations to the methylation patterns of gene-related CpG-rich regions, although overall methylation levels were not significantly altered. Moreover, approximately 40% of the epithelial cell-specific methylation patterns in gene-related regions were altered to those typical of mesenchymal cells, suggesting a cell-type specific regulation of DNA methylation.
This study provides the most comprehensive analysis to date of the methylome of human mammary cell lines and has produced novel insights into the mechanisms of methylome alteration during tumorigenesis and the interdependence between DNA methylome alterations and morphological changes.
Summary: We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.
Availability: The R package is available at www.ebi.ac.uk/tools/rcloud with online documentation at www.ebi.ac.uk/Tools/rwiki/, also available as supplementary material.
Supplementary information: Supplementary data are available at Bioinformatics online.
A method for de novo genome annotation using high-throughput cDNA sequencing data.
Next generation technologies enable massive-scale cDNA sequencing (so-called RNA-Seq). Mainly because of the difficulty of aligning short reads on exon-exon junctions, no attempts have been made so far to use RNA-Seq for building gene models de novo, that is, in the absence of a set of known genes and/or splicing events. We present G-Mo.R-Se (Gene Modelling using RNA-Seq), an approach aimed at building gene models directly from RNA-Seq and demonstrate its utility on the grapevine genome.