PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (797753)

Clipboard (0)
None

Related Articles

1.  A Scalable, Flexible Workflow for MethylCap-Seq Data Analysis 
Advances in whole genome profiling have revolutionized the cancer research field, but at the same time have raised new bioinformatics challenges. For next generation sequencing (NGS), these include data storage, computational costs, sequence processing and alignment, delineating appropriate statistical measures, and data visualization. The NGS application MethylCap-seq involves the in vitro capture of methylated DNA and subsequent analysis of enriched fragments by massively parallel sequencing. Here, we present a scalable, flexible workflow for MethylCap-seq Quality Control, secondary data analysis, tertiary analysis of multiple experimental groups, and data visualization. This workflow and its suite of features will assist biologists in conducting methylation profiling projects and facilitate meaningful biological interpretation.
doi:10.1109/GENSiPS.2011.6169426
PMCID: PMC3320741  PMID: 22484542
next generation sequencing; DNA methylation; epigenetics; cancer; data analysis; data visualization
2.  Genome-wide DNA methylation profiling of non-small cell lung carcinomas 
Background
Non-small cell lung carcinoma (NSCLC) is a complex malignancy that owing to its heterogeneity and poor prognosis poses many challenges to diagnosis, prognosis and patient treatment. DNA methylation is an important mechanism of epigenetic regulation involved in normal development and cancer. It is a very stable and specific modification and therefore in principle a very suitable marker for epigenetic phenotyping of tumors. Here we present a genome-wide DNA methylation analysis of NSCLC samples and paired lung tissues, where we combine MethylCap and next generation sequencing (MethylCap-seq) to provide comprehensive DNA methylation maps of the tumor and paired lung samples. The MethylCap-seq data were validated by bisulfite sequencing and methyl-specific polymerase chain reaction of selected regions.
Results
Analysis of the MethylCap-seq data revealed a strong positive correlation between replicate experiments and between paired tumor/lung samples. We identified 57 differentially methylated regions (DMRs) present in all NSCLC tumors analyzed by MethylCap-seq. While hypomethylated DMRs did not correlate to any particular functional category of genes, the hypermethylated DMRs were strongly associated with genes encoding transcriptional regulators. Furthermore, subtelomeric regions and satellite repeats were hypomethylated in the NSCLC samples. We also identified DMRs that were specific to two of the major subtypes of NSCLC, adenocarcinomas and squamous cell carcinomas.
Conclusions
Collectively, we provide a resource containing genome-wide DNA methylation maps of NSCLC and their paired lung tissues, and comprehensive lists of known and novel DMRs and associated genes in NSCLC.
doi:10.1186/1756-8935-5-9
PMCID: PMC3407794  PMID: 22726460
DNA Methylation; Epigenetics; MethylCap; Next generation sequencing; Non-small cell lung Cancer
3.  Enrichment-based DNA methylation analysis using next-generation sequencing: sample exclusion, estimating changes in global methylation, and the contribution of replicate lanes 
BMC Genomics  2012;13(Suppl 8):S6.
Background
DNA methylation is an important epigenetic mark and dysregulation of DNA methylation is associated with many diseases including cancer. Advances in next-generation sequencing now allow unbiased methylome profiling of entire patient cohorts, greatly facilitating biomarker discovery and presenting new opportunities to understand the biological mechanisms by which changes in methylation contribute to disease. Enrichment-based sequencing assays such as MethylCap-seq are a cost effective solution for genome-wide determination of methylation status, but the technical reliability of methylation reconstruction from raw sequencing data has not been well characterized.
Methods
We analyze three MethylCap-seq data sets and perform two different analyses to assess data quality. First, we investigate how data quality is affected by excluding samples that do not meet quality control cutoff requirements. Second, we consider the effect of additional reads on enrichment score, saturation, and coverage. Lastly, we verify a method for the determination of the global amount of methylation from MethylCap-seq data by comparing to a spiked-in control DNA of known methylation status.
Results
We show that rejection of samples based on our quality control parameters leads to a significant improvement of methylation calling. Additional reads beyond ~13 million unique aligned reads improved coverage, modestly improved saturation, and did not impact enrichment score. Lastly, we find that a global methylation indicator calculated from MethylCap-seq data correlates well with the global methylation level of a sample as obtained from a spike-in DNA of known methylation level.
Conclusions
We show that with appropriate quality control MethylCap-seq is a reliable tool, suitable for cohorts of hundreds of patients, that provides reproducible methylation information on a feature by feature basis as well as information about the global level of methylation.
doi:10.1186/1471-2164-13-S8-S6
PMCID: PMC3535705  PMID: 23281662
4.  Global Analysis of DNA Methylation by Methyl-Capture Sequencing Reveals Epigenetic Control of Cisplatin Resistance in Ovarian Cancer Cell 
PLoS ONE  2011;6(12):e29450.
Cisplatin resistance is one of the major reasons leading to the high death rate of ovarian cancer. Methyl-Capture sequencing (MethylCap-seq), which combines precipitation of methylated DNA by recombinant methyl-CpG binding domain of MBD2 protein with NGS, global and unbiased analysis of global DNA methylation patterns. We applied MethylCap-seq to analyze genome-wide DNA methylation profile of cisplatin sensitive ovarian cancer cell line A2780 and its isogenic derivative resistant line A2780CP. We obtained 21,763,035 raw reads for the drug resistant cell line A2780CP and 18,821,061reads for the sensitive cell line A2780. We identified 1224 hyper-methylated and 1216 hypomethylated DMRs (differentially methylated region) in A2780CP compared to A2780. Our MethylCap-seq data on this ovarian cancer cisplatin resistant model provided a good resource for the research community. We also found that A2780CP, compared to A2780, has lower observed to expected methylated CpG ratios, suggesting a lower global CpG methylation in A2780CP cells. Methylation specific PCR and bisulfite sequencing confirmed hypermethylation of PTK6, PRKCE and BCL2L1 in A2780 compared with A2780CP. Furthermore, treatment with the demethylation reagent 5-aza-dC in A2780 cells demethylated the promoters and restored the expression of PTK6, PRKCE and BCL2L1.
doi:10.1371/journal.pone.0029450
PMCID: PMC3245283  PMID: 22216282
5.  Genome-wide mapping of DNA methylation: a quantitative technology comparison 
Nature biotechnology  2010;28(10):1106-1114.
DNA methylation is a key component of mammalian gene regulation and the most classical example of an epigenetic mark. DNA methylation patterns are mitotically heritable and stable over time, but they undergo considerable changes in response to cell differentiation, diseases and environmental influences. Several methods have been developed for DNA methylation profiling on a genomic scale. Here, we benchmark four of these methods on two sample pairs, comparing their accuracy and power to detect DNA methylation differences. The results show that all evaluated methods (MeDIP-seq: methylated DNA immunoprecipitation, MethylCap-seq: methylated DNA capture by affinity purification, RRBS: reduced representation bisulfite sequencing, and the Infinium HumanMethylation27 assay) produce accurate DNA methylation data. However, these methods differ in their ability to detect differentially methylated regions between pairs of samples. We highlight strengths and weaknesses of the four methods and give practical recommendations for the design of epigenomic case-control studies.
doi:10.1038/nbt.1681
PMCID: PMC3066564  PMID: 20852634
Epigenome profiling; epigenetics; sequencing; differentially methylated regions; molecular diagnostics; biomarker discovery; cancer
6.  Comparative genome-wide DNA methylation analysis of colorectal tumor and matched normal tissues 
Epigenetics  2012;7(12):1355-1367.
Aberrant DNA methylation often occurs in colorectal cancer (CRC). In our study we applied a genome-wide DNA methylation analysis approach, MethylCap-seq, to map the differentially methylated regions (DMRs) in 24 tumors and matched normal colon samples. In total, 2687 frequently hypermethylated and 468 frequently hypomethylated regions were identified, which include potential biomarkers for CRC diagnosis. Hypermethylation in the tumor samples was enriched at CpG islands and gene promoters, while hypomethylation was distributed throughout the genome. Using epigenetic data from human embryonic stem cells, we show that frequently hypermethylated regions coincide with bivalent loci in human embryonic stem cells. DNA methylation is commonly thought to lead to gene silencing; however, integration of publically available gene expression data indicates that 75% of the frequently hypermethylated genes were most likely already lowly or not expressed in normal tissue. Collectively, our study provides genome-wide DNA methylation maps of CRC, comprehensive lists of DMRs, and gives insights into the role of aberrant DNA methylation in CRC formation.
doi:10.4161/epi.22562
PMCID: PMC3528691  PMID: 23079744
DNA methylation; colorectal cancer; biomarkers; H3K27me3; gene expression; Illumina sequencing
7.  Integrated analysis of genome-wide DNA methylation and gene expression profiles in molecular subtypes of breast cancer 
Nucleic Acids Research  2013;41(18):8464-8474.
Aberrant DNA methylation of CpG islands, CpG island shores and first exons is known to play a key role in the altered gene expression patterns in all human cancers. To date, a systematic study on the effect of DNA methylation on gene expression using high resolution data has not been reported. In this study, we conducted an integrated analysis of MethylCap-sequencing data and Affymetrix gene expression microarray data for 30 breast cancer cell lines representing different breast tumor phenotypes. As well-developed methods for the integrated analysis do not currently exist, we created a series of four different analysis methods. On the computational side, our goal is to develop methylome data analysis protocols for the integrated analysis of DNA methylation and gene expression data on the genome scale. On the cancer biology side, we present comprehensive genome-wide methylome analysis results for differentially methylated regions and their potential effect on gene expression in 30 breast cancer cell lines representing three molecular phenotypes, luminal, basal A and basal B. Our integrated analysis demonstrates that methylation status of different genomic regions may play a key role in establishing transcriptional patterns in molecular subtypes of human breast cancer.
doi:10.1093/nar/gkt643
PMCID: PMC3794600  PMID: 23887935
8.  Comprehensive methylome analysis of ovarian tumors reveals hedgehog signaling pathway regulators as prognostic DNA methylation biomarkers 
Epigenetics  2013;8(6):624-634.
Women with advanced stage ovarian cancer (OC) have a five-year survival rate of less than 25%. OC progression is associated with accumulation of epigenetic alterations and aberrant DNA methylation in gene promoters acts as an inactivating ?hit? during OC initiation and progression. Abnormal DNA methylation in OC has been used to predict disease outcome and therapy response. To globally examine DNA methylation in OC, we used next-generation sequencing technology, MethylCap-sequencing, to screen 75 malignant and 26 normal or benign ovarian tissues. Differential DNA methylation regions (DMRs) were identified, and the Kaplan?Meier method and Cox proportional hazard model were used to correlate methylation with clinical endpoints. Functional role of specific genes identified by MethylCap-sequencing was examined in in vitro assays. We identified 577 DMRs that distinguished (p < 0.001) malignant from non-malignant ovarian tissues; of these, 63 DMRs correlated (p < 0.001) with poor progression free survival (PFS). Concordant hypermethylation and corresponding gene silencing of sonic hedgehog pathway members ZIC1 and ZIC4 in OC tumors was confirmed in a panel of OC cell lines, and ZIC1 and ZIC4 repression correlated with increased proliferation, migration and invasion. ZIC1 promoter hypermethylation correlated (p < 0.01) with poor PFS. In summary, we identified functional DNA methylation biomarkers significantly associated with clinical outcome in OC and suggest our comprehensive methylome analysis has significant translational potential for guiding the design of future clinical investigations targeting the OC epigenome. Methylation of ZIC1, a putative tumor suppressor, may be a novel determinant of OC outcome.
doi:10.4161/epi.24816
PMCID: PMC3857342  PMID: 23774800
DNA methylation; Hedgehog pathway; ZIC1; ZIC4; ovarian cancer
9.  Methylcap-Seq Reveals Novel DNA Methylation Markers for the Diagnosis and Recurrence Prediction of Bladder Cancer in a Chinese Population 
PLoS ONE  2012;7(4):e35175.
Purpose
There is a need to supplement or supplant the conventional diagnostic tools, namely, cystoscopy and B-type ultrasound, for bladder cancer (BC). We aimed to identify novel DNA methylation markers for BC through genome-wide profiling of BC cell lines and subsequent methylation-specific PCR (MSP) screening of clinical urine samples.
Experimental Design
The methyl-DNA binding domain (MBD) capture technique, methylCap/seq, was performed to screen for specific hypermethylated CpG islands in two BC cell lines (5637 and T24). The top one hundred hypermethylated targets were sequentially screened by MSP in urine samples to gradually narrow the target number and optimize the composition of the diagnostic panel. The diagnostic performance of the obtained panel was evaluated in different clinical scenarios.
Results
A total of 1,627 hypermethylated promoter targets in the BC cell lines was identified by Illumina sequencing. The top 104 hypermethylated targets were reduced to eight genes (VAX1, KCNV1, ECEL1, TMEM26, TAL1, PROX1, SLC6A20, and LMX1A) after the urine DNA screening in a small sample size of 8 normal control and 18 BC subjects. Validation in an independent sample of 212 BC patients enabled the optimization of five methylation targets, including VAX1, KCNV1, TAL1, PPOX1, and CFTR, which was obtained in our previous study, for BC diagnosis with a sensitivity and specificity of 88.68% and 87.25%, respectively. In addition, the methylation of VAX1 and LMX1A was found to be associated with BC recurrence.
Conclusions
We identified a promising diagnostic marker panel for early non-invasive detection and subsequent BC surveillance.
doi:10.1371/journal.pone.0035175
PMCID: PMC3328468  PMID: 22529986
10.  Differential Programming of B Cells in AID Deficient Mice 
PLoS ONE  2013;8(7):e69815.
The Aicda locus encodes the activation induced cytidine deaminase (AID) and is highly expressed in germinal center (GC) B cells to initiate somatic hypermutation (SHM) and class switch recombination (CSR) of immunoglobulin (Ig) genes. Besides these Ig specific activities in B cells, AID has been implicated in active DNA demethylation in non-B cell systems. We here determined a potential role of AID as an epigenetic eraser and transcriptional regulator in B cells. RNA-Seq on different B cell subsets revealed that Aicda−/− B cells are developmentally affected. However as shown by RNA-Seq, MethylCap-Seq, and SNP analysis these transcriptome alterations may not relate to AID, but alternatively to a CBA mouse strain derived region around the targeted Aicda locus. These unexpected confounding parameters provide alternative, AID-independent interpretations on genotype-phenotype correlations previously reported in numerous studies on AID using the Aicda−/− mouse strain.
doi:10.1371/journal.pone.0069815
PMCID: PMC3726761  PMID: 23922811
11.  Fast and accurate read alignment for resequencing 
Bioinformatics  2012;28(18):2366-2373.
Motivation: Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers.
Results: We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data.
Availability: Linux and Mac OS X binaries free for academic use are available at http://www.stanford.edu/group/wonglab/seqalto
Contact: whwong@stanford.edu
doi:10.1093/bioinformatics/bts450
PMCID: PMC3436849  PMID: 22811546
12.  A Novel Approach for Transcription Factor Analysis Using SELEX with High-Throughput Sequencing (TFAST) 
PLoS ONE  2012;7(8):e42761.
Background
In previous work, we designed a modified aptamer-free SELEX-seq protocol (afSELEX-seq) for the discovery of transcription factor binding sites. Here, we present original software, TFAST, designed to analyze afSELEX-seq data, validated against our previously generated afSELEX-seq dataset and a model dataset. TFAST is designed with a simple graphical interface (Java) so that it can be installed and executed without extensive expertise in bioinformatics. TFAST completes analysis within minutes on most personal computers.
Methodology
Once afSELEX-seq data are aligned to a target genome, TFAST identifies peaks and, uniquely, compares peak characteristics between cycles. TFAST generates a hierarchical report of graded peaks, their associated genomic sequences, binding site length predictions, and dummy sequences.
Principal Findings
Including additional cycles of afSELEX-seq improved TFAST's ability to selectively identify peaks, leading to 7,274, 4,255, and 2,628 peaks identified in two-, three-, and four-cycle afSELEX-seq. Inter-round analysis by TFAST identified 457 peaks as the strongest candidates for true binding sites. Separating peaks by TFAST into classes of worst, second-best and best candidate peaks revealed a trend of increasing significance (e-values 4.5×1012, 2.9×10−46, and 1.2×10−73) and informational content (11.0, 11.9, and 12.5 bits over 15 bp) of discovered motifs within each respective class. TFAST also predicted a binding site length (28 bp) consistent with non-computational experimentally derived results for the transcription factor PapX (22 to 29 bp).
Conclusions/Significance
TFAST offers a novel and intuitive approach for determining DNA binding sites of proteins subjected to afSELEX-seq. Here, we demonstrate that TFAST, using afSELEX-seq data, rapidly and accurately predicted sequence length and motif for a putative transcription factor's binding site.
doi:10.1371/journal.pone.0042761
PMCID: PMC3430675  PMID: 22956994
13.  PRI-CAT: a web-tool for the analysis, storage and visualization of plant ChIP-seq experiments 
Nucleic Acids Research  2011;39(Web Server issue):W524-W527.
Although several tools for the analysis of ChIP-seq data have been published recently, there is a growing demand, in particular in the plant research community, for computational resources with which such data can be processed, analyzed, stored, visualized and integrated within a single, user-friendly environment. To accommodate this demand, we have developed PRI-CAT (Plant Research International ChIP-seq analysis tool), a web-based workflow tool for the management and analysis of ChIP-seq experiments. PRI-CAT is currently focused on Arabidopsis, but will be extended with other plant species in the near future. Users can directly submit their sequencing data to PRI-CAT for automated analysis. A QuickLoad server compatible with genome browsers is implemented for the storage and visualization of DNA-binding maps. Submitted datasets and results can be made publicly available through PRI-CAT, a feature that will enable community-based integrative analysis and visualization of ChIP-seq experiments. Secondary analysis of data can be performed with the aid of GALAXY, an external framework for tool and data integration. PRI-CAT is freely available at http://www.ab.wur.nl/pricat. No login is required.
doi:10.1093/nar/gkr373
PMCID: PMC3125775  PMID: 21609962
14.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications 
Bioinformatics  2011;27(11):1571-1572.
Summary: A combination of bisulfite treatment of DNA and high-throughput sequencing (BS-Seq) can capture a snapshot of a cell's epigenomic state by revealing its genome-wide cytosine methylation at single base resolution. Bismark is a flexible tool for the time-efficient analysis of BS-Seq data which performs both read mapping and methylation calling in a single convenient step. Its output discriminates between cytosines in CpG, CHG and CHH context and enables bench scientists to visualize and interpret their methylation data soon after the sequencing run is completed.
Availability and implementation: Bismark is released under the GNU GPLv3+ licence. The source code is freely available from www.bioinformatics.bbsrc.ac.uk/projects/bismark/.
Contact: felix.krueger@bbsrc.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr167
PMCID: PMC3102221  PMID: 21493656
15.  Technical Considerations for Reduced Representation Bisulfite Sequencing with Multiplexed Libraries 
Reduced representation bisulfite sequencing (RRBS), which couples bisulfite conversion and next generation sequencing, is an innovative method that specifically enriches genomic regions with a high density of potential methylation sites and enables investigation of DNA methylation at single-nucleotide resolution. Recent advances in the Illumina DNA sample preparation protocol and sequencing technology have vastly improved sequencing throughput capacity. Although the new Illumina technology is now widely used, the unique challenges associated with multiplexed RRBS libraries on this platform have not been previously described. We have made modifications to the RRBS library preparation protocol to sequence multiplexed libraries on a single flow cell lane of the Illumina HiSeq 2000. Furthermore, our analysis incorporates a bioinformatics pipeline specifically designed to process bisulfite-converted sequencing reads and evaluate the output and quality of the sequencing data generated from the multiplexed libraries. We obtained an average of 42 million paired-end reads per sample for each flow-cell lane, with a high unique mapping efficiency to the reference human genome. Here we provide a roadmap of modifications, strategies, and trouble shooting approaches we implemented to optimize sequencing of multiplexed libraries on an a RRBS background.
doi:10.1155/2012/741542
PMCID: PMC3495292  PMID: 23193365
16.  SeqBuster, a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells 
Nucleic Acids Research  2009;38(5):e34.
High-throughput sequencing technologies enable direct approaches to catalog and analyze snapshots of the total small RNA content of living cells. Characterization of high-throughput sequencing data requires bioinformatic tools offering a wide perspective of the small RNA transcriptome. Here we present SeqBuster, a highly versatile and reliable web-based toolkit to process and analyze large-scale small RNA datasets. The high flexibility of this tool is illustrated by the multiple choices offered in the pre-analysis for mapping purposes and in the different analysis modules for data manipulation. To overcome the storage capacity limitations of the web-based tool, SeqBuster offers a stand-alone version that permits the annotation against any custom database. SeqBuster integrates multiple analyses modules in a unique platform and constitutes the first bioinformatic tool offering a deep characterization of miRNA variants (isomiRs). The application of SeqBuster to small-RNA datasets of human embryonic stem cells revealed that most miRNAs present different types of isomiRs, some of them being associated to stem cell differentiation. The exhaustive description of the isomiRs provided by SeqBuster could help to identify miRNA-variants that are relevant in physiological and pathological processes. SeqBuster is available at http://estivill_lab.crg.es/seqbuster.
doi:10.1093/nar/gkp1127
PMCID: PMC2836562  PMID: 20008100
17.  SAMMate: a GUI tool for processing short read alignments in SAM/BAM format 
Background
Next Generation Sequencing (NGS) technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing this information.
Results
We have developed a Graphical User Interface (GUI) software tool named SAMMate. SAMMate allows biomedical researchers to quickly process SAM/BAM files and is compatible with both single-end and paired-end sequencing technologies. SAMMate also automates some standard procedures in DNA-seq and RNA-seq data analysis. Using either standard or customized annotation files, SAMMate allows users to accurately calculate the short read coverage of genomic intervals. In particular, for RNA-seq data SAMMate can accurately calculate the gene expression abundance scores for customized genomic intervals using short reads originating from both exons and exon-exon junctions. Furthermore, SAMMate can quickly calculate a whole-genome signal map at base-wise resolution allowing researchers to solve an array of bioinformatics problems. Finally, SAMMate can export both a wiggle file for alignment visualization in the UCSC genome browser and an alignment statistics report. The biological impact of these features is demonstrated via several case studies that predict miRNA targets using short read alignment information files.
Conclusions
With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files. Our software is constantly updated and will greatly facilitate the downstream analysis of NGS data. Both the source code and the GUI executable are freely available under the GNU General Public License at http://sammate.sourceforge.net.
doi:10.1186/1751-0473-6-2
PMCID: PMC3027120  PMID: 21232146
18.  iMir: An integrated pipeline for high-throughput analysis of small non-coding RNA data obtained by smallRNA-Seq 
BMC Bioinformatics  2013;14:362.
Background
Qualitative and quantitative analysis of small non-coding RNAs by next generation sequencing (smallRNA-Seq) represents a novel technology increasingly used to investigate with high sensitivity and specificity RNA population comprising microRNAs and other regulatory small transcripts. Analysis of smallRNA-Seq data to gather biologically relevant information, i.e. detection and differential expression analysis of known and novel non-coding RNAs, target prediction, etc., requires implementation of multiple statistical and bioinformatics tools from different sources, each focusing on a specific step of the analysis pipeline. As a consequence, the analytical workflow is slowed down by the need for continuous interventions by the operator, a critical factor when large numbers of datasets need to be analyzed at once.
Results
We designed a novel modular pipeline (iMir) for comprehensive analysis of smallRNA-Seq data, comprising specific tools for adapter trimming, quality filtering, differential expression analysis, biological target prediction and other useful options by integrating multiple open source modules and resources in an automated workflow. As statistics is crucial in deep-sequencing data analysis, we devised and integrated in iMir tools based on different statistical approaches to allow the operator to analyze data rigorously. The pipeline created here proved to be efficient and time-saving than currently available methods and, in addition, flexible enough to allow the user to select the preferred combination of analytical steps. We present here the results obtained by applying this pipeline to analyze simultaneously 6 smallRNA-Seq datasets from either exponentially growing or growth-arrested human breast cancer MCF-7 cells, that led to the rapid and accurate identification, quantitation and differential expression analysis of ~450 miRNAs, including several novel miRNAs and isomiRs, as well as identification of the putative mRNA targets of differentially expressed miRNAs. In addition, iMir allowed also the identification of ~70 piRNAs (piwi-interacting RNAs), some of which differentially expressed in proliferating vs growth arrested cells.
Conclusion
The integrated data analysis pipeline described here is based on a reliable, flexible and fully automated workflow, useful to rapidly and efficiently analyze high-throughput smallRNA-Seq data, such as those produced by the most recent high-performance next generation sequencers. iMir is available at http://www.labmedmolge.unisa.it/inglese/research/imir.
doi:10.1186/1471-2105-14-362
PMCID: PMC3878829  PMID: 24330401
Next generation sequencing; SmallRNA-Seq; Data analysis pipeline; Breast cancer; Small non-coding RNA; microRNA; Piwi-interacting RNA
19.  MOABS: model based analysis of bisulfite sequencing data 
Genome Biology  2014;15(2):R38.
Bisulfite sequencing (BS-seq) is the gold standard for studying genome-wide DNA methylation. We developed MOABS to increase the speed, accuracy, statistical power and biological relevance of BS-seq data analysis. MOABS detects differential methylation with 10-fold coverage at single-CpG resolution based on a Beta-Binomial hierarchical model and is capable of processing two billion reads in 24 CPU hours. Here, using simulated and real BS-seq data, we demonstrate that MOABS outperforms other leading algorithms, such as Fisher’s exact test and BSmooth. Furthermore, MOABS analysis can be easily extended to differential 5hmC analysis using RRBS and oxBS-seq. MOABS is available at http://code.google.com/p/moabs/.
doi:10.1186/gb-2014-15-2-r38
PMCID: PMC4054608  PMID: 24565500
20.  Systematic evaluation of spliced alignment programs for RNA-seq data 
Nature methods  2013;10(12):1185-1191.
High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. to assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. in total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.
doi:10.1038/nmeth.2722
PMCID: PMC4018468  PMID: 24185836
21.  Methyl-Analyzer—whole genome DNA methylation profiling 
Bioinformatics  2011;27(16):2296-2297.
Summary: Methyl-Analyzer is a python package that analyzes genome-wide DNA methylation data produced by the Methyl-MAPS (methylation mapping analysis by paired-end sequencing) method. Methyl-MAPS is an enzymatic-based method that uses both methylation-sensitive and -dependent enzymes covering >80% of CpG dinucleotides within mammalian genomes. It combines enzymatic-based approaches with high-throughput next-generation sequencing technology to provide whole genome DNA methylation profiles. Methyl-Analyzer processes and integrates sequencing reads from methylated and unmethylated compartments and estimates CpG methylation probabilities at single base resolution.
Availability and implementation: Methyl-Analyzer is available at http://github.com/epigenomics/methylmaps. Sample dataset is available for download at http://epigenomicspub.columbia.edu/methylanalyzer_data.html.
Contact: fgh3@columbia.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr356
PMCID: PMC3150045  PMID: 21685051
22.  CAP-miRSeq: a comprehensive analysis pipeline for microRNA sequencing data 
BMC Genomics  2014;15(1):423.
Background
miRNAs play a key role in normal physiology and various diseases. miRNA profiling through next generation sequencing (miRNA-seq) has become the main platform for biological research and biomarker discovery. However, analyzing miRNA sequencing data is challenging as it needs significant amount of computational resources and bioinformatics expertise. Several web based analytical tools have been developed but they are limited to processing one or a pair of samples at time and are not suitable for a large scale study. Lack of flexibility and reliability of these web applications are also common issues.
Results
We developed a Comprehensive Analysis Pipeline for microRNA Sequencing data (CAP-miRSeq) that integrates read pre-processing, alignment, mature/precursor/novel miRNA detection and quantification, data visualization, variant detection in miRNA coding region, and more flexible differential expression analysis between experimental conditions. According to computational infrastructure, users can install the package locally or deploy it in Amazon Cloud to run samples sequentially or in parallel for a large number of samples for speedy analyses. In either case, summary and expression reports for all samples are generated for easier quality assessment and downstream analyses. Using well characterized data, we demonstrated the pipeline’s superior performances, flexibility, and practical use in research and biomarker discovery.
Conclusions
CAP-miRSeq is a powerful and flexible tool for users to process and analyze miRNA-seq data scalable from a few to hundreds of samples. The results are presented in the convenient way for investigators or analysts to conduct further investigation and discovery.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-423) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-423
PMCID: PMC4070549  PMID: 24894665
miRNA sequencing; Analysis pipeline; Differential expression; Variant detection
23.  [No title available] 
RNA sequencing is a rich assay for delineating the transcriptome but few RNA-Seq standard data sets exist to help quantification of gene or splice form expression. Moreover, each next-generation sequencing (NGS) platform has unique aspects of library synthesis, sequencing, alignment, and data processing. Little is known about cross-site reproducibility, technical variance and interoperability of NGS platforms for RNA-Seq.
The goals of the ABRF-NGS study are to evaluate the performance of NGS platforms and to identify optimal methods and best practices. The study includes five ABRF Research Groups and over 20 core facility laboratories. To address RNA-Seq issues, we performed sequencing on five NGS platforms at multiple sites using two standardized RNA samples with synthetic RNA spike-ins. Platforms tested included Illumina HiSeq 2000/2500, Roche 454 GS FLX, Life Technology Ion PGM and Proton, and PacBio. We evaluated a wide range of variables, including varying input amount (1-1000 ng), alternate library preparation methods, specific size fractionation (1, 2, and 3 kb), and performance on degraded RNA (using heat, sonication, and RNase A). We used a set of 18,250 rt-PCR reactions as an orthogonal tool to gauge the linear and dynamic range of the RNA-Seq results.
Our results show that unique transcripts and isoforms are revealed by each method and NGS platform. We found that the majority of the human transcriptome can be found with each method and platform. We also discovered thousands of transcriptionally active regions (TARs) beyond existing gene annotations, which demonstrate that conservative annotation sets are inappropriate for analysis, versus larger annotation sets. Moreover, while we see high correlation of RNA-Seq within sites, we observed that “site effect” is the largest variance factor outside of biological sources. Additionally, we observed that the “bioinformatics noise” of aligners and annotations contributes substantial variance, underscoring the need for data provenance for long-term studies.
PMCID: PMC3635248
24.  The ABRF-Next Generation Sequencing Study: A Five-Platform, Cross-site, Cross-Protocol Examination of RNA Sequencing 
RNA sequencing is a rich assay for delineating the transcriptome but few RNA-Seq standard data sets exist to help quantification of gene or splice form expression. Moreover, each next-generation sequencing (NGS) platform has unique aspects of library synthesis, sequencing, alignment, and data processing. Little is known about cross-site reproducibility, technical variance and interoperability of NGS platforms for RNA-Seq.
The goals of the ABRF-NGS study are to evaluate the performance of NGS platforms and to identify optimal methods and best practices. The study includes five ABRF Research Groups and over 20 core facility laboratories. To address RNA-Seq issues, we performed sequencing on five NGS platforms at multiple sites using two standardized RNA samples with synthetic RNA spike-ins. Platforms tested included Illumina HiSeq 2000/2500, Roche 454 GS FLX, Life Technology Ion PGM and Proton, and PacBio. We evaluated a wide range of variables, including varying input amount (1-1000 ng), alternate library preparation methods, specific size fractionation (1, 2, and 3 kb), and performance on degraded RNA (using heat, sonication, and RNase A). We used a set of 18,250 rt-PCR reactions as an orthogonal tool to gauge the linear and dynamic range of the RNA-Seq results.
Our results show that unique transcripts and isoforms are revealed by each method and NGS platform. We found that the majority of the human transcriptome can be found with each method and platform. We also discovered thousands of transcriptionally active regions (TARs) beyond existing gene annotations, which demonstrate that conservative annotation sets are inappropriate for analysis, versus larger annotation sets. Moreover, while we see high correlation of RNA-Seq within sites, we observed that “site effect” is the largest variance factor outside of biological sources. Additionally, we observed that the “bioinformatics noise” of aligners and annotations contributes substantial variance, underscoring the need for data provenance for long-term studies.
PMCID: PMC3635422
25.  seqMINER: an integrated ChIP-seq data interpretation platform 
Nucleic Acids Research  2010;39(6):e35.
In a single experiment, chromatin immunoprecipitation combined with high throughput sequencing (ChIP-seq) provides genome-wide information about a given covalent histone modification or transcription factor occupancy. However, time efficient bioinformatics resources for extracting biological meaning out of these gigabyte-scale datasets are often a limiting factor for data interpretation by biologists. We created an integrated portable ChIP-seq data interpretation platform called seqMINER, with optimized performances for efficient handling of multiple genome-wide datasets. seqMINER allows comparison and integration of multiple ChIP-seq datasets and extraction of qualitative as well as quantitative information. seqMINER can handle the biological complexity of most experimental situations and proposes methods to the user for data classification according to the analysed features. In addition, through multiple graphical representations, seqMINER allows visualization and modelling of general as well as specific patterns in a given dataset. To demonstrate the efficiency of seqMINER, we have carried out a comprehensive analysis of genome-wide chromatin modification data in mouse embryonic stem cells to understand the global epigenetic landscape and its change through cellular differentiation.
doi:10.1093/nar/gkq1287
PMCID: PMC3064796  PMID: 21177645

Results 1-25 (797753)