There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data.
This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies.
Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure.
Tiling-arrays are applicable to multiple types of biological research questions. Due to its advantages (high sensitivity, resolution, unbiased), the technology is often employed in genome-wide investigations. A major challenge in the analysis of tiling-array data is to define regions-of-interest, i.e., contiguous probes with increased signal intensity (as a result of hybridization of labeled DNA) in a region. Currently, no standard criteria are available to define these regions-of-interest as there is no single probe intensity cut-off level, different regions-of-interest can contain various numbers of probes, and can vary in genomic width. Furthermore, the chromosomal distance between neighboring probes can vary across the genome among different arrays.
We have developed Hypergeometric Analysis of Tiling-arrays (HAT), and first evaluated its performance for tiling-array datasets from a Chromatin Immunoprecipitation study on chip (ChIP-on-chip) for the identification of genome-wide DNA binding profiles of transcription factor Cebpa (used for method comparison). Using this assay, we can refine the detection of regions-of-interest by illustrating that regions detected by HAT are more highly enriched for expected motifs in comparison with an alternative detection method (MAT). Subsequently, data from a retroviral insertional mutagenesis screen were used to examine the performance of HAT among different applications of tiling-array datasets. In both studies, detected regions-of-interest have been validated with (q)PCR.
We demonstrate that HAT has increased specificity for analysis of tiling-array data in comparison with the alternative method, and that it accurately detects regions-of-interest in two different applications of tiling-arrays. HAT has several advantages over previous methods: i) as there is no single cut-off level for probe-intensity, HAT can detect regions-of-interest at various thresholds, ii) it can detect regions-of-interest of any size, iii) it is independent of probe-resolution across the genome, and across tiling-array platforms and iv) it employs a single user defined parameter: the significance level. Regions-of-interest are detected by computing the hypergeometric-probability, while controlling the Family Wise Error. Furthermore, the method does not require experimental replicates, common regions-of-interest are indicated, a sequence-of-interest can be examined for every detected region-of-interest, and flanking genes can be reported.
Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage.
In this work, we present a novel method to detect intron retentions (IR) and exon skips (ES) from tiling arrays. While statistical tests have typically been proposed for this purpose, our method instead utilizes support vector machines (SVMs) which are appreciated for their accuracy and robustness to noise. Existing EST and cDNA sequences served for supervised training and evaluation. Analyzing a large collection of publicly available microarray and sequence data for the model plant A. thaliana, we demonstrated that our method is more accurate than existing approaches. The method was applied in a genome-wide screen which resulted in the discovery of 1,355 IR events. A comparison of these IR events to the TAIR annotation and a large set of short-read RNA-seq data showed that 830 of the predicted IR events are novel and that 525 events (39%) overlap with either the TAIR annotation or the IR events inferred from the RNA-seq data.
The method developed in this work expands the scarce repertoire of analysis tools for the identification of alternative mRNA splicing from whole-genome tiling arrays. Our predictions are highly enriched with known AS events and complement the A. thaliana genome annotation with respect to AS. Since all predicted AS events can be precisely attributed to experimental conditions, our work provides a basis for follow-up studies focused on the elucidation of the regulatory mechanisms underlying tissue-specific and stress-dependent AS in plants.
The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT–PCR experiments.
High-density tiling microarrays are a powerful tool for the characterization of complete genomes. The two major computational challenges associated with custom-made arrays are design and analysis. Firstly, several genome dependent variables, such as the genome's complexity and sequence composition, need to be considered in the design to ensure a high quality microarray. Secondly, since tiling projects today very often exceed the limits of conventional array-experiments, researchers cannot use established computer tools designed for commercial arrays, and instead have to redesign previous methods or create novel tools.
Here we describe the multiple aspects involved in the design of tiling arrays for transcriptome analysis and detail the normalisation and analysis procedures for such microarrays. We introduce a novel design method to make two 280,000 feature microarrays covering the entire genome of the bacterial species Escherichia coli and Neisseria meningitidis, respectively, as well as the use of multiple copies of control probe-sets on tiling microarrays. Furthermore, a novel normalisation and background estimation procedure for tiling arrays is presented along with a method for array analysis focused on detection of short transcripts. The design, normalisation and analysis methods have been applied in various experiments and several of the detected novel short transcripts have been biologically confirmed by Northern blot tests.
Tiling-arrays are becoming increasingly applicable in genomic research, but researchers still lack both the tools for custom design of arrays, as well as the systems and procedures for analysis of the vast amount of data resulting from such experiments. We believe that the methods described herein will be a useful contribution and resource for researchers designing and analysing custom tiling arrays for both bacteria and higher organisms.
Probing protein-deoxyribonucleic acid (DNA) is gaining popularity as it sheds light on molecular mechanisms that regulate the expression of genes. Currently, tiling-arrays and next-generation sequencing technology can be used to measure these interactions. Both methods generate a signal over the genome in which contiguous regions of peaks on the genome represent the presence of an interacting molecule. Many methods do exist to identify functional regions of interest (ROIs) on the genome. However the detection of ROIs are often not an end-point in research questions and it therefore requires data dragging between tools to relate the ROIs to information present in databases, such as gene-ontology, pathway information, or enrichment of certain genomic content. We introduce hypergeometric analysis of tiling-array and sequence data (HATSEQ), a powerful tool that accurately identifies functional ROIs on the genome where a genomic signal significantly deviates from the general genome-wide behavior. HATSEQ also includes a number of built-in post-analyses with which biological meaning can be attached to the detected ROIs in terms of gene pathways and de-novo motif analysis, and provides different visualizations and statistical summaries for the detected ROIs. In addition, HATSEQ has an intuitive graphic user interface that lowers the barrier for researchers to analyze their data without the need of scripting languages. We compared the results of HATSEQ against two other popular chromatin immunoprecipitation sequencing (ChIP-Seq) methods and observed overlap in the detected ROIs but HATSEQ is more specific in delineating the peak boundaries. We also discuss the versatility of HATSEQ by using a Signal Transducer and Activator of Transcription 1 (STAT1) ChIP-Seq data-set, and show that the detected ROIs are highly specific for the expected STAT1 binding motif. HATSEQ is freely available at: http://hema13.erasmusmc.nl/index.php/HATSEQ.
bioinformatics; NGS analysis; ChIP-Seq; peak detection
Short-read RNA sequencing in mouse and human tissues shows that most transcripts are encoded within or nearby known genes and that most of the genome is not transcribed.
A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.
The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.
To demonstrate the benefits of RNA-Seq over microarray in transcriptome profiling, both RNA-Seq and microarray analyses were performed on RNA samples from a human T cell activation experiment. In contrast to other reports, our analyses focused on the difference, rather than similarity, between RNA-Seq and microarray technologies in transcriptome profiling. A comparison of data sets derived from RNA-Seq and Affymetrix platforms using the same set of samples showed a high correlation between gene expression profiles generated by the two platforms. However, it also demonstrated that RNA-Seq was superior in detecting low abundance transcripts, differentiating biologically critical isoforms, and allowing the identification of genetic variants. RNA-Seq also demonstrated a broader dynamic range than microarray, which allowed for the detection of more differentially expressed genes with higher fold-change. Analysis of the two datasets also showed the benefit derived from avoidance of technical issues inherent to microarray probe performance such as cross-hybridization, non-specific hybridization and limited detection range of individual probes. Because RNA-Seq does not rely on a pre-designed complement sequence detection probe, it is devoid of issues associated with probe redundancy and annotation, which simplified interpretation of the data. Despite the superior benefits of RNA-Seq, microarrays are still the more common choice of researchers when conducting transcriptional profiling experiments. This is likely because RNA-Seq sequencing technology is new to most researchers, more expensive than microarray, data storage is more challenging and analysis is more complex. We expect that once these barriers are overcome, the RNA-Seq platform will become the predominant tool for transcriptome analysis.
Statistical analysis on tiling array data is extremely challenging due to the astronomically large number of sequence probes, high noise levels of individual probes and limited number of replicates in these data. To overcome these difficulties, we first developed statistical error estimation and weighted ANOVA modeling approaches to high-density tiling array data, especially the former based on an advanced error-pooling method to accurately obtain heterogeneous technical error of small-sample tiling array data. Based on these approaches, we analyzed the high-density tiling array data of the temporal replication patterns during cell-cycle S phase of synchronized HeLa cells on human chromosomes 21 and 22. We found many novel temporal replication patterns, identifying about 26% of over 1 million tiling array sequence probes with significant differential replication during the four 2-h time periods of S phase. Among these differentially replicated probes, 126 941 sequence probes were matched to 417 known genes. The majority of these genes were found to be replicated within one or two consecutive time periods, while the others were replicated at two non-consecutive time periods. Also, coding regions found to be more differentially replicated in particular time periods than noncoding regions in the gene-poor chromosome 21 (25% differentially replicated among genic probes versus 18.6% among intergenic probes), while such a phenomenon was less prominent in gene-rich chromosome 22. A rigorous statistical testing for local proximity of differentially replicated genic and intergenic probes was performed to identify significant stretches of differentially replicated sequence regions. From this analysis, we found that adjacent genes were frequently replicated at different time periods, potentially implying the existence of quite dense replication origins. Evaluating the conditional probability significance of identified gene ontology terms on chromosomes 21 and 22, we detected some over-represented molecular functions and biological processes among these differentially replicated genes, such as the ones relevant to hydrolase, transferase and receptor-binding activities. Some of these results were confirmed showing >70% consistency with cDNA microarray data that were independently generated in parallel with the tiling arrays. Thus, our improved analysis approaches specifically designed for high-density tiling array data enabled us to reliably and sensitively identify many novel temporal replication patterns on human chromosomes.
High-density oligonucleotide microarray is an appropriate technology for genomic analysis, and is particulary useful in the generation of transcriptional maps, ChIP-on-chip studies and re-sequencing of the genome.Transcriptome analysis of tiling microarray data facilitates the discovery of novel transcripts and the assessment of differential expression in diverse experimental conditions. Although new technologies such as next-generation sequencing have appeared, microarrays might still be useful for the study of small genomes or for the analysis of genomic regions with custom microarrays due to their lower price and good accuracy in expression quantification.
Here, we propose a novel wavelet-based method, named ZCL (zero-crossing lines), for the combined denoising and segmentation of tiling signals. The denoising is performed with the classical SUREshrink method and the detection of transcriptionally active regions is based on the computation of the Continuous Wavelet Transform (CWT). In particular, the detection of the transitions is implemented as the thresholding of the zero-crossing lines. The algorithm described has been applied to the public Saccharomyces cerevisiae dataset and it has been compared with two well-known algorithms: pseudo-median sliding window (PMSW) and the structural change model (SCM). As a proof-of-principle, we applied the ZCL algorithm to the analysis of the custom tiling microarray hybridization results of a S. aureus mutant deficient in the sigma B transcription factor. The challenge was to identify those transcripts whose expression decreases in the absence of sigma B.
The proposed method archives the best performance in terms of positive predictive value (PPV) while its sensitivity is similar to the other algorithms used for the comparison. The computation time needed to process the transcriptional signals is low as compared with model-based methods and in the same range to those based on the use of filters. Automatic parameter selection has been incorporated and moreover, it can be easily adapted to a parallel implementation. We can conclude that the proposed method is well suited for the analysis of tiling signals, in which transcriptional activity is often hidden in the noise. Finally, the quantification and differential expression analysis of S. aureus dataset have demonstrated the valuable utility of this novel device to the biological analysis of the S. aureus transcriptome.
Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.
Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome.
This paper presents a new probe selection algorithm (PanArray) that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pan-genome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage.
PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on a single microarray chip. These unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains.
The Microarray Core Facility (MCF) at Baylor College of Medicine provides investigators with access to a variety of state-of-the-art technologies and approaches that will enhance discovery for their genomic research. We house instrumentation supporting Affymetrix, Agilent, NimbleGen, Luminex, and Illumina platforms. The MCF provides expertise in the following applications: gene expression, array comparative genomic hybridization (aCGH), SNP genotyping, and next-generation sequencing. In addition, our lab offer services for sample quality check and a cDNA clone repository, for those that are interested in verifying results from gene expression experiments or any other application requiring cDNA clones. The MCF specializes in RNA applications that enable researchers to monitor genome-wide expression profiles through Affymetrix, Agilent and NimbleGen expression arrays.Agilent's aCGH and Affymetrix SNP Arrays are also offered, providing detection of copy number variations across the genome.Other related services include: tiling arrays, ChIP-on-chip arrays, SuperArray, Promoter Arrays, and Panomics. Due to the increased demand for rapid DNA sequencing, the facility now provides massively parallel “next generation” sequencing on the Illumina Genome Analyzer II.Our core lab has established a workflow involving: project consultation, sample quality check, sample preparation and data generation for each sequencing project.Illumina's sequencing platform provides high-quality data in the following applications: gene expression and alternative splicing (mRNA-Seq), protein-nucleic acid association profiling and epigenetics (ChIP-Seq), sequencing targeted genomic regions, small RNA discovery (small RNA-Seq) and de novo sequencing.The MCF offers investigators access to an array of emerging technologies while assisting in experimental design and data analysis.
Existing statistical methods for tiling array transcriptome data either focus on transcript discovery in one biological or experimental condition or on the detection of differential expression between two conditions. Increasingly often, however, biologists are interested in time-course studies, studies with more than two conditions or even multiple-factor studies. As these studies are currently analyzed with the traditional microarray analysis techniques, they do not exploit the genome-wide nature of tiling array data to its full potential.
We present an R Bioconductor package, waveTiling, which implements a wavelet-based model for analyzing transcriptome data and extends it towards more complex experimental designs. With waveTiling the user is able to discover (1) group-wise expressed regions, (2) differentially expressed regions between any two groups in single-factor studies and in (3) multifactorial designs. Moreover, for time-course experiments it is also possible to detect (4) linear time effects and (5) a circadian rhythm of transcripts. By considering the expression values of the individual tiling probes as a function of genomic position, effect regions can be detected regardless of existing annotation. Three case studies with different experimental set-ups illustrate the use and the flexibility of the model-based transcriptome analysis.
The waveTiling package provides the user with a convenient tool for the analysis of tiling array trancriptome data for a multitude of experimental set-ups. Regardless of the study design, the probe-wise analysis allows for the detection of transcriptional effects in both exonic, intronic and intergenic regions, without prior consultation of existing annotation.
The Affymetrix GeneChip technology uses multiple probes per gene to measure its expression level. Individual probe signals can vary widely, which hampers proper interpretation. This variation can be caused by probes that do not properly match their target gene or that match multiple genes. To determine the accuracy of Affymetrix arrays, we developed an extensive verification protocol, for mouse arrays incorporating the NCBI RefSeq, NCBI UniGene Unique, NIA Mouse Gene Index, and UCSC mouse genome databases.
Applying this protocol to Affymetrix Mouse Genome arrays (the earlier U74Av2 and the newer 430 2.0 array), the number of sequence-verified probes with perfect matches was no less than 85% and 95%, respectively; and for 74% and 85% of the probe sets all probes were sequence verified. The latter percentages increased to 80% and 94% after discarding one or two unverifiable probes per probe set, and even further to 84% and 97% when, in addition, allowing for one or two mismatches between probe and target gene. Similar results were obtained for other mouse arrays, as well as for human and rat arrays. Based on these data, refined chip definition files for all arrays are provided online. Researchers can choose the version appropriate for their study to (re)analyze expression data.
The accuracy of Affymetrix probe sequences is higher than previously reported, particularly on newer arrays. Yet, refined probe set definitions have clear effects on the detection of differentially expressed genes. We demonstrate that the interpretation of the results of Affymetrix arrays is improved when the new chip definition files are used.
High-throughput RNA sequencing (RNA-seq) has become an instrumental assay for the analysis of multiple aspects of an organism's transcriptome. Further, the analysis of a biological specimen's associated microbiome can also be performed using RNA-seq data and this application is gaining interest in the scientific community. There are many existing bioinformatics tools designed for analysis and visualization of transcriptome data. Despite the availability of an array of next generation sequencing (NGS) analysis tools, the analysis of RNA-seq data sets poses a challenge for many biomedical researchers who are not familiar with command-line tools. Here we present RNA CoMPASS, a comprehensive RNA-seq analysis pipeline for the simultaneous analysis of transcriptomes and metatranscriptomes from diverse biological specimens. RNA CoMPASS leverages existing tools and parallel computing technology to facilitate the analysis of even very large datasets. RNA CoMPASS has a web-based graphical user interface with intrinsic queuing to control a distributed computational pipeline. RNA CoMPASS was evaluated by analyzing RNA-seq data sets from 45 B-cell samples. Twenty-two of these samples were derived from lymphoblastoid cell lines (LCLs) generated by the infection of naïve B-cells with the Epstein Barr virus (EBV), while another 23 samples were derived from Burkitt's lymphomas (BL), some of which arose in part through infection with EBV. Appropriately, RNA CoMPASS identified EBV in all LCLs and in a fraction of the BLs. Cluster analysis of the human transcriptome component of the RNA CoMPASS output clearly separated the BLs (which have a germinal center-like phenotype) from the LCLs (which have a blast-like phenotype) with evidence of activated MYC signaling and lower interferon and NF-kB signaling in the BLs. Together, this analysis illustrates the utility of RNA CoMPASS in the simultaneous analysis of transcriptome and metatranscriptome data. RNA CoMPASS is freely available at http://rnacompass.sourceforge.net/.
Motivation: Individual probes on an Affymetrix tiling array usually behave differently. Modeling and removing these probe effects are critical for detecting signals from the array data. Current data processing techniques either require control samples or use probe sequences to model probe-specific variability, such as with MAT. Although the MAT approach can be applied without control samples, residual probe effects continue to distort the true biological signals.
Results: We propose TileProbe, a new technique that builds upon the MAT algorithm by incorporating publicly available data sets to remove tiling array probe effects. By using a large number of these readily available arrays, TileProbe robustly models the residual probe effects that MAT model cannot explain. When applied to analyzing ChIP-chip data, TileProbe performs consistently better than MAT across a variety of analytical conditions. This shows that TileProbe resolves the issue of probe-specific effects more completely.
Supplementary information: Supplementary data are available at Bioinformatics online.
RNA-Seq exploits the rapid generation of gigabases of sequence data by Massively Parallel Nucleotide Sequencing, allowing for the mapping and digital quantification of whole transcriptomes. Whilst previous comparisons between RNA-Seq and microarrays have been performed at the level of gene expression, in this study we adopt a more fine-grained approach. Using RNA samples from a normal human breast epithelial cell line (MCF-10a) and a breast cancer cell line (MCF-7), we present a comprehensive comparison between RNA-Seq data generated on the Applied Biosystems SOLiD platform and data from Affymetrix Exon 1.0ST arrays. The use of Exon arrays makes it possible to assess the performance of RNA-Seq in two key areas: detection of expression at the granularity of individual exons, and discovery of transcription outside annotated loci.
We found a high degree of correspondence between the two platforms in terms of exon-level fold changes and detection. For example, over 80% of exons detected as expressed in RNA-Seq were also detected on the Exon array, and 91% of exons flagged as changing from Absent to Present on at least one platform had fold-changes in the same direction. The greatest detection correspondence was seen when the read count threshold at which to flag exons Absent in the SOLiD data was set to t<1 suggesting that the background error rate is extremely low in RNA-Seq. We also found RNA-Seq more sensitive to detecting differentially expressed exons than the Exon array, reflecting the wider dynamic range achievable on the SOLiD platform. In addition, we find significant evidence of novel protein coding regions outside known exons, 93% of which map to Exon array probesets, and are able to infer the presence of thousands of novel transcripts through the detection of previously unreported exon-exon junctions.
By focusing on exon-level expression, we present the most fine-grained comparison between RNA-Seq and microarrays to date. Overall, our study demonstrates that data from a SOLiD RNA-Seq experiment are sufficient to generate results comparable to those produced from Affymetrix Exon arrays, even using only a single replicate from each platform, and when presented with a large genome.
Advantages of RNA-Seq over array based platforms are quantitative gene expression and discovery of expressed single nucleotide variants (eSNVs) and fusion transcripts from a single platform, but the sensitivity for each of these characteristics is unknown. We measured gene expression in a set of manually degraded RNAs, nine pairs of matched fresh-frozen, and FFPE RNA isolated from breast tumor with the hybridization based, NanoString nCounter (226 gene panel) and with whole transcriptome RNA-Seq using RiboZeroGold ScriptSeq V2 library preparation kits. We performed correlation analyses of gene expression between samples and across platforms. We then specifically assessed whole transcriptome expression of lincRNA and discovery of eSNVs and fusion transcripts in the FFPE RNA-Seq data. For gene expression in the manually degraded samples, we observed Pearson correlations of >0.94 and >0.80 with NanoString and ScriptSeq protocols, respectively. Gene expression data for matched fresh-frozen and FFPE samples yielded mean Pearson correlations of 0.874 and 0.783 for NanoString (226 genes) and ScriptSeq whole transcriptome protocols respectively, p<2x10-16. Specifically for lincRNAs, we observed superb Pearson correlation (0.988) between matched fresh-frozen and FFPE pairs. FFPE samples across NanoString and RNA-Seq platforms gave a mean Pearson correlation of 0.838. In FFPE libraries, we detected 53.4% of high confidence SNVs and 24% of high confidence fusion transcripts. Sensitivity of fusion transcript detection was not overcome by an increase in depth of sequencing up to 3-fold (increase from ~56 to ~159 million reads). Both NanoString and ScriptSeq RNA-Seq technologies yield reliable gene expression data for degraded and FFPE material. The high degree of correlation between NanoString and RNA-Seq platforms suggests discovery based whole transcriptome studies from FFPE material will produce reliable expression data. The RiboZeroGold ScriptSeq protocol performed particularly well for lincRNA expression from FFPE libraries, but detection of eSNV and fusion transcripts was less sensitive.
Currently, most of RNA-seq experiments are performed on Illumina platform, but other companies are competing for market share. In this highly competitive environment, cross-platform comparisons and/or validations are becoming increasingly critical. Results of several comparisons in which the same samples were studied using Illumina and Ion Torrent RNA-seq, and different microarray-based approaches are presented. To prepare the libraries, the RNA samples were processed using Illumina TruSeq protocol (a protocol capturing polyadenylated RNA) and sequenced on Illumina HiSeq 2500 producing 100x100-nt paired-end reads. The same samples were processed using the Ion Torrent Total RNA-Seq V2 protocol which is capable of capturing non-coding RNA and preserves the strand specificity. These libraries were sequenced on the Ion Proton using the P1 chip and produced up to 200-nt reads. The data obtained with both platforms was compared for quality, alignment statistics, error rates, evenness and continuity of coverage, RNA biotype representation, and accuracy for expression profiling. Additionally, detailed comparison of technical aspects including input amount, throughput, experimental time and reagent costs is presented. Lastly, the same samples were interrogated using Agilent V2 Human Whole Genome arrays, Affymetrix Gene arrays ST (1.0 and 2.0) and newly commercialized Affymetrix Human Transcriptome Arrays. There was a significant correlation between the Illumina and Ion Torrent RNA-Seq gene expression data and microarray data generated from the same samples; however, the RNA-Seq detects additional transcripts whose expression were either not interrogated or not detected by microarrays.
C57BL/6J (B6) and DBA/2J (D2) are two of the most commonly used inbred mouse strains in neuroscience research. However, the only currently available mouse genome is based entirely on the B6 strain sequence. Subsequently, oligonucleotide microarray probes are based solely on this B6 reference sequence, making their application for gene expression profiling comparisons across mouse strains dubious due to their allelic sequence differences, including single nucleotide polymorphisms (SNPs). The emergence of next-generation sequencing (NGS) and the RNA-Seq application provides a clear alternative to oligonucleotide arrays for detecting differential gene expression without the problems inherent to hybridization-based technologies. Using RNA-Seq, an average of 22 million short sequencing reads were generated per sample for 21 samples (10 B6 and 11 D2), and these reads were aligned to the mouse reference genome, allowing 16,183 Ensembl genes to be queried in striatum for both strains. To determine differential expression, ‘digital mRNA counting’ is applied based on reads that map to exons. The current study compares RNA-Seq (Illumina GA IIx) with two microarray platforms (Illumina MouseRef-8 v2.0 and Affymetrix MOE 430 2.0) to detect differential striatal gene expression between the B6 and D2 inbred mouse strains. We show that by using stringent data processing requirements differential expression as determined by RNA-Seq is concordant with both the Affymetrix and Illumina platforms in more instances than it is concordant with only a single platform, and that instances of discordance with respect to direction of fold change were rare. Finally, we show that additional information is gained from RNA-Seq compared to hybridization-based techniques as RNA-Seq detects more genes than either microarray platform. The majority of genes differentially expressed in RNA-Seq were only detected as present in RNA-Seq, which is important for studies with smaller effect sizes where the sensitivity of hybridization-based techniques could bias interpretation.
The genome-wide distribution patterns of the ‘6th base’ 5-hydroxymethylcytosine (5hmC) in many tissues and cells have recently been revealed by hydroxymethylated DNA immunoprecipitation (hMeDIP) followed by high throughput sequencing or tiling arrays. However, it has been challenging to directly compare different data sets and samples using data generated by this method. Here, we report a new comparative hMeDIP-seq method, which involves barcoding different input DNA samples at the start and then performing hMeDIP-seq for multiple samples in one hMeDIP reaction. This approach extends the barcode technology from simply multiplexing the DNA deep sequencing outcome and provides significant advantages for quantitative control of all experimental steps, from unbiased hMeDIP to deep sequencing data analysis. Using this improved method, we profiled and compared the DNA hydroxymethylomes of mouse ES cells (ESCs) and mouse ESC-derived neural progenitor cells (NPCs). We identified differentially hydroxymethylated regions (DHMRs) between ESCs and NPCs and uncovered an intricate relationship between the alteration of DNA hydroxymethylation and changes in gene expression during neural lineage commitment of ESCs. Presumably, the DHMRs between ESCs and NPCs uncovered by this approach may provide new insight into the function of 5hmC in gene regulation and neural differentiation. Thus, this newly developed comparative hMeDIP-seq method provides a cost-effective and user-friendly strategy for direct genome-wide comparison of DNA hydroxymethylation across multiple samples, lending significant biological, physiological and clinical implications.
As a powerful tool in whole genome analysis, tiling array has been widely used in the answering of many genomic questions. Now it could also serve as a capture device for the library preparation in the popular high throughput sequencing experiments. Thus, a flexible and efficient tiling array design approach is still needed and could assist in various types and scales of transcriptomic experiment.
In this paper, we address issues and challenges in designing probes suitable for tiling array applications and targeted sequencing. In particular, we define the penalized uniqueness score, which serves as a controlling criterion to eliminate potential cross-hybridization, and a flexible tiling array design pipeline. Unlike BLAST or simple suffix array based methods, computing and using our uniqueness measurement can be more efficient for large scale design and require less memory. The parameters provided could assist in various types of genomic tiling task. In addition, using both commercial array data and experiment data we show, unlike previously claimed, that palindromic sequence exhibiting relatively lower uniqueness.
Our proposed penalized uniqueness score could serve as a better indicator for cross hybridization with higher sensitivity and specificity, giving more control of expected array quality. The flexible tiling design algorithm incorporating the penalized uniqueness score was shown to give higher coverage and resolution. The package to calculate the penalized uniqueness score and the described probe selection algorithm are implemented as a Perl program, which is freely available at http://www1.fbn-dummerstorf.de/en/forschung/fbs/fb3/paper/2012-yang-1/OTAD.v1.1.tar.gz.
Tiling array; Targeted sequencing; Probe design; Penalized uniqueness score
RNA sequencing (RNA-Seq) is often used for transcriptome profiling as well as the identification of novel transcripts and alternative splicing events. Typically, RNA-Seq libraries are prepared from total RNA using poly(A) enrichment of the mRNA (mRNA-Seq) to remove ribosomal RNA (rRNA), however, this method fails to capture non-poly(A) transcripts or partially degraded mRNAs. Hence, a mRNA-Seq protocol will not be compatible for use with RNAs coming from Formalin-Fixed and Paraffin-Embedded (FFPE) samples.
To address the desire to perform RNA-Seq on FFPE materials, we evaluated two different library preparation protocols that could be compatible for use with small RNA fragments. We obtained paired Fresh Frozen (FF) and FFPE RNAs from multiple tumors and subjected these to different gene expression profiling methods. We tested 11 human breast tumor samples using: (a) FF RNAs by microarray, mRNA-Seq, Ribo-Zero-Seq and DSN-Seq (Duplex-Specific Nuclease) and (b) FFPE RNAs by Ribo-Zero-Seq and DSN-Seq. We also performed these different RNA-Seq protocols using 10 TCGA tumors as a validation set.
The data from paired RNA samples showed high concordance in transcript quantification across all protocols and between FF and FFPE RNAs. In both FF and FFPE, Ribo-Zero-Seq removed rRNA with comparable efficiency as mRNA-Seq, and it provided an equivalent or less biased coverage on gene 3′ ends. Compared to mRNA-Seq where 69% of bases were mapped to the transcriptome, DSN-Seq and Ribo-Zero-Seq contained significantly fewer reads mapping to the transcriptome (20-30%); in these RNA-Seq protocols, many if not most reads mapped to intronic regions. Approximately 14 million reads in mRNA-Seq and 45–65 million reads in Ribo-Zero-Seq or DSN-Seq were required to achieve the same gene detection levels as a standard Agilent DNA microarray.
Our results demonstrate that compared to mRNA-Seq and microarrays, Ribo-Zero-Seq provides equivalent rRNA removal efficiency, coverage uniformity, genome-based mapped reads, and consistently high quality quantification of transcripts. Moreover, Ribo-Zero-Seq and DSN-Seq have consistent transcript quantification using FFPE RNAs, suggesting that RNA-Seq can be used with FFPE-derived RNAs for gene expression profiling.
Electronic supplementary material
The online version of this article (doi: 10.1186/1471-2164-15-419) contains supplementary material, which is available to authorized users.
RNA sequencing; FFPE; RNA depletion; Ribo-zero; Gene expression; Microarray
RNA-seq is a promising technology to re-sequence protein coding genes for the identification of single nucleotide variants (SNV), while simultaneously obtaining information on structural variations and gene expression perturbations. We asked whether RNA-seq is suitable for the detection of driver mutations in T-cell acute lymphoblastic leukemia (T-ALL). These leukemias are caused by a combination of gene fusions, over-expression of transcription factors and cooperative point mutations in oncogenes and tumor suppressor genes. We analyzed 31 T-ALL patient samples and 18 T-ALL cell lines by high-coverage paired-end RNA-seq. First, we optimized the detection of SNVs in RNA-seq data by comparing the results with exome re-sequencing data. We identified known driver genes with recurrent protein altering variations, as well as several new candidates including H3F3A, PTK2B, and STAT5B. Next, we determined accurate gene expression levels from the RNA-seq data through normalizations and batch effect removal, and used these to classify patients into T-ALL subtypes. Finally, we detected gene fusions, of which several can explain the over-expression of key driver genes such as TLX1, PLAG1, LMO1, or NKX2-1; and others result in novel fusion transcripts encoding activated kinases (SSBP2-FER and TPM3-JAK2) or involving MLLT10. In conclusion, we present novel analysis pipelines for variant calling, variant filtering, and expression normalization on RNA-seq data, and successfully applied these for the detection of translocations, point mutations, INDELs, exon-skipping events, and expression perturbations in T-ALL.
The quest for somatic mutations underlying oncogenic processes is a central theme in today's cancer research. High-throughput genomics approaches including amplicon re-sequencing, exome re-sequencing, full genome re-sequencing, and SNP arrays have contributed to cataloguing driver genes across cancer types. Thus far transcriptome sequencing by RNA-seq has been mainly used for the detection of fusion genes, while few studies have assessed its value for the combined detection of SNPs, INDELs, fusions, gene expression changes, and alternative transcript events. Here we apply RNA-seq to 49 T-ALL samples and perform a critical assessment of the bioinformatics pipelines and filters to identify each type of aberration. By comparing to exome re-sequencing, and by exploiting the catalogues of known cancer drivers, we identified many known and several novel driver genes in T-ALL. We also determined an optimal normalization strategy to obtain accurate gene expression levels and used these to identify over-expressed transcription factors that characterize different T-ALL subtypes. Finally, by PCR, cloning, and in vitro cellular assays we uncover new fusion genes that have consequences at the level of gene expression, oncogenic chimaeras, and tumor suppressor inactivation. In conclusion, we present the first RNA-seq data set across T-ALL patients and identify new driver events.