There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data.
This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies.
Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure.
Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage.
In this work, we present a novel method to detect intron retentions (IR) and exon skips (ES) from tiling arrays. While statistical tests have typically been proposed for this purpose, our method instead utilizes support vector machines (SVMs) which are appreciated for their accuracy and robustness to noise. Existing EST and cDNA sequences served for supervised training and evaluation. Analyzing a large collection of publicly available microarray and sequence data for the model plant A. thaliana, we demonstrated that our method is more accurate than existing approaches. The method was applied in a genome-wide screen which resulted in the discovery of 1,355 IR events. A comparison of these IR events to the TAIR annotation and a large set of short-read RNA-seq data showed that 830 of the predicted IR events are novel and that 525 events (39%) overlap with either the TAIR annotation or the IR events inferred from the RNA-seq data.
The method developed in this work expands the scarce repertoire of analysis tools for the identification of alternative mRNA splicing from whole-genome tiling arrays. Our predictions are highly enriched with known AS events and complement the A. thaliana genome annotation with respect to AS. Since all predicted AS events can be precisely attributed to experimental conditions, our work provides a basis for follow-up studies focused on the elucidation of the regulatory mechanisms underlying tissue-specific and stress-dependent AS in plants.
Tiling-arrays are applicable to multiple types of biological research questions. Due to its advantages (high sensitivity, resolution, unbiased), the technology is often employed in genome-wide investigations. A major challenge in the analysis of tiling-array data is to define regions-of-interest, i.e., contiguous probes with increased signal intensity (as a result of hybridization of labeled DNA) in a region. Currently, no standard criteria are available to define these regions-of-interest as there is no single probe intensity cut-off level, different regions-of-interest can contain various numbers of probes, and can vary in genomic width. Furthermore, the chromosomal distance between neighboring probes can vary across the genome among different arrays.
We have developed Hypergeometric Analysis of Tiling-arrays (HAT), and first evaluated its performance for tiling-array datasets from a Chromatin Immunoprecipitation study on chip (ChIP-on-chip) for the identification of genome-wide DNA binding profiles of transcription factor Cebpa (used for method comparison). Using this assay, we can refine the detection of regions-of-interest by illustrating that regions detected by HAT are more highly enriched for expected motifs in comparison with an alternative detection method (MAT). Subsequently, data from a retroviral insertional mutagenesis screen were used to examine the performance of HAT among different applications of tiling-array datasets. In both studies, detected regions-of-interest have been validated with (q)PCR.
We demonstrate that HAT has increased specificity for analysis of tiling-array data in comparison with the alternative method, and that it accurately detects regions-of-interest in two different applications of tiling-arrays. HAT has several advantages over previous methods: i) as there is no single cut-off level for probe-intensity, HAT can detect regions-of-interest at various thresholds, ii) it can detect regions-of-interest of any size, iii) it is independent of probe-resolution across the genome, and across tiling-array platforms and iv) it employs a single user defined parameter: the significance level. Regions-of-interest are detected by computing the hypergeometric-probability, while controlling the Family Wise Error. Furthermore, the method does not require experimental replicates, common regions-of-interest are indicated, a sequence-of-interest can be examined for every detected region-of-interest, and flanking genes can be reported.
The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT–PCR experiments.
Short-read RNA sequencing in mouse and human tissues shows that most transcripts are encoded within or nearby known genes and that most of the genome is not transcribed.
A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.
The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.
High-density oligonucleotide microarray is an appropriate technology for genomic analysis, and is particulary useful in the generation of transcriptional maps, ChIP-on-chip studies and re-sequencing of the genome.Transcriptome analysis of tiling microarray data facilitates the discovery of novel transcripts and the assessment of differential expression in diverse experimental conditions. Although new technologies such as next-generation sequencing have appeared, microarrays might still be useful for the study of small genomes or for the analysis of genomic regions with custom microarrays due to their lower price and good accuracy in expression quantification.
Here, we propose a novel wavelet-based method, named ZCL (zero-crossing lines), for the combined denoising and segmentation of tiling signals. The denoising is performed with the classical SUREshrink method and the detection of transcriptionally active regions is based on the computation of the Continuous Wavelet Transform (CWT). In particular, the detection of the transitions is implemented as the thresholding of the zero-crossing lines. The algorithm described has been applied to the public Saccharomyces cerevisiae dataset and it has been compared with two well-known algorithms: pseudo-median sliding window (PMSW) and the structural change model (SCM). As a proof-of-principle, we applied the ZCL algorithm to the analysis of the custom tiling microarray hybridization results of a S. aureus mutant deficient in the sigma B transcription factor. The challenge was to identify those transcripts whose expression decreases in the absence of sigma B.
The proposed method archives the best performance in terms of positive predictive value (PPV) while its sensitivity is similar to the other algorithms used for the comparison. The computation time needed to process the transcriptional signals is low as compared with model-based methods and in the same range to those based on the use of filters. Automatic parameter selection has been incorporated and moreover, it can be easily adapted to a parallel implementation. We can conclude that the proposed method is well suited for the analysis of tiling signals, in which transcriptional activity is often hidden in the noise. Finally, the quantification and differential expression analysis of S. aureus dataset have demonstrated the valuable utility of this novel device to the biological analysis of the S. aureus transcriptome.
Statistical analysis on tiling array data is extremely challenging due to the astronomically large number of sequence probes, high noise levels of individual probes and limited number of replicates in these data. To overcome these difficulties, we first developed statistical error estimation and weighted ANOVA modeling approaches to high-density tiling array data, especially the former based on an advanced error-pooling method to accurately obtain heterogeneous technical error of small-sample tiling array data. Based on these approaches, we analyzed the high-density tiling array data of the temporal replication patterns during cell-cycle S phase of synchronized HeLa cells on human chromosomes 21 and 22. We found many novel temporal replication patterns, identifying about 26% of over 1 million tiling array sequence probes with significant differential replication during the four 2-h time periods of S phase. Among these differentially replicated probes, 126 941 sequence probes were matched to 417 known genes. The majority of these genes were found to be replicated within one or two consecutive time periods, while the others were replicated at two non-consecutive time periods. Also, coding regions found to be more differentially replicated in particular time periods than noncoding regions in the gene-poor chromosome 21 (25% differentially replicated among genic probes versus 18.6% among intergenic probes), while such a phenomenon was less prominent in gene-rich chromosome 22. A rigorous statistical testing for local proximity of differentially replicated genic and intergenic probes was performed to identify significant stretches of differentially replicated sequence regions. From this analysis, we found that adjacent genes were frequently replicated at different time periods, potentially implying the existence of quite dense replication origins. Evaluating the conditional probability significance of identified gene ontology terms on chromosomes 21 and 22, we detected some over-represented molecular functions and biological processes among these differentially replicated genes, such as the ones relevant to hydrolase, transferase and receptor-binding activities. Some of these results were confirmed showing >70% consistency with cDNA microarray data that were independently generated in parallel with the tiling arrays. Thus, our improved analysis approaches specifically designed for high-density tiling array data enabled us to reliably and sensitively identify many novel temporal replication patterns on human chromosomes.
Currently, most of RNA-seq experiments are performed on Illumina platform, but other companies are competing for market share. In this highly competitive environment, cross-platform comparisons and/or validations are becoming increasingly critical. Results of several comparisons in which the same samples were studied using Illumina and Ion Torrent RNA-seq, and different microarray-based approaches are presented. To prepare the libraries, the RNA samples were processed using Illumina TruSeq protocol (a protocol capturing polyadenylated RNA) and sequenced on Illumina HiSeq 2500 producing 100x100-nt paired-end reads. The same samples were processed using the Ion Torrent Total RNA-Seq V2 protocol which is capable of capturing non-coding RNA and preserves the strand specificity. These libraries were sequenced on the Ion Proton using the P1 chip and produced up to 200-nt reads. The data obtained with both platforms was compared for quality, alignment statistics, error rates, evenness and continuity of coverage, RNA biotype representation, and accuracy for expression profiling. Additionally, detailed comparison of technical aspects including input amount, throughput, experimental time and reagent costs is presented. Lastly, the same samples were interrogated using Agilent V2 Human Whole Genome arrays, Affymetrix Gene arrays ST (1.0 and 2.0) and newly commercialized Affymetrix Human Transcriptome Arrays. There was a significant correlation between the Illumina and Ion Torrent RNA-Seq gene expression data and microarray data generated from the same samples; however, the RNA-Seq detects additional transcripts whose expression were either not interrogated or not detected by microarrays.
High-density tiling microarrays are a powerful tool for the characterization of complete genomes. The two major computational challenges associated with custom-made arrays are design and analysis. Firstly, several genome dependent variables, such as the genome's complexity and sequence composition, need to be considered in the design to ensure a high quality microarray. Secondly, since tiling projects today very often exceed the limits of conventional array-experiments, researchers cannot use established computer tools designed for commercial arrays, and instead have to redesign previous methods or create novel tools.
Here we describe the multiple aspects involved in the design of tiling arrays for transcriptome analysis and detail the normalisation and analysis procedures for such microarrays. We introduce a novel design method to make two 280,000 feature microarrays covering the entire genome of the bacterial species Escherichia coli and Neisseria meningitidis, respectively, as well as the use of multiple copies of control probe-sets on tiling microarrays. Furthermore, a novel normalisation and background estimation procedure for tiling arrays is presented along with a method for array analysis focused on detection of short transcripts. The design, normalisation and analysis methods have been applied in various experiments and several of the detected novel short transcripts have been biologically confirmed by Northern blot tests.
Tiling-arrays are becoming increasingly applicable in genomic research, but researchers still lack both the tools for custom design of arrays, as well as the systems and procedures for analysis of the vast amount of data resulting from such experiments. We believe that the methods described herein will be a useful contribution and resource for researchers designing and analysing custom tiling arrays for both bacteria and higher organisms.
Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.
The Microarray Core Facility (MCF) at Baylor College of Medicine provides investigators with access to a variety of state-of-the-art technologies and approaches that will enhance discovery for their genomic research. We house instrumentation supporting Affymetrix, Agilent, NimbleGen, Luminex, and Illumina platforms. The MCF provides expertise in the following applications: gene expression, array comparative genomic hybridization (aCGH), SNP genotyping, and next-generation sequencing. In addition, our lab offer services for sample quality check and a cDNA clone repository, for those that are interested in verifying results from gene expression experiments or any other application requiring cDNA clones. The MCF specializes in RNA applications that enable researchers to monitor genome-wide expression profiles through Affymetrix, Agilent and NimbleGen expression arrays.Agilent's aCGH and Affymetrix SNP Arrays are also offered, providing detection of copy number variations across the genome.Other related services include: tiling arrays, ChIP-on-chip arrays, SuperArray, Promoter Arrays, and Panomics. Due to the increased demand for rapid DNA sequencing, the facility now provides massively parallel “next generation” sequencing on the Illumina Genome Analyzer II.Our core lab has established a workflow involving: project consultation, sample quality check, sample preparation and data generation for each sequencing project.Illumina's sequencing platform provides high-quality data in the following applications: gene expression and alternative splicing (mRNA-Seq), protein-nucleic acid association profiling and epigenetics (ChIP-Seq), sequencing targeted genomic regions, small RNA discovery (small RNA-Seq) and de novo sequencing.The MCF offers investigators access to an array of emerging technologies while assisting in experimental design and data analysis.
As a powerful tool in whole genome analysis, tiling array has been widely used in the answering of many genomic questions. Now it could also serve as a capture device for the library preparation in the popular high throughput sequencing experiments. Thus, a flexible and efficient tiling array design approach is still needed and could assist in various types and scales of transcriptomic experiment.
In this paper, we address issues and challenges in designing probes suitable for tiling array applications and targeted sequencing. In particular, we define the penalized uniqueness score, which serves as a controlling criterion to eliminate potential cross-hybridization, and a flexible tiling array design pipeline. Unlike BLAST or simple suffix array based methods, computing and using our uniqueness measurement can be more efficient for large scale design and require less memory. The parameters provided could assist in various types of genomic tiling task. In addition, using both commercial array data and experiment data we show, unlike previously claimed, that palindromic sequence exhibiting relatively lower uniqueness.
Our proposed penalized uniqueness score could serve as a better indicator for cross hybridization with higher sensitivity and specificity, giving more control of expected array quality. The flexible tiling design algorithm incorporating the penalized uniqueness score was shown to give higher coverage and resolution. The package to calculate the penalized uniqueness score and the described probe selection algorithm are implemented as a Perl program, which is freely available at http://www1.fbn-dummerstorf.de/en/forschung/fbs/fb3/paper/2012-yang-1/OTAD.v1.1.tar.gz.
Tiling array; Targeted sequencing; Probe design; Penalized uniqueness score
RNAi screens via pooled short hairpin RNAs (shRNAs) have recently become a powerful tool for the identification of essential genes in mammalian cells. In the past years, several pooled large-scale shRNA screens have identified a variety of genes involved in cancer cell proliferation. All of those studies employed microarray analysis, utilizing either the shRNA's half hairpin sequence or an additional shRNA-associated 60 nt barcode sequence as a molecular tag. Here we describe a novel method to decode pooled RNAi screens, namely barcode tiling array analysis, and demonstrate how this approach can be used to precisely quantify the abundance of individual shRNAs from a pool.
We synthesized DNA microarrays with six overlapping 25 nt long tiling probes complementary to each unique 60 nt molecular barcode sequence associated with every shRNA expression construct. By analyzing dilution series of expression constructs we show how our approach allows quantification of shRNA abundance from a pool and how it clearly outperforms the commonly used analysis via the shRNA's half hairpin sequences. We further demonstrate how barcode tiling arrays can be used to predict anti-proliferative effects of individual shRNAs from pooled negative selection screens. Out of a pool of 305 shRNAs, we identified 28 candidate shRNAs to fully or partially impair the viability of the breast carcinoma cell line MDA-MB-231. Individual validation of a subset of eleven shRNA expression constructs with potential inhibitory, as well as non-inhibitory, effects on the cell line proliferation provides further evidence for the accuracy of the barcode tiling approach.
In summary, we present an improved method for the rapid, quantitative and statistically robust analysis of pooled RNAi screens. Our experimental approach, coupled with commercially available lentiviral vector shRNA libraries, has the potential to greatly facilitate the discovery of putative targets for cancer therapy as well as sensitizers of drug toxicity.
Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome.
This paper presents a new probe selection algorithm (PanArray) that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pan-genome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage.
PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on a single microarray chip. These unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains.
Motivation: Individual probes on an Affymetrix tiling array usually behave differently. Modeling and removing these probe effects are critical for detecting signals from the array data. Current data processing techniques either require control samples or use probe sequences to model probe-specific variability, such as with MAT. Although the MAT approach can be applied without control samples, residual probe effects continue to distort the true biological signals.
Results: We propose TileProbe, a new technique that builds upon the MAT algorithm by incorporating publicly available data sets to remove tiling array probe effects. By using a large number of these readily available arrays, TileProbe robustly models the residual probe effects that MAT model cannot explain. When applied to analyzing ChIP-chip data, TileProbe performs consistently better than MAT across a variety of analytical conditions. This shows that TileProbe resolves the issue of probe-specific effects more completely.
Supplementary information: Supplementary data are available at Bioinformatics online.
Advantages of RNA-Seq over array based platforms are quantitative gene expression and discovery of expressed single nucleotide variants (eSNVs) and fusion transcripts from a single platform, but the sensitivity for each of these characteristics is unknown. We measured gene expression in a set of manually degraded RNAs, nine pairs of matched fresh-frozen, and FFPE RNA isolated from breast tumor with the hybridization based, NanoString nCounter (226 gene panel) and with whole transcriptome RNA-Seq using RiboZeroGold ScriptSeq V2 library preparation kits. We performed correlation analyses of gene expression between samples and across platforms. We then specifically assessed whole transcriptome expression of lincRNA and discovery of eSNVs and fusion transcripts in the FFPE RNA-Seq data. For gene expression in the manually degraded samples, we observed Pearson correlations of >0.94 and >0.80 with NanoString and ScriptSeq protocols, respectively. Gene expression data for matched fresh-frozen and FFPE samples yielded mean Pearson correlations of 0.874 and 0.783 for NanoString (226 genes) and ScriptSeq whole transcriptome protocols respectively, p<2x10-16. Specifically for lincRNAs, we observed superb Pearson correlation (0.988) between matched fresh-frozen and FFPE pairs. FFPE samples across NanoString and RNA-Seq platforms gave a mean Pearson correlation of 0.838. In FFPE libraries, we detected 53.4% of high confidence SNVs and 24% of high confidence fusion transcripts. Sensitivity of fusion transcript detection was not overcome by an increase in depth of sequencing up to 3-fold (increase from ~56 to ~159 million reads). Both NanoString and ScriptSeq RNA-Seq technologies yield reliable gene expression data for degraded and FFPE material. The high degree of correlation between NanoString and RNA-Seq platforms suggests discovery based whole transcriptome studies from FFPE material will produce reliable expression data. The RiboZeroGold ScriptSeq protocol performed particularly well for lincRNA expression from FFPE libraries, but detection of eSNV and fusion transcripts was less sensitive.
Genomic tiling arrays, cDNA sequencing and, more recently, RNA-Seq have provided initial insights into the extent and depth of transcribed sequence across human and other genomes. These methods have led to greatly improved annotations of protein-coding genes, but have also identified transcription outside of annotated exons. One resultant issue that has aroused dispute is the balance of transcription of known exons against transcription outside of known exons. While non-genic ‘dark matter’ transcription was found by tiling arrays to be pervasive, it was seen to contribute only a small percentage of the polyadenylated transcriptome in some RNA-Seq experiments. This apparent contradiction has been compounded by a lack of clarity about what exactly constitutes a protein-coding gene. It remains unclear, for example, whether or not all transcripts that overlap on either strand within a genomic locus should be assigned to a single gene locus, including those that fail to share promoters, exons and splice junctions. The inability of tiling arrays and RNA-Seq to count transcripts, rather than exons or exon pairs, adds to these difficulties. While there is agreement that thousands of apparently non-coding loci are present outside of protein-coding genes in the human genome, there is vigorous debate of what constitutes evidence for their functionality. These issues will only be resolved upon the demonstration, or otherwise, that organismal or cellular phenotypes frequently result when non-coding RNA loci are disrupted.
Probing protein-deoxyribonucleic acid (DNA) is gaining popularity as it sheds light on molecular mechanisms that regulate the expression of genes. Currently, tiling-arrays and next-generation sequencing technology can be used to measure these interactions. Both methods generate a signal over the genome in which contiguous regions of peaks on the genome represent the presence of an interacting molecule. Many methods do exist to identify functional regions of interest (ROIs) on the genome. However the detection of ROIs are often not an end-point in research questions and it therefore requires data dragging between tools to relate the ROIs to information present in databases, such as gene-ontology, pathway information, or enrichment of certain genomic content. We introduce hypergeometric analysis of tiling-array and sequence data (HATSEQ), a powerful tool that accurately identifies functional ROIs on the genome where a genomic signal significantly deviates from the general genome-wide behavior. HATSEQ also includes a number of built-in post-analyses with which biological meaning can be attached to the detected ROIs in terms of gene pathways and de-novo motif analysis, and provides different visualizations and statistical summaries for the detected ROIs. In addition, HATSEQ has an intuitive graphic user interface that lowers the barrier for researchers to analyze their data without the need of scripting languages. We compared the results of HATSEQ against two other popular chromatin immunoprecipitation sequencing (ChIP-Seq) methods and observed overlap in the detected ROIs but HATSEQ is more specific in delineating the peak boundaries. We also discuss the versatility of HATSEQ by using a Signal Transducer and Activator of Transcription 1 (STAT1) ChIP-Seq data-set, and show that the detected ROIs are highly specific for the expected STAT1 binding motif. HATSEQ is freely available at: http://hema13.erasmusmc.nl/index.php/HATSEQ.
bioinformatics; NGS analysis; ChIP-Seq; peak detection
Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs.
We developed a barcoding ChIP-Seq method for the concurrent analysis of transcription factor binding sites in yeast. Our multiplex strategy generated high quality data that was indistinguishable from data obtained with non-barcoded libraries. None of the barcoded adapters induced differences relative to a non-barcoded adapter when applied to the same DNA sample. We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric targets correspond to highly expressed genes in rich media. The presence of Cse4 non-centromeric binding sites was not reported previously.
We designed a multiplex short-read DNA sequencing method to perform efficient ChIP-Seq in yeast and other small genome model organisms. This method produces accurate results with higher throughput and reduced cost. Given constant improvements in high-throughput sequencing technologies, increasing multiplexing will be possible to further decrease costs per sample and to accelerate the completion of large consortium projects such as modENCODE.
The genome-wide distribution patterns of the ‘6th base’ 5-hydroxymethylcytosine (5hmC) in many tissues and cells have recently been revealed by hydroxymethylated DNA immunoprecipitation (hMeDIP) followed by high throughput sequencing or tiling arrays. However, it has been challenging to directly compare different data sets and samples using data generated by this method. Here, we report a new comparative hMeDIP-seq method, which involves barcoding different input DNA samples at the start and then performing hMeDIP-seq for multiple samples in one hMeDIP reaction. This approach extends the barcode technology from simply multiplexing the DNA deep sequencing outcome and provides significant advantages for quantitative control of all experimental steps, from unbiased hMeDIP to deep sequencing data analysis. Using this improved method, we profiled and compared the DNA hydroxymethylomes of mouse ES cells (ESCs) and mouse ESC-derived neural progenitor cells (NPCs). We identified differentially hydroxymethylated regions (DHMRs) between ESCs and NPCs and uncovered an intricate relationship between the alteration of DNA hydroxymethylation and changes in gene expression during neural lineage commitment of ESCs. Presumably, the DHMRs between ESCs and NPCs uncovered by this approach may provide new insight into the function of 5hmC in gene regulation and neural differentiation. Thus, this newly developed comparative hMeDIP-seq method provides a cost-effective and user-friendly strategy for direct genome-wide comparison of DNA hydroxymethylation across multiple samples, lending significant biological, physiological and clinical implications.
Transcriptomic studies in clinical research are essential tools for deciphering the functional elements of the genome and unraveling underlying disease mechanisms. Various technologies have been developed to deduce and quantify the transcriptome including hybridization and sequencing-based approaches. Recently, high density exon microarrays have been successfully employed for detecting differentially expressed genes and alternative splicing events for biomarker discovery and disease diagnostics. The field of transcriptomics is currently being revolutionized by high throughput DNA sequencing methodologies to map, characterize, and quantify the transcriptome.
In an effort to understand the merits and limitations of each of these tools, we undertook a study of the transcriptome in sickle cell disease, a monogenic disease comparing the Affymetrix Human Exon 1.0 ST microarray (Exon array) and Illumina’s deep sequencing technology (RNA-seq) on whole blood clinical specimens.
Analysis indicated a strong concordance (R = 0.64) between Exon array and RNA-seq data at both gene level and exon level transcript expression. The magnitude of differential expression was found to be generally higher in RNA-seq than in the Exon microarrays. We also demonstrate for the first time the ability of RNA-seq technology to discover novel transcript variants and differential expression in previously unannotated genomic regions in sickle cell disease. In addition to detecting expression level changes, RNA-seq technology was also able to identify sequence variation in the expressed transcripts.
Our findings suggest that microarrays remain useful and accurate for transcriptomic analysis of clinical samples with low input requirements, while RNA-seq technology complements and extends microarray measurements for novel discoveries.
Sickle cell disease; RNA-Seq; Exon arrays; Transcriptome; Clinical genomics
Dose-dependent differential gene expression provides critical information required for regulatory decision-making. The lower costs associated with RNA-Seq have made it the preferred technology for transcriptomic analysis. However, concordance between RNA-Seq and microarray analyses in dose response studies has not been adequately vetted.
We compared the hepatic transcriptome of C57BL/6 mice following gavage with sesame oil vehicle, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, or 30 μg/kg TCDD every 4 days for 28 days using Illumina HiSeq RNA-Sequencing (RNA-Seq) and Agilent 4×44 K microarrays using the same normalization and analysis approach. RNA-Seq and microarray analysis identified a total of 18,063 and 16,403 genes, respectively, that were expressed in the liver. RNA-Seq analysis for differentially expressed genes (DEGs) varied dramatically depending on the P1(t) cut-off while microarray results varied more based on the fold change criteria, although responses strongly correlated. Verification by WaferGen SmartChip QRTPCR revealed that RNA-Seq had a false discovery rate of 24% compared to 54% for microarray analysis. Dose–response modeling of RNA-Seq and microarray data demonstrated similar point of departure (POD) and ED50 estimates for common DEGs.
There was a strong correspondence between RNA-Seq and Agilent array transcriptome profiling when using the same samples and analysis strategy. However, RNA-Seq provided superior quantitative data, identifying more genes and DEGs, as well as qualitative information regarding identity and annotation for dose response modeling in support of regulatory decision-making.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1527-z) contains supplementary material, which is available to authorized users.
RNA-Seq; Microarray; Comparison; TCDD; Dose–response; Mouse; Liver
RNA-seq is a powerful tool used to obtain in-depth information on expression profiling, gene annotation, and transcript discovery. With the growing popularity of RNA sequencing, new library preparation techniques are becoming commercially available. These techniques are improvements on the classic poly-A selection and rRNA reduction methods, and in some cases sensitive enough to analyze the transcriptome of a single cell. However, limited information is available on comparative analysis of these methods and their appropriate application for the transcriptome studies. We utilized Illumina's HiSeq technology to compare the merits of four commercial sample preparation kits: NuGen's Ovation RNA-seq system v2, Illumina's TruSeq RNA Sample Preparation kit v2, Epicentre's ScriptSeq RNA-seq kit v2 and Clontech's SMARTer Ultra Low RNA kit. We found that the quality of input RNA was critical for optimum performance of SMARTer Ultra Low RNA kit. Ovation and ScriptSeq kits, on the other hand, worked well with moderate quality input RNA as well. Based on analysis of the sequencing data, 12% of reads from ScriptSeq mapped to the mitochondrial genes as compared to 24% reads from Ovation. The library complexity and percentage of reads aligning to non-exonic region was similar between both kits. However, 28% reads aligned to the coding region for ScriptSeq versus 18% for Ovation. While TruSeq and SMARTer kits are designed for Poly-A containing RNAs only, ScriptSeq and Ovation kits provide more global analysis of the transcriptome. Analyzing the differences between these methods provides a better understanding of their specific advantage over the other. This information is especially useful for Sequencing Core Facilities, to recommend and apply appropriate methods to different transcriptome studies.
To demonstrate the benefits of RNA-Seq over microarray in transcriptome profiling, both RNA-Seq and microarray analyses were performed on RNA samples from a human T cell activation experiment. In contrast to other reports, our analyses focused on the difference, rather than similarity, between RNA-Seq and microarray technologies in transcriptome profiling. A comparison of data sets derived from RNA-Seq and Affymetrix platforms using the same set of samples showed a high correlation between gene expression profiles generated by the two platforms. However, it also demonstrated that RNA-Seq was superior in detecting low abundance transcripts, differentiating biologically critical isoforms, and allowing the identification of genetic variants. RNA-Seq also demonstrated a broader dynamic range than microarray, which allowed for the detection of more differentially expressed genes with higher fold-change. Analysis of the two datasets also showed the benefit derived from avoidance of technical issues inherent to microarray probe performance such as cross-hybridization, non-specific hybridization and limited detection range of individual probes. Because RNA-Seq does not rely on a pre-designed complement sequence detection probe, it is devoid of issues associated with probe redundancy and annotation, which simplified interpretation of the data. Despite the superior benefits of RNA-Seq, microarrays are still the more common choice of researchers when conducting transcriptional profiling experiments. This is likely because RNA-Seq sequencing technology is new to most researchers, more expensive than microarray, data storage is more challenging and analysis is more complex. We expect that once these barriers are overcome, the RNA-Seq platform will become the predominant tool for transcriptome analysis.
Antisense transcription is a pervasive phenomenon, but its source and functional significance is largely unknown. We took an expression-based approach to explore microRNA (miRNA)-related antisense transcription by computational analyses of published whole-genome tiling microarray transcriptome and deep sequencing small RNA (smRNA) data. Statistical support for greater abundance of antisense transcription signatures and smRNAs was observed for miRNA targets than for paralogous genes with no miRNA cleavage site. Antisense smRNAs were also found associated with MIRNA genes. This suggests that miRNA-associated “transitivity” (production of small interfering RNAs through antisense transcription) is more common than previously reported. High-resolution (3 nt) custom tiling microarray transcriptome analysis was performed with probes 400 bp 5′ upstream and 3′ downstream of the miRNA cleavage sites (direction relative to the mRNA) for 22 select miRNA target genes. We hybridized RNAs labeled from the smRNA pathway mutants, including hen1-1, dcl1-7, hyl1-2, rdr6-15, and sgs3-14. Results showed that antisense transcripts associated with miRNA targets were mainly elevated in hen1-1 and sgs3-14 to a lesser extent, and somewhat reduced in dcl11-7, hyl11-2, or rdr6-15 mutants. This was corroborated by semi-quantitative reverse transcription PCR; however, a direct correlation of antisense transcript abundance in MIR164 gene knockouts was not observed. Our overall analysis reveals a more widespread role for miRNA-associated transitivity with implications for functions of antisense transcription in gene regulation. HEN1 and SGS3 may be links for miRNA target entry into different RNA processing pathways.
Antisense transcription is a pervasive but poorly understood phenomenon in a wide variety of organisms. We have found evidence for a novel source of antisense transcription in Arabidopsis thaliana associated with miRNA targets via computational analyses of published whole-genome tiling microarray data, deep sequencing smRNA datasets, and from custom high-resolution (3 nt) tiling microarray analysis. Our data show increased antisense transcription for select miRNA targets in the hua enhancer1-1 (hen1-1), a smRNA methyltransferase mutant, and the suppressor of gene silencing3-14 (sgs3-14) mutant that affects post-transcriptional gene silencing and leaf development. Additional results suggest that miRNA targets and MIRNA genes are subject to the activities of both the miRNA and RNA silencing pathways in which HEN1 and SGS3 may represent associated nodes. The analysis of sense–antisense transcripts using high-resolution tiling microarrays and genetic mutants provides a precise and sensitive means to study epigenetic activities. Our method of mining expression data of plant miRNAs targets and smRNAs is potentially applicable to the identification of epigenetic targets in metazoans, where computational methods for prediction of miRNAs and their targets lack power because of sequence degeneracy, and to identify loci producing antisense transcripts by triggers other than miRNA-directed cleavage.