We demonstrated that most gene expression levels are well correlated between RNA-Seq and tiling arrays. There are some outliers, which are generally called highly expressed by tiling array and poorly expressed by RNA-Seq. In previous studies, similar outliers were also found and when analyzed with qPCR, it was evident that their profile was masked by cross-hybridization [23
]. To further bolster this conclusion, we note that a substantially greater number of inactive pseudogenes are called expressed by tiling arrays (Table ) when compared with their paralogous parent genes. Furthermore, TARs tiled by probes that are highly similar to their nearest neighbors also tend to be called expressed by tiling arrays and not by RNA-Seq, strong evidence of cross-hybridization.
We have demonstrated that a simple similarity score threshold for tiling array probes can identify potentially unreliable regions (Figures , ). To immediately aid researchers conducting tiling array analysis on C. elegans
, we provide our manually compiled list of such "black list" regions (Additional file 2
). It is important to note, however, that these unreliable regions are dependent on the design of the tiling array and possibly on other factors such as hybridization conditions; an analysis like ours would need to be re-run in other scenarios.
Besides cross-hybridization, another drawback of tiling arrays is the limited dynamic range of detection [9
]. Previous work has presented RNA-Seq data with a dynamic range varying over 5 orders of magnitude [8
]. Consistent with this, we note that ~40% more genes are called differentially expressed by RNA-Seq between two distinct subpopulations of C. elegans
, even when using a conservative statistical test. It is also clear that the fold difference of differential expression is greater for RNA-Seq (Figure ). Bloom et al. deliberately used fewer reads and a high number of array replicates in their comparison of differential expression in S. cerevisiae
]. This defines "fair comparison" in an alternative way, one where the cost of experimentation is similar, and they find that arrays better distinguish differential expression for low abundance transcripts. This is because those transcripts' specific probes will still exhibit hybridization while the few reads may not be picked up in sequencing.
Yet another drawback of tiling arrays is the comparative lack of exon boundary resolution. Not unexpectedly, the median absolute deviation of aggregated exon boundaries is much smaller for RNA-Seq than for tiling arrays, reflecting the size of the oligonucleotide probes in our tiling arrays. This distinct difference between the two technologies is especially important when sequencing unannotated transcriptomes and detecting alternative splicing; these results accentuate why RNA-Seq has been so successful at both types of analysis [15
Given the superiority of RNA-Seq using these metrics, our strategy of using RNA-Seq as a gold standard set for guiding tiling array analysis may be useful for calibrating experiments where large numbers of tiling array runs are required. It is conceivable that one or two "pilot" RNA-Seq experiments could guide a series of microarrays. Indeed, a variant of this strategy was used successfully when validating the de novo
assembly of the Glanville fritillary butterfly's transcriptome [45
]. It may also prove useful for probing the transcriptomes of organisms with poor transcriptome annotation. This general strategy has the potential to be expanded from the maxgap/minrun algorithm to other methods, such as hidden Markov segmentation. We find that the false positive rate determined for tiling array TARs decreases by an average of ~10% when using the RNA-Seq data as a gold standard set instead of the high-confidence WormBase annotation. Even though using the annotation as a gold standard set is not optimal, because not all annotated transcripts are necessarily expressed, it is satisfying to observe that the effect is relatively small. Given the wealth of expression data coming from both RNA-Seq and tiling array analyses, it is often difficult to understand how to interpret cross-platform results. As a first step, we examined the relationship between transcriptome coverage and quality of called TARs. Furthermore, we determined the approximate number of reads required to yield a sensitivity comparable to that of tiling arrays at a given FPR. This is of practical importance to researchers employing RNA-Seq, since the cost of sequencing is generally proportional to the number of reads obtained. It is important to note, however, that the transcriptome is dynamic--expression can vary widely between different life stages and growth environments. For the L2-poly(A) C. elegans
transcriptome, we find that 4 million reads are necessary to achieve a similar sensitivity to tiling arrays. Importantly, because of its single nucleotide resolution, the FPR of RNA-Seq at this sequencing depth is >5x greater than that of tiling arrays.
In order to extend this conclusion to other organisms, we outline a simple method of approximating transcriptome coverage. In principle, the coverage of the transcriptome could be calculated if we knew the exact number of base pairs of RNA present at a given point in time. Since this is difficult to measure, we can approximate this number for organisms whose transcripts are well annotated, by assuming that the total number of base pairs of RNA in the cell is proportional to the total number of base pairs of annotated transcripts by some constant c
. This approximation makes the assumption that varying transcript expression levels averages out across the transcriptome. Thus,
where L is the number of annotated exon base pairs, including isoforms to account for complexity of transcription, N is the total number of reads within annotated exons, and R is the average read length. It is reasonable to assume that c should be relatively constant across organisms, and so this coverage value may be meaningful for organisms other than C. elegans.
Although almost all of our analyses have indicated otherwise, there are some drawbacks for RNA-Seq. "Cross-mapping" is an analogous problem to cross-hybridization, and has been addressed in complex organisms [8
] particularly because it poses a problem for genomes with many repetitive regions. We included only high quality mapped reads, but allowing greater mismatches, which could be beneficial for detecting additional transcription, would lead to decreased confidence in the transcriptional activity of regions with high sequence similarity. Our analysis was less affected by this issue since the C. elegans
genome does not contain many repetitive regions (~87% is non-repetitive; [46
]), and also because we included a rigorous pre-processing step that left less than 3% of reads mapping to multiple locations in the C. elegans
]. In principle, however, it is important for users of RNA-Seq to consider the repetitiveness of the genome they are analyzing, to determine how many reads map to multiple locations, and understand how they are dealt with. Importantly, different software packages deal with ambiguous reads differently; for example, MAQ assigns these reads randomly, whereas cross_match gives information about the alternative mapping sites. Ironically, it is evident that tiling arrays are not immune to the problem of ambiguous mapping; indeed ~6% of tiling array probes used in this study map to multiple locations in the C. elegans
genome (Additional file 1
: Table S2).
As read lengths continue to get longer, however, the problem of ambiguous read mapping will certainly become less of an obstacle. Indeed, the tantalizing possibility of obtaining kilobase long reads may completely eliminate this altogether. However, it has recently been demonstrated that transcript length affects differential expression analysis [47
]. Furthermore, the problem of rRNA and tRNA overloading the reads often forces RNA-Seq users to purify RNA over a poly-dT column, potentially losing RNA species of interest. This problem is currently being bypassed with the increased availability of kits for specific removal of rRNA from total RNA samples (Ambion, Invitrogen).
A less tangible disadvantage of RNA-Seq is the requirement for "big data," which can cause problems in storage, portability, and processing time [48
]. For example, just the sequences from the L2-poly(A) RNA-Seq dataset take up ~13 gigabytes. For genomes larger than C. elegans
, which require more reads, this number can rapidly increase. Larger data is simultaneously more costly to archive and easier to corrupt. Furthermore, these large datasets can often strain computational resources with respect to both processing time and memory usage. Although great strides have been made, as RNA-Seq grows in popularity, it is imperative that highly efficient RNA-Seq software pipelines and data formats be developed.