We have shown that priming using random hexamers biases the nucleotide content of RNA sequencing reads and that this bias also affects the uniformity of the locations of the reads along expressed transcripts. Despite this bias, we believe that priming using random hexamers is preferable to using oligo(dT) priming, as the latter is highly biased toward the 3′-end of the expressed transcripts.
Mamanova
et al. recently described an alternative protocol for sequencing RNA using the Illumina Genome Analyzer (
18), in which reverse transcription takes place directly on the flow cell and which yields stranded reads and avoids polymerase chain reaction amplification. RNA is not converted to dscDNA using random priming, instead sequencing adapters are ligated directly onto RNA fragments. The ligated RNA library is then reverse transcribed on the flow cell. Data from their study does not show the nucleotide biases reported here (
Supplementary Figure S9).
As an alternative to poly(A) enrichment, Armour
et al. describe a protocol for ribosomal depletion, where only a subset of the 4096 possible hexamers are used to prime the reverse transcription (
19). There are currently no publicly available data sets generated using this protocol.
A natural question is whether the read biases observed with the Illumina Genome Analyzer also occur with other sequencing platforms. We have begun investigating SOLiD data and our preliminary results based on data from (
20) indicate that similar, but smaller random priming biases may be present in addition to more important SOLiD-specific biases.
We have provided a method for adjusting the nucleotide bias and have shown that this method makes the start positions of the reads closer to being uniformly distributed across the transcriptome. The method is implemented in the R package Genominator available from Bioconductor (
5).
During the review of the present manuscript, we became aware of recent related work by Li
et al. (‘Modeling non-uniformity in short-read rates in RNA-Seq data’, submitted for publication). These authors consider a different approach for handling the non-uniformity of the read distribution, which involves modeling the base-level read counts of a given gene using a Poisson model with parameter that is a function of the neighboring nucleotides. Compared to our reweighting procedure, the Li
et al. method relies on gene annotation and does not relate to the library preparation protocol (i.e. random hexamer priming). In a similar spirit, Bullard (
21) considers generalized linear models for base-level read counts in a simulation study assessing the performance of the differential allele-specific expression method proposed in Bullard
et al. (
22). However, preliminary attempts in the explicit modeling of base-level read counts have had limited success and more research and benchmarking is needed before one can confidently derive expression measures based on such models.
Alleviating the positional bias induced by random hexamer priming is one step toward enabling comparisons within samples, between distinct genomic regions. Such comparisons are crucial to several aims of transcriptome sequencing, including the study of alternative splicing, transcript assembly and allele specific expression.