Taking advantage of rapidly advancing sequencing technology, researchers are now using high-throughput sequencers to measure gene expression with a protocol called RNA-Seq (Nagalakshmi et al.
; Wang et al.
). RNA-Seq is the transcriptome analog to whole-genome shotgun sequencing (Staden, 1979
), with a key difference being that RNA-Seq is primarily used for estimating the copy number of transcripts in a sample. The main steps in the protocol are (i) RNA is isolated from a sample, (ii) RNA is converted to cDNA fragments via reverse-transcription and fragmentation, (iii) a high-throughput sequencer [such as those from Illumina (San Diego, CA, USA), Applied Biosystems (Foster City, CA, USA) and Roche 454 Life Science (Branford, CT, USA)] is used to generate millions of reads from the cDNA fragments, (iv) reads are mapped to a reference genome or transcript set with an alignment tool and (v) counts of reads mapped to each gene are used to estimate expression levels. Because the primary outputs of RNA-Seq are counts of reads, they are referred to as ‘digital’ gene expression measurements, as opposed to the ‘analog’ fluorescence intensities from microarrays.
RNA-Seq is a promising replacement for microarrays as initial studies have shown that RNA-Seq expression estimates are highly reproducible (Marioni et al.
) and often more accurate, based on quantitative PCR (qPCR) and spike-in experiments (Mortazavi et al.
; Nagalakshmi et al.
). Although still a young technology, RNA-Seq has matured enough to be used in studies of transcription in yeast (Nagalakshmi et al.
(Lister et al.
), mouse (Cloonan et al.
; Mortazavi et al.
), and human (Marioni et al.
; Morin et al.
). The primary advantages of RNA-Seq are its large dynamic range (spanning five orders of magnitude), low background noise, requirement of less sample RNA and ability to detect novel transcripts, even in the absence of a sequenced genome (Wang et al.
). To allow the technology to reach its full potential, a number of experimental and computational challenges need to be addressed, including the handling of read mapping uncertainty, sequencing non-uniformity, estimation of potentially novel isoform (alternatively spliced transcript) expression levels and efficient storage and alignment of RNA-Seq reads.
In this article, we present our work in addressing the computational issue of read mapping uncertainty. Because RNA-Seq reads do not span entire transcripts, the transcripts from which they are derived are not always uniquely determined. Paralogous gene families, low-complexity sequence and high sequence similarity between alternatively spliced isoforms of the same gene are primary factors contributing to mapping uncertainty. In addition, polymorphisms, reference sequence errors and sequencing errors require that mismatches and indels be allowed in read alignments and further contribute to lower our confidence in mappings. Due to these factors, a significant number of reads are multireads
: reads that have high-scoring alignments to multiple positions in a reference genome or transcript set (Mortazavi et al.
). We will refer to reads that map to multiple genes as gene multireads
and reads that map to a single gene but multiple isoforms as isoform multireads
. The fraction of mappable reads that are gene multireads varies and depends on the transcriptome and read length. For the datasets we analyzed, this fraction ranged between 17% (mouse) and 52% (maize), representing a significant proportion of RNA-Seq data.
Two strategies have been previously used for handling gene multireads. The first simply discards them, keeping only uniquely mapping reads for expression estimation (Marioni et al.
; Nagalakshmi et al.
). A slightly more sophisticated method using only uniquely mapping reads adjusts exon coverage by the fraction of exon positions that would give rise to uniquely mapping reads (Morin et al.
). A second strategy has been to ‘rescue’ multireads by allocating fractions of them to genes in proportion to coverage by uniquely mapping reads (Faulkner et al.
; Mortazavi et al.
). These rescue strategies have been shown to give expression estimates that are in better agreement with microarrays than those from only using uniquely mapping reads (Mortazavi et al.
). A more recent method handles isoform multireads by explicitly estimating isoform expression levels but does not handle gene multireads (Jiang and Wong, 2009
We present a method for estimating expression in the presence of multireads that treats mapping uncertainty in a statistically rigorous manner. Using a generative model of RNA-Seq data, we unify the notions of gene and isoform multireads through latent random variables representing the true mappings. Model parameters correspond to isoform expression levels, read distributions across transcripts and sequencing error. We estimate maximum likelihood (ML) expression levels using an Expectation–Maximization (EM) algorithm and show that previous rescue methods are roughly equivalent to one iteration of EM. Our inference method can be thought of as the RNA-Seq analog of methods for correcting for SAGE sequencing errors (Beissbarth et al.
) and microarray cross-hybridization (Kapur et al.
Like the statistical method of Jiang and Wong (2009
), our model can be used to estimate individual isoform expression levels. In contrast with their method, our method incorporates gene multireads into a statistical model and does not require knowledge of which isoforms share exonic sequence. In addition, our model can accommodate non-uniform read distributions across transcripts (e.g. 3′ biases), which may be simultaneously learned from the data.
Results from simulations with parameters derived from real data indicate that our method gives more accurate gene expression estimates than those using only unique reads or rescue strategies. We show that estimation of gene expression levels as the sum of estimated isoform expression levels improves gene-level accuracy. We found a slight 3′ bias in the read distributions of a real dataset and determined that taking such non-uniformities into account can improve accuracy, depending on the strength of the bias. Last, through simulations with varying read lengths, we show that the optimal length for RNA-Seq is around 20–25 bases for the mouse and maize transcriptomes when our method is used to handle multireads.
1.1 Problem statement
The goal of expression analysis is to estimate a transcriptome
: the set of all expressed transcripts and their frequencies in a cell at a given time. By itself, RNA-Seq data allow us to estimate the relative
expression levels of isoforms in a sample. Combined with accurate physical sample size or spike-in measurements, absolute expression may be estimated (Mortazavi et al.
), but that is a separate issue that we will not discuss here.
There are two natural measures of relative expression: the fraction of transcripts
and the fraction of nucleotides
of the transcriptome made up by a given gene or isoform. For isoform i
, we will denote these two quantities by τi
, respectively. At the isoform level, these quantities are related by the equations
is the length, in nucleotides, of isoform i
. At the gene level, expression is simply the sum of the expression of possible isoforms. For ease of notation, we give expression levels in terms of nucleotides per million
(NPM) and transcripts per million
(TPM), which are obtained by multiplying ν and τ by 106
The problem addressed in this article is that of using a set of RNA-Seq data to estimate the ν and τ values for a set of previously identified isoforms and genes. We assume that the RNA-Seq data comes in the form of N sequence reads, each of length L. Reference sequences, but not necessarily genomic coordinates, are assumed to be available for all M isoforms. Sequence clustering or genomic coordinates may be used to group isoforms into genes, if desired.
The fundamental assumption of RNA-Seq is that the fraction of reads derived from isoform i is a function of νi. Assuming uniformly distributed reads, all of which can be assigned to isoforms, and poly(A)+mRNA, we have that ci/N approaches νi as N→∞, where ci is the number of reads from isoform i. Even if reads are not uniformly distributed along the lengths of isoforms, so long as they are sampled in proportion to νi, this result still holds.
1.1.1 Comparison to RPKM estimation
We compare the problem of estimating ν and τ values with that of computing expression in terms of reads per kilobase per million mapped reads
(RPKM; Mortazavi et al.
). The measured level of isoform i
, in RPKM, is defined as 109
), where ci
is the number of reads mapping to isoform i
is the total number of mappable reads and i
is the length of isoform i
. Under the assumption of uniformly distributed reads, we note that RPKM measures are estimates of 109
, which is an unnormalized value of τi
. The normalization factor is
which is proportional to the mean length of a transcript in the transcriptome. When the mean expressed transcript length is 1 kb, 1 TPM is equivalent to 1 RPKM, which corresponds to roughly one transcript per cell in mouse (Mortazavi et al.
Because the mean expressed transcript length may vary between samples, we prefer the use of ν and τ over RPKM measures. For example, an isoform with the same fraction of transcripts in two samples will have different RPKM values if the expression of other genes changes such that the mean expressed transcript length differs. In addition, τ values are comparable between two species even if mRNA lengths tend to be larger in one of the species.