PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2012; 7(1): e29175.
Published online 2012 January 6. doi:  10.1371/journal.pone.0029175
PMCID: PMC3253133

IQSeq: Integrated Isoform Quantification Analysis Based on Next-Generation Sequencing

Andrey Rzhetsky, Editor

Abstract

With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e001.jpg times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.

Introduction

The concepts of genes and isoforms have evolved and become more complex [1]: the discovery of splicing [2][4] revealed that the gene was a series of exons, coding for, in some cases, discrete protein domains, and separated by long noncoding stretches called introns. With alternative splicing, one genetic locus could code for multiple different mRNA transcripts (isoform transcripts). This discovery complicated the concept of the gene radically. For instance, as of 2007, the GENCODE annotation [5] contained on average An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e002.jpg transcripts per locus.

With the recent development of high-throughput RNA sequencing (RNA-Seq) technology, it is possible for biologists to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification, where one tries to reconstruct the abundances of similar isoforms based on a set of RNA-Seq reads. Various methods have been developed to solve this problem. In previous work, researchers proposed different statistical frameworks to solve this problem. Xing et al. [6] proposed a maximum likelihood problem, an expectation maximization solution, and a Fisher information measurement for performance estimation; Jiang et al. [7], based on Poisson model assumption, formulated a maximum likelihood problem and its numerical solution, and also utilized the observed Fisher information matrix to sample the posterior distribution of isoform quantity; Trapnell et al. [8] used variable read-length model (normal distribution by default) and a sampling method similar to [7] to derive the posterior distribution of isoform quantity; Richard et al. [9] with a Poisson model, also used bootstrapping to study the robustness of their method against non-uniform sequencing effects; Lacroix et al. [10] studied the conditions under which the problem can be solved, revealing that although neither single nor paired-end sequencing guarantee a unique solution, paired-end reads may be sufficient to solve the vast majority of the transcript variants in practice.

These studies, however, have not fully addressed the problem of isoform quantification in a couple of respects: First of all, they usually assume that only one sequencing technique is used in an experiment, and that the reads are uniformly sampled along the transcripts. These are not necessarily good approximations to real data. Second, while some theoretical results have been presented on estimating the accuracy (e.g. average variance) of quantification results, there does not yet exist a method to efficiently compute these measurements other than using brute-force simulation, which is computationally infeasible in large scale expriments involving tens of thousands of genes and millions of sequencing reads. On the other hand, fast estimation of quantification accuracy would not only enable researchers to better understand the analysis results being obtained, but also will be useful in RNA-Seq experiment design to optimally integrate different sequencing technologies in a cost-efficient way.

In order to fill in these gaps, we have developed a generalized statistical solution for the problem of isoform quantification, and a practical implementation in a tool we call IQSeq (Isoform Quantification in next-generation SEQuencing). IQSeq has the following features which represent improvements over previous work in isoform quantification in the following aspects:

  1. It has a generalized statistical read generation function during the sequencing process (i.e. a customizable function describing how reads are randomly sampled from isoforms). This provide a flexible way to incorporate characteristics of different sequencing technologies (e.g. 3′ end sequencing bias of transcripts).
  2. It integrates the analysis of different sample sets generated from different sampling technologies (e.g. long and short reads).
  3. It has a fast algorithm for estimating the average variance of the results provided by our expectation maximization based solution.
  4. Given the estimated isoform abundance output, IQSeq also provides an information theoretical method to measure the overall transcriptome complexity.

In this paper, we will first introduce a mathematical definition of the generalized partial sampling and distribution estimation problem (which IQSeq is based on), and provide a expectation maximization based iterative solution. Then we discuss in detail on how to estimate the performance of this solution using Fisher information based heuristics, and present fast algorithms that implement the computation of these heuristics. Finally, we show results of applying our methods to both simulated and real-world data, illustrating scenarios where such integrated analysis can be the most informative.

Methods

First, we formally define the isoform quantification with multiple sequencing technologies as a generalized statistical partial sampling problem, and present a computational solution based on maximum likelihood estimation and expectation maximization. We then show both analytical results and practical fast algorithms to estimate the average variance of the solution on isoform quantification, and compare their computational complexity against brute-force methods. We present the main theoretical results in this section, and detailed derivations can be found in Text S1.

Problem Definition

We start by defining the generalized process of batch partial sampling, which represents the sequencing process in RNA-Seq experiments, and the relationships between partial samples and the objects being sampled.

Definition 1

(Batch Partial Sampling) Let An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e003.jpg be all the possible isoforms for a given gene, with relative abundances An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e004.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e005.jpg. We assume that there are An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e006.jpg different partial sampling methods (sequencing techniques with difference characteristics, e.g. long/medium/short, single/paired end): An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e007.jpg, and let An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e008.jpg denote all the samples (reads): An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e009.jpg from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e010.jpg. We also define An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e011.jpg (partial sample (read) An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e012.jpg is compatible with An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e013.jpg), where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e014.jpg is the indicator function. There are in total An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e015.jpg samples, where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e016.jpg is the total number of partial samples from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e017.jpg.

Here we assume a two-step sampling process: First, a sampling method An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e018.jpg chooses an isoform instance An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e019.jpg according to An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e020.jpg. Second, the sampling method generates a partial sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e021.jpg according to a local partial sample generation model (the read generation function) An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e022.jpg (generating An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e023.jpg).

Definition 2

(Distribution Estimation based on Batch Partial Samples) Given An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e024.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e025.jpg as defined in Definition 1, estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e026.jpg.

As shown in Figure 1, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e027.jpg are the isoforms with different relative abundances An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e028.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e029.jpg are the single- and paired-end reads whose sequences align with part of this gene region. Some of these reads (e.g. read An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e030.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e031.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e032.jpg) are compatible with multiple isoforms. The ultimate problem is to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e033.jpg based on An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e034.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e035.jpg, i.e., reconstructing a distribution based on partial observations.

Figure 1
Reads (partial samples) in the isoform quantification problem.

In the remaining part of this paper, we will use two notations to describe a partial sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e036.jpg: An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e037.jpg is the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e038.jpgth sample from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e039.jpg; and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e040.jpg stands for a partial sample from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e041.jpg, starting (inclusive) from position An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e042.jpg and ending (exclusive) at An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e043.jpg in that isoform. We also define exons as those nodes in the splicing graph of a gene, so that there are no exons that overlap with each other (i.e. an exon in a transcript may be a combination of multiple nodes of the splicing graph). We have included in our software package a preprocessing tool for grouping transcripts into gene clusters and formulating corresponding splicing graphs.

Maximum Likelihood Estimation (MLE)

Definition 2 does not give an explicit criterion for a “good” estimation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e044.jpg. Since the problem is defined in a statistical sampling framework, it is natural to consider using Maximum Likelihood as such a criterion.

Definition 3

(Maximum-Likelihood Distribution Estimation based on Batch Partial Samples) Given An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e045.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e046.jpg as defined in Definition 1, find An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e047.jpg such that:

equation image
(1)

By plugging in the partial samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e049.jpgs and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e050.jpgs, we can rewrite the formula above as follows:

equation image
(2)

In the next subsection, we demonstrate how this problem can be solved by introducing a hidden variable An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e052.jpg and using the technique of Expectation Maximization [11].

Applying the Expectation Maximization Method

We define An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e053.jpg is from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e054.jpg, which are the hidden variables in this problem. Since Expectation Maximization gives an iterative solution, we denote the estimation for An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e055.jpg in the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e056.jpgth step as An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e057.jpg, and further define An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e058.jpg, which is the expectation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e059.jpg given An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e060.jpg (the estimated paramters at the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e061.jpgth step) and the reads An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e062.jpg.

equation image
(3)
equation image
(4)

where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e065.jpg is generated by An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e066.jpg.

By performing an E step that computes

equation image
(5)
equation image
(6)

and a M step which maximizes An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e069.jpg with constraint: An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e070.jpg, we have:

equation image
(7)

as the new estimation for An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e072.jpg.

The iterative estimation in Equation 7 is intuitively consistent with the case of estimating a distribution based on full samples: consider the scenario in which for each An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e073.jpg, there is only one An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e074.jpg satisfying An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e075.jpg, the right hand side of Equation 7 thus becomes An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e076.jpg, which is exactly how the distribution estimation problem with traditional full samples can be solved. In the case of partial samples, our solution provides a way to adjust the “weight” each sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e077.jpg contributes to the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e078.jpgs of different objects.

Analyzing the Performance of Estimation

Given An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e079.jpg obtained from the MLE solution presented in the previous section, we would like to understand how much this estimate will deviate from the “true” An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e080.jpg on average. Here we focus on the variance of the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e081.jpg, which describes how stable the MLE result is over many different partial sample sets (obtained via additional experiments or re-sampling) drawn from the same isoform set:

equation image
(8)

As we will show later, although brute-force simulation can be performed to obtain a relatively accurate estimation of this measurement, it is may become computationally intractable when there are too many reads and genes to be considered. We thus propose to use a Fisher information based heuristic for estimating An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e083.jpg, and present a fast algorithm to compute the exact value of this heuristic.

We first introduce the Fisher information matrix [12], [13] as a basis for further discussion. The Fisher information is a way of measuring the amount of information that the random samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e084.jpg carries about the unknown parameter An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e085.jpg upon which the likelihood function of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e086.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e087.jpg, depends. An important use of the Fisher information matrix in statistical analyses is its contribution to the calculation of the covariance matrices of estimates of parameters fitted by maximium likelihood.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e088.jpg be the free parameters, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e089.jpg.

Definition 4

(Observed Fisher information matrix).

equation image
(9)
equation image
(10)

Definition 5

(Expected Fisher information matrix).

equation image
(11)

Covariance matrix of the maximum likelihood estimator

Let An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e093.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e094.jpg. The Cramér-Rao bound [13] states that:

equation image
(12)

where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e096.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e097.jpg; An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e098.jpg.

We then estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e099.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e100.jpg, and use the bound above to estimate the covariance matrix:

equation image
(13)

This means that we only need An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e102.jpg in order to estimate the performance of our MLE with different sampling method combinations.

Heuristic for MLE performance estimation

In order to provide a single value measure for the expected performance of Maximum Likelihood estimation, we propose to use the following heuristic to estimate the average variance of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e103.jpg:

equation image
(14)

This heuristic avoids the potential computational intensive and numerically unstable computation of the inverse of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e105.jpg, and is consistent with the theoretical result on the lower-bound of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e106.jpg in one dimensional case:

equation image
(15)

which is a specialization of the result in the previous subsection. In other words, the precision to which we can estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e108.jpg is fundamentally limited by the Fisher information.

In order to compute this heuristic, all we need is An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e109.jpg itself. However, the brute-force computation (according to Definition 4 and 5) of this matrix will be time-consuming since its time complexity is proportional to the total number of possible sample sets (which in turn grows exponentially with the number of samples). In the next section, we will present algorithms that can compute this matrix in a more efficient fashion.

Efficient Computation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e110.jpg

First of all, we can decompose An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e111.jpg in the following way:

equation image
(16)

where

equation image
(17)

is the expected Fisher information matrix of a single partial sample based on An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e114.jpg. Thus we need to be able to compute An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e115.jpg in order to obtain An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e116.jpg.

Further decomposing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e117.jpg

equation image
(18)

where

equation image
(19)

is the Fisher information matrix of a partial sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e120.jpg from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e121.jpg at An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e122.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e123.jpg.

A brute-force algorithm for computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e124.jpg can thus be described as follows: Algorithm 1

Algorithm 1
BruteForceFIM (I,Θ,Sampm,p,q)

In Algorithm 4, if An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e125.jpg is the length of a given sequence An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e126.jpg, then the whole algorithm consists of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e127.jpg length(Ik) computations of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e128.jpg.

Equivalent partial samples

In order to continue our discussion on faster algorithms to compute An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e129.jpg, we introduce the concept of equivalent partial samples below (the relavant proofs can be found in Text S1):

Definition 6

Two partial samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e130.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e131.jpg are equivalent w.r.t. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e132.jpg if and only if An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e133.jpg.

Lemma 1

If An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e134.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e135.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e136.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e137.jpg are equivalent w.r.t. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e138.jpg.

Definition 7

A set of partial samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e139.jpg is an equivalent sample set w.r.t. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e140.jpg if and only if An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e141.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e142.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e143.jpg are equivalent w.r.t. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e144.jpg.

Lemma 2

Given an isoform An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e145.jpg and a sampling method An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e146.jpg, if we divide all its possible partial samples into An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e147.jpg non-overlapping equivalent sample sets An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e148.jpg, then:

equation image
(20)

Results from a simple shotgun read generation model

In this subsection, we consider a simplified partial sample generation model:

Definition 8

A simple shotgun sampling method An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e150.jpg generates samples with fixed read length An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e151.jpg. When sampling from an isoform An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e152.jpg with length An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e153.jpg, there are in total An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e154.jpg different samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e155.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e156.jpg; and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e157.jpg. Each of these samples has equal probability of being generated from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e158.jpg: An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e159.jpg.

Figure 2 illustrates simple shotgun sampling process and its corresponding per-base coverage on the isoform being sampled.

Figure 2
A simple shotgun read generation model.

Lemma 3

Given the sample generation model An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e160.jpg above, if two samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e161.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e162.jpg generated by this method are compatible with the same set of isoforms, i.e. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e163.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e164.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e165.jpg are equivalent w.r.t. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e166.jpg.

Theorem 1

Given the sample generation model An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e167.jpg above, if two samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e168.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e169.jpg generated by this method overlap with all the junctions in the same set of connected exons An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e170.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e171.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e172.jpg are equivalent w.r.t. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e173.jpg.

For example, in Figure 3, where the reads are generated from a simple shotgun sampling process, the equivalent partial samples are -readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e174.jpg, readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e175.jpg, readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e176.jpg}, -readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e177.jpg, readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e178.jpg}. Also, if we consider a paired-end read as a long shotgun read with its gap filled, the samples readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e179.jpg and readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e180.jpg are also (approximately) equivalent, if their insert sizes are close to each other. However, readAn external file that holds a picture, illustration, etc.
Object name is pone.0029175.e181.jpg is not equivalent to these reads, since its shotgun version overlaps with a different exon junction set (with an addition exon).

Figure 3
Equivalent samples in a simple shotgun read generation model.

Algorithms for efficiently computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e182.jpg

Based on Definition 8 and Theorem 1, we can design the following algorithm for efficiently computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e183.jpg by combining the computation of this value for equivalent partial samples from each isoform. Algorithm 2

Algorithm 2
FasterShotgunFIM (I,Θ,Sampm,p,q)

In Algorithm 2, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e184.jpg identifies the connected exons set in An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e185.jpg that overlaps with a given partial sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e186.jpg, and can be implemented with An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e187.jpg time complexity by pre-computing an exon-position index table for the isoforms.

We can further reduce the number of times of computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e188.jpg by combining equivalent partial samples from across isoforms: when an equivalent sample set from an isoform has been identified, all the same samples from other isoforms can be recorded in lists of intervals to avoid redundant computation of their An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e189.jpgs. The algorithm is shown below: Algorithm 3

Algorithm 3
FasterShotgunFIM (I,Θ,Sampm,p,q)

In Algorithm 3, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e190.jpg finds the minimum position An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e191.jpg that is outside a given interval list An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e192.jpg; An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e193.jpg returns the partial sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e194.jpg from An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e195.jpg covering all the exon junctions in An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e196.jpg with a minimum An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e197.jpg, and can be implemented with a worst-case An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e198.jpg time complexity by using a pre-computed exon position index table for the isoforms.

Complexity analysis

Given a set of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e199.jpg possible isoforms An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e200.jpg, with lengths An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e201.jpg, respectively, and a shotgun sampling method An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e202.jpg with sample read length An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e203.jpg as described in Definition 8, Algorithm 1 requires An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e204.jpg steps of computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e205.jpg. Thus computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e206.jpg using this brute-force algorithm requires An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e207.jpg operations of calculating An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e208.jpg. If we assume that the average length of an isoform is An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e209.jpg, this corresponds to An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e210.jpg computations of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e211.jpg.

Suppose that on average an isoform can be divided into An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e212.jpg equivalent sample sets by Algorithm 2, this algorithm will then require An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e213.jpg steps of computing An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e214.jpg to obtain the Fisher information matrix An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e215.jpg for the given sampling method, thus being more efficient than Algorithm 1 by a ratio of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e216.jpg. Algorithm 3 will obviously be even more efficient by avoiding the redundant computation of some of the equivalent sample sets in Algorithm 2.

Using more complex An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e217.jpg function in Algorithm 3

The sequencing technology being used in an RNA-Seq experiment is usually more complicated than the simplified An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e218.jpg function described in Definition 8, which assumes equal sample-length and uniform generative probability. In reality, a typical An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e219.jpg usually involves reads with different lengths within a certain range, and also biased sample generation probability at different locations of a full-length isoform. Although once such a An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e220.jpg is defined, our MLE solution can treat it in the same way as it does for simplified versions, Algorithm 3 no longer works “out of the box” due to its dependency on Definition 8 to find equivalent partial samples. We discuss briefly in this subsection on how to handle more complex features.

When the assumption of uniform sample generation still holds, it is straightforward to handle samples with different lengths in FIM computation. We can treat one sampling method as a combination of multiple simplified methods as in Definition 8, with different sample lengths An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e221.jpg:

equation image
(21)
equation image
(22)

where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e224.jpg represents the probability of generating a sample with length An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e225.jpg in sampling method An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e226.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e227.jpg is a sample with length An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e228.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e229.jpg is the simplified sample generation probability as in Definition 8, with sample length An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e230.jpg.

In the case of non-uniform sample generation along the isoform, if An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e231.jpg is a step function (piece-wise constant function) for sample An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e232.jpg along isoform An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e233.jpg, we will still be able to find equivalent sample sets as described in Definition 7, based on both the isoform structures and the intervals in An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e234.jpg. If, however, very few such constant components exist in An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e235.jpg, we will need to relax the definition of equivalent partial samples to satisfying An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e236.jpg only. With this relaxed definition, we can find samples An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e237.jpg with equivalent structural similarities to all the isoforms. In this case, if the isoforms contain regions where any An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e238.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e239.jpg from it satisfy An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e240.jpg with a constant An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e241.jpg for all An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e242.jpg, we still have An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e243.jpg according to Equation 19, and the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e244.jpg can thus be efficiently computed using a variant of Algorithm 3 by combining the computation for such equivalent partial samples. For more complex An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e245.jpg functions, however, approximation algorithms may have to be introduced for fast computation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e246.jpg.

Results

Simulation Results

Here we present our results on a set of simulated datasets. In order to demonstrate the accuracy and efficiency of the methods we developed, we first use simulations to show the performance of our approach on simplified gene models and a real gene. These simulations are useful in designing optimal sequencing experiments for isoform quantification.

Simulation on simplified genes

Due to the complexity of real gene structure, we apply our methods to three artificially constructed genes with simplified isoform structures, so as to better illustrate how different characteristics of the gene structures can affect the outcome of the isoform quantification analysis.

As shown in Figure 4A, each of these genes has two different isoforms, with the more abundant one shown in a darker color. Two sampling techniques, short single and short paired-end (PE), are used to generate reads from them, with a fixed total cost of $An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e247.jpg (roughly corresponding to An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e248.jpg medium length reads with average size of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e249.jpg bp (An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e250.jpg× coverage on the simplied genes), or An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e251.jpg short reads with average length of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e252.jpg bp (An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e253.jpgx)). The per-base costs of these sampling techniques are based on [14]. Different cost combinations, e.g. different percentage of the total cost being assigned to a certain sampling technique, are illustrated by the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e254.jpg-axis in Figure 4B–D. For each of these cost combinations, we randomly generate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e255.jpg read sets, and use the MLE solution to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e256.jpg, based on which An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e257.jpg are computed (solid lines in Figure 4B–D). We also use Algorithm 3 to estimate the same quantity, and plot the estimations using dashed lines in the same figure for comparison. The results show that the FIM estimation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e258.jpg are close to the direct simulation results, and also correctly predicts the trend in how this value changes with different cost combinations. Also, different gene structures have noticeable impact on the MLE accuracy, mostly due to the ability of sampling techniques to distinguish isoforms from each other with different gene structures.

Figure 4
Results on simplified genes.

Not only can the FIM based heuristic correctly approximate how the performance of MLE changes with regard to different sampling technique combinations, it is also able to dramatically shorten the time of computation, as shown in Table 1. This is mainly because while the computation of brute-force simulation depends heavily on the number of reads being generated and the number of trials needed to obtain a relatively stable estimation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e259.jpg, the core computation taken by the FIM based heuristic is the evaluation of individual FIMs for the sampling techniques involved, which can be efficiently computed using Algorithm 3, and then combined based on Equation 16 to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e260.jpg under different cost combinations. Being able to do these simulations fast is useful in designing optimal expriments.

Table 1
Total time used by brute-force simulation vs. FIM based heuristic to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e261.jpg in simplified genes.

Efficiently Estimating quantification error: Application on a typical gene

We have developed a Fisher information matrix (FIM) based fast algorithm (Algorithm 3 in Methods section) for estimating the quantification error in An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e267.jpg, and compared its speed with two other benchmark algorithms. Here we consider the gene An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e268.jpg, which has An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e269.jpg known isoforms shown in Figure 5A. Figure 5B shows its corresponding splicing graph [6], [15], with An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e270.jpg exon blocks, and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e271.jpg possible isoforms, which are all the possible paths from node “START” to node “END” in the splicing graph. Figure 5C shows the brute-force way and Fisher information based method to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e272.jpg.

Figure 5
Computations on gene TCF7.

When computing the expected Fisher information matrix An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e273.jpg, a brute-force algorithm (Algorithm 1) requires An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e274.jpg computations of the observed Fisher information matrix An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e275.jpg, while an improved algorithm developed by us (Algorithm 2) involves An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e276.jpg such computations, and the number for our final algorithm (Algorithm 3) is An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e277.jpg, achieving a An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e278.jpg times speedup compared to the brute-force method. A summary of the speedups is shown in Figure 6. Note that theoretical speedup is calculated based on the number of key computational steps (per-read FIM computation), while the actual speedup depends on the software implementation of all steps in the algorithm.

Figure 6
Speedup in FIM computation for gene TCF7.

Integrated analysis with multiple sequencing technologies: Simulation on a typical gene

We present in this section the application of the FIM based heuristic on a real gene, and compared the results to the ones obtained from direct simulations. We pick TCF7 again as a typical example gene with multiple isoforms. Similarly to our procedure in the section on simplified genes, two sampling techniques, medium and short shotgun sequencing (with average length of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e281.jpg bp and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e282.jpg bp, $An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e283.jpg and $An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e284.jpg per An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e285.jpg million base cost respectively), are used to generate reads from them, with a fixed total cost of $An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e286.jpg, with An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e287.jpg trials being conducted for each cost combination. Two different sets of results are shown in Figure 7, one using all the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e288.jpg possible isoforms deduced from its splicing graph, and the other just using its An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e289.jpg known isoforms. As in the previous section, the results here show that the FIM estimation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e290.jpg are close to the direct simulation results, and also correctly predicts the trend in how this value changes with different cost combinations.

Figure 7
Simulation results on TCF7.

Figure 8 presents a more detailed simulation focusing on short paired-end reads. The An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e291.jpg value reflects the expectation of the variance in insert size for such experiments: a An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e292.jpg value means that all the paired-end reads are expected to have exactly the same insert size; the higher the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e293.jpg is, the more relaxed are we on the insert size variation (i.e. if the distance of the mapped positions of the two ends of a read on a transcript is within the expected insert size An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e294.jpg a “tolerated” ratio, the paired-end read will be considered “compatible” with the transcript). As we can see from this figure, the higher the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e295.jpg, the larger An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e296.jpg becomes, corresponding to a worse expected performance of MLE. This can be explained by the fact that a higher An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e297.jpg makes the sampling method less capable of distinguishing highly similar isoforms from each other based on a single paired-end read (e.g. GeneA in Figure 4A). The FIM based heuristic is again able to correctly depict the different trends of MLE performance under different cost combinations and An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e298.jpg settings.

Figure 8
Simulation results on TCF7 with paired-end reads.

We also show the computation time used by brute-force simulation and FIM based heuristic in Table 2. Note that the brute-force simulation is even more computational consuming, mainly because more isoforms are involved in the MLE process. Given the fact that there exist more than An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e303.jpg genes in the human genome and that the simulation has to be rerun for every new experiment to adjust its read counts (the number of reads attributed to a gene region in the experiment), using the FIM based heuristic instead for the purpose of estimating isoform quantification accuracy is obviously a more computationally tractable choice.

Table 2
Total time used by brute-force simulation vs. FIM based heuristic to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e304.jpg in TCF7.

Application to a model-organism (worm) dataset

To illustrate how we can interpret the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e310.jpg values output, we further apply our MLE solution to a worm dataset [16][18], which is a well-studied model organism. which is a well-studied model organism. The worm has intermediate complexity in isoform structures. It has isoforms but they are significantly simpler in structure than in human, leading to interpretability in the results. This dataset includes multiple developmental stages, and we were able to compare the results on a same set of isoforms under different conditions. The worm genome contains An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e311.jpg K genes, and the transcripts from each stage are sequenced with An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e312.jpg M short Solexa reads. This dataset is particularly useful for isoform comparison since it contains multiple stages of splicing events that are not overly complex.

Dataset description

Whole transcriptome sequencing data for worm L2, L3, L4 and Young Adult stages, each stage with on average 50 M reads. The annotation set (derived from the modENCODE project, [17], [18]) has 21774 total genes. Of these, 12875 genes has multiple isoforms, with an average of 4.344 isoforms per gene.

Comparison of isoform composition between stages

We first present the isoform quantification results on individual genes in two different stages, early embryo (EE) and late embryo (LE), to briefly illustrate the fact that different genes have different isoform composition differences between stages. Here we use the following formula to measure the difference in isoform composition of the same gene in two different stages:

equation image
(23)

where An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e314.jpg is the total number of isoforms in gene An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e315.jpg.

Figure 9 shows two examples of zero and non-zero Diff values. The reads are plotted below the isoforms, and the numbers associated with the isoforms are their estimated relative abundances based on MLE. Furthermore, if we compute such values for all the genes in these two stages, we can get a histogram of isoform composition differences as illustrated in Figure 10, which characterizes the general isoform composition difference between stages. The distribution of differences in relative isoform composition for genes is shown: Isoform quantification was applied to RNA-Seq data in 4 developmental stages in worm (L2, L3, L4, YA) and Diff score was calculated for each gene in all 6 pairwise comparisons. Because isoform quantification is noisy for genes expressed at very low level, we plotted the distribution for genes that have at least an RPKM value of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e316.jpg here (RPKM for a gene is the sum of RPKMs of all its isoforms). Red bars represent the average number of genes within the respective Diff score range, while error bars indicate the maximum and minimum numbers. Diff scores close to An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e317.jpg indicate big changes in isoform composition, or the relative expression level of isoforms between stages. The histogram indicates that only a few genes (An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e318.jpg) show dramatic differences in isoform expression between stages. (The number An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e319.jpg is derived from a cutoff of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e320.jpg on the Diff score.) In Table 3, we include a classification of the structural difference (5′ UTR, 3′ UTR, alternative exon, etc.), between the dominant transcripts in such genes with different isoform compositions. When the different dominant isoforms from a gene differ in two aspects, we assign An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e321.jpg to each category. As shown in this table, many of the structural differences are due to either Distinct 5′ UTR or Overlapping 3′ UTR. We have also included in the supplementary website (http://archive.gersteinlab.org/proj/rnaseq/IQSeq) the genes with stage-wise isoform composition differences, ranked by their FIM based estimation variances, and with a thresholded on Diff score and RPKM at An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e322.jpg.

Figure 9
Isoform composition in two stages.
Figure 10
Diff of all genes across four stages.
Table 3
Classification of different isoform composition between stages.

The effect of different isoform sets on MLE result

We also investigate how different isoform sets (e.g. with a major/minor isoform missing, with an additional “dummy” isoform) will affect the MLE result, especially in terms of the maximized likelihood value. We pick gene No. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e323.jpg as a base isoform set, using the same set of reads and the per-read average maximized likelihood value An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e324.jpg to measure the goodness of fitting:

equation image
(24)

As we can see from Figure 11, the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e326.jpg value always decreases when we modify the “true” isoform set in an unfavorable fashion. This shows that the likelihood score is an effective metric for ranking isoform sets for a particular gene. We observe that the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e327.jpg value decreases the most in all cases when the dominant isoform is removed from the isoform set Figure 11b), which indicates that a more important element has a larger contribution in explaining the generated reads. Correspondingly, when a low-probability or dummy isoform (that is not similar to the dominant one) is added to the input isoform set, the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e328.jpg value decreases less significantly (about half of the case with dominant isoform removal), and also the isoform quantification results remain almost unchanged for the other transcripts in the isoform set. In practice, this characteristic can also be useful to eliminate non-existing isoforms - any isoform that has little effect on either quantification result or An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e329.jpg score can be considered “not important” for explaining the observed reads, and can thus be removed from the isoform set when analyzing a particular dataset.

Figure 11
Gene 7649: Leave out one isoform, or add a “dummy” isoform.

Use of empirical An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e330.jpg function

To illustrate how non-uniform An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e331.jpg function works, we modeled the bias of RNA-Seq data by aggregating signal of mapped reads along annotated transcripts. A signal map of the first base of mapped reads was generated. The signal was subsequently mapped onto the transcript and aggregated for all genes with signal isoform. An aggregation plot of such a signal map for Young Adult is shown in Text S1. In this aggregation, each transcript is divided evenly into 100 bins, with the signals normalized by the sum of signals across all the bins. The normalized signal at each bin thus represents the probability that a read is generated at certain position of the transcript. These non-uniform probabilities gave more realistic estimation of how the reads are generated, and were plugged into the EM calculations. We compared the quantification results for Young Adult worm with uniform and non-uniform G function. The Pearson correlation score for relative abundance is An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e332.jpg, and score for absolute abundance is An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e333.jpg. The results are similar for other stages. For the majority of genes, the isoform quantification is largely dependent upon whether reads are compatible with the different isoforms of the gene, while the subtle differences in start position probabilities have little influences on final estimation results. Only for a few genes where the isoform structures are highly similar to each other, the quantification results are different.

Also, there have been some recent works [19], [20] studying the sequencing biases in RNA-Seq data, with more sophisticated modeling utilizing local sequence composition at different positions along the transcript. Based on different assumptions on sequencing bias, their results can be plugged into the An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e334.jpg function to derive more realistic quantification results.

Comparison with existing tools

In order to understand the performance of our method compared with other existing tools, we have conducted additional computational analysis by applying IQSeq and Cufflinks on An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e335.jpg samples from MAQC-3 data [21]. We summarize our result in Figure 12. The genes are categorized by their number of isoforms, and Pearson correlations of the estimated isoform level RPKMs (in logarithmic scale) from the two methods are calculated for each category in each sample. The overall correlation of the isoform quantification results from these two methods is An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e336.jpg across all samples, which indicates a similar characteristic with a near uniform read-generation assumption. Also, both outputs from IQSeq and Cufflinks have An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e337.jpg correlation with the Taqman assay [22] (an qRT-PCR technique, which can be considered as a gold-standard here). These results confirm consistency of our method with previous work. Note, however, that with our method one can readily “plug-in” more practical read generation models as illustrated in the previous section, making it a more flexible tool to handle and integrate data from different sequencing technologies. Also, we compared the isoform quantification variances between replicates with the FIM based variance estimations, and their logarithmic values have a correlation of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e338.jpg (Figure 3 in Text S1).

Figure 12
Comparison between IQSeq and Cufflinks.

Discussion

In this paper we explore the problem of integrating different sequencing techniques to quantify the relative abundance of different isoform transcripts, which can be generalized to the problem of estimating the distribution based on partial samples from different sampling techniques. We first introduce a statistical framework to model the generative process of the partial samples, using a “plugin-able” function An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e339.jpg to allow flexible incorporation of different sampling characteristics, and then present the original problem as a maximum likelihood estimation (MLE) problem, with an iterative solution based on expectation maximization, which guarantees a locally optimum answer. This provides a solution to the question of estimating a distribution based on partial samples.

In order to further investigate the problem involving partial samples, we introduce a heuristic based on the Fisher information matrix (FIM) to estimate the variance of the previously presented MLE solution. Also, in order to accelerate the computation of this measurement, we introduce the concept of equivalent partial samples and develop a fast algorithm, Algorithm 3, to accurately calculate FIM, achieving a speedup of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e340.jpg times compared to the brute-force method. Simulation results on both hypothetical and real gene models also show that our FIM-based heuristic gives a good approximation to the value of An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e341.jpg, and accurately predicts the numeric order of this value under different conditions. With this metric, we are also able to demonstrate examples of how to efficiently find low-cost combinations of different sampling techniques to best estimate the isoform compositions in RNA-Seq experiments. Although we are only using individual genes as examples, once we have good assumptions of expression levels of different genes, this procedure can be generalized to all the genes for the low-cost design of actual whole genome RNA-Seq experiments.

What is more, by applying the MLE method to a worm RNA-Seq dataset, we illustrate how we can compare the differential isoform composition between different developmental stages, and how different isoform sets (e.g. with a major/minor isoform missing, with an additional ‘dummy” isoform) will affect the MLE result, especially in terms of the maximized likelihood value, showing that the likelihood score is an effective tool for ranking the “fitness” of isoform sets for a particular gene.

Since IQSeq estimates isoform quantity within a probabilistic framework, it does not directly determine the existence of a certain isoform transcript in the data, but rather gives probability measures (An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e342.jpg) and corresponding RPKM values. The result of a secondary experiment with high precision, e.g. qPCR, on a smaller set of genes, can be used as a gold standard dataset to assist answering such existence questions, with either a simple RPKM value threshold that maximizes the prediction accuracy on the gold standard (training) dataset, or more sophisticated classification techniques that takes multiple characteristics (e.g. An external file that holds a picture, illustration, etc.
Object name is pone.0029175.e343.jpg, overall gene expression, FIM-based variance estimation) into account.

The FIM-based variance we are trying to estimate in the proposed algorithm focuses mainly on the expected estimation variance based on different read sets of similar on a same sample, and is a measurement of estimation accuracy. In the case of read sets from different biological replicates, the variances of interest there are usually the actual differences in isoform composition of particular genes between/among the replicates, and analysis on such differences can be generally conducted as a downstream procedure after the isoform quantification calculation.

As sequencing technologies constantly evolve, IQSeq will remain able to provide integrated analysis of different datasets with their own sequencing characteristics, and provide guidelines for optimal RNA-Seq experiment design.

Supporting Information

Text S1

Supplementary material including additional derivation of formulas, proof of lemmas and theorems, and analysis results.

(PDF)

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: These authors have no support or funding to report.

References

1. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, et al. What is a gene, post-encode? history and updated definition. Genome Res. 2007;17:669–681. [PubMed]
2. Berget SM, Moore C, Sharp PA. Spliced segments at the 5′ terminus of adenovirus 2 late mrna. Proc Natl Acad Sci U S A. 1977;74:3171–3175. [PubMed]
3. Chow LT, Gelinas RE, Broker TR, Roberts RJ. An amazing sequence arrangement at the 5′ ends of adenovirus 2 messenger rna. Cell. 1977;12:1–8. [PubMed]
4. Gelinas RE, Roberts RJ. One predominant 5′-undecanucleotide in adenovirus 2 late messen- ger rnas. Cell. 1977;11:533–544. [PubMed]
5. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, et al. Gencode: producing a reference annotation for encode. Genome Biol. 2006;7(Suppl 1):S4.1–S4.9. [PMC free article] [PubMed]
6. Xing Y, Yu T, Wu YN, Roy M, Kim J, et al. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006;34:3150–3160. [PMC free article] [PubMed]
7. Jiang H, Wong WH. Statistical inferences for isoform expression in rna-seq. Bioinformatics. 2009;25:1026–1032. [PMC free article] [PubMed]
8. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, et al. Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010 [PMC free article] [PubMed]
9. Richard H, Schulz MH, Sultan M, Nrnberger A, Schrinner S, et al. Prediction of alternative isoforms from exon expression levels in rna-seq experiments. Nucleic Acids Res. 2010;38:e112. [PMC free article] [PubMed]
10. Lacroix V, Sammeth M, Guigo R, Bergeron A. WABI '08: Proceedings of the 8th international workshop on Algorithms in Bioinformatics. Berlin, Heidelberg: Springer-Verlag; 2008. Exact transcriptome reconstruction from short sequence reads. pp. 50–63.
11. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B (Methodological) 1977;39:1–38.
12. Schervish MJ. Theory of Statistics. Springer; 1995.
13. Van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998.
14. Du J, Bjornson RD, Zhang ZD, Kong Y, Snyder M, et al. Integrating sequencing technologies in personal genomics: optimal low cost reconstruction of structural variants. PLoS Comput Biol. 2009;5:e1000432. [PMC free article] [PubMed]
15. Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA. Splicing graphs and est assembly problem. Bioinformatics. 2002;18(Suppl 1):S181–S188. [PubMed]
16. Hillier LW, Reinke V, Green P, Hirst M, Marra MA, et al. Massively parallel sequencing of the polyadenylated transcriptome of c. elegans. Genome Res. 2009;19:657–666. [PubMed]
17. Celniker SE, Dillon LAL, Gerstein MB, Gunsalus KC, Henikoff S, et al. Unlocking the secrets of the genome. Nature. 2009;459:927–930. [PMC free article] [PubMed]
18. Gerstein MB, Lu ZJ, Nostrand ELV, Cheng C, Arshinoff BI, et al. Integrative analysis of the caenorhabditis elegans genome by the modencode project. Science. 2010;330:1775–1787. [PMC free article] [PubMed]
19. Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in rna-seq data. Genome Biol. 2010;11:R50. [PMC free article] [PubMed]
20. Hansen KD, Brenner SE, Dudoit S. Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 2010 [PMC free article] [PubMed]
21. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normal- ization and differential expression in mrna-seq experiments. BMC Bioinformatics. 2010;11:94. [PMC free article] [PubMed]
22. Shi L, Reid LH, Jones WD, Shippy R, et al. Consortium MAQC. The microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measure- ments. Nat Biotechnol. 2006;24:1151–1161. [PMC free article] [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science