The concepts of genes and isoforms have evolved and become more complex 
: the discovery of splicing 
revealed that the gene was a series of exons, coding for, in some cases, discrete protein domains, and separated by long noncoding stretches called introns. With alternative splicing, one genetic locus could code for multiple different mRNA transcripts (isoform transcripts). This discovery complicated the concept of the gene radically. For instance, as of 2007, the GENCODE annotation 
contained on average
transcripts per locus.
With the recent development of high-throughput RNA sequencing (RNA-Seq) technology, it is possible for biologists to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification, where one tries to reconstruct the abundances of similar isoforms based on a set of RNA-Seq reads. Various methods have been developed to solve this problem. In previous work, researchers proposed different statistical frameworks to solve this problem. Xing et al. 
proposed a maximum likelihood problem, an expectation maximization solution, and a Fisher information measurement for performance estimation; Jiang et al. 
, based on Poisson model assumption, formulated a maximum likelihood problem and its numerical solution, and also utilized the observed Fisher information matrix to sample the posterior distribution of isoform quantity; Trapnell et al. 
used variable read-length model (normal distribution by default) and a sampling method similar to 
to derive the posterior distribution of isoform quantity; Richard et al. 
with a Poisson model, also used bootstrapping to study the robustness of their method against non-uniform sequencing effects; Lacroix et al. 
studied the conditions under which the problem can be solved, revealing that although neither single nor paired-end sequencing guarantee a unique solution, paired-end reads may be sufficient to solve the vast majority of the transcript variants in practice.
These studies, however, have not fully addressed the problem of isoform quantification in a couple of respects: First of all, they usually assume that only one sequencing technique is used in an experiment, and that the reads are uniformly sampled along the transcripts. These are not necessarily good approximations to real data. Second, while some theoretical results have been presented on estimating the accuracy (e.g. average variance) of quantification results, there does not yet exist a method to efficiently compute these measurements other than using brute-force simulation, which is computationally infeasible in large scale expriments involving tens of thousands of genes and millions of sequencing reads. On the other hand, fast estimation of quantification accuracy would not only enable researchers to better understand the analysis results being obtained, but also will be useful in RNA-Seq experiment design to optimally integrate different sequencing technologies in a cost-efficient way.
In order to fill in these gaps, we have developed a generalized statistical solution for the problem of isoform quantification, and a practical implementation in a tool we call IQSeq (Isoform Quantification in next-generation SEQuencing). IQSeq has the following features which represent improvements over previous work in isoform quantification in the following aspects:
- It has a generalized statistical read generation function during the sequencing process (i.e. a customizable function describing how reads are randomly sampled from isoforms). This provide a flexible way to incorporate characteristics of different sequencing technologies (e.g. 3′ end sequencing bias of transcripts).
- It integrates the analysis of different sample sets generated from different sampling technologies (e.g. long and short reads).
- It has a fast algorithm for estimating the average variance of the results provided by our expectation maximization based solution.
- Given the estimated isoform abundance output, IQSeq also provides an information theoretical method to measure the overall transcriptome complexity.
In this paper, we will first introduce a mathematical definition of the generalized partial sampling and distribution estimation problem (which IQSeq is based on), and provide a expectation maximization based iterative solution. Then we discuss in detail on how to estimate the performance of this solution using Fisher information based heuristics, and present fast algorithms that implement the computation of these heuristics. Finally, we show results of applying our methods to both simulated and real-world data, illustrating scenarios where such integrated analysis can be the most informative.