In large-scale clinical studies, most existing data sets focus on gene expression profiles; however, human transcriptome is undoubtedly more complex. While human genome encodes only 20,000~25,000 genes
[1], alternative splicing allows a single gene to produce multiple transcripts and subsequently multiple proteins that greatly increases protein diversity and their functions
[2]. In addition, more than 90% of genes are shown to undergo alternative splicing
[3],
[4], and many disease-causing mutations introduce alternative mRNA transcripts
[5]. It is, therefore, of great importance to effectively measure the levels of gene isoforms in human health and diseases. High-throughput RNA sequencing (RNA-Seq) provides unparalleled dynamic ranges and specificity for transcriptome analysis, while its sample throughput and cost are being improved continuously. An emerging approach for large-scale clinical studies is, therefore, to first sequence with a sufficient depth to comprehensively identify the mRNA transcriptome of the disease, followed by the design of customized microarrays targeting these transcripts as well as by high-throughput screening of thousands of patient samples using the arrays
[6].
In addition to providing essential references for array design, the discovery and annotation of new transcripts are also critical to the analysis of both array and RNA-Seq data. The successful deconvolution of isoform expression levels of a gene from microarray data requires a ‘complete’ exon structure not missing any major isoforms of the gene
[7]. Similarly statistical inference of isoform expression and their changes from RNA-Seq data also relies on prior annotations of gene transcripts
[8]. Currently, the UCSC genome browser includes annotations of human mRNA transcripts from RefSeq
[9], UCSC Known Genes
[10], and Ensembl
[11] (35,971, 77,614, and 143,123 respectively as of June 21, 2010). While each database uses different but overlapping criteria to curate existing sequencing evidences, many new gene isoforms remain to be discovered and catalogued from high-throughput sequencing data.
While generating long-read expressed sequence tags (ESTs) using Sanger sequencing was traditionally the major experimental approach to discover new isoforms of mRNA transcripts
[12], the second generation sequencing technologies produce many millions of short reads from mRNA, making it possible to comprehensively analyze the entire transcriptome
[13]. Data sets of RNA-Seq have been rapidly accumulated for multiple mammalian tissues
[4],
[14], various cancers
[15], and individual patients
[16], which brings the opportunity to discover new mRNA transcripts, i.e. gene isoforms, to a genome-wide scale.
However, currently the reads generated by either Illumina or ABI platforms are typically 50~100 bases long, much shorter than the full length of mRNA transcripts. For example, a survey of the NCBI Sequence Read Archive (Dec. 1, 2011) indicates that among 4,653 RNA-Seq experiments 84% have read lengths shorter than 100 bp and 76% are single-end reads
[17]. In addition, gene transcripts are often not fully covered by uniquely mapped reads of RNA-Seq data because of either the low abundance of the transcripts, the low amplification efficiency of the transcript regions in sample preparation, the insufficient depth of typical sequencing runs, or the high sequence homology. Therefore, the discovery of new transcript isoforms from RNA-Seq data becomes a bioinformatic problem: how to reconstruct a full-length mRNA transcript from short reads by inferring exons and exon-exon junctions that are incompletely covered or completely missing in the sequencing.
The limitations of read length and coverage completeness make it difficult to apply conventional assembly or reconstruction methods to RNA-Seq data. Many short read assemblers for second generation sequencing data construct contigs by extending consensus of overlapping sequencing reads
[18],
[19],
[20],
[21]. However, the incomplete coverage of RNA-Seq data significantly limits the length of the reconstructed transcripts, and these short read assemblers commonly target to build sequence contigs rather than full-length transcripts. Other constructors such as ExonWalk
[22],
[23], which was extensively utilized in the design of exon and junction arrays
[6],
[7], build full-length transcripts by walking through exons, which is more suitable for long sequencing reads such as ESTs that completely cover several consecutive exons of the transcripts.
While the objective survey of transcriptome by RNA-Seq provides valuable and yet incomplete information on previously unknown transcripts, prior knowledge on annotated transcripts from public databases can compensate the missing information in RNA-Seq data. For example, if sequencing reads alone only identify one of the two junctions of an exon, the other one can be defined by previously known exons sharing the same junction.
In this paper, we propose a computational approach using previously developed SpliceMap
[24] as well as ExonMap and JunctionWalk algorithms proposed here to reconstruct new transcripts from RNA-Seq data and prior knowledge on annotated transcripts from existing databases. The proposed method was demonstrated on an RNA-Seq data set of 203 million sequencing reads of 58 bases from human reference muscle tissue
[6]. Using strict filtering criteria to control false positives, we identified 2,973 new junctions, 7,471 new exons, and 7,571 new transcripts that were not annotated in RefSeq. These new findings can be included in the future array design and the analysis of alternative splicing in large-scale transcriptome studies
[6].