The advent of second generation sequencing (SGS) has opened up a new era of genome-wide and transcriptome-wide research. Currently a single lane of a SGS instrument such as the HiSeq® instrument from Illumina can generate 10
8 short reads (SR) of length up to 200 bp with a low error rate (2%)
[1]. While the high read count of SGS allows for accurate quantitative analysis, the relatively short length of the reads greatly reduces the utility of SGS in tasks such as de novo genome assembly and full length mRNA isoform reconstruction. Given RNA-seq with SRs only, the reconstruction of gene isoforms must rely on assumptions. For example, SLIDE
[2] uses statistical modeling under sparsity assumption, and Cufflinks
[3] imposes solution constraints. These assumptions are not supported by direct experimental evidence and may induce many false positives.
Recently, a third generation sequencing (TGS) technology capable of much longer reads has become available. The PacBio
RS can yield reads of average length over 2,500 bp and some longer reads can reach 10,000 bp
[4]. These continuous long reads (CLR) can capture large isoform fragments or even full length isoform transcripts. These CLRs tend to have high error rate (up to 15%). The sequencing accuracy can be greatly improved by approach called “circular consensus sequencing” (CCS) which uses the additional information from multiple passes across to insert to build a higher intramolecular accuracy. However, the requirement that 2 or more full passes across the insert for CCS read generation limits the insert size to <1.5 Kb for Pacific Biosciences’ C2 chemistry and sequencing mode, not allowing the interrogation of extremely long transcripts by CCS reads. Furthermore, the number of reads per run from the PacBio
RS is in only the range of 50,000 per run. The relatively modest throughput makes it difficult to obtain full sampling of the transcriptome.
Some researchers have been attempting to combine PacBio long reads and SGS short reads, for example the genome assembler Allpaths-LG
[5]. In this paper, we introduce a error correction approach that combines the strengths of SGS and TGS for in the task of isoform assembly from RNA-seq data. In particular, we use homopolymer compression (HC) transformation as a means to allow accurate alignment of SR to LR. HC transformation has been previously been proven to be useful in seeking possible alignment matches
[6] of pyrosequencing reads (454 platform). Since the SRs have lower sequencing error, the LR can be modified based on information from the aligned SRs to form a “corrected” LR with a much lower error rate than that of the original LR.