An accurate description of the ensemble of RNA molecules that is encoded by a genome is an integral step toward understanding the biology of the corresponding genome. Usually it takes years until a relatively complete and cell type−specific transcriptome description is made available. Here, we show that using long read sequencing technologies we can provide a very large part of such a description almost instantaneously without further input information.
The accuracy of the provided information is evident by the fact that the vast majority of partial gene structures we find are already part of the gencode v7 annotation, although a large number of novel structures can be found. For lncRNA genes we find a particularly high fraction of novel isoforms, which demonstrates that long-read sequencing can further the understanding of lncRNAs. When comparing our partial gene structures to predicted gene structures based on short read sequencing technologies, however, we find that many of our mappings cannot be reconstructed from short read cDNA sequencing. Hence, sequencing longer reads adds information to short-read analysis of transcriptomes. Nevertheless, one should note that there are many more short-read-predicted transcripts than long-read-predicted transcripts, thus enabling statements about genes, which the long-read approach did not cover.
A clear advantage of short read sequencing is that its low cost allows very deep sequencing, thus enabling quantitative statements about gene or isoform expression and exon inclusion. We therefore explored whether an example of such an analysis, cell type specific exon inclusion, can be performed using long reads. We show that despite the limited sequencing depth and other biases affecting 454 sequencing, such an analysis can be performed in an annotation-independent manner. The thus-defined cell type−specific alternative exons are highly reliable, as 97% of them coincide with already known exons. A total of 30% of these alternative exons were however not alternative according to the annotation; hence, our approach can complete the annotation in a cell type specific manner. The defined alternative exons show characteristics that correspond perfectly to what is known about alternative exons: They show weaker splice sites and are shorter in terms of exon length—and above all, in roughly two-thirds of the cases, they keep the reading frame of the encoded protein.
The dataset we present here comprises millions of reads that span multiple exon-exon junctions. However, it falls short of sequencing entire transcripts. Therefore, the question of whether one can accurately predict full-length transcripts from these reads becomes important. We show that based on our 454 reads we can predict ~9500 full-length transcripts and that for more than half of these an annotated transcript with identical introns exists. This is contrasted by the <25% of all transcripts predicted from short reads that fulfill this criterion. Furthermore, despite the extreme sequencing depth of short reads, more than 2300 annotated full-length transcripts can only be predicted correctly (that is correctness of each single splice site) using 454 reads. Conversely, there are also almost 1900 annotated full-length structures in the regions covered by 454 reads whose exact exon-intron-structures are only found based on short-read-sequencing. This shows, that the length of long-read sequencing and the depth of short-read sequencing complement each other and that their combination can be used to obtain an accurate description of transcriptomes. Importantly, new advances in sequencing technology could help reduce the cost of long read sequencing in comparison to the approach employed here. Coupling isolation of ~450-bp cDNA fragments with 250-bp paired-end MiSeq sequencing could lead to ~450 bp reads and to results that come close to ours. Assuming that Illumina continues to increase read length (from 25 or 36 initially to 150 bp paired-end at the time of writing) this could also be true for that platform as well. Finally, using the Pacific Biosciences platform, we could obtain much longer reads. Due to the lower throughput of this platform, it might be worth to couple this platform with capturing techniques in order to represent all genes independently of their expression level.
In summary, these results show that long-read cDNA sequencing is ideal for transcriptome analysis in species for which an annotation is not available or in cases in which relying on an annotation can introduce biases. In species in which an annotation is available, long-read cDNA analysis can successfully complete the annotation and complement short read sequencing analysis.