Transcriptome sequencing using the massively parallel, next-generation sequencing technologies provides a high-throughput method for rapidly generating millions of reads for expressed sequences from different RNA samples[3
]. These reads can be used to verify predicted genes and discover new transcription products and the expression levels can be quantified by direct counting of the numbers of reads mapping to different genes and transcript variants. In addition Single Nucleotide Polymorphisms (SNPs) in both the coding and noncoding regions of expressed genes can be identified by comparing the sequences of multiple matching reads with reference sequences [8
]. In the past year these new technologies have already provided radically new insights into the transcriptional complexity of a number of different organisms from bacteria[9
], to yeast[10
], and man [5
] As the new sequencing technologies strive toward the capability of sequencing the entire human genome for $1000, these approaches will become routine tools for every facet of gene expression analysis.
In this paper we have focused on the quantification of gene expression levels and have compared the results with DNA microarrays and Quantitative RTPCR using the Microarray Quality Control (MAQC) reference RNA samples[1
]. We have generated 3.6 million reads of average length 250 bp on Roche's 454 GS FLX for the MAQC A and B reference RNA samples. Using the ExpressSeq pipeline these reads can be easily mapped to the human RefSeq genes on a Windows desktop computer and gene expression levels conservatively determined by simply counting the numbers of unique hits with e-values ≤ 10-20
Using the MACQ quality metrics for evaluating the reproducibility, sensitivity, specificity, and accuracy of the gene expression analysis, we have demonstrated that these ExpressSeq results already compare favourably with the results for DNA microarrays and QRTPCR in the MAQC studies. Moreover, a simple Poisson error model for the random shotgun sequencing process describes how these metrics systematically improve with increased numbers of reads.
The presence of repeated sequences arising from homologous gene families and from LINE, SINE, and ALU elements in untranslated regions of the transcriptome pose a significant challenge to the accurate quantification of gene expression by read counts. This is a problem that is well appreciated by the DNA microarray and QRTPCR communities that have been very careful to try to avoid these problems by carefully designing gene probes and primers. Here we have tried to minimize the problem by counting each read only once. However, quantitative errors from the miss-assignment of this read are inevitable. Better strategies for dealing with reads with multiple alignments are required. For example, reads with multiple equivalent alignments to the genome could be eliminated altogether or reads counted only when they map to gene coding regions or to known exons [13
We have also used the AceView alignment software[18
] to co-align these 3.6 million reads for A and B samples to the human genome to identify over 20,000 new exon junctions in the human transcriptome. Although these candidate exon junctions may be less common than the well-annotated RefSeq junctions, they may still play an important role in determining the diversity of biological phenotypes. These results indicate that there is still a significant amount of biology to be uncovered using transcriptome sequencing.
At present genomics researchers can choose among several next-generation sequencing platforms for transcriptome sequencing that generate either long, 250 bp–400 bp, (Roche GS FLX) or short, ~30 bp, (Illumina Genome Analyzer and ABI SOLiD) reads. The larger numbers of reads generated by the short read platforms provide greater sequencing depth and should provide better reproducibility, sensitivity, specificity, and accuracy for the measurement of differential gene expression. However, the longer reads are easier to map and to assemble. Consequently, the long reads should be better for the discovery and assembly of novel genes and splice variants.
For example, three papers have recently appeared with deep sequencing results for transcriptomes of different human cell lines [21
] and tissues[22
] using the Illumina Genome Analyzer. Although these studies generated between 16 million[21
] and 435 million[22
] reads for expressed transcripts, they succeeded in only identifying between 4096[21
] and 11,099[23
] new exon junctions not contained in the human Ensembl and RefSeq databases by aligning the short, 32 bp, reads to databases of predicted splice junctions. In particular, the largest study[22
] with more than 400 million reads discovered only 114 new "isolated" cassette exons in all of the 15 human tissues and cell lines examined, as compared with the 912 new cassette exons found here by aligning 3.6 million long reads for the MAQC samples directly to the human genome.
A remaining challenge is the proper assembly of the transcribed exons into the full-length alternative transcripts [24
]. This will require preservation of the identity of the transcribed strand to distinguish reads from overlapping transcripts (which can be accomplished using a modification of the TSEQ protocol) and longer read lengths and/or paired end reads for the transcriptomes will be necessary to link all of the pieces together. Fortunately, the new GS FLX Titanium upgrade now provides average 400 bp reads and all of the new sequencing platforms have protocols for paired end reads to bridge across the full length transcripts. As read lengths and throughput continue to increase and new sequencing platforms [25
] and software tools [28
] for mapping the reads emerge, the combination of the long and short read technologies may be most effective for exploring the complexity of the transcriptome, where the long reads are used for gene discovery and assembly and the short reads for confirmation and quantification[22