Next Generation Transcriptome sequencing of Arabidopsis floral tissue
A half plate of GS20 sequencing from an Arabidopsis random-primed cDNA library generated 134,791 reads totalling 13.8 MB with an average length of 102.2 bp. The reads were assembled into 82,281 unigenes, which included 8,188 contigs with an average length of 147 bp and 74,093 singleton reads (Table ). We mapped 122,344 (90.8%) reads to the TAIR 7 Arabidopsis genome annotation (Table and see Methods). Of the total mapped reads, 88.7% were located within 15,539 genic regions and 2.1% were located in intergenic regions. Within the genic regions, 119,518 (88.7%) reads mapped exactly to known exons, while 1,117 (0.8%) and 11,524 (8.6%) reads mapped to introns and intron/exon boundaries, respectively. Also, 3,066 (2.3%) of the reads included in the genic regions extended current boundaries of known genes while 302 reads combined two annotated genes or marked areas of the genome with overlapping genes. There were 12,447 (6.7%) reads that did not have a significant BLASTn match to any location within the genome. There were 1,085 genes that had more than 20 reads per locus, and the 10 most highly expressed genes (Table ), included two subunits of the photosynthetic protein RuBisCo, as well as TASTY, TGG1, and PDF1. These "top ten" transcripts had read counts ranging from 190 to 586 reads with the RuBisCO small subunit 1A being most highly represented. At this shallow sequencing depth, 2 non-overlapping contigs, with lengths of 357 and 240 bp, mapped to the RuBisCO small subunit 1A gene.
Sequencing statistics of analyzed libraries.
Arabidopsis 454 reads mapped to the annotated genome.
Top 10 most frequently detected unigenes in 454 cDNA libraries of Arabidopsis, Eschscholzia, and Persea.
Despite low overall transcriptome coverage, one-half plate of Arabidopsis
GS20 sequence data returned 27 fully sequenced cDNAs, as well as 292, 628, and 1008 genes at 90%, 80%, and 70% coverage, respectively. These results demonstrate that nominal amounts of 454 sequencing can generate complete or nearly complete sequences for an appreciable number of genes, especially those that are small and highly expressed. Another very promising result is the improved annotation of genes for both model and non-model species. For example, although the Arabidopsis
genome has been largely sequenced since 2000 [26
], the half plate of GS20 extended the untranslated regions (UTRs) of roughly 3,066 genes and mapped new transcript boundaries of 8,662 genic regions. These regions are possibly new splice variants of previously annotated genes. Finally, 2,826 transcripts were mapped to 2,096 unique intergenic regions. These transcripts might represent un-annotated protein-coding genes or non-coding RNA sequences that have not previously been sampled in traditional cDNA libaries.
Next Generation transcriptome simulation study
A primary goal of large-scale transcriptome sequencing is to identify and obtain full-length sequences of all of the expressed genes in an organism or tissue. A researcher will typically begin with RNAs isolated from a tissue of interest or a collection of tissues from the entire organism. The researcher may use tissue from a particular developmental stage or assay gene expression under a range of experimental conditions (e.g., light/temperature/water/nutrient stress, gene knock out). Each of the new NG technologies (e.g., 454-GS20/FLX, Solexa) produces data with characteristics that can be evaluated and compared to each other and traditional capillary sequencing.
In order to predict the expected outcomes of varied amounts of sequencing effort using a blend of technologies, we developed a predictive model based on the simulation engine of ESTstat [24
]. Inputs to the model include four distribution profiles that reflect information about the cDNA library or sequencing technology: 1) the transcript abundance profile, a transcriptome-specific frequency distribution of the number of tags of different genes in the entire transcriptome, 2) the distribution of cDNA lengths 3) the distribution of sequencing start sites, and 4) the distribution of read lengths after removal of vector and low quality data. The first three of these reflect library specific features, while the fourth is mostly dependent upon the sequencing technology. The ESTstat simulation model has been tested under a variety of situations and found to robustly predict the outcomes of future sequencing experiments. Although ESTstat can estimate and correct assembly errors in silico
without reference to a known genome sequence, we were able to map each read to its known location on the Arabidopsis
genome to assess and correct assembly error.
We used the results from our GS20 sequencing to simulate different levels of sequencing coverage for each of the NG and capillary technologies. For each technology, we considered both non-normalized and perfectly normalized libraries, in which the expression level of every gene is made identical. Actual normalization experiments should therefore fall somewhere between non-normalized and perfectly normalized, depending on the normalization method, RNA quality, and success of the normalization procedure (see Materials and Methods for more detail). We used the following parameters to help evaluate the different sequencing platforms: transcriptome coverage, percentage of all expressed genes that were tagged, percentage of singletons, number of unigenes, mean unigene length, and the percentage of all expressed genes that were sequenced completely (i.e. 100% covered; Figures , and ).
Figure 3 Simulation results for different Next Generation sequencing technologies. Simulation results illustrating predicted outcomes for different transcriptome sequencing technologies with a complex library expressing ca. 18,000 genes. Left column illustrates (more ...)
Transcriptome coverage (Figure ) is a direct indicator of the sequencing depth and breadth of sequence data relative to the sample transcriptome. We define the transcriptome coverage as the total non-redundant number of bases from sampled genes that are included in at least one EST, divided by the sum of cDNA lengths for all expressed genes (including both detected and undetected genes in the transcriptome). In this study, the 15,276 detected genes and randomly sampled 3,007 undetected genes (estimated using ESTstat, see Materials and Methods) sum to 18,283 genes, with an expected total cDNA length of 29.8 MB. The transcriptome coverage, as a function of the total number of sequenced bases (MB), differs only slightly for all technologies. However, when the amount of sequence is low (1–500 MB), the transcriptome coverage is greater in the normalized libraries (dashed lines) compared to the non-normalized libraries (solid lines) for each technology. Theoretically, perfect normalization will equalize the level of expression for all genes, without any other impact on library quality, and thus will increase the coverage of genes that are randomly sampled. Using the distributions of cDNA length, read length, and sequencing start sites obtained in these experiments, we estimate that traditional 5' capillary sequencing of a non-normalized library will cover approximately 14%, 52%, and 82% of the transcriptome with 6.25, 50, and 200 MB of sequencing, respectively. For a normalized library, the percentage increases to 18%, 69%, and 95% with the same amounts of sequence. The same pattern was observed for the NG technologies but with higher levels of transcriptome coverage. For example, the GS20 technology is estimated to cover 15%, 54%, and 88% of the transcriptome for a non-normalized library and 18.2%, 72%, and 98% of the transcriptome for a normalized library at 6.25, 50, and 200 MB of sequencing. The lower coverage of capillary-based EST sequencing given the same number of sequenced bases is attributed to biases implicit in the cDNA cloning process. The FLX is estimated to cover 15%, 54%, and 88% for the non-normalized library and 18%, 72%, 98% for a normalized library at the same intervals. Finally, the Solexa platform is estimated to cover 55% and 87% for the non-normalized library and 75% and 98% for the normalized library for 50 and 200 MB, respectively. Given that one plate of sequence data from the Solexa platform is estimated at 1,000 MB, we chose 50 MB (1/20 of a plate) as the first interval to be simulated, and we excluded all intervals less than 50 MB.
Transcriptome coverage differs substantially among the various technologies at the same cost. However, the cost used in this analysis refers only to the actual sequencing costs and not the pre-processing costs such as library preparation and normalization. The Solexa platform rapidly approaches 100% coverage primarily because the cost of sequencing is substantially smaller per MB (simulations for Solexa were based on $4000/plate at 1,000 MB/plate). Solexa is followed by GS20, FLX, and conventional EST sequences. It is estimated that traditional capillary sequencing would reach 100% transcriptome coverage at more than 200 MB and at a cost of over $200,000. While Solexa sequencing is the most economical technology for deep coverage of transcriptomes, de novo assembly of short Solexa sequences for non-model species remains an unresolved challenge.
A second indicator of the depth of transcriptome sequencing is the percentage of genes tagged (Figure ). A gene is considered tagged if it has been sampled with at least one read. The percentage of genes tagged increases with both amount of sequencing and price. For a non-normalized traditional library, we estimate that 27%, 75%, and 96% of the genes will be tagged in our sample transcriptome with 6.25, 50, and 200 MB of sequencing. For a normalized library, the percentage increases to 39%, 98%, and 100% with the same amounts of sequence. As expected, this percentage increases when the sequencing is done with any of the NG technologies. The cost of gene tagging also differs substantially among the various sequencing technologies. The Solexa platform tags essentially 100% of the expressed genes with less than one plate of sequence ($4000). Solexa is followed by GS20, FLX, and conventional EST sequences. Capillary sequencing would approach 100% genes tagged at more than 200 MB and over $200,000.
The number of unigenes (Figure ) – including singletons and contigs – has typically been used to estimate the number of transcribed genes in a tissue. With small amounts of sequencing, the number of unigenes is similar to the number of sequences, but with more sequencing multiple reads are observed for each gene (increasing redundancy), and the rate of discovery for new genes falls off. At a particular point in the sequencing process (peaks in Figure ), the number of unigenes will begin to decrease as disconnected reads coalesce into contigs covering entire genes, and eventually the unigene number approaches the number of genes expressed in the library. The rate at which multiple reads for a gene coalesce into a single contig is a function of read length. With the capillary technology, each read is large compared to the NG reads. With a non-normalized library similar to the model library, we will reach the peak unigene number at more than 200 MB of sequencing. With a normalized library, we reach the peak at approximately 100 MB and decrease gradually with an additional 100 MB of sequence. However, we still do not reach the estimated 18,000 genes expressed in the Arabidopsis floral library. For the FLX technology, the maximum number of unigenes occurs at roughly 100 MB and 50 MB for the non-normalized and normalized libraries, respectively. However, because the FLX sequences are two to three times shorter than the traditional sequences, the peak is reached with roughly double the number of unigenes (38,000 and 46,000, respectively). For the GS20 platform, the peaks occur at nearly the same levels (approximately 100 MB) as the FLX platform, but since these reads are half as long as FLX reads, the GS20 produces more than twice the number of unigenes (92,000 and 115,000) for both library types. The Solexa platform produces many more unigenes at all levels of sequencing and the peak occurs at approximately 200 MB for both library types (1.3 and 1.7 million reads).
The mean unigene length (Figure ) is an important statistic if the goal of the transcriptome sequences is to perform multi-gene phylogenetic or molecular evolutionary analyses. In this case, researchers would like full-length sequences for many expressed genes, not just small fragments of expressed genes. In the Arabidopsis genome, the average transcript length is approximately 1,500 bp (1,436 for all transcripts and 1,628 bp for only the transcripts predicted to be expressed in this library). Therefore, a researcher would like to sequence enough of a library to produce contiguous sequences with average lengths of all genes in the library. We calculated the unigene length in two different ways. First, we used the mean length of all unigenes, although this estimate lowers the mean length for the shorter sequences in the NG technologies. Second, we calculated the mean length of only the longest unigenes for each gene (Figure ). All NG technology and library type combinations require greater depth of sequencing to reach the same level as its traditional counterpart. When we examine the mean unigene length in relation to price, the traditional sequencing produces the longest unigenes until approximately $5,000 worth of sequencing. This is approximately 4–5 MB of capillary sequencing and 6,000–8,000 reads. At this point, the NG technologies begin to generate enough sequences to assemble longer unigenes at a lower cost.
The percentage of singleton reads (Figure ) reflects sequencing depth and the likelihood that a given read will assemble to form a contig with other reads. A singleton is defined as a single read that does not contain enough overlap in length to be combined with other reads from the same transcribed gene. The percentage of singletons is also inversely proportional to the levels of redundancy in the library. Therefore, additional sequencing usually reduces the percentage of singletons. This is the case for capillary sequencing, where the percentages of singletons are 73%, 40%, and 16% for non-normalized and 81%, 23%, and 4% for normalized libraries at the 6.25, 50, and 200 MB levels, respectively. For the GS20, these values change to 76%, 48%, and 25% for non-normalized libraries and 80%, 34%, and 7% for normalized libraries at the same levels. For the FLX, the percentage of singletons changes to 74%, 44%, and 22% for non-normalized and to 78%, 29%, and 5% for normalized libraries at the same levels. Finally, for Solexa, the percentage of singletons is predicted to be around 68%, 47%, and 25% for non-normalized and 67%, 32%, and 7% for normalized libraries at the 50, 200, and 1000 MB sequence intervals, respectively.
The final parameter used to evaluate and compare the technologies is the percentage of genes with 100% coverage (Figure ). As with mean unigene length, gene coverage can be calculated using all of the unigenes per gene, or by using only the longest unigene. The smaller reads from the NG technologies might cover all the regions within a gene. However, many of the reads for a gene will not have sufficient overlap to assemble into a contiguous sequence. Although we calculated both estimates, we use the percentage of gene coverage based on the longest unigene for comparisons to other platforms. In relation to amount of sequencing (MB), the capillary, GS20, and FLX technologies have similar percentages. The Solexa platform requires more data (MB of sequencing) to fully sequence a similar number of genes. For example, the FLX generates unigenes that completely cover roughly 18% and 58% of the total genes with 200 MB and 1000 MB of sequence data. The same amounts of Solexa sequencing would fully sequence 4% and 25% of the genes. However, the FLX experiment would cost approximately $18,000 and $90,000, whereas the Solexa data could be generated for roughly $800 and $4,000. Finally, with capillary sequencing, 200 MB would need to be sequenced at $250K to fully cover 25% of the genes.