Due to the large number of reads afforded, the 454 DNA sequencing technology is effective in revealing the expression of a large number of genes and has a great potential for discovering many rare or novel transcripts [21
] also in non-model organisms where few previous ESTs sequences are available [16
]. Combining the pyrosequencing of pooled samples derived from tissues or conditions to be analyzed in detail with the generation of a specific microarray based on developed sequence information, therefore, has the potential for allowing large scale expression analysis of the majority of genes expressed in those tissues or conditions also in non-model organisms [15
To date, random pyrosequencing of cDNAs is still unable to accomplish de novo
assembly for a solid gene reconstruction and transcriptome characterization, and data produced by this approach are so far largely used in sequenced genomes to refine annotated gene structures or to propose novel gene models [14
]. It has been suggested that 3'cDNA 454-sequencing can enable resolution of a catalog of unique transcripts, eliminating overestimation associated with shotgun sequencing of multiple non-overlapping 454-ESTs per transcript [20
]. We therefore produced pooled libraries enriched for 3'cDNA ends in order to limit the number of contigs for the same transcript and, consequently, the redundancy of the probe sets. The specificity of 3'-UTR-based sequence reads should also facilitate unambiguous gene assignment and, consequently, it has the potential for allowing the identification and analysis of nearly identical paralogous genes, as previously demonstrated [25
]. With this approach, the capacity of 454-derived unigenes to map to a unique location on the grape genome was very high, similar to or even better than that of the TCs comprised in the VvGI. It was also definitely higher than for unigenes identified by random 454 sequencing of cDNA [24
]. Furthermore, as 454 reads were derived from only one strand, the resulting sequences have known directional orientation.
A challenge for any EST project is obtaining sufficient coverage of less abundant transcripts [24
]. As the aim of this study was to maximize the number of genes represented in the 454-derived EST catalog, we evaluated the potential advantage of cDNA library normalization. In previous works this was applied both to model and non-model organisms and was recently reported that normalization could have little influence on the efficiency of gene discovery when working with thousand of reads from a single tissue type [17
]. However, so far normalization has not been performed on 3'cDNA libraries used in 454 sequencing, as the aim of these studies was always to assess also relative gene expression. We observed here that normalization increased the number of contigs assembled, and reduced the average number of reads per contig, clearly limiting over-representation of abundant transcripts. Furthermore, normalization dramatically improved the sampling of rare transcripts, as revealed by the higher number of contaminant fungal sequences found in the N library.
To demonstrate the high quality of the information that can be obtained by this approach and that the information can successfully be used to build up a microarray, we compared the effectiveness of the 454-derived unigene sets for oligonucleotide probe design with that of the TCs obtained from the 33,638 TCs assembled from the 347,879 Sanger-based ESTs included in the release 6.0 of the Vitis vinifera Gene Index. With just half of a 454 sequencing run of a 3'-cDNA normalized library, we could develop a microarray that can recognize 21,846 genes (15,606 for the non-normalized library). By comparison, the microarray designed on the extensive collection of ESTs from the last release of the VvGI can recognize only 19,398 genes. It should be noted that the 454-derived microarrays also carry a number of probes targeting previously unknown genes, which are not represented in the VvGI nor they have been predicted from the assembled grape genome, thus revealing a high coverage depth of the grape transcriptome.
In fact, the microarray designed on the unigenes from the normalized library proved to be more informative than one of the most comprehensive grape microarrays available to date, the GrapeArray 1.2 developed by the Italian-French Public Consortium for Grapevine Genome Characterization. This was demonstrated by comparing the performances of the GrapeArray 1.2 with those of the two microarrays designed on the set of NN and N unigenes, in detecting the expression of genes during grape berry maturation, a phenomenon that we are extensively studying by cDNA-AFLP [26
], microarray and deep sequencing analyses (unpublished), and that we have adopted as reference to compare the different expression profiling methodologies currently available. Hybridization with a pool of RNAs from grape berries revealed that the GrapeArray 1.2, which carries 24,562 probes, could detect the expression of 15,556 genes. By comparison, the microarray carrying 17,843 probes designed on the NN unigenes, detected the expression of 14,115 genes, 1,251 of which were previously unknown. Strikingly, the microarray carrying 29,393 probes designed on the N unigenes, detected the expression of 19,609 genes, 3,098 of which are novel. These data confirm the effectiveness of cDNA normalization in increasing the number of genes that can be identified, and show the effectiveness of the proposed method in allowing genome-wide microarray analyses also in species for which very limited gene information, if any, is available. We anticipate that adaptation to the Titanium upgrade of the 454 platform, which extends the average length of the sequences to about 400 bp and increases the number of reads per run to 1.2 millions, will further strengthen the power of this approach.