We have shown here that array-based normalization of RACE reactions is a very efficient strategy in discovering previously unknown transcript variants of protein coding loci. Our experiment yielded about one novel transcript variant per 10 clones sequenced, and while the RACEfrags assayed here for the nine tested genes were selected from a subpopulation of high confident RACEfrags, a large fraction (60%) of the loci interrogated have assigned at least one such RACEfrag. We do not know of any other strategy, which is able to survey the transcriptional diversity of protein coding loci with this level of detail and efficiency. We have compared the results of the hybridization of RACE reactions to tiling arrays with those from other high-throughput transcript interrogation surveys using distinct technologies: CAGE tags22
, GIS PET ditags11
and ESTs (Supplementary Table 2
). While there is significant overlap between our RACEfrags and the sites of transcription detected by these other technologies there is still a substantial number of RACEfrags (about 17% of those detected through all the experiments performed here), which are not detected by alternative methods.
The RACE array normalization strategy relies on fairly well-established molecular techniques and can be performed by any basic molecular biology laboratory. It is cost-effective, since the total cost of interrogating a single locus amounts to less than $1,000 (assuming $225-$400 for the array, $50 for RACE reaction and array hybridization/labeling, and the remaining $650$ for RT-PCR and sequencing of RT-PCR products). The approach can be scaled to whole genome large-scale transcript discovery. Indeed, our results indicate that, through the interrogation of a relatively small number of properly spaced exons from annotated loci in a relatively small number of tissues and cellular types, it is possible to recover a substantial fraction of the transcript diversity associated to a given locus. Given the large space that protein coding loci seem to span, however, high pooling density of RACE reactions prior to hybridization can only be achieved when the hybridization experiments are performed multiple times in different conditions or cell types, using multiple primers from the same locus. Indeed, the pattern of co-occurrence of primers and RACEfrags across the different conditions provides additional information about their connectivity. Pooling of RACE reactions could reduce the array cost by about two orders of magnitude. “Next generation” sequencing platforms could, in turn, be used to sequence in parallel thousands of clones. If these have been pooled judiciously, short read sequences can be unambiguously assembled into their respective full-length contigs, decreasing the cost of sequencing also several orders of magnitude, and making genome wide RACEarray exploration feasible.
The issue obviously arises as to why so many transcripts of otherwise well-annotated protein coding loci had systematically been missed by the large unbiased cDNA and EST sequencing projects? Many reasons may have contributed. First, cDNA libraries are not exhaustive. If not normalized, random clone selection will only yield high-copy number variants. This is compounded by the fact that many existing cDNA libraries are obtained from tissues characterized by a complex and heterogeneous transcript complement. In this respect, targeted interrogation of a single locus is intrinsically more sensitive in discovering the full complexity of transcripts at that locus than shotgun sequencing of the entire library. Second, normalization based on hybridization may have the undesirable effect of decreasing the likelihood of sampling low or medium copy alternative splice forms or long chimeric transcripts16,23
, since these can be selectively eliminated with their higher copy variants with which they may share a large fraction of the their sequence. Third, cDNA and EST libraries are often obtained through oligo-dT primed reverse transcriptase reactions. The single short read sequences originated in this way might not be long enough to reach the 5’ ends of long mRNA sequences, or the junction between exons from different loci, that we are predominantly discovering here.
The many novel transcript variants discovered here are not necessarily in low copy number. In fact, while novel RACEfrags show a restricted expression pattern when compared to annotated exons (6.8 vs. 13.0 tissues on average, respectively, see Supplementary Results, Supplementary Figure 7
), a substantial fraction of them (58%) seem to be expressed in more than one tissue (compared with 62% of the annotated exons; see Supplementary Results
). Unfortunately, because of the many amplification steps involved, and as a drawback of our approach, RACEarray normalization does not provide a good estimate of the expression levels of the identified transcripts. We have, however, attempted to reconstruct the expression level of RACEfrags by using transcriptional maps obtained within the ENCODE project. Our results (see Supplementary Results, Supplementary Figure 8
) indicate that novel RACEfrags are in general in lower expression levels than exonic RACEfrags, but that for about a third of loci interrogated in this study there are no significant differences between the expression levels of exonic and of novel RACEfrags. In any case, whether low copy number can be taken as an indication of lack of functionality or not, we would like to stress that our method is a transcript surveying tool—as, for instance, EST and CAGE sequencing are—and, like these methods, does not attempt to provide evidence of functionality.
In summary, RACEarray normalization can be used to efficiently explore how the transcript complement of loci changes under different cellular conditions, or varies between different cell types or individuals. With the appropriate experimental design, the strategy can be effectively multiplexed by high-density pooling of RACE reactions, and therefore can be used for genome scale transcript discovery.