|Home | About | Journals | Submit | Contact Us | Français|
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.
Determining the RNA complement of a cell is a pre-requisite for fully understanding its biology, and for translating this understanding to technical applications in medicine, agriculture, and biotechnology. Systematic sequencing of cDNA libraries has been the main approach for transcript characterization. One popular strategy is based on the identification by a "single pass" sequencing of random cDNA clones that results in short partial transcript sequences, known as Expressed Sequence Tag1 (EST). ESTs can then be used to identify clones suitable for full-length sequencing, as in the MGC initiative2. More recently, methods for full-length isolation and sequencing of random clones from cDNA libraries have also been developed3,4. The wide dynamic range of mRNA abundances in cells, however, makes random clone selection inefficient for discovering relatively rare transcripts, because predominantly the abundant cDNAs will be sequenced. To overcome this limitation, procedures which increase the likelihood of sampling rare or tissue specific transcripts such as normalization and subtraction have been developed5,6. But, while these procedures may be effective in normalizing the abundance of representative transcripts from different genes, they are less effective in normalizing the abundances of alternative transcripts within a given gene, which usually share a substantial fraction of sequence. Methods have been developed to enrich cDNA libraries for alternative splice transcripts7,8, but they preferentially target interal exons, whereas a large part of transcript variability resides at the 5’ and 3’ ends of the genes9.
Recent data, obtained through a variety of approaches, strongly suggest the existence of a wealth of transcripts, which had so far escaped detection through systematic sequencing of cDNA libraries10–15. In particular, experiments in which the products of RACE reactions originating from primers anchored in annotated genes are hybridized onto genome tiling arrays have uncovered many previously undetected exons16. Here, building on this approach (see also Kapranov et al.17), we develop an array-based normalization strategy of RACE (Rapid Amplification of cDNA Ends18) reactions, which is very efficient for targeted discovery of novel transcripts. RACE amplifies all transcript sequences in a given RNA sample that include an index exon within which the RACE primer has been designed. Following amplification, RACE products are typically cloned, and random clones are subsequently sequenced. However, the dynamic range in the RACE reaction may still be very large, and random clone sequencing may predominantly yield high-copy number variants—the most likely to have been already identified. To overcome this limitation, we introduce an intermediate step in which the RACE mixture is hybridized onto high-density tiling arrays. This serves to highlight putative novel exons that are used to delineate a set of conventional RT-PCR reactions, from which clones to be sequenced are randomly selected. Here, we show both theoretically and experimentally that this strategy leads to the specific amplification of novel transcripts and to the homogenization of their relative abundances, and that it is indeed very efficient in sampling novel transcript species. We also show that, under the appropriate conditions, it can be efficiently multiplexed by interrogating multiple loci simultaneously.
Figure 1 schematizes the RACEarray strategy (Supplementary Figs 1, 2, Supplementary Methods). Given a locus, we first select the exons in which the RACE primers will be designed (Supplementary Fig 3), and carry out the RACE reactions. Second, we hybridize the RACE products onto tiling arrays, and we build the sites of transcription, the so-called RACEfrags (RACE positive fragments)16, from the probe hybridization intensities. A number of filters can be applied to RACEfrags to account for highly expressed genes in the original RNA source, or to identify RACEfrags produced by the amplification of non-targeted loci (Supplementary Fig 4). If RACE reactions from multiple primers have been pooled together before hybridization, complex assignment procedures may need to be employed to assign RACEfrags to interrogated primers. Third, we use the resulting RACEfrags (Supplementary Fig 5) to delineate RT-PCR reactions. One of the primers for each of the reactions is the original RACE primer, and the second primer is designed within each novel RACEfrag. Strategies based on the pattern of co-occurrence of RACEfrags across different assayed conditions can be designed to select the subset of RACEfrags maximizing transcript discovery (Supplementary Fig. 6). Fourth, we clone the products of the resulting RT-PCR into “mini-pools”: each mini-pool contains the amplified transcripts connecting an index exon with a novel RACEfrag. Finally, we randomly select clones from these pools for sequencing.
Through this process, the original RACE population (likely to be dominated by the most abundant transcripts) is segregated into a number of RT-PCR populations, each one designed to include at least one (but probably more) novel transcripts. Sampling from the RT-PCR subpopulations increases the probability of selecting at least one clone of each novel transcript variant, in contrast to directly sampling from the original RACE population. By modeling random clone selection as a multinomial process, we show that the probability of obtaining at least one clone of each novel species after randomly sampling a number of clones increases as the probabilities of the known mRNA species decreases (Supplementary Methods). We can also show that, when sampling from the population of novel transcripts, the probability of obtaining at least one clone from each novel species increases as the probabilities of the novel transcripts approach homogeneity. We have evidence that the RACE array normalization will often, though not necessarily, lead to the overall homogenization of transcript abundances (since transcripts within an RT-PCR subpopulation are likely to share a larger number of exons, and may have in consequence more similar abundances), and we have carried out extensive simulations (see Supplementary Methods) which show that when the transcript abundances within the segregated subpopulations are homogeneous, sampling from them is in general more efficient that sampling from the original population.
As a proof of concept, we used RACEarray normalization to interrogate a single gene: MECP2. Mutations in this gene cause the Rett syndrome. MECP2 has two known transcript variants: the longer form has four exons, the shorter form skips the 2nd exon (Figure 2a). We performed 3’ and 5’ RACE from the exon number 3 in 16 different tissues (see Methods). We additionally performed 5’ RACE from exons 2, 3 and 4 in fetal brain. RACE reactions were separately hybridized on the ENCODE arrays containing the region ENm006, in which the MECP2 gene resides. The raw hybridization data appears on Figure 2a. Seventy one RACEfrags were detected. Eight 5’ RACEfrags were selected for RT-PCR verification. All eight gave at least one RT-PCR product, which was either cloned or sequenced directly. In total, 15 novel isoforms including 14 novel exons were discovered in this way. Most these isoforms are partial, since many of the novel RACEfrags interrogated are likely to correspond to internal exons. The majority of them use canonical splice sites and a few could be coding for proteins. Therefore, through a limited exploration using the RACEarray normalization strategy, we have discovered many novel isoforms for an important disease gene.
To further test the utility of our approach and to demonstrate that it may be effective in exploring protein coding genes, we tested 10 novel RACEfrags connecting to index exons in 9 diferent genes. The RACEfrags were randomly selected from a subpopulation of RACEfrags flanked by good splice sites and within 500Kb from the RACE index exon (see Methods). We performed the RT-PCR reactions using total RNA from two pools of tissues. Positive RT-PCRs were cloned into a mini-pool, and 32 clones were randomly selected in each case for sequencing (Supplementary Table 1). Thirty four novel variants were uncovered for these 9 loci, compared with 59 previously known. Some of these variants correspond to complex transcriptional events including long-range exon sharing (Fig. 2b). Virtually all cases were positive in the two pools, and nearly all sequences aligned to the genome with canonical splice sites. The novel transcripts discovered did not result in novel ORFs for the interrogate genes, being either non-coding variants, or variants extending the UTRs.
While the strategy designed here is particularly useful to exhaustively characterize the transcript complement of individual loci, a few steps can be efficiently multiplexed allowing for the simultaneous interrogation of multiple loci. This is mostly accomplished by pooling together RACE reactions from different loci prior to the hybridization onto the array. However given the long extensions previously observed in the ENCODE regions16, pooling together RACE reactions originating from index genes in close proximity in the genome sequence might confound assignment of RACEfrags to RACE index exons. It is therefore helpful to estimate the range of genomic distances used by primary transcripts of protein coding genes to design an optimal pooling strategy. In addition, both the number and combination of tissues on which the original RACE reactions are performed, as well as the number and distribution of primers along the interrogated loci influence the ability to survey transcript diversity.
We performed both 5' and 3' RACE of 12 genes mapping on human chromosomes 21 and 22 on polyA+ RNA of 48 cell types (see Table 1 and Methods). Both 5’ and 3’ RACE reactions for three widely spaced genes per chromosome were pooled and hybridized onto a high-density tiling array of human chromosomes 21 and 22 with 17-nucleotide interrogation resolution. Detailed results are provided in Supplementary Results, but figure 3 summarizes the main findings. Figure 3a plots the genomic coverage of RACEfrags as a function of the tissue in which the RACE reaction was performed. Not surprisingly, tissues exhibit, in general, higher transcriptional diversity than cell lines, but large variations in the amount of transcribed bases are observed between both tissues and cell lines, consistent with previous results19. Figure 3b plots the cumulative genomic coverage as a function of the combination of tissues. As shown, a combination of about 16 cell types already captures about 90% of all detected transcribed nucleotides.
We carried out, 5’ RACE on 10 exons, evenly distributed 3’ to 5’, of 44 genes, each gene mapping to a different ENCODE region20,21 (Table 1). We used polyA+ RNA from 12 human tissues, and the RACE reactions were pooled before being hybridized to the ENCODE arrays16,17. Each pool contained 44 RACE reactions, each one originating from one exon from a different gene, and thus from a different ENCODE region. Detailed results are presented in Supplementary Results. Figure 4 displays the proportion of all RACEfrags that originate from primers in exons from the 3’ to 5’. The cumulative distribution, although inconclusive, suggest and optimal interrogation strategy, in which RACE of the most 5’ and 3’ exons is likely to give rise to a larger number of novel RACEfrags, compared with RACE of internal exons (see Supplementary Figure 3).
We conducted 5’ RACE on 96 genes in human chromosomes 21 and 22 (Table 1). Reactions were carried out individually on polyA+ RNA from 12 different tissues and subsequently pooled. RACE reactions from different tissues originating from genes each separated by 10 Mb were pooled in groups of 6 on the same chip. Results (Supplementary Results and Figure 5) show that transcripts may span very large genomic space, with about 50% of the RACEfrags more than 3MB away from the index gene. These results need further validation, but they could potentially challenge our current understanding of the structure and organization of transcripts encoded in the human genome, suggesting that distal regions may be connected into individual transcripts more often than previously expected. They also make very challenging the delineation of an effective pooling strategy since only a very sparse pooling appears to guarantee a robust assignment of RACEfrags to primers.
We have shown here that array-based normalization of RACE reactions is a very efficient strategy in discovering previously unknown transcript variants of protein coding loci. Our experiment yielded about one novel transcript variant per 10 clones sequenced, and while the RACEfrags assayed here for the nine tested genes were selected from a subpopulation of high confident RACEfrags, a large fraction (60%) of the loci interrogated have assigned at least one such RACEfrag. We do not know of any other strategy, which is able to survey the transcriptional diversity of protein coding loci with this level of detail and efficiency. We have compared the results of the hybridization of RACE reactions to tiling arrays with those from other high-throughput transcript interrogation surveys using distinct technologies: CAGE tags22, GIS PET ditags11 and ESTs (Supplementary Table 2). While there is significant overlap between our RACEfrags and the sites of transcription detected by these other technologies there is still a substantial number of RACEfrags (about 17% of those detected through all the experiments performed here), which are not detected by alternative methods.
The RACE array normalization strategy relies on fairly well-established molecular techniques and can be performed by any basic molecular biology laboratory. It is cost-effective, since the total cost of interrogating a single locus amounts to less than $1,000 (assuming $225-$400 for the array, $50 for RACE reaction and array hybridization/labeling, and the remaining $650$ for RT-PCR and sequencing of RT-PCR products). The approach can be scaled to whole genome large-scale transcript discovery. Indeed, our results indicate that, through the interrogation of a relatively small number of properly spaced exons from annotated loci in a relatively small number of tissues and cellular types, it is possible to recover a substantial fraction of the transcript diversity associated to a given locus. Given the large space that protein coding loci seem to span, however, high pooling density of RACE reactions prior to hybridization can only be achieved when the hybridization experiments are performed multiple times in different conditions or cell types, using multiple primers from the same locus. Indeed, the pattern of co-occurrence of primers and RACEfrags across the different conditions provides additional information about their connectivity. Pooling of RACE reactions could reduce the array cost by about two orders of magnitude. “Next generation” sequencing platforms could, in turn, be used to sequence in parallel thousands of clones. If these have been pooled judiciously, short read sequences can be unambiguously assembled into their respective full-length contigs, decreasing the cost of sequencing also several orders of magnitude, and making genome wide RACEarray exploration feasible.
The issue obviously arises as to why so many transcripts of otherwise well-annotated protein coding loci had systematically been missed by the large unbiased cDNA and EST sequencing projects? Many reasons may have contributed. First, cDNA libraries are not exhaustive. If not normalized, random clone selection will only yield high-copy number variants. This is compounded by the fact that many existing cDNA libraries are obtained from tissues characterized by a complex and heterogeneous transcript complement. In this respect, targeted interrogation of a single locus is intrinsically more sensitive in discovering the full complexity of transcripts at that locus than shotgun sequencing of the entire library. Second, normalization based on hybridization may have the undesirable effect of decreasing the likelihood of sampling low or medium copy alternative splice forms or long chimeric transcripts16,23, since these can be selectively eliminated with their higher copy variants with which they may share a large fraction of the their sequence. Third, cDNA and EST libraries are often obtained through oligo-dT primed reverse transcriptase reactions. The single short read sequences originated in this way might not be long enough to reach the 5’ ends of long mRNA sequences, or the junction between exons from different loci, that we are predominantly discovering here.
The many novel transcript variants discovered here are not necessarily in low copy number. In fact, while novel RACEfrags show a restricted expression pattern when compared to annotated exons (6.8 vs. 13.0 tissues on average, respectively, see Supplementary Results, Supplementary Figure 7), a substantial fraction of them (58%) seem to be expressed in more than one tissue (compared with 62% of the annotated exons; see Supplementary Results). Unfortunately, because of the many amplification steps involved, and as a drawback of our approach, RACEarray normalization does not provide a good estimate of the expression levels of the identified transcripts. We have, however, attempted to reconstruct the expression level of RACEfrags by using transcriptional maps obtained within the ENCODE project. Our results (see Supplementary Results, Supplementary Figure 8) indicate that novel RACEfrags are in general in lower expression levels than exonic RACEfrags, but that for about a third of loci interrogated in this study there are no significant differences between the expression levels of exonic and of novel RACEfrags. In any case, whether low copy number can be taken as an indication of lack of functionality or not, we would like to stress that our method is a transcript surveying tool—as, for instance, EST and CAGE sequencing are—and, like these methods, does not attempt to provide evidence of functionality.
In summary, RACEarray normalization can be used to efficiently explore how the transcript complement of loci changes under different cellular conditions, or varies between different cell types or individuals. With the appropriate experimental design, the strategy can be effectively multiplexed by high-density pooling of RACE reactions, and therefore can be used for genome scale transcript discovery.
See Supplementary Methods for the tissues and cell lines used, as well as for the conditions under which the RACE reactions were carried out.
The products of RACE reactions were pooled and purified by ethanol precipitation. The pooled amplicons were then fragmented, and subsequently labeled for direct array hybridization (see Supplementary Methods).
The maps of probe intensities versus the genomic positions were generated using Tiling array Software (TAS; http://www.affymetrix.com/support/developer/downloads/TilingArrayTools/index.affx, see Supplementary Methods). RACEfrags were built using a 99.7% percentile in the probe intensity value as a threshold. Two probes are included in the same RACEfrag if they are less than 25 nucleotides away (maxgap), and the minimum length of a probe (minrun) is 25 nucleotides. RACEfrags were filtered using RACEarray in silico simulator that aims at reducing the RACEfrag false-positive rate due to RACE mis-priming, as well as array cross-hybridization (see Supplementary Methods). Surviving RACEfrags were assigned to the closest interrogated primer in genomic space.
Verification of RACEfrag/index exon connectivity was carried out in a nested RT-PCR experiment on 10 cases. These cases corresponded the experiments described in the section “Pooling of RACE reactions-genomic extent of loci”. We considered only RACEfrags with very good putative donor sites (with score over 2.4 as computed by the geneid program24) in the vicinity (within −10 to +30 nucleotides) of the 3’ end of the RACEfrag, and within 500 Kb from the index RACE exon. The 189 RACEfrags surviving these criteria corresponded to 58 loci, and the 10 test cases were randomly selected from this population.
For the test cases, the left primers were designed on the RACEfrag sequence, whereas the right primers were selected within the corresponding index exon. The external right primers were chosen to be the same as in the RACEarray experiment. For the positive controls, nested RT-PCR primers were designed in the 60 5’-most and 60 3’-most nucleotides of the target full-length mRNA. In all cases, the primer3 program 25 was used to pick primers.
RT-PCR, Cloning and Sequencing (Supplementary Methods)
The list of genes, exons, and RACE primers used in these experiments is available at http://genome.imim.es/datasets/racearrays2007. Processed and unprocessed micro-array data, as well as RT-PCR primers and the sequences of the resulting RT-PCR products are also available at this address. Primary array data will be deposited in the National Center for Biotechnology Information’s Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/). The sequence data of the RT-PCR products from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers FE530175 - FE530855, with the exception of the MECP2 sequences, which are in the process of being submitted. RACEfrag data and maps are being deposited to the UCSC browser.
In Supplementary Methods we describe in detail the mathematical formalization of the RACEarray sampling problem, and the computer simulations, as well as the detailed protocols for the RACE, array hybridization, RT-PCR, cloning and sequencing experiments
The project at IMIM, CRG, the Universities of Lausanne and Geneva, and Affymetrix is supported by grants U01HG003150 and U01HG003147 from the National Human Genome Research Institute, NIH. Institut Municipal d’Investigació Mèdica and Center for Genomic Regulation (CRG) have also been funded by grant BIO2006-03380 from the Spanish Ministry of Education and Science and from the European BioSapiens Consortium. The Universities of Lausanne and Geneve have also been funded by the Swiss National Science Foundation, the EU AnEUploidy project, and the NCCR Frontiers in Genetics. Affymetrix has also received funds from the National Cancer Institute, NIH, under contract no. N01-CO-12400 and by Affymetrix, Inc. The portion of this work carried out at Center for Cancer Systems Biology was funded by a grant from the Ellison Foundation (awarded to MV) and as Institute Sponsored Research from the Dana Farber Cancer Institute Strategic Initiative. We gratefully acknowledge Dr. J.M. Oller from the University of Barcelona for his review of the probabilistic results, and Robert Castelo from the University Pompeu Fabra, Cédric Howald from the University of Lausanne, and David Martin from the CRG, for useful suggestions.