We have demonstrated that SLIP is an efficient and effective method for screening plasmid cDNA libraries. In screens for 153 Drosophila transcription factor genes known to be represented at relatively low levels in our cDNA libraries, we recovered high-quality, full-length cDNAs with complete ORFs for 72 genes and compromised cDNAs for another 32 genes. The six cDNAs compromised by nucleotide discrepancies, and one clone with a 2 bp deletion resulting from a SLIP artifact, could be repaired by site-directed mutagenesis to produce high-quality, full-length cDNAs. Three of the co-ligated cDNAs encode complete ORFs suitable for cloning into expression systems. Thus, by a more liberal standard, full-length cDNAs were recovered for 82 genes. SLIP is simpler to perform than the similar MACH-2 method, and both methods are considerably more efficient than the traditional hybridization-based library screening approach.
Because PCR tends to amplify shorter products more efficiently, SLIP likely has a bias toward recovery of shorter clones. Thus, if a cDNA library contained clones of various lengths for a target gene, SLIP might recover only short cDNAs with incomplete ORFs. Many of the short clones in cDNA libraries are missing the 5′ end of the transcript. We took two measures to improve the recovery of full-length clones: we designed PCR primers within the first 500 bases of each annotated transcript model, and we performed PCR with an extension time sufficient to amplify cDNAs with inserts of at least 4 kb in our 1.6 kb cloning vector pOT2. We recovered relatively long full-length cDNA clones, but we did not recover full-length clones for target genes with ORFs longer than 3 kb. Comparison to EST sequencing results from the same cDNA libraries shows that the two approaches recovered full-length cDNAs at a similar rate. This suggests that full-length cDNAs for target genes with long ORFs are rare in the cDNA libraries used in this study. However, we cannot exclude the possibility that some long transcripts were not recovered in our screens due to the PCR conditions used.
Modifications to the SLIP protocol are likely to improve recovery of long cDNAs. Techniques for PCR amplification of large fragments, including increasing the number of cycles of amplification, increasing the extension time and employing DNA polymerases optimized for ‘long PCR’ could be incorporated to reduce size bias due to the PCR step and recover long cDNAs. PCR amplification of fragments at least 20 kb in length from complex templates such as the human genome is a routine procedure (35
), and kits for this purpose are available from several commercial suppliers. Libraries containing full-length cDNAs for very long transcripts are also necessary for recovery of long cDNAs, and methods for constructing such libraries have been developed (36
). In addition, PCR products could be size-selected by excision from agarose gels before the self-ligation step. Although we have not demonstrated recovery of very long cDNAs using SLIP, we see no reason the method should be significantly limited by the lengths of transcripts or cloning vectors.
The success of SLIP screening was not significantly correlated with named genes, a common surrogate for the confidence of the target gene annotation, nor with the presence of ESTs in our collection. This suggests that recovery of a cDNA clone for a target gene depends primarily on the presence of a cDNA clone in the library. Because we diluted the cDNA library pool 500-fold for the first round of screening experiments, library complexity seemed likely to be a limiting factor. To test this, we performed a second screen for 69 target genes, including 56 targets that failed to yield specific clones in the first round of screening, using a 10-fold higher concentration of library pool (50-fold dilution). An additional twelve genes yielded specific clones in this second screen. The effect of library concentration was not dramatic, however, which suggests that most of the complexity of the library pool was represented in each sample in the initial round of screening. Statistical analysis of the results indicates that the additional successes in the second round of screening are consistent with the expected increase from selection of additional isolates for sequence analysis, with the underlying screening success rate identical for both library dilutions (data not shown). Note that these cDNA libraries had already been extensively sampled by EST sequencing, and this had not yielded clones for 127 of the 153 genes targeted in this study. To use this screening method to recover cDNAs for the transcription factor genes that are still not represented in our collection, new cDNA libraries with higher complexity and from additional tissues and developmental stages would seem to be required.
Since PCR primers were designed based on Release 3.1 annotated genes, including many for which no molecular evidence currently exists, our success in recovering clones depended upon the accuracy of the gene predictions. In 29 cases, the clones recovered in the screen provide evidence that the corresponding gene models should be modified. For three of the failed library screening experiments, the revised Release 4.1 gene models do not include the Release 3.1 exons used to design the PCR primers. This provides a trivial explanation for these failures. Further examination of the PCR primer sequences and the gene models they were designed to target may suggest other ways of improving the success rate.
The 49 genes that did not yield target-specific clones probably failed due to absence of clones from the cDNA library aliquot. Most of the failed screens yielded clones representing genes that were not targets. These non-target clones probably arise by mis-priming during PCR in the absence of target-specific cDNAs. Another potential explanation for the recovery of non-target clones is incomplete DpnI digestion of the library template DNA. However, in many cases the sequence traces from non-target clones include sequences complementary to one or both of the corresponding PCR primers. Thus, mis-priming appears to be the primary failure mode.
Our results suggest ways of optimizing the screening procedure. One of the easily adjusted parameters is the number of isolates selected for sequence analysis. Based on a retrospective analysis, we estimate that by characterizing four isolates per target instead of three, we have increased our screening success rate by ~12%. Similarly, characterizing four isolates per target yields ~32% more screening successes than two isolates, and 88% more screening successes than a single isolate. We estimate that selecting more than four isolates will result in a maximum increase of 5% in the number of successes, and this needs to be balanced against the increase in costs of characterizing additional clones. Another parameter that may be adjusted is the number of isolates selected for full-insert sequencing. While in most cases all of the characterized isolates were identical (based on analysis of the three initial sequence reads), there were cases in which different clones were recovered. These may indicate alternative transcription start sites or alternative splicing, rather than incomplete cDNAs. Another area for optimization is in the automated analysis of the initial sequence reads to determine which clones should be considered for full-insert sequencing. Analysis of the finished sequences from these experiments, largely gained through manual examination, suggests that a useful criterion for clone selection would be 50% or greater sequence identity of the clone and the corresponding gene model over at least half the length of the sequence data generated from the clone.
The success of these directed library screens raises the question of when a project to produce a non-redundant cDNA collection should switch from an EST-based approach to a directed approach. At the end of our EST sequencing project, the final 10
000 EST sequences identified cDNAs representing just 96 (1%) new genes not previously represented in the collection. At that point, it was decided that additional EST sequencing was not warranted. If an efficient directed method had been available, we might have switched from EST sequencing to directed library screening at an earlier stage in the DGC project.
In our view, the results described here justify a larger scale SLIP screen for cDNA clones representing the remaining annotated genes and alternative transcripts that are not yet represented by cDNA clones in the DGC. We assert that cDNAs obtained by library screening can be more informative and valuable than RT–PCR products. The principle advantage of cDNAs over RT–PCR products is that cDNAs can recover sequences at the 5′ and 3′ ends of transcripts that are not represented in annotated gene models. In our screens, we recovered full-length cDNA clones that extend the ORF of the annotated gene model in the 5′ (10 genes; e.g. ) or the 3′ (3 genes) direction, that discover a dicistronic transcript (one gene-pair), and that fuse gene models (three gene models into one gene; ). For these 15 genes, RT–PCR experiments based on the ORFs in the annotated gene models would have amplified cDNA products representing incomplete ORFs encoding truncated protein sequences. Furthermore, such RT–PCR data would appear to validate the incomplete gene models. In addition, we recovered five full-length cDNAs classified as exon variants that have alternative 5′- or 3′-terminal coding sequences that are not present in the genome annotation. Because the termini of these ORFs are not present in the current genome annotation, they would not be recovered in annotation-based RT–PCR experiments. The 5′ and 3′ ends of transcripts can be recovered by RACE, but this approach does not lead directly to full-length cDNA clones. Thus, because it involves fewer assumptions based on predicted transcript structures, we consider directed cDNA library screening to be a more conservative and informative approach than RT–PCR.
RT–PCR is likely to be more sensitive than directed cDNA library screening for the recovery of sequences of transcripts with extremely low expression levels because it does not involve library construction steps, which inevitably reduce the complexity of the sample. RT–PCR is also likely to be more effective for recovery of long transcripts, since it constrains amplified transcript sequences to include the 5′ and 3′ ends defined by the PCR primers. Thus, we do not assert that directed cDNA library screening is better than RT–PCR. Instead, we maintain that SLIP can be more informative than RT–PCR and that the two approaches are complementary, each with distinct advantages and disadvantages.
In a pilot study evaluating the use of RT–PCR to generate cDNA clones for the Mammalian Gene Collection, acceptable full-ORF clones were recovered for 67% of 384 well characterized human genes that had sequences in the RefSeq database but that were not yet represented by cDNAs in the collection (37
). In the study, RT–PCR was performed on a series of RNA templates representing different human tissues until a PCR product of the expected size was obtained for each target gene. Multiple bands were observed in many of the RT–PCR, so bands of expected size were purified by excision from agarose gels before cloning. Twelve or more cloned isolates were end sequenced for each target; 4718 clones were sequenced to recover acceptable clones for 259 genes. In our study, the targets include many uncharacterized predicted genes, the cDNA libraries were pooled into a single PCR template, no agarose gel analysis or purification was performed, and four clones were analyzed per target (although 67 targets were subjected to two rounds of screening). The target gene sets, the tissue sampling approaches, and the work expended per target are quite different in the two studies, making their direct comparison difficult.
A productive and rigorous strategy for cloning and characterizing a eukaryotic transcriptome might involve successive phases of EST sequencing, directed cDNA library screening using SLIP, RT–PCR amplification of annotated ORFs and RACE experiments to recover uncaptured UTRs and coding sequences and to precisely define transcription start sites. A strategy based purely on RT–PCR and RACE could also be effective, particularly if advances in genome annotation approaches lead to significant improvements in gene prediction.
Finally, cDNA libraries are often constructed from RNA isolated from particular tissues or developmental stages, so EST and cDNA sequences can provide data on when and where a transcript is expressed. We pooled cDNA libraries into a mixed template to improve the efficiency of our screens, resulting in the loss of this spatial and temporal expression information. In Drosophila
, large datasets on RNA expression have been produced in microarray studies (38
) and embryonic in situ
hybridization experiments (40
), and these data have much higher resolution and reliability than data from cDNA library associations. If cDNA libraries constructed with library-specific sequence tags were used, as in the rat EST project (41
), then the library source information for cDNAs amplified from a pooled template would be retained.
In summary, SLIP is an effective method for increasing the representation of genes and transcripts in comprehensive cDNA collections, such as those currently under construction by the NIH Mammalian Gene Collection project for several model organisms and the human (7
). We have used it to recover full-length cDNA clones for 72 genes with relatively low expression levels. Our results also demonstrate that SLIP can be used to screen for cDNAs representing alternatively spliced transcripts. By designing PCR primers in predicted isoform-specific exonic sequence, cDNAs containing the alternatively spliced sequences can be targeted. Finally, the utility of SLIP is not limited to genomic applications. The method is simple and should be useful in any project requiring the isolation of cDNA clones. The main limitation is the availability of high-quality plasmid cDNA libraries representing organisms, tissues and developmental stages of interest.