In light of the increasing interest in the generation of high-quality and complete cDNA-sequence resources, we developed a robust and scalable transposon-based strategy for the full-insert sequencing of cDNA clones. The specific pipeline described here was tailored to a high-throughput environment that emphasizes the shotgun sequencing of genomic clones and that is associated with a relatively low per-sequence-read cost. As such, the goal was to minimize the effort required upstream and downstream of the main sequence production (see Fig. ), even at the cost of generating a small proportion of extra sequence reads. The upstream transposon-insertion step is simply used to generate suitable plasmid templates with common primer sequences inserted at random positions within the cloned DNA, which can then traverse the main sequence-production pipeline in an efficient fashion.
While each cDNA clone is handled individually, it is batch processed in one of three groups (i.e. those to be sequenced using 0, 12 or 24 transposon subclones). Our approach has proven highly effective, routinely yielding sequence data with error rates of less than one in 50 000 bp. Indeed, ~90% of cDNA clones are finished to this accuracy level after generation of the transposon-derived sequence reads, requiring no additional custom sequence reads. It is interesting to note that the overall effort (as measured in total reads per kb) for sequencing cDNA clones using the described pipeline is quite similar to that required for sequencing large-insert genomic clones by a shotgun-sequencing strategy (1
It is important to emphasize that the transposon-based strategy described can be readily adapted for use in other sequencing environments, including smaller facilities where the per-sequence-read costs might be considerably higher. In the latter situations, one might consider initially sizing each cDNA insert and then titrating the number of transposon subclones sequenced based on each insert’s size. Similarly, a smaller number of random transposon-derived sequence reads might be generated, which would then likely require a larger proportion of custom finishing reads. Regardless of the precise implementation scheme, our general strategy provides a robust and efficient path for sequencing cDNA clones in a variety of environments.
Traditionally, an inherent limitation of transposon-based sequencing has been the insertion of transposons into irrelevant regions of the target DNA, such as the cloning vector. Various solutions to this problem have been devised, such as mapping the insertions and then selecting those subclones harboring transposons at desirable positions within the cloned insert (16
). The pipeline described here was initially optimized using cDNA clones containing the small pOTB7 vector and employing double-antibiotic selection. In generalizing the strategy to include cDNA clones containing larger vectors, we sought to avoid the addition of labor-intensive steps, such as those involving mapping of transposon-insertion sites. Fortunately, we were able to integrate a single, simple step in the pipeline to address the problem (see Fig. ). Specifically, following transposon insertion, the cDNA inserts are subjected to Gateway-mediated transfer (27
) to a new recipient vector, which yields collections of subclones with transposons exclusively within the cDNA inserts. Of course, this adaptation requires the presence of suitable Gateway-recombination signals in the cDNA clones; fortunately, these are present in most clones being generated and sequenced by the MGC program (12
Our sequencing of a large set of cDNA clones provided the opportunity to rigorously assess the randomness of Tn5 insertions. Quantitative analysis of the Tn5-insertion sites across 3.59 Mb of finished cDNA sequence revealed nearly random behavior. There was little to no evidence for any hot spots or cold spots or the favoring of certain local GC content. Although some rare inter-insertion intervals were occasionally too far or too close to be deemed truly random, the overall spacing was sufficient to produce adequate sequence coverage in most cases. This generally random behavior is consistent with our finding that the sequence for the bulk of cDNA clones is completed after generation of the transposon-based sequence reads. Note that such randomness is not always encountered with genomic DNA. In fact, we have found a handful of clear cold spots for Tn5 insertion during the finishing of large-insert genomic clones.
In the accompanying paper by Butterfield et al.
), an analogous transposon-based strategy for cDNA sequencing is described. Several notable differences with the current study deserve special mention. First, their approach employs the transposon Mu; interestingly, this transposon was also found to provide sufficiently random insertions to support the routine sequencing of cDNA clones. Secondly, these investigators sequence the cDNA clones in large pools rather than individually. Such an adaptation has the potential to be highly efficient by more accurately fine-tuning the number of sequence reads generated per clone based on insert size. However, this requires each cDNA insert to be sized in advance, an available high-quality end read(s) from each clone for identification purposes, and the need to disambiguate the sequence data emanating from the pooled clones. Nonetheless, their approach has also proven effective for the high-throughput sequencing of cDNA clones.
While the general behavior of the transposon Tn5 is sufficiently random to facilitate the sequencing of cDNA clones, its insertion is associated with certain sequence preferences. Indeed, previous studies have sought to define such consensus sequences for various transposons [e.g. Tn3 (33
), Tn5 (30
), Tn7 (34
), Tn9 (30
) and γδ (30
)]. The data generated here for the transposon Tn5 and that reported in the accompanying paper for the transposon Mu (24
) provide almost unprecedented large data sets for examining the preferred sequence for transposon insertion. Based on examining more than 24 000 Tn5 insertions, a reasonable consensus sequence emerged (Table and Fig. ). Careful scrutiny of this data suggests the consensus could even be extended further than previously recognized, possibly to an 11- or 13-bp sequence. In addition, the symmetry observed in Figure spans further, suggesting the enzyme–transposon complex recognition of target sequence may span two complete helix turns. Finally, there is a pyrimidine–purine symmetry at this site that seems to strengthen the microfilament model proposed by Goryshin et al.
In summary, the studies reported here provide convincing evidence that a transposon Tn5-based strategy can be used for the systematic sequencing of cDNA clones. By deploying this approach for sequencing more than 4200 mammalian cDNA clones, important insight has been gained about the effectiveness of our general paradigm for cDNA sequencing and some of the fundamental biological characteristics of Tn5.