Since its introduction, the benefits of 454 sequencing (1
) have been exploited for an increasing number of applications, including genomic sequencing (2
), cDNA sequencing (3
) and ultra-deep amplicon sequencing (4
). Despite its power in generating a great number of sequences from few samples, 454 sequencing is at present unsuitable for studies requiring targeted sequence data from many different samples. This limitation is particularly pertinent to population and medical genetics studies. The 454 sequencing plate can be physically divided into a maximum of sixteen regions, each of which yields an average of 0.63 and 2.88 Mb per run for the GS20 and the new GS FLX sequencing platforms, respectively. If shotgun libraries from 16 human mtDNA genomes are sequenced in a single run, each genome will on average be covered 70- and 350-fold for GS20 and GS FLX, respectively. Despite the low cost per sequenced nucleotide, the high coverage and cost per run make this approach impractical. Furthermore, the physical separation of the plate reduces the total number of sequences retrieved from one run to roughly half and consumes more time and material, as each sample must be processed separately. Thus, even further physical separation of the plate would not solve these inherent problems.
One approach to overcoming these limitations is to barcode samples with sample-specific sequence tags. Samples are pooled prior to 454 sequencing and are identified after sequencing by their unique sequence tags. This approach has been used to barcode cDNA libraries with tagged primers for reverse transcription (5
). Recently, Binladen et al.
) used 5′-tagged PCR primers to distinguish amplicon sequences derived from different sources. However, this method requires the synthesis of sample-specific primers for each target under study, which is time-consuming and cost-prohibitive when dealing with large sample sizes. Currently, no method exists for barcoding small genomic libraries or amplicons derived from untagged PCR primers. The resources of 454 sequencing therefore cannot be fully exploited for applications that require low coverage sequencing of many different samples.
Despite the potential benefits a barcoding method for 454 sequencing offers, any proposed technique must ensure efficient use of sequencing resources and high data reliability. Incomplete reactions and sequencing errors can result in background sequences without a sequence tag, heterogeneous sequence representation among samples and false-assignment of sequences to their sample origin. We have developed a method called parallel tagged sequencing (PTS) that largely alleviates these problems.
PTS is based on a ligation strategy analogous to the one utilized in the standard 454 library preparation procedure (1
), the use of barcoding adapters and a restriction system that excludes background sequences. An outline of the method is displayed in A. In separate reactions DNA molecules from different samples are blunt end repaired and phosphorylated. Subsequently, sample-specific barcoding adapters are ligated to both ends of the molecules, and the resulting nicks are removed by a strand-displacing polymerase. Each barcoding adapter is comprised of a single self-hybridized palindromic oligonucleotide containing an SrfI restriction site flanked by complementary sequence tags (B). This setup requires only a single oligonucleotide to be synthesized in order to barcode each sample. The barcoded samples are then quantified, pooled in a ratio reflecting the proportion of sequences desired from each sample and treated with phosphatase to remove residual 5′ phosphates of unligated ends. Such unligated ends may arise during the adapter fill-in step in molecules containing single-strand nicks caused by nebulization. Half of the adapter is then cut off by SrfI, which is a rare cutter in mammalian genomes (7
). It leaves blunt ends with 5′ phosphates and allows the pooled samples to be directly processed using the standard 454 library preparation. Phosphatase treatment in conjunction with SrfI digestion effectively reduces background sequences when starting from nebulized DNA; untagged molecule ends are prevented from being ligated to the universal 454 adapters during library preparation.
Figure 1. Workflow for barcoding and sequencing double-stranded DNA samples. (A) The DNA is blunt end repaired (I) and sample-specific self-complementary oligos are self-hybridized to form double-stranded barcoding adapters (II), which are subsequently ligated (more ...)
The sequence output from barcoded libraries begins with the sequence key TCAG, which originates from the universal 454 adapters, followed by four non-informative nucleotides GGGC from the SrfI site. Since the 454 technique uses sequencing-by-synthesis, all G's are incorporated in the same nucleotide flow as the last key base, and therefore do not decrease the read length. The origin of the sequence is identified through the adjacent sequence tag. We developed a tag design that is particularly robust to sequencing errors associated with homopolymers, a well-known problem in 454 sequencing (1
). At a length of 6 bp, it allows the pooling of a maximum of 72 samples into a single sequencing library (B). Since the 454 sequencing plate is always split into at least two regions, this configuration allows for the parallel processing of up to 144 samples in a single run.
We demonstrate the power of parallel tagged 454 sequencing by shotgun sequencing six human mtDNA genomes on two 16th plate regions of the GS20 platform. Published Sanger sequences for these samples allow a comparison of the two sequencing approaches. Additionally, we verified the reproducibility of the method by independently sequencing a second set of six human mtDNA genomes.