Design of barcoded primers
We have devised a barcoding approach to substantially enhance the scope and capacity of multiplexed high-throughput pyrosequencing. Specifications for the barcodes are based on details of pyrosequencing error characteristics. Sets of barcodes have been independently designed for the forward PCR primers (FORWARD barcodes) and the reverse PCR primers (REVERSE barcodes) used in library construction (). Length of the barcode was chosen to allow for definitive assignment of barcodes, while minimizing the number of base reads used per barcode. The barcode design () facilitated unambiguous assignment, with care taken to avoid homopolymeric runs that are particularly problematic in pyrosequencing technologies. All forward barcodes were of the type HIJJK LLMNO, while reverse barcodes were of the type HIJKK LMMNO. H, I, J, K and L, M, N, O are uniquely partnered with one of four nucleotides: A, C, G and T. The barcodes thus contained two instances of a 1–2 base run of each nucleotide (one from the H, I, J, K set, and the other from the L, M, N, O set), with no more than two successive occurrences of the same nucleotide (). This design avoids significant problems due to error rate at homopolymeric runs of nucleotides. The forward and reverse barcodes varied in dinucleotide positioning along the length of the barcode. Positions 3, 4 (nucleotide J) and 6, 7 (nucleotide L) were homopolymeric dinucleotides in the forward barcode, while positions 4, 5 (nucleotide K) and 7, 8 (nucleotide M) were homopolymeric dinucleotides in the reverse barcode.
There were several restrictions at specific positions in the barcode. The start position of the barcode (position H) was required to differ from the last nucleotide of the preceding adapter to avoid homodinucleotides at the Adapter::Barcode junctions (). The terminal nucleotide of the barcode (position ‘O’) was constrained to prevent identity with the first nucleotide of both the F-Cloning-Linker and the R-Cloning-Linker. To ensure identical nucleotide possibilities at ‘O’ for all forward and reverse barcodes, priming sequences that correspond to the F-Cloning-Linker and to the R-Cloning-Linker were adjusted as necessary so that they started with the same base (refer to Methods section). These adjustments avoided any unnecessary restrictions on the last base of the barcode (‘O’), created independent sets of restrictions for positions H and O, eliminated the possibility for a dinucleotide at the Barcode::Cloning-Linker junctions and at the Adapter::Barcode junctions, and increased the total number of forward–reverse barcode pairs.
Sixty-four barcode pairs were thus designed, with any two forward or reverse barcodes differing in at least 4 nt: one doublet + two singlets (). These serve as strong unique identifiers for each of the forward and reverse barcode sets. Due to the terminal nucleotide restriction (on position ‘O’), only 48 barcodes may be used with a single 5′–3′ pair of linkers used in cloning. We make one additional note on the choice of barcode sequences: in applications where there is a need for fully optimized read length, an additional preference may be imposed for barcodes whose sequences best utilize the sequencing order (TACG for the GS20) so that a few more sequencing cycles may be available for the DNA of interest.
Testing barcoded primers
The barcoded primers were 45–46 nt in length (Supplementary ), had long overhangs (29–30 nt), and short regions of complementarity to the template (16–17 nt; did not span the entire Cloning-Linker sequence). Despite the extended length of the barcoded primers, complex pools of cDNA containing 19–27 nt small RNA inserts were efficiently amplified to yield PCR products that were 115–123 nt long.
Deciphering the nature of products amplified from complex pools of cDNA
Construction and sequencing of libraries from complex mixtures of RNA templates involves numerous steps that may skew the relationship between ratios of molecules in the initial pool and obtained ratios of corresponding sequences. Differential efficiencies of RNA extraction, linker addition and reverse transcription may skew the initial cDNA pool. Differences in amplification efficiencies of individual sequences in the cDNA pool, however slight, may bias the library population over many rounds of PCR. Sequence-specific differences in efficiency or fidelity of sequencing may distort the final proportions of assigned DNA sequences. These biases have been extensively discussed in literature; although potentially confounding, their effects may be controlled for to some extent, so that sequence representations in the cDNA and subsequent PCR-amplified libraries at least provide for an initial assessment of sequence incidence in the original small RNA populations.
We note one additional concern that is highly relevant to preparation of libraries for high-throughput sequencing. This concern specifically relates to purification and handling of pooled DNA products following amplification of complex cDNA libraries. Amplified DNA pools generally consist of a mixture of single strands, perfect duplexes and partial duplexes. The overall structural composition of the amplified DNA pool is determined by the extent of amplification in the last round of PCR. DNA molecules that were substrates for polymerase extension during the last round of PCR will be double-stranded. Molecules that fail to extend in the last round of PCR may remain single-stranded, may form heteroduplexes (if they anneal to a distinct molecule during the last PCR cycle; ), or may form perfect duplexes (if they find and anneal to an exact complement during the last PCR cycle). This heterogeneity will be more prominent in situations where extension has become inefficient (e.g. PCR taken beyond a point of saturation), resulting in the majority of molecules being single stranded or heteroduplex during thermal cycling. Only abundant single-stranded species would be more likely to reanneal to form homoduplexes (c). The low probability for rare single-stranded species to find perfect partner strands pushes them towards reannealing with non-identical PCR products (i.e. with library elements that have distinct inserts), resulting in the formation of heteroduplex structures ().
Figure 2. A diagrammatic representation of heteroduplexes that may be formed in an amplified pool. (A) Heteroduplexes formed between molecules from the same sample that have the same barcode, but different RNA inserts (X and Y). (B) Heteroduplexes formed between (more ...)
The heterogenous mixture of single strands, perfect duplexes and heteroduplexes skews the sequence incidence only if there is a step in the procedure that enriches for double-stranded molecules post-PCR. This is the case, however, with many gel isolation procedures that are employed to size-select for PCR products in a pre-defined size range. Biased selection in these protocols is attributed to a combination of factors: (i) differences in mobility between single-stranded, double-stranded and heteroduplex molecules on the gel, (ii) differential recovery from gel-isolation procedures, which may enrich for segments that have either formed perfect duplexes, or are capable of forming perfect duplexes.
Our observations confirm that the overall nature of the DNA pool number was influenced by the extent of amplification, i.e. by the number of PCR cycles (), since a migration shift on the agarose gel was evident upon the addition of two to four PCR cycles (with all other PCR conditions being constant). This shift is consistent with a transition between duplex structures (a relatively tight band), and a mixture of heteroduplex and single-stranded material (a more diffuse band starting above the presumed duplex band). In order to maximize the accuracy of sequence representation for each individual library, we took samples from PCR cycles prior to saturation. Products migrating with the expected double-stranded DNA population were then isolated from an agarose gel, recovered (by extraction without denaturation) and quantitated, before they were used for sequencing. Our samples were found to consist of approximately equal numbers of double-stranded and single-stranded DNA molecules by the Quant-iT PicoGreen dsDNA assay (Invitrogen), and the RNA BioChip assay (Agilent 2100 Bioanalyzer, Agilent Technologies), respectively (data not shown). Our efforts to avoid saturation of the PCR amplification, although not perfect, helped to maximize the proportion of homoduplexes formed in the last cycle of PCR, thus facilitating the recovery of both rare and common species in the library pool. All these observations stress the importance of close monitoring of complex DNA pools to ensure that overall quality is not compromised through excessive amplification of seed populations, and through subsequent manipulations of amplified populations.
Figure 3. Visualizing the nature of amplified products from various cycles of PCR. With an increase in cycle number (twenty cycles of an initial round of PCR followed by six, eight or ten cycles of a second round of PCR), there is an evident shift in mobility of (more ...)
Analysis of data from two independent pyrosequencing runs
DNA from two independent sets of 25 barcoded libraries (prepared with 19–27 nts RNAs from different in vitro and in vivo virus-infected systems) were pooled for sequencing runs, and the data was used to evaluate the efficacy of the barcoding technology. Both 237 639 and 271 777 sequences from runs I and II, respectively, passed the GS20's quality control filters. Since six of the forward barcodes are exact reverse complements of six reverse barcodes, we added a number of adjoining nucleotides in the Cloning-Linker (0–17) to the barcode query to reduce ambiguity while segregating the sequences into independent datasets. These barcode + linker sequences are referred to herein as ‘search motifs’ or ‘barcode motifs’. Using 13-nt search motifs, Barsort and Barverify identified that 94.53% (run I) and 94.08% (run II) of the sequences had perfect forward and reverse barcode motifs, 2.90% (run I) and 3.24% (run II) of the sequences had perfect barcode motifs only at the 5′ end of the read, and 2.44% (run I) and 2.51% (run II) had perfect barcode motifs only at the 3′ end of the read (for data and terminology, refer ).
Summary of sequence distribution for both pyrosequencing runs as a function of barcode motif length
Sequences that have a perfect match to only one of the expected motifs have been further subdivided into three classes based on whether the motif at the other end has: (i) perfect identity to an unexpected motif, resulting in an unassigned 5′–3′ motif combination (perfect, unexpected occurrence), (ii) a high-scoring partial match to the expected motif for that library (imperfect, expected occurrence) and (iii) no strong match to any barcode motif (imperfect, ambiguous occurrence). Perfect, ‘unexpected’ occurrences may be attributed to motifs with rare coincidental sequence matches (due to multiple fortuitous errors during barcode synthesis or sequencing), to contamination in PCR primer stocks, or to cross-contamination of template DNAs in the second round of PCR. The imperfect matches may arise due to errors in sequencing [particularly at the starts and ends of the sequencing reads; (1
)], due to errors during conventional and/or emulsion PCR, or due to imperfections during the chemical synthesis of barcoded primers. With the barcode design used in this work (i.e. no homopolymer beyond 2 nt), we expected to minimize errors related to pyrosequencing. Indeed, the experimentally observed spectrum of errors did not match the types of homopolymer-related errors most frequently observed for pyrosequencing. Instead, the observed error spectrum appeared most closely to match that expected as a result of known imprecisions in the chemical synthesis of DNA primers used in standard PCR reactions (Supplementary ). Adding to any potential protocol-related possibility of sample misassignment, we note that the highly sensitive nature of high-throughput sequencing needs to be considered in all analyses. In particular, any possible source for sample contamination upstream (or during) library construction needs to be carefully monitored.
For bioinformatic segregation of a barcoded dataset, a decision needs to be made as to how much of the (fixed) linker sequence will be required in addition to the (variable) barcode for definitive sample assignment. The false-discovery rate, defined by the percentage of ‘perfect and unexpected’ occurrences, was observed to decrease to <0.005% (), if three bases of the linker were used in addition to the 10-base barcode to assign sample identities. Thus a search motif of 13 nt was determined to be ideal for our studies in minimizing false-discovery rate, while maximizing the number of sequences with error-free hits identified by the program. Sequences segregated using motifs of this length were used for all further analyses. The low false-discovery rate for perfect matches strengthens the argument that any sequence with both 5′ and 3′ barcode motifs will be placed with reasonable accuracy into the correct bin. For our purposes, all sequences with a match to one of the two expected sequence motifs were considered as true members of the corresponding dataset. Run I (0.13%) and run II (0.17%) of sequences had no perfect matches to any of the sequence motifs (forward or reverse), and were discarded. Thus, data from two independent sequencing runs on pools of barcoded material from a total of 50 sample libraries (25 sample libraries per run) show similar patterns with respect to sequence recovery, binning and rate of errors that result in false discovery and misassignment.
Distribution of sequences across samples
The barcoding approach requires being able to establish a relationship between ratios of material from various libraries, and the ratios of obtained sequences that correspond to those libraries. Our pooling of equimolar ratios of DNA from independent libraries based on visual quantification was evidently less than perfect, as there was variance in the representation of individual libraries in the overall pool (Supplementary Tables 4D and 5D). If the pooling was indeed equimolar, the expected mean number of sequences per independent pool would be 9596 (run I) and 10 852 (run II), with expected binomial SD of 95.98 (run I) and 102.07 (run II). Instead, in run I, we observed a SD of 2909.93 from the mean, and the number of sequences per set varied from 2950 to 17 922. In run II, we observed a SD of 2801.64 from the mean, and the number of sequences per set varied from 5347 to 15 901. The observed fluctuations between samples are not statistical in nature and may primarily be due to inconsistencies in ethidium bromide (EtBr)-based quantitation of DNA. Much of the difference may result from subjectivity in visual quantification (2-fold differences or less are considered appropriate for this method). In addition, EtBr will fluoresce more effectively with double-stranded species than with heteroduplex or single-stranded species. Thus, ratio of single-stranded to duplex species may influence the intensity of the band during gel electrophoresis, and may contribute to the observed variation. Finally, the fraction of heteroduplexes and non-annealed single strands present in the pool has an impact on the compactness of the band, and this in turn may affect precise concentration approximation. Denaturing materials from individual samples, and measuring their concentrations using Agilent's 2100 BioAnalyzer (before pooling samples together) may overcome these drawbacks of visually estimating DNA concentrations.