The following sections outline the algorithms used by barcrawl to design primers and bartab to post-process multiplexed DNA data sets. These are followed by an experimental test-case that outlines the design and validation of a set of 96 barcoded primers for broad-range PCR amplification of bacterial large subunit ribosomal RNA genes (LSU rDNA).
Barcrawl Algorithm and Operation
In overview, barcrawl creates a list of all possible DNA sequences of a specified length and then progressively culls sequences that may interfere with primary PCR amplification and/or sequencing steps (i.e. homopolymers, hairpins, and/or heteroduplex formation). The primary structures of forward and reverse PCR primers used to barcode amplicons are schematized in Fig. . Regions labeled "Sequencing" in Fig. are not required for barcrawl function, but are included in this example because they encode sequences necessary to perform direct pyrosequencing of amplicons on the 454 GS-FLX. At a minimum, the barcode must lie upstream of the specificity region of the oligonucleotide and in a position that can be sequenced by a particular platform. Well-designed barcode sequences must be robust to both mutation during amplification and errors in base-calling during sequencing. To this end, barcrawl performs an exhaustive pairwise comparison of all potential barcodes and discards those that differ in sequence compared to other barcodes by less than a user-specified cutoff value. The default value of 3 base differences insures that multiple substitution events or base-calling errors must occur for one barcode to be mis-classified as another.
Figure 1 Barcoded primer structures. Design of barcoded oligonucleotides. "Barcode" indicates the location of sequences unique to each PCR primer. "Template-specificity" indicates sequences required for PCR amplification. In this example, template-specificity (more ...)
Barcrawl is invoked on the command line by typing barcrawl [options]. The default implementation requires no input from the user, although several options can be set on the command line (Table ). Specific primer sequences to which the barcodes will be linked (either 5' or 3' to the barcode) along with a reverse primer sequence can be designated on the command line. The algorithm used by barcrawl to create and evaluate potential barcode sequences in the context of other primer sequences is as follows (Fig. ):
Summary of barcrawl command-line options.
Figure 2 Barcrawl and bartab algorithms. Rectangles specify input and output. Diamonds designate filters applied to data. Barcodes (barcrawl) or sequences (bartab) that do not follow the outlined rules are discarded. Arrows diverging from the central flow chart (more ...)
1. For barcodes of length l (default = 8), generate 4∧f l strings encoding all permutations of the four standard DNA nucleobases (G, A, T, and C). Store strings in a list.
2. Remove from list all barcodes with G+C content > cutoff value (default > 70%).
3. Remove from list all barcodes with G+C content < cutoff value (default < 30%).
4. Remove from list all barcodes that contain homopolymers of length > = cutoff value (default = 3).
5. Concatenate potential barcode(s) to 5' and 3' sequences (if specified) to form composite forward primer. Remove from list all barcodes in which 5' base of barcode is identical to 3' base of upstream primer ("Sequencing" region in Fig. ).
6. Remove from list all barcodes in which 3' base of barcode is identical to 5' base of downstream primer ("Template-specificity" region in Fig. ).
7. Remove from list all barcodes for which composite forward primers potentially form intramolecular hairpins of length >= cutoff value (default = 7 basepairs).
8. Remove from list all barcodes for which composite forward primers potentially form heteroduplexes with reverse primer of length >= cutoff value (default = 7 basepairs).
9. For all pairwise combinations of barcodes, i and j, remaining in list (i.e., those retained after steps 2-8), compute d, the number of base differences between barcodes i and j. Retain barcodes i and j if d >= cutoff value (default = 3 base differences), otherwise remove barcode j from list.
10. For all pairwise combinations of barcodes, i and j, remaining in list (i.e., those retained after steps 2-9), determine whether any deletion within barcode j creates a string that is an exact match to a substring of i. Determine whether any deletion within barcode i creates a string that is an exact match to a substring of j. If an exact match is found in either search, then remove barcode j from list.
11. Sort list of remaining barcodes by number of nucleotide flows (sensu 454 GS-FLX system) required to sequence through a primer.
12. Output list of barcodes, corresponding composite forward primers, and required pyrosequencing flows to tab-delimited file.
Barcrawl outputs the results of a search in two files: 1) A tab-delimited list of barcoded forward primers, the corresponding barcode sequences by themselves, and the number of pyrosequencing nucleotide flows required to sequence each barcode using the 454 GS-FLX system and 2) A logfile that summarizes the command options for a barcrawl run, statistics for number of barcodes output, distribution of flows for the set of barcodes, and elapsed execution time.
Bartab Algorithm and Operation
The process of multiplexed high-throughput sequencing typically requires physical mixing of barcoded PCR libraries. Consequently, sequence reads generated with different barcoded primers must be identified and sorted into barcode-specific groups in order to associate sequences with the appropriate libraries. The command line software bartab provides a set of post-processing tools for checking sequence quality, identifying barcodes, annotating sequence reads, parsing sequences into barcode-specific groups, and de-replicating sequence sets (Fig. ). While designed to complement the function of barcrawl, the two software programs function independently of one another.
Bartab is invoked on the command line by typing bartab [options]. Command line options are summarized in Table . As input, bartab requires a FASTA formatted DNA sequence file and a corresponding file of quality scores in FASTA format. Optionally, a tab-delimited file that lists valid barcode sequences (one barcode per line) can be provided as input (example shown in Fig. ); sequences that do not encode a valid barcode are discarded if this file is present. The decision to discard sequences with invalid barcodes, rather than to infer the most likely barcode for a sequence, is based on the premise that sequences with mutated barcodes may encode other, non-correctable mutations in coding regions. This conservative approach does not unduly affect the total number of reads obtained from a sequencer run; we typically discard ca. 2% of sequences on the basis of invalid barcodes.
Summary of bartab command-line options.
Figure 3 Addition of Metadata through Barcode File. A. Format of barcode file. Column headings act as keys for values in column cells. Strings specifying keys and values must consist of ASCII characters without whitespace. Values in a row are separated by tabs. (more ...)
Fig. presents a flow chart of the main bartab functions. In overview, bartab applies a user-configurable series of filters to check the quality of each sequence read. These include filters based on minimum quality scores (default is an average Q-score of 20 across a window of 5 nts), presence of ambiguous base calls (default is to reject sequences with any ambiguities), removal of primer and barcode sequences, and minimum polished length above a threshold (default = 200 nt.).
In addition to validating barcodes, inclusion of a barcode file provides a ready means for both sorting sequences and annotating them with metadata that is keyed by barcode sequences. For instance, if unique barcodes are assigned to individual PCR libraries (e.g. from individual samples), then metadata specific to a library can be assigned to all of its constituent sequences. For this purpose, additional tab-delimited columns of data are included in the barcode file (Fig. ) in order to associate barcodes with metadata. Column headers specify keys (e.g., "barcode", "library", and "sample" in Fig. ), and data elements falling under a given column specify particular values to be associated with the barcode listed on the same row. A sequence can then be annotated with all key-value pairs associated with its barcode sequence. For ease of subsequent parsing, metadata are listed in FASTA definition lines using the format "key:value" (Fig. ). Such annotation is not mandatory. If a barcode file is not specified on the command line, bartab simply skips the barcode validation and sequence annotation steps (Fig. ).
Polished sequences are output in FASTA format, as are corresponding quality scores. In addition, sequences (and quality scores) that are rejected based on their failure to satisfy the conditions of the filter steps, are discarded into a separate file (designated "<filename>.rej.fa" and "<filename>.rej.qual"). The default behaviour is to aggregate all acceptable, polished sequences into a single file. Alternatively, sequences assigned to different groups on the basis of barcode metadata can be split into multiple files by setting the "-spl <string>" option on the command line, where "<string>" denotes the name of a column in the barcode file. For instance, "-spl barcode" creates a separate output file for each unique barcode and then directs the output of polished sequences to the appropriate files based on their barcodes. Other annotation keys in the barcode file could also be selected as the basis of sorting. Analogously, selection of the rename option "-rnm <string>" replaces sequence names with a base name determined by the barcode file and a unique, sequential integer value for each sequence in the group (Fig. ).
Because high-throughput sequencing libraries may include multiple identical sequences, identifying, denoting, and removing such replicate sequences, can bring significant performance improvements to downstream applications. Toggling on the sort option ("-rep") causes bartab to de-replicate groups of like sequences. For each primary FASTA file that is output, Bartab uses a hash function to identify unique sequences (termed "sequence tags") and subsequently output the representative sequences to a FASTA file with the suffix "<filename>.rep.fa", for representative sequences. In order to preserve information about the relative abundances of sequences, bartab adds metadata to each representative sequence that specifies the number of sequences in the group (e.g. a barcoded library), designated by the key "xsrep_count" (Fig. ).
Lastly, bartab compiles detailed summary statistics, written either to a log file or the terminal, that include histograms and cumulative distributions of sequence length before and after trimming, as well as polished sequence counts for each barcode. Bartab can be executed in a dry run mode ("-dry") to tabulate these statistics without actually modifying sequence data. This feature allows the user to adjust input parameters if desired.
Case Study: Application of Software to rRNA Metagenomics
In order to validate, refine, and demonstrate the use of barcrawl
, a set of 96 barcoded oligonucleotides were designed and analyzed in a real world application. Barcoded-forward and reverse primers were constructed for broad-range PCR amplification of the bacterial LSU rDNA, using the highly-conserved LSU130F and LSU559R sequences ([9
]; numbering relative to Escherichia coli
23S rRNA gene). The design goal of this exercise was to construct a bank of 192 barcoded LSU559R primers, while minimizing the number of pyrosequencing flows required to sequence each barcode.
To explore the sequence space of possible barcodes in the context of the required GS-FLX and LSU sequences, multiple barcrawl analyses were performed under systematic permutation of the program's parameters. Results and execution times for a subset of the runs are summarized in Table . As expected, barcode length was the primary determinant of the size of final barcode pools. Regardless of the parameters, the number of barcodes in an output set typically was ~1% of the size of the starting pool, indicating that most potential barcodes were rejected. For example, the default values produced a pool of 2774 barcodes out of a starting pool of 262,144 sequences (Table , run 6). Relative to the default settings, enhancing the stringency of the analysis either by increasing the minimum distance between barcodes (Table , run 8) or decreasing the permissible homopolymer length (Table , run 13) reduced the final pool sizes by 4-fold or 3-fold, respectively. In comparison, modifying the permissible GC-contents, hairpin lengths, or heteroduplex lengths produced only marginal changes in the extents of final set sizes.
Barcode sequence space(s) defined by program parameters.
Fig. presents cumulative distribution plots of barcodes as a function of pyrosequencing nucleotide flow, for 8-mers (Table , run 5), 9-mers (Table , run 6), and 10-mers (Table , run 16). As demonstrated in this plot, for pyrosequencing flows greater than 10, the cumulative number of barcodes available at a given nucleotide flow increased with barcode length. For example, in the case of either 9-mer or 10-mer barcodes, the desired sets of 192 barcodes could be selected from barcodes that required <= 12 nucleotide flows. In contrast, a comparable set of 8-mer barcodes would have required selection of barcodes requiring 13 nucleotide flows. Although somewhat counterintuitive, this result indicates that the choice of longer barcode lengths (e.g., 9-mers, 10-mers) can be used to minimize overall sequencing effort (i.e., average number of nucleotide flows) relative to shorter barcodes. In other words, longer barcode lengths can in some circumstances decrease the average number of sequencing flows required to sequence a set of barcoded oligonucleotides.
Figure 4 Cumulative distribution of barcodes as function of pyrosequencing nucleotide flows. Barcrawl analyses were performed for range of barcode lengths. The cumulative sums of selected barcodes are plotted vs. pyrosequencing nucleotide flows. The inset box (more ...)
The default parameters produced a set of 2774 potential barcodes of length 9 nt. Inspection of these results indicated that cutoff values for maximum hairpin and heteroduplex lengths could be made more stringent (i.e. <= 4 basepairs) without compromising the design goal of 192 barcodes. Therefore the parameters specified in Table , Run 10 were used to select a set of LSU559R barcoded primers. In this example, barcrawl was invoked with the following command:
barcrawl -l 9 -m 3 -p 3 -a 4 -b 4 -c 30 -g 70 -f5 GCCTCCCTCGCGCCATCAG -f3 NNCATTMTACAAAAGGYACGC -r GCCTTGCCAGCCCGCTCAGNNCCGAATGGGGVAACCC -j5 -j3 -d
The starting pool consisted of 262,144 oligonucleotides. Of these, 260,736 oligonucleotides were discarded by application of the GC-content (47,104 oligos culled), homopolymer (58,816), hairpin (65,624), join (14,115), deletion (15,007), and minimum base difference (59,710) filters. The final pool of 1768 acceptable barcodes was sorted in ascending order of 454 GS-FLX flows and oligonucleotides were synthesized for the top 96 barcodes.
The relative abilities of the barcoded LSU559R oligonucleotides to prime broad-range PCR were assessed by setting up parallel PCR reactions that used the same cocktail of PCR reagents, metagenomic DNA, and reverse primer. Real-time PCR reactions were conducted and Ct scores measured for each oligonucleotide (Fig. ). As shown in Figs. and , the distribution of Ct scores was approximately normal. In this experiment, Ct values ranged from 26.5 to 29.1 cycles (mean = 27.6 ± 0.48 cycles). Under perfectly efficient exponential PCR amplification, this would translate to a 6-fold difference (22.6) in yield across the range of Ct scores.
Figure 5 Functional assessment of barcoded primers by quantitative PCR and pyrosequencing analysis. A. Distribution of Ct scores for 96 barcoded LSU559 primers tested under identical reaction conditions. Bars represent relative frequency of occurrence of Ct scores. (more ...)
To assess whether the observed differences in amplification within the set of primers was due to factors intrinsic to the barcode sequences or to normal experimental variability, the experiment was duplicated using different metagenomic DNA and newly diluted sets of LSU559R oligos. Fig. presents a comparison of the Ct values obtained for each primer in duplicate experiments. We interpret the Pearson correlation coefficient for these data (r2 = 0.48), as indicating that only 48% of the variability could be attributed to intrinsic differences in the set of barcoded oligonucleotides. This could represent sequence-specific factors or variability in the concentrations of oligonucleotide stocks provided by the manufacturer.
Finally, barcoded LSU559R oligos were tested for their ability to support multiplexed pyrosequencing on the Roche GS-FLX platform. Pre-sequencing PCR amplification, clean-up, and pyrosequencing steps were performed as described in Materials and Methods. A selection of 48 barcoded primers was assayed in order to balance sequencing costs against the need to ensure an adequate distribution of sequences across oligonucleotides; for the purposes of this experiment, we predicted that a 1/8 platform run would provide sufficient sequencing coverage of all 48 oligos (>400 sequences on average).
To mitigate the effects of experimental variability, each barcoded LSU559R primer was tested in triplicate sequencing reactions. Despite this redundancy, one well of the PCR plates (H01) was systematically prone to evaporation during the triplicate experiments. Although earlier experiments demonstrated that the barcoded primer in this well of the microtitre plate was functional, only minimal PCR product was obtained prior to sequencing in this set of experiments. Consequently, the results for this oligo were censored.
Pyrosequencing data were provided in two files containing either FASTA formatted sequences (lsu559R.fa) or their corresponding quality scores (lsu559R.qual). These, along with a file specifying sequences of the 47 barcodes under investigation (lsu559R.bar), were provided as input to bartab. Sequence reads were trimmed and barcodes identified using the bartab system call: bartab -in lsu599R.fa -qin lsu599R.qual -bar lsu559R.bar -min 175 -out lsu559Rout"). A total of 29,173 raw reads were generated in the 1/8 plate GS-FLX run. From this starting pool, bartab output 22,332 polished sequences and rejected 5557 sequences (19%) with length < 175 nt., 814 sequences (2.7%) with ambiguous base calls, and 470 sequences (1.6%) with no valid barcode.
Each of the 47 barcoded LSU559R primers was well-represented in the sequence data set, with a range of 219-693 sequences recovered per barcode. As shown in Figs. and the distribution of sequence counts obtained for each barcode sequence was approximately normal (median = 500.0, mean = 474.7). A scatterplot of the number of sequences generated by barcoded primers vs previously measured QPCR Ct values for the primers showed no correlation between PCR and sequencing efficiencies (Pearson correlation coefficient = -0.01). Thus, no bias was observed in the ability of a primer to support high-throughput sequencing under the protocols employed. We conclude that this set of LSU specific barcoded primers is validated for multiplexed sequencing projects.
Software Performance Issues
As a rough guide to system requirements, benchmark comparisons for common barcrawl and bartab tasks are summarized in Tables and , respectively. Barcrawl stores all processed data in RAM, so performance likely will be limited as much by memory as by raw processor speeds. Barcrawl operates in approximately exponential time, O(cn), proportional to the length of barcodes. Selection of the 2774 LSU559R barcodes by barcrawl required 416 seconds on a Linux workstation (Table ). Performance was most critically affected by the filter that screened for robustness of barcodes to deletion ("-del" option).
Performance measures for bartab:
Bartab operates in approximately linear time, O(n), proportional to the number of sequences to be processed. In contrast to barcrawl, bartab does not store extensive data in RAM; only the file dereplication step stores multiple (unique) sequences in RAM at any given time. A Linux workstation required 40 seconds to trim, annotate, and export 400,000 sequences (Table ), a comparable number of sequences to that generated in a full run on a GS-FLX pyrosequencer. Dereplication of output files required little additional computational time. In contrast, the overhead required to sort sequences into barcode-specific files was considerable (2× elapsed time) for the Mac OSX operating system. To enhance performance, sequence data can be spread across multiple files. Alternatively, more complex data storage strategies, such as the use of application-specific databases may be implemented in the future.