Multiple-alignment informs primer design
A high-information multiple sequence alignment covering a region of interest (ROI) must be supplied by the user (Figure ). The ROI is the DNA region that contains the de novo
DNA information in the samples to be analyzed. These could be, for example, sites related to drug resistance, CTL epitopes, or other features. It is important that the alignment contains examples of the expected genetic variation in the population to be analyzed, for instance, that sufficient diversity in hyper-variable regions to be representative is included. This will allow the design of primers that both avoid highly variable regions, and that address the diversity found in the more conserved regions that are good candidates for primer localization. To allow localization of the best primer sites, the ROI should be flanked by sequence on either side. Ideally, such flanking sequences should be long enough to allow most sequence reads to extend the full length of the ROI, and maximize the range of possibilities for primer selection. Formally, therefore, flanking sequence length should equal the maximum sequencing read-length minus the ROI-length on either side: e.g., if the ROI is 100 bp and maximum sequencing read length is 500 bp, the alignment should ideally be (500–100)
900 bp long. While this may not always be possible, it is helpful as a guideline; at minimum, there must always be enough sequence on either side of the ROI to identify a suitable place for a primer. When studies of within-subject viral diversity are undertaken, having a set of sequences spanning the ROI from the subject is desirable to inform primer design; when sequences will be obtained from multiple subjects, population diversity needs to be considered. While our primer design methods will work using alignments that come from any variable organism, however for HIV we provide many premade alignments that cover different levels of genetic variation or that are made on the fly from database searches at the HIV database site http://www.hiv.lanl.gov
Figure 1 The region of interest (ROI) defines the de novo target DNA sequence. Ideally a multiple alignment should have ample space for primers on either side of the ROI, considering desired primer lengths and sequence read length (which depends on the sequencing (more ...)
User and sequencing method restrictions constrain primer design
When primers are designed for re-sequencing, including NGS-based deep sequencing, the user already has information about the ROI. Based on the alignment, the user specifies the ROI by its alignment-specific start and end coordinates. The user also can place constraints on the primers in terms of minimum and maximum length (10
40 bp), and maximum allowed differences in melting temperatures (Tm
) in °C to match the primers to experimental conditions and desired de novo
lengths. Importantly, primers will be designed to include genetic variation down to a user specified detection limit d
, which, if necessary, will include degenerate nucleotide states at some primer positions. The user can also restrict the design by requiring primer sites to have G/C characters at 3′- and/or 5′-ends.
In addition to user specified constraints, the sequencing method also puts restrictions upon the design process. For instance, if the user specifies 454 Titanium as adaptors, this automatically sets restrictions on the barcode generation as well as possible primer sites, to avoid di-nucleotides of the same state at junctions and the potential for dimerization. Another important method-specific restriction is the total length of the fragment to be sequenced: this is set automatically based on published sequencing lengths for different sequencing methods, but the user can change this by specifying “my adaptor” and a desired de novo length.
All user constraints are input on a simple web form (Figure ), and method-specific constraints are automatically enforced based on user information. The resulting primer-constructs are presented in a table format that easily can be transferred to a spreadsheet or database. Up to 5 alternative primer-construct pairs will be presented. In addition, appropriate barcodes will be listed for each primer-construct pair. Such DNA-based barcodes (aka. “tags”) can be used to identify different sample sources, e.g., different patients, longitudinal samples, tissue compartments, etc. This table of primer-construct pairs may become very large, as 1,000's of bar codes can be generated for use with a primer-construct. Importantly, this also makes it possible for a user to design template-specific barcodes [16
] and to investigate and predict problems related to barcode misreading or loss of sequence fragments due to barcodes acting as alternative primers.
Figure 2 Input web page for PrimerDesign. Alignments can be either pasted into the input window or uploaded as a file from the user’s computer. A ‘sample input’ is provided so users can test run the software and familiarize themselves with (more ...)
Entropy informs about primer locations
In order to find good locations for the primers within the alignment, the algorithm first finds the nucleotide frequencies at each position of the alignment. It then excludes nucleotides whose frequency at any position falls below a desired detection limit d
(given by the user), and calculates a multi-state consensus character (IUPAC code) representative of the remaining nucleotides at each position. The remaining variability is then estimated by the Shannon entropy for each site [25
]. Next, the locations of all forward-reverse primer pairs are found around the region of interest (ROI) with primer lengths between minimum and maximum primer length (given by the user) sorted by their entropy scores. Lower entropies are preferred. These potential primer pairs are then filtered by Tm
and dimerization risk.
Estimating Tm for matching to experimental conditions
The primer melting temperature Tm
is estimated by the empirical nearest-neighbor model [27
], assuming a sodium concentration of 0.15 M and primer concentration at 30 μM, and dinucleotide entropies and enthalpies according to published empirical values [27
]. Because the consensus sequence may contain multistate characters (aka. degenerate sites), each such site is deconvoluted into possible A, C, G, and T characters and all possible primer sequences are generated. The complexity measures how many individual primers composed of only A, C, G, and T characters exist for a given primer (and across the entire alignment). For each of the deconvoluted primers the Tm
is calculated, and the average is presented for each potential multistate-containing primer. If the difference in Tm
between a forward and a reverse primer is within the maximum allowable limit (given by the user), then that primer-pair is appended to the list of potential constructs and sent to the dimerization risk estimation.
Bio-barcode tags label sequences
The user can optimize the bio-barcode tags based on 1) a desired number of unique tags, 2) a certain length of the tags, or 3) a minimum edit distance. Currently, we have restricted the number of tags to 200,000 and a tag length of 18 nucleotides. The edit distance (Levenshtein distance), i.e., the possible minimum number of “mutations” required to modify one tag into another depends on tag length, can be selected to be up to 10 for long tags. A higher edit distance makes downstream bioinformatic sorting more robust, translating into fewer lost sequences due to ambigious reads, and also fewer misclassifications of barcodes. Figure shows the current limitations on the number of possible tags for different lengths and edit distances. Tag numbers grow exponentially as a function of length, and at lower numbers stochastic limitations due to starting values influence the actual number of possible tags.
Figure 3 Current limitations in the number of possible bio-barcode tags for different lengths and edit distances. These bio-barcodes are stored in a database for fast selection according to user-specified criteria. Two examples (dashed lines) are indicated, representing (more ...)
To avoid time-consuming generation of barcodes every time the software is run we have constructed prior lists that PrimerDesign accesses during the primer-tag-adaptor design. There are two sets of potential barcodes. The first set contains all tags of lengths between 3 and 18 nucleotides. This list was generated recursively. Thus, for tag length n
, nucleotides were appended to the tags of length n
-1. During the generation of the tags, no tags with repeated nucleotides at adjacent sites were allowed as this is known to cause potential misreading in pyrosequencing [29
]. The second set of tags was derived from the first set where tags were filtered out when the edit distance was less than a user-specified minimum. All sets with minimum edit distances of 3–10 were pre-calculated. In detail, to effectively generate the second set the algorithm checks for the longest common substring of two tags. If the difference between the tag length and the common-substring length was less than the specified edit distance, then the Levenshtein distance between the two tags must also be less than the edit distance. Because the lists are recursive, this procedure led to a great reduction of the number of calculations for longer tag lengths.
Automated dimerization risk filters design constructs
Because primer-dimers may cause serious problems in experiments, we search for potential dimerization risk between all primer constructs (primer-tag-adaptor oligomers) that will be in the same reaction. Both potential homo-dimer (e.g., hairpins) and hetero-dimer structures are investigated. This algorithmic step can become very computationally intensive, especially if large numbers of tags and/or a high degree of complexity are included in the design. Dimerization risk is evaluated in two ways (Figure ): 1) in a user-defined window moved along all potential primer-primer interactions, and 2) as a user-specified fraction of bonds in an interaction. To reduce the software running time we have removed redundant calculations by hierarchically dividing and testing the different parts of the primer-tag-adaptor constructs. Specifically, the first step is to check for dimers amongst only the adaptors, since all constructs in one reaction have the same adaptors. The second step involves estimating dimer risk in the additional adaptor-primer parts since each primer set will have the same forward-reverse primers. Finally, the third step involves the remaining adaptor-primer-tag parts, as each construct in each primer set will have a different tag. This allows estimating the dimerization risk early, thus reducing the number of calculations and potentially long algorithmic loops that otherwise would slow down the design process. Potential primer-dimers detected during the design are logged in a file that can be downloaded for quality control analyses.
Figure 4 Prediction of dimerization risk. The dimerization risk is assessed by virtually sliding one primer (A) across another (B), reversed. Each potential overlap (1…n…N) is examined by i) a sliding window that tests if the two primers have an (more ...)
Hence, all unique primer constructs are checked against each other, and if an interaction shows more than the user-specified number or ratio of matches, then that set is discarded. The default values (window
0.9) are conservative values based on recently published estimates of dimerization risk [31
PrimerDesign suggests primer pairs
Based on user specifications, minimization of sequence entropy and complexity, Tm-matching, and checking for potential dimerization risks, the five best primer pair constructs are identified graphically in a plot that corresponds to the user-provided alignment on which the design was based (Figure ). The ROI is marked (in grey) and the five best forward and reverse primer pairs are indicated on either side as arrows. The graph also shows the Shannon entropy (in this example for a hypothetical primer 20 bp long), the complexity as defined by the number of degenerate sites (informed by d), and the estimated mean Tm. These curves allow the user to interact with the design, e.g., to identify primer sites with less complexity that would be created if the restriction of G/C in the 3′-end were relaxed. If desired, the user can go back and reevaluate the original design. For instance, Illumina’s error rate rises along the length of its reads; therefore, it may not be desirable to place the region of interest at the end of the sequencing read, which experiences the highest error rate. Of course, this can be controlled by the user, by simply reducing the expected read length, therefore placing the primers closer to the ROI.
Figure 5 Example output web page from PrimerDesign. The output includes a list of the design parameters specified by the user, a graphical illustration showing primer locations (arrows), the ROI (in grey), and curves for entropy, complexity and Tm, followed by (more ...)
In addition to the graphical output, we also provide a simple table with the primer constructs and Tm, which can be used for primer synthesis and experimental setup, and a log file with additional quality control and iteration data.
The execution speed of the program depends on a number of factors, particularly the size of the input multiple alignment, the level of genetic variation, and the number of barcodes requested. The level of genetic variation affects several steps; as the variability increases, the number of degenerate sites may also increase, which adds to 1) the complexity and the number of primers to average Tm over, and 2) the number of potential primer-dimers. The level of complexity to consider is adjusted by the user-specified detection limit d. It is noteworthy in this context that adjusting target Tm, and the allowable matching intervals can make primers designed with less complexity/lower detection limit still be sufficiently promiscuous to amplify all genetic variants in the target population. At the web site we provide a “sample input”; this imports an alignment and several parameter values, with no adaptors, de novo length of 200 bp, and no barcodes; it finds optimal primers in about 25 seconds, depending on server load and internet speed. When calculations take long times, i.e., when web browsers may time out before the design is completed, we provide an email notification option. This sends an email to the user with a link to the results.
Primers used in successful 454 sequencing of HIV-1 populations
This PrimerDesign algorithm was used to design primers to investigate CD8 T-cell driven HIV-1 immune escape over time, starting with acute infection samples from 3 human patients [9
]. We were also interested in using 454 pyrosequencing to better estimate the number of HIV-1 strains that established the primary infection in these subjects [9
]. The env
V3 region is a highly variable region of the HIV-1 genome, and was therefore, as in a previous HIV-1 deep-sequencing study of long-term infections [14
], sequenced as a reference region in all study subjects to examine non-T-cell-selected diversification. Thus for each patient the ROIs were chosen to encompass the env
V3 region, along with additional regions spanning one or two verified CD8+ T cell epitope(s) recognized early in infection. These regions were located in nef
in one subject, rev
in the second, and env
in the third [9
]. We were interested in detecting the emergence of low frequency variants in each patient, and tracking the progression of immune escape over time. In addition to the 3 acutely infected patients, we had access to one sample from a chronically infected donor who was expected to have a very diverse virus population. Based on alignments of previous Sanger sequences from these 4 patients [32
], the previous 454 study of chronic patients [14
], and additional sequences from the HIV database (http://www.hiv.lanl.gov
), we designed 30 unique adaptor-tag-primer constructs: 2 patients
2 regions (a T-cell epitope region and the V3 variable region)
4 samples per patient (3 longitudinal samples
16, plus 1 patient
3 regions (2 T-cell epitope region and the V3 variable region)
4 samples per patient (3 longitudinal samples
12, plus 2 regions for the single time-point for the chronically infected donor. Each of these constructs successfully revealed accumulating genetic variation in the investigated regions over time, and extensive within-epitope diversity. The number of sequences retrieved varied in the patients, but was generally similar between regions within the same patient [9
]. Furthermore, the variation found in the HIV-1 sequencing experiments agrees very well with results from Sanger sequencing by single-genome amplification (SGA), with 95% confidence intervals for variant frequency detected by SGA overlapping frequencies from 454, and increased sensitivity to detect rare variants [12
To further investigate PrimerDesign’s performance in identifying effective primers for revealing genetic variation in different regions of the HIV-1 genome, we designed primer sets to provide overlapping short amplicons across the HIV-1 genome of the HIV-infected subject CH40. We aimed at finding 454 sequencing primers every 250 bp that would work in both directions, using full-length genome sequences obtained from this subject as the an input alignment for primer design [12
]. As the expected read-length was about 500 bp, this resulted in a tiled primer design of 30
30 primers. Figure shows the aligned 454 sequences covering the entire HIV-1 genome. Critically, most primers generated similar numbers of de novo
sequences, hence displaying comparable levels of genetic detail across the genome. A few primers around genome positions 6500–7000 gave more data, interestingly in both directions, and none of the primers failed. Because the genetic variation of HIV-1 is highly non-uniform across the genome, this shows that our PrimerDesign algorithm can find good primers regardless of whether a highly variable gene (such as env
) or a less variable gene (such as pol
) was the ROI, and that the actual sequencing depth was comparable across the genome.
Figure 6 A visual representation of a portion of a 1.4-gigabase, 4.5-million-sequence nucleotide alignment, derived from overlapping 454 reads covering nearly the complete HIV-1 genome, from 35 individual PCR amplicons from a single patient at 5 time-points during (more ...)
The generation of primers across the entire HIV-1 genome was done with an in-house semi-automated version of PrimerDesign, but it could be easily done manually using multiple ROIs in separate runs. We may add this as an automatic feature in the future if requested by users, which is customary for LANL HIV database bioinformatic tools. Similarly, we welcome other user requests for further improvements, and we will implement such requests if feasible.
PrimerDesign compared to other software
The biggest strength of PrimerDesign compared to the majority of previous primer design programs is PrimerDesign’s comprehensive approach. Specifically, PrimerDesign uses a multiple alignment to address deep-sequencing needs of primer localization in relatively conserved regions of highly variable DNA targets, allows the integrated design of bio-barcode tags and adaptors, and allows primer-dimer evaluation based on experimental data. Most primer design software tools base their design on a single sequence (e.g., Primer3 [1
]), which could result in selective or biased amplification from a diverse population by inadequate coverage of variants, or inadvertent selection of primers in relatively diverse regions, when conserved regions could be used instead, thereby risking missed coverage due to biological variability in primer regions. Other programs that use multiple alignments (e.g., GeneFisher [5
]) lack the ability to simultaneously evaluate primers appropriately, for instance, by matching Tm’
s and checking for dimerization risk. In these comparisons, it is important to point out that many previous primer design programs have focused on specific needs of certain experimental protocols, and therefore comparisons are difficult. Another advantage with PrimerDesign is its automatic and flexible tag generation, which is becoming more important with the increasing usage of multiplexed and next-generation sequencing. Tag generation, together with the possibility of adding DNA adaptors, and testing the entire construct’s dimerization risks, are to our knowledge unique features of PrimerDesign. Primers designed using PrimerDesign could readily be used in conjunction with the Primer ID strategy, where a random sequence tag is introduced so that each template receives a unique primer ID [16
]. The goal of the primer ID is to minimize errors in estimating sequence frequencies resulting from initial sequence resampling and amplification biases, and template consensus sequences, and to minimize the impact of recombination during amplification and misincorporation/sequencing errors [16
]. The introduction of such tags may introduce some primer issues, as these tags are by definition random, and added at the DNA synthesis stage, and so will not be designed to avoid homopolymer strings, dimerization issues, and some bias may be introduced by the chemistry of synthesis.