The degenerate primer design pipeline was used to design primers for multiple viral projects across 15 different DNA and RNA viruses including both segmented and non-segmented viruses. The results for 8 different non-segmented viruses are presented here. The size of these genomes varied from 9kb to 15.6kb with various degrees of sampled genomic sequence variation and GC content. The summary statistics for the primers designed for the 8 different non-segmented viral genomes using the two different PCR protocols described as the “standard” and “high GC” protocols are presented in Table . The standard and high GC protocols are described briefly in the Materials and Methods section.
Summary statistics for targeted viruses and sequencing results
Table contains the actual sequencing success rates for each targeted viral consensus sequence and the relevant information regarding each consensus sequence’s construction. The median amplicon coverage for all viruses is approximately 3x, however in many cases the ends of the genomes had a lower minimum coverage, i.e.,1x or 2x, due to the inability to select primer beyond the end of the available sequence. Dynamic tiling produced an even tiling of amplicon coverage across all genomes.
Primers designed using the standard protocol targeted organisms with less than 50% GC content. These included the human parainfluenza virus (HPIV-1 and HPIV-3), measles virus (MeV), mumps virus (MuV), human respiratory syncytial virus (HRSV-A and HRSV-B) and human metapneumovirus (HMPV-A and HMPV-B). For organisms with GC content exceeding 50%, e.g., rubella virus (RUBV-1 and RUBV-2), the high GC protocol was used.
Consensus sequence construction
Amplicons were designed to cover the entire genome for all viruses, so only full length complete sequences available from NCBI’s Viral Genomes [5
] were used to generate the consensus sequences. The number of sequences used to generate the consensus sequence and the percent of degenerate bases across the constructed consensus sequences are described in Table . The number of sequences collected to construct each consensus sequence ranged from 3 (for HRSV-A and HRSV-B) to 32 (for MuV).
For each viral type, a single consensus file was generated with a target of less than 10% degenerate bases. With two exceptions, the resultant percent ambiguity across the constructed consensus sequence ranged from 4.12% for HRSV-B to 9.28% for MuV. If a single consensus could not be generated with less than 10% ambiguity, multiple consensus sequences were constructed based on sequence similarity-based clustering results. The two exceptions were measles virus (MeV) and rubella virus (RUBV-G2) which had percent ambiguities exceeding 10%.
For measles virus, the allele frequencies for multi-allelic positions were not filtered to remove the less dominant allele frequencies, even when the percentage of degenerate bases across the constructed consensus sequence exceeded 10%. Even though the less dominant allele frequencies were below the expected threshold for sequencing error, because the percent degeneracy across the consensus sequence was very close to 10%, it was decided that stratifying the sequences and generating multiple consensus sequences may not be cost effective. The total success rate was not expected to be significantly impacted.
For the rubella virus, the input sequences were stratified into two genotypes because the total percentage of degenerate bases in the initially constructed consensus sequence was 21% (see Figure ). After computational stratification, two sets of sequences were used as input to generate consensus sequences for genotype 1 (G1) and genotype 2 (G2), based on 11 and 5 sequences, respectively. This reduced the percent degeneracy across the consensus sequence to 6.07% and 13.13%, for G1 and G2, respectively. G2 contained many multi-allelic positions that were not filtered. At an allelic frequency of 20% (1 out of 5), it was not clear whether these should be attributed to sequencing error within the input sequences, so the variations were retained just to be conservative. Degenerate primers were computed independently for the G1 and G2 consensus sequences, and then redundant primer pairs between the two computes were removed.
Figure 1 Dendrogram representing the relationship between sequenced Rubella Virus (RUBV) genomes. The Rubella sequences were divided into two groups, RUBV-G1 (green) and RUBV-G2 (blue), to reduce the percent ambiguity across their consensus sequences. The hierarchical (more ...)
For human respiratory syncytial virus (HRSV), separate sets of primers were designed for HRSV-A and HRSV-B since there was more than 20% sequence dissimilarity across the entire genome with 24% ambiguity (see Figure ). Splitting the sequences into two subsets ensured that the sequences in each subset were less than 10% dissimilar. The sequences in the subset HRSV-A differed by 5.5% and those in HRSV-B differed by 4.5%, with a percent ambiguity of 6.8% for HRSV-A and 4.1% for HRSV-B.
Figure 2 Dendrogram based on sequence similarity for Human respiratory syncytial virus (HRSV) genome. The 6 HRSV genomes were divided based on sequence similarity into two clades, HRSV-A (blue) and HRSV-B (green). Two consensus sequences were constructed. The (more ...)
For human metapneumovirus (HMPV), there appeared to be two distinct clades, between which they were approximately 20% dissimilar. These initial sequences were split into two clades of HMPV-A, with 9.25% ambiguity and HMPV-B with 7.95% ambiguity. Separate primers pairs were then designed for the two clades independently.
For all viral genomes, amplicons were designed with an intended coverage depth of 2x or 3x; with the exception of HMPV-A and HMPV-B, which had a coverage depth of only 1x. Multiple depths of coverage increase the likelihood of successfully amplifying the targeted region at least once because when a second pair of candidate primers is selected, the primer design pipeline ensures that the second primer pair will not overlap any of the previously selected primer pairs. In a high-throughput sequencing environment, all wells in a plate are processed, so additional amplicon coverage was generated if any empty wells remained. If the number of amplicons designed exceeded the number of wells on a plate, amplicons were computationally reduced to a smaller subset so that every targeted region was covered by at least one amplicon. Projects also differed by their sequencing requirements, thus resulting in varying amplicon numbers and coverage depth, depending on each virus type.
Calculating success rates
Determining the success of degenerate primer design is a more complicated process than that of standard non-degenerate primer design. Not only does the degenerate primer design algorithm need to accurately model and predict the outcome of PCR with a heterogeneous population of primer pairs, but the constructed degenerate consensus sequence needs to sufficiently represent the targeted genome’s population. The latter can be detrimentally impacted by a lack of available sequence information, or the potential for heavily biased sequencing favouring strains specifically studied by a single, or few laboratories.
The success rates provided in Table represent the average success rates for each virus type as a percentage of all sequencing reactions performed. However, since the effect of isolate is confounded inside of the average success rate, to gain a deeper understanding of the success rate of primer pairs across isolates, a graded approach was taken. This was necessary because if a primer pair was successful at a low percentage across all samples, then it was possible that the primers did not match the genotype of the isolate, rather than a poorly selected primer pair based on the primer design algorithm alone. For each primer pair and isolate, sequencing was performed in both the forward and reverse directions. If a sequence was recovered for an isolate and primer pair (under standard expectations of high quality values, length, etc.) then that sequencing direction was considered a success. Success rates were tallied for forward, reverse, and then averaged between both sequencing directions. These per primer pair success rates were then graded at cutoffs of greater than 25, 50, 75, 85 and 90 percent of isolates. (See Table ) This cumulative and graded approach was necessary to distinguish between amplification failures that were either isolate or target region specific. Note that these are actual, not predicted, success rates calculated based on laboratory experiments, not predicted success rates based on in silico computations. Furthermore, because of the fairly uniform amplicon coverage across all genomes, e.g., Figure , the reported success rates should not be inflated by any redundant stacking of amplicons over easy-to-amplify loci. Table contains the forward, reverse, and average success rates for each viral consensus sequence. For each directional set of statistics, the sequencing success rate was computed across all isolates. For example, for HRSV-A, 94% of all primers designed had greater than 75% of the isolates successfully sequenced.
Designed primer pair success statistics by cumulative isolate success rate
Figure 3 Dynamically tiled layout of the amplicons across the targeted genome. This is an example of the tiling output from a single primer design run performed on RUBV-G1. Each green rectangle represents an amplicon. The long black bar represents the targeted (more ...)
To separate PCR failure bias due to the GC content of the genome alone, the overall success rates of primer pairs were calculated for high GC and standard protocols, separately. For the standard protocol, the overall forward, reverse and averaged sequencing success rates of primer pairs were 83%, 82% and 82% respectively, for over 95% of the isolates. However at greater than 75% of the isolates, there was a consistent 95% primer pair success rate. For mumps virus, most of the primer failures were recorded in the V/P protein region which has RNA editing (with the insertion of non-template G) and in the variable region of the SH gene. A consistent pattern in the plate location of the primer failures was also discovered, which could be an issue related to laboratory conditions and not due to the design of the primers.
For the high GC protocol, the three sequencing success rates of primer pairs were lower than expected. For greater than 75% of isolates, success rates were 77%, 73% and 70% for the overall forward, reverse and total sequencing success rate, respectively. There did not appear to be a relationship between the percentages of ambiguity in the consensus sequence and the success rates of the designed primers (simple linear regression, adjusted R2 = 0.167). This lack of correlation could be a result of the few sequences represented by the degenerate consensus sequence poorly representing the isolates very accurately.
Designed primer pair sequences and success rates are available as Additional file 1