PCR is used in many NGS workflows but has the potential to increase false positive and false negative allele calls. False positive allele calls result from nucleotide misincorporations that occur during the early cycles of PCR. False negative allele calls result from unequal amplification of two alleles. This situation is exacerbated by low template concentration.
We found that different input masses into an iPCR reaction resulted in similar numbers of reads. This result is an artefact of the Roche 454 library preparation because samples are added to emulsion PCR reactions at different concentrations to maximize the number of reads (GS FLX Titanium emPCR Method Manual). Despite this limitation, counters were sensitive to input copy number and the number of returned counters had a linear relationship with the input mass. Since the generation and ligation of DBRs is random, we had to use probabilistic methods to infer the actual number of molecules sequenced. This analysis shows that the number of observed counters is most likely equal to the number of input molecules when it is lower than the square root of the number of possible counters. At this point, the probability of two molecules being tagged with the same DBR is sufficient to make the relationship non-linear. A maximum likelihood estimate of the number of input molecules can be inferred until the counters are saturated, at which point all counter sequences are observed. To eliminate the effects of saturation, the degeneracy and number of bases included in the DBR can be altered to provide a greater number of potential counter sequences. This approach can, partially at least, overcome the problem of collisions when the number of molecules input into a PCR reaction of the same type is high. However, for many applications it is only necessary to quantify low numbers of template molecules where miscalls can occur, in which case the number of DBRs can be set appropriately.
Given sufficient sequencing depth, there is a good correlation between allele frequency in a sample and its estimated allele frequency [this study and (
14)]. However, counters improve the estimates of input molecules into the iPCR particularly at lower sequencing depths. This improves genotyping accuracy, allows us to assign statistical confidence to variant sites and reduces overall sequencing costs. In addition, the counters improve detection of polymerase or sequencing errors and hence reduce false positive variants. For example, simulations based on sequencing sampling 10 counters from the data show an error rate of 30% when SNP calling using read numbers but using counter numbers instead reduces this error rate to 0%. The reduction of false positive calls is important because previous studies have not been able to distinguish particular classes of variation, such as insertion or deletion polymorphisms, from sequencing errors (
14).
NGS sample preparation kits from major manufacturers including Illumina, Life Technologies and Roche 454 all require adaptor ligation (
2,
15,
16). Adaptors that include counter sequences can, therefore, be incorporated into existing protocols at no extra cost in time and little extra cost in adaptors. However, counter sequences can potentially increase the cost of sequencing since the counter sequence itself must be read along with the genomic insert. This is an important issue for short read platforms but can be mitigated by additional index, or barcode, sequencing reads (for example, using Illumina's TruSeq DNA Sample Prep Kits or Life Technologies' SOLiD System barcodes).
The counter sequence presented here is incorporated in the adaptor sequence and is therefore present in the template for PCR amplification. In an alternative approach, a counter sequence could be incorporated in the 5′ tail of a PCR primer sequence. However, at each PCR cycle new counters would be randomly associated with each newly synthesized molecule, thus obfuscating the number of template molecules. Instead, a two-step PCR reaction, analogous to multiplex PCR (
17,
18), that consists of limited cycles of priming with a counter containing primer followed by cycles of universal priming could allow accurate counting during PCR.
Mulitplex identifiers are commonly designed with error-correcting codes (
19). For example, a minimum edit distance of two allows detection of MIDs with a single error. However, the counters described here do not have a minimum edit distance because each base is degenerate. This means that a single polymerase or read error within a DBR can associate a single genomic sequence with different counter sequences, and therefore increase the probability of a false positive allele call. However, this effect can be minimized by careful design of the DBR to remove sequences, such as homopolymers, that are prone to sequencing errors and by discarding DBRs with incorrect base positions (for example, an A at a B position).
Because counters are effective at identifying relative biases, such as allelic bias, they may also prove useful in detecting representational bias of different molecules within a sample. For example, counters could help correct biases caused by GC composition in standard library preparations (
1) or copy number variation (
20). In addition, a counter attached to molecules by RNA ligation, or first- or second-strand cDNA synthesis (
21) could be used to quantify the relative levels of different transcripts or transcript isoforms, such as those derived from alternative splicing (
22–
24). Further applications include sequencing of heterogeneous populations such as, multiplexed samples (manuscript submitted), viral quasispecies (
25), pathogen populations (
26), environmental samples (
27); and tumour samples where rare sequence variants, present in a subpopulation of cells, must be distinguished from true variants (
28).