The massively parallel pyrosequencing of emulsification PCR-based templates holds great promise for revolutionizing high-throughput sequencing. However, there is concern over the potentially high degree of error, particularly for applications that cannot rely on consensus of large assemblies. Consensus-based projects would also benefit from a lower error rate, as fewer sequences would be required to build a reliable consensus. The original description of the 454 Life Sciences system reported an error rate for shotgun library reads of four bases per hundred nucleotide positions. The test fragment data in that publication had much lower error rates despite the fact that the test fragments have extensive homopolymers and are designed to be difficult to read correctly. This discrepancy indicates that the basic method of pyrosequencing, luminescence detection and flow intensity resolution, is sound, and suggests that the higher error with experimental data may come from the experimental manipulation of the sequences prior to pyrosequencing. Margulies et al
] suggested that this may be due to multiple sequences binding to an individual bead prior to the emPCR amplification, resulting in a heterogeneous amplification pool. The GS20 quality filters will eliminate sequences from beads that contain two highly divergent DNA templates but the software will attempt to interpret flowgrams from a single bead that contains two similar but non-identical sequences. Unlike shotgun genomic data, V6-tag data may have large numbers of highly similar sequences. It is, therefore, even more important in V6-tag and metagenomic sequencing to remove reads that may result from multi-templated beads.
We conducted an in-depth analysis of experimentally generated GS20 reads by sequencing an amplicon library made from a set of clones of known sequence. We found an error rate (incorrect bases/total number of expected nucleotides) of 0.49%, considerably lower than that reported by Marguiles et al
] but still higher than they or we found for test fragment data. Significantly, we found that the errors in our experimental reads were not randomly distributed across all reads: 86% of the reads contain no errors, while reads that differ from the reference sequence by more than 4% contained nearly 50% of the errors (Figure ). In contrast, errors were much more randomly distributed in our test fragments, where 50% of the errors were from those fragments that differ by less than 1% from the reference. A multi-templated bead would frequently have multiple bases at a position, which could cause indeterminate flows - neither base having ample luminescence to clearly register. The convergence of the error rates of the two separate sequencing runs when the reads containing Ns were removed is consistent with the multi-templated bases as the primary source of error. The error distribution across reads and the similarity of error rates for reads with no Ns are consistent with a high general accuracy of the pyrosequencing method and poor resolution of a small number of beads with a heterogeneous amplification population.
If heterogeneous templates on a single bead represented a major contribution to observed errors in a low quality read, we anticipate a disproportionate number of errors would occur in sequences that correspond to low abundance templates in the original emPCR reactions; if a low-frequency strand shares a bead with another sequence, the other sequence is likely to be different. In contrast, high-frequency sequences are more likely to be contaminated by an identical sequence. Our data match this pattern. The removal of the bulk of the errors via the removal of reads with ambiguous bases is also consistent with multi-templated beads. All of our reads shared the same proximal primer, and would, therefore, sequence with few errors in the primer, even on multi-templated beads. Other experiments with heterogeneous primers may find primer fidelity to be more useful at identifying low-quality reads.
A significant decrease in the heterogeneous amplification population (HAP) between the original reported experiment and ours is likely given the improvements in the protocol developed by 454 Life Sciences. Unfortunately, it suggests that the highest single source of error in an emPCR-based pyrosequencing experiment may vary from experiment to experiment. The evidence from our two separate sequencing runs, however, is that the removal of reads with Ns is a good surrogate for the removal of reads from multi-templated beads. Nonsynchronized extension of fragments will also produce Ns. These will also be culled when reads with Ns are removed. Advances in pyrosequencing that would reduce the occurrence of multi-templated beads, reduce nonsynchronized extension, or better identify these errors in the base-calling software could significantly improve the overall accuracy of the technology.
Our analysis of the distribution of the types of error in pyrosequencing of emPCR libraries suggests ways of identifying and removing these HAP-hazards: reads with a disproportionately large number of errors are disproportionately likely to contain ambiguous bases (Ns) and to be aberrantly long or short. Short reads arise from short fragments on emPCR beads, but also, and perhaps more likely, from sequential deletion of a read by the software in the GS20 machine, which successively trims bases presumed to be in error from the end of reads. The more bases trimmed, the more likely the entire read is of dubious quality. These errors may be multi-templated beads of similar sequence, or nonsynchronized extension of the templates. Once fragments lose synchrony, they will have successively more errors as the read extends. A reduced threshold for removing reads that have bases trimmed from the end might remove many of these poor-quality short reads. Many sequencing projects cannot judge an appropriate read length, although it is always possible to detect and remove short reads. Reads of aberrant length represented only a small fraction, approximately 1%, of our data, and most of these reads, >60%, also contain Ns.