High-throughput sequencing involves the parallel sequencing of millions of DNA fragments simultaneously. Generally, these fragments are sequenced one base at a time, and, at each step or cycle, the current base is determined through fluorescent detection. For a review, see Holt and Jones [11
]. Although sequencing platform chemistries differ, in all cases care must be taken to avoid introducing bias at this early stage.
Focusing on the Illumina Genome Analyzer platform, base-call errors are not randomly distributed across the cycle positions in sequenced reads [12
]. Although not as extensively studied, similar biases have been observed and low-level signal correction methods have been developed for other sequencing platforms [13
Incorrect base calls can have a deleterious impact downstream in aligning reads to the reference genome (resulting in fewer or incorrect alignments) and in variant detection (contributing to false-positive variant calls). In experiments aimed at detecting variants in genomic DNA, concern about false positives may lead researchers to employ stringent filtering criteria. Many researchers are hypothesizing that the discovery of rare variants will be a crucial next step in understanding the genetic causes of complex diseases [14
], and overly strict filtering criteria may eliminate exactly the variants of most interest and impact. By improving the quality of nucleotide calls, either through better base calling or error correction, more accurate variant calls will be possible.
Alternative base-calling methods that reduce the cycle-related bias in error rates have been developed (Figure ) [15
]. Numerous error correction methods have also been developed to remove errors from reads after base calls have been made [17
]. Since base calling requires the raw intensity files, which many laboratories never receive from sequencing centers, re-calling bases is logistically burdensome, and error correction provides a potential alternative.
Figure 1 Effect of base-calling improvements on error bias. This figure is based on figures from Bravo and Irizarry . Choosing a site that was a false-positive variant as determined by MAQ , the authors examined the pattern of nucleotide calls according (more ...)