The rapid development of new DNA sequencing technologies is transforming biology by allowing individual investigators to sequence volumes previously requiring a major genome center. Before the development of next-generation sequencing platforms, Sanger biochemistry was the basis of sequencing production. Sanger sequencing, or conventional sequencing has been fine-tuned to achieve read-lengths of up to ~1,000 bp and per-base accuracies as high as 99.999%
[
1]. However, given several bottlenecks in conventional sequencing that restrict its parallelism, the optimization in throughput and cost has reached a plateau. Several alternative sequencing strategies have been proposed in the past several years to reduce sequencing time and cost.
One category of such alternatives is cyclic-array sequencing, referred to as second-generation or next-generation sequencing, which has been made available commercially, and includes products such as the 454 Genome Sequencers (Roche Applied Science), the Illumina (Solexa) platform (Illumina), the SOLiD platform (Applied Biosystems) and the HeliScope Single Molecule Sequencer (Helicos). Because of the much higher degree of parallelism and much smaller reaction volumes, next-generation sequencing achieves much higher throughput with dramatically lower cost. The disadvantages are shorter reads and higher error rates compared to Sanger sequencing. Next-generation sequencing has been applied in many areas of biology, including quantification of gene expression and alternative splicing, polymorphism and mutation discovery, microRNA profiling, and genome-wide mapping of protein-DNA interactions. A detailed review of sequencing technologies can be found in
[
1].
Aware of the large impact of sequencing quality on downstream analysis, several groups have attempted to detect, quantify and understand errors that arise from next-generation sequencing pipelines. At the nucleotide level, base-calling algorithms developed by manufacturers (for example, Bustard by Illumina) and independent investigators
[
2-
5] provide per-base phred-like
[
6] quality scores as a byproduct. These methods require fluorescence intensity measurements from sequencing runs as input, and the per-base quality scores produced need to be further summarized to assess sequencing quality at higher levels. At the technology level, there have been efforts to characterize error patterns associated with different platforms, which include Dohm
et al. and Hansen
et al.
[
7,
8] for the Illumina platform, and Huse
et al.
[
9] for the 454 Genome Sequencers. These studies are very important in facilitating our understanding of the quality characteristics, however, they do not provide methods to assess the quality of sequencing data that are produced everyday in individual laboratories.
Sequencing quality also needs to be evaluated and analyzed in light of different applications. For example, Bullard
et al.
[
10] conducted a study of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing data. They evaluated how DE results are affected by varying gene lengths, base-calling calibration methods, and library preparation effects. They obtained the number of uniquely mapped reads with 0 (U0), 1 (U1) or 2 (U2) mismatches and used the ratio (U1+U2)/(U0+U1+U2) to estimate the per-read sequencing error rate, which is the proportion of reads that contain at least one error. We refer to this method of counting the number of mismatches to the reference genome as the
mismatch counting method. This builds on the assumption that the perfect-matching reads contain no errors and that U1 and U2 contain sequencing errors but not SNPs. It also requires mapping to a reference genome, a step that may be problematic in applications such as threat detection, as the genome sequence of interest may not be known.
Tools also exist to correct errors from next generation sequencers
[
11-
17] and error rates can be estimated as a by-product. These methods are based on
k-mer or substring frequencies, or finding overlaps between reads, which are very computationally intensive, require a large amount of memory, and are difficult to work with large genomes.
In this work we develop a simple and efficient method to estimate the error rates in any sequencing pipelines based on the observation that there is a linear relationship between the number of reads sequenced and the number of reads containing errors. We refer to this proposed method as shadow regression, and show that it works well in applications with moderate to high sequencing depth, such as mRNA-seq, re-sequencing and SAGE.
As with any high throughput experiments, it is important to monitor the quality of next-generation sequencing data at the level of individual experiments. Currently the percentage of reads mapped is used as a quality indicator but it does not directly address the fundamental question of how much error is present in the reads obtained from a sample. A reference genome may not be available at all. Even if the reference genome is available, we show, using simulated data and real data on PhiX, that mapping reads to the reference genome can introduce biases even at relatively modest polymorphism rates. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data. For example, error rates are useful in understanding the fidelity of SNP or mutation calls as a function of coverage.