As sequencing costs drop, it is becoming cost-effective to sequence even whole genomes to a sufficient depth that random errors become insignificant. However, systematic sequencing errors (SSEs) and biases remain problematic even at high sequencing depths, so recent research has started to focus on understanding these SSEs and biases 
. In this work, we focus on SSEs rather than coverage biases, where SSEs are systematic errors in sample preparation and sequencing processes that cause base call errors to accumulate preferentially at certain base positions in the genome, and coverage biases are biases in the number of reads covering certain genomic regions such as GC-bias 
. Examples of SSEs, as well as random errors, are portrayed in . Compensating for these SSEs is critical for applications in which a variant might be expected to be in only a small fraction of the reads, such as samples containing RNA-editing 
, cancer tissues and circulating tumor cells 
, fetal DNA in mother’s blood 
, mixtures of bacterial strains 
, mitochondrial heteroplasmy 
, mosaic disorders 
, and pooled samples 
. Since the causes of many SSEs are not well understood and may vary due to batch effects in a run-specific manner, compensating for them requires training data sets. The two previously proposed approaches either use a separate data set with special characteristics (e.g., SysCall uses overlapping paired-end reads 
) or use the data set itself excluding regions known to have variants (e.g., Genome Analysis Toolkit, or GATK, base quality score recalibration 
). Here, we combine the advantages of these approaches by using DNA or RNA spike-in standards without homology to almost all biological organisms.
Systematic errors and base quality score recalibration.
The first approach, SysCall, used a methyl-Seq dataset that had overlapping paired-end reads to detect SSEs depending on sequencing direction for the Illumina sequencer 
. The region in which the reads overlap can be used to find systematic errors that preferentially occur on one DNA strand compared to the other strand. To improve variant calls, the SysCall method uses a separate dataset with overlapping reads to train a logistic regression model that accounts for SSEs correlated with several covariates: (1) the 2 preceding bases + the base in question (each base independently), (2) directionality bias of the errors, the proportion of non-reference reads, and (3) a comparison of the quality scores of the error base to the next base. Most sequencing runs do not contain overlapping paired reads, so SysCall assumes the SSEs in a training data set are the same as the SSEs in other independent sequencing runs without overlapping reads. However, even when comparing SSEs from two sequencing runs on the same sample, significant differences were found between the runs 
. SSEs may vary even more due to constantly evolving sample preparation reagents and protocols, sequencing reagents and technologies, and bioinformatics methods. In this respect, SysCall is at a disadvantage compared to methods that perform run-specific SSE correction, such as GATK.
The second approach, GATK, is a suite of tools used for variant calling based on the map-reduce framework 
. A widely used tool implemented in GATK performs base quality score recalibration (BQSR, see ). BQSR recalibrates base qualities using pre-specified and/or user-defined covariates related to the properties of the read. The covariates in BQSR are factors that are suspected to be correlated with SSEs. The most commonly used covariates implemented in GATK-BQSR are read group, base quality score, machine cycle, and dinucleotide context. BQSR calculates the empirical quality score for each combination of covariates based on the proportion of base call errors observed. Then, it compares the empirical to the reported quality scores, and recalibrates the quality scores of each base in each read depending on its corresponding covariate values. After recalibration, the base quality scores were much more predictive of the empirical quality scores 
One attractive feature of the GATK BQSR package is that it compensates for biases in the machine’s estimation of base calls and their quality scores by lowering the base qualities of SSEs independently in each sample, so that it can correct run- or batch-specific systematic sequencing errors. This appears to contradict established metrological practice for quantity (or numerical) values (see Metrology for “nominal properties”
in Methods S1
), where results are corrected for bias, and measurement uncertainty adjusted for propagation of error arising from the bias correction. In the absence of mature metrological guidance for nominal properties (such as sequence), adjustment of the quality score is a practical solution with the desirable property of straightforward propagation into bioinformatics pipelines such as variant calling.
In order to find run-specific SSEs, GATK BQSR typically assumes that the sample contains neither mis-mapped/mis-aligned reads nor variants (except at the bases defined by dbSNP). Since many rare variants are not in dbSNP and some SSEs might be in dbSNP 
, BQSR will count many rare variants as SSEs and may miss some SSEs that are included in dbSNP (see ), resulting in a bias against variants not in dbSNP. This problem is compounded for the many non-human species for which comprehensive SNP databases are not available, so many real variants will be counted as SSEs for these species. For RNA-seq, recalibrating based on dbSNP will also be biased against detecting any possible RNA editing that occurs at non-dbSNP locations. For DNA and RNA sequencing, certain motifs that are highly mutable (e.g., CpG dinucleotides) are likely to receive lower recalibrated quality scores due to incomplete dbSNP databases, resulting in variant bases receiving lower weight in the precise locations where real variants are more likely to be found. In addition, mapping errors due to closely related sequences or complex variants could cause both false positive SSEs when accurately sequenced bases are incorrectly mapped to mismatching bases, as well as false negative SSEs when inaccurately sequenced bases are incorrectly mapped to matching bases.
A few previous methods have used external spike-ins to calibrate base or variant calls. The Ibis base caller for Illumina uses the PhiX spike-in provided by Illumina to achieve more accurate base calls and quality scores by correcting machine cycle biases 
. This method has the advantage of using the raw fluorescent intensity values in its base calling method, but the small size of the PhiX genome (≈5 kb) limits the number of covariates that can be calibrated. Another approach used short 1–2 kb external spike-in plasmids with known indels to both recalibrate quality scores and tune their parameters for detecting rare indels in large cohorts 
. Their algorithm, called SPLINTER, used the base and the two previous bases to calibrate error rates, but they did not investigate whether the length of their short plasmids was sufficient for statistically robust calibration.
Here, we set out to test whether synthetic spike-in DNA or RNA standards with well-characterized sequence purities can be used in the GATK BQSR framework to improve detection and correction of SSEs in any sample without the need for a comprehensive and accurate SNP database.