Here we describe a three-part conceptual framework ():
- Phase 1: raw read data with platform-dependent biases is transformed into a single, generic representation with well-calibrated base error estimates, mapped to their correct genomic origin, and aligned consistently with respect to one another. Mapping algorithms place reads with an initial alignment on the reference genome, either generated in, or converted to, the technology-independent SAM/BAM reference file format 25. Next, molecular duplicates are eliminated (Suppl. Mats), initial alignments are refined by local realignment, and then an empirically accurate per-base error model is determined.
- Phase 2: the analysis-ready SAM/BAM files are analyzed to discover all sites with statistical evidence for an alternate allele present, among the samples including SNPs, short indels, and CNVs. CNV discovery and genotyping methods, though part of this conceptual framework, are described elsewhere 26.
- Phase 3: technical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium, and family and population structure are integrated with the raw variant calls from phase 2 to separate true polymorphic sites from machine artifacts, and at these sites high-quality genotypes are determined for all samples.
All components after initial mapping and duplicate marking are instantiated in the Genome Analysis ToolKit (GATK) 27
Applying the analysis pipeline to HiSeq data at ˜60× of NA12878
2.72B bases (~96%) of the 2.83B non-N bases in the autosomal regions and chromosome X of the human reference genome have sufficient coverage to call variants in the 101bp paired-ended HiSeq data (). Even though the HiSeq reads were aligned with the gap-enabled BWA, more than 15% of the reads that span known homozygous indels in NA12878 are misaligned (Supplemental Table 1
). Realignment corrects 6.6M of 2.4B total reads in 950K regions covering 21Mb in the HiSeq data, eliminating 1.8M loci with significant accumulation of mismatching bases (Supplemental Table 2
). The initial data processing steps (Phase 1) eliminate ~300K SNP calls, more than one fifth of the raw novel calls, with quality metrics consistent with more than 90% of these SNPs being false positives ().
Table 2 Raw to recalibrated, imputed SNP calls HiSeq, Exome, and 61 sample low-pass data sets. Part one of each section summarizes the impact of local realignment and base quality recalibration by comparing SNP calls on reads with raw quality scores and alignments (more ...)
The initial 4.2M confidently called non-reference sites include 99.7% and 99.5% of the HapMap3 and 1KG Trio sites genotyped as non-reference in NA12878; at these variant sites the sequencing and genotyping calls are concordant 99.9% of the time (). Variant quality score recalibration of these initial calls identifies a tranche of SNPs with estimated FDR of <1% containing 3.2M known variants and 362K novel variants, a 90% dbSNP rate, and Ti/Tv ratios of 2.15 and 2.05, respectively, consistent with our genome-wide expectations (Box 1). While the variant recalibrator removed ~595K total variants with a Ti/Tv ratio of ~1.2, it retained 99% and 97.3% of the HapMap3 and 1KG Trio non-reference sites. The discordant sites have 100× higher genotype discrepancy rates, suggesting that the sites themselves may be problematic. Almost all of the variants in the 1% tranche are already present in the even higher stringency 0.1% FDR tranche, while analysis of the 10% FDR tranche suggest that some more variants could be obtained, at the cost of many more false positives ().
Figure 4 (a) Relationship in the HiSeq call set between strand bias and quality by depth, for genomic locations in HapMap3 (red) and dbSNP (yellow) used for training the variant quality score recalibrator (left) and the same annotations applied to differentiate (more ...)
Applying the analysis pipeline to 28Mb exome capture at ˜150× of NA12878
The raw data processing tools here eliminated ~450 novel call sites from the pre-MSA/pre-recal call set, representing more than 20% of all the novel calls, with a Ti/Tv of 0.30 - fully consistent with all being false positives - while adding several sites present in HapMap3 and the 1KG Trio. The raw whole exome data call set, at ~150× coverage (), includes >99% of both the HapMap3 and 1KG Trio non-reference sites within the 28Mb exome target region, with >99.8% genotype concordance at these sites. As with HiSeq, even with recalibration and local realignment, however, the Ti/Tv ratio of the novel sites in the initial SNP calls indicates that more than 50% of these calls are false positives. Variant quality score recalibration, using only ~5400 SNPs for training, identifies a high-quality subset of calls that capture >98% of the HapMap3 and 1KG Trio sites in the target regions. The value of the tranches is more pronounced in the whole exome (), where 900 of the 1039 novel calls come from tranches with FDRs under 1%, despite needing to reach into the 10% FDR tranche to include most true positive SNPs.
The Hiseq WGS and exome capture datasets differ drastically in their sequencing protocols (WGS vs. hybrid capture), the sequencing machines (HiSeq vs. GA), and the initial alignment tools (BWA vs. MAQ). Neverthless, the exome call set is remarkably consistent the subset of calls from HiSeq that overlap the target regions of the hybrid capture protocol. 94% of the HiSeq calls are also called in the final exome set sliced at 10% FDR (data not shown), and at these sites the non-reference discrepancy rate is extremely low (<0.4%). Mapping differences between the aligners used for HiSeq (BWA) and exome (MAQ) data sets account for vast the majority of these discordant calls, with the remainder of the differences due to limited coverage in the exome, and only a small minority of sites due to differential SNP calling or variant quality score recalibration. Overall, despite the technical differences in the capture and sequencing protocols of the HiSeq and Exome data sets, the data processing pipeline presented here uncovers a remarkably consistent set of SNPs in exomes with excellent genotyping accuracy.
Applying the analysis pipeline to low-pass (4×) sequencing of NA12878 with 60 unrelated CEPH individuals
Multi-sample low-pass resequencing poses a major challenge for variant discovery and genotyping because there is so little evidence at any particular locus in the genome for any given sample (). Consequently, it is in precisely this situation where there is little signal from true SNPs that our data processing tools are most valuable, as can be seen from the progression of call sets in . Local realignment and base quality recalibration eliminate ~650K false positive SNPs among 13M sites, 4× more sites than in the HiSeq data set, with an aggregate Ti/Tv of 0.7. The initial low-pass CEU set includes over 13M called sites among all individuals, of which nearly 7M are novel. NA12878 herself has 2.9M variants, of which 430K are novel. The 4× average coverage limits the sensitivity and concordance of this call set, with only 84% and 80% of HapMap3 and 1KG Trio sites assigned a non-reference genotype in the NA12878 sample, both with a ~20% NRD rate.
The variant quality recalibrator identifies from the 13M potential variants ~6M known and 1.5M novel sites in tranches from 0.1% to 10% FDR. highlights several key features of the data: the allele frequency distribution of these calls closely matches the population genetics expectation and the vast majority of HapMap3 and 1000 Genomes official CEU call sites are recovered, with the proportion nearing 100% for more common variant sites (). Although we selected a 0.1% FDR tranche for analysis here, which contains the bulk of HapMap3, 1KG Trio, and HiSeq sites, there are another ~700K true sites can be found in the 1 and 10% FDR tranche, albeit among many more false positives. This highest quality tranche includes nearly all variants observed more than 5 times in the samples and 1.4M novels, with the SNPs in the tranches at 1% and 10% generally occupying the lower alternate allele frequency range (). The overall picture is clear: calling multiple samples simultaneously, even with only a handful of reads spanning a SNP for any given sample, enables one to detect the vast majority of common variant sites present in the cohort with a high degree of sensitivity.
Figure 5 Variation discovered among 60 individuals from the CEPH population from 1000 Genomes pilot phase plus low-pass NA12878. (a) Discovered SNPs by non-reference allele count in the 61 CEPH cohort, colored by known (light blue, striped) and novel (dark blue, (more ...)
While the bulk properties of the 61-sample call set are good, we expect the low-pass 4× design to limit variation discovery and genotyping in each sample relative to deep re-sequencing. In the 61 sample call set we discover ~80% of the non-reference sites in NA12878 according to HapMap3, 1KG Trio, and HiSeq call sets (). The ~20% of the missed variant sites from these three data sets had little to no coverage in the NA12878 sample in the low-pass data and, therefore, could not be assigned a genotype using only the NGS data, a general limitation of the low-pass sequencing strategy (, ). The multi-sample discovery design, however, affords us the opportunity to apply imputation to refine and recover genotypes at sites with little or no sequencing data. Applying genotype-likelihood based imputation with Beagle 28
to the 61 sample call set recovers an additional 15–20% of the non-reference sites in NA12878 that had insufficient coverage in the sequencing data () as well as vastly improving genotyping accuracy ().
We further characterize the quality of our low-pass call set as a function of the number of samples included during the discovery process in addition to NA12878 herself. Increasing the number of samples in the cohort rapidly improves both sensitivity and specificity of the call set. As evidence mounts with more samples that a particular site is polymorphic, our confidence in the call increases and the site is more likely to be called (). Distinguishing true positive variants from sequencing and data processing artifacts is more difficult with few samples and, consequently, low aggregated coverage; adding more reads empowers the error covariates to identify sites as errors by the variant recalibrator ().
Figure 6 Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples. (a) Receiver operating characteristic (ROC) curves for SNP calls relating (more ...)
The combination of multi-sample SNP calling, variant quality recalibration using error covariates, and imputation allows one to achieve a high-quality call set, both in aggregate and per-sample, with astoundingly little data. The aggregated 61-sample set at 4× coverage includes only four times as much sequencing data as the HiSeq data, yet we discover 3.2M polymorphic sites in NA12878, which includes 97%, 91%, and 87% of the variants in HapMap3, 1000 Genomes Trio, and HiSeq call sets, respectively, while also finding ~5M additional variants among the 60 other samples.
Comparison of hard filtering to variant quality score recalibration
Supplemental Table 3
lists the quality of call sets derived using our previous filtering approaches on all three data sets relative to the adaptive recalibrator described here. In all cases the adaptive approach outperforms the manually optimized hard filtering previously developed for this calling system for the 1000 Genomes pilot data. This highlights two important points – first, that a principled integration of all covariates (which may have a complex correlation structure) should and does outperform single manually defined thresholds on covariates independently, with the added benefit of not requiring human intervention; second, that an accurate ranking of discovered putative variants by the probability that each represents a true site permits the definition of tranches for specificity or sensitivity () as appropriate to the needs of the specific project. Although the most permissive tranche includes almost all sites that have any chance of being true polymorphisms – critical for projects looking for single large effect mutations – the vast majority of true polymorphisms are present in the highest quality tranche of data (not shown).
Comparison of this calling pipeline to Crossbow
To calibrate the additional value of the tools described here we contrast our results with SNPs called on our raw NA12878 exome data using Crossbow 29
, a package combining bowtie, a gapless read mapping tool based on the Burrows-Wheeler transformation 30
and SoapSNP for SNP detection 16
. We chose to perform this analysis on the exome data because its wide range of read depths and complex error modes make SNP calling a challenge, especially given the small number of novel variants (~1000 per sample) expected in this 28Mb target. In Supplemental Table 4
the high-level results of the GATK and Crossbow calling pipelines are compared and contrasted. Key metrics such as the number of novel SNP calls, their Ti/Tv ratio, the number of calls not seen in either the 1000G trio or the HiSeq data, and the high nonsense/read-through rates indicate that the Crossbow call set has lower specificity than the GATK pipeline. This is the case despite applying a aggressive P-value threshold (P < 0.01) for the base quality rank sum test 16
to filter false positive variants, which reduces the sensitivity to HM3, 1000G, and the HiSeq call sets by >3%. As usual, the intersection set between GATK and Crossbow is more specific but less sensitive than the calls unique to each pipeline (), a clear sign that despite the advances presented here significant work remains in perfecting calling in data sets like single sample exome capture. Although the value of the data processing and error modeling presented here is also clear, applying local realignment and base quality score recalibration -- publicly available, easy-to-use modules in the GATK -- are likely to improve the results of the Crossbow pipeline.