We first defined the basic performance parameters of the data, in terms of read length, base accuracy and error profile, and sequence composition-based representational bias. Using a dataset of 4 runs of data (47,638 reads) from human amplicons (see methods), we profiled read length, showing an average of 700 bases over a wide distribution, including 5% of reads >2500 bases (Figure
a). Errors were characterized by comparison to the known reference sequences (see methods), showing that the primary error mode was insertions, at 12%, followed by deletions, 2% and apparent mismatch errors at 1% (Figure
b). Further, errors were randomly distributed across the read, rather than increasing in distal positions as is true for other data types (Figure
c). This property by which read length and base position do not affect the quality of the bases produced greatly facilitates alignment of reads despite their relatively high indel rate. Finally, since extreme base composition is a source of representational bias in many sequence data types, we evaluated the performance of the Pacific Biosciences system using a dataset (12 runs, 319,090 reads) derived from 3 microbial genomes that span a much wider range of base composition than the human genome (Plasmodium falciparum
19% GC, Escherichia coli
50% GC, Rhodobacter sphaeroides
69% GC). The average relative coverage across GC content levels fell into ranges narrow enough to have little impact on data utility: between 53% and 110% of mean coverage for P. falciparum
, between 79% and 140% for E. coli
, and between 64% and 120% for R. sphaeroides
Figure 1 Characterization of Pacific Biosciences data.a) Base error mode rate for deletions, insertions and mismatches. b) Length distribution of reads in the Pacific Biosciences discovery dataset (here some raw reads are as long as 5,000 bases). c) Pacific Biosciences (more ...)
Base quality scores are a critical tool for accurate SNP calling, and are used by most analysis algorithms to help distinguish true variation from artifacts. Indeed, the accuracy of the reported base quality scores has a significant impact on the correctness of variation detection[12
]. We have found that Pacific Biosciences data underreport true base qualities by on average 10 PHRED scaled points. For this reason we applied a method to generate empirical base quality scores for Pacific Biosciences data using the base quality score recalibration framework in the Genome Analysis Toolkit (GATK)[10
]. Empirical quality scores generated with GATK are on average Q20-Q24 for bases that are not insertions or deletions. We observed no relationship between the empirical quality scores and position in the read, similar to accuracy of base calls.
Alignment is the process of mapping reads individually to a region in the reference sequence where each read can align with minimum number of mismatches and gap openings (insertions and deletions). Because of the high base insertion and deletion rates, alignment of Pacific Biosciences data challenges standard approaches. To account for this, we set the gap open penalty to be lower than the base mismatch penalty (see methods) in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to “hide” a true SNP inside an insertion. The result is that mapping is accurate, but the aligned sequence is biased toward the reference at some indel positions (see Figure
a and b), although the “hidden” bases remain available for subsequent analysis. It is important to note however, that reference bias is an artifact of the alignment process, rather than of the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Biosciences data, but we anticipate the development of new tools.
Figure 2 Error profile of Pacific Biosciences data.a) A chart showing the number of observations of the alternate allele in all heterozygous sites and how reference bias pulls the median significantly below the expected 0.5. This combination creates multiple possible (more ...)
For all projects aimed at discovery of mutations or variation, it is critical to follow-up with validation of the variants called, either by sampling to determine the rate at which calls are correct, or to confirm which exact changes are real rather than art factual. Extension is an approach similar to validation, but carried out as a discovery process on samples that have not been sequenced to determine the extent to which mutations are present at the same base or in the same gene in a large number of samples, rapidly and at relatively low cost. Today, variants called by massively parallel sequencing are commonly validated using either Sequenom genotyping, Sanger sequencing or resequencing using an alternate massively parallel technology (e.g. variants called from Illumina HiSeq data are validated with 454 sequencing); for examples see[1
]. Each of these options has limitations. Sequenom genotyping can validate only the specific mutation called, and so lacks the ability to identify base differences at sites that were not discovered (i.e. unable to validate mutation load assays, gene burden tests), and is thus much less effective for extension. Further, accurate interpretation of Sequenom data requires manual review, with the potential to introduce human error. Sanger sequencing operates at low throughput and also requires human interpretation of the sequencing results, making it only suitable for small validation assays. Using alternate sequencing platforms may not provide the fast turnaround necessary (Illumina HiSeq) for validation studies or may suffer from other context specific error modes (SOLiD, 454) that often further complicate the validation analysis. Sequencing with an alternate technology is, nevertheless, the most statistically correct approach for validation, provided the alternate sequencing methodology employs a different approach from the original technology, with a different spectrum of biases, error modes or other limitations.
We designed two targeted sequencing experiments to evaluate the power of the Pacific Biosciences data type as a tool for validation and extension and as a tool for variant discovery. For both experiments we used the GATK to discover sites (see methods for details on site selection).
To evaluate Pacific Biosciences data for validation/extension, we sequenced PCR-amplicons spanning 98 variant calls based on Illumina GA1 and GA2 data from the 1000
G project. These had been previously validated either as true de novo
mutations (38 sites) or false call artifacts (60 sites), using SOLiD, 454 and Sanger sequencing chemistry[13
]. Individual PCR amplicons spanning the test set variants were generated, sequenced with both the Pacific Biosciences RS and the Illumina MiSeq, and the data were used to call SNPs. Performance of data from the two platforms was similar by several metrics. Pacific Biosciences data showed 97% sensitivity and 98% specificity by correctly genotyping 96 of the 98 sites, with a negative predictive value of 98% and positive predictive value of 97%. Illumina MiSeq data provided 100% sensitivity and 91% specificity, genotyping 93 out of 98 sites correctly with 100% negative predictive value and 88% positive predictive value (see Tables
and ). The miscalled sites from each data set were then manually inspected. One site was wrongly called monomorphic with the reference allele from Pacific Biosciences data due to reference bias. The one site wrongly called polymorphic from Pacific Biosciences data was similarly miscalled in MiSeq and in our gold standard HiSeq whole genome dataset (see methods). From the 5 sites miscalled using data from Illumina MiSeq, 2 were in agreement with Pacific Biosciences (one listed above and one not called in Pacific Biosciences due to reference bias), and 3 sites were called polymorphic in error due to noise in the MiSeq data (See Figure
c for an example). Pacific Biosciences RS data performed well by all metrics, and at a similar quality to Illumina data, demonstrating that the RS is a powerful tool for follow up validation or extension.
Validation calls for Pacific Biosciences and Illumina MiSeq
Validation metrics for Pacific Biosciences and Illumina MiSeq
To evaluate Pacific Biosciences data for variant discovery, we sequenced 177
kb in 61 amplicons from regions across human chromosome 20 using both Pacific Biosciences RS and Illumina MiSeq. These amplicons contained 225 SNPs that had been validated in our gold standard deep whole genome call set (see methods), which includes 43 previously validated as high-confidence SNPs from Hap Map 3.3. A single Hap Map site was called with very high confidence by amplicon-based data from both technologies to have a different alternate allele than Hap Map, and was similarly called with the whole genome shotgun dataset. Thus, it likely reflects either a single base change in our DNA sample source or an error in Hap Map rather than a miscalled SNP. With Pacific Biosciences data we called 197 of the 225 gold standard sites (including 38 of the 43 Hap Map sites); with MiSeq data we called 222/225 (and 43/43). We manually inspected the sites of all the discordant calls, which fell into two classes: 1) Reference bias
, where the correct call was obscured by reference bias in the alignment, and the alternate allele was present in the data but hidden by the alignment inside insertions; 2) Low sequence coverage
, where insufficient data were obtained to make a confident call, likely due to poor PCR amplification of the target. For the Pacific Biosciences data, the 28 sites missed from the gold standard call set included 16 missing due to reference bias and 12 due to lack of coverage. For MiSeq data, the three sites missed were all due to low sequence coverage.