We have demonstrated in this paper the simultaneous acquisition of hundreds of thousands of sequence reads, 80–120 bases long, at 96% average accuracy, in a single run of the instrument using a newly developed in vitro sample preparation methodology, and sequencing technology. With phred 20 as a cutoff, we show that our instrument is able to produce over 47 million bases from test fragments and 25 million bases from genomic libraries. We used test fragments to decouple our sample preparation methodology from our sequencing technology. The decrease in single read accuracy from 99.4% for test fragments to 96% for genomic libraries is primarily due to a lack of clonality in a fraction of the genomic templates in the emulsion, and is not an inherent limitation of the sequencing technology. Most of the remaining errors result from a broadening of signal distributions, particularly for large homopolymers (7 or more), leading to ambiguous base calls. Recent work on the sequencing chemistry and algorithms that correct for crosstalk between wells suggests that the signal distributions will narrow, with an attendant reduction in errors and increase in read lengths. In preliminary experiments with genomic libraries that also includes improvements in the emulsion protocol, we are able to achieve, using 84 cycles, read lengths of 200 bp with accuracies similar to those demonstrated here for 100 bp. On occasion, at 168 cycles, we have generated individual reads which are 100% accurate over greater then 400 bp.
Using
M. genitalium, we demonstrate that short fragments
a priori do not prohibit the
de novo assembly of bacterial genomes. In fact, the larger oversampling afforded by the throughput of our system resulted in a draft sequence having fewer contigs than with Sanger reads, with substantially less effort. By taking advantage of the oversampling, consensus accuracies greater then 99.96% were achieved for this genome. Further quality filtering the assembly, a consensus sequence can be selected with accuracy exceeding 99.99%, while incurring only a minor loss of genome coverage. Comparable results were seen when we shotgun sequenced and
de novo assembled the 2.1 Mbp genome of
Streptococcus pneumoniae 15 (Supplementary Table 4). The
de novo assembly of genomes more complex than bacteria, including mammalian genomes, may require the development of methods, similar to those developed for Sanger sequencing, to prepare and sequence paired end libraries that can span repeats in these genome. To facilitate the use of paired end libraries we have developed methods to sequence, in an individual well, from both ends of genomic template, and plan to add paired end read capabilities to our assembler (Supplementary Methods: Double Ended Sequencing).
Future increases in throughput, and a concomitant reductions in cost per base, may come from the continued miniaturization of the fibreoptic reactors, allowing more sequence to be produced per unit area – a scaling characteristic similar to that which enabled the prediction of significant improvements in the integrated circuit at the start of its development cycle
19.