Low cost whole genome sequencing technologies promise to revolutionize biomedical research and usher in an era of personalized medicine. The reductions in cost and time that are needed to make routine genome sequencing practical will require new methodologies, the most developed of which are “sequencing-by-synthesis” (SBS) methods that involve direct detection of polymerase-mediated synthesis.[1
] Helicos BioSciences (Cambridge, MA) has commercialized a single-molecule SBS platform in which spatially separated oligonucleotides are sequenced in parallel through the use of total internal reflection illumination microscopy to detect the addition of deoxynucleotide triphosphates bearing fluorophores attached via cleavable linkers (dNTP-Fl, and S1
] After incorporation and detection of the modified dNTP, the disulfide bond of the linker is cleaved to release the fluorophore, and after capping with iodoacetamide, a spectroscopically dark primer is regenerated which is then ready for an additional round of dNTP-Fl incorporation. Removal of the fluorophore leaves behind a portion of the linker, referred to as a ‘scar’ (), and sequential rounds of sequencing eventually results in a run of modified nucleotides at the primer terminus. Ultimately, polymerase recognition of modified nucleotides, both during incorporation of the dNTP-Fl, and after incorporation as part of scarred primer terminus, limits this and other SBS methods.
Figure 1 SBS substrates and selection system. a) Modified dUTP used in sequencing (R=H) or selection experiments (R=CH2NHCO-biotin). b) Representative scarred nucleotide. R1 = H at primer terminus and DNA when in remainder of primer. c) Activity based phage display (more ...)
A variety of approaches have been pursued for developing polymerases with novel activities, including screening variants produced by rational design[3
] or random mutagenesis,[4
] and selections based on in vitro
] or phage display.[6
] In previous work,[6
] we demonstrated that polymerase mutants with specific, non-natural catalytic properties can be isolated from large libraries of mutants through the use of an activity-based selection system that relies on the co-display of DNA mutants and their oligonucleotide substrate on M13 phage (). Phage production is optimized such that each phage particle displays one or zero polymerase mutants via fusion to a phagemid-encoded pIII, and four to five “acidic peptides” via fusions to the phage genome-encoded pIII. The displayed acidic peptides are used to attach oligonucleotide primers to the surface of the phage particle via a covalently linked “basic peptide.” Because all of the pIII proteins are localized to one end of the phage particle, a displayed polymerase preferentially extends the primers that are covalently attached to the same phage particle. Biotinylation of the dNTP substrate, natural or modified, enables selective recovery of the active polymerases and their respective genes using streptavidin beads. Thus, libraries can be enriched in mutants possessing a desired activity, such as the recognition of modified substrates. Using this directed evolution approach we previously evolved polymerases that can synthesize RNA,[6a
] DNA containing C2’-O-methyl modified nucleotides,[6b
] or DNA containing unnatural hydrophobic nucleobase analogs.[6c
] While the selection system is compatible with either Klenow fragment (Kf, the N-terminal truncation of E. coli
polymerase I) or Stoffel fragment (Sf, the N-terminal truncation of Taq) DNA polymerases, all previous successes have been with Sf, which may be related to a link between thermostability and evolvability.[7
] Also, initial efforts to evolve Kf mutants tailored for the Helicos nucleotides failed, thus our attention turned to the evolution of Sf.
A library of Sf mutants was generated by synthetic shuffling,[8
] which involved assembly PCR with degenerate oligonucleotides encoding residues found in six homologous polymerases: Thermus aquaticus; Thermus thermophilis
(91% amino acid identity to T. aquaticus
); Thermus caldophilus
(86%); Thermus filiformis
(81%); Spirochaeta thermophila
(54%); and Thermomicrobium rosem
(54%). Mutations were restricted to twenty-one residues within 14 Å of the incoming dNTP (based on a ternary structure of Taq (PDB ID 1qsy[9
]). This approach allowed many mutations to be introduced, and since every mutation is found in nature, simultaneously minimized the chances that any mutation would compromise the structure or basic activity of the enzyme. The resulting 14Å library contains mutations at 21 unique sites within the fingers and palm region, which are prominently involved in dNTP binding and incorporation,[10
] resulting in a final library of 108
chimeric Sf variants (for details, see Supporting Information
To optimize selection pressure, we measured the steady-state rates at which Sf extends a natural primer by incorporation of each dNTP-Fl against its cognate base in the template (see Table S1 in Supporting Information
), and we found that the incorporation of dUTP-Fl opposite dA is the least efficient. Thus, to apply a selection pressure for this step, we synthesized biotinylated dUTP-Fl (). 1011
phage bearing both a polymerase mutant and a primer-template duplex containing dA at the first templating position were prepared as described previously.[6
] Four rounds of selection were performed where phage immobilization required the more efficient extension of the primer with the biotinylated dUTP-Fl.
From a preliminary screen of 300 members of the enriched library, mutants were selected based on their ability to recognize dUTP-Fl. Six mutants were further characterized based on their ability to recognize each different modified dNTP under both steady state () and sequencing-like conditions using a scarred primer (Table S2
). Six mutants were selected for further characterization under both steady state () and sequencing-like conditions using a scarred primer (Table S2 in Supporting Information
). The three most active polymerase mutants, Sf168, Sf197, and Sf267, showed an approximately 10- to 50-fold increased efficiency for dUTP-Fl incorporation and a 7- to 80-fold increased efficiency for incorporation of the other three modified dNTPs.
Steady-state rate of dUTP-Fl incorporation by Sf wt and Sf mutants[a]
Sf has relatively low affinity for DNA,[11
] so its practical use for sequencing is limited by the need for prohibitively high concentrations of enzyme. Thus, the mutations found in Sf168, Sf197, and Sf267 (see Table S3
), were cloned into full length Taq, which has a higher affinity for DNA.[11
] As expected, the concentration of each Taq mutant needed to saturate the primer-template decreased more than 25-fold, making it likely that the selected mutants will be suitable for practical applications. A preliminary survey of dNTP-Fl incorporation kinetics indicated that Taq197 (corresponding to the mutations from Sf197) was better optimized for the modified substrates than was Taq168 or Taq267 (corresponding to Sf168 and Sf267, respectively), thus our focus turned to the further characterization of Taq197. Using pre-steady state kinetics, we found that Taq197 incorporates each dNTP-Fl 48- to 377-fold more efficiently into scarred primer termini than wild type Taq, with no apparent bias toward the identity of the dNTP-Fl or the sequence of the primer (). At least for dUTP-Fl incorporation, the data also suggest that GC-rich sequences, which are commonly difficult to sequence, are not problematic.
Rate of incorporation (kpol) of dNTP-Fl against cognate base (N3) by Taq and Taq197 onto scarred primer termini[a]
To examine the fidelity of Taq197, we characterized the misincorporation of the three incorrect dNTP-Fls opposite dA in the template under pre-steady state conditions. Importantly, as with wild type Taq, Taq197 does not measurably synthesize any mispairs, even after 90 minutes. Thus, based on the detection limit of the assay (see Supporting Information
), we set an upper limit of 5.6 × 10−4
for the rate of mispair formation, making correct incorporation of modified nucleotides more than 5,000-fold more efficient than mispair formation. These data suggest the fidelity of Taq197 has not been significantly compromised and that it should be sufficient for sequencing applications.
Finally, we evaluated the performance of Taq and Taq197 in single molecule sequencing reactions using a set of 30 oligonucleotides derived from the sequence of the M13 phage genome (). After 32 base addition cycles (a total of 8 additions of each of the four dNTP-Fls), Taq produced a median strand length of two nucleotides. In contrast, Taq197 produced a median strand length of six nucleotides, with significantly longer lengths also observed. The conditions employed in these reactions were optimized for mesophilic polymerases, thus optimization for Taq197 will likely significantly increase the absolute read length. Regardless, the data reveal that the improved ability of Taq197 to accept the modified nucleotides translates into significantly improved performance in single molecule sequencing.
Figure 2 Performance of Taq197 (black) compared to wild type Taq (gray) in single molecule DNA sequencing after 32 base addition cycles (for details, see Supporting Information).
Taq197 has 14 mutations relative to its wild type progenitor ( and Supporting Information
). Based on the crystal structure of the ternary complex of the wild type enzyme and natural substrates,[9
] it seems likely that at least one of the mutations, T664A, alters direct interactions with the modified substrates. Thr664 is located in the developing major groove of the DNA, 6.1 Å away from the site where the linker is attached to the incoming dNTP (); mutation to the smaller Ala residue may enable the polymerase to better tolerate the bulky linker. In addition to these direct interactions, there are a number of more subtle changes; five amino acids in the O-helix and three in the N-helix, which packs on the O-helix, are mutated to other hydrophobic residues (). These mutations appear to participate in three clusters of packing interactions that likely contribute to improved positioning of the O-helix, which has been shown to close over, and make specific contacts with, the incoming dNTP during DNA synthesis.[9
] Thus, while additional experiments are required to fully deconvolute the specific contributions of individual mutations to the improved activities of Taq197, the polymerase appears to have acquired an expanded substrate repertoire by optimizing both direct and indirect contacts with the modified substrates.
Figure 3 Taq polymerase (PDB ID 1qsy) showing residues mutated in Taq197. The N- and O-helices are labeled, the DNA strands are shown in black, and the incoming dNTP is rendered as stick. a) Thr664 is located 6.1 Å (dashed line) from the major groove of (more ...)
Taq197 is the first example of a DNA polymerase optimized by directed evolution for next generation sequencing. Considering that many of the most promising next generation sequencing methods rely on DNA polymerase recognition of modified substrates, and that they would be aided by polymerase optimization, the methods detailed here should be broadly applicable to other next generation sequencing platforms. In addition to its benefit to sequencing methods, many other emerging technologies, including DNA labeling and SELEX, would be greatly facilitated by an increased ability to replicate DNA containing similarly modified nucleotides, and Taq197, or a further optimized progeny, is likely to facilitate these technologies, as well.[13