|Home | About | Journals | Submit | Contact Us | Français|
The 2008 ABRF DNA Sequencing Research Group (DSRG) difficult template sequencing study was designed to identify a general set of guidelines that would constitute the best approaches for sequencing difficult templates. This was a continuation of previous DSRG difficult template studies performed in 1996, 1997, and 2003. The distinguishing factors in the present study were the number of DNA templates used, the number of different types of difficult regions tested, and the inclusion of a follow-up phase of the study to identify optimal protocols for each type of difficult template. DNA templates with associated sequencing primers were distributed to participating laboratories and each laboratory returned their sequencing results along with descriptions of the experimental conditions used. The data were analyzed and the best protocols were identified for each difficult template. This information was subsequently distributed to the participating laboratories for a second round of sequencing to evaluate the general applicability of the optimized protocols. The average improvements in sequencing results were 11% overall, with a range of −25% to +43% using the optimized protocols. The full results from this study are presented here and they demonstrate that general experimental protocols and common additives can be used to improve the sequencing success for many difficult templates.
The “classical” Sanger DNA sequencing technique1 is a well-established and mature technology used very successfully in many core facilities and large sequencing centers. Until 2005, when the first of the new, highly parallel sequencing instruments were introduced,2 the Sanger method was the dominant sequencing technology used globally. Throughout the 31 years since its introduction, almost every step in the DNA sequencing process has been optimized, and re-optimized, as technology changed from radioactivity to fluorescence, and from slab gels to capillary-based systems.3–12 With all of these improvements, it is now routinely possible to obtain over 900 Q ≥ 20 bases13,14 for most typical DNA templates. However, this is not necessarily true when encountering a difficult region, operationally classified as such if sequencing is impeded using the standard ABI protocol.3 The complexity lies in the fact that there are many types of difficult regions,15 and each situation requires a unique treatment.15,16 Previously, papers have been published that have addressed either singular situations or a narrow range of templates,17,18–22 and therefore have limited applications for the broader scope. Confounding factors are that individual laboratories are not standardized in terms of reaction conditions, cleanup methods, instrumentation, and laboratory protocols.
This study was designed to address some of the shortcomings of the previous studies, and to assist core facilities, commercial laboratories, and other units encountering such situations to deal more effectively with a variety of nonstandard templates. Although the next generation sequencing technologies are making tremendous strides in all aspects of sequencing applications, the Sanger methodology will remain viable for many years to come, and the ability to effectively sequence a variety of difficult templates will be of great and lasting importance to the success of any sequencing project.
The DNA Sequencing Research Group (DSRG) designed this study to identify a general set of guidelines that would constitute the best approaches for sequencing of difficult templates. This was a continuation of previous DSRG research group studies performed in 1996, 1997, and 2003.23,24 The distinguishing factors in the present study were the number of DNA templates tested, the number of different types of difficult regions tested, and the inclusion of a follow-up phase in the study. This follow-up phase involved the generation of consensus “optimal protocols” for each difficult template.
The DSRG distributed a set of 8 templates containing a variety of difficult regions along with a control DNA (pGem3zf) to each participating laboratory (refer to Table 1 for characteristics of these DNAs). Participants were requested to sequence each template, in triplicate, employing as many different conditions as they wished. The resultant electropherograms were collected along with the associated conditions and formulations used by individual laboratories. This was designated phase I of the study. The data from phase I were analyzed and the two best protocols were identified for each difficult template and control sample. In phase II of the study, this information was distributed to participating laboratories, and each laboratory was requested to re-sequence the samples in a second round to evaluate the general applicability of the optimized protocols. Results from this second round (phase II) were then collected and analyzed. In both phase I and phase II, the participants were asked to record a number of parameters, including the amounts of DNA and primers, reaction conditions (volumes and dilutions of reagents and additives), cycling parameters, thermocycler used, cleanup methods, sequencing instrument, etc. For phase II of this study very specific polymerase chain reation instruments and cleanup methods were recommended for each optimized protocol; however, we were aware that it was unreasonable to expect that laboratories could follow all specific instructions, due to inherent restrictions in availability of instrumentation or technologies. We assumed that polymerase chain reaction instruments and cleanup protocols would have only negligible effects on the overall quality and read length of sequencing data, and that the most critical components would be the sequencing chemistry (mix of various dye-terminators and other additives) as well as precycling steps and cycling conditions.
To ensure the uniformity of all DNAs, a single laboratory prepared all the sequencing templates and distributed aliquots to participating laboratories. The samples were transformed using electrocompetent TOP 10 cells (Invitrogen, Carlsbad, CA, One Shot TOP10 Electrocomp E.coli, catolog no. C4040-52) and large-scale DNA preps were performed using High Purity Plasmid Maxiprep System (Marligen Biosciences, Ijamsville, MD, catalog no. 11452-026). The DNA concentration and A260/280 ratio was measured using a Nanodrop ND-1000 spectrophotometer (Nanodrop, Willmington, DE). All DNAs had A260/280 ratios > 1.8, indicating a high quality of DNA.25 In addition, aliquots of approximately 200 ng of all DNAs were run on 1% agarose gel26 to visually assess their quality and integrity (Fig. 1) using a low DNA mass ladder and a supercoiled DNA ladder (Invitrogen, Carlsbad CA, catalog nos. 10068-013 and 15622-012, respectively) for comparison.
The Q ≥ 20 values were calculated using Sequence Scanner v1.0 (Applied Biosystems; Foster City, CA). We also evaluated signal strength (using the same software), but the utility of these data were of limited value as various BigDye dilutions and cleanup protocols used by participants rendered signal strength comparisons inconsequential. We also evaluated the electropherograms using the “contiguous read length” metric in Sequence Scanner. This is defined as the longest contiguous read length with quality higher than a specified limit and we observed no substantial difference between Q ≥ 20 and contiguous read length parameters. The data also were analyzed using other software and algorithms including the KB basecaller (Applied Biosystems), LongTrace (Nucleics, Bendigo VIC, Australia), and PHRED.13,14 The Q ≥ 20 values were not significantly different; therefore, we report the data as Q ≥ 20 values using the Sequence Scanner software package. The Q ≥ 20 values are an accepted measure of quality for sequencing traces for standard DNAs.13,14 However, the visual inspection of chromatograms for all templates used in this study quite often indicated that the usable trace region was shorter than the reported Q ≥ 20 value.
In addition to BigDye v3.1 (Applied Biosystems, catalog no. 4337455) used for cycle sequencing, common additives were dGTP v3.0 (Applied Biosystems, catalog no. 4390229), betaine (available as a 5 M solution from various distributors, such as catalog no. B-0300 from Sigma Aldrich, St. Louis, MO, or catalog no. 77507 from USB, Cleveland, OH), and DMSO (Sigma-Aldrich, catalog no. D2650).
Table 1 shows the characteristics of the DNA templates and the range of Q ≥ 20 values for phase I and phase II of this study. In phase I over 50 different protocols were tested from 21 different laboratories, and the participants submitted data with a wide range of Q ≥ 20 values (zero to greater than 1000 bases). The protocols producing the best results were identified and were selected for phase II. Table 2 shows the compilation of the 10 most optimal protocols submitted by the study participants, and these protocols are assigned to specific templates, shown in Table 3. Often, multiple protocols differed only in the number of cycles, so the protocol with a median number of cycles was chosen as the representative for the group. These protocols were distributed to the participants so that each laboratory could re-sequence the templates using one or both of the optimized protocols.
Somewhat surprising was the wide spread in read length for control DNA distributed with each set of difficult templates. This may indicate that there were factors not controlled for in this study, perhaps unknown experimental procedures that affected the sequencing results.
Of the original 21 laboratories, 12 submitted results for phase II (see Table 4 for the results from these laboratories for both phase I and phase II). Most results (90%) from phase II showed improvements over phase I results, but in 18 of the 180 different results analyzed, the participating laboratories produced better data using an in-house protocol than the recommended protocols distributed in phase II (Table 4, blue). The Q ≥ 20 scores from 36 of the 180 different results improved significantly using the protocols provided for phase II (Table 4, red). It is worth noting that, for the most part, those laboratories that submitted poor or average data in phase I demonstrated the most improvement in phase II (Table 4), and that most laboratories already included various combinations of dGTP, betaine, and DMSO in their initial formulations. Overall, the average Q ≥ 20 scores from phase I to phase II for the 12 participating laboratories improved by an average of 11% (Fig. 2). The maximum improvement was observed for the DNA1:forward primer, which showed an average of 43% improvement in Q ≥ 20 score. Only one template:primer combination (DNA5:Reverse) showed a decrease in average Q ≥ 20 scores in the phase II results (−25%). What was unexpected, however, was the wide range of scores also observed for the data submitted in phase II. It is likely that this individual variation is due to the fact that most laboratories, although following a standard set of reaction conditions, by necessity, could not standardize every aspect of the protocols (e.g., individual labs used cleanup protocols available to them, often not the protocol identified as best), as noted above.
Figure 3 shows an average (A) and the best (B) chromatograms for DNA 1 (very GC-rich template). Figure 3C shows different, potentially impeding DNA sequencing motifs in this DNA, including a region which is greater than 95% GC, CCG trinucleotide repeats and dinucleotide nonrepeats. The mixed dye-terminator (BigDye v3.1/dGTP v3.0/5 M betaine:1.5 μL/0.5 μL/2 μL) and the standard ABI cycling protocol offered the most optimal conditions for sequencing of this template. Figure 4 shows an average (A) and the best (B) chromatograms for DNA 3 (containing a strong hairpin structure, as seen in Fig. 4C). Again, the protocol used for DNA 1 was the most optimal. The sequencing of this DNA, and similar templates with strong hairpins, can be tricky, as often one can get clean but incorrect data. Li et al.27 described the detailed analysis of this phenomenon. Using the protocol with Sequence Resolver kit28 produced very clean and correct data (J. Kieleczawa, data not shown; see also ref. 15). The DNA 5 contained a very long (456 bases) stretch of C/T dinucleotide nonrepeat, as shown in Figure 5C. When using typical sequencing conditions, the average read length is less than 400 bases (Fig. 5A) and for the best conditions (protocol 1) this read length exceeds 600 bases (Fig. 5B). The DNA 8 has an Alu and inverted repeat (Fig. 5C). On average, participants were able to obtain sequence with read lengths of approximately 600 bases (Fig. 6A) and the best data (Fig. 6B) were obtained using protocol 1. In each case, adding a heat denaturation step to the most optimal protocol15 improved the data quality (J. Kieleczawa, data not shown; see also ref. 15). A few years ago a new and powerful biochemical tool28 was developed to help sequencing through many types of difficult templates (Sequence Resolver Kit from GE Healthcare) but it seems that currently its use is limited.
In this study we have evaluated many different protocols used in DNA sequencing centers for their ability to effectively sequence various difficult templates. Given that the characteristics of each difficult template are ultimately determined by the primary structure of the DNA molecule and may appear in a nearly infinite variety of combinations, it is not surprising that we were unable to identify a single protocol that would be effective for all types of difficult regions. However, the results of this study can be used as general guidelines and approaches to serve as starting points for troubleshooting difficult template sequencing. For example, protocol 1 in Table 2 had the widest application and should be considered as the initial procedure for resolving the sequencing of troublesome regions. In addition, these data also demonstrated the validity of incorporating additives and reagents to aid in sequencing through the most difficult templates.
With the renewed commitments from Applied Biosystems (M. Rosoff, S. Santhanam, personal communication, Salt Lake City 2008) to the continued support of capillary-based systems, and the development of new dye-terminator chemistries,29 it is reasonable to expect that further and significant progress will soon be possible. In addition, bioinformatics tools30 used to predict various DNA sequencing impeding structures may enable more effective resolution of most nonstandard sequences. Such tools cannot, however, predict problems that may arise from templates of unknown sequence. Currently there are very limited data analyzing the effectiveness of the next generation of sequencing technology for sequencing through difficult templates. We hope to explore this subject in the near future.
We would like to thank Kim Marquette, Michelle Mader, and Erica Mazaika of Wyeth Research, Cambridge, MA for the sample preparation and distribution efforts. Without the dedication and great effort of all participating laboratories this study we would not have succeeded in providing the community with the best currently available protocols.