Discovery and initial characterization of a low allelic fraction artifact
In a deep coverage exome study of 221 tumors and matched normal tissues from melanoma, we observed an unexpectedly high number of variants at allelic fractions <20% (). These variants were not consistent with the expected dominance of C > T transitions in melanoma owing to ultraviolet damage as previously reported (19
). Instead, the analysis uncovered thousands of apparent C > A/G > T variants in these samples with a specific sequence context of CCG > CAG, and most strikingly were not restricted to tumor material. They were also found at a similarly high rate when mutation calling was reversed to call ‘variants’ off the normal using the tumor as the reference, further indicating these base changes were likely not due to a disease mechanism (). Further, we noted that the artifacts had a specific strand orientation in that G > T errors always presented in the first Illumina HiSeq instrument read, whereas the C > A errors were always found in the second HiSeq read. Lastly, these variants were not supported by matching RNA sequence data obtained from either the tumor or normal DNA of affected samples (data not shown). Closer inspection of deep coverage exome data from other cancer types also uncovered these same C > A/G > T transversions at low allelic fractions in subsets of samples, including cancers with known lower mutation rates such as neuroblastoma (21
) and chronic lymphocytic leukemia (23–25
) (data not shown). Combined, these characteristics led us to believe that these variants were not biological in nature. Rather, we hypothesized that these base changes were caused by some artifact induced in the sample collection, extraction, library preparation or sequencing processes.
Figure 1. Distribution of mutation calls across a variety of base motifs and allelic fractions in Melanoma samples. Lego plots consisting of counts of all single base mutations for given bases and contexts at various allele fractions (AF) for 221 melanoma samples. (more ...)
Figure 2. Presence of CCG > CAG mutations in both tumor and normal samples. CCG > CAG mutations can also be detected at an abnormally high rate in this set of neuroblastoma samples when mutation calling was flipped to call variants off normal samples (more ...)
To better understand the full scope of the artifact’s presence in exome data, we developed a sensitive and accurate metric using the unique context, strand specificity and read orientation characteristics of the artefact that could quickly process large data sets to measure the rate of C > A/G > T artifacts. We started by looking at all high quality (>Q20) base observations at C/G sites in the genome and classifying them into the following: (i) reference basecalls; (ii) alternative basecalls (A/T) that are consistent with the artifact characteristics; (iii) alternative basecalls (A/T) that are inconsistent with the artifact characteristics; and (iv) all other bases. As in aggregate there should be no correlated bias between C > A and G > T error modes and instrument read order for true biological variation, we assume that the errors from sources other than that of the artifact should be symmetric and arrive at the following definition of ArtQ:
This yields a phred scaled error rate attributable to G > T/A > C transversions with the specific artifact characteristics we had observed in our data. An ArtQ score >30 means that less than 1 in a thousand bases are attributable to artifact error. For the purposes of our analysis, we considered a sample with an ArtQ score >30 to be ‘unaffected’ by this artifact.
Determining the origin of the artifact
We calculated the ArtQ metric for nearly all human targeted capture and whole genome projects sequenced at our institute since 2009, encompassing >50 000 libraries from hundreds of different initiatives and disease projects (). Immediately, it became clear that the prevalence of the artifact, though always there, had increased in frequency during the previous year, and that there was a large amount of project-to-project variation in the artifact prevalence (see Discussion). We began to investigate possible sources of error in the laboratory, and we were first able to rule out that the Illumina HiSeq and cluster amplification chemistry was not inducing these base changes by sequencing the same affected libraries on Illumina HiSeq V2, V3 and MiSeq chemistries, as well as on Life Technologies Ion Torrent PGM chemistry. The Illumina sequencing versions tested generated no significant differences in ArtQ scores for each library (). Although ArtQ scores could not be calculated in the same fashion for Ion data owing to the lack of pairing information, C > A base changes found in the samples at given base positions in Illumina data were also found in Ion PGM data for each library tested (data not shown). These observations strongly indicated that the base changes had occurred before the sequencing step.
ArtQ metric over time for Broad’s Targeted Capture pipeline. ArtQ by library creation date.
We then observed that whole genome samples contained little to no evidence of significant artifactual C > A/G > T transversions compared with targeted capture samples processed during the same period (Supplementary Figure S1
). This finding then led us to investigate our automated SureSelect targeted enrichment process itself. We sequenced a pool of 370 pre-exome enrichment libraries along with their 370 corresponding post-exome enrichment captured libraries. We found that for all samples, the ArtQ scores for both were highly correlated (), indicating that the base changes were already been induced upstream of targeted capture. Because we had streamlined our laboratory processes, the production protocols used to create libraries destined for either targeted capture or whole genome processing were nearly identical. The one exception was the acoustic shearing protocol used to fragment the DNA for either the exome process (a high powered 150-bp fragmentation) or whole genome (a lower powered 500-bp fragmentation) (see Methods).
Figure 4. ArtQ for Pre- versus Post-targeted capture. For a set of 370 samples, both the pre- and post-exome enrichment libraries were sequenced. ArtQ was well correlated (R2 = 0.9957), indicating that the artifactual base changes had already been introduced before (more ...)
To determine whether the 150-bp shearing protocol could be introducing the C > A/G > T artifacts, we took DNA samples that when sequenced previously had low ArtQ scores and made new 150 bp and 500 bp libraries. All other steps in the library preparation protocol were kept the same, and the libraries were sequenced immediately after PCR without any further processing or size selection steps. The sequencing results show a clear and significant increase in the prevalence of the artifact, as measured by ArtQ, when the same samples were sheared using the 150-bp protocol as compared with the 500-bp protocol (). These data provided the first proof that a major contributing source of the artifact was the higher powered 150-bp acoustic shearing protocol used to create these targeted capture libraries.
Figure 5. Comparison of 150-bp versus 500-bp shearing conditions. Average ArtQ scores post-sequencing for the same set of six samples sheared with using both 150-bp and 500-bp Covaris protocols. The 150-bp shear protocol had significantly lower ArtQ values (P < (more ...)
Interaction of shearing protocol with incoming genomic sample quality
Although the 150-bp shearing step was demonstrated to induce these artifacts, we saw that less than a half of all exome samples receiving that 150-bp shearing protocol were significantly affected (), and for samples that traveled together on 96-well plates throughout the entire process, often only a subset of samples would be affected. Moreover, as described in the previous section, we consistently observed that new sequencing libraries made from the same source DNA as a previously affected library were also highly affected at similar ArtQ scores. Taken together, these observations strongly suggested that the 150-bp shearing protocol alone was not sufficient to cause the artifact. Rather, some inherent property of the incoming sample made it more or less susceptible to damage during the 150-bp shearing process.
Within and between each disease project, the rate of artifact prevalence varied widely without any immediately obvious patterns or correlations (). However, for a given disease project, we often receive pre-extracted DNA samples from multiple collaborators or institutions, which we term as separate ‘collection sites’ within a project. These separate laboratories do not necessarily use the same DNA extraction or handling methods. By further separating disease projects into their constituent collections, we observed that the artifact’s prevalence was clustering by DNA collection site (). This finding added credence to our hypothesis that varied upstream extraction methods may be causing a subset of incoming samples to be more susceptible to artifact generation during shearing.
Figure 6. ArtQ by project and collection site within a project. ArtQ scores vary significantly between disease projects, despite consistent protocols and automation used during the targeted capture preparation process. For a subset of six projects shown here receiving (more ...)
Confirmation of DNA oxidation mediated mutation during DNA shearing
We next set out to determine the molecular basis for these artifacts. The fact that these were C > A/G > T transversions and that they occur most frequently in the context of CCG → CAG led us to hypothesis that the cause was oxidation of DNA, specifically the conversion of guanine to 8-oxoG. 8-oxoG is a common lesion in DNA generated via oxidation, and it is known to pair with both cytosine and adenosine during PCR leading to C > A/G > T transversions (26
). Oxidation of guanine to 8-oxogG has also been demonstrated to exhibit a distinct sequence context preference (28–30
) in which the likelihood of a base being targeted for oxidation is highly dependent on both the 5′ and 3′ bases surrounding it, with the CCG/GGC we observed here being the context with the highest oxidation potential in demonstrated in these previous studies. Oxidation of DNA can come from a variety of commonly encountered sources, including DNA extraction methods, long-term storage of DNA in aqueous buffers, heat, exposure to trace metals and sonication (31–37
). During the 150-bp shearing protocol, we observed that the contents of the shearing tube increased in temperature from 10°C to ~30°C, despite being submerged in a 10°C water bath. Conversely, the temperature did not increase during 500-bp shear protocol. The presence of both powerful acoustic sonication energy and heat accumulation provided further indications that the 150-bp shearing protocol could be oxidizing DNA.
To confirm the presence of 8-oxog in affected samples, we performed an 8-oxoG specific ELISA assay (Enzo Biosciences) on remaining DNA from six melanoma that were highly affected in previous sequencing (ArtQ < 20) and six samples that were relatively unaffected (ArtQ > 30). Each sample was split in three equal aliquots and sheared with the 150-bp protocol, 500-bp protocol or left unsheared and then assayed via ELISA (a). The results clearly show significantly elevated levels of 8-oxoG (P < 0.05) were present only in the previously highly affected melanoma samples when sheared with the 150-bp shearing protocol. The levels of 8-oxoG generated following the lower powered 500-bp shear were not significant even in known susceptible samples, which was consistent with previous observations of lower artifact prevalence in 500-bp whole genome samples. The difference in 8-oxoG levels between affected and unaffected samples further confirmed that shearing alone is not enough to induce oxidation, but that there was some contaminant in some samples that leads to increased oxidation activity during shearing. To confirm this, we compared 8-oxoG levels on samples sheared with or without buffer exchange using Ampure XP SPRI beads (b). The results show a significant decrease (P < 0.05) in the presence of 8-oxog in the samples that underwent buffer exchange, further confirming that there were contaminants in the DNA buffers that were contributing to oxidation.
Figure 7. 8-oxoG ELISA results. (a) Bar chart of mean ng/ml of 8-oxoG for affected and unaffected samples processed with different shearing conditions. The ‘Highly Affected’ samples with high artifact rate sheared to 150 bp had a significantly higher (more ...)
These ELISA results provided confirmation for the shearing-induced oxidation hypothesis and demonstrated that affected DNA samples contain some contaminants in their source buffers that when exposed to the strong acoustic energy and/or heat generated during the 150-bp shear create a highly oxidative environment. These 8-oxoG bases then persist throughout the NGS protocol to the PCR enrichment step, where the C > A and G > T base changes occur owing to Hoogsteen base pairing of 8-oxoG and A (27
). In addition, this mechanism fits with the sequencer read specificity we observed in our data: the G > T base change is always observed on the read 1 strand, and the C > A is always on the opposite read 2 strand. If the artifact is being induced in shearing, this is before ligation of the forked Illumina adapters. Because of the nature of these palindromic adapters and subsequent PCR reaction mechanics, the original DNA strands containing the 8-oxoG lesion containing the G > T error will always end up on the read 1 side of the final library fragment, and the strand generated during PCR enrichment (containing the C > A error) will be read during read 2 (Supplementary Figure S2
Development of methods to reduce DNA oxidation during shearing
We next explored the addition of antioxidants to samples before shearing as an attempt to both rescue susceptible samples and also gain insight into the reaction mechanism. Additives tested included two metal chelators, 1 mM EDTA and 100 uM DFAM, and a phenolic antioxidant and free radical scavenger, 100 uM BHT (38
). Susceptible melanoma sample material (previous ArtQ = 22) was diluted in triplicate in our standard 10 mM Tris–HCl pH 8 buffer without or with these additives, alone and in combination, before 150-bp shearing (see Materials and Methods). Following standard library construction and sequencing, the ArtQ metrics were used to determine the effectiveness of each additive or combination of additives at preventing the oxidation artifact (, bottom pane). The results of this experiment demonstrated a significant reduction in oxidation artifacts as measured by ArtQ for samples when either of the chelators, EDTA or DFAM was included in the shearing buffer. The hydroxyl radical scavenger BHT appeared to have little protective effect. The success of the chelators strongly suggests that some form of metal ions present in samples after DNA extraction may be involved in the oxidation mechanism during shearing. However, as DFAM significantly reduced yields from our library construction process (, top), only EDTA showed a clear benefit without risking, reducing the robustness of our library preparation process.
Figure 8. Antioxidant additives in shearing. Bottom pane: Mean ArtQ score for oxidation susceptible melanoma samples (n = 3 sample per condition) sheared with antioxidant conditions compared with Tris–HCl alone, showing significant improvement in ArtQ scores (more ...)
These results in combination with the results from the previously described buffer exchange experiments have allowed us to develop a protocol to protect DNA from oxidation during shearing. All incoming DNA samples are buffer exchanged into Tris-EDTA (TE) buffer (10 mM Tris, 1 mM EDTA) as a preventative measure to remove any possible contaminants present in the source buffer and to add EDTA to protect against shearing induced oxidation. Although metal-catalysed oxidation appears to be the primary mechanism of artifact induction for this particular subset of samples tested, it is highly likely that other oxidation mechanisms could exist. We are continuing to evaluate more samples from a wider variety of sources and extraction methods, as addition of metal chelation with EDTA may not be a cure all for shearing induced oxidation in all samples.
Development of a filter-based method for removing oxidation artifacts from sequencing data during mutation calling
The presence of these artifactual C > A/G > T transversions in sequencing data could lead to obvious issues in somatic mutation calling. Reprocessing the samples with low ArtQ scores through library preparation using the previously detailed laboratory process improvements is preferred, but for affected data sets that cannot be reprocessed in the laboratory, we have developed a post-processing filtering method that can be used to screen out oxidation-induced artifacts in sequencing data with high confidence to improve the fidelity of mutation calling at low allelic fractions. As the artifact was originally found at low allelic fractions, we have discovered that a universal threshold cannot be applied to throw out possible artifacts, as bona fide somatic mutations can be found at similar or even lower allele fractions. In particular, in high mutation rate tumors such as lung adenocarcinomas C > A artifacts are likely to co-mingle with previously characterized smoking-induced C > A mutations (Supplementary Figure S3
Mutation calls from the program used here, MuTect (Cibulskis et al.
in preparation, http://confluence.broadinstitute.org/display/CGATools/Home
), depend on well-calibrated base qualities to distinguish real mutations from sequencing errors, but in the case of the CCG > CAG artifacts reported here, the base qualities do not reflect the probability that a non-reference base call is the result of a true mutation. As MuTect already applies post-processing filters to remove various types of previously known sequencing artifacts from the final list of somatic point mutations, we designed a new filter to specifically remove CCG > CAG oxidation artifacts based on their unique properties that are inconsistent with real mutations including the base context, sequencing read orientation and characteristic low allelic fraction prevalence. Artifact base changes are seen as G > T only in read 1 and C > A only in read 2 as previously described, which we represent in the filter as the fraction of alternate allele supporting reads comprising G > T changes on read 1 and C > A changes on read (‘FoxoG’). Artifactual base changes should have high FoxoG ratios as compared with real somatic C > A mutations that are not read orientation specific. To determine the FoxoG value for the non- C > A or G > T mutations, we adopted a convention that ‘A’ and ‘G’ alternate alleles on read 2 and ‘C’ and ‘T’ alternate alleles on read 1 are counted in the numerator of the FoxoG value. To capture the characteristic low allele fraction of 8-oxoG artifact for purposes of artifact filtering, we quantified this property in terms of the MuTect ‘tumor_lod’ score, which is the estimated log odds that the observed number of alternate allele reads from the tumor sample could have arisen from a reference allele. The filter therefore takes into account both the FoxoG and tumor_lod measures to determine whether C > A base changes are most likely artifactual in nature and should be excluded.
To train the filter, we used a set of 31 samples from seven different tumor types prepared and sequenced at different periods during 2011 and early 2012 (Supplementary Table S1
). Here, we defined ‘full’ context mutations as C > A base changes in the context of both a proceeding C and a trailing G base (CCG > CAG). ‘Partial’ context mutations are defined as having either a preceding C or a trailing G, but not both (NCG > NAG or CCN > CAN), and ‘no’ context C > A mutations lack both the preceding C and trailing G bases. All non- C > A or G > T mutations served as a null model for non-oxidation-induced base changes. Two-dimensional histograms of FoxoG versus the tumor_lod for C > A and G > T mutations at each of the three levels of sequence context () showed an excess enrichment for high FoxoG and low tumor_lod mutations (lower right corner area of plots) as compared with the non- C > A or G > T mutation distribution. As expected, the proportion of artifact prevalence increased with sequence context specificity from ‘No’ to ‘Partial’ to ‘Full’ as the oxidation potential of the G base in question increased. Based on the distribution of context-specific low allelic fraction and high FoxoG C > A/G > T mutations seen in the training set, we determined the filter threshold to be applied (represented by the dashed line in the plots) to be the following: tumor_lod > −10 + (100/3) FoxoG. Once applied, the filter was able to successfully eliminate the excess number of CCG > CAG mutations that demonstrated artifactual characteristics (b). Using the null model from non- C > A or G > T mutations, we estimated that the fraction of true biological C > A mutations removed by the filter was only 1.4% (1.2–1.6%, 95% confidence interval), whereas the fraction of passed mutations that could be attributed to oxidation artifacts was reduced to 0.1% (0–1.6%, 95% confidence interval) in this test set. Application of this filter to MuTect mutation calls therefore removed nearly all artifactual C > A mutations with high specificity, high confidence and a low chance of false positive mutation calls attributed to oxidative base changes in shearing.
Figure 9. Training of CCG > CAG artifact mutation call filter. (a) Two dimensional histograms showing the filter criteria as distributions of FoxoG (the fraction of alternate allele reads in the oxoG artifact configuration, horizontal axis) and Tumor_lod (more ...)