A UDG/endoVIII repair scheme
In library preparation protocols for high-throughput sequencing of ancient DNA, PNK and
T4 DNA polymerase are used to generate ends amenable to DNA ligation (
17). This is achieved by phosphorylation of non-phosphorylated 5′-ends by PNK, and removal of 3′-overhangs and fill-in of 5′-overhangs by
T4 polymerase, producing blunt ended, 5′-phosphorylated molecules. A large proportion of nucleotide misincorporations generated from ancient DNA libraries are caused by uracils present in short 5′-overhangs in ancient DNA fragments, which are filled in during this end-repair reaction (
6). We reasoned that if DNA is incubated with UDG and endoVIII prior to
T4 DNA polymerase treatment, this would result in the removal of uracil residues from DNA by UDG, and cleavage on the 5′- and 3′-sides of the resulting abasic sites by endoVIII. PNK and
T4 polymerase would then remove 3′-phosphate groups and generate blunt ends (), thus avoiding the loss of these molecules from the library.
Repair of oligonucleotides
In order to test the ability of UDG and endoVIII to repair ancient DNA, we designed four double-stranded oligonucleotides that carry short overhanging ends (). Whereas oligos A and B contain exclusively the four natural DNA bases, C and D carry uracil bases in their overhanging ends as predicted to frequently occur in ancient DNA (
6). Oligos A and C carry 5′-overhangs whereas oligos B and D carry 3′-overhangs. Oligo C carries a uracil base in the second position of the 5′-overhangs and is therefore the type of fragment that the UDG/endoVIII protocol is designed to repair. Oligo D carries a uracil base in the second position of the 3′-overhangs. In 3′-overhangs, deaminated cytosine should not lead to nucleotide misincorporations in ancient DNA sequences because
T4 polymerase removes 3′-overhangs before adaptor ligation. However, we wanted to investigate if
T4 polymerase activity may be blocked by the abasic sites left after UDG treatment or the 3′-phosphates left by endoVIII treatment (
26) ().
The three oligonucleotides were treated in four different ways: (i) With PNK to phosphorylate 5′-ends, followed by T4 polymerase to generate blunt ends; (ii) as 1 with the addition of UDG together with PNK; (iii) as 2 with the addition of endoVIII with the UDG and PNK and (iv) as 3 except that PNK was not used. Subsequently, 454 sequencing adaptors were ligated to the oligos. The ligation reactions were visualized on ethidium-stained agarose gels, and analyzed by quantitative (q) PCR with primers specific for the 454 adaptors, either with or without prior treatment with UDG to gauge the amount of uracil residues in the ligation product ().
For oligos A and B, which contain no uracils, gel and qPCR results show that adaptor ligation was similarly efficient for treatments 1–3 (). This indicates that UDG and endoVIII do not interfere with the generation of blunt ends in non-damaged DNA. The observed level of between-treatment variation is expected given that two spin column purification steps were performed during the procedure.
Oligo C, which contains uracil residues in 5′-overhanging ends, showed high ligation efficiency following treatment 1. However, the ligated product showed greatly reduced copy number in qPCR if treated with UDG prior to amplification. This is expected given that the ligated product should contain one uracil on each strand. Ligation efficiency was very low after treatment 2, consistent with UDG creating abasic sites and preventing T4 polymerase fill-in of ends. After treatment 3, ligation efficiency was high, and in this case the product was insensitive to UDG treatment prior to qPCR. This indicates that the uracils had been removed prior to ligation, as predicted by the scheme in .
Oligo D, which contains uracil residues in 3′-overhanging ends, showed a pattern similar to oligos A, B and C in that ligation efficiency was high after treatments 1–3. This indicates that uracil-containing 3′-overhangs can be successfully removed by T4 polymerase after UDG and endoVIII treatment. The products were largely insensitive to UDG treatment prior to qPCR, as expected since uracils in 3′-overhangs should be removed before ligation. The apparent ~40% sensitivity to UDG after treatment 1 is surprising, but may be due to uracil misincorporation during manufacture of this oligo. Since oligos are synthesized in the 3′- to 5′-direction, uracil was added in the first coupling step and carry over to later synthesis steps would lead to UDG sensitivity of the ligated product.
When PNK is omitted (treatment 4), only oligo C shows a full-length ligation product. This demonstrates that the UDG/endoVIII repair reaction generates ligatable phosphorylated 5′-ends after removal of the uracils and cleavage of phosphodiester bonds by endoVIII ().
To confirm that UDG/endoVIII repair of oligo C removes uracil-derived errors from DNA sequences as predicted (), we sequenced oligo C products of treatments 1 and 3. As expected, after treatment 1, the oligo C product is full length and carries nucleotide misincorporations where uracil residues have been replaced by thymine residues. In contrast, after treatment 3, the oligo C product has been shortened by two bases at both ends, removing the uracil positions from the sequence ().
Repair of mammoth DNA
We next tested the UDG/endoVIII protocol on DNA extracted from a ~43 000-year-old woolly mammoth bone. The extract was subjected to various repair reactions in duplicate, followed by 454 adaptor ligation, qPCR quantification and determination of between 783 and 20 935 sequences per sample on the 454 platform. The results were analyzed with respect to numbers of molecules in the 454 libraries generated as well as the percent of sequences in the libraries that show similarity to the elephant genome and are thus assumed to be of mammoth origin. In most ancient remains, only a small proportion of sequences retrieved are from the organism under study. For this extract, it is ~3%, the rest presumably stemming from microorganisms that have colonized the sample after the death of the individual. If the endogenous mammoth DNA carried more DNA modifications than the environmental DNA, the relative amount of mammoth DNA might be changed by the repair reactions. Other relevant aspects such as the length of the mammoth sequences and the frequency and type of nucleotide misincorporations were also analyzed.
One issue that potentially complicates the application of the UDG/endoVIII protocol to ancient DNA extracts is that DNA modifications other than deaminated cytosines may be present (
27,
28). Therefore, we first tested if mammoth DNA retrieval was affected by the longer incubation (3 h) at a higher temperature (37°C) that differs from the standard library preparation (30 min at 25°C) and thus could conceivably cause DNA degradation by affecting some unknown modifications. Neither in terms of numbers of molecules in the resultant 454 library, the percent of mammoth sequences, the length of mammoth sequences or patterns of misincorporations did the incubation conditions strongly influence the results (
Supplementary Figure S3).
We next tested the three treatments previously performed on the oligonucleotides, i.e. (i) PNK + T4 polymerase; (ii) PNK + T4 polymerase + UDG; (iii) PNK + T4 polymerase + UDG + endoVIII. In a second series of experiments, the mammoth DNA extract was dephosphorylated with CIP, purified, and then subjected to the same three treatments except that PNK was never used. Dephosphorylation should prevent the DNA from successful 454 adaptor ligation except in cases where UDG and endoVIII generates phosphorylated 5′-ends where uracil bases exist close to the 5′-ends (c.f. oligo B in ). Thus, these experiments tested if endoVIII is active on the mammoth DNA.
Total library yield, which includes mammoth and microbial sequences, is reduced by ~30% upon UDG treatment (a), consistent with uracil removal from damaged DNA fragments and blockage of T4 polymerase fill-in or Taq polymerase amplification by the resulting abasic sites. When endoVIII is added, library copy numbers increase by ~40% relative to the standard condition. This is unexpected since endoVIII is expected to rescue molecules carrying abasic sites generated by UDG but not to increase copy numbers relative to the standard treatment. We repeated this experiment with more replicates per condition, and again observed a reduction in yield upon UDG treatment but an increase in yield after UDG/endoVIII treatment only to approximately the no repair condition (data not shown). We conclude that UDG reduces copy number of a mammoth DNA library relative to the no repair condition, while UDG together with endoVIII does not, suggesting the repair of uracil-containing DNA fragments. None of the treatments differed in the percent of mammoth DNA sequence, suggesting that uracil or any other unknown modifications that might be affected by the treatments did not differ substantially in frequency between the mammoth DNA and the environmental DNA.
When the DNA was dephosphorylated prior to the experiments, a small amount of amplifiable material was present after treatments 1 and 2, probably representing both incomplete dephosphorylation of naturally phosphorylated 5′-ends by CIP and 454 adaptor artifacts. This is supported by the observation that only a very small proportion of the sequences determined after this treatment are of mammoth origin (b). In contrast, treatment 3 yielded ~3.5 times more amplifiable library molecules than treatments 1 and 2, consistent with direct selection of DNA fragments that had uracil near the 5′-ends of both strands. Notably, after treatment 3, the proportion of mammoth sequences was 10–12%, four times higher than in the libraries made from non-CIP treated mammoth DNA in the presence of PNK. This suggests that the mammoth DNA carried more uracil residues on average than the environmental DNA, although this is probably a modest difference because selecting for damage at both fragment ends will multiply the effect of any difference in damage between sequence populations. Thus, the observed ~4-fold enrichment of mammoth DNA suggests a maximum 2-fold difference between the fraction of mammoth DNA strands and environmental DNA strands that carry uracil residues close to their 5′-ends. This is a small enough effect to be compatible with the fact that no obvious differences in mammoth DNA percentage were observed among the non-CIP-treated libraries where PNK was used.
The 5′-start points of the mammoth sequences represent the terminal bases of the ancient DNA fragments at the point of adaptor ligation (
6). We plotted the base composition of the elephant reference sequence aligned to the mammoth sequences for 10 bases on either side of the mammoth 5′-ends. UDG/endoVIII treatment should lead to an increase in cytosine frequency at the base position immediately preceding 5′-ends (i.e. the –1 position in c), as DNA fragments containing deaminated cytosine will be truncated to the 3′-side of such positions (). Consistent with this prediction, we observed in the non-CIP-treated libraries an increased frequency of cytosine at position –1 in the UDG/endoVIII condition relative to the no-repair and UDG-treated conditions. This effect was much stronger in the CIP-treated, UDG/endoVIII repaired sample, where 84% of 5′-ends are preceded by cytosines. This is predicted since this library should predominantly contain fragments where ligatable ends were generated by excision and repair at positions of deaminated cytosine.
The average length of mammoth sequences is around 70–80 nt (
Supplementary Figure S4), as is typical of ancient DNA (
29). After UDG treatment, fragments are ~10 nt shorter. This is expected since longer fragments are more likely to contain uracil and thus to be left with an abasic site after UDG treatment. When endoVIII is added, length seems to be intermediate, perhaps representing rescue of some longer fragments by endoVIII treatment. If the DNA is dephosphorylated before UDG and endoVIII treatment, length is ~10 nt shorter than after UDG and endoVIII treatment without CIP-treatment, as expected if all of the ligated fragments in this condition have been truncated at both ends due to UDG/endoVIII repair, as opposed to the non-CIP-treated libraries where only a fraction of the fragments will have been truncated.
We investigated the effect of the repair protocol on nucleotide misincorporations, by plotting for each library the frequency of each of the 12 possible mismatches observed in the mammoth-to-elephant alignments. Ancient DNA sequences generally show an excess of C→T and G→A substitutions due to uracil-derived nucleotide misincorporations (
4,
6,
7). As expected, this pattern is seen when the sample is not subjected to UDG treatment (
Supplementary Figure S4). By taking the excess of C/G→T/A substitutions relative to T/A→C/G substitutions as an estimate of the amount of uracil-derived misincorporations in each sample, we found that uracil-derived misincorporations were reduced 4- to 10-fold in samples where UDG was used (
Supplementary Figure S4).
In summary, UDG–endoVIII treatment allows more mammoth DNA sequences to be retrieved than UDG treatment alone, while generating much lower rates of nucleotide misincorporations than if UDG is not used.
Repair of Neandertal DNA
Mammoth and African elephant DNA diverged from each other more than ~7 million years ago (
30,
31). Therefore genuine sequence differences will dominate in pairwise alignments, reducing the ability to study patterns of nucleotide misincorporations. In contrast, Neandertals and humans diverged much more recently. Furthermore, in contrast to the elephant genome the human genome is of finished quality allowing better resolution of nucleotide misincorporations. We therefore applied the UDG/endoVIII repair protocol to DNA extracted from a 38 000-year-old Neandertal bone from Croatia and performed deep sequencing of the libraries. An additional advantage of this specimen is that its complete mitochondrial (mt) sequence is known (
23), allowing detailed analysis of sequence errors in mtDNA. This may be particularly interesting as mtDNA does not carry 5-methylcytosine (5′-m-C), a naturally occurring modification in vertebrate nuclear DNA which when deaminated is converted to thymine (
32). Since UDG will not remove thymine from DNA, deamination-derived misincorporations may still occur at sites of cytosine methylation after UDG treatment of the ancient DNA.
Between 10 and 11 million raw DNA sequence reads were generated on the Illumina
GAII platform from four Neandertal DNA libraries prepared in the standard condition or with the UDG/endoVIII protocol (each in two replicates). In order to improve sequencing accuracy, which decreases substantially toward the 3′-end of Illumina reads (
25), we sequenced into both ends of each molecules and merged the paired reads, requiring overlaps of at least 11 bases, and discarded all sequences that could not be merged to their paired read. Since the single read length was 48 bases, this creates a maximum read length of 85 bp. However, previous work has shown that the majority of Neandertal fragments in this bone are shorter than this (
33).
Sequences were aligned to the previously published complete mtDNA sequence of this Neandertal (
23), and to the human nuclear genome (hg18) using a custom mapping program (ANFO) (
22). In each library, 0.007–0.009% of sequences aligned well to the mtDNA and 2.0–2.2% to the nuclear genome. Variation in these proportions was as high within as between treatments. Data from replicates were then pooled for each treatment and patterns of nucleotide mismatches between the sequences and the references were analyzed.
When analyzing Neandertal DNA sequences, contamination of experiments with contemporary human DNA is a potential problem (
10,
34). However, the level of such contamination in a Neandertal DNA library can be assessed by counting the ratio of Neandertal versus contaminant fragments at nucleotide positions where Neandertals differ from all or almost all present-day humans (
33). The mtDNA of this Neandertal carries 133 such diagnostic positions (
23). The ‘no repair’ dataset yielded 139 mtDNA fragments that overlapped such positions; 138 carried the Neandertal base while one matched modern human mtDNA. The UDG/endoVIII treated dataset yielded 128 informative fragments, of which all were the Neandertal type. Thus, the mtDNA in all libraries was almost completely free of contamination by modern human mtDNA, even after treatment with UDG and endoVIII. Since the ratio of mitochondrial to nuclear DNA may differ between the contaminating and the Neandertal DNA, this estimate is strictly applicable only to the mtDNA (
33). However, the estimate of mtDNA contamination in these libraries is low enough that within even a few-fold variation in mtDNA:nuclear DNA ratios between the Neandertal and contaminating DNA, sequences aligning to the human nuclear genome will be predominantly of Neandertal origin.
To analyze patterns of nucleotide misincorporations, we plotted the frequency of each of the 12 possible nucleotide differences in alignments as a function of distance from the ends of DNA fragments for each dataset (, top panels). In the ‘no repair’ results, the transitions C→T and G→A are drastically elevated toward the 5′- and 3′-ends ends of fragments, respectively, with up to 40–50% error frequencies at the fragment ends. This occurs in both mtDNA and nuclear DNA fragments, as shown previously (
6,
23). The G→A substitutions at 3′-ends of fragments originate during the fill-in of 5′-overhangs containing deaminated cytosines (
6,
7). After UDG/endoVIII treatment C/G→T/A differences were drastically reduced, consistent with the removal of uracils by UDG. In mtDNA this removal seems to be almost complete, whereas a small amount remains in nuclear DNA (see below).
Neandertal CpG methylation
In vertebrates, DNA methylation occurs at the 5′-position of cytosine residues at ~80% of CpG dinucleotides in somatic tissues (
35). When 5′-m-C is deaminated it becomes a thymine residue that is not recognized as an unnatural base by UDG in vertebrate cells. As a consequence, CpG dinucleotides are preferentially lost from vertebrate genomes with the result that the frequency with which CpG dinucleotides occur genome wide is ~1%, while 4–5% would be expected from the base composition of the genome (
36). In some regions of the genome, CpG content is less reduced and can be 60% or more of the statistically expected frequency. Such ‘CpG islands’ represent areas where methylation is reduced or absent. About half of CpG islands overlap transcriptional start sites where their methylation in a tissue is positively correlated with suppression of transcription.
We noted that after UDG/endoVIII treatment ~5% of C/G→T/A substitutions remained at the first and last positions of Neandertal DNA fragments in nuclear DNA whereas this did not seem to be the case in the mtDNA (). Since cytosine methylation does not occur in mtDNA we speculated that this might be due to cytosine methylation at CpG dinucleotides. We therefore analyzed the rates of C→T mismatches seen along the DNA molecules separately for the four dinucleotides that carry cytosine at their first position (, lower panels). Without UDG/endoVIII treatment, C→T differences to the human nuclear genome are common in all four dinucleotide contexts, but slightly more common in CpG dinucleotides (e.g. 52.4% at position 1 of DNA fragments) than in the other dinucleotides (38%–40% at position 1). This is likely due to the higher deamination susceptibility of 5′-m-C relative to unmethylated cytosine (
32). With UDG/endoVIII treatment, C→T substitutions in the three non-CpG dinucleotides are greatly reduced (to 2–5% at position 1 of DNA fragments), whereas they remain abundant in CpG dinucleotides (40.6% at position 1 of DNA fragments). Thus, at CpG sites, ~80% of C→T substitutions are resistant to UDG/endoVIII treatment, consistent with findings that ~80% of CpG sites are methylated (
35). At CpG sites in mtDNA, no such resistance of substitutions to UDG/endoVIII treatment is seen. This is consistent with the absence of CpG methylation in mitochondria.
In order to further investigate if the UDG/endoVIII protocol can be used to detect CpG methylation present in the Neandertal individual when alive, we analyzed CpG→TpG mismatch rates in CpG islands, where methylation is reduced relative to the genome-wide average (
37) (). As described, across the genome 52.4% of CpGs at 5′-ends of fragments are read as TpG without UDG/endoVIII treatment, and this is reduced to 40.6% upon UDG/endoVIII treatment. At CpG islands, 51.7% of 5′-terminal CpGs are read as TpG without UDG/endoVIII treatment and this is reduced to 23.4% with UDG/endoVIII treatment. Thus, deaminated CpG sites in CpG islands are more susceptible to UDG/endoVIII treatment than deaminated CpG sites elsewhere in the genome. This is consistent with a reduced frequency of cytosine methylation in CpG islands. Together with the observation that no resistance to UDG treatment is observed at deaminated CpG sites in mtDNA, we conclude that
in vivo Neandertal DNA methylation patterns have survived over 38 000 years in the bone, and can be detected by the UDG/endoVIII protocol.
Neandertal DNA sequence accuracy
Although UDG/endoVIII treatment of ancient DNA clearly reduces nucleotide misincorporations, a remaining limitation to obtaining maximal sequence accuracy from ancient DNA molecules is the high rate of machine sequencing errors of current high-throughput platforms. The ‘merged-paired-end’ sequencing approach used here helps to decrease machine errors by requiring that the 3′-ends of fragments, where errors are most common (
25), are sequenced in two directions. However, near-elimination of machine sequencing errors should be possible if libraries are first amplified (
24,
29), as they were here, and sequenced deeply enough so that multiple copies of each original molecule are sequenced, allowing a consensus to be generated from such replicate sequences, which can be identified by their strand orientation, start and end points (
6,
13).
To investigate the accuracy, that can be achieved when combining a deep sequencing approach with UDG/endoVIII repair, we determined the overall rates of each Neandertal nucleotide misassignment in (i) all sequences from the ‘no-repair’ dataset; (ii) all sequences from the UDG/endoVIII dataset; and (iii) all ‘multi-pass’ sequences from the UDG/endoVIII dataset, which we generated by taking consensus sequences of all distinct fragments that were sequenced more than once. shows that for mitochondrial sequences, overall error rate per base () was 2.20% for ‘no repair’ sequences; 0.40% for UDG/endoVIII sequences and 0.09% for multi-pass UDG/endoVIII-treated sequences. Thus, UDG/endoVIII treatment alone results in a 5.5-fold reduction in error rates while deep sequencing results in an additional 4.4-fold reduction. In combination this results in a 22-fold reduction in errors. In nuclear DNA, for which we removed CpG sites from the analysis due to the effect of methylation described above, a similar pattern is observed () although the true error rate at these low levels cannot be accurately calculated due to genuine sequence differences between this Neandertal and the human reference.