|Home | About | Journals | Submit | Contact Us | Français|
A complete mitochondrial (mt) genome sequence was reconstructed from a 38,000-year-old Neandertal individual using 8,341 mtDNA sequences identified among 4.8 Gb of DNA generated from ~0.3 grams of bone. Analysis of the assembled sequence unequivocally establishes that the Neandertal mtDNA falls outside the variation of extant human mtDNAs and allows an estimate of the divergence date between the two mtDNA lineages of 660,000±140,000 years. Of the 13 proteins encoded in the mtDNA, subunit 2 of cytochrome c oxidase of the mitochondrial electron transport chain has experienced the largest number of amino acid substitutions in human ancestors since the separation from Neandertals. There is evidence that purifying selection in the Neandertal mtDNA was reduced compared to other primate lineages suggesting that the effective population size of Neandertals was small.
Although it is well established that Neandertals are the hominid form most closely related to present-day humans, their exact relationship with modern humans remains a topic of debate (Hublin and Pääbo, 2006; Soficaru et al., 2006; Harvati et al., 2007). Molecular genetic data first spoke to this issue in 1997 when a 379-base-pair section of the hypervariable I region (HVRI) of the mitochondrial genome (mtDNA) was determined from the Neandertal type specimen found in 1856 in Neander Valley, near Düsseldorf, Germany (Krings et al., 1997). Since then, a total of 15 complete or partial Neandertal HVRI sequences as well as two HVRII sequences (Krings et al., 1999; Krings et al., 2000) have been described. Phylogenetic analyses of these suggest that Neandertal mtDNAs fall outside the variation of modern human mtDNAs. Since the mtDNA genome is maternally inherited without recombination, these results indicate that Neandertals made no lasting contribution to the modern human mtDNA gene pool (Krings et al., 1997; Currat and Excoffier, 2004; Serre et al., 2004).
High-throughput 454 sequencing techniques have recently been applied to ancient DNA (Green et al., 2006; Poinar et al., 2006; Stiller et al., 2006). These methods open new possibilities for the retrieval of ancient DNA that has hitherto relied either on the cloning of random molecules in bacteria (Higuchi et al., 1984; Pääbo, 1985; Noonan et al., 2005; Noonan et al., 2006) or on the PCR amplification of individual DNA sequences of interest (Pääbo and Wilson, 1988; Pääbo et al., 2004). The main benefit of the 454 sequencing technique is the sheer volume of sequence data that make it practical to undertake genome-scale ancient DNA sequencing projects. This is particularly feasible for mitochondrial genomes (Gilbert et al., 2007), given their smaller size relative to the nuclear genome and their abundance in cells, where typically several hundred mtDNAs per nuclear genome exist.
454 sequence data from ancient DNA have also allowed an increased understanding of DNA diagenesis, i.e. how DNA is modified during deposition in a burial context. In particular, they have allowed a quantitative model of how DNA degradation and chemical modification occurs and how the effects of these processes interact with the molecular manipulations used to generate sequencing libraries (Briggs et al., 2007). Notably, although it was previously known that a high rate of cytosine deamination occurs in ancient DNA (Hofreiter et al., 2001) it has become clear that this is particularly prevalent in the ends of the ancient molecules presumably because these are often single-stranded (Briggs et al., 2007). Deamination of cytosine residues results in uracil residues that are read as thymine by DNA polymerases, leading to a high rate of C to T transitions. A high rate of G to A transitions observed near the 3’-ends of sequence reads is thought to be caused by deaminated cytosine residues on the complementary strands used as templates during the fill-in reaction to create blunt ends when sequencing libraries are constructed (Briggs et al., 2007).
By 454 sequencing, we have generated 34.9-fold coverage of the Neandertal mtDNA genome from a Neandertal bone (Vindija bone 33.16) excavated in 1980 from Vindija Cave, Croatia (Malez, 1982). It has been dated to 38,310+/−2,130 years before present (Serre et al., 2004). Previously, the mtDNA HVRI sequence of this bone has been determined (Serre et al., 2004) as well as 2,414 bp of mtDNA sequences by 454 sequencing (Green et al., 2006). Here, we present its complete mtDNA sequence as well as the insights it allows into recent human and Neandertal mtDNA evolution.
Three DNA extracts, each from 100–200 mg of a Neandertal bone (Vindija 33.16) were prepared in our clean room facility where several precautions against DNA contamination are implemented (Experimental Procedures). These include complete separation from other parts of the laboratories, direct delivery of all equipment and reagents to the facility, positive pressure generated with filtered air that excludes particles larger than 0.2 µm, and UV irradiation and bleach treatment of all surfaces. The bone surface was removed prior to extraction. However, the interior of bones are also often contaminated with modern human DNA, presumably due to past washing and other treatments of Neandertal bones. Thus, we analyzed each extract for contamination by extant human mtDNA by PCR using primers flanking positions in the HVRI that distinguish extant humans from Neandertals (Green et al., 2006) and amplify both types of mtDNA with similar efficiencies. Following amplification, we cloned the PCR product and sequenced 103–112 clones to determine the ratio of Neandertal to extant human mtDNA. The contamination rate in the three extracts ranged from 0% to 0.9% (Supp. Fig. 1).
From these extracts, we generated a total of nine 454 libraries in the clean room facility using 454-adapters with a Neandertal-specific sequence key that is unique to this project and thus unequivocally identifies each sequence determined as derived from the extract of a Neandertal bone (Briggs et al., 2007). This allows detection of any contamination that may be introduced in subsequent handling and sequencing steps outside the clean room. To maximize the library and sequence yield, we incorporate two modifications to the standard 454 protocol that reduce the need to perform titration runs of libraries (Meyer et al., 2008) and allows more molecules to be retrieved during library preparation (Maricic and Pääbo, unpublished results). From these libraries we generated a total of 39 million sequence reads by 147 runs on the GS FLX sequencing platform. Bases were called using the standard 454 signal threshold and filtering criteria with minor modifications tailored for short, ancient sequence reads (Meyer et al., 2008)
Neandertal sequences were identified within each run as described previously (Green et al., 2006) with the chief criterion being sequence similarity to a primate genome. MtDNA sequences were identified from these using further criteria (see Experimental Procedures) including similarity to the human mtDNA at least as great as to any nuclear DNA sequence. While the total fraction of sequences that are identified as Neandertal varied across libraries and library pools, the ratio of putative Neandertal nuclear DNA sequences to mtDNA sequences varied little among these libraries and averages 171. This corresponds to approximately 2,100 mtDNA molecules per cell. In total, 8,341 mtDNA sequences were identified. They are of an average length of 69.4 bp (standard deviation=26.4), with the shortest fragment identified being 30 bp (limited by the length cut-off in the analysis pipeline) and the longest fragment being 278 bp (limited by the flow cycles performed on the GS FLX instrument).
Ancient DNA sequences present a challenge for DNA sequence assembly since they are typically short and exhibit high rates of nucleotide misincorporation. A further complication is that pyrosequencing (Ronaghi et al., 1998), for example as performed on the GS FLX platform, calls long polymers of the same base with reduced accuracy. With these issues in mind, we designed an assembly procedure for ancient DNA. In short, each sequence identified as mtDNA was aligned over its entire length to the human reference mtDNA sequence (UCSC build hg18). These alignments were then merged and each alignment column was examined to determine the majority base, yielding an assembled mtDNA sequence. Homopolymer lengths at positions where the reference human carries ≥5 identical bases were determined by analysis of the raw signal distributions as described in Supp. Data.
Following this, some problematic regions remained in the assembly. These include four regions of a total length of 20 bp where no sequence coverage existed, eight other regions amounting to a total of 117 bp covered by only single reads, nine positions where no majority base existed due to low coverage, and 31 homopolymers for which the data were not sufficient to determine their length. These regions were amplified by PCR from a Vindija 33.16 bone extract in two-step multiplex PCRs (Krause et al., 2006), cloned and sequenced by Sanger technology in order to complete the assembly (Experimental Procedures).
We then reapplied our mtDNA fragment detection pipeline to all Neandertal DNA sequences determined from the bone but compared them to the assembled Neandertal mtDNA instead of the reference human mtDNA. This resulted in the detection of an additional 721 mtDNA sequences that were initially missed since they have a higher similarity to the human nuclear genome than human mtDNA and thus could only be identified using the assembled Neandertal mtDNA. Interestingly, 522 of these sequences were similar to a single nuclear mtDNA insertion on chromosome 1 (hg18 position 554,327–560,165). In total, 8,341 mtDNA fragments totaling 578,733 nt yielded a final assembly of 16,565 nt where no position has less than 9-fold sequence coverage.
In order to estimate the level of human contamination among the mtDNA sequences determined from these libraries, we realigned the individual sequences to the assembled mtDNA and used a number of differences between the putative Neandertal and human mtDNAs as diagnostic markers. We then examined each alignment to determine if it contained a position that allowed it to be classified as of either Neandertal or extant human origin.
First, we used three human-Neandertal differences in the HVR1: an A to T and a C to A transversion and an insertion of an A where all Neandertal sequences determined to date, including Vindija 33.16 (Serre et al., 2004), differ from all 1,865 contemporary human HVRI sequences available in the mtDB database (Ingman and Gyllensten, 2006). Forty three sequences overlapped these HVRI positions and all were of the Neandertal type (Fig. 1). Second, we used four transversions between the putative Neandertal mtDNA and human mtDNAs that occur outside the HVR1 and are not adjacent to homopolymers and not polymorphic among the 1,865 human mtDNA sequences. Among 192 sequences that overlapped these positions, 186 carried the base of the Neandertal assembly (Supp. Fig. 2). Of the 6 sequences that did not, five were cases where the observed base was T, the assembly base was C, and the modern human base was G or A. Thus, these are likely to be the result of deaminated cytosine residues that are common in ancient DNA (Hofreiter et al., 2001; Briggs et al., 2007; Brotherton et al., 2007). The alternative that these are contamination by a previously unknown modern human mtDNA sequence is unlikely because of the large number of modern human sequences available. One single sequence matches the human base at a diagnostic position. Interestingly, this sequence shows features commonly associated with ancient DNA (Supp. Fig. 2) in that it begins with a C to T mismatch to the assembly, ends with two G to A mismatches, and is 65 nucleotides long and thus relatively short. In spite of this, it may represent a contaminant since modern DNA sequences retrieved from ancient specimens can carry such features (Sampietro et al., 2006). In order to estimate contamination levels, we thus disregard the 5 C to T mismatches as uninformative and count the last sequence as a contaminant. This yields a total of 229 out of 230 sequences carrying Neandertal diagnostic positions and an estimate of the contamination rate of 0.4% with a 95% confidence interval of 0.01% to 2.4%
In order to maximize our power to detect contamination, we also used all positions in the Neandertal assembly that differ from extant humans and for which at least 99% of a panel of 311 worldwide modern human mtDNA sequences do not differ, ignoring positions that could be due to cytosine deamination. One hundred thirty-three positions fulfilled these criteria and allowed 1,963 fragments to be classified. Nine were of extant human origin yielding a contamination estimate of 0.5% (95% confidence interval 0.21% to 0.87%)
We conclude that so few of the mtDNA sequences determined derive from extant humans that they will not compromise the assembly which has a 35-fold average coverage. The assembled mtDNA sequence therefore represents a reliable reconstruction of the mtDNA that this Neandertal individual carried when alive.
Alignment of the 16,565 nt Neandertal mtDNA to the 16,568 nt human revised Cambridge reference mtDNA sequence (rCRS) (Andrews et al., 1999) revealed 206 differences (195 transitions and 11 transversions). In the non-coding control region, the Neandertal sequence contains a deletion of four base pairs (CACA) at rCRS position 514 and a previously known insertion of one base pair (Serre et al., 2004) following position 16,263. The 13 protein-coding genes, the 22 tRNA genes and two rRNA genes in the Neandertal mtDNA lack notable structural differences when compared to the human and chimpanzee mtDNAs (Supp. Table 2).
Figure 2A shows the distribution of pairwise sequence differences among 53 humans from around the world (Ingman et al., 2000), between these and the Neandertal, and between the modern humans and the chimpanzee mtDNAs. Among the humans, the number of differences ranges from 2 to 118 and is bimodal. The peak with a mode around 99 differences contains comparisons that involve at least one member of deep clades containing sub-Saharan African mtDNAs. The peak with a mode around 44 differences involves comparisons between individuals outside these clades. By contrast, the number of differences between human mtDNAs and the Neandertal mtDNA ranges from 201 to 234 and is unimodal. Thus, the Neandertal mtDNA falls outside the variation of extant humans even when nucleotide differences, uncorrected for multiple substitutions, are counted. Notably, when the comparisons are restricted to only the HVRI or HVRII, the distributions of pairwise differences among extant humans and between humans and the Neandertal overlap illustrating that multiple substitutions complicate the reconstruction of mtDNA relationships when only these regions are studied (Figure 2B, C).
To further explore the evolutionary relationship between Neandertal and extant human mtDNAs, we estimated a phylogenetic tree using the complete mtDNA of Neandertal, 53 human mtDNAs, the human rCRS, and chimpanzee and bonobo mtDNAs as outgroups. Neighbor-joining, maximum likelihood, parsimony, and a Bayesian approach (Supp. Data) all indicated that the 54 extant humans formed a monophyletic group to the exclusion of the Neandertal with complete support for each tree-building measure (100% bootstrap support; 1.0 posterior probability). Unsurprisingly, this topology is also found when analysis is restricted to various subsets of the sequence such as protein-coding sequences and RNA-coding sequences (Supp. Data).
In order to estimate the date of the divergence of the Neandertal and extant human mtDNAs, we compared the Neandertal mtDNA to 10 divergent human mtDNAs (Fig. 3A) and tested if we could reject a molecular clock assuming the topology of the Bayesian tree (Supp. Data). This was not the case indicating no significant heterogeneity in evolutionary rates among the sequences. To allow divergences of mtDNA sequences to be transformed to calendar years, we assumed - based on the fossil record - that humans and chimpanzees diverged between six (Galik et al., 2004) to eight million years ago (Brunet et al., 2002; Lebatard et al., 2008). This results in an estimate of the mean divergence time between Neandertal and extant human mtDNAs of 660,000 with a 95% credibility interval of 520,000 to 800,000 years ago (Fig. 3B) (see also Supp. Data) which overlaps with previous estimates based on HVRI and II sequences (Krings et al., 1997; Krings et al., 1999). Since the width of the divergence credibility interval increases almost linearly with the posterior mean of the divergence estimate (Supp. Fig. 5), further sequence data would be unlikely to decrease the width of the credibility interval (Yang and Rannala, 2006). However, if the estimated date of the divergence between humans and chimpanzees or current assumptions about how the mtDNA evolves would be incorrect, the estimates in calendar years of the divergence of the Neandertal and human mtDNAs would need to be revised.
Next we estimated the substitutions that occurred in each of the 13 mtDNA protein-coding genes on the Neandertal and the human lineages using parsimony with the chimpanzee as an outgroup (Table 1). The total number of silent substitutions assigned to the Neandertal lineage, 44, is lower than the number assigned to the human lineage, 57. An apparent shortening of the Neandertal lineage relative to the human branch is also observed when all substitutions in the mtDNA are analyzed (Supp. Fig. 3). This dearth of Neandertal lineage differences is not clearly concentrated in any particular region of the mtDNA (Supp. Fig. 4A) or class of substitutions (Supp. Fig. 4B,C) and is not statistically significant (Supp. Data). It may reflect stochastic variation in the amount of substitutions on the two lineages, perhaps in combination with the fact that the Neandertal mtDNA is ~38,000 years older than the human mtDNA sequence. However, for amino acid substitutions, we observe a similar number on the Neandertal (20) and human (18) lineages. Thus, relative to silent substitutions more amino acid replacements occurred on the Neandertal lineage.
To explicitly test for an increase in amino acid replacements on the Neandertal lineage, we evaluated the evolution of the 12 protein-coding genes (excluding ND6, see Supp. Data) across seven primates (human, Neandertal, chimpanzee, bonobo, gorilla, orangutan, and baboon) using a maximum likelihood framework. The ratio of amino acid replacements per replacement site (dN) to silent substitutions per silent site (dS) can reveal the influence of natural selection on protein evolution with low dN/dS (1) indicating purifying selection and high dN/dS (>1) indicating positive selection. Overall, dN/dS was quite low across the entire phylogeny (0.051–0.087; Supp. Table 3), consistent with strong purifying selection on mitochondrial proteins in primates (Hasegawa et al., 1998). However, dN/dS varied significantly among branches in the phylogeny (all P<0.0001; Supp. Table 3). While lineage-specific estimates of dN/dS were sensitive to both the underlying model of sequence evolution and the number of outgroups considered (see Supp. Fig. 6 and Supp. Table 3), we consistently observed the highest dN/dS on the Neandertal lineage and this value was significantly higher than background rates in all comparisons (all P < 0.01; Supp. Table 3).
To determine if any of the 13 proteins encoded in the mtDNA show an unusual pattern of evolution in humans since the divergence from the Neandertal, we contrasted the ratio of nucleotide polymorphisms in the 54 humans to fixed differences to Neandertal at synonymous and non-synonymous sites (Table 2). Under a standard neutral model the ratio of diversity to divergence should be the same for both classes of sites (McDonald and Kreitman, 1991). Whereas this is the case for 12 of the protein-coding genes, COX2 has an excess of amino acid divergence (P=0.021), consistent with the action of positive directional selection. However, when corrected for multiple tests of 13 genes, this yields only suggestive evidence for adaptive evolution of COX2 (Bonferroni α =0.0038).
COX2 has experienced four amino acid substitutions on the human mtDNA lineage after its divergence from the Neandertal lineage about 660,000 years ago. In order to see if any of these amino acids vary among humans today, we analyzed human mtDNA sequences in mtDB (Ingman and Gyllensten, 2006). At one of the four positions (rCRS pos. 7,868), one of 2,704 individuals carries the same base as the Neandertal. Because mtDNA is inherited without recombination and because the Neandertal mtDNA falls outside the variation of modern human mtDNA, this single modern human observation represents a reversion to the ancestral state seen in Neandertals and chimpanzees. Thus, these four amino acid substitutions occurred in the relatively short period after the divergence of Neandertal and extant human mtDNAs and before the most recent common ancestor of current human mtDNAs. The observation of four non-synonymous substitutions on the modern human lineage and no amino acid changes on the Neandertal lineage stands in contrast to the overall trend of more non-synonymous evolution in Neandertal protein-coding genes (Table 1) and deserves consideration.
COX2 encodes subunit 2 of the cytochrome c oxidase, or complex IV, of the inner mitochondrial membrane. Complex IV catalyzes the reduction of molecular oxygen to water using electrons from cytochrome c and protons from within the mitochondrial matrix. By acting as a proton pump (Belevich et al., 2006) it maintains a proton gradient across the mitochondrial inner membrane which drives the phosphorylation of ADP to ATP (Mitchell, 1966; Reid et al., 1966). It has previously been noted that genes encoding various components of the electron transport chain evolve quickly in primates (Doan et al., 2004; Grossman et al., 2004).
To determine what functional effects, if any, these amino acid substitutions may have, we examined the crystal structure of the bovine cytochrome c oxidase complex (Figure 4) (Muramoto et al., 2007) All four substitutions are spatially far from functionally critical parts of the protein such as the cytochrome c binding area and the bimetallic CuA center that passes electrons from cytochrome c to COX1. However, three of them interact with the nuclear encoded COX5a and COX6c subunits, which are thought to perform regulatory functions. It has been noted that COX5a evolves fast in primates (Uddin et al., 2008). However, the two COX5a sites that differ between humans and chimpanzees are not in close proximity to COX2. It may also be of relevance that two of the four amino acid residues that are specific to modern humans relative to Neandertals are variable among apes and all four are variable when Old World monkeys are considered (Figure 4).
The 8,341 mtDNA sequences comprising the Neandertal mtDNA assembly allow us to gauge a number of factors of relevance to the state of preservation of mtDNA in this late Pleistocene individual. These estimates have an advantage over previous ones (e.g. Briggs et al. 2007) that were based on inferences from alignments to human and chimpanzee genome sequences and were thus potentially affected by alignment uncertainties or uncertain inferences about evolutionary processes.
One feature of interest is that the retrieval of DNA across the mtDNA genome is more uneven than expected under a model of random sampling (Lander and Waterman, 1988) (Figure 5A). The variation in DNA retrieval is largely explained by a positive correlation (r=0.49) between GC content and the number of recovered fragments (Figure 5C). Furthermore, for shorter fragments, GC content is negatively correlated with length (Figure 5B). This may be due to denaturation of short, AT-rich fragments during library preparation.
Because the GS FLX read length of about 250 nucleotides is much longer than the average sequence length of 69 nucleotides, fragmentation points can be inferred for nearly all sequences. As previously reported (Briggs et al., 2007), purines are over-represented before, and pyrimidines after, fragmentation points (Supp. Table 4). This is consistent with that depurination in ancient DNA which would induce strand breaks. When dinucleotides made up of the bases immediately 5’ and 3’ of fragmentation points are analyzed (Supp. Table 4), the most over-represented bases are G 5’ of breaks and T 3’ of breaks, where breaks occur ~2.9 times more often than would be expected if fragmentation was random and ~2.1 times more often than predicted by the individual contributions of G and T separately. Further investigations are necessary to understand the biochemical basis for this.
When we compare the observed rates of nucleotide misincorporations to maximum likelihood estimates for the expected numbers of misincorporations based on a model we previously developed (Briggs et al., 2007), we find the model fits the overall data well (Supp. Fig. 7). However, longer fragments have higher misincorporation rates (and by inference higher deamination rates), and shorter fragments have lower misincorporation rates than predicted by the model. This could be due to a reduced power to detect and align fragments that are both short and contain nucleotide misincorporations. Alternatively, it could represent a difference between shorter and longer fragments, for example if longer fragments have, on average, longer single-stranded overhanging ends.
The complete Neandertal mtDNA genome confirms and extends previous insights into the genetic history of Neandertals. Firstly, it confirms that the Neandertal mtDNA falls outside the variation of extant human mtDNA variation. Second, it shows that the Neandertal mtDNA diverged from the extant human mtDNA lineage on the order of 660,000 years ago. Thus, the most recent common ancestor of human and Neandertal mtDNA lived more than two or three times as long ago as the most recent common ancestor of extant human mtDNAs.
A striking observation from the analysis of the 13 protein-coding genes in the mtDNA is that the ratio of non-synonymous to synonymous evolutionary rates is significantly higher on the Neandertal lineage. One plausible explanation is that Neandertals had a smaller effective population size and thus less effective purifying selection than humans. Previous studies have reported that mtDNA dN/dS ratios tend to be higher within than between species of great apes, including extant humans (Nachman et al., 1996; Templeton, 1996; Hasegawa et al., 1998; Kivisild et al., 2006). These results suggest that slightly deleterious amino acid variants segregate within populations and that differences in the intensity of purifying selection may affect mtDNA dN/dS ratios. Previous estimates based on mean pairwise differences (MPD) within the mtDNA HVRI suggested that Neandertals (MPD=5.5) had an effective population size similar to modern Europeans (MPD=4.0) or Asians (MPD=6.3), but lower than in modern Africans (MPD=8.1) (Krause et al., 2007b). Recent population genetic analyses have revealed a higher mtDNA amino acid substitution rate (Elson et al., 2004) and relatively more deleterious autosomal nuclear variants (Lohmueller et al., 2008) in Europeans than in Africans, presumably due to the smaller effective population size of Europeans. Thus, it seems plausible that Neandertals had a long-term effective population size smaller than that of modern humans. Population reductions caused by recurrent glaciations in Eurasia during Neandertals’ ~400,000 years of existence may have contributed to this. Further work will reveal if a small effective population size is compatible with the extent of nucleotide diversity seen in the Neandertal nuclear genome.
A tantalizing finding is that when all substitutions in the mtDNA are analyzed, the Neandertal mtDNA lineage is shorter than the human lineage by approximately 20% (Supp. Fig. 3B). It is tempting to speculate that this is due to the fact that the Neandertal mtDNA is ~38,000 years older than the extant human mtDNA sequences to which it is compared. However, under the assumption that the evolutionary dates are reasonably accurate, the reduction in length is about three times as large as would be expected if it was entirely due to the age of the fossil. It should be noted that for small evolutionary distances such as these, there is a large stochastic component to phylogenetic branch length. Thus, although the evolutionary dates are clearly dependent on many tenuous assumptions, it seems reasonable to assume that the majority of the discrepancy in length between the Neandertal and extant human mtDNA lineages is due to stochastic differences in the amounts of substitutions that have come to fixation on the two lineages.
Another interesting observation is that COX2 stands out among proteins encoded in the mitochondrial genome as having experienced four amino acid substitutions on the modern human mtDNA lineage. Further work is warranted to elucidate the functional consequences of these amino acid substitutions. However, all these substitutions are in regions of the protein that based on the crystal structure do not have any obvious function, and they are variable among primates. Hence, they may represent either minor adaptive advantages, perhaps of regulatory relevance, or have no significant functional consequences for mitochondrial function. Unless other evidence for their importance becomes available, we see no need to invoke positive selection to account for the evolution of COX2 on the human lineage.
Two features of ancient DNA in general and Neandertal DNA in particular may cause errors in nucleotide sequences in excess of what is expected from modern DNA. First, chemical modification in the ancient DNA may cause nucleotide misincorporations, and, second, the unintended presence in the experiments of DNA from extant humans may be mistaken for Neandertal DNA. In the case of the Neandertal mtDNA genome sequences presented here, high average coverage of the random sequence reads in combination with amplification and sequencing of positions where coverage is low or where longer nucleotide homopolymers may cause base calling problems make us confident that the error rates from both these sources are low. Thus, the amount of data now accumulated for the Neandertal mtDNA allows extrapolations to what will be required to arrive at a reasonably reliable nuclear DNA sequences from the Neandertal.
With respect to nucleotide misincorporations, the current data show that errors causing a transition occur at positions carrying either a C or a G base in ~2.7% of sequence reads. By contrast, all other errors occur in less than 0.1% of reads (Supp. Fig. 8). Thus, C to T and G to A misincorporations are the predominant sequencing errors that need to be considered in terms of an ancient genome assembly. Since these misincorporations are drastically accumulated towards the 5’ and 3’ ends of molecules (Supp. Fig. 7), respectively, and since there is a tendency for strand breaks in the ancient DNA to occur 3’ of purines and in particular at GT dinucleotides (Supp. Table 4), this might conceivably result in an accumulation of misincorporations at certain nucleotide positions where even multiple retrieval of the same sequence may then not always allow the correct base to be determined. However, since GT dinucleotides cause only an approximately 3-fold increase in the likelihood of a break compared to random positions in the genome, already rather low sequence coverage makes this effect unlikely to cause errors in the assembled sequence. In fact, although it has been argued that C to T misincorporations are accumulated at particular sites in ancient mtDNA molecules (Gilbert et al., 2003), no positions in the mtDNA assembly have a mixture of observed bases that would confuse assembly (Supp. Fig. 9) and higher relative amounts of mismatches correlate with lower sequence coverage, suggesting that stochastic events among few recovered sequences rather than hotspots for DNA modifications are the basis of these observations. For example, the least supported position of the Neandertal mtDNA assembly (pos. 5,476) is supported by 8 C observations and 3 T observations (Supp. Fig. 9). Thus, we conclude that nucleotide misincorporations will not impede the determination of reliable Pleistocene genome sequences provided that reasonable sequence coverage is achieved.
Contamination with extant human DNA is the other dominant source of erroneous Neandertal sequences. Given the high coverage and the fact that the best estimate of the contamination rate here is 0.5% (with an upper 95% confidence limit of 0.87%), we do not expect contamination to affect the mtDNA sequence assembly to any appreciable level. Under the assumption that the Neandertal mtDNA sequence is reliable, it is a useful tool for gauging contamination when sequencing the Neandertal nuclear genome. Previously, assays to determine contamination within Neandertal fossil extracts were limited to the HVRI which carry few positions where extant humans differ from Neandertals. By contrast, the complete Neandertal mtDNA now offers 133 such positions. This enables a reliable estimation of mtDNA contamination by analyzing sequence reads from 454 libraries rather than by PCR-based assays of the DNA extracts. For example, when we do this in a small preliminary data set initially published from this fossil (Green et al., 2006), 10 of 10 sequences are classified as Neandertal. However, in further unpublished sequencing runs from that library, 8 out of 75 diagnostic sequences derive from extant human mtDNA, suggesting a contamination rate of ~11% (C.I. 4.7–20%). This is in agreement with the suggestion (Wall and Kim, 2007) that contamination occurred in that experiment. That library was constructed outside our clean room facility and before the introduction of the Neandertal-specific key which is crucial for the detection of contamination by other 454 libraries and was therefore not used for the subsequent Neandertal genome sequencing project (Briggs et al., 2007). However, with the help of the mtDNA presented here, such levels of contamination are now easily detectable from single 454 sequencing runs. Nevertheless, since the ratios of nuclear to mtDNA may differ between endogenous and contaminating DNA, it is of importance to identify nuclear DNA sequences diagnostic for Neandertals. Data collected from Vindija bone 33.16 now begin to make this possible (Krause et al., 2007a).
Irrespective of the fact that the amino acid substitutions in the human COX2 protein may not be of functional significance, they illustrate the power of Neandertal DNA sequences for finding accumulations of recent evolutionary changes in human genes. A genome-wide analysis of human and Neandertal genes holds the promise to identify many such cases for nuclear genes. However, whereas the mtDNAs of extant humans are monophyletic with respect to Neandertals, this is not expected to be the case for most nuclear genes (Pääbo, 1999). This fact will be helpful for the identification of recent positive selection in humans. If, for example, a completed selective sweep affected a gene in human ancestors after their separation from their common ancestors with Neandertals, extant human diversity in that region of the genome would coalesce to the exclusion of Neandertal alleles. Consequently, the combination of the accumulation of recent human substitutions in coding genes or conserved sequence elements along with reduced human diversity that excludes Neandertals may be a hallmark for genes and genomic regions that have been important in the emergence of fully modern humans.
Technically, the Neandertal mtDNA presented here is a useful forerunner for the sequencing of the Neandertal nuclear genome. First, the mtDNA assembly and the sequences of which it is comprised can be used to tune the parameters of models of DNA diagenesis. These parameters, in turn, can then be used to properly detect and align ancient Neandertal DNA sequence to the human genome. This will be particularly crucial before multifold coverage of the Neandertal nuclear genome is achieved. Second, the high coverage allows a first tentative estimate of how deeply a Neandertal genome would need to be sequenced in order to arrive at a reasonable accuracy of the sequence given the biased retrieval of sequences (Fig. 5), the misincorporation patterns (Supp. Fig. 8) and other features of the DNA extracted from Neandertal bones. By resampling our sequences to simulate different depths of coverage, we find that to achieve an error-rate of one in ten thousand, 12-fold coverage would be required (Supp. Fig. 10). Although all factors that effect ancient DNA error rates are unlikely to be identical between mitochondrial and nuclear DNA, for example homopolymer lengths and DNA methylation, this lends us confidence that a reliable Neandertal genome sequence will be achievable.
DNA was extracted as described (Serre et al., 2004). After the first DNA extraction from 100–200 mg of bone a second DNA extraction (“re-extract”) was carried out from the remaining bone pellet as described (Rohland and Hofreiter, 2007a; Rohland and Hofreiter, 2007b) except that incubation was over night at 56°C. One extract of the Vindija 33.16 bone (Vi80 [33.16] 120402, Supp. Fig. 1) as well as the two re-extracts (Vi80 [33.16] 120402re, Vi80 [33.16] 031201re, Supp. Fig. 1) were used for 454 library preparation and high throughput sequencing. For PCR verification of the initial sequence assembly, DNA was extracted from 90 mg of the bone as described (Rohland and Hofreiter, 2007b).
MtDNA sequences were identified using megablast (Zhang et al., 2000) and was required to be ≥30 nucleotides long and show ≥90% identity to the reference human mtDNA (GI: 17981852) or a version of this sequence carrying the Vindija HVRI. MtDNA alignments, which had to be at least as high-scoring as the best alignment against the human nuclear genome, were merged and regions where the assembly was problematic were amplified by PCR. While nuclear insertions of mtDNA could in theory confound the assembly, this is unlikely for several reasons (see Supp. Data). The Neandertal mtDNA sequence is available under accession number AM948965.
We aligned Neandertal, human, ape and monkey mtDNAs (identifiers in Supp. Data) and estimated trees using neighbor-joining, maximum parsimony, maximum likelihood and Bayesian approaches. Support for nodes was assessed using bootstrap replicates and a likelihood ratio test was used to test the clock assumption. To date mtDNA divergences, we estimated the posterior distribution of divergence times assuming that chimpanzees and humans diverged 6–8 million years ago. For protein analyses, we concatenated 12 genes, excluding ND6 and regions with overlapping reading frames and codons with insertion-deletion variation, and performed McDonald-Kreitman tests (McDonald and Kreitman, 1991)
We thank the Croatian Academy of Sciences and Arts, the Berlin-Brandenburg Academy of Sciences and the Presidential Innovation Fund of the Max Plank Society for making this work possible. We are indebted to Mark Stoneking, Weiwei Zhai, John Huelsenbeck, Jérôme Fuchs, Mario Mörl, Nick Patterson, David Reich and Timothy White for helpful discussions. We would also like to thank Bruce Rannala and Ziheng Yang for help with mcmctree, James Hudson Bullard for help with figures, and Christine Green for editing. AS, PFLJ, and MS are supported by NIH grant R01-GM40282. PLFJ is also supported by a Chang-Lin Tien scholarship.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.