|Home | About | Journals | Submit | Contact Us | Français|
The settlement of the many island groups of Remote Oceania occurred relatively late in prehistory, beginning approximately 3,000 years ago when people sailed eastwards into the Pacific from Near Oceania, where evidence of human settlement dates from as early as 40,000 years ago. Archeological and linguistic analyses have suggested the settlers of Remote Oceania had ancestry in Taiwan, as descendants of a proposed Neolithic expansion that began approximately 5,500 years ago. Other researchers have suggested that the settlers were descendants of peoples from Island Southeast Asia or the existing inhabitants of Near Oceania alone. To explore patterns of maternal descent in Oceania, we have assembled and analyzed a data set of 137 mitochondrial DNA (mtDNA) genomes from Oceania, Australia, Island Southeast Asia, and Taiwan that includes 19 sequences generated for this project. Using the MinMax Squeeze Approach (MMS), we report the consensus network of 165 most parsimonious trees for the Oceanic data set, increasing by many orders of magnitude the numbers of trees for which a provable minimal solution has been found. The new mtDNA sequences highlight the limitations of partial sequencing for assigning sequences to haplogroups and dating recent divergence events. The provably optimal trees found for the entire mtDNA sequences using the MMS method provide a reliable and robust framework for the interpretation of evolutionary relationships and confirm that the female settlers of Remote Oceania descended from both the existing inhabitants of Near Oceania and more recent migrants into the region.
Exploring the timing and pathways of Oceanic settlement is a multidisciplinary endeavor, with the direct evidence of prehistoric populations obtained from archeological investigations increasingly supplemented by linguistic and biological studies of their present-day descendants (Hurles et al. 2003). The region of Near Oceania (fig. 1) encompasses New Guinea, the Bismarck Archipelago, Bougainville, and the northern Solomon Islands and delineates the extent of early migrations into Oceania, with the first evidence of human settlement dating from at least 44,000 years ago (O'Connell and Allen 2004). In contrast, the settlement of the many islands of Remote Oceania began just over 3,000 years ago, following the appearance in Near Oceania of an archeological horizon known as the Lapita Cultural Complex. Similar assemblages appear in the first settlement sites in Remote Oceania to the south and east from about 3,100 years ago, and a rapid settlement sequence of Remote Oceania follows, with migrants reaching the 3 points of the Polynesian triangle at Hawaii, Easter Island, and New Zealand from ~1,500 to 800 years ago (Kirch 2000).
In a recent review, Green (2003) summarizes the many models proposed for the development of the Lapita Cultural Complex in Near Oceania, describing this and the subsequent settlement of Remote Oceania as “one set of human migrations among the many that have occurred throughout Oceania during the last 40,000 and perhaps 50,000 years.” Much attention has been focused on this later phase of Oceanic prehistory, and current debate centers on whether the Lapita sites represent an intrusive migration of peoples into Near Oceania and, if so, the geographic origin of these migrants and the extent of interactions between the newcomers and the existing inhabitants of Near Oceania.
All of the languages of Remote Oceania and many from Near Oceania belong to the Oceanic subgroup of the Austronesian language family. The high level of diversity among the many non-Austronesian (Papuan) languages spoken in parts of Near Oceania suggests that they have developed over a much greater time frame than the Austronesian languages, consistent with the early settlement dates in this region. Phylogenetic analysis of the Austronesian language family, widely spoken today in Oceania, Island Southeast Asia, and Taiwan has demonstrated that the language tree fits an “out-of-Taiwan” sequence of expansion (Gray and Jordan 2000). This model proposes a migration of proto–Austronesian-speaking peoples into Island Southeast Asia and Oceania from Taiwan, beginning from about 5,500 years ago (Bellwood 1991, 2001). Under this model, the Lapita sites in Near Oceania are viewed as evidence of an intrusive “Austronesian” settlement. Although early descriptions of this model presented the sequence of settlement as so rapid that it allowed very little interaction between settlers and the existing inhabitants of Near Oceania (often described as the “Express Train” model; Diamond 1988), more recent formulations emphasize integration between the 2 groups (Green 2003).
This report focuses on the inferences for Oceanic prehistory preserved in phylogenies of maternally inherited mitochondrial DNA (mtDNA), using entire genome sequences from individuals from Oceania, Australia, Island Southeast Asia, and Taiwan. Previous analyses of the hypervariable region-I (HVR-I) of the control region of mtDNA (Sykes et al. 1995; Lum et al. 1998; Murray-McIntosh et al. 1998; Hagelberg et al. 1999; Friedlaender et al. 2002) have found a general pattern of few, closely related lineages present among the populations of Remote Oceania, particularly in Polynesia, contrasting with a large number of diverse haplotypes in Near Oceanic populations.
One HVR-I haplotype has become known as the “Polynesian motif” due to its high frequency among Polynesian peoples and analyses of the distribution of this haplotype, and its immediate precursors in Oceania, Island Southeast Asia, and mainland Asia have been interpreted as supporting an out-of-Taiwan model of migration into the Pacific (Melton et al. 1995; Redd et al. 1995). However, one molecular dating estimate on this region of the mtDNA control region suggested that the time to the most recent common ancestor (TMRCA) for this motif in eastern Indonesia is ~17,000 years, beyond the ~5,500 year time frame for migration from Taiwan (Richards et al. 1998). This has lent some support to a model of origins of Lapita migrants in Island Southeast Asia, not Taiwan (Richards et al. 1998; Oppenheimer 2004); however, see Cox (2005) for a reassessment with a larger sample set and also Penny (2005) for a discussion of issues surrounding molecular dating of events in the recent past.
A recent study of entire mtDNA sequences from Taiwanese individuals with the ancestral Polynesian motif has revealed a close relationship between the Oceanic Polynesian motif haplotypes and the Taiwanese sequences (Trejaut et al. 2005). Entire mtDNA sequences from Australian and Near Oceanic populations are also beginning to reveal a number of deep lineages found only in this part of the world, which descend from the M, N, and N/R macrohaplogroups, reflecting the complexity expected of a region with >40,000 years of human prehistory (Ingman and Gyllensten 2003; Friedlaender et al. 2005; Merriwether et al. 2005).
To examine patterns of prehistoric migrations in Oceania, we have sequenced 19 mtDNA genomes from Taiwan and Oceania (table 1) and analyzed these together with over 100 other sequences from Taiwan, Island Southeast Asia, and Australia and Oceania. Steel and Penny (2004) have shown that for “ample” data (sequences that can be connected with steps of size 1) maximum parsimony and most parsimonious likelihood are equivalent. This suggests that maximum parsimony may be an appropriate criterion for analyzing densely sampled population data. The large number of unique sequences in our data set means it is not feasible to explore all possible trees and evaluate their parsimony score, and a heuristic approach is required. However, in many cases, we can guarantee that a particular tree, or set of trees, is globally optimal under the maximum parsimony criterion by using the recently described MinMax Squeeze Approach (MMS) (Holland, Huber, et al. 2005). The method takes the length of the shortest tree found by heuristic search as an upper bound and derives a lower bound by summing the parsimony scores of partitions of columns of the data set. Heuristic search is used to optimize the partition to give the highest possible lower bound. A tree or set of trees is guaranteed optimal if the upper and lower bounds meet. The MinMax Squeeze method was updated for this study (see Materials and Methods), and we report a large improvement in the size of data set that can be handled. After omitting 2 highly homoplasious sites, we are able to find provably optimal trees for our 127 taxon data set.
Some previous analyses of entire mtDNA data sets have used distance-based analyses, constructing Neighbor-Joining trees (Ingman et al. 2000; Ruiz-Pesini et al. 2004). This is fast computationally, but it is undesirable to use a method that reports a single bifurcating tree (with ties broken arbitrarily) for population data when multifurcations are expected, and as has been noted by Bandelt et al. (1995), homoplasy means that there are typically many equally parsimonious trees. Similar to the median-joining network approach (Bandelt et al. 1995), we want to display all the equally parsimonious trees in a single graph. Here we accomplish this by summarizing the trees in a consensus network (Holland and Moulton 2003; Holland, Delsuc, et al. 2005) that displays all the edges (splits) in the most parsimonious trees.
The 19 mitochondrial genomes sequenced in this study are from Taiwan (n = 2), Near Oceania (n = 4), and Remote Oceania (n = 13). A sample from a New Zealander of European maternal descent was also sequenced as a methodological control. The locations and sources of the samples are summarized in table 1. An initial long polymerase chain reaction (PCR) was used to amplify the entire mtDNA in 2 overlapping fragments using the Expand Long Template System (Roche Applied Science, Mannheim, Germany) following the manufacturer's instructions; using primer sets 1F,11R and 11F,1R described by Rieder et al. (1998). Twelve ~2,000-bp internal fragments were subsequently amplified from the large fragments, again using combinations of primers from Rieder et al. (1998). PCR products were purified either by digestion with Exo/Sap or by filtration with PCR Cleanup Plates (Millipore, Billerica, MA) and sequenced in forward and reverse directions using the complete set of 24 overlapping primers. BigDye Terminator chemistry (version 3.1, Applied Biosystems, Foster City, CA) was used to generate sequencing products, and capillary separation was performed on an ABI3730 Genetic Analyzer (Applied Biosystems) by the Allan Wilson Centre Genome Service (Palmerston North, New Zealand). Electropherograms were edited and sequences assembled using Sequencher (Version 4.2.2, Gene Codes Corporation, Ann Arbor, MI).
The 20 sequences from this study were manually aligned using SE-AL (Rambaut 1996) with sequences from previous studies that included Oceanic and East Asian samples (Ingman et al. 2000; Maca-Meyer et al. 2001; Ingman and Gyllensten 2003; Kong et al. 2003; Tanaka et al. 2004; Friedlaender et al. 2005; Macaulay et al. 2005; Merriwether et al. 2005; Starikovskaya et al. 2005; Thangaraj et al. 2005; Trejaut et al. 2005; Kivisild et al. 2006). A 9-bp deletion of 1 copy of a tandem repeat in an intergenic region at nt8270–nt8294 was further encoded in the data set by adding a transition where it occurred.
From this alignment, a subset of 137 mtDNA sequences was selected for the Pacific phylogenetic reconstruction. This Oceanic data set contained an African L3 sequence, (AF347014; Ingman et al. 2000); all sequences from Taiwan, Island Southeast Asia, Oceania, and Australia currently available on public databases; and 19 of the 20 sequences generated by this study (details given in table S1, Supplementary Material online). The noncoding control region (nt16024–nt576) was excluded from this data set reducing the number of unique haplotypes from 137 to 127.
Separate data sets were constructed for haplogroups N/R/B4a, N/R/B5a, M/M7 with M/M22, M/M28 with M/M27, N/R/P with N/R/21, and M/Q with M/M29, each containing the same African L3 sequence as the Oceanic data set. Whereas the P, Q, and M28 haplogroups are autochthonous to Oceania, M7bc, B4a, and B5a mtDNA haplotypes are found in populations outside of this region. Relevant mainland East Asian and Japanese sequences from the original alignment were included in these subsets. The number of haplotypes in these data sets ranged from 9 for B5a to 46 for B4a and the entire mtDNA sequence was analyzed, including the noncoding control region. As the Kivisild et al. (2006) sequences do not include the control region, these were not included in the parsimony analyses. All data sets are available from http://awcmee.massey.ac.nz/downloads.htm.
The MMS (Holland, Huber, et al. 2005) was used to determine optimality of the set of most parsimonious trees found by heuristic analysis. For each data set, the upper bound on the parsimony score was found by heuristic searches carried out using PAUP* 4.0b10 (Swofford 2003) excluding gapped characters (branch swapping = Tree Bisection-Reconnection, stepwise addition = simple). The number of steps required for parsimony-informative characters was calculated using PAUP* and averaged over the sets of most parsimonious trees. A lower bound on the parsimony score was calculated using the MinMax Squeeze program. Exact search is not feasible for such a large data set, but the most parsimonious trees found by heuristic search can be proved optimal if the upper bound found by the PAUP search meets the lower bound found using the MinMax Squeeze program. There is no guarantee that the set of most parsimonious trees found by heuristic search are the only optimal trees, other trees with equal score may exist. In order to be effective on the large data set analyzed here, the MinMax Squeeze software has been improved in 2 main ways from the implementation described in Holland, Huber, et al. 2005. The program was rewritten in the C++ language for improved speed. It uses a new heuristic search routine for finding partitions that takes account of information in the tree used to generate the upper bound; this works by avoiding changes to those parts of the partition where the parsimony score found on the tree and lower bound already agree. The tree used to guide the heuristic search for a good partition can be binary or multifurcating, labels at internal nodes are also allowed. The updated MinMax Squeeze program is available from http://awcmee.massey.ac.nz/downloads.htm.
Consensus networks (Holland and Moulton 2003; Holland, Delsuc, et al. 2005) of the sets of most parsimonious trees found for each data set were constructed using a Python script (available from B.R.H.) in combination with Spectronet 1.27 (Huber et al. 2002). All splits (edges) in the equally parsimonious trees are shown in the consensus networks. Sequencher was used to generate lists of substitutions for each sequence compared with the rCRS (Andrews et al. 1999) (with historic base numbering retained by maintaining the 3107C insertion). Base-labeled phylogenies of the haplogroup subsets were reconstructed from the consensus networks or trees, weighting coding-region changes over control-region polymorphisms when resolving conflicts. Changes in protein-coding genes were determined as synonymous or nonsynonymous using the MitoAnalyzer tool (2000) (http://www.cstl.nist.gov/biotech/strbase/mitoanalyzer.html).
Estimations of the TMRCA were calculated using 3 different rates from previously published studies. The first is calculated solely from synonymous substitutions in the protein-coding genes (Kivisild et al. 2006), whereas the second is derived using all of the mtDNA sequence excluding the control region (Mishmar et al. 2003). Both of these rates are calibrated by comparison to chimpanzee sequences, estimating the most recent common ancestor (MRCA) of human and chimpanzee mtDNA at 6.5 Myr. The third rate is calculated for transitions over a portion of the control region, from nt16090 to nt16365. It is estimated from Native American Eskimo and Na-Dene sequences and calibrated with a date of expansion related to the end of the Younger Dryas glacial relapse (Forster et al. 1996). For each of the 3 rates, the number of relevant substitutions from the vertex of interest to its descendants was averaged (the rho statistic) and the variance (σ2) calculated as described in Saillard et al. (2000).
The initial heuristic search on the Oceanic data set excluding the control region found 582,624 most parsimonious trees and provided an upper bound for the MinMax Squeeze (MMS) of 412 (from 282 parsimony-informative characters). The lower bound reached by MMS was 410. Thus, we cannot exclude the possibility that trees requiring 1 fewer mutation (411) or 2 fewer mutations (410) could be found. The consensus network (fig. 2a) shows that most of the differences between the trees found involve the branching order from the M and N/R vertices indicating that the tree is not fully resolved at those points. Averaging the number of steps required for each character over all trees identified 5 characters requiring 5 or more steps: nt709 (6.8), nt1598 (6.2), nt1719 (5), nt10398 (5), and nt15924 (5). In several M lineages (M27, M28a, M29, and M42), there is a recurring transition from G to A in the 12S rRNA gene at nt1598. A similar pattern is found at the N/R vertex, where several lineages descending from the N/R vertex appear to have a back mutation at nt10398 from A to G (a nonsynonymous transition at nt10398 in the ND3 gene is 1 of 5 substitutions that define the N macrohaplogroup).
When the 2 sites nt1598 and nt10398 were excluded from the parsimony analysis, 165 trees were found by heuristic search (parsimony-informative characters = 280, search score = 399), and this upper bound of 399 was met by the MMS program, guaranteeing the parsimony score optimal for this reduced data set. Clearly, the 2 excluded sites caused most of the increase in the number of possible trees (165–582,624). The consensus network of the 165 most parsimonious trees is shown in figure 2b. Knowing that the number of mutations on this network is minimal allows more definite statements about the early evolution among some lineages.
The exclusion of the 2 characters results in a clear multifurcation at the M vertex and greatly reduces the branching possibilities at the N/R vertex. The consensus network is largely tree-like, with just 1 area of uncertainty within the M/Q haplogroup and another surrounding the branching from N/R of single individuals representing the R12 and R21 haplogroups, an Australian N/R/P sequence, and the remainder of the N/R/P haplogroup. The area of uncertainty in the Q haplogroup involves transitions at nt15172 and nt9254 in the Q3 subhaplogroup (fig. S5, Supplementary Material online); control-region sequence data for the DQI12898 sequence (Kivisild et al. 2006) may help to resolve this part of the network.
The conflicting branching hypotheses at the N/R vertex involve 2 shared coding-region transitions between the R21 and P7 sequences at nt12361 and nt15613, a single shared substitution at nt11404 between the R12 and R21 sequences, and the P-defining substitution at nt15607 that is not found in the R12 and R21 sequences (fig. S4, Supplementary Material online). The link between the Malaysian R21 sequence and the Australian P7 sequence is intriguing and for explanation requires 1) parallel substitutions at 2 sites (nt12361G and nt15613) with a MRCA at the N/R vertex or 2) if it reflects shared ancestry, the P-defining nt15607G transition to have arisen independently in the P7 sequence or reverted to nt15607A in the R21 sequence. Additional sequences from these haplogroups are required to clarify this interesting association.
All 6 haplogroup subsets—N/R/B4a (fig. 3), N/R/B5a, M/M7 with M/M22, M/M28 with M/M27, N/R/P with N/R/21, and M/Q with M/M29 (figs. S1–S5, Supplementary Material online)—were proved minimal using the MMS approach. These data sets ranged in size from 9 (N/R/B5a) to 46 (N/R/B4) haplotypes. With these relatively small numbers of sequences, it was possible to include the control region of the mtDNA in the MMS analyses. Three of the analyses (N/R/P with N/R/21, M/M28 with M/M27, and N/R/B5a) resulted in a single most parsimonious tree, and the M/Q with M/M29 analysis found just 2 most parsimonious trees. The N/R/B4 and M/M7 with M/M22 phylogenies were considerably more complex: 1,274 and 74,390 most parsimonious trees were found, respectively. A single tree with branch-labeled nucleotide changes was reconstructed from the minimal tree or consensus network for each of the data sets.
The estimated ages of selected vertices in the haplogroup subset–labeled trees were calculated according to 3 previously described rates based on synonymous changes only (Kivisild et al. 2006), all coding-region changes (Mishmar et al. 2003), and HVR-I changes (Forster et al. 1996) (table 2). The dates estimated from the synonymous changes only are consistently lower than those calculated from all changes in the coding region, whereas the HVR-I estimates are, in several instances, much greater than the coding-region dates. Unless indicated otherwise, we refer to the synonymous substitution rate estimates in the following discussion as these changes are more likely to be selectively neutral than others in the coding region (Penny 2005; Kivisild et al. 2006).
This analysis demonstrates the effectiveness of the MMS for analyzing population data sets. In most cases, it was able to prove that a set of equally parsimonious trees was optimal under the parsimony criterion. Holland, Huber, et al. (2005) found that the performance of the MMS at finding the minimum bound was affected by the degree of homoplasy in the data set, and our results support this. Although the entire coding-region data set was not guaranteed optimal, exclusion of 2 of the 5 characters that required a large number of steps in the trees (nt1598 and nt10398) resulted in optimality being provable. For “tip-labeled binary trees,” there are (2n – 5)!! trees (Semple and Steel 2003), where n is the number of taxa and the double factorial notation (!!) is multiplying by every second number (1×3×5 2n – 5). For 127 taxa (the number of unique sequences in the Oceanic data set), there are ≈4 × 10245 trees. This is far more than that for 53 taxa, which until now was the largest data set for which a tree had been proven minimal (Holland, Huber, et al. 2005); here there were ≈3 × 1080 tip-labeled binary trees. Thus, we have now shown a major increase in the power of the MinMax Squeeze program.
The use of consensus networks allows for a concise summary of all of the edges contained within the equally parsimonious trees—in particular, highlighting those areas where the trees disagree. We find that combining the MMS and consensus network approaches provides a useful alternative to direct construction of the median network that is practical for large data sets. One advantage of this approach over median networks is that it is not necessary to preprocess the alignment by recoding sites with more than 2 states nor is it necessary to remove sites with ambiguous characters. An improvement to the consensus network method that would make the graphs more representative of population data would be to allow input trees with labeled internal nodes and, hence, to produce networks with internal labels.
Four of the 7 branches from the M vertex in the Oceanic consensus network combine haplogroups that have been previously reported as direct descendants of M. In 3 of these instances, a single shared character accounts for the link between the named haplogroups. M31 and M32 are found in samples from the Andaman Islands, and the basal link here is a transition from A to G at base nt1524 in the 12S rRNA gene. All of the individuals in M31 and 1 subgroup of 2 of the 5 M32 individuals have this transition. As this variant is found only in the 7 Andaman samples (from 2,423 global sequences in mtDB—Human Mitochondrial Genome Database [Ingman and Gyllensten 2006], 1/02/06), it seems probable that it is a basal polymorphism for a single M haplogroup encompassing all Andaman sequences.
Haplogroups Q and M29 share a synonymous transition in the ND5 gene at nt13500. Merriwether et al. (2005) have suggested that more sequences, particularly from M29, are required to assess whether this is in fact a basal change or independently acquired in each haplogroup as the transition at nt13500 occurs in parallel in several other global lineages. Similarly, as the transition linking M27 and M28 (at nt1719 in the 16S rRNA gene) has arisen several times independently (e.g., in L3e, M/M8/C, M/D, N/R/B4b, and N/R/P), it does not provide strong support for an ancestral link between M27 and M28.
The fourth grouping of 2 M haplogroups previously described independently is between the M7 sequences and M22, represented by a single sequence from Malaysia. These are grouped by 2 shared polymorphisms: a synonymous transition from A to G at nt5351 in the ND2 gene and a nonsynonymous transition from A to G at nt15236 in the cytochrome B gene. Both substitutions are present in the M22 sequence, whereas the nt5231 transition is found in the M7b Taiwanese sequences and the nt15236 polymorphism in the Micronesian and Taiwanese M7c sequences. When other M7 sequences from outside of the Pacific are examined, and the control region included in the analysis (fig. S2, Supplementary Material online), it seems likely that these shared polymorphisms arose independently in M22 and M7.
There is marked geographic clustering within the Oceanic consensus network (fig. 2b). The N/R/P haplogroup contains haplotypes from Australia and Near Oceania, haplogroups M/M42, N/O, and N/S are currently found only in Australia, and M/Q, M/M29, M/M27, and M/M28 are present only in Oceania. Haplogroups M/M31 and M/M32 are found in samples from the Andaman Islands and M/M22, M/M21, N/N21, N/N22, and N/R/R21 only in Malaysian Aboriginal populations. By contrast, N/R/B4, N/R/B5, N/R/R9, N/R/U, M/M7, and M/M9 haplotypes are found in populations outside of the area covered in this analysis, and their presence in Oceanic populations may represent later migrations into areas initially settled by the ancestors of the present-day carriers of the autochthonous haplotypes.
Eight of the sequences reported here belong to the autochthonous Oceanic haplogroups M/Q, M/M28, and N/R/P. Four M/Q haplotypes are from Polynesia and Vanuatu and are the first reported from Remote Oceania. They form 2 geographic subgroups within the Q1 subclade (fig. S5, Supplementary Material online). The Vanuatu and Polynesian Q1 sequences appear to have diverged well before the settlement of Remote Oceania as they are not closely related to each other or to the other Q1 sequences available at present from Papua New Guinea and Bougainville; the age estimate of the Q1 vertex is 22,862 ± 4,464 years. The Polynesian sequences contain an HVR-I signature, which has been reported from throughout Polynesia; tracking a likely geographic source in Near Oceania of this Q1 variant will require more sampling from the region.
The 2 M/M28 sequences from Remote Oceania (DQ372879 and DQ372883) provide a closer link to known lineages in Near Oceania (fig. S3, Supplementary Material online); the TMRCA of these Vanuatu sequences and 2 from New Britain is 10,146 ± 4,126 years. Two sequences from the Trobriand Islands in Near Oceania (DQ372870 and DQ372872) fall within the N/R/P/P2 subhaplogroup (fig. S4, Supplementary Material online). The P sequences in general are very diverse in HVR-I haplotypes: there is only a single coding-region substitution defining the haplogroup and consequently each of the 7 subhaplogroups has distinct HVR-I sequences. It is interesting to note that the 2 sequences from the Trobriand Islands have an identical HVR-I haplotype to the N/R/HV/H reference sequence (Andrews et al. 1999); if typed to haplogroup solely by HVR-I sequencing, these sequences could be misinterpreted as European-derived haplotypes.
The 2 sequences from Taiwan described here (DQ372868 and DQ372869) belong to haplogroups M/M7c and N/R/B5a, both of which have been reported from HVR-I data to be present in Oceania and Island Southeast Asia. A Micronesian sample from the Marshall Islands (DQ372876) is closely related to the Taiwanese M7c sequence, and these together with a sequence from the Philippines and 1 from Mongolia form a subclade of M7c separate from other sequences from China and Japan (fig. S2, Supplementary Material online). The Taiwanese B5a sequence is notable for its distance from the B5a sequences from the Austro-Asiatic language–speaking Nicobarese (TMRCA estimate 27,732 ± 5,005 years); however, HVR-I sequences from aboriginal Taiwanese suggest that B5a lineages more closely related to the Nicobarese haplotypes may also exist in Taiwan (Trejaut et al. 2005).
The remaining 8 new sequences from Oceania belong to the B4a1a subgroup of haplogroup N/R/B4, bringing the total number of haplotypes in this group to 24. B4a1a HVR-I sequences are the most common type found in Polynesia, and an out-of-Taiwan model predicts the observed pattern of shared ancestry between Oceanic and Taiwanese B4a sequences. The complete mtDNA sequences reveal the phylogeny to be considerably more complex than indicated from the HVR-I sequences: several coding-region substitutions occur between the vertices defining the pre-Polynesian motif (16217C and 16261T) and the full Polynesian motif (16217C, 16247G, and 16261T) (fig. 3).
All of the sequences in the B4a1a subclade to date are from Oceania or Taiwan. The expansion of haplotypes from the B4a1a vertex has occurred recently; a sample from the Trobriand Islands (DQ372871) retains the ancestral sequence, whereas another from Taiwan (AJ842749) differs only by a single control-region transition. The TMRCA estimates for the B4a1a1 and Polynesian motif vertices (table 2) reflect the recent divergence of these sequences and highlight the limitations of molecular dating at the tips of human mtDNA phylogenies and, in particular, estimations based on the noncoding HVR-I. The dates for the B4a1a haplogroup from the entire mtDNA sequences certainly do not exclude the possibility of ancestry of this subhaplogroup in Taiwan, consistent with the out-of-Taiwan model. However, the current lack of entire mtDNA B4a1a sequences from mainland and Island Southeast Asia leaves an important gap in our understanding of the prehistory of this haplogroup.
The haplotypes present in Oceanic populations reinforce the orthodox model of recent settlement of Remote Oceania from Near Oceania, with the maternal lineages in Remote Oceania descending from both the Pleistocene and probable Holocene-era settlers of Near Oceania. The MMS approach is an effective means of analyzing large intraspecific data sets such as this, making it possible to prove that trees found by heuristic search are minimal, and the use of the consensus network method to display all minimal trees found in one graph provides a clear overview of the results, which is easy to interpret. The mitochondrial genome sequences refine existing phylogenies inferred from control-region haplotypes and provide a framework for future investigations: coding-region polymorphisms can be targeted in future studies aimed at resolving the histories of each haplogroup without the need for entire mtDNA sequencing.
Details of the Oceanic data set and results of the haplogroup analyses provided as supplementary table S1 and supplementary figures S1–S5 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
We are grateful to P.A. McLenachan, W.Schievenhovel, and J. Clegg for providing samples for this project. Thanks are also due to Glenn Conner for implementing the changes to the MMS program and to Lisa Matisoo-Smith for valuable discussions and guidance on issues in Pacific prehistory and scholarship support to M.J.P. This work was supported by a grant from the Marsden Fund.
Several of the sequences from Andaman Islanders (Thangaraj et al. 2005) have been recently updated (AY950291.2–AY9650300.2, revised 1/06/06). Among the revisions are changes in the sequences at nt1594, which result in the M31 and M32 haplogroups branching independently from the M vertex in contrast to the shared descent shown in figure 2.