Since private alleles have proven useful in investigating population structure and migration patterns (Calafell et al.
; Neel and Thompson, 1978
; Schroeder et al.
), we now provide a detailed example to illustrate various ways in which our generalized private allelic richness approach can be used in data analysis.
We employ a dataset from human populations (Rosenberg et al.
) containing genotypes of 1048 individuals—the H1048 collection of individuals (Rosenberg, 2006
)—at 783 microsatellite loci. We also consider the genotypes for the H952 subset of the full H1048 dataset—a group of 952 individuals that contains no known first or second degree relatives (Rosenberg, 2006
). The individuals were classified as belonging to one of five major geographic regions—sub-Saharan Africa, Eurasia (Europe, Central/South Asia, and the Middle East, including North Africa), East Asia, Oceania and Americas. We treat each of these regions as a ‘population’ in the computations that follow.
We used Equations (2
) and (3
) to compute allelic richness and private allelic richness for each of the five geographic regions, and we used Equation (4
) to compute generalized private allelic richness for various combinations of regions. The computation was performed for individual loci for values of g
from 2 up to the maximum possible value for the dataset, and for each g
the mean was taken across loci. For a given locus, the smallest number of observations in one of the population groupings under consideration specifies the largest value of g
possible to use for private allelic richness and generalized private allelic richness computations at that locus. Because missing data can reduce this maximal g
, in our example we used the locus filtering feature in the ADZE
computer program to restrict our attention to 721 loci for which each geographic region had a missing data rate ≤15% (similar results are obtained when using all 783 loci, with a lower maximal g
). With this collection of loci in the H952 dataset, every locus had a sample size of at least 48 observations in each of the five geographic regions. The same collection of 721 loci was used in analyses that employed the full collection of 1048 individuals.
There are 31 combinations of one or more of the five geographic regions, and we computed generalized private allelic richness for each combination. For comparison, we also partitioned alleles among the 31 possible geographic distributions without correcting for sample size. Considering all loci, each distinct allele can be private to a single region, present in two regions, present in three regions, present in four regions or present in all five regions. For each of the 31 geographic distributions, we determined the fraction of alleles in the dataset that had the specified distribution.
We performed a simulation study to assess the extent to which our estimates of the proportions of alleles in various combinations of geographic regions reflect the true proportions. First, for each of the 783 loci in our dataset we treated the sample allele frequencies in each geographic region from the H952 subset of individuals as the true allele frequencies. For each locus and each geographic region, we sampled 250 diploid individuals (with replacement) to create a simulated dataset. Repeating this sampling, we produced 100 simulated datasets, each consisting of 250 diploid individuals per region at each of the 783 loci. For each simulated dataset, each locus, and each value of g from 10 to 500, we calculated the generalized private allelic richness for each of the 31 combinations of one or more of the five regions. We then divided each of the 31 values by their sum to determine the fraction of alleles present in each of these 31 categories. Similarly, continuing to treat the sample frequencies in the H952 dataset as true frequencies, we tallied the true number of distinct alleles in each of the 31 combinations of regions in the H952 dataset and divided by the total number of distinct alleles worldwide to obtain the true proportion of private alleles for each of the 31 combinations of regions. We then calculated ∑i=131(simi − truei)2, where simi and truei denote the simulated and true proportions of alleles private to geographic combination i, respectively. The mean of this statistic was taken over the 100 replicate simulated datasets, and the resulting quantity was then plotted in .
Fig. 1. The distance between simulated and true values of the proportions of alleles with specific geographic distributions, summed across distributions and plotted as a function of standardized sample size g from 10 to 500. Results shown represent the mean across (more ...)
As the standardized sample size g increases, the distance between simulated and true values decreases considerably, so that for large g, our generalized private allelic richness measures provide a close approximation to the true values in the setting of the simulation (). Because this simulation is based on our human microsatellite dataset, its results suggest that it is reasonable to make interpretations about allelic distributions in human populations using our method applied to the data in our example.
shows the generalized private allelic richness at g = 40 for each of the 31 combinations of geographic regions, as a fraction of the sum of the 31 values. Examining the percentages of alleles having a given geographic distribution at g = 40, the average absolute difference across geographic distributions is 0.12% between the computations including and excluding relatives. Because of the similarity in results including and excluding relatives, our subsequent analyses use only one of the two datasets (the H952 subset excluding relatives).
Fig. 2. The number of alleles private to various combinations of geographic regions as a fraction of the total at a standardized sample size of 40. The outer circle corresponds to the entire dataset including known first and second degree relatives. The inner (more ...)
and compare the fractions of alleles having each of the 31 geographic distributions, for four values of g (10, 20, 30 and 40) as well as uncorrected for sample size. Notable in the figure and table is the emergence of alleles that were found in various combinations of two, three and four major regions when correcting for sample size, but that did not appear in the uncorrected calculations. Additionally, we see that the uncorrected analysis produces a rather different view of the allelic distribution compared with the analyses that correct for sample size. For example, considering the distribution of private alleles across the major geographic regions, the uncorrected calculations indicate that Eurasia contains the most private alleles, followed by Africa, East Asia, the Americas and Oceania. However, when we correct for sample size differences using g = 40, Africa has the largest number of private alleles, followed by Eurasia, Oceania, East Asia and the Americas. Similarly, in the uncorrected calculations the region with the largest number of missing alleles (alleles private to four of the five regions) is Oceania (AfEuEaAm) followed by the Americas (AfEuEaOc); in the corrected calculations (standardized sample size of g = 40) missing alleles are most numerous for the Americas (AfEuEaOc) followed by Oceania (AfEuEaAm).
Fig. 3. The number of alleles private to various combinations of geographic regions as a fraction of the total, using a subset of the data excluding relatives. The innermost circle corresponds to calculations uncorrected for sample size variation. Moving outward (more ...)
Percentages of 8516 total alleles private to various combinations of geographic regions
For each geographic region, the mean number of distinct alleles per locus and the mean number of private alleles per locus are shown in A and B as functions of standardized sample size g. From these plots, we see that Africa has both the highest number of distinct alleles and the highest number of private alleles, and that the smallest values in both categories occur in the Americas.
The mean number of (A) distinct alleles per locus and (B) private alleles per locus, as functions of standardized sample size for five major geographic regions (excluding known relatives).
The numbers of alleles private to combinations of regions are plotted in . A shows the mean number of alleles per locus private to pairs of major regions, demonstrating that the combination of Africa and Eurasia has the largest number of private alleles. The smallest number is observed in the combination of Oceania and the Americas. The highest number of alleles private to three regions is seen in the combination of Africa, Eurasia and East Asia, followed closely by the combination of Africa, Eurasia and Oceania (B). In the plot for the number of missing alleles (C), we see that the Americas have by far the largest number, followed by Oceania and Africa. D, which shows the mean number of alleles simultaneously present in all regions, illustrates that the number of alleles found in all regions considerably exceeds the number private to any one region or any combination of two, three or four regions.
Fig. 5. The mean number of alleles per locus private to combinations of k of five major geographic regions as a function of standardized sample size (excluding known relatives). (A) k=2, (B) k=3, (C) k=4 and (D) k=5. For geographic region abbreviations refer (more ...)
3.4 Out of Africa and the peopling of Oceania
We can interpret the patterns of private allelic richness in and in relation to our expectations based on various perspectives about the history of human migrations. The larger numbers of alleles and private alleles in Africa, and the smaller numbers in the Americas, match the pattern expected for models of human evolution that begin from an African origin and reach the Americas only after a series of founder events (Ramachandran et al.
). The pair of regions with the largest number of alleles is the combination of the geographically connected regions of Africa and Eurasia; the group of three regions with the largest number is the combination of Africa, Eurasia and East Asia; and the group of four regions with the largest number is the combination of Africa, Eurasia, East Asia and Oceania. These results each fit the prediction of African-origin models that include serial founder effects during outward migrations, as many alleles in the founding population would only have migrated along part of the path outside of Africa.
One set of results that offers the potential to distinguish among competing hypotheses about human migrations concerns alleles found in combinations of geographic regions that include Oceania. The initial peopling of near Oceania (which includes the islands of New Guinea and Bougainville, from where our samples originate) involves the first demonstrable human sea crossing (Derricourt, 2005
). Fossil evidence of the presence of anatomically modern humans in Sahul—the ancient landmass of Australia and New Guinea separated by sea from Asia—dates to at least 42 000–45 000 years before the present (BP) (Gillespie, 2002
; O'Connell and Allen, 2004
), and earlier dates (~60 000 BP) have also been proposed (O'Connell and Allen, 2004
; Thorne et al.
). Several migration waves have entered Oceania since the initial colonization, creating a complex mixture of ancestries in many parts of the region (Friedlaender et al.
; Matisoo-Smith, 2007
A theory of a single main migration out of Africa ultimately reaching Oceania proposes a recent dispersal of modern humans from sub-Saharan Africa into Eurasia, replacing earlier archaic humans. There are at least two plausible out-of-Africa routes of dispersal towards eastern Asia—a northern inland route through the Middle East and a southern coastal route via Arabia and India (Bulbeck, 2007
; Cavalli-Sforza et al.
; Field et al.
Quintana-Murci et al.
). Because the existence of multiple routes suggests the possibility that two or more major migrations taking different paths may have occurred, it is of interest to examine whether an additional main out-of-Africa event—distinct from the events responsible for most of the peopling of Asia and Europe—might have been responsible for the peopling of Sahul.
To investigate the possibility of a separate migration wave from Africa to Oceania, we can consider three simplified scenarios concerning human dispersal from Africa to Oceania that have the potential to be distinguished based on multilocus population-genetic data (). The first scenario, Model 1, corresponds to a single primary out-of-Africa migration through the Middle East and East Asia before reaching Oceania. This hypothesis predicts that variation in Oceania is largely a subset of East Asian variation. The second scenario, Model 2, postulates a peopling of Oceania from Eurasia. In this model, following a migration out of Africa into western Asia, the migration that carried human populations into Oceania was separate from the migration into East Asia and left a negligible genetic trace along the path to Oceania. Under this hypothesis, variation in Oceania would largely be a subset of variation in Eurasia. Finally, the third scenario, Model 3, suggests an early peopling of Oceania, perhaps by a southern route out of Africa via the Arabian peninsula, the Indian sub-continent and Southeast Asia. In this scenario, populations in Asia along the migration path would have only a small or negligible fraction of ancestry from the time of the initial colonization of Oceania, and would descend largely from later out-of-Africa migrations. Variation in Oceania would then be a subset of variation in Africa but not of variation in Eurasia or East Asia.
Three hypothetical migration patterns to Oceania out of Africa. For geographic region abbreviations refer to the legend.
Comparisons of the numbers of alleles with various geographic distributions can assist in distinguishing these alternative hypotheses (). displays the rarefaction curves for the four pairs of geographic regions that include Oceania, a subset of the pairs shown in A. The Africa/Oceania combination has more private alleles than the other three pairs, and the East Asia/Oceania and Eurasia/Oceania pairs have nearly equal numbers of private alleles. These observations are compatible with Model 3, in which Oceania would retain many ancestrally African alleles not found elsewhere. They are also compatible with Model 1, as the relatively high number of alleles each non-African group shares with Africa could be a consequence of the particularly high level of African variation. The similar numbers of alleles private to Eurasia/Oceania and East Asia/Oceania would then result from the opposing effects of a higher level of variation in Eurasia than in East Asia and a higher degree of relationship with Oceania in East Asia than in Eurasia. The observations, however, are not compatible with Model 2, which would have been expected to produce an excess number of alleles private to the combination of Eurasia and Oceania compared with the number private to the combination of East Asia and Oceania.
Comparisons of numbers of private alleles for pairs of geographic regions that would support a given migration model if observed
Fig. 7. The mean number of alleles per locus private to the combination of Oceania and another major geographic region as a function of standardized sample size (excluding known relatives). Error bars represent SEM across loci. For geographic region abbreviations (more ...)
Further support for Model 3 can be found in various additional comparisons in A. Under Model 1, Oceania and the Americas both derive from East Asian ancestry, and therefore, the combinations Africa/Oceania and Africa/America, Eurasia/Oceania and Eurasia/America, and East Asia/Oceania and East Asia/America are directly comparable. In each of these three cases, the pair including Oceania has more alleles than the pair including the Americas, consistent with the higher allelic richness in Oceania compared to the Americas. However, the amount by which the number of alleles private to the combination of Africa and Oceania exceeds the number of alleles private to the combination of Africa and the Americas is considerably greater than the corresponding excess for the other two comparisons. Moreover, with the exception of Africa/Eurasia, the Africa/Oceania combination has more alleles than any other pair of regions—including the combination of Africa and East Asia. These observations, which are compatible with Model 3, are more difficult to reconcile with Model 1.
Examination of combinations of three regions in B produces similar suggestive evidence of Model 3 to that obtained from combinations of two regions in A and . Except for the combination of Africa, Eurasia and East Asia, the combination of Africa, Eurasia and Oceania has more private alleles than any other three-region combination. Although this observation could potentially be explained by any of the three models, the amount by which the number of private alleles for the Africa/Eurasia/Oceania combination exceeds that of other combinations is least compatible with Model 1, which has several groups of three regions that might have been expected to have numbers of private alleles close to that of Africa/Eurasia/Oceania.