|Home | About | Journals | Submit | Contact Us | Français|
Microsatellite length mutations are often modeled using the generalized stepwise mutation process, which is a type of random walk. If this model is sufficiently accurate, one can estimate the coalescence time between alleles of a locus after a mathematical transformation of the allele lengths. When large-scale microsatellite genotyping first became possible, there was substantial interest in using this approach to make inferences about time and demography, but that interest has waned because it has not been possible to empirically validate the clock by comparing it with data in which the mutation process is well understood. We analyzed data from 783 microsatellite loci in human populations and 292 loci in chimpanzee populations, and compared them with up to one gigabase of aligned sequence data, where the molecular clock based upon nucleotide substitutions is believed to be reliable. We empirically demonstrate a remarkable linearity (r2 > 0.95) between the microsatellite average square distance statistic and sequence divergence. We demonstrate that microsatellites are accurate molecular clocks for coalescent times of at least 2 million years (My). We apply this insight to confirm that the African populations San, Biaka Pygmy, and Mbuti Pygmy have the deepest coalescent times among populations in the Human Genome Diversity Project. Furthermore, we show that microsatellites support unbiased estimates of population differentiation (FST) that are less subject to ascertainment bias than single nucleotide polymorphism (SNP) FST. These results raise the prospect of using microsatellite data sets to determine parameters of population history. When genotyped along with SNPs, microsatellite data can also be used to correct for SNP ascertainment bias.
To be useful as a molecular clock, a polymorphic genetic locus needs to accumulate mutations in a predictable way, so that with an appropriate statistical transformation, the differences between two alleles present in the population can be used to obtain an unbiased estimate of the time that has elapsed since their last common genetic ancestor (Zuckerkandl and Pauling 1962). When loci dispersed throughout the genome are combined, this molecular clock can in principle provide accurate estimates of genetic divergence times and, with further analysis, can also estimate ancestral population sizes and population migration histories.
Microsatellites (or short tandem repeats) are simple repetitive sections of DNA of typically 2–5-bp motifs (e.g., CACACACACA). They possess several features suitable for a molecular clock. First, microsatellites are widely dispersed throughout the genome. In humans, an estimated 150,000 informative (sufficiently polymorphic) loci exist, of which tens of thousands have been genotyped (Weber and Broman 2001). Second, in humans, the mutation rate at these markers is estimated to be around 10−3 to 10−4 per locus per generation (Ellegren 2000), which is orders of magnitude larger than the genome-wide average nucleotide mutation rate of around 10−8 per base per generation. The higher mutation rate means that a much smaller fraction of the genome needs to be sampled to make inferences with microsatellite data than with sequence data. Third, microsatellites are largely free of ascertainment bias compared with single nucleotide polymorphisms (SNPs) (Conrad et al. 2006). The extraordinarily high mutation rate of microsatellites means that they are primarily discovered not based on their polymorphism pattern in any one population (they are essentially guaranteed to be polymorphic) but instead based on their sequence. Thus, the population in which they are first studied is not expected to substantially bias inferences based on the data. By contrast, SNP allele frequency in the population in which it is discovered has a dramatic influence on the probability that it will be included in a study, and thus, SNP data sets are deeply affected by ascertainment bias (Clark et al. 2005). The majority of SNPs on human genome-wide scanning arrays have been ascertained in a complex way that is difficult to model, confounding the interpretation of allele frequency distributions for inferences about history.
The technology to efficiently genotype microsatellites—using polymerase chain reaction followed by length separation on gel—has sparked an enormous amount of effort on using them to make inferences about genetic variation. They have been extensively analyzed in the context of constructing genetic linkage maps in a wide range of species, from humans to zebra fish to wheat (Dib et al. 1996; Roder et al. 1998; Shimoda et al. 1999). Using linkage maps and family-based linkage analysis, microsatellites have been used to discover regions of identity by descent in related individuals, which in turn have been used to localize the search for disease genes.
Initially, there was great interest in using microsatellites to make inferences about history, not only in humans but also in other species (Bowcock et al. 1994; Paetkau et al. 1997). The idea that inferences about history were possible using these markers was based on preliminary evidence that microsatellites mutate approximately according to a random walk, whereby alleles undergo length changes during DNA replication due to polymerase slippage (Levinson and Gutman 1987; Ellegren 2004). The simplest model was the single-step symmetric stepwise mutation model (SMM) (Ohta and Kimura 1973; Valdes et al. 1993), whereby microsatellites mutate to one motif length shorter or longer with equal probability. In the generalized stepwise mutation model (GSMM) (Kimmel and Chakraborty 1996), the length changes can also be multi-step (Di Rienzo et al. 1994) and involve directional asymmetry (Amos and Rubinstzein 1996). Assuming that the GSMM holds, the average square distance (ASD) (Goldstein et al. 1995a) between orthologous microsatellites of two individuals provides an unbiased estimate of the coalescence time averaged across the genome, also known as the average time to the most recent common ancestor (tMRCA) (Slatkin 1995). The establishment of the microsatellite molecular clock using the GSMM led researchers to infer average coalescent times (Goldstein et al. 1995a, 1995b; Goldstein and Pollock 1997; Zhivotovsky 2001), population differentiation (FST for microsatellites) (Slatkin 1995), and patterns of population size expansion and contraction (Kimmel et al. 1998; Reich and Goldstein 1998).
Despite the initial excitement in using microsatellites to make inferences about history, this interest has waned because experimental evidence has revealed instances where the GSMM is violated. In the context of boundary constraints on microsatellite allele lengths, for example, ASD can lose accuracy for separations beyond 10,000 generations (assuming the range of alleles is constrained to 20 repeats) (Feldman et al. 1997), which is well within the depth of human genetic variation. Researchers have also explored more complex models of microsatellite evolution that include boundary constraints (Nauta and Weissing 1996; Feldman et al. 1997) and length-dependent mutation rates (Di Rienzo et al. 1994; Kruglyak et al. 1998; Xu et al. 2000; Sainudiin et al. 2004), where ASD is also inappropriate. Perhaps the greatest concern for using microsatellites as molecular clocks is the concern that each locus would have to be characterized experimentally and individually modeled.
Due to doubts about the ability to accurately model the microsatellite mutation process, recent studies have eschewed the use of microsatellite data to infer parameters of human history, though there are some important exceptions (Ramachandran et al. 2008; Szpiech et al. 2008). Thus, although large-scale microsatellite data sets have recently been collected in many human populations—in particular ~700 microsatellite loci were genotyped in approximately 3,000 individuals from 147 populations, including the Human Genome Diversity Panel (HGDP) (Rosenberg et al. 2002, 2005; Zhivotovsky et al. 2003), South Asians (Rosenberg et al. 2006), Native Americans (Wang et al. 2007), Latinos (Wang et al. 2008), and Pacific Islanders (Friedlaender et al. 2008)—only two of eight studies (Zhivotovsky et al. 2003; Becquet et al. 2007) attempted to make time inferences with these data. Most studies have instead focused on using microsatellite data to detect and analyze population structure.
In this study, we revisit the hypothesis that reliable inferences about history can be obtained using microsatellite data. To do this, we use newly available genome sequencing data sets that permit empirical assessments of the microsatellite molecular clock. More specifically, we compare ASD with genomic sequence divergence using data sets from both humans and chimpanzees and show that, despite the known presence of deviations from the GSMM at many individual loci, the averaged microsatellite clock over all loci applies with remarkable accuracy to time depths that are about 10-fold greater than previous simulations. Next, we show that the microsatellite FST is accurate when compared to SNP FST, and we perform coalescent simulations to show that SNP ascertainment bias is a plausible explanation for discrepancies between the two FST measures. It is likely that the microsatellite molecular clock can be useful to the analysis of population history for many populations and closely related species, beyond the humans and chimpanzees analyzed here.
It is important to note that microsatellite ASD, like sequence divergence between two samples (the number of nucleotide differences per base pair), is expected to be proportional to tMRCA averaged across the genome, and does not provide any direct information about population split times. We focus on ASD here because we can directly plot it against average sequence divergence for population pairs and test whether the molecular clock holds, without making any assumptions about demographic history. Only after having demonstrated that ASD is an accurate molecular clock do we discuss its potential applications in estimating population split times, historical population sizes, and historical migrations, which are more complicated inferences that can only be done with appropriate population genetics modeling.
For humans, we used 783 autosomal microsatellites from Rosenberg et al. (2005). From this set, we found that two loci were almost perfectly correlated and removed the locus (D2S1334) with more missing data. We used Rosenberg's H952 set of individuals, who are expected to be less related than second cousins (Rosenberg 2006). To match individuals to the sequence data sets, we pooled individuals according to population (supplementary table S1, Supplementary Material online). For chimpanzees, we used the 292 autosomal microsatellites generated by Becquet et al. (2007). We only used chimpanzees (supplementary table S1, Supplementary Material online) that have no population ambiguity based on geographic and genetic clustering information.
We used three sequence data sets (table 1): The first was generated by Keinan et al. (2008), which used whole genome shotgun sequencing (WGS) (Weber and Myers 1997) to sequence four East Asians (Han Chinese and Japanese), five North European, five West Africans (Yoruba), and one Biaka Pygmy. The second data set was experimentally generated in our own laboratory using a reduced representation shotgun (RRS) library (Altshuler et al. 2000) to sequence one San, one Australian aborigine, and one Mbuti Pygmy. This data set has not been previously published. Unlike WGS, which fragments the genome at random, RRS produces fragments cut by specific restriction enzymes, constraining sequences to specific regions of the genome (see details of RRS sequencing below). WGS data from Yoruba, Europeans, and East Asians from WGS were aligned to the sequence from the three RRS individuals, allowing for a larger number of pairwise comparisons across populations than was possible with WGS. The third data set was generated by Caswell et al. (2008) and consisted of WGS sequence data from one Bonobo, three Western Chimpanzees (including “Clint,” the individual used to generate the chimpanzee reference sequence 2005), three Central Chimpanzees, and one Eastern Chimpanzee. We converted divergence values from Caswell et al. into absolute units of substitutions per kilobase (kb) by assuming that the Western–Western chimpanzee divergence is approximately equal to WGS European–European divergence (Patterson, Price, and Reich 2006; Patterson, Richter, et al. 2006).
We used restriction enzymes PmeI (5′-GTTTAAAC-3′) and EcoRI (5′-GAATTC-3′) to fully digest DNA extracted from cell lines of five diverse human DNA samples, using an RRS protocol similar to that described in Altshuler et al. (2000). We ran the products of the two restriction enzyme digests on a gel and cut out a 2–3-kb band, which is expected to isolate to the same subset of the genome in each of the samples. Finally, we cloned the fragments into a pUC19 vector flanked by a PmeI overhang on one side and an EcoRI overhang on the other.
We calculated that the same ~30 Mb, or ~1% of the genome, would be isolated in the five samples by this experimental protocol. Given the human genome GC content of 41%, PmeI sites are expected to occur every 36 kb (0.205−2 × 0.295−6) for a total of ~86,000 fragments, and EcoRI are expected to occur every 3.1 kb (0.205−2 × 0.295−4), for a total of ~1,000,000 fragments. Given the human genome size of 3.1 Gb, and assuming a Poisson distribution of restriction sites flanked by PmeI and EcoRI, approximately 2 × 86,000 × (1,000,000–86,000)/(1,000,000) = 157,000 such fragments are expected in the genome. Of these, we carried out an integral to infer that the proportion of these fragments that are expected to be in the 2–3-kb range is ~15%, which translates to an expectation of ~23,000 fragments of 2–3 kb for sequencing in each sample. Because each fragment we analyzed was sequenced from both ends with an expected 500–800 bp per read, the total amount of sequence that we expected in our “reduced representation” of the genome was about 23,000 × 1.3 kb = 30 Mb. The advantage of RRS over WGS is that with deterministic fragmentation of the genome, the sequences that we obtained in distinct individuals were expected to overlap with greatly increased probability, so that we required substantially less sequencing to obtain genome overlaps from different samples.
We carried out RRS sequencing on two San male samples from HGDP (HGDP_988 and HGDP_991), two Mbuti Pygmy females from the Coriell Cell Repositories (NA10493 and NA10496), and one Australian Aborigine female from the European Collection of Cell Cultures (ECCAC_9118). We attempted to sequence 15,360 reads (7,680 paired ends) from each sample, and then aligned the reads to the reference human genome sequence, NCBI Build 35, using ssahaSNP (Ning et al. 2001) with stringent NQS parameters of Qsnp> = 40, Qflank> = 15, Nflank = 5, maxFlankDiff = 1, and maxSNPs/kb < 15. Reads that map to multiple places in the genome with nearly identical scores are removed from further analysis. After alignment and filtering, we had data from 11,687 reads in HGDP_998 (5,656,804 bp meeting neighborhood quality score thresholds), 11,500 reads in HGDP_991 (5,359,356 bp), 11,848 reads in NA10493 (5,702,532 bp), 11,905 reads in NA10496 (5,486,017 bp), and 12,193 reads in ECCAC_9118 (6,034,676 bp).
We note that in this study we do not examine overlaps of RRS libraries, even though such comparisons were the original intent of the RRS data collection strategy. This is because we found that if the same section of the genome passes through the RRS process in two or more chromosomes, they are in practice biased to be too closely related to each other in time (the inferred tMRCA was systematically lower than the value obtained based on microsatellite ASD). We hypothesize that this reflects the fact that to enable a comparison between two RRS libraries, two haplotypes must be identical at both the PmeI (8 bp) and EcoR1 (6 bp) restriction cut sites, which requires identity for each of the 14 = 8 + 6 bases. By requiring that pairs of haplotypes match for each of the 14 bases, we are biasing the haplotypes that we analyze to be ones with fewer mutations separating them, and thus to be more closely related to each other (in time) than the average pair of sequences in the genome. It is straightforward to show that this generates an appreciable (if small) downward bias in the divergence time estimate, which we in fact observed.
We used the HGDP autosomal 650K SNPs (Li et al. 2008).
For microsatellites, we computed the unbiased sample statistic of ASD, which is theoretically proportional to tMRCA assuming that the GSMM is valid (Goldstein et al. 1995a). It is important to realize that the average tMRCA across the genome can be estimated directly from genetic data (using either microsatellite ASD or per base pair sequence divergence). It is a property of the samples that are being analyzed and can be estimated empirically without making any assumptions about the demographic history of populations.
For a single locus, ASD works as follows: Suppose we have population A with nA individuals (2nA alleles) and population B with nB individuals (2nB alleles). We take an allele from each population, perform a subtraction, and square the result. Then, the single locus ASD is the average of all allele pairs defined as follows:
It can be shown (see below) that ASD is very similar to the total variance of all samples between two populations. Furthermore, the within-population ASD (not explicitly shown) is equal to twice the variance of the sampled population.
Next, we averaged ASD over multiple loci. We assumed that the microsatellite loci are independent because they were selected for the purpose of linkage analysis to be distantly spaced across the genome. Thus, the standard error is simply the standard deviation of ASD across all loci divided by the square root of the number of loci. We did not correct for mutation rate heterogeneities across loci, because their empirical values were unknown. More importantly, we did not normalize across loci to equalize the tMRCA of each locus, because biologically, tMRCA are different for each locus due to different gene genealogies (Rosenberg 2002).
To compute genetic distances for pairwise aligned sequences, we simply counted nucleotide differences to obtain sequence divergences. Assuming that the molecular clock hypothesis is true for sequence divergence (i.e. the genome-average nucleotide substitution rate is constant since human–chimpanzee speciation), then sequence divergence is strictly proportional to tMRCA. Because of linkage disequilibrium, nearby divergent sites are dependent, and standard errors of sequence divergence were computed via a block jackknife approach (Keinan et al. 2007).
Although there are multiple methods to compute FST, our goal is to have an unbiased FST statistic for microsatellites that is also coherent with SNP FST. FST is defined as
HS is the average heterozygosity across all populations. HT is the heterozygosity of all populations pooled together. Slatkin (1995) showed that in the context of the GSMM, heterozygosity is simply the variance of the allelic distribution at a particular locus. However, we do not use his sample statistic verbatim because he requires equal sample sizes, and instead use one that we derived that allows for unequal sample sizes.
Suppose we have two populations, each with allelic distributions described by random variables A and B. HS is trivial:
HT is found using the law of total variance, yielding
Combining terms, we have an FST estimator:
SNP loci are biallelic, and hence, random variables A and B are Bernoulli distributed with minor allele frequency (MAF) parameters pA and pB. SNP FST becomes
This is a classical definition for SNP FST, where P is the MAF of the two populations combined, and d is the difference between the MAF of a population and P:
Hence, SNP FST is just a special case of microsatellite FST.
We compute unbiased sample statistics (which we refer to using a “hat” notation) separately for the numerator and denominator, then calculated the ratio.
Given sample sizes and unbiased sample statistics for mean and variance, the numerator becomes:
Similarly, the denominator becomes
All discussion so far has been for a single microsatellite locus. For K loci, we first compute K unbiased sample statistics, each for the numerator and denominator. Then we separately average the numerator and denominator and finally compute the ratio. This strategy avoids numerical instability issues of averaging ratios (namely, when denominators are small at certain loci).
FST and ASD are closely related. From the above, it is clear that FST is a function of first- and second-order moments of allelic distributions. Furthermore, it is known (Goldstein et al. 1995a) that the ASD estimator is
Define X as the sum of intrapopulation variances. Define Y as interpopulation variance.
Now the relationship between FST and ASD is clear. ASD closely resembles the total variance of allelic distributions of populations A and B combined. FST is the ratio of interpopulation variance to total variance.
To test empirically whether the microsatellite ASD statistic (Goldstein et al. 1995a) can be an unbiased estimate of tMRCA, we used genomic sequence divergence as a “gold standard,” and assessed how closely the microsatellite inferences matched this number. We restricted our analysis to pairs of populations for which we had both extensive genome sequence alignments and large-scale microsatellite data. We first used sequence data sets to compute autosomal sequence divergence, which was assumed to be proportional to the average tMRCA. This formed our gold-standard molecular clock. For the same pairs of populations, we then computed ASD using microsatellite data. Comparing sequence divergence to ASD provided a metric for the accuracy of the microsatellite molecular clock, assessed in terms of linearity (correlation coefficient) and standard errors.
FIgure 1 plots sequence divergence against microsatellite ASD. For WGS humans (Panel A), the correlation coefficient is r = 0.989 (P = 4.9e−7, 95% confidence interval [CI] 0.946–0.998). For RRS humans (Panel B), r = 0.983 (P = 5.3e−11, 95% CI 0.949–0.995). For chimpanzees (Panel C), r = 0.986 (P = 2.7e−4, 95% CI 0.877–0.999). Figure 1 suggests the following:
Although these results demonstrate microsatellites’ usefulness in estimating tMRCA, there is a nonzero y-intercept (supplementary fig. S1, Supplementary Material online), oddly suggesting that zero sequence divergence (tMRCA = 0) is associated with a positive ASD. We used simulations to investigate the possibility that microsatellite genotyping error caused the elevated ASD relative to its true value. Assuming a typical genotype error rate of 1% with error being randomly distributed at ±1 repeat length (Weber and Broman 2001), we can only explain 10% of the offset. It is possible, however, that the most pertinent error in microsatellite genotyping is not miscalling microsatellite lengths by a single repeat length, but instead, miscalling heterozygous genotypes as homozygous, which can easily occur with microsatellites (Weber and Broman 2001). Missing of heterozygotes would have the effect of generating false multi-step mutations, which would result in a much larger inflation in the ASD (due to the squaring of the difference in allele lengths) and could plausibly explain our significantly nonzero y-intercept. Alternatively, the relationship between ASD and tMRCA could be globally nonlinear but easily linearizable in our time window. Whatever the cause for our observations, these results indicate that for population genetic analysis, it is important to use a calibration curve (such as fig. 1) to convert ASD to sequence divergence, correcting for the inflated estimate of divergence time from microsatellite ASD.
The microsatellite data show that the San, Biaka Pygmy, and Mbuti Pygmy Africans are more diverged in their pairwise tMRCA from non-African populations than are Yoruba West Africans. These results are consistent with an analysis of microsatellite data by Zhivotovsky et al. (2003) but strengthen their result because microsatellite and sequence divergence concur (fig. 1A and B). It was already known based on mitochondrial DNA and Y chromosome data that the San and Mbuti contain deeply diverged lineages, but our results and those of Zhivotovsky et al. using autosomal microsatellite data show definitively that these populations are outgroups to all other populations.
An immediate application of the regressions from figure 1 is to infer sequence divergences for the remaining HGDP populations in which we lack sequence data. Figure 2 is a matrix plot showing the inferred divergences (hence inferred tMRCA). In this plot, the San and Pygmy Africans are the only populations equidistant to all other populations, further suggesting that these populations are the most deeply diverged.
FST measures the degree of differentiation between populations. Given genetic diversity data for two populations, FST (a quantity between 0 and 1) is the ratio of interpopulation variance to total variance. When FST is appropriately transformed (Slatkin 1991; Patterson unpublished), one can infer the genetic drift that occurred between two populations since they split. In particular, one can estimate the population split time (tpop) in units of 2N, where N is the effective population size, under the assumption that populations have been constant in size since their divergence. We note that in human populations, tpop and tMRCA are different by an order of magnitude: For Africans versus non-Africans, the average tMRCA is thought to be ~500,000 years ago, whereas tpop is thought to be 40,000–80,000 years ago (Keinan et al. 2008). As we have shown that the microsatellite molecular clock works for time depths of at least 2 My, we can be confident that it also works for time separations that are an order of magnitude less.
FST is usually estimated based on SNP and sequencing data when available, because uncertainties of the complex microsatellite mutation process confound the interpretation of a microsatellite FST in terms of history. Assuming the GSMM of microsatellite evolution, however, Slatkin derived a microsatellite-based FST estimator (Slatkin called it RST) (Slatkin 1995) that should be identical to SNP-based FST. The empirical analyses using Slatkin's estimator have been encouraging. For example, based on <300 SNPs (Fischer et al. 2006) and <300 microsatellites in four chimpanzee populations, Becquet et al. (2007) showed that the SNP FST and microsatellite FST were concordant.
As of today, the richest data sets with both genomewide SNPs and large numbers of microsatellites are those from HGDP (Rosenberg et al. 2002; Li et al. 2008). We computed and compared FST based on SNPs and microsatellites in these samples. An important distinction between the comparison we present here and that of the previous section (where we examined ASD) is that we do not assume SNP-based FST as gold standard.
Figure 3A plots SNP FST on the horizontal axis and microsatellite FST on the vertical axis. There are 53 populations in HGDP and hence 1,378 data points (53 choose 2) with standard errors. The linearity is clear and the regression lines intersect the origin. However, there are two distinct lines for FST > 0.1. The 1,035 pairwise comparisons of non-Africans populations (46 choose 2) have a regression line slope of 0.91 and correlation coefficient r = 0.983 (95% CI 0.982–0.986). The African versus non-African comparisons have a distinctly smaller slope of 0.73 and r = 0.969 (95% CI 0.962–0.975). It is evident that for FST > 0.1, SNP-based quantities are larger than microsatellite quantities when Africans are involved. We next investigate the possible reasons for this discrepancy.
To investigate whether SNP ascertainment bias can explain the phenomena in figure 3A, we simulated SNP ascertainment as follows:
For demographic model 1, we denoted population A (the one with the larger effective population size) as “Africans” and population B as “non-Africans.” The simulation results are shown in figure 3B. Without ascertainment, both FST are identical. Ascertainment using two Africans showed negligible bias. Ascertainment using two non-Africans negatively biased SNP FST. Ascertainment using one sample from each population positively biased SNP FST. Compared with the real HGDP data (fig. 3A), ascertaining from one African and one non-African generated the same directional effect. This result is reasonable, because SNPs on medical genetics arrays were discovered as differences between a non-African chromosome and the reference human genome. The reference human genome sequence has a substantial amount of African ancestry because RPCI-11, the Bacterial Artificial Chromosome library that has contributed ~74% of the human genome reference sequence (International Human Genome Sequencing Consortium 2001), is likely to be derived from an African American (Reich et al. 2009).
We applied the one African one non-African ascertainment scheme to demographic model 2. There are four populations in the model, producing six FST values in total (four choose two). As shown in figure 3C, the non-African versus non-African comparisons show little bias. The African versus non-African comparisons show a positively biased SNP FST. Thus, we have demonstrated that SNP ascertainment bias can generate the discrepancy in figure 3A.
Having established the accuracy of both microsatellite ASD and FST, we next show a 2D view of HGDP microsatellite data that highlights important historical events.
Just as sequence variation data contains information on both divergence time and genetic drift, it can be shown (Materials and Methods) that microsatellite ASD and FST are functions of two independent quantities: interpopulation variance and intrapopulation variance. Using the HGDP microsatellite data as previously described, in figure 4 we projected the data onto the two orthogonal statistics: interpopulation variance (horizontal axis) and intrapopulation variance (vertical axis). Again we have 1,378 data points, and lines of constant ASD and FST are marked. Above the thick black line are Africans versus all populations, and below are non-Africans versus non-Africans. This figure suggests the following:
The fact that microsatellites are useful as molecular clocks has immediate applications: First, as described above (and in supplementary fig. S3, Supplementary Material online), we were able to use the clocklike nature of microsatellites to provide clear evidence that the San, Biaka, and Mbuti Pygmy branch off near the root of the tree of human populations, with all other populations (including West Africans) forming a clade. Note that all of our analyses are restricted to population average coalescent time, a quantity distinctly different and much more ancient than population split time. Second, we can use microsatellite data to correct inferences about FST based on high density SNP array data. SNP FST values can be precise, but they are affected by ascertainment bias. Potentially, we can use microsatellite FST to correct most of this bias. For example, based on figure 3, we estimate that all pairwise autosomal FST's between African and non-African populations in the Li et al. HGDP data (Li et al. 2008) are too large by a factor of 1.25 for FST values >0.1. By deflating all these FST values by this factor, we can obtain a pairwise FST matrix that is likely to be more accurate.
We finally note that our results are intriguing because in principle, they offer a way to obtain a direct estimate of the human per nucleotide mutation rate for sequence divergence data. To date, it has been impossible to obtain a direct estimate of the human per base pair mutation rate because the rate is too low (about 2 × 10−8 per nucleotide per generation) to permit direct observation from pedigree data. However, the microsatellite mutation rate is sufficiently high (10−3 to 10−4 per generation) that novel mutations are frequently directly observed in families (Weber and Wong 1993). By directly estimating the microsatellite mutation rate and mutation process in families, and then extrapolating to sequence divergence, we should be able to estimate the human per base pair mutation rate and infer the dates of important historical events, like the divergence times of human and chimpanzees, without using fossil records for calibration.
We thank Alon Keinan for his suggestions about the design of the SNP ascertainment bias simulations. D.R. was supported by a Burroughs Wellcome Career Development Award in the Biomedical Sciences. J.S. was supported by the Bioinformatics and Integrative Genomics Ph.D. training grant by NIH. J.C.M. was supported by the Intramural Research Program of the National Human Genome Research Institute, NIH. We are grateful to Nicole Stange-Thomann and Julie Neubauer for preparing the Reduced Representation Shotgun data.