|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Y-chromosome short tandem repeats (Y-STRs) are widely used for population studies, forensic purposes and, potentially, the study of disease, therefore knowledge of their mutation rate is valuable. Here we show a novel method for estimation of site-specific Y-STR mutation rates from partial phylogenetic information, via the maximum likelihood framework.
Results: Given Y-STR data classified into haplogroups, we de-scribe the likelihood of observed data, and develop optimization strategies for deriving maximum likelihood estimates of mutation rates. We apply our method to Y-STR data from two recent papers. We show that our estimates are comparable, often more accurate than those obtained in familial studies, although our data sample is much smaller, and was not collected specifically for our study. Furthermore, we obtain mutation rate estimates for DYS388, DYS426, DYS457, three STRs for which there were no mutation rate measures until now.
Microsatellites or Short Tandem Repeats (STRs) are repetitive stretches of DNA made of short sequence motifs (2–6 bp), repeating a variable number of times. STRs are very common in eukaryotic genomes, and are highly mutable, with changes in repeat count occurring with much higher mutation rates compared to other polymorphisms [as high as 2 × 10−3 in Y-chromosome STRs (Y-STR) as estimated by Heyer et al. (1997)], leading to allelic polymorphism. These properties have made them an efficient tool for identification in forensics and paternity tests (Hammer and Redd, 2006; Kayser and Sajantila, 2001) as well as for studying demographic history and population structure, especially Y-STRs (Contu et al., 2008). This is because most of the Y-chromosome does not undergo recombination, hence population polymorphism originates only from mutations, and individuals can be placed on a common phylogenetic tree, whose branches are marked by mutations. Besides their use in forensics and demographic studies, some autosomal microsatellites are also known to be involved in disease: expansion of specific microsatellites beyond a certain threshold has been known to cause diseases such as Fragile X syndrome, myotonic dystrophy and Huntington's disease (Ashley and Warren, 1995; Rubinsztein, 1999). Microsatellites are also known to hypermutate in some cancers (Thibodeau et al., 1993).
In order to understand microsatellites mutation mechanisms, reliable rate estimates for STRs, in general, and Y-STR, specifically, have long been considered of scientific interest. Different approaches have been utilized for the estimation of Y-STR mutation rates, some of which use direct counting such as counting mutation events in deep routed pedigrees with known history (Heyer et al., 1997), or in father–son pairs (Ge et al., 2009; Kayser et al., 2000), and also in sperm (Holtkemper et al., 2001). Other methods try to estimate the mutation rates indirectly; for example, Zhivotovsky et al., 2004, use the diversity of Y-STRs in modern-day population samples with documented population founding or population splitting events in the last 1000 years (Gypsies and Pacific Islanders). They use TD, an estimator for a population divergence time based on inter- and intra-population variance of STR repeat number (Zhivotovsky, 2001), to estimate the average STR mutation rates. For the eight Y-STRs they consider they get an estimate of 6.9 × 10−4±5.9 × 10−4 mutations per generation.
All of the different approaches observed significant variation in the mutation rates of different STRs. See for example the comparable mean and standard error for mutation rates as estimated by Zhivotovsky et al. (2004). However, there is a roughly 3-fold gap between the rates estimated from geneologies and those estimated from historical or phylogenetic data (Zhivotovsky et al., 2004, 2006). This has generated extensive attention in the literature, with some explanations offered by Zhivotovsky et al. (2006), but in our view it largely remains unresolved.
In this article, we propose a new approach for estimating Y-STR mutation rates. As opposed to previous approaches, which require extensive collection of data specifically for this purpose (father–son pairs, genealogies, populations with documented history, etc.), our approach takes advantage of data collected in population genetic studies. There is a long list of such studies which make use of Y-STR data (Contu et al., 2008; Hammer et al., 2009; Quintana-Murci et al., 2010), and they often also sequence a collection of unique event polymorphism markers, usually single nucleotide polymorphisms (SNPs), which place each sample in a well-defined region of the human Y-chromosome phylogenetic tree, referred to as a haplogroup (Hg). We show below how this partial knowledge of the phylogenetic relationship between samples affects the likelihood of the observed Y-STR lengths, and demonstrate how the resulting optimization problems can be solved to obtain maximum likelihood (ML) estimates of Y-STR mutation rates. In the simplest cases, the resulting ML estimation problem is generalized linear model (GLM), with a non-standard complementary log–log (CLL) link function. This has been previously demonstrated and applied on mitochondrial DNA (mtDNA) data by Rosset et al. (2008). However, as we show, in Y-STR data we can extract more detailed information from the data. We formulate and solve the resulting ML maximization as general convex optimization problems.
We apply our approach to a combined dataset of haplogroup-associated Y-STR data from two recent papers (Hammer et al., 2009; Quintana-Murci et al., 2010), comprising in total 3780 samples in 66 haplogroups. We then compare our estimates to the ones published in the Y-Chromosome Haplotype Reference Database, YHRD (Willuweit and Roewer, 2007). The YHRD estimates are based on simple counting of mutations in very large numbers of meioses on known geneologies (between 10 000 and 25 000 meioses for each Y-STR) and are based on genealogical data collected specifically for this purpose. We demonstrate that our estimates are comparable to those derived from genealogical data, but more reliable (as reflected by having tighter confidence intervals). We are also able to obtain estimates for Y-STRs not contained in YHRD. Our approach can easily be applied to much larger datasets by augmenting our current study with additional datasets collected from the population genetics literature, with no proactive data collection effort required.
Our data comes from two recent population genetics studies. The first (Hammer et al., 2009) is a study of the Y-chromosome landscape of Jewish Cohanim, compared to the general Jewish population and to gentiles within the Middle East and other regions. The authors genotyped 75 SNPs on the Y-chromosomes of 3674 individuals. This classified them into 64 unique Y-chromosome Hgs, according to the accepted nomenclature. They also genotyped 12 Y-STRs in all individuals: DYS19, DYS385a, DYS385b, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS426 and DYS439. An additional two Y-STRs (DYS438, DYS457) were genotyped in most individuals.
The second study (Quintana-Murci et al., 2010) deals with a very different population: the Colored people of South Africa, who are an intriguing genetic mixture of African, European and Southeast Asian ancestries. In this article, the authors used SNPs to divide the 228 males sampled into 21 Hgs, and they typed 14 Y-STRs on 226 of them: DYS19, DYS385a, DYS385b, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS426, DYS437, DYS438 and DYS439.
We use the combined dataset from both studies for our analyses, where we rely on the Hg inference by the authors of these two papers, and use the Y-STR data they generated as input to our method. We removed DYS385a and DYS385b from the analysis, since their values cannot be uniquely determined, and replaced DYS389II by DYS389B = DYS389II-DYS389I, as is typically done (Goff and Athey, 2006; Mohyuddin et al., 2001). Our pre-processing of the data further included merging the datasets, specifically combining data for the 10 Hgs that appeared in both datasets; removing Hgs which had less than two samples in the combined dataset; removing samples which had less than 11 Y-STRs observed.
After this process, we were left with data on 66 Hgs, with a varying number of observations on each Y-STR, as detailed in Table 1.
Denote by T the phylogenetic tree containing all 3780 samples. We do not have the full tree structure, but rather a haplogroup view of that tree, that is, the samples are grouped into haplogroups or paragroups, here referred to as Hgs. These Hgs represent terminal subtrees of the full tree T, but we are not given their internal structure (Fig. 1). For each sample we are given the number of subunit repeats of all, or part of the 13 STRs mentioned bellow. We assume that:
Let t(T) be the total time of all branches of our phylogenetic tree T, then, according to our assumptions, the number of mutations on this tree in STR i in total time t(T) is distributed Poisson(λit(T)). In the same manner let T1,…,TK represent the K Hg terminal subtrees of T, whose total time length of all the branches, t1,…, tk and inner structure are not known. Thus, mik, the number of mutations of STR i, in Hg k, is distributed Poisson(λitk). If we had the internal structure of each subtree k (of Hg k), then we could directly count mik, and hence formulate the total log-likelihood of the data and estimate the parameters, using Poisson regression, through the usual ML framework. However, as aforementioned, we do not, so we do not observe the mik's, but only observe the state (number of subunit repeats) of STR i in all samples (leaves) of Hg k.
Here we go about this problem using an extension of a method proposed by Rosset et al. (2008) for SNP rate estimation in mtDNA. Briefly, given all states of STR i, in the leaves of Hg k, if the STR state is not identical in all leaves, we know for certain that mik > 0, i.e. STR i has mutated at least once somewhere on the phylogenetic tree describing haplgroup k samples. If all of the samples in Hg k have the same number of subunits in STR i, we can conclude with almost absolute confidence that this site has not mutated anywhere on the Hg's phylogenetic tree, i.e. mik = 0. To demonstrate that our approach can properly capture whether a mutation did occur in a specific site, consider a simple phylogenetic tree like the one in Figure 2, where we assume a mutation from red triangle to black circle has occurred on the top-right branch. The shapes at the bottom describe the states of the leaves (observed samples), if no other mutations have occurred at this site. Now assume we want all the leaves of the tree to have the same number of subunits (all triangles or all circles) at this STR. This would clearly require that either the mutation reverted back from circle to triangle on a cut of the subtree below the original mutation (such as both branches marked with **) or the same exact mutation (triangle to circle) simultaneously happened on a set of branches completing a cut of the full tree (such as the branch marked with X). If none of these highly unlikely events (requiring multiple ‘coordinated’ mutations) occur, all leaves would not have the same state at this site, given the shown triangle to circle mutation. We can illustrate the low probability of missing a mutation in our approach, by comparing it to another probability that of not observing a mutation on a coalescent tree because it has mutated back on the same link and thus is completely unobservable. Assuming for simplicity that all polymorphisms are binary, consider for example the two links marked with **, assume they both have length t. It is easily seen that the probability that STR i mutated and reverted on either one of them is 2 exp(−2λit)(λit)2/2+O((λit)3). The probability that the triangle to circle mutation reverted back on both of them simultaneously is similarly exp (−2λit)(λit)2/2+O((λit)4), i.e. slightly smaller. If we do not assume both links have the same length, then the first probability is potentially much bigger than the second. Therefore, under reasonable assumptions, that reversion back is most likely on binary splits, our total chance of setting mik = 0 when the true value is mik > 0, is on the same order of magnitude as twice the chance that the coalescent tree contains mutations that reverted back on the same link, which are inherently unobservable (Rosset et al., 2008).
Thus, while we could not observe the Poisson mutation counts mik, we observe the binary variables:
These variables are distributed as bik ~ Bernoulli(exp(−λitk)). After formulating the partial log-likelihood [Equation (1)] of the observed data b, the ML estimation of the parameters is a binomial regression with a CLL link function. This is therefore a GLM problem (MacCullagh and Nelder, 1991), and can be solved using the standard GLM framework (Rosset et al., 2008). Note that we use observed data only on mik, whose distribution depends only on the total branch length of each Hg subtree rather than on a specific internal structure. Hence the rate estimates will not depend on the internal structure as long as the total branch length is the same. We refer to this estimation procedure as MAL1 (ML estimation with information on at least one mutation).
However, using this method we do not use all data available, specifically how many different STR states are present in each haplogroup. Using the same reasoning as above, if two different states indicate that at least one mutation occurred, three different states indicate the presence of at least two mutations, etc. (with the maximal number of STR states per Hg in our data being 9). Since we assume that the number of mutation events of STR i in Hg k, mik ~ Poisson(λitk), the probabilities in each case can therefore be written as:
Let yik be the observed number of states of STR i in Hg k. We can formulate the log-likelihood of the data y:
We refer to this estimation procedure as MAL8 (ML estimation with information on at least eight mutations). This does not fit the GLM framework, but is a convex function, and therefore has a global maximum, which can be found using standard optimization tools. Here we used the Matlab function ‘fmincon’ to find the optimum solution.
MAL1 and MAL8 procedures yield ML estimates of both the Hg tree lengths tk; k = 1,…, K, and the STR-specific mutation rates λi; i = 1,…, I. However, note that this ML solution is defined only up to a multiplication of all the λis by a constant and division of all the tks by the same constant. Thus, to complete the estimation we need to resolve this remaining degree of freedom. Here we calibrate our rates to the YHRD, setting the sum of STR rates that are in common equal. The STRs that are missing in the YHRD were multiplied by the same constant that was used for the calibration of the others.
It so happens that some of the Hgs in our data are saturated, that is, for some Hg k all the STRs have more than one state: yik > 1∀i. This happens especially for Hgs that contain many individuals. In this case tk is not estimable in our methodology (that is, the ML estimate is not finite). In order to reduce the amount of saturation we created different datasets by sub-sampling 50% of the samples belonging only to the ‘problematic’ Hgs, multiple times (n = 10). We then applied our methodology to the different datasets and generated a ‘distribution’ of estimates. Our rate estimates proved to be very robust and changed very little in the different sets of data sampled. For each rate parameter, λi, the mean of the distribution was taken to be our final estimate.
In order to assess how reliable our mutation rate estimates are, we use the Non-Parametric bootstrap (Efron and Tibshirani, 1993). Namely, we resample our data (with replacement) over and over again, 750 times, and run our estimation procedure (subsampling 10 times to get a distribution of estimates and taking the mean as the estimate λi*b) on each bootstrap sample to get a distribution of bootstrap estimates. The central assertion of the bootstrap method is that the relative frequency distribution of these λi*b's is an estimate of the sampling distribution of the true λi. Hence, we can use this distribution to estimate the bias and variance of our estimate, and consequently for constructing confidence intervals.
In order to obtain estimation of Y-STR mutation rates, we used previously published datasets comprised of Y-STR lengths of 13 Y-STRs of >3780 individuals, assigned to one of 66 haplogroups (see Section 2). By using the length distribution of Y-STRs in each haplogroup, we apply the proposed MAL1 and MAL8 likelihood maximizations (see Section 2) to obtain uncalibrated mutation rate and branch length estimates. The estimations were then calibrated according to the YHRD.
Table 2 gives estimates of the 13 Y-STR mutation rates obtained using MAL1 and MAL8. Both methods give similar results, (Fig. 3), however, since the MAL8 method uses more information, we would expect it to give better results. This is indeed the case as both the SD and the bias estimated by the bootstrap are smaller for MAL8 compared to MAL1 in 12 out of the 13 STRs. Both the bias and SD give a significant P-value of 0.006 using a sign test (Wackerly et al., 2002), indicating that MAL8 is indeed more accurate. Based on these observations we proceed using MAL8 for the estimation and inference of Y-STR rates.
Table 3 and Figure 4 show the rate estimates and confidence intervals of our method and the YHRD for each of the 13 STRs. Confidence intervals were calculated using 750 bootstrap samples, as described in the Section 2. While the rates vary considerably between the different STRs (ranging from 7 × 10−5 mutations per meiosis for DYS426 to about 3 × 10−3 for DYS19, 439 and 389B), the MAL8 predictions are similar to the YHRD measured rates, the only exception being DYS391. In seven out of 10 STRs, the confidence intervals obtained by MAL8 are a tighter than the YHRD confidence intervals.
In addition to the 10 STRs with known rates, we obtained predictions for three more Y-STRs: DYS388, DYS426, DYS457 with mutation rates of 1 × 10−3 ± 1.59 × 10−4, 7.5 × 10−5 ± 3.7 × 10−5, 7.3 × 10−4± 2.29 × 10−4 mutations per meiosis, respectively.
It has been previously suggested that STR mutation rate increases with the number of subunits (Calabrese and Sainudiin, 2005). Figure 5 shows each STR's mutation rate plotted against its modal length (the length appearing in the largest number of the samples). As can be seen there is a positive correlation, measured by Spearman's rho of 0.6713, with a significant P-value of 0.012. Thus, the estimated rates support this assertion.
In this study we show a new method for estimating Y-STR mutation rates. This method (MAL8) is based on ML estimation using partial phylogenetic information. Using data for 10 Y-STRs with previously estimated mutation rates, MAL8 obtains rates highly similar to the previously measured rates described in the YHRD (Willuweit and Roewer, 2007). This is noteworthy, as both the estimation method and the datasets used are very different.
Classical mutation rate estimation is based on counting mutations in father son pairs (or other known geneologies). This requires both a specialized dataset, and, due to the small rate of mutations, thousands of samples (YHRD uses 10 000–25 000 meioses for each STR). The MAL8 method used in this study was applied to a small, previously published dataset, consisting of only <4000 samples, which were collected for other uses and not customized for Y-STR mutation rate estimation in any way. Even on such a small dataset, the MAL8 method obtained tighter confidence intervals in most STRs, including DYS437 for which we had significantly fewer samples than the other Y-STRs (~230 samples). This emphasizes an important aspect of our approach, whose performance largely depends on the level of detail in Hg classification, more than on the actual number of samples used. Thus, our approach is expected to do well in the presence of a detailed Hg phylogeny, even with relatively small sample sizes. Note that an incorrect classification into the Hgs might cause a contradiction to the assumption of independent occurrences of mutations, but this is not a real concern, since Hg classification is done with very high confidence according to SNPs (Hammer et al., 2009; Karafet et al., 2008; The Y Chromosome Consortium, 2002). Importantly, the rate estimates depend only on polymorphisms observed, which in turn depend on the total branch lengths of each Hg subtree, rather than on a specific internal structure.
We obtained mutation rate estimates for three STRs for which, to the best of our knowledge, there is no measured mutation rate: DYS388, DYS426 and DYS457. The estimated mutation rate of DYS426 (0.0007) is significantly slower than the other estimates. Indeed, we observed in this study, as well as others in previous works (Willuweit and Roewer, 2007), that there is the large difference in mutation rates between different STRs. This has been suggested to originate from differences in STR repeat counts (a connection which we have verified above exists in our data), the length of the subunit itself, or its nucleotide composition. Obtaining better statistics on the mutation rates in different STRs may enable to shed more light on this problem. While currently >400 Y-STRs have been described (Hanson and Ballantyne, 2006; Kayser et al., 2004), only <20 Y-STRs have measured rates (Willuweit and Roewer, 2007). Since the dataset used for the MAL8 is publicly available and continues to grow, obtaining sufficient data for additional STRs may enable estimation of their mutation rates. In addition, a current limiting factor for more accurate estimation is the resolution of the haplogroup classification. Studies with detailed Hg classification based on unique event polymorphisms like SNPs would be most useful for generating accurate estimates of Y-STR rates.
A fundamental flaw of all Y-STR mutation rate estimation approaches, including our own, is the limiting assumption they make about the nature of Y-STR mutation processes. The mutation count is assumed to be a symmetric random walk, so that the probability of change in the repeat count for each STR is fixed, independently of whether the change is an increase or decrease, and of the current repeat count. These models are unrealistic as they do not allow a stationary distribution of repeat counts (Calabrese and Sainudiin, 2005), thus, eventually leading to STR length which is infinite or zero—both possibilities are obviously unreasonable. Interestingly, it has been demonstrated that use of such simplistic models is consistent with a decrease in ‘observed’ mutation rates as distance between samples increases (Calabrese and Sainudiin, 2005). We believe that this should be further investigated as a possible factor in the 3-fold gap between the genealogical and evolutionary rates (Zhivotovsky et al., 2004).
Hence, we plan future extensions for our modeling approach to allow for asymmetric rates and length dependencies. Note that this is much more complex than our current approach, since knowledge of the total mutation count mik no longer suffices to write the likelihood. At the very least, this approach requires prior knowledge about ancestral STR length in every Hg, or, possibly, a way to infer it.
Special Thanks to Amnon Amir for his helpful ideas throughout this work.
Funding: European Union (grant MIRG-CT-2007-208019 to O.R.-A. and S.R., in part); Israeli Science Foundation (grant 1227/09 to O.R.-A. and S.R., in part).
Conflict of Interest: none declared.