|Home | About | Journals | Submit | Contact Us | Français|
Genomic sequence data may be used to test hypotheses about the process of species formation. In this paper, I implement a likelihood ratio test of variable species divergence times over the genome, which may be considered a test of the null model of allopatric speciation without gene flow against the alternative model of parapatric speciation with gene flow. Two models are implemented in the likelihood framework, which accommodate coalescent events in the ancestral populations in a phylogeny of three species. One model assumes a constant species divergence time over the genome, whereas another allows it to vary. Computer simulation shows that the test has acceptable false positive rate but to achieve reasonable power, hundreds or even thousands of genomic loci may be necessary. The test is applied to genomic data from the human, chimpanzee, and gorilla.
Genomic sequence data provide information not only about population demographic processes of modern species (Wilson et al. 2003; Heled and Drummond 2008) but also about such processes in extinct ancestral species (Rannala and Yang 2003) and even about the mode and timing of the speciation process itself. Takahata (1986) pointed out that sequences from multiple genomic regions of two closely related extant species can be used to estimate the population size of their common ancestor, relying on the fact that the coalescent time in the ancestral population fluctuates over loci at random, in proportion to the ancestral population size. The sequence distance between two species is comprised of two parts, due to the evolution since the time of species separation (τ) and to the evolution during the coalescent time t in the common ancestor. Although τ is constant over the whole genome, t varies over genomic regions according to the exponential distribution with both the mean and the standard deviation (SD) equal to 2N generations, where N is the effective population size of the ancestor. Takahata et al. (1995) extended this analysis to three species, using maximum likelihood to account for uncertainties in the gene tree topology and coalescent times. The past few years have seen considerable improvements in the statistical methodology for analyzing multiple-species multiple-loci data sets, particularly concerning reconstruction of species phylogenies in presence of gene tree conflicts (for reviews, see Rannala and Yang 2008; Liu et al. 2009).
Genomic data may also shed light on the mode and timing of the process of species formation (Patterson et al. 2006; Burgess and Yang 2008). Wu and Ting (2004) argue that while the species divergence time τ may be constant over genomic regions if speciation is allopatric, with gene flow ceasing immediately at the time of species separation, τ should vary if speciation is parapatric and reproductive isolation develops gradually over a period of time. Osada and Wu (2005; see also Zhou et al. 2007) explored this idea to develop a likelihood ratio test (LRT) of the null hypothesis that τ is constant between two kinds of loci against the alternative that τ is variable. With only two species in the comparison, the test may have low power and may be very sensitive to variable mutation rates among loci. The information about variable τs over loci comes mostly from the variation, among loci, in sequence divergence between the two species. However, a large variation in sequence divergence can be explained by any of the following reasons: variable mutation rates, a large ancestral population size, and variable species divergence times. The simple model of speciation without gene flow with a large ancestral θ may explain the sequence data nearly as well as the more complex model of speciation with gene flow, so that the test will likely lack power.
The problem may be alleviated somewhat by inclusion of a close outgroup species. With three species (fig. 1), the gene tree can differ from the species tree, and such conflicts between the gene tree and the species tree provide information about the ancestral population size. The species-tree gene-tree mismatch probability is, or the probability that the sequences from species 1 and 2 do not coalesce in the common ancestor of species 1 and 2 (fig. 1) (Hudson 1983). Furthermore, the outgroup species may provide information about the relative mutation rate at the locus, so that the test may become less sensitive to mutation rate variation. For example, a large between-species distance d12 can be due to a long coalescent time in the ancestor or a high mutation rate at the locus, but if d23 and d31 are small at the locus, the former explanation becomes more likely. Of course, the gene tree topology and branch lengths involve substantial uncertainties due to lack of information in the alignment at each locus, but such uncertainties can be dealt with properly in a standard likelihood approach. Indeed, Yang (2002) implemented a maximum likelihood method for the case of three species under the simple allopatric speciation model (fig. 1). The JC model (Jukes and Cantor 1969) was used to correct for multiple hits. This is an extension of the maximum likelihood method of Takahata et al. (1995), which assumes the infinite sites mutation model. The likelihood calculation involves 2D integrals, which were calculated using Mathematica.
In this paper, I improve the computational algorithm of Yang (2002) so that it can be used for larger data sets with more loci. Numerical integration using Mathematica is slow, so I use Gaussian quadrature method instead. I then implement a new model that allows the species divergence time to vary among loci at random. The new model is compared with the old model to formulate an LRT of constant species divergence time τ1 (fig. 1). This may be interpreted as a test of the null model of speciation without gene flow against the alternative model of speciation with gene flow. Although gene flow at the early stages of allopatric speciation is imaginable, parapatric and sympatric speciation appears to be the more natural scenario of speciation with gene flow. Thus, the test may also be considered a test of the null model of allopatric speciation against the alternative model of parapatric (and sympatric) speciation. Computer simulations are conducted to assess the sampling errors in parameter estimates and to examine the false positive rate and power of the test. The method is then applied to a data set of genomic sequences from the human, chimpanzee, and gorilla (Burgess and Yang 2008).
I briefly describe the model of Yang (2002) to introduce the notation and to discuss the computational issues involved. The species tree ((1, 2), 3) is assumed known (fig. 1a), and the two ancestral species are referred to as 12 and 123. There are four parameters in the model: θ0 = 4N0μ for the ancestor 123, θ1 = 4N1μ for the ancestor 12, and two species divergence times τ0 and τ1. Here, μ is the mutation rate, N0 and N1 are the two ancestral (effective) population sizes, whereas τ0 and τ1 are species divergence times multiplied by the mutation rate.
The data consist of DNA sequences from multiple neutral loci, with one sequence from each species at each locus. It is assumed that there is no recombination within a locus and free recombination between loci. Each population is assumed to be random mating, and there is no gene flow since species separation.
Under the Jukes and Cantor (1969) mutation model, the sequence alignments at any locus i can be summarized as the counts of sites, Di = , for five site patterns xxx, xxy, yxx, xyx, and xyz, where x, y, and z are any different nucleotides. Sites with ambiguities and alignment gaps are removed. We define branch lengths b0 and b1 as the lengths of branches AB and B1, respectively, in gene tree G1 (fig. 1). Branch lengths in other gene trees are defined similarly. Given the gene tree G1 (or G0) and branch lengths b0 and b1, the probabilities of observing the five site patterns are
(Yang 1994). The conditional probabilities of data at locus i given the gene tree and branch lengths are given by the multinomial distribution as
The unconditional probability of data Di at locus i is an average over the gene trees and branch lengths (i.e., over coalescent times t0 and t1)
(Yang 2002: eq. 8). The first term in the equation corresponds to gene tree G0 and the second to the three gene trees G1, G2, and G3 (fig. 1). Note that with time measured in 2N generations, the coalescent time t has an exponential distribution (with mean 1 for two lineages or mean 1/3 for three lineages) and contributes a mutational distance of .
Finally, the likelihood is a product over all the L loci
Parameters θ0, θ1, τ0, and τ1 are estimated by numerical maximization of the log likelihood = . The numerical optimization routine used here (Yang 1997) deals with lower and upper bounds but not general linear inequality constraints such as τ1 < τ0. Thus, the transformation x1 = τ1/τ0 is used instead of τ1, with 0 < x1 < 1.
Each evaluation of the likelihood function (4) requires calculation of 2L 2D integrals. Yang (2002) used Mathematica to calculate them numerically. This was found to be reliable but quite slow. In this paper, I apply Gaussian quadrature, using the Gauss-Legendre rule (e.g., Kincaid and Cheney 2002, p. 492–501), by which a 1-D integral is approximated using a sum of K terms
where the points xi and weights wi are given according to the Gauss-Legendre rule. Note that the number of points K is not a parameter in the model but affects the computation, with a larger K producing more accurate but computationally more expensive approximation. Two-dimensional integrals can be calculated by repeated use of this approximation, with the computation proportional to K2. Tests using the data of Chen and Li (2001), which include L = 53 loci each of about 500 bp, in comparison with the results of Yang (2002), suggest that K = 8 or 16 provide adequate approximation under this model. Maximum likelihood iteration for the data set takes about 30 min using the old algorithm and ~5 s using the new one on the same PC.
The probabilities P(Di|Gk, b0, b1) of equation (2) are very small and vary over many orders of magnitude depending on b0 and b1. To avoid underflows and overflows, the highest log likelihood at the locus, max, calculated at the maximum likelihood tree topology and branch lengths, is used for scaling: the integrands of equation (3) are divided by before they are summed up (Yang 2006: eq. 9.9). Tests suggest that with this scaling, the algorithm is feasible for up to 10 kb at each locus.
This model allows the divergence time τ1 to vary among loci. Species 3 is considered an outgroup and its divergence time (τ0) from the common ancestor of species 1 and 2 is assumed to be constant. No theory appears to exist to predict how τ1 should vary among loci under a model of parapatric speciation with gene flow, so my choice here is somewhat arbitrary (see Discussion). One may use the gamma distribution but the truncation (so that τ1 < τ0) makes it awkward to interpret the model parameters. The beta distribution appears to be quite flexible and is implemented here. The density is
Here τ0, p, and q are the parameters of the distribution. The model is equivalent to assuming that the transformed variable x1 = τ1/τ0 has the familiar two-parameter beta distribution: x1 ~ beta(p, q) with 0 < x1 < 1. The distribution is uniform if p = 1 and q = 1, has a single mode if p > 1 and q > 1, and can take a variety of shapes depending on p and q. The mean of the distribution is and the variance is . For easy comparison with model M0, I use and q instead of p and q as parameters of the model, with , and 0 < q < ∞. Thus, model M1 involves five parameters: θ0, θ1, τ0, =1/τ0, and q. With this formulation, parameter q is inversely related to the variance in τ1, and the null model of constant τ1 is represented by q = ∞.
The probability of data at a locus is then
where f(Di|θ0, θ1, τ0, x1τ0) is given by equation (3) with τ1 = x1τ0, is the beta density. Under this model, the integrals are 3D, so that the computation involved in Gaussian quadrature is proportional to K3.
To let the algorithm focus on the region where the integrand is large, the integral limits in equation (7) are changed to max(0, – 5s) and min(1, + 5s), where s is the SD of the beta distribution. For the same K, the approximation to the 3-D integrals under M1 is poorer than the approximation to the 2-D integrals under M0. Furthermore, the approximation is poorer for small qs than for large qs (fig. 2). Tests suggest that K = 16 provides adequate approximation: this value is used in the simulation and analysis in this paper.
When q = ∞, model M1 reduces to the simple model of a constant τ1. The two models are thus nested and can be compared using an LRT. Let the test statistic be 2Δ = 2(1 – 0), where 0 and 1 are the log likelihood values under the two models. Because q = ∞ is at the boundary of the parameter space of model M1, the standard approximation breaks down. Instead, the null distribution is the 50:50 mixture of point mass 0 and (Self and Liang 1987). The critical values are 2.71 at 5% and 5.41 at 1% (as opposed to 3.84 for 5% and 6.63 for 1% for . The P value for the mixture is half the P value from for the same test statistic.
The information concerning ancestral θs and possible variation in divergence time τ comes mostly from the variation in the gene tree topology and branch lengths among loci. As different mutation rates can cause such variation as well, rate variation among loci may be a serious concern. Although rates may be nearly constant among neutral loci (such as the hominoid genomic data analyzed later in this paper), they may vary considerably over functional regions or protein-coding genes. Because different genes are under different selective constraints, they have different proportions of neutral mutations and different neutral mutation rates.
Following Yang (2002), an outgroup species may be used to estimate the relative rates for the loci, which may be used as constants in the likelihood calculation. If the rate for locus i is ri, the branch lengths in equations (2) and (3) are simply multiplied by ri. As the relative rates are scaled to have mean 1, parameters (θs and τs) are all defined using the average rate across all loci.
Three simulations are conducted to examine the sampling errors of the maximum likelihood estimates (MLEs) and the type-I and type-II errors of the LRT. The first simulates data under model M0 to examine the sampling errors in the MLEs of model parameters. Two sets of parameter values are used in the simulation, roughly based on estimates from the hominoids: θ0 = 0.005, θ1 = 0.005, τ0 = 0.006, τ1 = 0.004 (Burgess and Yang 2008) and from the mangroves: θ0 = 0.01, θ1 = 0.01, τ0 = 0.02, τ1 = 0.01 (Zhou et al. 2007). The JC69 mutation model, with constant rate among loci, is used both to simulate and to analyze the data. Given the parameter values, the probabilities of the five site patterns are calculated using equation (1) and the counts of sites at each locus (ni0, ni1, ni2, ni3, ni4) are generated by sampling from the multinomial distribution. Each locus has 500 sites. Each replicate data set consists of L loci, which are analyzed to obtain the MLEs of the parameters under model M0. The number of replicates is 1,000.
The means and SDs of the parameter estimates under model M0 are listed in table 1. For the hominoid parameter set, estimates of θ0 and θ1 are quite poor with L = 10 loci, although τ1 is well estimated. Estimates of θ1 have a positive bias. The fact that θ1 is more poorly estimated than θ0 may seem counterintuitive as one might expect it to be easier to estimate parameters for recent ancestors (such as θ1) than for ancient ancestors (such as θ0). Nevertheless, this expectation may not be correct. For the hominoid parameter set, the two speciation times are close, so that there was little chance for coalescent events to occur during that time interval, which would provide information about θ1. With 100 or 1,000 loci, all parameters are well estimated.
For the mangrove set, the parameters are greater so that the sequences are more informative. Indeed, even with L = 10 loci, all parameters except θ0 are well estimated. The difference in the overall performance of the method between the two parameter sets appears to be mainly due to the different mutation rates (i.e., larger values of θ and τ for the mangrove set). The more accurate estimation of θ1 for the mangrove set may also be due to the larger time interval between the two speciation events and thus more chances for coalescent events during that time interval: the probability of gene tree G0 is for the mangrove set and 0.86 for the hominoid set. For both sets, the results are consistent with the expectation that a 10-fold increase in the number of loci leads to -fold reduction in the SD.
The second simulation examines the type-I error rate of the LRT implemented in this paper. Data are simulated under model M0 using the two sets of parameter values (for hominoids and mangroves). Each locus has 500 bp. The number of replicates is 200. Each replicate data set is analyzed using models 0 and 1 to calculate the test statistic 2Δ = 2(1 – 0). The results are shown in table 2, with the significance level set at 5%. The test appears to be conservative, with the false positive rate <5%, when the data contain little information (i.e., when L = 10 or 100 for the hominoid set and when L = 10 for the mangrove set). With more loci or with a higher mutation rate, the false positive rate becomes close to the nominal 5%.
The third simulation examines the power of the LRT. Data are simulated under model M1, using q = 1.2 (which is the estimate from the hominoid data; see below). As before, two sets of parameter values for θ0, θ1, τ0, and τ1 are used. Again each locus has 500 sites, and the number of replicates is 200. The results are shown in table 2. For the hominoid set, the test has virtually no power (<5%) with L = 10 or 100 loci and moderate power (52%) when L = 1,000. For the mangrove set, the power is quite high (78%) with 100 loci and reaches 100% when L = 1,000. The large difference between the two parameter sets lies mainly in the near 2-fold difference in mutation rate and the information content in the sequence data. Longer sequences in each alignment are expected to improve the power just like a higher mutation rate (Felsenstein 2005), but this effect is not evaluated here.
Here, I apply the LRT to the genomic sequences of the human, chimpanzee, and gorilla from Burgess and Yang (2008). These data are an updated version of the data of Patterson et al. (2006), updated and recurated by Burgess and Yang (2008) to incorporate more recent genome assembly sequences and to generate high-quality alignments of genomic regions instead of single variable sites. Filters were applied to remove the error-prone ends of whole-genome shotgun reads, as well as coding regions, repeats, RNA genes, and low-complexity regions. As the model assumes free recombination between loci and no recombination within locus, the data were filtered so that each locus (genomic region) was at least 1 kb away from known genes, and every two loci had a minimum separation of 10 kb. The resulting “neutral” data set comprised 14,663 autosomal loci and 783 X-linked loci for five species: human (H), chimpanzee (C), gorilla (G), orangutan (O), and macaque (M). The mean locus length was 508 bp. The likelihood method of this paper can analyze three species only, so the human, chimpanzee, and gorilla sequences are used. To test the impact of mutation rate variation among loci, the orangutan sequence is used as the outgroup to calculate relative mutation rates for the loci (Yang 2002). Thus, some loci at which the orangutan sequence is missing are excluded in the analysis, leaving 9,861 autosomal loci and 510 X loci. The data for the 22 human autosomal chromosomes are analyzed separately and are then combined in one analysis. Sites with alignment gaps and ambiguity nucleotides are removed. Although Burgess and Yang (2008) modeled sequencing errors and violations of the molecular clock, those factors were found to have only minor impact on estimation of parameters concerning the human, chimpanzee, and gorilla in the analysis of the curated data (compare tables 2 and 5 in Burgess and Yang 2008). In this paper, sequencing errors are ignored and the molecular clock is assumed.
The results of the LRT are shown in table 3. When the mutation rate is assumed to be constant among loci, the test is significant at 3 out of the 22 autosomes. Using the relative rates calculated from comparison with the orangutan, the test is significant at 6 out of the 22 autosomes, as well as for the X chromosome. If the apparent variation in τHC among loci is due to mutation rate variation, accounting for variable mutation rates among loci should lead to a reduction in the number of significant results. Thus, there seems to be little evidence for variable rates among loci in those data (for the similarity of parameter estimates under the basic model and the variable-rates models, see also Burgess and Yang 2008; table 2), and the LRT is not misled by possible rate variation among loci in this analysis. The average rates for the 22 autosomes, as indicated by the average JC69 distance between HCG and the orangutan, are very homogeneous (table 3), indicating little rate differences among the chromosomes.
When all the 9,861 autosomal loci are used in the same analysis, the LRT is highly significant whether the mutation rate is assumed to be constant or variable across loci. There is thus evidence for variable τ1 over the genome. This model, although not so extreme as the large-scale hybridization model envisaged by Patterson et al. (2006), is incompatible with the simple model of a constant τ over the genome. The evidence should perhaps not be considered overwhelming, given the huge number of loci used in the test. It has been suggested that the LRT tends to reject the null model too often in large data sets and that the Bayesian method may provide a more accurate assessment of the evidence in the data concerning the models (e.g., Schwarz 1978). A Bayesian implementation of the same test would make it possible to compare the different methodologies using the same data.
Maximum likelihood estimates of the parameters under model M0 obtained from the analysis of all the autosomal loci are as follows: = 0.00358 ± 0.00008, = 0.00431 ± 0.00025, = 0.00661 ± 0.00004, = 0.00432 ± 0.00007. The standard errors (SEs) are very small due to the large size of the data. The estimates of τHC and τHCG are very similar to those of Burgess and Yang (2008), although those of θHC and θHCG are more different (table 3). I analyzed the same data using the Bayesian program of Rannala and Yang (2003), using the gamma prior G(2, 2,000) for all θs and τHCG ~ G(2, 300) with mean 0.0067. The means and SDs of the posterior distribution are = 0.00360 ± 0.00008, = 0.00419 ± 0.00027, = 0.00660 ± 0.00004, = 0.00435 ± 0.00008. These are virtually identical to the MLEs and SEs obtained in the likelihood analysis. Further tests (supplementary table S1, Supplementary Material online) suggest that the differences between the MLEs of this paper and the Bayesian estimates of Burgess and Yang (2008) are not due to different estimation methods or to removal of some loci or of sites with ambiguous nucleotides: instead they are due to exclusion of orangutan and macaque in the present data set. The posterior mean of θHC is about 0.0042 in the HCG data sets but 0.0060 in the HCGO data sets and 0.0064 in the HCGOM data sets. The reasons for those differences are unclear. The molecular clock assumption is most likely violated when the macaque is included in the analysis. However, accommodating the higher rate in the macaque lineages was found to have very minor impact on estimates of θHC, clearly insufficient to explain the differences observed here (Burgess and Yang 2008: table 2e). Estimates of the other parameters are all very similar in the different data sets and analyses, with the posterior means to be in the range 0.0033–0.0036 for θHCG, 0.0063–0.0068 for τHCG, and 0.0039–0.0043 for τHC (supplementary table S1, Supplementary Material online). Similar patterns are noted for the X chromosome loci (supplementary table S2, Supplementary Material online). Estimates of θHC was 0.0014–0.0016 in the HCG data sets but 0.0023 in HCGO and 0.0026 in HCGOM data sets. Inclusion of orangutan and macaque also caused the estimates of θHCG to become smaller, with the posterior means to be 0.0029–0.0030 in the HCG, 0.0024 in the HCGO, and 0.0020–0.0022 in the HCGOM data sets.
Table 4 shows the correlations between parameter estimates in the analysis of the hominoid autosomal loci. Estimates of θHCG and τHCG are strongly correlated, as are those of θHC and τHC. As in the simulated data sets (table 1), θHCG is more precisely estimated than θHC.
The estimates under model M1 from analysis of all the autosomal loci are = 0.00367 ± 0.00008, = 0.00137 ± 0.00014, = 0.00657 ± 0.00004, = 0.00530 ± 0.00006, and = 1.189 ± 0.068 for the beta model of τHC variation. The estimated q is rather small, consistent with the rejection of the null model M0 by the LRT. The estimated beta distribution beta(5.0, 1.2) is shown in figure 3, which implies that for most genomic regions, τHC is near τHCG or if there were migrations at the time of separation of the human and chimpanzee, the gene flow had ceased a long time ago. As expected, estimates of q and θHC are strongly negatively correlated (table 4).
Here, we discuss several factors that may cause the species divergence time τ to vary over the genome and speculate on their implications to the LRT developed in this paper. First, as discussed in Introduction, gene flow during parapatric or sympatric speciation can cause variation in τ over genomic regions. Similarly, variable τs can be caused by introgression (secondary contact) following allopatric speciation in which reproductive isolation is established without gene flow. It appears very difficult to distinguish between those two scenarios, especially if introgression occurred soon after the initial speciation. No distinction is made between the two in the LRT of this paper. Thus, caution should be exercised in the interpretation of the LRT, as the statistical evidence for variable τs over the genome is compatible with both parapatric speciation with gene flow and allopatric speciation without gene flow followed by secondary contact. In their evaluation of the IM program (Hey and Nielsen 2004), Becquet and Przeworski (2009) considered secondary contact as a version of the null hypothesis of allopatric speciation without gene flow (see their fig. 1E) and regarded the detection of gene flow by IM as a false positive error. Here, both parapatric speciation and introgression are considered the alternative hypothesis of speciation with gene flow or different scenarios of the complex speciation model (Patterson et al. 2006).
It is not so clear how τ should vary across the genome when speciation is parapatric and in presence of gene flow. Different models exist that predict the accumulation of genomic incompatibilities over time after one ancestral population splits into two (for reviews, see Turelli et al. 2001; Coyne and Orr 2004; Gourbière and Mallet 2010). Incompatibilities may involve a single locus (i.e., heterozygous disadvantage) or multiple loci. The latter type is known as Dobzhansky–Muller (D-M) incompatibility, which reduces hybrid fitness due to epistatic effects of independent substitutions in different genes since the separation of the two populations. This appears likely to be more important than single-locus incompatibilities. The “snowball” model (Orr 1995; Orr and Turelli 2001) predicts that D-M incompatibilities accumulate at least as fast as τ2, where τ is the species separation time. The prediction, however, is based on the assumption that many genes are involved in D-M incompatibility and that any pair of genes might interact to create an incompatibility. Different dynamics such as linear accumulation of incompatibilities over time may result from different model assumptions (Kondrashov 2003; Kirkpatrick and Barton 2006; Gourbière and Mallet 2010). In addition to the different predictions of the accumulation of incompatibilities, it is unclear how incompatibilities affect fitness and how the linear or quadratic accumulation of incompatibilities should be translated into a reduction of migration rate and gene exchange over time and to a probability density function f(τ), which describes the variation of divergence time between the two species across the genome. Intuitively f(τ) should be single-moded if incompatibilities accumulate gradually, leading to gradual reduction of the migration rate: if t0 is the inception of species separation and t1 is the time when gene flow has completely ceased, the density f(τ) should be >0 in the interval t0 < τ < t1 only. It may be noted here that the model of variable τs over loci is only a heuristic approximation to the model of speciation with gene flow, as the process cannot simply be described by variable τs among loci. However, the null model in the LRT is correctly formulated, so that the heuristic nature of the alternative model should affect the power of the LRT but should not cause excessive false positives. A more accurate formulation of the model should consider migrations between the two populations, perhaps with the migration rates changing over time, as well as coalescent events within the two populations and their common ancestor. A Bayesian Markov chain Monte Carlo (MCMC) algorithm appears necessary to implement such a model, by extending the work of Rannala and Yang (2003) and Hey and Nielsen (2004).
In the case of introgression following initial allopatric speciation, as envisaged by Patterson et al. (2006) in their complex speciation model, one should expect τ to have a bimodal distribution. Even though the beta distribution cannot accommodate two modes, it appears appropriate to apply the LRT of this paper to test for introgression against the null hypothesis of a constant τ across the genome.
Another important factor that can cause variable species divergence times over the genome is natural selection. In this regard, it should be noted that the model developed here assumes neutral evolution of gene sequences and may not be suitable for analysis of gene loci under selection. If a locus is under the same purifying selection in different species and the effect is simply to remove strongly deleterious mutations, the strict neutral model may be a reasonable approximation of the evolutionary process at the locus although with a reduced neutral mutation rate. Most housekeeping genes appear to fit this description as they perform the same function in closely related species and are under similar selective constraints. Use of such genes in the analysis appears justifiable (Ebersberger et al. 2007). The same may apply to neutral loci undergoing background selection because of their linkage to genes under purifying selection (Charlesworth et al. 1993; Nordborg et al. 1996). If the strength of background selection and the recombination rates are similar across species, background selection will have similar effects in different lineages, reducing both diversity and divergence, and the overall effect will be similar to a reduction of mutation rate at the neutral locus.
Although purifying and background selection may have similar effects in different species and thus not cause serious problems to the LRT, positive selection often operates in different ways in different species. For example, ecological adaptations may be highly species specific (Swanson and Vacquier 2002b; Orr et al. 2004). The method developed here is not suitable for analyzing genes under positive selection or genes that cause reproductive isolation or are otherwise involved in the speciation process (Orr et al. 2004; Wu and Ting 2004). Studies of such genes my provide great insights into the speciation process, but their analysis requires different molecular evolutionary tools, such as methods for measuring and testing the strength of positive Darwinian selection (Yang et al. 2000; Swanson and Vacquier 2002a).
Another factor that may cause violations of model assumptions made in the LRT is the population demographic process. Population subdivision in the ancestor may be expected to lead to an increased effective ancestral population size (i.e., large estimates of θ1) rather than variation in τ and thus may not cause excessive false positives in the LRT. This was the result found by Becquet and Przeworski (2009: fig. 1C) in their evaluation of the IM program, and the LRT of this paper may be expected to behave in similar ways. The impact of population size fluctuation such as bottlenecks in the ancestor is less clear: it may likely affect the ancestral population size (θ1) rather than causing τ to vary among loci.
It may be noted that the conceptual framework of the model of variable τ among loci implemented in this paper is similar to the test of simultaneous species divergences across pairs of sister species, due to a particular geological event, such as the forming of the Isthmus of Panama (Hickerson et al. 2006; Hurt et al. 2009). Such analyses have to overcome similar difficulties such as the confounding effects of variable mutation rates among loci and the strong correlation between the divergence time of the species pair and the ancestral population size. In addition, the ancestral populations of the different sister species have different sizes and separate parameters may have to be used for them. Violation of the molecular clock (i.e., variable rates between the species pairs rather than within each species pair) may complicate the analysis even further. Data of multiple loci from multiple individuals appear necessary to address this problem, although Hickerson et al. (2006) analyzed only one mitochondrial locus and were much more optimistic.
In an analysis of variable sites in the genomes of the human (H), chimpanzee (C), gorilla (G), orangutan (O), and macaque (M), Patterson et al. (2006) suggested that the human–chimpanzee speciation process might have been complex and have involved introgression after the initial separation of the two species. This controversial hypothesis was based on two major pieces of evidence: the large fluctuation of H-C sequence divergence throughout the genome and a dramatic reduction in H-C sequence divergence on the X chromosome. Here, we discuss the implications of the results of this paper to that controversy (see also Barton 2006; Burgess and Yang 2008; Wakeley 2008).
The large fluctuation of H-C divergence could be explained by a large ancestral population size (or large θHC) (Barton 2006). Indeed, Burgess and Yang (2008) estimated the HC ancestral population to be ~10 times as large as the modern human population, consistent with early estimates (e.g., Takahata and Nei 1985; Hobolth et al. 2007). More generally, θ estimates for ancestral species have been noted to be much larger than for modern species in many species groups (e.g., Satta et al. 2004; Won et al. 2005; Zhou et al. 2007). A number of authors have suggested that population subdivision in the ancestors may have generated the large effective population sizes (e.g., Osada and Wu 2005; Becquet and Przeworski 2007; Zhou et al. 2007). However, there does not appear to be any evidence that most ancestral species were subdivided, whereas modern species are not. Thus, those large estimates of ancestral θs may be a methodological artifact, due to, for example, gene flow around the time of speciation, as suggested by the LRT of this paper for the hominoid data. If the speciation process is often “unclean,” the exchange of migrants would cause large variations in the sequence divergence times, leading to large estimates of ancestral θs under models that do not accommodate gene flow.
Yet another explanation is the differential reduction of diversity at neutral loci due to background selection (Charlesworth et al. 1993). McVicker et al. (2009) found that both diversity within the human population and divergence between the human and chimpanzee are reduced at putative neutral sites close to exons and other conserved elements, with greater reduction at sites closer to exons. The authors estimated a 19–26% reduction in human diversity at neutral sites due to background selection. However, background selection may not be very important to the hominoid data analyzed here and by Burgess and Yang (2008) because these data were filtered so that every locus is >1 kb away from known genes, whereas McVicker et al. (2009) included putatively neutral sites that are often very close to exons. In another study where sites near genes (within 5 kb of transcripts and within 1 kb of exons) were excluded, the estimated reduction in diversity was small (6%) (Cai et al. 2009). Furthermore, the background selection considered by McVicker et al. (2009) should reduce both diversity and divergence, so that its effect should be similar to reduction of the mutation rate for the neutral locus. This effect has been considered and found to be unimportant for the hominoid data by Burgess and Yang (2008).
A second major observation that may be inconsistent with a simple model of human–chimpanzee speciation is the extreme reduction in the H-C sequence divergence (but not in the H-G divergence) on the X chromosome. Note that many factors can contribute to this reduction. 1) The mutation rate is higher in males than in females (Haldane 1935; Li et al. 2002; Ellegren 2007), resulting in a mutation rate difference between the X chromosome and the autosomes (μx/μA < 1). 2) The X and A loci have different effective population sizes, with Nx/NA = ¾ for a 1:1 sex ratio. 3) Processes such as introgression may have caused the H-C species divergence time to differ between the X and autosomal loci. The analysis of Burgess and Yang (2008), under models of constant τHC across the genome, suggested that the reduction in H-C sequence divergence on the X was mostly due to a reduced population size (θHC) rather than a reduced species divergence time (τHC).
To explore the contributions of the various factors to the reduced H-C divergence on the X, we calculated the X/A ratios of θ and τ estimates for the different ancestors (supplementary table S3, Supplementary Material online), following Burgess and Yang (2008). The sensitivity of estimates of parameters such as θHC to the inclusion or exclusion of orangutan and macaque is intriguing and causes the X/A ratios of θ and τ estimates to depend on the data sets as well. Furthermore, the number of X loci is relatively small, so that parameter estimates for the X chromosome involve considerable sampling errors. One may expect that the HCG and the HCGO data sets are less affected by violations of the molecular-clock assumption or by genomic rearrangements that may alter the neutral mutation rate. For example, the structure of the X chromosome appears to be conserved in all the great apes (Muller and Wienberg 2001; Stanyon et al. 2008). Thus, we focus on the HCG and HCGO data sets and on the large data sets with smaller sampling errors. The τx/τA ratio was very consistent in the different analyses and data sets, being 0.85 for the HCG ancestor and 0.82 for all others, so the estimate 0.83 used by Burgess and Yang (2008) appears reliable. Note from μx/μA = 0.82 and 0.84, one obtains the male/female mutation rate ratio α = μM/μF = 3.3 and 2.8, respectively. This consistency implies that the male/female mutation rate most likely stayed constant among the hominoid ancestors (cf. Wakeley 2008). The θx/θA ratio for the HCG ancestor varies among the data sets used, at about 0.80 in the HCG data sets, 0.70 in the HCGO data sets, and 0.65 in the HCGOM data sets. Divided by the rate ratio μx/μA = 0.83, these θ ratios translate to the population size ratios Nx/NA = 0.96, 0.84, 0.78, all higher than the expected ¾. The θx/θA ratio for HC is about 0.38–0.40, which implies Nx/NA = 0.40–0.48, much lower than ¾. Those calculations are affected by the limited number of loci on the X chromosome and the large sampling errors in the θ estimates for the X. Future studies may benefit from including more X loci and from the analysis of the genomic sequences for the Y chromosome from the chimpanzee (Hughes et al. 2010) and other great apes.
Presgraves and Yi (2009) suggest that the variation in male mutation rate among the great apes caused by different mating systems and different intensities of sperm competition may explain the data. Sperm competition is expected to be weak or absent in gorillas and orangutans but intense in chimpanzees with humans to be intermediate. The authors estimated α to be about 2.8–3.6 for humans and 5.1–5.8 for chimpanzees in a data set involving HCGM, consistent with the sperm-competition hypothesis, where estimates of a are 3.3–4.1 for humans and 3.0–3.8 for chimpanzees. Estimates of α for the gorilla and orangutan were around 1.2–1.7 and 1.6–1.8, respectively. However, the authors’ estimation procedure is somewhat simplistic. It fixes θs for the different ancestors at the same values and does not account for variation in gene genealogies across the genome. If the hypothesis of sperm competition is true, one would expect the X loci to evolve at more homogeneous rates among lineages, whereas the molecular clock should be violated at the autosomal loci, with the chimpanzee having the highest rate and the gorilla the lowest rate. However, those expectations are not supported by the average sequence distances between those species calculated by Burgess and Yang (2008: table 1). The gorilla had the largest distance (or highest rate) compared with the human and chimpanzee at both the A and X loci, apparently because of the high sequence errors in the gorilla sequence. The human and chimpanzee distances were very close at both the autosomal and X loci (dHO = 0.0346 vs. dCO = 0.0348 for autosomal loci and dHO = 0.0278 vs. dCO = 0.0277 for the X loci).
Pool and Nielsen (2007) showed that demographic processes such as population bottlenecks may have disproportional effects on the diversity of autosomal and X-linked loci and that as a result, the Nx/NA ratio may deviate from the expected 3/4. Thus, a bottleneck in the HC ancestor could cause Nx/NA to be smaller than ¾. However, this effect appears small for parameter values reasonable for the hominoids. Furthermore, although bottlenecks are generally acknowledged to have occurred in modern humans when humans migrated out of Africa, there is yet no known evidence for bottlenecks in the HC ancestor, and instead, the large θHC estimates for the autosomal loci appear inconsistent with such bottlenecks.
In sum, the process of human–chimpanzee speciation remains poorly understood. The large ancestral population sizes (large θs) may reflect biological reality such as population subdivision in the ancestral species but may also be an artifact of the estimation procedure because of model violations. One important such violation is gene flow around the time of speciation, which elevates the variance in the H-C sequence divergence times and leads to large estimates of ancestral θs. The severely reduced H-C divergence on the X chromosome is intriguing, as is the sensitivity of estimates of certain parameters to the inclusion or exclusion of the orangutan and macaque sequences. Analysis of more data from the X chromosome and of the Y genomic data may shed light on the issue.
The use of maximum likelihood without the need for priors may be considered an advantage of the method. Nevertheless, the current implementation is limited to three species, with one sequence from each. The likelihood computation involves 3-D integrals under model M1, which seems near the limit of computational feasibility. Every additional sequence would mean an extra dimension in the integral. This “curse of dimension” makes it difficult to extend the present model to more species or more sequences. In this regard, Bayesian MCMC methods offer a clear advantage, and it should be straightforward to implement the same model in the framework of Rannala and Yang (2003).
This paper originated from discussions with Chung-I Wu and benefited from discussions with Jim Mallet. I thank Jim Mallet, David Reich, and two anonymous referees for many constructive comments. Z.Y. is a Royal Society Wolfson Merit Award holder.