Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS Genet**|**v.4(9); 2008 September**|**PMC2529407

Formats

Article sections

Authors

Related links

PLoS Genet. 2008 September; 4(9): e1000198.

Published online 2008 September 19. doi: 10.1371/journal.pgen.1000198

PMCID: PMC2529407

Gil McVean, Editor^{}

University of Oxford, United Kingdom

* E-mail: jjensen/at/ucsd.edu

Conceived and designed the experiments: JDJ KRT PA. Performed the experiments: JDJ KRT PA. Analyzed the data: JDJ KRT PA. Wrote the paper: JDJ KRT PA.

Received 2008 January 24; Accepted 2008 August 13.

Copyright Jensen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

This article has been cited by other articles in PMC.

The recurrent fixation of newly arising, beneficial mutations in a species reduces levels of linked neutral variability. Models positing frequent weakly beneficial substitutions or, alternatively, rare, strongly selected substitutions predict similar average effects on linked neutral variability, if the product of the rate and strength of selection is held constant. We propose an approximate Bayesian (ABC) polymorphism-based estimator that can be used to distinguish between these models, and apply it to multi-locus data from *Drosophila melanogaster*. We investigate the extent to which inference about the strength of selection is sensitive to assumptions about the underlying distributions of the rates of substitution and recombination, the strength of selection, heterogeneity in mutation rate, as well as the population's demographic history. We show that assuming fixed values of selection parameters in estimation leads to overestimates of the strength of selection and underestimates of the rate. We estimate parameters for an African population of *D. melanogaster* (*ŝ*~2E−03, ) and compare these to previous estimates. Finally, we show that surveying larger genomic regions is expected to lend much more discriminatory power to the approach. It will thus be of great interest to apply this method to emerging whole-genome polymorphism data sets in many taxa.

Understanding the process of adaptive evolution requires quantifying the extent to which beneficial mutations contribute to differences between species. However, fundamental parameters of adaptation, such as the rate and strength of beneficial mutations, are poorly understood and have historically been difficult to estimate from data. In particular, distinguishing a high rate of weakly selected substitutions from a low rate of strongly selected substitutions has been problematic. Here, we introduce a new method to estimate the parameters of adaptive evolution from multi-locus population genetic data. We conduct simulations to show that this method is able to discriminate the rare/strong model from the frequent/weak model. Applying this method to an African population sample of *Drosophila melanogaster*, we estimate selection parameters and find that recurrent adaptive evolution has reduced genome variability by ~50% on average. The availability of genome-scale population genetic data will lend considerable discriminatory power to the approach. Thus, this new approach represents an important step towards characterizing the nature of adaptive evolution in natural populations.

The fixation of beneficial mutations can strongly reduce levels of closely linked neutral variation – the so-called genetic hitchhiking effect [1]. This prediction has been used to search for positive selection by looking for regions of the genome with reduced variability [*e.g.*, 2]. The hitchhiking model most often used is of a single selective sweep, where the location and timing of selection are assumed to be known [3]. This single sweep model has been of great value in understanding the effect that a single selective event has on patterns of polymorphism, as a function of the strength of selection and location of the beneficial mutation [*e.g*., 1],[4],[5]. However, this model is somewhat disconnected from the problem of detecting selective sweeps in the genome, for which locations and timings are not known *a priori*, and should be treated as random variables.

Kaplan *et al.* (1989) described a “recurrent hitch-hiking” (RHH) model, where the expected number of sweeps (per base pair, per 2*N* generations) is *2Nλ* with sweeps occurring at random locations in the genome [6]. The RHH model is most commonly considered for the case of genic selection on new mutations entering the population [*e.g.*, 6]–[8]. Under this model, several patterns expected under the single sweep model no longer apply. For example, the single sweep model predicts coalescent histories with long internal branches, as some lineages may escape the recent coalescent event via recombination. This results in the widely employed prediction of an excess of high-frequency derived alleles flanking the fixed site [5]. Under RHH models however, the probability of such a history is small, as sweeps are on average old and high frequency derived mutations have thus likely drifted to fixation [9].

Wiehe and Stephan (1993) showed that under a RHH model, for a given recombination rate, the expected level of heterozygosity at linked sites relative to neutral expectations is dependent upon the compound parameter (*s*)(2*Nλ*), where 2*Nλ* is the rate of fixation of beneficial mutations and *s* is the average strength of selection [7]. This result implies that that the two parameters are confounded (much like the effective population size, *N _{e}*, and mutation rate,

A number of methods have been proposed for quantifying *s* and 2*Nλ* (separately) using divergence and polymorphism data [*e.g.*, 11]–[12], [14]–[17]. These approaches typically make strong assumptions regarding the possible distribution of selection coefficients, the number of adaptive substitutions between species, or the timing of selection. For example, Li and Stephan (2006) examined 250 non-coding regions from an East African population of *D. melanogaster* [18]. Using a likelihood approach, they estimate that approximately 160 beneficial mutations have fixed in this population over the last ~60,000 years (corresponding to ), with mean selection coefficient *ŝ*~0.002. This inference is achieved by effectively assuming that the timing of all sweeps is known (and the time since the sweep, *τ*=0). Under a recurrent sweep model, this assumption may bias the estimation of *s* and *2Nλ*. Additionally, as this method relies on first fitting a demographic model to non-coding DNA polymorphisms, it is possible that the effects of purifying selection on the site frequency spectrum of non-coding DNA [19]–[20] may strongly affect the estimates.

Using synonymous polymorphism data in *D. melanogaster*, and divergence to *D. simulans*, at 137 X-linked loci, Andolfatto (2007) employed a maximum likelihood approach to estimate the joint parameter *2Nλs*, followed by a McDonald-Kreitman-based method to separately estimate *2Nλ* and *s* [11]. Based on these calculations, Andolfatto estimated that most beneficial amino acid substitutions are very weakly advantageous on average (with average *ŝ*~1.2E−5 and ). Macpherson *et al.* (2007), using polymorphism data from *D. simulans* (and divergence to *D. melanogaster*), propose a method to infer the rate and strength of selection from the spatial scale of variation in polymorphism and divergence [12]. In contrast to Andolfatto's estimates, Macpherson *et al.* estimate a much stronger average selection coefficient (*ŝ*~0.01) and less frequent selection (). However, they note that their method is more likely to detect strong selection, so the effects of many weakly beneficial mutations may be missed.

By evaluating a wide array of recurrent selection models across a variety of sampling schemes, with parameters relevant for both Drosophila and human populations, we demonstrate here that there are differences in the predictions of weak and strong selection models, both in the spatial distribution of variability levels and the distribution of polymorphism frequencies (also called the site frequency spectrum, hereafter SFS). We propose a polymorphism-based approximate Bayesian (ABC) estimator that is most closely allied to the approach of Macpherson *et al.* (2007), but is also applicable to sub-genomic multi-locus data of the kind that has most often been collected [*e.g.*, 11], [21]–[22], and incorporates more information from the data. Fundamentally, this estimation procedure is based on the principle that while models may predict the same average affects, the variance of many common summary statistics varies greatly between models. We show that highly accurate estimation will be possible with large-scale genome polymorphism data, and that the approach is robust to both mutation and recombination rate heterogeneity.

As pointed out by Macpherson *et al.* (2007), there is reason to anticipate that region size may be key in uncoupling the strength of selection (*s*) from the rate of beneficial fixation (2*Nλ*) (see Table 1 for a summary of terms). Intuitively, because only a very strong sweep is capable of severely reducing larger regions - on the order of 100 kb for instance - regions may be observed with very little variation under this model. However, because selection is rare, other regions will appear close to neutral. Conversely, weak selection serves to homogenize variation as it occurs with much greater frequency. For example, for an effective population size of 10^{6} and *ρ*=4*Nr*=0.1/bp, the expected waiting time between sweeps is 68,000 generations, for *s*=1E−04 and 2*Nλ*=5E−04, for a region size of 10^{4} base pairs. For the same population parameters, but *s*=0.01 and 2*Nλ*=5E−06, the expected waiting time between sweeps is 532,000 generations. Considering that most signatures of selection are dissipated by 400,000 generations for these parameters [9],[23], this demonstrates that if selection is strong and rare on average, there will likely be a large variance across the genome, from strongly swept to essentially neutral looking regions (Figure 1). Capturing this variance is dependent upon the size of the sampled region as, while many values of *s* may reduce a 500 bp region for instance, only large selection coefficients are capable of reducing a 100 kb region, suggesting that larger region sizes should afford greater discriminatory power.

In order to more precisely determine this ‘region size’ effect, we examined 500 bp, 1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, and 100 kb regions using simulated data (Figure 2A). First examining *L*=500 bp regions (matching existing empirical datasets, *e.g.*, [11],[21]), we observe that there is relatively little difference in the coefficient of variation (CV) of *π* between RHH models of strong and weak selection (Figure 2), consistent with previous observations that *s* and *2Nλ* are difficult to estimate separately with data of this kind [13].

Examining larger regions, the CV is essentially unchanged under weak selection models once regions larger than 25 kb have been sequenced. Conversely, the CV continues to grow rapidly under a strong selection model, producing a four-fold difference in the CV at 50 kb of sequence relative to weak selection models, and over a five-fold difference at 100 kb for these parameters, for Drosophila-like parameters (*θ*=0.01/site; *ρ/θ*=10). The difference between strong and weak selection models in Figure 2 does not appear to be attributable to the total amount of surveyed sequence between the 100 kb and 500 bp regions. By comparing the distribution observed when considering ten 100 kb regions vs. two thousand 500 bp regions (and thus the same number of segregating sites on average) we still observe a large difference in CV at the scale of 100 kb, and little difference between models at the scale of 500 bp (results not shown).

We found that the relative point at which the region size benefit plateaus is a function of *θ*, *ρ*/*θ*, 2*Nλ* and *s*. We examined the effect of doubling the recombination rate (such that *ρ/θ*=20), and find that the CV is reduced under all models relative to *ρ/θ*=10, and that the models begin to differentiate at smaller region sizes (Figure 2B). These effects are a result of the fact that the expected size of the swept region will decrease as the recombination rate increases [6]. Additionally, using human-like parameters (*θ*=0.002/site, *ρ/θ*=1), we find that the pattern of an increasing CV with region size is still observed to some extent. However, the CV is much larger on average even under neutrality when *ρ/θ*=1, and the models are more similar to one another with human-like parameters (Figure 2C) than with Drosophila-like parameters (Figures 2A and B). This implies that weak and strong selection models will be more difficult to distinguish in humans.

It is noteworthy that for large surveyed regions, more strongly negative values of Fay and Wu's *H*-statistic (*i.e.*, SFS skewed towards high-frequency derived alleles) and Tajima's *D*-statistic (*i.e.*, SFS skewed towards rare alleles) are observed under strong selection models (Figure 3), suggesting that differences in the polymorphism site frequency spectrum may also be used to distinguish between models if large enough regions are surveyed. Though this differs qualitatively from the conclusions of Przeworski (2002), simulations demonstrate that this is attributable to a modeling difference (results not shown), as we here allow sweeps within the sampled region (following [24]). This discrepancy between modeling approaches will thus only become greater as region sizes increase.

The above results suggest that focusing on variability across loci may distinguish models of strong, rare sweeps from those of frequent, weak sweeps. Thus, we here implement an approximate Bayesian (ABC) approach to estimate the strength of selection (*s*), the rate of fixation of beneficial mutations (2*Nλ*) and the neutral population mutation rate (*θ=4N _{e}u*) under a recurrent hitchhiking model. We begin by employing the observed mean and standard deviation of heterozygosity (

We find that this *π*-based estimation performs reasonably well, particularly when the size of surveyed regions is large and selection is strong. For 500 bp regions, MAP estimates are accurate within an order of magnitude. However, distributions of MAP estimates are typically widely dispersed, particularly when selection is weak (Figure S1; Table S1). Additionally, estimation of *s*, 2*Nλ*, and *θ* is generally upwardly biased. Under the best conditions - large region sizes and strong selection - the performance of the estimator is greatly improved (RMSE(*ŝ*)=0.179, and the relative bias, RB(*ŝ*)=−0.281).

Given the computational efficiency of ABC, it is straightforward to explore multiple combinations of test statistics, in order to determine whether incorporating additional information from the site frequency spectrum or spatial distribution of sites may significantly improve the accuracy of estimation. We found that the incorporation of the mean and variance of several common summary statistics did not significantly improve or alter estimation, owing to correlations with *π* (results not shown). However, other statistics such as *θ _{H}* [25], and

This intuition appears to be accurate. The addition of the mean and SD of *ZnS* and *θ _{H}* particularly, and the number of segregating sites (

Though the parameters *s*, 2*Nλ*, and *ρ* are fixed in the above simulations, these parameters likely vary among genomic regions in real data. While it is attractive to assume a fixed parameter model given its simplicity, if the true model is in fact one in which parameters are drawn from distributions, this may lead to a bias in estimation owing to misspecification of the model. We consider a variety of examples – those in which *s* and 2*Nλ* are drawn from exponentials, and *ρ* is drawn from an exponential or normal. When comparing between fixed and distributed models – the mean of the distribution is equal to the fixed value used previously (*i.e.*, if in the fixed model *s*=0.01, the distribution model to which it would be compared would have *s* exponentially distributed with mean 0.01). Figure S2 documents the effect of modeling parameters drawn from distributions on the relative CV of π (compare to Figure 2). As expected the relative CV is inflated compared to the fixed parameter model, which may lead to biases in estimation if unaccounted for.

In order to consider the effect of model misspecification on parameter estimation, datasets are simulated under a model where parameters were drawn from distributions, yet priors are constructed assuming that these parameters have fixed values. Misspecification of the model in this way leads to an upward bias in the estimate of selection coefficients, and a downward bias in the estimated rate of selection (Figure 5). To account for this misspecification, the priors must be appropriately constructed, by allowing each locus within a given replicate dataset to also be drawn from distributions (see Methods). As shown in Figure 5, while the distribution of MAP estimates are more greatly dispersed when compared with Figure 4 (*e.g.*, under a fixed model the RMSE(*ŝ*)=7.9E−06 for strong selection and large regions, and under a distributed model the RMSE(*ŝ*)=1.11), the mean of the distribution nonetheless accurately reflects the means of *s*, 2*Nλ*, and *θ* (for the above two models, the RB(*ŝ*) are 0.12 and 0.57, respectively; Table S1). Additionally, for all estimated parameters, the relative bias is reduced for 50 kb relative to 500 bp regions.

For comparison, an alternate distributed parameter model was considered. As opposed to *s* being drawn from an exponential distribution for each locus, we model *s* being drawn from an exponential distribution for each selective event. Results between the two models are similar, though this case results in consistently smaller RMSEs (results under this alternative model, mirroring Figure 5, are given in Table S1). This result suggests that this alternative distribution model is intermediate between the two extreme cases examined here - fixed models and distributed locus-by-locus models. Despite the overall improvement gained by modeling distributed parameters in general, an important limitation is the assumption that the shape of the underlying distribution of each parameter is known.

The above simulations however, continue to assume a constant mutation rate among regions. In reality, the mutation rate may vary among loci, which may be a potential source of bias for the method [11]–[12]. Thus, in order to consider the possible effects of mutation rate variation, the distribution of variation at synonymous sites among loci in the Andolfatto (2007) dataset (see below) was taken as a proxy for mutation rate variation. We estimated the parameters for a Γ-distribution using the distribution of synonymous site divergence estimates across loci. Modeling this observed distribution with simulated data (*i.e.*, Γ(200,2.5); Figure S3), we found that the estimation was not affected and results resemble those of a fixed *θ* model (Figure S1, Figures 4–5). This result suggests that the variation in mutation rate observed in *D. melanogaster* is not widely dispersed enough to impact estimation, and is thus not likely to be biasing our estimated parameter values.

As there is relatively little variance at synonymous sites observed among regions in the Andolfatto (2007) dataset, data was simulated in which *θ* is much more widely dispersed (*i.e.*, Γ (10,50)), in order to determine the possible bias introduced by more extreme mutation rate variation. Importantly, under this model, estimation based upon and SD(*π*) becomes strongly biased in the direction of estimating larger selection coefficients, as heterogeneity in mutation rate is artificially inflating the variance among loci (Figure S3). However, when estimation is based upon the means and SDs of *π*, *S*, *θ _{H}*, and

In summary, we propose that our estimator of recurrent hitchhiking model parameters that incorporates information from multiple summary statistics performs reasonably well. This method is preferable to a *π*-based approach both because it is more accurate and more robust to variation in mutation rate. The overall performance of the method will be greatly improved by the availability of genome-scale polymorphism data. An important point relevant to all of these models is that relatively simple adaptive models have been considered, and additional complexities such as recently increased or decreased rates of adaptation, variation in dominance of beneficial mutations, or selection from standing variation, have yet to be incorporated.

Here we apply our approach to the multi-locus data set of Andolfatto (2007), who surveyed 137 X-linked regions from an East African population of *D. melanogaster* [11]. Though our performance evaluation of the method suggests that regions of this size are not ideal for estimation (the average region length in this dataset is 680 bps), they indicate at least the possibility of distinguishing weak from strong selection models, though such small regions cannot assure accurate parameter estimation. We estimated selection parameters both from 1) priors where these parameters are drawn from distributions (exp(*s*), exp(2*Nλ*) and N(*ρ*, *ρ*/2), and 2) in order to compare to previous estimation methods, priors that assume fixed values of *s*, *2Nλ* and *ρ*. The strength of selection for each sweep, *s*, is drawn from an exponential distribution (see Methods). We ignore variation in *θ* among loci as we have shown that this is not expected to significantly impact estimation (see above).

Shown in Figure 6 are marginal posterior distributions for selection parameters (assuming distributed parameters, *ŝ*=2E−03, , and per site). Consistent with simulated data, parameter estimations assuming fixed values leads to considerably larger estimates of *ŝ*, and reduced estimates of (Figure 6, *ŝ*=0.01, , and per site). It is thus important to emphasize that estimation will be sensitive to the underlying models chosen for the priors. Given that we expect these parameters to vary among loci, we consider the former estimate to perhaps be better, with the caveat that we lack precise knowledge of how these parameters are actually distributed (see Methods for more details). Interestingly, the large estimate of compared to previous studies [11]–[12] suggests a stronger mean reduction in genome variation due to hitchhiking (~50%). Finally, it is additionally noteworthy that estimation does not necessarily need to be performed using the marginal posteriors as we have implemented here. For example, Figure S4 compares estimation between joint and marginal posteriors for our empirical dataset, and finds that while the estimates are similar, they are not identical. Understanding these differences, and better determining if estimation based upon joint posteriors may have any advantages, is a topic of future investigation.

An important consideration we have not addressed thus far is the impact of non-equilibrium demography, which may closely resemble sweep-like patterns of variation and may be expected to bias the estimator [*e.g.*, 27]–[28]. For instance, a strong population bottleneck exhibits many characteristics of a selection model – greatly increasing the variance of summary statistics, and specifically producing very negative values of the *H*-statistic [29]–[32]. In order to assess the potential bias induced by demography on the proposed estimator, we model two simple bottleneck models (BN1 and BN2) and a growth model (see Methods). BN1 and the growth model were fit to match the observed mean *π* and Tajima's *D*. BN2 was chosen specifically to match the observed CV(π). Under all three models, the posterior distributions are localized around weaker selection coefficients, and larger rates, than we estimate from the observed data, with estimation based upon distributed priors (MAP estimates given in Table 2).

This result suggests both that, while the estimator is obviously sensitive to non-equilibrium demography, our empirical data is not easily explained by any of the demographic models considered (with the empirical estimates falling outside of the 95% CIs for the demographic models considered). This is particularly encouraging given that one of the bottleneck models, BN2, was chosen specifically to match the CV(π) that was observed for this dataset. Clearly, to minimize demographic effects, populations should be carefully chosen when possible. The dataset we have analyzed is from a putatively ancestral East African population that is believed to have been relatively demographically stable compared to non-African populations, which show signatures of a recent and severe bottleneck [18], [31]–[32]. Characterizing biases induced from a wider range of demographic models is a topic of future study, and will be important before performing estimation in other populations and species. One promising direction will likely take advantage of the observed correlation between *π _{s}* and

Several other studies have attempted to estimate parameters under a recurrent hitchhiking model, and a discussion of how our estimates compare with those studies is of considerable interest. As previous studies assumed fixed values of *s*, 2*Nλ* and *ρ*, it is most appropriate to first compare these estimates with our “fixed value” estimation. Li and Stephan (2006) employed a sliding window likelihood ratio test using multi-locus polymorphism data and estimate that *ŝ*~0.002 and [18], which is similar to our estimates (Table 3). Their approach has a number of notable differences with ours: they co-estimate a growth model within their estimation procedure, use non-coding DNA rather than synonymous sites, and assume that all detectable sweeps have fixed immediately prior to sampling (*i.e.*, *τ*=0). Given that our values of 2*Nλs *are quite similar, so is the expected level of reduction in genome variability (Table 3). Macpherson *et al.* (2007) used large-scale polymorphism data from six lines of *D. simulans* and estimate a strong average selection coefficient (*ŝ*~0.01) [12], which is identical to our fixed value estimate. The bigger difference is in our estimates of 2*Nλ*, with our estimate being ~4× larger. However, given that the dataset examined here is from *D. melanogaster*, there is no reason to necessarily anticipate that these estimates should match.

It is noteworthy that our estimated selection coefficient is an order of magnitude smaller (and our estimate of the rate an order of magnitude larger) when we assume that *s*, 2*Nλ* and *ρ* are drawn from distributions rather than taking fixed values. Despite this, our estimated selection coefficient under the distributed model is still almost two orders of magnitude larger than Andolfatto's (2007) estimate [11]. Andolfatto's estimates of *s* and 2*Nλ* are particularly relevant, as we here examine the same dataset and arrive at quite different conclusions. The discord between estimates may arise from the fact that Andolfatto's estimate of *s* relies on estimating 2*Nλ* using the McDonald-Kreitman statistical framework [33]–[34]. However, we note that with short surveyed fragments, our estimator of *s* is somewhat upwardly biased (Figure 5) so it will be interesting to apply our method to larger genomic regions when that data becomes available.

Additionally, while Andolfatto (2007) and Macpherson *et al.* (2007) estimate a 20% average reduction in genome-wide variability, we estimate a considerably larger reduction (50%), which is more consistent with the estimate of 2*Nλs* of Li and Stephan (2006). This may to some extent explain Andolfatto's observation that the observed Tajima's *D* at synonymous sites is more negative than predicted by his estimates of *s* and 2*Nλ*. When we model a recurrent hitchhiking model with our estimated parameters, the average Tajima's *D* is −0.3, which is close to the observed average (−0.28). While a negative mean Tajima's *D* is usually interpreted in the context of demographic models (such as population growth, see for example [18]), it may instead imply that recurrent hitchhiking may be having a larger genome wide impact than previously appreciated.

While common/weak and rare/strong recurrent positive selection result in similar average levels of genome variation on average (for 2*Nλs*=constant), rare/strong selection greatly increases the variance of common summary statistics relative to common weak selection. We demonstrate, using an ABC approach based upon this observation, that the rate and the strength of selection may accurately be estimated jointly. Though there is some power to differentiate parameters using existing data, our results strongly suggest that genome scale data will afford much better discriminatory power. Our study also highlights that learning more about how parameters such as *s*, 2*Nλ* and *ρ* are distributed among loci will be crucial for accurate parameter estimation.

We use the recurrent selective sweep coalescent simulation machinery described in [24], with a modification to account for the stochastic trajectories of positively selected mutations in finite populations [11], [35]–[36]. Briefly, sweeps are occurring in the genome at a rate determined by 2*Nλ*=*Λ*, where λ is the rate of sweeps per generation [6],[8]. Following [24], selective sweeps are allowed both within the sampled region, as well as at linked sites. This distinction is significant, because for large simulated regions the probability of a sweep within the region may not be negligible for large *Λ*. The rate of sweeps within a region is thus M*Λ*, and as each sweep may affect up to *s/r _{bp}* (from [6],[37]; which is equivalent to 4

For the purposes of testing the proposed estimator, we evaluated models for *N _{e}*=10

(1)

where *θ* is the neutral population mutation rate, *r* is the unscaled recombination rate in Morgans per base pair per generation, *κ* is a constant ~0.075, *γ*=2*N _{e}s* (where

When simulating distributed rather than fixed values of *s*, 2*Nλ*, *θ*, and *ρ*, values for each region are drawn from a distribution (exp(*s*), exp(2*Nλ*), N(*ρ*, *ρ*/2) or exp(*ρ*). Thus, the value is fixed for an individual locus, but varies among loci. An alternative model was additionally examined, in which *s* is not fixed per locus, but rather is drawn from an exponential distribution for each selective event. These two separate models were chosen for two distinct purposes: 1) an exp(*s*) per locus is chosen for the performance simulations as it results in a large variance between loci. Thus, alongside the fixed parameter model, these comparisons represent two extremes; 2) an exp(*s*) per sweep is chosen when analyzing the empirical and demographic data, as we believe it better approximates biological reality (representing a model first introduced by Fisher). While the true underlying distributions are unknown, there is some biological data to draw from. For instance, observed *K _{a}* among genes [11] is nearly exponentially distributed, implying that an exp(2

In order to consider the performance of our method under non-equilibrium demographic models, we fit a simple bottleneck and growth model to the empirical data based on observed values of and the average Tajima's *D* (0.025/site and −0.28, respectively). Under both models, simulation parameters are thus scaled to mimic the observed values of these two statistics. As with above, *n*=12, *ρ*=0.1, *θ*=0.01 and *N _{e}*=10

To estimate the parameters *s*, 2*Nλ*, and *θ*, we relied upon their relationship with the means and standard deviations of common summary statistics. We take an approximate Bayesian (ABC) approach [41]–[44] to obtain marginal posterior distributions (estimation is also possible using joint posterior distributions, an example of this is discussed in the Results and given as a Supplement). Calculating our summary statistics (the means and SDs of *π*, *S*, *θ _{H}* and

In order to determine the optimal combination of information, estimation was performed using all combinations of the mean and standard deviations of *π*, the number of segregating sites (*S*), *θ _{H}*, Tajima's

We use the137 X-linked coding loci surveyed in [11]; Genbank accession numbers EU216760-EU218523. All loci were surveyed in 12 lines of *D. melanogaster* from a Zimbabwe population. For this analysis, only synonymous sites were considered. We summarized the mean average pairwise diversity, , its standard deviation, SD(π), and the coefficient of variation, , as well as the means and SDs of the number of segregating sites, *S*, *θ _{H}* [25], Tajima's

Approximate Bayesian estimation of the strength and rate of selection as well as the neutral *θ*, when estimation is based upon the mean and SD of *π*. The model is one in which *s* and 2*Nλ* are fixed. For the strong selection case *s*=1.0E−02 and 2*Nλ*=2.0E−05, for weak selection *s*=1.0E−04, and 2*Nλ*=2.0E−03. *ρ*=0.1/site and *θ*=0.01/site. Shown are the distributions of 1000 MAP estimates. The dotted lines indicate the true values. The distributions for 10 50 kb region datasets are given in black, and for 1000 500 bp datasets in gray. As shown, the former affords more accurate estimation, and estimation is improved in general as *s* becomes large (see also Table S1).

(0.2 MB TIF)

Click here for additional data file.^{(289K, tif)}

The ratio CV to CV(equilibrium neutrality) for four values of *s*. The product *2Nλs*=5E−07 for all panels. (A–D) Drosophila-like parameters: *ρ*/*θ*=10 (*ρ*=0.1/site, *θ*=0.01/site), *ρ*=constant or Normal(0.1, 0.05). (E–H) Human-like parameters: *ρ*/*θ*=1 (*ρ*=0.002/site, *θ*=0.002/site), *ρ*=constant or Exponential(0.1). (A,E) Exponential(*s*), Exponential(2*Nλ*), and *ρ*=Normal(0.1, 0.05). (B, F) Exponential(2*Nλ*), *s*=constant. (C, G) Exponential(*s*), 2*Nλ*=constant. (D, H) *ρ*=distributed, *s*=constant, 2*Nλ*=constant. The choice of exponentially distributed *ρ* for human-like parameters is motivated by evidence for greater heterogeneity in *ρ* relative to Drosophila [39]. Importantly, these models only represent one possible way of modeling distributions of *s* and 2*Nλ*, and alternative models may result in differing conclusions.

(0.2 MB TIF)

Click here for additional data file.^{(235K, tif)}

Approximate Bayesian estimation of the strength and rate of selection as well as the neutral *θ*, when estimation is based upon the means and SDs of *π*, *S*, *θ _{H}* and

(0.2 MB TIF)

Click here for additional data file.^{(262K, tif)}

Joint posterior distributions of *s* and 2*Nλ*, for the 137-locus dataset of [11], when estimation is based upon the means and SDs of *π*, *S*, *θ _{H}* and

(0.6 MB TIF)

Click here for additional data file.^{(635K, tif)}

The authors acknowledge Doris Bachtrog, Yuseob Kim, Molly Przeworski and members of the Aquadro lab for helpful comment and discussion.

The authors have declared that no competing interests exist.

JDJ was supported by a National Science Foundation Biological Informatics postdoctoral fellowship. KRT was supported in part by setup funds.

1. Maynard Smith JM, Haigh J. The hitchhiking effect of a favourable gene. Genet Res. 1974;23:23–35. [PubMed]

2. Harr B, Kauer M, Schlotterer C. Hitchhiking mapping: a population-based fin-mapping strategy for adaptive mutation in Drosophila melanogaster. Proc Natl Acad Sci USA. 2002;99:12949–12954. [PubMed]

3. Stephan W, Wiehe THE, Lenz MW. The effect of strongly selected substitutions on neutral polymorphisms: analytical results based on diffusion theory. Theor Popul Biol. 1992;41:137–154.

4. Simonsen KL, Churchill GA, Aquadro CF. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics. 1995;141:413–29. [PubMed]

5. Fay JC, Wu C-I. Hitchhiking under positive Darwinian selection. Genetics. 2000;155:1405–13. [PubMed]

6. Kaplan NL, Hudson RR, Langley CH. The “hitchhiking effect” revisited. Genetics. 1989;120:819–829. [PubMed]

7. Wiehe THE, Stephan W. Analysis of a genetic hitchhiking model, and its application to DNA polymorphism data from *Drosophila melanogaster*. Mol Biol Evol. 1993;10:842–54. [PubMed]

8. Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W. The hitchhiking effect on the site frequency spectrum of DNA polymorphism. Genetics. 1995;140:783–796. [PubMed]

9. Przeworski M. The signature of positive selection at randomly chosen loci. Genetics. 2002;160:1179–89. [PubMed]

10. Begun DJ, Aquadro CF. Levels of naturally occurring DNA polymorphism correlate with recombination rate in *D. melanogaster*. Nature. 1992;356:519–520. [PubMed]

11. Andolfatto P. Hitchhiking effects of recurrent beneficial amino acid substitutions in the *Drosophila melanogaster* genome. Genome Research. 2007;17:1755–62. [PubMed]

12. Macpherson JM, Sella G, Davis JC, Petrov DA. Genomewide spatial correspondence between nonsynonymous divergence and neutral polymorphism reveals extensive adaptation in Drosophila. Genetics. 2007;177:2083–99. [PubMed]

13. Kim Y. Allele frequency distribution under recurrent selective sweeps. Genetics. 2006;172:1967–78. [PubMed]

14. Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132:1161–76. [PubMed]

15. Smith NG, Eyre Walker A. Adaptive protein evolution in Drosophila. Nature. 2002;415:1022–4. [PubMed]

16. Eyre-Walker A. The genomic rate of adaptive evolution. Trends Ecol Evol. 2006;21:569–75. [PubMed]

17. Sawyer SA, Parsch J, Zhang Z, Hartl DL. Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proc Natl Acad Sci USA. 2007;104:6504–10. [PubMed]

18. Li H, Stephan W. Inferring the demographic history and rate of adaptive substitutions in Drosophila. PLoS Genet. 2006;2:e166. [PubMed]

19. Andolfatto P. Adaptive evolution of non-coding DNA in Drosophila. Nature. 2005;437:1149–52. [PubMed]

20. Bachtrog D, Andolfatto P. Selection, recombination and demographic history in *Drosophila miranda*. Genetics. 2006;174:2045–59. [PubMed]

21. Ometto L, Glinka S, De Lorenzo D, Stephan W. Inferring the effects of demography and selection on Drosophila melanogaster populations from a chromosome-wide scan of DNA variation. Mol Biol Evol. 2005;22:2119–30. [PubMed]

22. Wright SI, Bi IV, Schroeder SG, Yamasaki M, Doebley JF, et al. The effects of artificial selection on the maize genome. Science. 2005;308:13130–1314. [PubMed]

23. Kim Y, Stephan W. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics. 2002;160:765–777. [PubMed]

24. Jensen JD, Thornton K, Bustamante CD, Aquadro CF. On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in non-equilibrium populations. Genetics. 2007;176:2371–2379. [PubMed]

25. Fu Y-X. New statistical tests of neutrality for DNA samples from a population. Genetics. 1996;143:557–570. [PubMed]

26. Kelly JL. A test on neutrality based on interlocus associations. Genetics. 1997;146:1179–1206. [PubMed]

27. Jensen JD, Kim Y, Bauer DuMont V, Aquadro CF, Bustamante CD. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics. 2005;170:1401–1410. [PubMed]

28. Thornton KR, Jensen JD, Becquet C, Andolfatto P. Progress and prospects in mapping recent selection in the genome. Heredity. 2007;98:340–8. [PubMed]

29. McVean GA. A genealogical interpretation of linkage disequilibrium. Genetics. 2002;162:987–91. [PubMed]

30. Lazzaro BP, Clark AG. Molecular population genetics of inducible antibacterial peptide genes in *Drosophila melanoaster*. Mol Biol Evol. 2003;20:914–23. [PubMed]

31. Haddrill P, Thornton K, Andolfatto P, Charlesworth B. Multilocus patterns of nucleotide variability and the demographc and selection history of *Drosophila melanogaster* populations. Genome Res. 2005;15:790–799. [PubMed]

32. Thornton KR, Andolfatto P. Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of Drosophila *melanogaster*. Genetics. 2006;172:1607–19. [PubMed]

33. McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991;351:652–4. [PubMed]

34. Bierne N, Eyre-Walker A. The genomic rate of adaptive amino acid substitutions in Drosophila. Mol Biol Evol. 2004;21:1350–60. [PubMed]

35. Coop G, Griffiths RC. Ancestral inference on gene trees under selection. Theor Pop Biol. 2004;66:219–32. [PubMed]

36. Przeworski M, Coop G, Wall JD. The signature of positive selection on standing genetics variation. Evolution Int J Org Evolution. 2005;59:2312–23. [PubMed]

37. Durrett R, Schweinsberg J. Approximating selective sweeps. Theor Popul Biol. 2004;66:129–238. [PubMed]

38. Cirulli ET, Kliman RM, Noor MA. Fine-scale crossover rate heterogeneity in *Drosophila pseudoobscura*. J Mol Evol. 2007;64:129–35. [PubMed]

39. Coop G, Przeworski M. An evolutionary view of human recombination. Nat Rev Genet. 2007;8:23–34. [PubMed]

40. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8. [PubMed]

41. Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol. 1999;16:1791–1798. [PubMed]

42. Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162:2025–2035. [PubMed]

43. Marjoram P, Molitor J, Plagnol V, Tavare S. Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA. 2003;100:15324–15328. [PubMed]

44. Przeworski M. Estimating the time since the fixation of a beneficial allele. Genetics. 2003;164:1667–76. [PubMed]

45. Tajima F. Statistical methods for testing the neutral mutation hypothesis. Genetics. 1989;123:437–460.

Articles from PLoS Genetics are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |