The fixation of beneficial mutations can strongly reduce levels of closely linked neutral variation – the so-called genetic hitchhiking effect 
. This prediction has been used to search for positive selection by looking for regions of the genome with reduced variability [e.g., 2]
. The hitchhiking model most often used is of a single selective sweep, where the location and timing of selection are assumed to be known 
. This single sweep model has been of great value in understanding the effect that a single selective event has on patterns of polymorphism, as a function of the strength of selection and location of the beneficial mutation [e.g., 1]
. However, this model is somewhat disconnected from the problem of detecting selective sweeps in the genome, for which locations and timings are not known a priori
, and should be treated as random variables.
Kaplan et al.
(1989) described a “recurrent hitch-hiking” (RHH) model, where the expected number of sweeps (per base pair, per 2N
generations) is 2Nλ
with sweeps occurring at random locations in the genome 
. The RHH model is most commonly considered for the case of genic selection on new mutations entering the population [e.g., 6]
. Under this model, several patterns expected under the single sweep model no longer apply. For example, the single sweep model predicts coalescent histories with long internal branches, as some lineages may escape the recent coalescent event via recombination. This results in the widely employed prediction of an excess of high-frequency derived alleles flanking the fixed site 
. Under RHH models however, the probability of such a history is small, as sweeps are on average old and high frequency derived mutations have thus likely drifted to fixation 
Wiehe and Stephan (1993) showed that under a RHH model, for a given recombination rate, the expected level of heterozygosity at linked sites relative to neutral expectations is dependent upon the compound parameter (s
), where 2Nλ
is the rate of fixation of beneficial mutations and s
is the average strength of selection 
. This result implies that that the two parameters are confounded (much like the effective population size, Ne
, and mutation rate, μ
, in θ
) as their effect on expected levels of diversity depends on their product. In D. melanogaster
and D. simulans
, lower than expected levels of nucleotide diversity are observed in regions of reduced recombination 
and in the coding sequences of rapidly evolving proteins 
. These findings are compatible with either strong but infrequent positive selection (i.e.
, large s
and small 2Nλ
) or weak but common positive selection (i.e.
, small s
and large 2Nλ
A number of methods have been proposed for quantifying s
(separately) using divergence and polymorphism data [e.g., 11]
. These approaches typically make strong assumptions regarding the possible distribution of selection coefficients, the number of adaptive substitutions between species, or the timing of selection. For example, Li and Stephan (2006) examined 250 non-coding regions from an East African population of D. melanogaster 
. Using a likelihood approach, they estimate that approximately 160 beneficial mutations have fixed in this population over the last ~60,000 years (corresponding to
), with mean selection coefficient ŝ
~0.002. This inference is achieved by effectively assuming that the timing of all sweeps is known (and the time since the sweep, τ
0). Under a recurrent sweep model, this assumption may bias the estimation of s
. Additionally, as this method relies on first fitting a demographic model to non-coding DNA polymorphisms, it is possible that the effects of purifying selection on the site frequency spectrum of non-coding DNA 
may strongly affect the estimates.
Using synonymous polymorphism data in D. melanogaster
, and divergence to D. simulans
, at 137 X-linked loci, Andolfatto (2007) employed a maximum likelihood approach to estimate the joint parameter 2Nλs
, followed by a McDonald-Kreitman-based method to separately estimate 2Nλ
and s 
. Based on these calculations, Andolfatto estimated that most beneficial amino acid substitutions are very weakly advantageous on average (with average ŝ
). Macpherson et al.
(2007), using polymorphism data from D. simulans
(and divergence to D. melanogaster
), propose a method to infer the rate and strength of selection from the spatial scale of variation in polymorphism and divergence 
. In contrast to Andolfatto's estimates, Macpherson et al.
estimate a much stronger average selection coefficient (ŝ
~0.01) and less frequent selection (
). However, they note that their method is more likely to detect strong selection, so the effects of many weakly beneficial mutations may be missed.
By evaluating a wide array of recurrent selection models across a variety of sampling schemes, with parameters relevant for both Drosophila and human populations, we demonstrate here that there are differences in the predictions of weak and strong selection models, both in the spatial distribution of variability levels and the distribution of polymorphism frequencies (also called the site frequency spectrum, hereafter SFS). We propose a polymorphism-based approximate Bayesian (ABC) estimator that is most closely allied to the approach of Macpherson et al.
(2007), but is also applicable to sub-genomic multi-locus data of the kind that has most often been collected [e.g., 11]
, and incorporates more information from the data. Fundamentally, this estimation procedure is based on the principle that while models may predict the same average affects, the variance of many common summary statistics varies greatly between models. We show that highly accurate estimation will be possible with large-scale genome polymorphism data, and that the approach is robust to both mutation and recombination rate heterogeneity.