Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS One**|**v.5(8); 2010**|**PMC2914027

Formats

Article sections

Authors

Related links

PLoS One. 2010; 5(8): e11913.

Published online 2010 August 2. doi: 10.1371/journal.pone.0011913

PMCID: PMC2914027

Magnus Rattray, Editor^{}

University of Manchester, United Kingdom

* E-mail: rf.arni.yuoj@reituag.ueihtam

Conceived and designed the experiments: MG JLF. Performed the experiments: MG JLF. Analyzed the data: MG. Contributed reagents/materials/analysis tools: MG TDH JLF. Wrote the paper: MG TDH JLF.

Received 2010 June 3; Accepted 2010 July 12.

Copyright Gautier et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.

The purpose of this study is to develop an efficient model-based approach to perform Bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a Bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting Bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.

The procedure described turns out to be much faster than former Bayesian approaches and also reasonably efficient especially to detect loci under positive selection.

The recent advent of high-throughput Single Nucleotide Polymorphism (SNP) genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged. Indeed, such approaches early proposed in the population genetics literature [1]–[5], look particularly relevant when studying populations belonging to a same species but adapted to different environmental conditions. However, their application to whole genome scan data mainly relied on the analysis of simple descriptive summary statistics, generally related to standard estimators of marker-specific *F _{ST}*, used to investigate the variability of allele frequencies at different loci across and within populations. Markers affected by selection are then expected to display an unexpectedly high or low value relative to the null distribution of

Alternatively, using simple demographic models, likelihood-based approaches allowing a full use of the information contained in the data sets have also been developed to distinguish, among the evolutionary forces shaping differences in allele frequency, those pertaining to population-specific factors (*e.g.* migration or drift) from those due to locus-specific factors (such as selection). Hence, relying on an infinite Wright island model with drift and migration at equilibrium, Beaumont and Balding [10] proposed a Bayesian modeling of allele frequencies involving both a “locus” and a “population” effects on genetic differentiation. Using this model, the efficiency to detect non-neutral loci has recently been further investigated through model choice strategies via Reversible Jump Markov Chain Monte Carlo (RJ-MCMC) algorithms [11], or by introducing locus-specific selection variables [12], [13]. Although the application of the latter approach to a large data set comprising 36,320 SNPs genotyped on 9 West African cattle populations illustrated its feasibility [12], the estimation of the posterior distributions for the parameters of interest remains computationally intensive due to model complexity and to no parallelizable Markov Chain Monte Carlo (MCMC) algorithms. Similarly and more recently, Guo *et al.* [14] investigated Bayesian hierarchical models to estimate locus-specific effects on *F _{ST}*, statistical outliers being detected based on the Kullback-Leibler divergence measure between the posterior distributions of locus-specific effects and the common

The purpose of this study is to develop an efficient likelihood-based approach to perform Bayesian exploratory analyses for adaptive differentiation for very large SNP data sets. The basic idea is to start from a very simple model for neutral loci that will be easy to implement under a Bayesian framework and to identify selected loci as outliers via posterior predictive P-values (PPP-values) [15], [16]. We investigate two different statistical models: *i)* a model interpreted in the context of populations evolving under pure genetic drift from a common ancestral population [17] and *ii)* a model interpreted in the context of populations under migration-drift equilibrium [10], [18]. Robustness and power of both resulting classifiers are further evaluated and compared to a previous well described classifier [10]–[13] through extensive simulations and an application to the cattle data set mentioned previously.

Let *x _{ij}* be the observed reference allele (defined arbitrarily for instance by randomly choosing it as the ancestral or derived allele) count in population

(1)

Note that 1) implicitly assumes that populations are in Hardy-Weinberg Equilibrium (HWE) or equivalently their respective inbreeding coefficients (*F _{IS}*) are null. Non null

In the first model considered (denoted model 1 hereafter) and according to Nicholson *et al.* [17], the second step assumes that the *α _{ij}* are sampled from a truncated Gaussian distribution on the (0,1) segment

(2)

plus additional probability masses at 0 and 1.

This distribution was proposed by Nicholson *et al.* [17] in the context of a pure-drift demographic model. In (2), the parameter *π _{i}* stands for the allele frequency in the population ancestral to the

(3)

(4)

where and is the density of the standard Gaussian.

The last level of the hierarchy corresponds to the distributions of *π _{i}* and . These two hyper-parameters are classically assumed to be sampled from

(5)

(6)

In practice, the model was found to be robust to the parameter values of these *Beta* distributions [17]. We thus chose *a _{π}=b_{π}*=0.7 and

Note that the introduction of the truncation (equations 2, 3 and 4) leads to some difficulty in the implementation of a MCMC algorithm in particular when defining the proposal distribution for the *α _{ij}*. As suggested by G. Nicholson (personal communication) model 1 was considered as equivalent to the following one in which the first two levels are modified as

(7)

(8)

In fact, the hierarchical model in (7) and (8) can be implemented equivalently by using a proxy variable distributed as a regular Gaussian distribution with the relationship .

The second model considered (model 2 hereafter) is similar to model 1 except that it assumes the *α _{ij}* are sampled from a

(9)

where .

Note that model 2 does not consider the possibility of allele fixation since the Beta probability density function is either null or not defined in 0 and 1. Hence, alleles fixed in some (or all) populations are interpreted as being the result of a binomial sampling with a probability parameter close (but not equal) to 0 or 1. Demographic interpretation of this distribution on allele frequencies relies on an infinite Wright island model involving drift and migration at its equilibrium state [10], [18]. Under both model 1 (if we neglect truncature) and model 2, *c _{j}* represents a scale parameter of the allele frequency variance and might thus be interpreted as a population specific

For each model, we implemented a Metropolis-Hastings within Gibbs sampler to estimate the posterior distributions of the parameters of interest (Text S1). Program executables are available upon request from the first author. To check each program, we initially analyzed data sets simulated under the corresponding inference model (Figure S1). In addition, several data sets (including a real one) were also analyzed using a mirror version of the algorithms programmed in BUGS code and implemented in the OpenBUGS software [20]. For each model, results obtained with the two implementations were found in almost perfect agreement (data not shown).

Under the assumption of exchangeability among SNPs (*i.e.* neutrality), *c _{j}* parameters are expected to be the same over SNPs within each population. Therefore, non neutral loci might be viewed as outliers with respect to the null model. One simple way to identify such loci thus consists of evaluating a local assessment of the null model (either model 1 or model 2) at each locus using Posterior Predictive Check tools. This can be easily accomplished by computing PPP-values which are the Bayesian counterparts of the frequentist P-values [15].

The PPP value for SNP *i* over the *J* populations is defined as

(10)

where with is the (reference) allele frequency of SNP *i* in population *j* and is a discrepancy criterion applied to replicated (*rep*) and observed (*obs*) data respectively given the values of model parameters with . Notice that the probability that takes into account both variability in the replicates and uncertainty in the unknown parameters by integrating out with respect to these two sources of the variation via the distributions of and of given all data observed.

A first issue is to choose the discrepancy criterion which, contrarily to usual statistics, depends generally both on data and parameters. Here, we relied on a Chi-square type criterion ([21], formula 6.4, page 175)

(11)

with .

It can be shown (Text S2) that for both models: and . The null hypothesis being primarily based on the exchangeability assumptions between loci made at the second level of the hierarchy (formulae 2 and 9), we used here the moments of the marginal distribution of the 's as our measure of discrepancy between the data and the model. Letting the indicator variable be equal to 1 if and 0 otherwise, then the corresponding is simply the posterior expectations of and can be easily computed from the Gibbs sampling outputs. The replicated data are generated at each iteration from the predictive distribution of given each current values of .

Extreme probabilities at a given locus will indicate that the data at this locus are inconsistent with the model. Actually, small values correspond to positive selection and large values to balancing selection.

Under this model, referred hereafter as model 3, allele count data are modeled according to the reparameterized extension, recently proposed by Riebler *et al.* [13], of the Bayesian hierarchical model developed by Beaumont and Balding [10]. Model 3 is actually identical in its first levels to model 2 described above. Nevertheless, the differentiation parameter (*c _{ij}*) is considered as both locus and population specific. In that respect, model 2 might be viewed as model 3 under the null hypothesis of neutrality (locus exchangeability). Hence, model 3 assumes the

The *τ _{ij}*'s are subsequently modeled via a linear model on the logistic transformation of the

where *α _{i}* is a locus effect,

The posterior distributions of the different parameters of interest were estimated via MCMC procedures as previously described [13] from 2,000 post burn-in samples (with a burn-in period of 2,500 iterations) and a thinning interval of k=25. The decision rule to identify loci subjected to selection was based on a Bayes Factor (*BF*) derived at each locus from the posterior distribution of the *δ _{i}* [12]. To make interpretation of the

Four different demographic scenarios were investigated to evaluate the distribution of the PPP-values for neutrally evolving SNPs. In the first and second scenarios, allele count data were simulated for *L*=1,000 independent bi-allelic neutral SNPs in *P* random mating populations evolving under a pure-drift Wright-Fisher demographic model over *T* non overlapping discrete generations from a common ancestral population. Under this model we thus expect the population specific *F _{ST}* for a population with a constant (diploid) size

Finally, to investigate a more complex (and realistic) demographic scenario, we simulated a data set under the calibrated model featuring the best-fitting conditions for four human populations (Europeans, Africans, Asians and African Americans) using the cosi package [22]. The demographic history of the populations corresponded to an Out-of-Africa model of an ancestral population that splits into Africans and non-Africans, and then into Europeans and Asians. African Americans are modeled as a recent admixture of the African and European populations. Fifty 250 kb long (autosomal) segments were simulated using heterogeneous recombination rates (picked from the empirical distribution of the deCODE genetic map [23]) leading to a total of 107,158 SNPs.

Allele count data were simulated for *L* independent bi-allelic SNPs in *P* random mating populations evolving over T discrete non overlapping discrete generations from a common ancestral population (star shaped phylogeny). We first considered a simple pure-drift Wright-Fisher model (Text S3) in which the current populations are derived from an ancestral one in complete isolation (*i.e.* without migration between populations). We also simulated data under a simple Wright Fisher model with migration (Appendix III2) using a simplified version of previously described algorithms [10], [13]. In both models, selection was further introduced in the model by attaching a selective coefficient ( for a neutral locus) and a selection type (either positive or balancing) to each SNP *i*.

In all the simulations we did not consider mutation. In addition, SNP fixed for the same allele in all populations were discarded from further analysis. This might somewhat mimic part of the ascertainment bias expected in real data sets, since monomorphic SNPs are not expected to be genotyped or analyzed [12].

Under the hypothesis of exchangeability of the SNP (*i.e.* SNP neutrality), PPP-values for neutral loci are expected to be close to 0.5. However statistical noise introduces some dispersion around this value and thus some thresholds are necessary to define outliers. In addition, both models of differentiation could only be related to very simple demographic models and thus departure from the true demographic history might also affect the PPP-value distribution. Thus, in order to evaluate the robustness of both PPP-value classifiers based on the two alternative Bayesian hierarchical models 1 and 2, we first investigated the PPP-values distribution for data sets simulated under various neutral demographic scenarios (see Materials and Methods).

The first scenario (PDN1) is a simple Wright-Fisher pure drift model with 1,000 (neutral) SNPs segregating in 10 different populations of constant size originating from a common ancestral one *T* generations ago. As mentioned above, the statistical model 1 was proposed to deal with this latter kind of non-equilibrium demographic scenarios [17], [18]. Several values of *T* were considered to evaluate the effect of the level of differentiation: from *T=10* (*F _{ST}*=0.01) to

Under such pure-drift divergence models, a strong (and implicit) hypothesis which might often be violated concerns the star shaped phylogeny relating the different populations. As an attempt to evaluate consequences of departure from such simple phylogeny, we thus analyzed data sets simulated under a pure-drift demographic scenario with a more complex history (PDN2) starting with 4 populations at *T*=0, two of which giving rise to 2 populations at *T*=30 and *T*=50 generations respectively. This resulted in an increase of PPP-values dispersion soon after the population split (*e.g. T*=35 and *T*=55 in Table 1). This could be directly related to previous observations since population splits lead to a clear underestimation of population-specific *F _{ST}* for the newly arisen populations ( tending to 0 at early time after the split). Overall, although bias in the estimation of the persisted, the dispersion decreased as the number of generations (since the split), thus rendering the PPP-value approach relatively robust.

In a third demographic scenario (MDN), we simulated 1,000 SNPs segregating in 10 populations under a migration-drift equilibrium which corresponds to the inference model 2 (see Methods). Although, no clear bias was observed in the estimation of the with both models (*e.g.* Figure S2), PPP-value dispersion increased as the level of differentiation decreased. Nevertheless and as expected, model 1 appeared less robust than model 2 at low level of differentiation (*F _{ST}*<0.05). In addition, the departure of the average PPP-values toward values lower than (the expected) 0.5 appeared more pronounced as the differentiation increased.

We finally explored with coalescent simulations a more realistic scenario (COA) consisting in the calibrated Out-of-Africa model featuring the best-fitting conditions for four human populations (Europeans, Africans, Asians, and African Americans modelled as a recent admixture of Africans and Europeans) [22]. Note that because, 50 independent 250 kb segments were simulated, some SNPs were not independent. Based on the complete resulting data sets, three different population groups were analyzed (Table 1). Results were overall consistent with those reported above for more simple scenarios. Hence, for the EuroAfriAsia group, almost no SNP displayed PPP-values below 0.2 or above 0.8. As expected from previous observations on PDN2 simulated data sets, introducing the African American recently admixed populations lead to a higher dispersion of PPP-values (more pronounced for the analysis of the EuroAfAmAfri group) together with a low estimated for this population (<0.015 when analyzing the four simulated populations and <0.001 when analyzing the EuroAfAmAfri group). Nevertheless, only a small proportion of the SNPs (<1%) displayed PPP-values lower than 0.1 or higher than 0.9.

On the basis of these different simulations, the distribution of PPP-values appears robust to various demographic scenarios provided that the global estimated population differentiation (average) is not too small (as a rule of thumb *F _{ST}*>0.05). In addition and as previously mentioned [12], [17], the two models considered in this study remain almost insensitive to the prior distribution of the which might be (demographically) interpreted as the allele frequency in the ancestral population (under a pure-drift model) or in the gene pool (under a migration-drift model at equilibrium). As a result, the models are expected to be robust to the chosen SNP ascertainment scheme. Hence, for the first three scenarios investigated above, the distribution of PPP-values appeared almost unchanged when keeping all the SNPs in the analysis, even those fixed in all populations (data not shown). Similarly, results reported in Table 1 for a different SNP ascertainment scheme applied on the COA simulations which consisted in keeping only those SNPs segregating (MAF>0.01) in at least two populations suggested that the influence of ascertainment bias on the PPP-value distribution is small.

In order to evaluate to what extent selection causes an outlier PPP-value for the underlying SNPs, we analyzed several simulated data sets with some SNPs subjected both to balancing and positive selection (see Materials and Methods) and three different selection coefficients as representative of low (*s=0.02*), moderate (*s=0.05*) and high (*s=0.10*) selection intensity.

We first considered a pure-drift demographic scenario similar to the PDN1 one described previously. We herein report results obtained with a data set consisting of eight populations with a constant haploid size of N=500 deriving from a common ancestral one and genotyped for 10,000 SNPs among which 8,500 were neutral (*s _{i}=0*), 750 were subjected to positive selection (250 with

Interestingly, estimates of the ancestral reference allele frequency (mean of the posterior distribution of ) were remarkably consistent with their corresponding simulated values for neutral SNPs although precision decreased with increased level of divergence (Figure S5). Indeed, the correlation between simulated and estimated ancestral allele frequencies was always above 0.98 with both models, while the Root Relative Mean Square Error (RRMSE) ranged from 3.25% (*T=10*) to 10.7% (*T=100*) with model 1 and from 3.24% (*T=10*) to 9.90% (*T=100*) with model 2. However, these estimates were biased for SNPs under selection (Figure S5), the bias increasing with divergence and intensity of selection. More precisely, the RRMSE ranged from 4.04% (*T=10* and *s=0.02*) to 40.9% (*T=100* and *s=0.10*) with model 1 and from 4.04% to 39.5% with model 2 for SNPs under balancing selection. Similarly, for SNPs subjected to positive selection, the RRMSE ranged from 3.24% (*T=10* and *s=0.02*) to 41.4% (*T=100* and *s=0.10*) with model 1 and from 3.23% to 40.8% with model 2.

For these five simulated data sets the mean of the different PPP-value distributions were always close to 0.5 (from 0.480 to 0.489 for model 1 results and from 0.501 to 0.528 for model 2 results) while the standard deviation decreased with level of divergence (from 0.235 when T=10 to 0.106 when T=100 for model 1 results and from 0.231 when T=10 to 0.0958 when T=100 for model 2 results). As expected and shown in Figure 1 for two of these simulated data sets (*T=10* and *T=100*), the PPP-value median (and mean) remained close to 0.5 for neutral SNPs while tending to 0 (respectively 1) for SNPs subjected to positive (respectively balancing) selection. Moreover, this trend was more pronounced as the selective coefficient and differentiation increased and for SNPs under positive selection. As a result, the tails were more enriched in SNPs under selection (Table S1), model 2 showing an increased power of discrimination compared to model 1 (at least for SNPs under positive selection). For instance, while 7.5% of the simulated SNPs were subjected to positive selection, this proportion ranged from 39.2% (*T=10*) to 73.6% (when *T=100*) for the 250 SNPs with the lowest PPP-values obtained with model 1 and from 40.4% (*T=10*) to 100% (*T=75* and *T=100*) with model 2 (a vast majority of these SNPs being those with high value of *s*). Discrimination based on PPP-values appeared nevertheless far less efficient in identifying SNPs under balancing selection (Table S1). Hence, while 7.5% of the simulated SNPs were subjected to balancing selection, this proportion ranged from 15.6% (*T=10*) to 64.0% (when *T=75* and *T=100*) for the 250 SNPs with the lowest PPP-values obtained with model 1 and from 9.20% (*T=10*) to 56.0% (*T=100*) with model 2.

Although we expected lower (resp. upper) tails to be enriched in SNPs under positive (resp. balancing) selection, identifying outliers on the observed PPP-value distribution makes it impossible, in practice, to control for False Discovery Rate (FDR) or False Negative Rate. We thus further investigated the power and robustness of each model, based on the simulated data sets, by computing FDR and recording the number of SNPs properly identified as subjected to selection for different PPP-value threshold (Table 2). For a given threshold the FDR but also the power decreased as the number of simulated generations increased, which was expected since this also resulted in sharpening the overall PPP-value distribution due in particular to an increase of the *c _{j}* and thus the allele frequency variance. Similarly, the power was always higher when considering SNPs subjected to stronger selection (see above). Model 2 appeared far more efficient than model 1 mainly because PPP-value estimates were more extreme for SNPs under selection (

We further evaluated both statistical models on data sets simulated under a migration drift demographic model (see Material and Methods). Note that assuming populations have reached (migration-drift) equilibrium, model 2 is consistent with such demographic scenarios (see Material and Methods). As in the previous section, we herein reported results obtained with data sets consisting of eight populations with a constant haploid size of *N=500* which, according to a Wright island demographic model and during 250 generations, exchanged migrant alleles through a common gene pool (Text S3). For each data set, 10,000 SNPs were simulated among which 8,500 were neutral (*s _{i}=0*), 750 were subjected to positive selection (250 with

In each case, the values of obtained under model 1 and model 2 for each of the eight simulated populations were found very close to the corresponding simulated *F _{ST}*: for data set simulated with

Consequently, this simple migration drift demographic scenario close to equilibrium appeared more favorable than the previous one to identify SNPs subjected to selection based on the PPP-value criterion (Figure 2). As previously, the mean of the PPP-values was close to 0.5 for both models (from 0.453 to 0.502 with model 1 and from 0.514 to 0.538 with model 2) while standard deviations were lower and decreased with the level of differentiation (from 0.131 to 0.191 with model 1 and from 0.129 to 0.183 with model 2). Likewise, the proportion of SNPs under selection in the tails of the distribution was higher (Table S1). For instance, among the 250 SNPs with the lowest PPP-values, from 92.8% to 98.4% (with model 1) and 100% (with model 2) were subjected to positive selection, while among the 250 SNPs with the highest PPP-values from 29.3% to 45.6% (with model 1) and from 27.3% to 42.6% (with model 2) were under balancing selection. Compared to previous simulation demographic scenarios, model 1 estimates of the PPP-values for SNPs subjected to positive selection deviate to a lesser extent from those estimated with model 2 (Figure 2).

As a consequence, for a given PPP-value threshold, power and robustness to detect SNPs under selection were greatly improved (Table 2). Hence, when looking for SNPs subjected to positive selection, at the 0.2 threshold, FDR was very close to 0 for both models while almost all the SNPs with *s=0.1* were detected. In general, FDR tended to decrease with differentiation while model 2 performed better than model 1.

To compare the power and robustness of our decision criterion based on PPP-values to identify SNPs under selection with a previously reported approach, we further analyzed the above simulated data sets under the model initially proposed by Beaumont and Balding [10]. As detailed in the Methods section, this model relies on the estimation, through a logistic regression, of both a “locus effect” and a “population effect” on genetic differentiation. Based on an extension proposed by Riebler *et al.* [13], we recently proposed to derive for each SNP a Bayes Factor (BF) which provides a straightforward decision criterion to decide whether the SNP is subjected to selection [12]. Indeed, in agreement with the Jeffreys' rule [25], we showed that a threshold of 15 (respectively 10) on such BF (expressed in Deciban units) appeared optimal to detect SNPs under positive (respectively balancing) selection. This model and its extensions [10]–[13] might thus be considered as the state of the art Bayesian approach to identify SNPs under selection although it requires far more computational efforts than for models considered in the present study (see Discussion).

We first assessed the power of the three different classifiers (based on PPP-values estimated under models 1 and 2 and BF estimated under model 3) by generating receiver operating characteristic (ROC) curves [26] which plot the power *vs* (1-specificity) for a binary classification system whereby the cutoff value is varied. Curves resulting from the analysis of nine different simulated data sets are reported in Figure 3 for each of the three classifiers and distinguishing SNPs subjected to positive (in red) and balancing (in blue) selection (irrespectively of the intensities of selection). As expected from previous observations, ROC curves for SNPs subjected to positive selection were always better than ROC curves for SNPs under balancing selection. Similarly, power to detect SNPs subjected to selection under a pure-drift demographic model increased with differentiation but remained lower than under a migration-drift demographic model. Interestingly, ROC curves of both PPP-value classifiers were generally above ROC curves for the BF classifier while the PPP-value classifier based on model 2 clearly outperformed the PPP-value classifier based on model 1 for positive selection. Nevertheless, the definition of an optimal threshold value for the PPP-value classifier strongly depends on the level of differentiation (see above). Table 3 reports the comparisons of the power and robustness of the analyses performed with model 2 and model 3. As a matter of expedience we chose a PPP-value threshold of 0.2 (respectively 0.8) to declare SNPs as subjected to positive (respectively balancing) selection and a threshold of 15 on BF was chosen when analyzing data with model 3. At such thresholds, the FDR were generally similar among the two different analyses except for low level of differentiation (*F _{ST}≤0.05*). In these cases FDR (and thus power) was substantially higher for model 2 than for model 3. Note that a great proportion of false positives originated from the upper tail of the PPP-value distribution (see for instance results for the MDM data sets simulated with

Finally, as shown in Table 3 and Figure 4, results were in good agreement between the two models since a good relationship appeared between estimated PPP-values and BF (the more extreme PPP-value towards 0 or 1, the higher the BF). As an example, in the MDM simulated data set with *F _{ST}*=0.15, respectively 338 and 328 SNPs were (correctly) identified as under positive selection with analysis based on model 2 and model 3 (Table 3). Among these, 313 SNPs were shared by both procedures.

We finally analyzed a data set consisting of 9 West-African cattle populations genotyped for 36,320 SNPs we previously used to perform of whole genome scan for adaptive divergence [12]. Overall, estimates of population specific differentiation were in good agreement with those previously reported (Table S2) especially, and as expected, when using model 2. Nevertheless, for highly differentiated populations (*F _{ST}*>0.2), model 1 resulted in higher values. The

We further studied the distribution of PPP-values estimated for each SNPs. As with simulated data, the average PPP-value was close to 0.5 with both models (0.480 with model 1 and 0.521 with model 2) with an almost equal standard deviation (0.173 with model 1 and 0.172 with model 2). When comparing these results to those previously obtained (Figure 4C), an overall good agreement was observed. Indeed, the higher the SNPs were differentiated (plotted in blue in the Figure 4C) and the higher the BF, the lower the estimated PPP-values. Conversely, the lower the SNPs were differentiated (plotted in red in the Figure 4C) and the higher the BF, the higher the estimated PPP-values. However, although the estimates from the two models lead to qualitatively similar results with a global correlation of 0.928 between them, the dispersion was higher for those obtained with model 2 (from 0.008 to 0.924) than model 1 (from 0.012 to 0.902). In particular, in the tails of the distributions PPP-values were generally more extreme with model 2 than model 1. This observation was actually in good agreement with results obtained on simulated data sets (see above). As a consequence, model 2 lead to a decision regarding SNP selection status more in agreement with the one based on the BF [12]. For instance, we initially identified 2,054 SNPs with a BF>15 as candidates to be subjected to selection, among which 537 were most likely under balancing selection (*F _{ST}*<0.011) while 1,517 were most likely under positive selection (

In this study, we implement a Bayesian model-based strategy to scan for adaptive differentiation on large SNP data sets. The rationale of such approaches consists of evaluating a local adjustment of the null model to data at each locus by computing PPP-values [15]. We then investigated two different models (model 1 and model 2), which could respectively be interpreted in the context of pure-drift divergence [17] and an infinite Wright island model involving drift and migration at its equilibrium state [10], [18]. An important feature of both models was the possibility to derive population-specific differentiation estimates [18]. These latter parameters are then expected to be constant over SNPs within each population under the neutral hypothesis corresponding to SNP exchangeability. Consequently, SNPs displaying either high or low PPP-values (*i.e.* outliers) were interpreted as loci possibly under selection. From a theoretical point of view, such likelihood-based strategy permits full use of the data, relying explicitly on a model rather than opting for the identification of outliers based on an empirical or simulated distribution of summary statistics [7]–[9]. Conversely, in our approach, the integration over the posterior distribution of parameters, given the complete data vector in the calculation of PPP-values, results in a rather conservative procedure (double use of the data). However, and as pointed out by Bayarri and Castellanos [27], a small (or large) PPP-value “can safely be interpreted as incompatibility with the null model”. Nevertheless, compressing the data into summary statistics and further replacing likelihood computation by data simulations under an Approximated Bayesian Computation (ABC) framework was recently proven efficient to identify loci subjected to selection while allowing complex genetic model to be considered [28].

Evaluation of both models on simulated data sets revealed that they performed equally in estimating differentiation when considering a migration-drift scenario while model 2 was surprisingly more efficient under the pure-drift divergence scenario. For the latter scenario, nevertheless, they together resulted in underestimation of differentiation at low level of pure-drift divergence (*F _{ST}*<0.05) while model 1 implementation became fairly inaccurate (

As expected from previous studies [10]–[13], estimated PPP-values allowed more accurate identification of SNPs under positive rather than balancing selection. Moreover, the accuracy increased with the intensity of selection. Importantly, the dispersion of PPP-values was confirmed to be highly dependent on the level of differentiation since this is directly related to the variance in allele frequency. In particular, for the case of low level of pure-drift divergence (*F _{ST}*<0.05) for which we observed an underestimation of differentiation, this might contribute to an increased FDR. More generally and from a desired practical point of view, this made it difficult to propose standard thresholds on the distribution of PPP-values to decide whether or not a SNP is under selection, although more sophisticated approaches could have been envisioned such as clustering techniques based on mixture models. However, at this stage, we wanted to keep the decision rule as simple as possible. Thus, as a matter of expedience, to detect SNP candidates to be subjected to positive (balancing) selection, a conservative threshold of 0.1 (0.9) might be recommended since in most cases investigated through simulation it lead to a FDR close to 0 (although it increases when differentiation decreases) and an optimal power (which conversely decreases as differentiation increases). When analyzing the two types of simulated data sets, it clearly appeared that both models were less powerful when dealing with completely isolated populations. However and more strikingly, model 2 undoubtedly outperformed model 1 in identifying outliers probably because it resulted in more extreme PPP-value estimates for SNPs localized in either tail of the distribution. This trend was confirmed when comparing results with those based on the alternative and more complex modeling represented by model 3 [10]–[13]. More generally, both approaches were shown to have similar performances. This might be related to the high similarity between the two underlying statistical models. Note however that while the PPP-value is essentially a measure of the departure from the null hypothesis (SNP neutrality), under the approach based on model 3, an explicit modeling of the alternative hypothesis is performed through the introduction of a SNP effect in a logistic regression of the differentiation.

For real data sets, the SNP ascertainment process might also be expected to affect robustness of the approaches. Hence, approaches to perform a separate modeling of the demographic and ascertainment processes have recently been proposed [29]. However, as previously discussed [12] and suggested by our simulations, the different models appeared rather robust to such a bias. This might result from their insensitiveness to the prior distribution on the . As a consequence, the main difficulties might be more related to the simple assumptions on which demographic scenarios relied. Except for some special cases (*e.g.* artificial selection experiments based on the development of divergent lines from a common founder population), assuming star shaped phylogenies, as in a pure-drift model, remains highly unrealistic and leads to an underestimation of the variance in allele frequency when a hierarchical structure exists among the populations studied [30]. Due to the dependency of PPP-values on this crucial parameter, we might have expected a higher rate of false positives in these situations. For instance, it was recently shown that such population relationships greatly increased the rate of false positives in tests of selection based on *F _{ST}* which use a null distribution generated under a simple island model of differentiation [31]. Nevertheless, simulations under more complex scenarios suggested that the approach was relatively robust to departure from simple demographic assumptions provided the level of differentiation was not too low. Interestingly, hierarchical structure among populations as introduced from recent admixture or population splitting appeared as limiting cases. Indeed, we observed a high underestimation of the population specific differentiation parameter for recently admixed (or split) populations, leading to an increase of the PPP-values dispersion. Owing to the flexibility of Bayesian hierarchical modeling, it might be straightforward to include additional levels in models 1 and 2 to incorporate prior information on relationships among the populations surveyed. Alternatively, in the context of high throughput genotyping data sets, results from different neighbor SNPs might be empirically combined to identify regions in the genome displaying an unexpectedly high proportion of outliers [12]. Such regional information was overlooked in both our models since we considered all SNPs as exchangeable. Introduction of a spatial structure among SNPs was recently investigated, to take Linkage Disequilibrium into account in a model extending model 2 [14]. However, when considered on a whole genome basis, such approaches might add significant computational burden.

Finally, the model-based strategy presented in this study was chiefly operational and this might be viewed as an efficient way to perform a first exploratory analysis of large data sets. Hence running the MCMC algorithm underlying model 1 for 250,000 iterations on a data set containing 10,000 SNPs genotyped on 10 populations needed approximately 2 hours on a PC equipped with a 2.1 GHz processor. In contrast, the analysis of the cattle data set with model 3 [10], [12], [13] took more than 40 hours (*i.e.* approximately 20 times slower).

Description of the MCMC algorithms.

(0.09 MB PDF)

Click here for additional data file.^{(91K, pdf)}

Derivation of E(f_{ij}|θ_{ij}) and Var(f_{ij}|θ_{ij}).

(0.01 MB PDF)

Click here for additional data file.^{(15K, pdf)}

Composition of the tails of the PPP-values distribution. Results for eight different data sets are reported, five of which were simulated under a pure drift demographic model (PDM) with a varying number T of simulated generations and three were simulated under a migration drift model (MDM) with a varying level of migration controlled by the simulated FST. For each simulated data sets and analysis (either with model 1 or model 2), the proportions of SNPs subjected to selection within the considered tail of the distribution which belong to the different classes of selection type (N=neutral, B=balancing or P=positive) and selection coefficients (s=0.02, s=0.05 and s=0.10) are reported.

(0.04 MB XLS)

Click here for additional data file.^{(36K, xls)}

Estimates of differentiation for the nine populations from the cattle data set obtained with model 1, model 2 and the alternative model used in the initial report [12].

(0.01 MB XLS)

Click here for additional data file.^{(14K, xls)}

95% equal tail Credible Interval for the differentiation parameter c obtained after analyzing two data sets DS1 and DS2 simulated respectively under inference model 1 and 2. The two simulated data sets consist of 1,000 SNPs and 12 populations with the following simulated value of c : c1=c2=0.01, c3=c4=0.025, c5=c6=0.05, c7=c8=0.075, c9=c10=0.1 and c11=c12=0.15. A) Data set DS1 analyzed with model 1, B) Data set DS1 analyzed with model 2, C) Data set DS2 analyzed with model 2, D) Data set DS2 analyzed with model 2.

(0.03 MB PDF)

Click here for additional data file.^{(29K, pdf)}

Estimates of c for 17 data sets simulated under a pure-drift demographic model. Allele counts for 1,000 (neutral) SNPs were simulated for 10 populations and for 23 different times (measured in number of discrete generations) after divergence (T=10, T=20, T=30, T=40, T=50, T=60, T=70, T=80, T=90, T=100, T=125, T=150, T=175, T=200, T=250, T=300, T=400, T=500, T=600, T=700, T=800, T=900 and T=1000). The resulting data sets were analyzed using both model 1 (A) and model 2 (B). Resulting estimates (mean of the posterior distribution) are plotted against the corresponding simulated time (the different number representing the population label) and are connected by a line. The grey dashed line represents the expected FST value (see Methods).

(0.07 MB PDF)

Click here for additional data file.^{(69K, pdf)}

Allele frequency distribution within a population of constant (haploid) effective size (Ne=500) evolving during T discrete generations (T=log(1−c)/log(1−1/Ne) where c is a measure a population differentiation) as a function of the initial allele frequency (Pi). For each case investigated, a histogram of 1,000,000 simulated values is plotted together with the corresponding densities from model 1 (truncated Gaussian in blue with probability masses in 0 and 1) and model 2 (Beta distribution). Note that an exact derivation of the corresponding distribution has been derived using a forward-time diffusion approach [24].

(0.32 MB PDF)

Click here for additional data file.^{(316K, pdf)}

Estimates of c for five data sets simulated under a pure-drift demographic model. Allele counts for 10,000 SNPs (8,500 neutral SNPs, 750 subjected to positive selection and 750 to balancing selection) were simulated for 8 populations and for 5 different times (measured in number of discrete generations) after divergence (T=10, T=25, T=50, T=75 and T=100). The five resulting data sets were analyzed using both model 1 (A) and model 2 (B). Resulting estimates (mean of the posterior distribution) are plotted against the corresponding simulated time (the different number representing the population label) and are connected by a line. The grey dashed line represents the expected FST value (see Methods).

(0.01 MB PDF)

Click here for additional data file.^{(13K, pdf)}

Robustness of the estimates (mean of the posterior distribution) of the ancestral (reference) allele frequency πi. Two data sets (T=10 and T=100) consisting in genotyping data for 10,000 SNPs (8,500 neutral SNPs, 750 subjected to positive selection and 750 subjected to balancing selection) on 8 populations were analyzed with model 1 and model 2 (see text). For each data set, three plots are shown: i) estimates obtained under model 1 against (true) simulated values (A with T=10 and D with T=100), ii) estimates obtained under model 2 against (true) simulated values (B with T=10 and E with T=100) and iii) estimates obtained under model 1 against estimates obtained under model 2 (C with T=10 and E with T=100). Neutral SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). In addition, the simulated coefficients of selection are represented by a triangle (s=0.02), a circle (s=0.05) or a square (s=0.10).

(1.06 MB JPG)

Click here for additional data file.^{(1.0M, jpg)}

Robustness of the estimates (mean of the posterior distribution) of the (reference) allele frequency πi in the gene pool. Two data sets (FST=0.05 and FST=0.15) consisting in genotyping data for 10,000 SNPs (8,500 neutral SNPs, 750 subjected to positive selection and 750 subjected to balancing selection) on 8 populations were analyzed with model 1 and model 2 (see text). For each data set, three plots are represented: i) estimates obtained under model 1 against (true) simulated values (A with FST=0.05 and D with FST=0.15), ii) estimates obtained under model 2 against (true) simulated values (B with FST=0.05 and E with FST=0.15) and iii) estimates obtained under model 1 against estimates obtained under model 2 (C with FST=0.05 and E with FST=0.15). Neutral SNPs are plotted in grey while SNPs subjected to positive (respectively balancing) selection are plotted in red (respectively blue). In addition, the simulated coefficients of selection are represented by a triangle (s=0.02), a circle (s=0.05) or a square (s=0.10).

(1.51 MB JPG)

Click here for additional data file.^{(1.4M, jpg)}

We wish to thank Gilles Celeux (INRIA) for helpful discussions and support for this project and George Nicholson (University of Oxford, UK) for providing further insights into the hierarchical model proposed by his group [17]. We also thank two anonymous referees for helpful corrections and suggestions.

**Competing Interests: **The authors have declared that no competing interests exist.

**Funding: **This work was supported by the French National Institute of Agronomic Research (INRA). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1. Beaumont M, Nichols RA. Evaluating loci for use in the genetic analysis of population structure. Proc R Soc Lond B. 1996;263:1619–1626.

2. Lewontin RC, Krakauer J. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973;74:175–195. [PubMed]

3. Nielsen R. Molecular signatures of natural selection. Annu Rev Genet. 2005;39:197–218. [PubMed]

4. Storz JF. Using genome scans of DNA polymorphism to infer adaptive population divergence. Mol Ecol. 2005;14:671–688. [PubMed]

5. Vitalis R, Dawson K, Boursot P. Interpretation of variation across marker loci as evidence of selection. Genetics. 2001;158:1811–1823. [PubMed]

6. Bowcock AM, Kidd JR, Mountain JL, Hebert JM, Carotenuto L, et al. Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc Natl Acad Sci U S A. 1991;88:839–843. [PubMed]

7. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–1814. [PubMed]

8. Flori L, Fritz S, Jaffrezic F, Boussaha M, Gut I, et al. The genome response to artificial selection: a case study in dairy cattle. PLoS One. 2009;4:e6595. [PMC free article] [PubMed]

9. Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG. Measures of human population structure show heterogeneity among genomic regions. Genome Res. 2005;15:1468–1476. [PubMed]

10. Beaumont MA, Balding DJ. Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol. 2004;13:969–980. [PubMed]

11. Foll M, Gaggiotti O. A genome-scan method to identify selected Loci appropriate for both dominant and codominant markers: a bayesian perspective. Genetics. 2008;180:977–993. [PubMed]

12. Gautier M, Flori L, Riebler A, Jaffrezic F, Laloe D, et al. A whole genome Bayesian scan for adaptive genetic divergence in West African cattle. BMC Genomics. 2009;10:550. [PMC free article] [PubMed]

13. Riebler A, Held L, Stephan W. Bayesian variable selection for detecting adaptive genomic differences among populations. Genetics. 2008;178:1817–1829. [PubMed]

14. Guo F, Dey DK, Holsinger KE. A Bayesian hierarchical model for analysis of SNP diversity in multilocus, multipopulation samples. J Am Stat Assoc. 2009;104:142–154. [PMC free article] [PubMed]

15. Gelman A, Meng XL, Stern H. Posterior Predictive Assessment of Model Fitness via Realized Discrepancies. Statistica sinica. 1996;6:733–807.

16. Rubin DB. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann Statist. 1984;12:1151–1172.

17. Nicholson G, Smith AV, Jonsson F, Gustafsson O, Stefansson K, et al. Assessing population differentiation and isolation from single-nucleotide polymorphism data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64:695–715.

18. Balding DJ. Likelihood-based inference for genetic correlation coefficients. Theor Popul Biol. 2003;63:221–230. [PubMed]

19. Foll M, Beaumont MA, Gaggiotti O. An approximate Bayesian computation approach to overcome biases that arise when using amplified fragment length polymorphism markers to study population structure. Genetics. 2008;179:927–939. [PubMed]

20. Thomas A, O Hara B, Ligges U, Sturtz S. Making BUGS Open. R News. 2006:12–17.

21. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis, 2nd edition. Chapman & Hall/CRC; 2004.

22. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. [PubMed]

23. Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, et al. A high-resolution recombination map of the human genome. Nat Genet. 2002;31:241–247. [PubMed]

24. Crow JF, Kimura M. An Introduction to Population Genetics Theory. New York: Harper an Row; 1970.

25. Jeffreys H. Theory of Probability, 3rd edition. Oxford University Press; 1961.

26. Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27:861–874.

27. Bayarri MJ, Castellanos ME. Bayesian Checking of the Second Level of Hierarchical Models (Discussion Paper). Statistical Science. 2007;22:322–343.

28. Bazin E, Dawson KJ, Beaumont MA. Likelihood-free Inference of Population Structure and Local Adaptation in a Bayesian Hierarchical Model. Genetics. 2010 in press. [PubMed]

29. Guillot G, Foll M. Correcting for ascertainment bias in the inference of population structure. Bioinformatics. 2009;25:552–554. [PubMed]

30. Robertson A. Gene frequency distributions as a test of selective neutrality. Genetics. 1975;81:775–785. [PubMed]

31. Excoffier L, Hofer T, Foll M. Detecting loci under selection in a hierarchically structured population. Heredity. 2009;103:285–298. [PubMed]

Articles from PLoS ONE are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |