|Home | About | Journals | Submit | Contact Us | Français|
Domesticated Asian rice (Oryza sativa) is one of the oldest domesticated crop species in the world, having fed more people than any other plant in human history. We report the patterns of DNA sequence variation in rice and its wild ancestor, O. rufipogon, across 111 randomly chosen gene fragments, and use these to infer the evolutionary dynamics that led to the origins of rice. There is a genome-wide excess of high-frequency derived single nucleotide polymorphisms (SNPs) in O. sativa varieties, a pattern that has not been reported for other crop species. We developed several alternative models to explain contemporary patterns of polymorphisms in rice, including a (i) selectively neutral population bottleneck model, (ii) bottleneck plus migration model, (iii) multiple selective sweeps model, and (iv) bottleneck plus selective sweeps model. We find that a simple bottleneck model, which has been the dominant demographic model for domesticated species, cannot explain the derived nucleotide polymorphism site frequency spectrum in rice. Instead, a bottleneck model that incorporates selective sweeps, or a more complex demographic model that includes subdivision and gene flow, are more plausible explanations for patterns of variation in domesticated rice varieties. If selective sweeps are indeed the explanation for the observed nucleotide data of domesticated rice, it suggests that strong selection can leave its imprint on genome-wide polymorphism patterns, contrary to expectations that selection results only in a local signature of variation.
Domesticated Asian rice is one of the oldest and most important crops in the world. Two main rice evolutionary lineages have been identified, and are thought to have been independently domesticated in Asia. We have examined patterns of DNA sequence variation in the genomes of rice and its wild ancestor to make inferences about the origin of domesticated rice. Population bottlenecks (a reduction in the size of the founding population) in the evolutionary transition from wild to cultivated species has long been thought to be the dominant force shaping patterns of molecular evolution during domestication. We find that the nucleotide variation patterns in rice are inconsistent with a simple bottleneck model. Rice genetic variation, however, can be explained by either a model that incorporates both a bottleneck and migration among rice variety groups, or a model that incorporates a bottleneck and multiple rounds of artificial selection on rice. Selection by humans is believed to have played an important role during crop domestication, and these results may suggest that strong, recurrent selection can leave a signal that can be observed throughout the genomes of domesticated species.
Domestication is a complex, cumulative evolutionary process in which human use of organisms leads to morphological and/or behavioral changes distinguishing domesticated species from their wild ancestors [1,2]. Beginning with Charles Darwin [3,4], there has been strong interest in the study of domestication of crop species as a means of understanding the nature of selection. Moreover, domestication and the development of agriculture are arguably the most important technological innovations in human history . Crop plant domestication was the linchpin of the Neolithic Revolution 10,000–12,000 years ago, in which hunter-gatherer groups transitioned into sedentary agricultural societies that gave rise to current human cultures . With domestication came the availability of food surpluses, and this agricultural development led to craft specializations, art, religious and social hierarchies, writing, urbanization, and the origin of the state .
One of the earliest domesticated crop species is cultivated Asian rice, Oryza sativa L., which has become the world's most widely grown crop and has also assumed the stature of a key model system in plant biology. Rice consumption constitutes about 20% of the world's caloric intake, and in Asian countries, where over half of the world's population lives, rice often represents over 50% of the calories consumed . Because of its small genome size, rice has been the first crop plant to have its whole genome sequenced [8–10].
A wealth of morphological, physiological, and ecological variation exists within cultivated Asian rice, reflected in the large number of recognized cultivars or strains [11,12]. Two main rice varietal groups, O. sativa indica and O. sativa japonica, have been recognized since ancient China . Although phenotypic distinctions between these groups is not always straightforward, indica varieties tend to be found throughout the tropical regions of Asia and are primarily grown in lowland conditions, while japonica types are differentiated into tropical japonica, distributed in upland tropical regions, and temperate japonica, a recently derived group cultivated in temperate regions [11,13,14]. Additional variety groups include aus, drought-tolerant rice from Bangladesh and West Bengal, and aromatic, fragrant rice from the Himalayan range [14,15]. All rice varieties have a predominantly self-fertilizing mating system . Both morphological and isozyme data have established that O. rufipogon Griff., a partially outcrossing species native to southern Asia, is the wild ancestor of domesticated rice .
In this paper, we describe the levels and patterns of DNA sequence polymorphism across the rice genome and that of its wild ancestor, O. rufipogon. To our knowledge this is the first genome-wide characterization of sequence variation in domesticated Asian rice, and we show that rice contains a unique pattern of excess high-frequency derived single nucleotide polymorphisms (SNPs) that has not been reported in other species. We develop four models to explain patterns of genetic variation in O. sativa and O. rufipogon, including a simple selectively neutral bottleneck model that has been previously thought to be the dominant demographic force shaping levels of nucleotide variation in crop species. We demonstrate that this simple bottleneck model is inadequate to explain the origin of domesticated rice. We conclude that either positive selection has made a significant impact on genomic polymorphism patterns, or that domestication involved an extremely severe bottleneck (~99.5% reduction) coupled with gene flow among modern varieties and between domesticated rice and its wild ancestor.
To assess levels and patterns of polymorphism in the rice genome, we sequenced one hundred eleven randomly chosen gene fragments (sequence-tagged sites or STS) in a diverse panel of Oryza accessions, including 72 from O. sativa and 21 from O. rufipogon (Tables S1 and S2). Average silent (synonymous and noncoding) site nucleotide diversity (θπ) across all sampled loci in O. sativa is approximately 3.20 × 10−3 (Table 1). Levels of polymorphism in the wild ancestral species, O. rufipogon, are predictably higher than rice, with a mean silent θπ of 5.19 × 10−3 (Table 1). These levels of polymorphism are lower than those observed for maize, a domesticated outcrossing species , and Arabidopsis thaliana, a selfing, wild species [17,18].
To determine if any genetic differentiation due to population structure among rice groups is evident in these STS sequences, we used the Bayesian clustering program STRUCTURE . The highest likelihood obtained was with a model specifying K = 7 groups (Figure 1; Table S1). Five groups occur within O. sativa and correspond to the traditional variety designations, as described previously . Evidence of some limited geographical population structure is also observed in O. rufipogon (Figure 1; Table S1). Neighbor-joining analysis of the concatenated STS sequences (Figure S1) revealed two distinct clusters within cultivated rice; one comprises a tropical japonica, temperate japonica, and aromatic rice lineage, and another consists of aus and indica rice. The apparent monophyly of these major groups is consistent with at least two domestication events in rice [14,20–24]. The nesting of the aromatic and the temperate japonica variety groups within tropical japonica suggests the first two groups originated from secondary divergence events from the latter, although the lack of support for tropical japonica branches does not exclude other possible divergence scenarios (Figure S1). Indica and aus relationships, on the other hand, are consistent with rapid divergence after domestication or separate domestication events from the same ancestral gene pool. Within-group SNP levels of cultivated rice are lower than those of the whole species (Table 1), with subpopulations harboring between 19% (temperate japonica) and 43% (indica) of the polymorphism of O. rufipogon. Assuming separate domestication events, the japonica clade contains 42% and the indica clade contains 48% of the diversity levels found in O. rufipogon.
Because of the strong population structure evident in our rice sample, it is necessary to assess patterns of variation separately for each group when making inferences about the evolutionary dynamics of domestication. Indica and tropical japonica represent the most widely grown cultivars for each of the separate domestication events, and we limited our characterization of polymorphism patterns to these two groups. We examined the frequency spectrum of segregating sites within loci using Tajima's D , and found that O. rufipogon and the two main rice subspecies show an excess of rare alleles, as evidenced by the biased distribution of Tajima's D toward negative values (Figure S2; Table 1). Crops are expected to have gone through a population bottleneck during domestication, as only a limited number of founding individuals were brought into cultivation. The distribution of Tajima's D in the domesticated rice varieties is inconsistent with a recent bottleneck, however, as these should reduce levels of low-frequency variants and bias measures of Tajima's D toward positive values. It is possible that subsequent population expansion, due to the spread of rice agriculture, could be responsible for the over-representation of rare alleles segregating in domesticated rice varieties, or selection may have played a role.
We further examined the derived site-frequency spectrum across SNPs (i.e., the fraction of derived polymorphisms present at various frequencies within a group) in indica and tropical japonica. To infer ancestral alleles for each SNP, we used as an outgroup O. meridionalis, a species believed to have diverged from O. sativa ~2 million years ago . In each O. sativa variety we observed a large number of high-frequency derived mutations (i.e., derived SNPs above 70% frequency in the population) leading to a U-shaped frequency distribution (Figure 2); this type of pattern has not been reported at the genomic level in any other species.
Possible explanations for the excess of high-frequency derived SNPs in O. sativa include the misidentification of ancestral states due to shared polymorphism with O. meridionalis, or the occurrence of multiple mutations at given sites since divergence from O. meridionalis. However, both misidentification of derived alleles and multiple hits would be expected to also affect the site-frequency spectrum of O. rufipogon, which is not observed (Figure 2). This suggests that the O. sativa derived site-frequency distribution is a result of the domestication process. Furthermore, derived alleles at high frequency in the O. sativa varieties occur primarily at low to intermediate frequency in O. rufipogon, suggesting that such alleles have only recently increased in frequency (Figure S3).
We also checked the ancestral state calls in O. sativa using the African wild rice O. barthii. Although O. barthii is more closely related to O. sativa than is O. meridionalis, if we assume that both wild species share ancestral polymorphisms with domesticated rice, the possibility that we always identified the same alternative allele as derived in our sample should be low. Using this approach, we find that 88% of our ancestral SNP calls in indica and 86% in tropical japonica matched in O. barthii and O. meridionalis. Even when using only the matched calls (which is a very conservative criterion, since it does not take into account drift and/or fixation processes in O. barthii), the site frequency spectrum in O. sativa varieties remains U-shaped.
An excess of high-frequency derived SNPs is often interpreted as a result of genetic hitchhiking during recent selective sweeps . Because the site-frequency spectrum in rice varieties is observed from randomly selected loci, and the loci contributing high-frequency derived SNPs are distributed across the genome (Figure S4), this pattern suggests that strong linkage to positively selected mutations occurred within most of the genome. However, demographic forces may have also played a role in shaping the rice genomes. We developed several demographic models and a multiple selective sweeps model to test which evolutionary processes may best explain the observed patterns of polymorphism in rice.
The most widely accepted demographic model for crop domestication is a neutral bottleneck model [27–29]. In this model, rice domestication is assumed to be a result of recent population divergence, with one of the two daughter populations experiencing a reduction in population size at divergence associated with the founder effect at the time of domestication, followed by population growth as cultivation of the crop increases. To fit this model to our data, we used a diffusion-based approach [30–32] to predict the pattern of allele frequencies in domestic and ancestral populations under selective neutrality.
Details of the inference procedure can be found in the Materials and Methods section. The composite-likelihood function we employed uses the reduction in diversity observed in either of the domesticated rice subspecies and the shift in allele frequency distribution to estimate four parameters: the time back until the start of domestication (τ1), duration of the bottleneck (τ2), ratio of current population to ancestral population size (ν2), and relative size of the bottleneck population to the ancestral population (νb). The duration of the bottleneck was assumed to be 25% of the time back until domestication (τ2 = 0.25 × τ1), which is consistent with archeological data suggesting it took ~3,000 y from the time of initial cultivation (~12,000 y ago) until the appearance of domesticated rice grains [33,34].
Bottleneck parameter estimates for indica and tropical japonica are broadly comparable, with a slightly more severe bottleneck in tropical japonica (Table 2). Assuming the time back to the beginning of domestication for both variety groups was ~12,000 y , we can independently derive estimates of the current O. rufipogon effective population size, Nrufi, using the relationship τ1 × 2Nrufi = 12,000 (because τ1 is scaled by 2Nrufi). From the indica analyses, Nrufi is equal to 12,000/(2 × 0.1044) = 57,471, and from the tropical japonica analyses is equal to 12,000/(2 × 0.0508) = 118,110 (this exact value of Nrufi is important in scaling all of the estimated parameters into years and number of individuals). The indica-derived Nrufi estimate implies bottleneck and current estimated population size (Ne) for indica of (νb × Nrufi) = 1,413 and (ν2 × Nrufi) = 40,229 respectively. The second estimate suggests a bottleneck and current Ne sizes for tropical japonica of (νb × Nrufi) = 1,334 and (ν2 × Nrufi) = 46,889, respectively.
The differences in estimates of Nrufi from each analysis could be attributable to differences in the founding population of each variety group or differences in the timing of each domestication event. We note, however, that a bottleneck model conditioned on coincident domestication for indica and tropical japonica (equal τ1 values) differs only by 1.8 log likelihood units (unpublished data), suggesting that equal timing of domestication is likely to have occurred. An independent estimate of Nrufi can be found by using the estimated scaled population silent mutation rates (θW = 4Nrufi μ = 5.42 × 10−3 per bp; Table 1) and the observation that the O. rufipogon site-frequency spectrum is consistent with that of a population of long-term constant size (Figure 2). Assuming a neutral mutation rate of 10−8 per bp, yields a point estimate of Nrufi = 135,500, which is slightly higher, but close to the estimates found by conditioning on the start of domestication.
It is important to note that population bottlenecks alone would not generate the strong excess of high-frequency derived alleles and strong U-shaped site-frequency spectrum observed in O. sativa (Figure 2) . In order to explain this aspect of the data, we considered several demographic models that included ancient subdivision in the ancestor of rice, a bottleneck at the time of domestication for each domesticated varietal group, and limited gene flow between the independently domesticated rice groups indica and tropical japonica. Ancient, strong subdivision is not evident in our O. rufipogon sample (Figure 1); Fst between Chinese and non-Chinese O. rufipogon is low, about 0.16, and no interior modes are evident in the site-frequency spectrum of O. rufipogon, as expected under subdivision. However, it is possible for limited gene flow in O. rufipogon to lead to some differentiation of allele frequency between groups, but not so much that it would have a strong effect on a combined O. rufipogon sample. Furthermore, the population bottlenecks induced by independent domestication events could amplify any allele frequency differentiation between indica and tropical japonica, and limited gene flow between these two groups could introduce ancestral alleles into each population, causing mutations previously fixed in one group to be observed as high-frequency derived alleles in the other.
To test the effect of ancestral population substructure within O. rufipogon prior to the domestication of the two O. sativa groups, we fit the parameters of a complex demographic model to our data using a composite likelihood technique (see Materials and Methods). We began by exploring a model with seven demographic parameters, which consists of O. rufipogon being subdivided into two demes of equal size, sharing on average MR migrants per generation. Current-day indica varieties are descended from one of these demes, while tropical japonica varieties descend from the other. During the domestication process, each population underwent a bottleneck that began τ1 generations ago (in units 2Nrufi) and had severity νb (the ratio of the reduced population size to the ancestral size). After τ2 = 0.25 × τ1 generations (~3,000 y), both indica and tropical japonica partially recovered, instantaneously reaching a fraction υI and υJ of the ancestral size, respectively. Contemporary gene flow (since domestication) between tropical japonica, O. rufipogon, and indica is captured by the last parameter, the average number of migrants per generation between these demes (MS). This model was conceived because it incorporates key demographic features of rice or crop domestication (e.g., bottlenecks, two domestication events) and could conceptually generate the observed derived SNP site frequency spectrum.
In preliminary analyses, we found that the migration rate (MR) between the two ancestral O. rufipogon demes was very large, with the marginal likelihood surface for this parameter near its maximum value whenever MR > 7. This is consistent with our observations of limited population structure in O. rufipogon (above), and we therefore discarded ancestral population structure as a main contributor to the patterns observed in our dataset, and simplified the demographic model to consider only a single ancestral population from with both indica and tropical japonica derive (with migration rates among the three remaining demes, MS = 4Nrufim). This assumption reduced the computational complexity, so that the remaining parameters could be estimated via a grid search using an initial size of over 2,000 points with 1,000,000 coalescent simulations per point. The resulting model (which we refer to as the bottleneck plus migration model) has five free parameters with composite maximum likelihood estimates of MS = 7.0 (migration between demes), νb = 0.0055 (domestication bottleneck size), νI = 0.27 (ratio of indica to O. rufipogon Ne), νJ = 0.12 (ratio of tropical japonica to O. rufipogon Ne), and τ1 = 0.04 (start of domestication in units of 2Nrufi) (Table 2). It is important to note that coalescent simulations scale the migration based on population size, so the number of migrants entering into the tropical japonica population is smaller (0.5 × M × νJ = 0.42), than into indica (0.5 × M × νI = 0.945), and O. rufipogon (0.5 × M = 3.5).
In Figure 3, we report the profile composite-likelihood contours for the three key demographic parameters in the bottleneck plus migration model: migration rate, start of the bottleneck, and severity. The figure is constructed by holding two parameters fixed at a given point in the (x,y) plane, optimizing over the third parameter, and reporting the maximum likelihood attained for the (x,y) point (due to computational limitations the figure was constructed holding the ratio of current-day indica and tropical japonica populations at their maximum composite-likelihood estimates). We note that the three parameters are moderately to strongly correlated, but only a restricted set of values in high dimensional space is consistent with the data. These solutions all include: a very strong bottleneck (>99% reduction), high rates of migration within and between domesticated and wild populations of Asian rice (M > 5), and current-day effective population sizes for cultivated rice that are substantially smaller than those seen in the ancestral population. We also note that the model solutions show a positive correlation between size of bottleneck population and timing of the bottleneck, a negative correlation between size of the bottleneck and migration, and a negative correlation between migration and timing (consistent with the ~2-fold difference in the estimated time of the bottleneck between the model with migration and the model without).
As can be seen in Figure 4, the expected site-frequency spectrum under the best fitting bottleneck plus migration model matches the observed frequency distributions fairly well for both O. rufipogon as well as indica, but not as well for tropical japonica. As expected, the total number of SNPs in each of the three populations is predicted quite well by the model. We quantified the fit of the model to the observed data using a modified Pearson Chi-square goodness-of-fit (GOF) statistic, and found that the best-fitting complex demographic model is an excellent fit to the marginal indica (GOFI = 20.26, p = 0.72) and O. rufipogon site-frequency spectra (GOFR = 7.57; p = 0.99), and an adequate fit to the tropical japonica site-frequency spectrum (GOFT = 37.83, p = 0.22). One interesting observation is that the demographic model underpredicts the excess of high-frequency derived alleles observed in tropical japonica—a potential indication of recent positive selection. Given that artificial selection was probably quite strong and frequent during and after domestication, we further explored models that incorporate selection during the domestication process of O. sativa.
Since strong selection is known to accompany crop domestication, we developed two alternative models incorporating multiple selective sweeps to explain the unusual polymorphism patterns in indica and tropical japonica. In a neutral locus linked to a single, recent selective sweep, let fi be the probability of observing a neutral mutation segregating at frequency i in a sample of size n, conditional on the locus being variable. An expression for fi has been derived  and further extended [37,38], and includes the genomic distance d (measured in bp) between neutral and selected loci, a compound parameter α, which represents the combined contributions of recombination, selection, and population size, and the “background” allele frequency distribution (i.e., the expected site-frequency spectrum for loci unlinked to a selected site).
These results for a single sweep can be used to predict the site-frequency spectrum at randomly chosen loci if multiple sweeps have recently occurred. Assuming that selective sweeps occur at random positions in the genome at a density of κ sweeps per bp, the distance between a random neutral locus and the nearest sweep will be approximately exponentially distributed with mean 1/(2κ). Define the function i(d, α, κ) to be the probability of observing i copies of a neutral mutation in a sample of n chromosomes, given that a sweep occurred at a distance d bp away with compound parameter α , and background site-frequency spectrum q. By integrating over the distance between the sampled locus and the unknown target of the sweep, the marginal probability, Pi, of observing a randomly chosen SNP at frequency i in a sample of n chromosomes is a function of κ, α, and q :
This probability can be used to calculate the composite likelihood of the data and estimate the parameters κ and α (see Materials and Methods). It should be noted that this equation assumes that the neutral locus is affected only by the nearest selective sweep.
We considered two distinct models. The first is a model in which strong selection is the only force that has acted in domesticated rice populations, and uses the normalized O. rufipogon site-frequency spectrum as the background frequency distribution. The second, a bottleneck plus sweeps model, allows multiple selective sweeps to affect patterns of variation immediately following a population size change. The background site-frequency spectrum in the latter case can be approximated using the predictions of a simplified neutral bottleneck model. The bottleneck plus sweeps model incorporates the sweep density κ, the compound parameter α (the combined contributions of recombination, selection, and population size), and a bottleneck severity parameter ν.
The likelihood surfaces for both the pure selection and the bottleneck plus sweeps model in rice each contains a long ridge where different parameter combinations have almost equally high likelihoods, implying that a model with high sweep density and relatively weak selection is just as likely as a model with low sweep density and strong selection (Figure 5). For both models, the ridge of maximum likelihood is shifted to the right in tropical japonica, indicating that for a given value of the selection severity parameter α, the sweep density in tropical japonica is estimated to be twice that in indica.
Sweep density is confounded with selection strength due to the effect of a mating system change on recombination rate. In domesticated rice, the transition to selfing likely occurred simultaneously with the sweeps, making it difficult to disentangle the recombination rate and selfing parameters. Under a recent selective sweep in a randomly mating population, the compound parameter α ≈ rs−1 ln(2N), where r is the per-basepair recombination rate, s is the selection coefficient and N is the population size . In a partially selfing population such as domesticated rice, however, both effective recombination rate and population size are affected by selfing rate. While the rate of coalescence (and hence the effective population size) is at most doubled by the rate of selfing, the rate of recombination can be radically altered. An expression for effective recombination rate is r(1 − σ/[2 − σ]), where σ is the selfing rate . For domesticated rice, estimates of selfing rates are typically ~0.99 , resulting in a reduced recombination rate by approximately 10−3. If we assume 400 selective sweeps occurred in the rice genome since domestication (κ = 10−6), we estimate that α = 2 × 10−12 for indica. With r = 10−9 recombination events per generation per base pair and ln(2N) ≈ 10, this estimate of α corresponds to an unreasonably high estimate of a 5,000-fold fitness advantage. Substituting an effective recombination rate of 10−12 (corresponding to a reduced effective rate due to selfing), we find more reasonable values for the strength of selection for the selective sweeps, with s ≈ 5. This example illustrates how high selfing rates can amplify the signal of selection and contribute to the pattern of polymorphism in the rice genome.
Visually, it appears both the bottleneck plus sweeps model and the bottleneck plus migration model predict the site-frequency spectrum of domesticated rice better than the bottleneck model alone (Figure 4) or the pure selection model (unpublished data). To compare likelihoods and determine which model best fits the data, we used the Akaike information criterion (AIC) . Since SNPs in our dataset are linked, we used a composite likelihood function and simulations to assign p-values to the observed AIC statistic (see Materials and Methods).
For indica, the bottleneck plus sweeps model is significantly better than the neutral bottleneck model (Λ = −17.18, p < 0.05) as is the bottleneck plus migration model (Λ = −14.19, p < 0.05). For tropical japonica, we also reject the neutral bottleneck model in favor of both the bottleneck plus sweeps model (Λ = −56.88, p < 0.01) and the bottleneck plus migration model (Λ = −53.60, p < 0.01). For both rice variety groups, the AIC for the bottleneck plus sweeps model was slightly lower than for the bottleneck plus migration models (Λ = −2.26, indica; Λ = −3.28, japonica), but this difference is likely not statistically meaningful given the various assumptions made. A separate (but not independent) assessment is comparing the fit of the predictions of each model to the data. The bottleneck plus sweeps model fits the marginal site-frequency spectrum of indica quite well (GOF = 13.86; p = 0.92), and does a slightly better job explaining the site-frequency spectrum of tropical japonica than does the complex demographic model incorporating bottlenecks plus migration (GOFsweeps + bottleneck = 31.21, p = 0.33; GOFbottlenecks + migration = 37.83; p = 0.22). These results underscore the importance of jointly modeling demographic and selective effects when considering the evolution of domesticated crop species.
Population bottlenecks are believed to be the primary demographic event associated with crop species origins, and are the accepted mechanism to explain observed genome-wide polymorphism levels among these taxa. There have been concerted efforts to model the impact of population bottlenecks on domesticated species genomes [27–29,42–44]. It appears from our results, however, that a population bottleneck alone is inadequate to explain the observed nucleotide polymorphism patterns in rice, one of the oldest and the most predominant food crop species in the world.
A more complex demographic scenario involving very strong bottlenecks that led to the fixation of alternate alleles during the two rice domestication events, with concurrent gene flow between variety groups, can explain the site-frequency spectrum of indica and O. rufipogon. However, this pure demography model requires a bottleneck 4-fold stronger in indica and twice as strong in tropical japonica relative to the model that incorporates selection (Figure 5; Table 2), and a relatively high migration rate between domesticated rice and wild O. rufipogon populations. It is also important to note that the model is a poor fit to the observed frequency distribution of alleles in tropical japonica.
Domestication, however, is characterized by strong directional selection on a suite of traits that lead to the establishment of cultivated species as distinct entities from their wild progenitors within agricultural settings. We show that, in contrast to the complex demographic model, a simple bottleneck with sweeps model fits data from both tropical japonica and indica well without requiring an extremely strong domestication bottleneck. Since domesticated Asian rice has been subject to artificial selection, the selection plus demography model is a very plausible explanation for the observed strong excess of high-frequency derived alleles in domesticated rice varieties, and is consistent with recent reports about domestication genes in rice [45,46].
Positive selection on specific genes results in reductions in variation within a genome through selective sweeps [47,48]. Unlike bottlenecks, however, selection is thought to have largely localized effects on genome variation. Our results suggest that a model that incorporates selection can explain patterns of nucleotide variation in a set of genome-wide markers. We suggest two reasons why selective sweeps during domestication could cause a genome-wide effect in O. sativa and not in other cereal crop species such as maize. First, the origin of domesticated Asian rice is associated with a transition to self-fertilization, which results in a low effective recombination rate and greatly increases the genomic distance affected by selection. Second, O. sativa possesses such a small genome (<400 Mb) that it is likely that a few dozen to hundreds of selective sweeps could leave a genome-wide imprint.
Interestingly, under the bottleneck plus selective sweeps model, the dynamics of domestication appear to differ in significant ways between indica and tropical japonica. Despite the fact that these two variety groups were domesticated from the same species and both have contributed significantly to Asian agriculture, it appears that the number of selective events and/or the bottleneck severity differs between them. It is possible that the two subspecies would diverge from each other in the demographic patterns associated with domestication, given that they were established by different cultures. If this is correct, then tropical japonica appears to have undergone a more severe bottleneck associated with domestication. Alternatively, it may be that the establishment of tropical japonica, which includes landraces that expanded to upland growing areas, may be associated with stronger selection pressures on a larger number of traits.
The process of domestication is one of recent, rapid species evolution, and studies on the dynamics of this process inform our understanding of the origins and diversification of new species. Simple demographic scenarios that have been employed in the past may not fully capture the domestication process of some crop species such as Asian rice. Our models indicate that selection and population bottlenecks together, or more complex scenarios that invoke very strong bottlenecks and current gene flow, could be responsible for determining genome-wide variation in the rice genome, a finding that has not been described in other domesticated species. Domesticated crop species are particularly suitable subjects in which to study the interaction between demographic events and selection in shaping species characteristics, and exploring the relative contributions of these forces require developing predictions for patterns of DNA polymorphism using models that allow selection to vary in timing (i.e., both during and after population bottlenecks) and strength. Nevertheless, our findings do underscore the possible role that selection may play in shaping genomic variation in domesticated species, reinforcing our appreciation of the foresight showed by Charles Darwin nearly a century-and-a-half ago  when he sought to illustrate the power of selection by drawing on the lessons learned from the evolution of domesticated species.
A panel of 72 O. sativa accessions was chosen to represent the diversity found within the species. These include representatives of five major subpopulations identified in a previous study , including 21 indica, 18 tropical japonica, 21 temperate japonica, six aus, and six aromatic accessions (Table S1). Most accessions are landraces, but five accessions studied correspond to modern cultivars. Also included in the panel were 21 accessions of the wild progenitor of rice, O. rufipogon, along with one sample each of O. nivara (a close relative of O. rufipogon not believed to have contributed to the ancestry of cultivated rice) and the outgroup species O. barthii and O. meridionalis (Table S1).
DNA was extracted from single plants as described in  with minor modifications. All O. sativa and one O. rufipogon accession (International Rice Germplasm Collection [http://www.irri.org/grc/] #105491) were self-fertilized for two generations prior to initiating the study. Seeds from O. rufipogon from Nepal were collected in the field by H. J. Koh and colleagues (Seoul National University); all other seeds were obtained from germplasm repositories as summarized in Table S1.
A total of 121 approximately 400–600 bp gene regions across the rice genome were chosen at random for sequencing from a set of 6,591 ESTs . Four fragments were also selected from genes coding for well-known allozymes, including: catalase, acid phosphatase, pgi-a, and Adh. Primers were designed from the Nipponbare genomic sequence available from Gramene using Primer3 . Primers were designed in exons, and attempts were made to include both exon and intron sequence within each fragment. DNA sequencing was carried out in Genaissance's sequencing facilities (New Haven, Connecticut, United States) as described in . Amplification and sequencing were successful for 111 fragments referred to as STS (Table S2). Approximately 54 kbp per accession were sequenced, composed of, on average, 55% coding and 45% noncoding sequence.
Base-pair calls, quality score assignment, and construction of contigs were carried out using the Phred and Phrap programs (Codon Code). Sequence alignment and editing were carried out with BioLign Version 2.09.1 (Tom Hall, North Carolina State University, Raleigh, North Carolina, United States). Heterozygous sites were identified with Polyphred (Deborah Nickerson, University of Washington, Seattle, Washington, United States) and by visually inspecting chromatograms for double peaks. Heterozygous sites were rare for O. sativa. For heterozygous O. sativa and O. rufipogon sequences, heterozygous sites were labeled with ambiguity codes. For all analyses, the published sequence of Nipponbare was included.
To assess the sequencing error rate, 18 randomly chosen STS fragments were resequenced in a single direction for four Oryza accessions. Only three discordant base pairs within a single individual in a single fragment sequence were observed. This corresponds to three errors in 33,193 resequenced bp, or a sequencing error rate of less than 0.01%.
Population structure among O. sativa and O. rufipogon accessions was evaluated with STRUCTURE 2.1  using an admixture model with no linkage. To limit the effect of correlation between SNPs due to linkage, one SNP per fragment (the SNP with the highest minor allele frequency across the entire accession set) was used in the analysis. O. sativa is primarily selfing, and most accessions exist as homozygotes; thus, SNP data were considered haploid for this species. O. rufipogon is partially outcrossing, a condition that cannot be adequately represented by considering each locus as diploid; thus, SNP data for O. rufipogon were also considered haploid. Because alternate alleles could occur at a given site in heterozygous O. rufipogon accessions, ten datasets were created with randomly chosen alternative base pairs in heterozygous individuals. Analyses were carried out for all ten datasets. All analyses had a burn-in length of 50,000 iterations and a run length of 100,000 iterations. Three replicates at each value of K (population number) were carried out. Simulations were run with uncorrelated allele frequencies. Results were entirely consistent among replicate runs within datasets and among datasets; the results from one run are presented in Figure 1 and Table S1.
To assess relationships among Oryza accessions, all STS fragment alignments were concatenated to form a single dataset. Relationships were estimated with a neighbor-joining analysis as implemented in PAUP* version 4.0 b3 . Distances were calculated using the Kimura two-parameter model. Branch bootstrap estimates were obtained from 1,000 replicates.
Perl scripts were written to assess levels of nucleotide variation (θW) and nucleotide diversity (θπ) and Tajima's D across rice groups for all STS fragments, and to calculate the frequency distributions of derived SNPs across the genome. For O. sativa accessions, where heterozygotes were rare, all measures were calculated considering each accession as contributing a single haplotype; for O. rufipogon population measures, each accession was considered to contribute two haplotypes, except for one accession (International Rice Germplasm Collection [http://www.irri.org/grc/] #105491) from Malaysia, which had been selfed for several generations prior to this study.
Under a neutral bottleneck model, the history of rice domestication is represented by recent population divergence, with one of the two daughter populations experiencing a size bottleneck at divergence associated with the founder effect at the time of domestication. We use the sample frequencies of variable noncoding and synonymous nucleotides in the STS alignments (i.e., the site-frequency spectrum of putatively neutral SNPs) to infer the parameters of the bottleneck model. Our analytical approach makes use of standard Wright-Fisher population genetic theory within a Poisson random field setting [54–57]. The assumptions of this model include independence among SNPs, no selection, an underlying Poisson process governing mutations, and a piecewise constant population of large size amenable to modeling using diffusion approximations.
The model we employ is an extension of Williamson et al. , where we present the relevant population and statistical inference theory for modeling a population experiencing a recent size change. The key addition to our previous model is a second size change event, corresponding to the post-bottleneck growth phase. This amounts to modeling the components of the site-frequency spectrum (X1, X2, . . ., Xn) as independent Poisson random variables with mean:
where θ is the genome-wide mutation rate, x represents the (unknown) population frequencies of mutations, and f(x;Θ) is the distribution of mutation frequencies given demographic history parameters Θ = (ν,τ1,τ2}. These parameters are: the time back until the start of domestication (τ1), duration of the bottleneck (τ2), ratio of current population to ancestral population size (ν2), and relative size of the bottleneck population to the ancestral population (νb). The duration of the bottleneck was assumed to be 25% of the time back until domestication (τ2 = 0.25 × τ1), which is consistent with archaeological data suggesting domestication took 3,000 y and began 12,000 y ago. The mutation rate, θ, was estimated from the number of synonymous and noncoding segregating SNPs assuming O. rufipogon represented a population of constant size. This assumption is quite reasonable given the excellent concordance between the O. rufipogon and the predictions of the standard neutral model (Figure 4), and is equivalent to using Watterson's (1975) estimator of θ. In order to account for missing data, we fitted the population bottleneck model using the projected site-frequency spectrum for a sample of n = 16 chromosomes.
We considered alternative demographic scenarios, in which ancestral population subdivision, followed by gene flow between rufipogon, indica, and tropical japonica, led to an excess of high-frequency derived alleles in domesticated rice groups, as well as a simpler model that has no ancestral substructure. For these models, the composite likelihood function was based on the marginal site-frequency spectrum of each of the three groups analyzed. For ease of notation, let Sind, Sjap, and Sruf be the number of SNPs for which we could distinguish ancestral from derived alleles using the outgroup (223, 172, and 636, respectively). Let y denote the set of derived allele counts for each SNP, with y•ind, y•jap, and y•ruf referring to set of SNPs for indica, tropical japonica, and O. rufipogon (with lengths Sind, Sjap, and Sruf , respectively). To account for missing data, let n refer to the number of chromosomes sequenced at each SNP, with n•ind, n•jap, and n•ruf the vector for each group (again with lengths Sind, Sjap, and Sruf, respectively). For a given demographic model discussed above (the parameters of which we collectively denote Θ), the composite likelihood function is written as
where Pr(S•|Θ) is assumed to follow a Poisson probability of observing S• SNPs in a given population under the demographic model Θ assuming the population scaled mutation rate θ = 148.6 (estimated using the observed number of SNPs in O. rufipogon), and is the probability of observing a SNP configuration in a given population under the demographic model. It is important to note that the inference scheme assumes the allele frequency distributions, conditional on the observed number of segregating sites and demographic parameters, are independent among populations. This composite-likelihood function (like all composite-likelihood functions) must, therefore, be taken as an approximation of the true likelihood function since it ignores dependencies among SNPs due to linkage and among populations due to shared variation. To account for missing data at an arbitrary SNP k in population x, we set
where Pz(Θ,Nx) is the expected proportion of SNPs at a frequency z in a sample of Nx chromosomes under the demographic model Θ, and the fraction within the summation represents the hypergeometric probability of sampling derived alleles in a subsample of chromosomes if the unknown frequency of the SNP were j out of Nx (summed over all possible underlying SNP frequencies, j). Details on calculating the expected number of SNPs in each population as well as Pz(Θ,Nx) are described below.
For a given set of parameters, Θ, we determine the expected site-frequency spectra for all three populations (O. rufipogon, indica, and tropical japonica) using 100,000 iterations of the coalescent simulation program ms  conditional on the observed genome-wide estimate of θ for O. rufipogon. To generate data under this model, we used the following code:
ms 80 200000 –t 148.6487 –r 148.6487 111 –I 3 21 18 41 M –en 0.5*0.75*τ1 1 νB –en 0.5*0.75*τ12νB –ej 0.5*τ1 1 3 –ej τ1 2 3 −em 0.5*τ1 3 1 0 –em 0.5*τ1 3 2 0 −n 1 νI –n 2 νJ.
Note that the factor 0.75 enters from the assumption that the bottleneck lasted 3,000 y of the 12,000 y time since domestication began, and 0.5 enters since ms scales time in units of 4N generations.
To optimize the three- and five-dimensional likelihood surface, we used an iterative technique, whereby a very coarse grid is initially chosen for each parameter, followed by successively tighter intervals containing the previous iteration's maximum likelihood estimates. Because we were pooling data across 111 STS loci, we generated our expected site-frequency spectrum accordingly. Although recombination within or between STS loci will not affect the expected number of segregating sites or the expected site frequency spectrum under a neutral demographic model, it does impact the rate at which simulations will approach them. We therefore assumed 111 mostly independent loci of equal size when generating our expectations.
In order to compare the fit of the demographic model to the observed data accounting for missing genotypes and partial selfing, we considered a projection of the observed and predicted site-frequency spectra into a sample of size n = 16 chromosomes from each of the three populations using the hypergeometric distribution. The “observed” data can be thought of as the predicted SFS in a subsample of n =16 based on the actual SNP data assuming each of the O. sativa accessions contributes one chromosome to the observed allele frequency spectrum, and each of the O. rufipogon accessions contributes two, with the exception of one accession that was known to have been purified. The “expected” data are the predicted marginal site-frequency spectrum at the maximum composite-likelihood estimates of the parameters from the complex demographic model that includes bottlenecks in the two domesticated populations, migration within domesticated populations, and migration between domesticated and ancestral populations. There were 45 observed data points (15 segregating site-frequency spectrum components multiplied by three populations), and the GOF statistic for a given population was tabulated as . In order to assign a p-value, we simulated 10,000 datasets each containing 111 independent loci with no recombination within loci under the best-fitting demographic model. For each dataset, we then calculated the GOF test statistic using the expected site-frequency spectrum from Figure 4 scaled to the observed number of segregating sites within each of the subpopulations. Ideally, one would re-estimate the demographic parameters in order to fully mimic the inference procedure we used. Unfortunately, estimation of the demographic parameters was extremely computationally intensive for each dataset; the single observed STS data point analyzed here, for example, took over a week of computer time on a dedicated 100-node computing cluster.
Conditioning on the observed number of segregating sites in the dataset, the site-frequency spectrum is multinomially distributed with frequency probabilities according to Equation 1. For the pure selection model, the composite likelihood is:
where qr is the normalized site-frequency spectrum of O. rufipogon. For the bottleneck plus multiple sweeps model, the composite likelihood is:
where qν is the predicted spectrum from a neutral bottleneck model with severity ν. Equations 5 and 6 can be maximized to quantify the number and strength of selective sweeps in domestic rice, and the optimization of Equation 5 provides an estimate of the severity of the population bottleneck that preceded the selective sweep.
The bottleneck plus sweeps model assumes that a short bottleneck (representing to the founding of domestic populations) precedes the selective sweeps. To calculate the background site-frequency spectrum at the end of the bottleneck and the beginning of the selective sweeps, we again used numerical methods to solve the one-population diffusion equation with population size changes:
In this case, the recovery time, τ1, was set to 0, corresponding to the assumption that new mutations since the bottleneck do not make a strong contribution to the observed SFS. Because the bottleneck duration, τ2, and the severity, ν, are confounded parameters, we set τ2 = 0.01 and allow ν to vary. With f(q,τ2) as the numerical solution to Equation 7 evaluated at time τ2, we calculate the background site-frequency spectrum qν as:
To properly interpret differences in AIC between models, we simulated 10,000 datasets of 111 nonrecombining loci under the null hypothesis of the best-fitting neutral bottleneck model using the ms coalescent simulation program . Because we did not allow recombination within loci, these simulations conservatively account for the effects of linkage. For each simulated dataset, we found the maximum composite likelihoods under each model (bottleneck, bottleneck plus migration, multiple sweeps, and bottleneck plus sweeps) and calculated the AIC value. The AIC statistic of model i is defined as: AICi = −2(lmaxi − ki) where lmaxi is the maximum likelihood under model i and ki is the number of free parameters in model i. We used Λ = AIC1 − AIC2 as a test statistic for comparing the bottleneck and alternative models using a one-tailed test: the p-value was estimated as the proportion of simulations under the null distribution with Λ > Λobs.
Numbers by branches are bootstraps of 1,000 replicates. Only branches with a bootstrap value higher than 60% for major clades (five or more accessions) are labeled. The monophyly of each rice variety group is well supported, with the exception of tropical japonica. The accession M202-new is an elite temperate japonica line that has been subjected to possible crosses with other groups, perhaps explaining its inclusion within the tropical japonica.
(409 KB EPS)
(373 KB EPS)
Notably, most alleles are at low to intermediate frequency in O. rufipogon, consistent with multiple selective sweeps in O. sativa, and discounting the possibility of misidentification of ancestral alleles or interspecific introgression being responsible for the pattern observed in rice. HFD, high-frequency derived SNPs.
(390 KB EPS)
In each group, high-frequency derived SNPs occur in ten of 12 rice chromosomes. Fragments containing high-frequency derived SNPs comprise a large portion of fragments containing any variation at all in each O. sativa group. In both rice groups, the sample of STS fragments used to construct the site-frequency spectrum is slightly lower than 111 due to missing data in O. meridionalis.
(433 KB EPS)
(45 KB XLS)
(133 KB XLS)
The National Center for Biotechnology Information GenBank (http://www.ncbi.nlm.nih.gov/Genbank) ID numbers for the sequences and alignments discussed in this article are EF000002–EF010509.
We are grateful to two anonymous reviewers for suggestions that much improved the manuscript.
¤a Current address: Department of Biology, University of Massachusetts, Amherst, Massachusetts, United States of America
¤b Current address: Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
¤c Current address: Department of Biology, Washington University, St. Louis, Missouri, United States of America
¤d Current address: Centre for Bioinformatics, University of Copenhagen, Denmark
A previous version of this article appeared as an Early Online Release on August 6, 2007 (doi:10.1371/journal.pgen.0030163.eor).
Author contributions. RN, SRM, CDB, and MDP conceived the experiments. ALC, KMO, and MDP designed the experiments. ALC collected the data. ALC, SHW, RDH, and CDB analyzed the data. AB, AFA, NRP, TLY, and SRM and contributed materials/analysis tools. ALC, SHW, CDB, and MDP wrote the paper.
Funding. This work was funded by the US National Science Foundation Plant Genome Research Program..
Competing interests. The authors have declared that no competing interests exist.