Expectations of diversity in an evolving bacterial population
The E. coli populations in the long-term evolution experiment were each founded from a genetically homogeneous clone. No plasmids, viruses, or other mechanisms for horizontal gene exchange are present in the populations, and they evolve in a strictly clonal (asexual) manner. How quickly do we expect measurable diversity to accumulate in one of these populations, and what evolutionary forces will impact the patterns of variation within a population over time?
shows a numerical simulation of the spread of beneficial mutations in a population with parameters similar to the E. coli
long-term evolution experiment (Woods 2005
). Many beneficial mutations will be lost to drift when they are rare (Lenski et al. 1991
), with only those that achieve substantial frequency visible in the evolving frequency distribution. Among those that survive drift, some will eventually fix, but others will be eliminated by clonal interference (Gerrish and Lenski 1998
). In some cases, the winning lineage may be decided not by the effects of single beneficial mutations, but rather by the combined effects of multiple mutations that accumulate before any one lineage is fixed (Fogle et al. 2008
). Note that diversity does not increase monotonically, but rather it waxes and wanes as new beneficial mutations take hold and their fates are resolved by selection.
Expected dynamics in an evolving bacterial population
As beneficial mutations spread, they will perturb the frequencies of neutral and deleterious mutations owing to linkage disequilibrium in an asexual population. These perturbations are the classic signature of periodic selection (Atwood et al. 1951
). However, we do not expect to detect many neutral or deleterious mutations in the evolution experiment. Under a pure drift process, the number of generations required for a neutral mutation to drift to fixation is on the order of the population size (Kimura 1983
), which is many millions even after accounting for the bottlenecks during serial transfers (Lenski et al. 1991
). Selective sweeps, however, reduce the effective size so that a rare neutral mutation may hitchhike to fixation much faster than it can spread by pure drift, although the vast majority of neutral mutations will be purged by these sweeps. In this case, the expected number of neutral mutations that fix equals the product of the genomic mutation rate, the proportion of neutral sites, and the number of generations (Kimura 1983
). For E. coli
, the genomic mutation rate is on the order of 10−4
per generation (Lenski et al. 2003
), and perhaps 50% of sites are neutral, so we expect it to take on the order of 2,000–20,000 generations for even one neutral mutation to fix in the population. It is unlikely, therefore, that many neutral mutations would reach high frequency in the first few thousand generations of evolution. Deleterious mutations will fare even worse. Mutations causing extreme fitness defects will be rapidly be lost from the population. A pool of many slightly deleterious mutations may accumulate and persist at mutation-selection balance, but these mutations are less likely to fix or reach high frequency than neutral ones.
If the genomic mutation rate were much higher, however, then neutral and weakly deleterious alleles could spread more easily and more would potentially reach high frequency. Several populations in the long-term experiment evolved mutator phenotypes, leading to mutation rates roughly two orders of magnitude higher than the ancestral rate (Sniegowski et al. 1997
). In fact, a mutT
mutator phenotype evolved in the population studied here, making its first appearance by generation 26,500 and becoming numerically dominant by generation 29,000 (Barrick et al. 2009
Other polymorphisms may evolve and be maintained by negative-frequency-dependent interactions, in which some genotype has a selective advantage when rare but is disadvantaged at high frequency. Acetate and short chain fatty acids are byproducts of glucose fermentation by E. coli
. These compounds are normally excreted during growth on glucose, then reabsorbed and used after the glucose is depleted. Mutants that are better competitors for acetate have been observed to evolve and persist via frequency-dependent selection in some chemostat experiments with E. coli
(Rosenzweig et al. 1994
). The low concentration of glucose and serial-transfer regime used in the long-term experiment lead to low cell densities and correspondingly low levels of excreted metabolites, so that cross-feeding genotypes should be rarer and harder to detect. Indeed, a sustained cross-feeding interaction is only known to have evolved in one of the long-term populations (Rozen and Lenski 2000
; Rozen et al. 2005
; Rozen et al. 2009
), and not the one that is the focus of our study, although there appear to be weaker frequency-dependent interactions in some other populations (Elena and Lenski 1997
Mixed-population sequence datasets
We examined the genetic diversity over time in one experimental line from the long-term E. coli
evolution experiment (designated Ara-1) by sequencing whole-population samples from 2,000 (M2K), 5,000 (M5K), 10,000 (M10K), 15,000 (M15K), 20,000 (M20K), 30,000 (M30K), and 40,000 (M40K) generation time points. Clones that were the subject of a previous study (Barrick et al. 2009
) serve as controls for the mixed-population analysis. These clones were isolated from the same population at 2,000 (C2K), 5,000 (C5K), 10,000 (C10K), 15,000 (C15K), 20,0000 (C20K), and 40,000 (C40K) generations. We also include the founder of the Ara+1 experimental population that differs by two point mutations from the ancestor of the Ara-1 line (C0K).
Clones and mixed-population samples were sequenced, one per lane, in two separate runs of the Genome Analyzer system. Alignment of the resulting 36-base reads to the ancestral sequence yielded 40- to 60-fold average coverage outside of repeat regions for each genome (). Positions with zero coverage are not counted in these estimates, as they almost always proved to represent true deletions relative to the ancestral sequence in clone genomes (Barrick et al. 2009
). The ancestral clone (C0K) has at least 10-fold read coverage at 99.9% of the positions in the reference genome. The decrease in the number of sites with coverage at later generations in both the clone and mixed-population samples is consistent with sizable deletions becoming fixed in the population.
Dataset statistics and SNP prediction summary
If the sampling of reads from different locations in the genome were perfectly random, the number of sites with a given coverage depth would fit a Poisson distribution with equal mean and variance. We find that the index of dispersion (the variance divided by the mean) for the coverage distribution is much greater than unity in these genomes, ranging from 3.1 to 5.5 (), with the mixed samples showing slightly more dispersion. A maximum likelihood fit to a negative binomial distribution, which is commonly used to model over-dispersed count data, reproduces most of the observed coverage structure (). Higher coverage within GC-rich regions has been reported for Genome Analyzer sequence data, possibly due to more efficient processing of these fragments during library preparation on account of their greater duplex stability (Dohm et al. 2008
). This bias may contribute to the over-dispersion we observe and could systematically affect the recovery of polymorphisms in specific chromosomal regions.
Example coverage distribution and base error rates
There are hundreds of thousands to millions of base mismatches in the reads with unique best alignments to the ancestral genome in each dataset (). When constructing a model for the base error rate, we verified that bases assigned high quality scores by the re-sequencing analysis software usually have fewer mismatches to the ancestral sequence (). A majority of the bases in each run were assigned high quality scores. However, there were fewer overall errors, and bases with higher quality scores had fewer errors, in the mixed-population dataset. For example, 50% of the base calls have quality scores corresponding to error rates of roughly 0.02% or lower per base, and 75% have error rates below 0.04%, in the 2K mixed-population sample. By comparison, 68% of bases in the 2K clone data have quality scores with error rates below 0.04%, but only about 1% have error rates below 0.02%.
Distinguishing SNPs from sequencing errors
Our aim is to determine what diversity in a set of a whole-population genome sequences is due to biological variation, as opposed to confounding mechanical errors and biases introduced during DNA preparation and sequencing. We restrict our analysis here to single nucleotide polymorphisms (SNPs), representing new mutations that have risen to a measurable frequency, but not fixed, in a bacterial population at the time of sampling. While there is information about deletions, insertions, and rearrangements in genome re-sequencing data, it is more difficult to interpret in terms of population frequencies, and so we have not yet attempted to analyze these other polymorphisms. Roughly 2/3 of the changes found in a more detailed analysis of the 20K clone from this population were point mutations (Barrick et al. 2009
After aligning the reads in each dataset to the reference genome, we employed a likelihood ratio test to determine whether there was evidence of a SNP at each site. This test compares the likelihood of observing the collection of bases at a site under the null hypothesis of no genetic variation (i.e., all mismatches due to sequencing errors) to the maximum likelihood possible under the alternative hypothesis that there is a mixture of two alleles in the population. A much greater probability of the data given the alternative hypothesis indicates that the population from which DNA fragments were sampled consisted of subpopulations with different bases at this position. We report an E-value for each SNP prediction that is an estimate of its genome-wide significance, i.e., the likelihood ratio test p-value at a given site corrected for multiple testing. An E-value thus also represents the approximate number of false-positive predictions expected in a genome at a given significance level by chance.
Owing to the stochastic nature of both sequencing errors and sampling DNA fragments from different individuals, a true polymorphism that has a 50% frequency in the population is far more likely to achieve a significant E-value than one at 5%. We used simulated data with the same coverage and quality score distributions as the 2K mixed-population sample to estimate the chances of discovering polymorphisms at various frequencies in the population by our procedure (). At an E-value threshold of 1, we expect to recover nearly all of the polymorphisms with frequencies of 20–80%, roughly 50% of the polymorphisms present in 5% of the individuals, and only 1.6% of the polymorphisms at a frequency of 1%. Lowering the E-value cutoff to 0.01 reduces the sensitivity by factors of 1.6 and 5.4 for finding polymorphisms at frequencies of 5% and 1%, respectively.
Sensitivity of SNP prediction procedure
In light of ever-improving technologies, we also investigated how better coverage and error rates would affect the discovery of SNPs at very low frequencies in a population (). We performed further simulations to address this issue, with a simplified model that assumes uniform coverage and the same rate for all base errors (i.e., no differences in base quality). At an E-value cutoff of 1, the threshold frequency for a 50% chance of SNP discovery drops from 8.9% to 0.63% as coverage increases from 30- to 1000-fold. Reducing the error rate by an order of magnitude to 0.01%, does not affect the recovery of SNPs at 30-fold coverage and only slightly improves the frequency for 50% detection probability to 0.36% at 1000-fold coverage. This sensitivity analysis therefore predicts that increasing coverage would be more effective for improving rare SNP detection than reducing the base error rate by a similar factor.
We chose to examine SNP predictions below a relatively permissive E-value cutoff of 1 in hopes of identifying real polymorphisms that were at low frequencies in the mixed-population samples. We first discovered that there were many more SNP predictions at this significance level than the average of 1 expected in each of the clone datasets, with the values ranging from 22 to 53 per clone (). Many of these predictions appear highly significant: 61 have E-values ≤ 0.01. During the outgrowth of a single cell it is highly unlikely that even a single polymorphism will reach a frequency of >1%, as a mutation would have had to occur within the first seven generations (i.e., 27 = 128 cell divisions). Furthermore, if mutations that arose while culturing these samples after picking a clone were responsible for these SNP predictions, we would expect many more in the 40K clone because it is a mutator with a ~100-fold elevated mutation rate, yet we see about the same number in this clone as in any other.
Instead, the unexpectedly high rate of false-positive predictions in the clones appears to result from sequencing or alignment errors that are outside the scope of our statistical model. Certain genomic sites appear to be especially prone to these errors, as many of the exact same SNPs are predicted in multiple samples and in sequence datasets from both the clone and mixed-population runs. Fortunately, many of these spurious predictions can be recognized by two kinds of biases. Base calls supporting the putative mutated base often have consistently lower qualities than those supporting the reference base for these polymorphisms, and reads supporting these SNPs are often derived largely or even exclusively from one strand of the genomic sequence. We developed a bias filtering step to reject putative SNPs with these error signatures. It reduces the number of predictions in clones to at most 8 per genome () and removes all but five clone predictions with E-values ≤ 0.01. This filter does not, however, eliminate any SNP predictions in the mixed-population samples thought to be real (see below).
There are many more highly significant predictions in the mixed-population samples than in the clones after bias filtering (). Every population dataset has at least one predicted SNP with an E-value < 10−5, whereas the best prediction in any clone has an E-value almost two orders of magnitude higher. Even at 20K, where the mixed sample has fewer predicted SNPs than the paired clone, one of the mixed-population SNPs is very highly significant. Our detailed knowledge of the long-term experiment allows us to further evaluate these SNP predictions. We believe that 49 of the 57 predicted SNPs in the 2K to 20K population samples () are probably both accurate and biologically important for the reasons presented below.
SNPs of particular biological interest
(1) Elevated dN/dS ratio
We expect most alleles that reach a high enough frequency in the population to be detected as SNPs during the first 20,000 generations will be beneficial mutations. Synonymous substitutions are likely to be neutral, and so an elevated ratio of non-synonymous to synonymous mutations, dN/dS, provides evidence of positive selection. In the ancestral genome there is a 20.4% chance that a random base substitution in a protein-coding region is synonymous. There are 37 non-synonymous and 3 synonymous mutations in the predicted SNPs from the pooled set of 2K to 20K mixed-population samples, which is a significantly higher dN/dS ratio than expected by chance (one-tailed binomial test, p = 0.03). In contrast, taking all putative SNPs in the 7 clone datasets together, there is no evidence that their dN/dS ratio is elevated (p = 0.79).
(2) Mutator phenotype
We expect an increase the amount of genetic variation in this population after a mutator phenotype evolved. Indeed, there is a dramatic increase in the number of predicted SNPs, from an average of 11.4 in the 2K to 20K population samples to 364 and 1062, in the 30K and 40K samples, respectively. The mutT defect that evolved specifically elevates the rate of A·T→C·G transversions, and so we also expect the SNPs in the 30K and 40K samples to exhibit almost exclusively this sequence signature. In the 2K to 20K mixed-population samples 26.3% of the predicted SNPs are A→C or T→G changes. As expected, there is an extremely significant shift in this fraction to 98.1% (one-tailed Fisher's exact test, p = 1.0 × 10−37) and 91.5% (p = 1.3 × 10−29) in the 30K and 40K population samples, respectively.
(3) Selective sweeps reaching fixation
If our procedure finds true polymorphisms, then we would expect some predicted SNPs to be mutations that were rising in frequency during a selective sweep that would ultimately reach fixation. In fact, nearly half (26/57) of the predicted SNPs in the 2K to 20K population samples are mutations that were later found at 100% frequency in the population. By contrast, none of the suspect SNP predictions from clonal samples correspond to mutations that were fixed in the population or observed in other clones.
(4) Unsuccessful mutations in genes where other alleles fixed
If our procedure for SNP discovery is accurate, then we expect to find evidence for selective sweeps that failed due to clonal interference. Consistent with that expectation, 5 predicted SNPs in the 2K to 15K population samples are in genes where a different mutation fixed by 20,000 generations (). These transient polymorphisms probably represent alternative beneficial alleles at genes under strong selection in the long-term experiment. Given that E. coli has ~4,000 genes and that 26 predicted SNPs from this period did not reach fixation, there is only a small chance (one-tailed Binomial test, p = 8 × 10−7) of picking 5 or more SNPs at random in the 27 genes with mutations that later fixed. Of the putative SNPs in clones, only 1 in 42 impinges on this same set of 27 genes, which is not unlikely by chance (p = 0.25).
Among these unsuccessful mutations, the 10K mixed sample included two different SNPs at adjacent bases upstream of the ompF gene. The ompF allele that eventually fixed is also in the promoter region and appears as a SNP at 15K. Surprisingly, it has a highly deleterious effect when moved alone into the ancestral chromosome (D. Schneider and R.E.L, unpublished data). The finding of two similar contending mutations provides compelling evidence that these ompF mutations are actually beneficial in a genetic background that had become common by 10,000 generations. There are also transient SNPs affecting the pykF and iclR genes, each of which eventually fixed a different allele.
(5) Other unsuccessful beneficial mutations
We would also expect some unsuccessful lineages to have beneficial mutations in other genes. Two predicted SNPs in the 10K population (hsdM
) are clearly real because they were also present in the 10K clone genome, although this clone was off the main line of descent. Though we have no direct evidence that other transient SNPs are biologically significant, it seems plausible that at least 15 of them are beneficial in this environment. Eleven transient SNPs in the 2K to 20K mixed-population samples occur in genes involved in processes thought to be key targets of selection () including cell wall synthesis, respiration, ribosomal function, and gene regulation (Philippe et al. 2007
; Philippe et al. 2009
). For example, mrdA
are two genes in the same operon involved in cell wall synthesis; a transient SNP in mrdB
occurs in the 2K population sample, while a mutation in mrdA
has fixed in every population sample from 5K onward.
(6) Cross-feeding adaptations
The 4 remaining transient SNPs in the 2K to 20K mixed-population samples, which occur in genes related to acetate and short chain fatty acids (SCFA) metabolism (), may be cross-feeding adaptations. Three of these mutations were rare and all of them were ultimately lost from this population. One is in the promoter region of the acs gene, which encodes an enzyme for acetate utilization. A second is a non-synonymous change in yaaH, which is predicted to have a role in acetate transport. The third and fourth cause amino-acid substitutions in atoS and atoC, which together regulate an operon involved in SCFA degradation. Two early transient SNPs in the iclR gene, which encodes a repressor for glyoxylate bypass enzymes that are induced when E. coli grows on acetate or SCFAs, may also promote cross-feeding interactions, even though a different mutation in iclR was eventually adopted by the dominant lineage and fixed in the population by 20,000 generations.
It is possible that cross-feeding genotypes off the main line of descent may have persisted in this population at low levels below our detection limit for SNPs, with occasional increases in frequency, perhaps in association with other beneficial mutations. The presence of such cross-feeding genotypes could explain the weaker frequency-dependent interactions observed in populations other than the one with the stable polymorphism (Elena and Lenski 1997
). None of the potential cross-feeding alleles detected as SNPs remain at detectable frequencies in successive samples, so we suspect that they were evolutionary dead ends in this population. However, new cross-feeding genotypes could periodically re-evolve from the main lineage to exploit that niche and, in turn, later go extinct (Rozen et al. 2005
Changes in genetic diversity over time
summarizes the genetic diversity observed in mixed-population samples over time and the tempo with which mutations were fixed in the population. The top panel shows the origin and eventual fate of all the point mutations discovered in the 2K to 40K population samples, while the bottom panel provides a visual summary of the main patterns in these data.
Mutational diversity in an evolving E. coli population
There was a great deal of allelic diversity in the 10K and 15K samples that was lost by 20K, with the majority of SNPs at 15K becoming fixed in the 20K sample. This pattern seems to indicate a deep branching between two main competing lineages, one of which prevailed and the other of which went extinct. Additional support for this scenario comes from the clone sequences. In particular, the 10K clone carried six mutations that were off the line of descent, including two SNPs that reached intermediate frequencies of ~33% in the 10K sample (hsdM and maeB/talA). Meanwhile, five other mutations on the line of descent were present in 40–60% of the 10K population sample. This diversity was slow to disappear, as these five mutations that would eventually fix were still only at ~90% frequency in the 15K sample. After that lineage finally prevailed, however, there was very little diversity in the 20K population sample.
The mutational dynamics in this population changed dramatically after the mutator phenotype evolved. SNP diversity and the rate at which point mutations fixed both increased dramatically by 30,000 generations. Adaptation had already slowed substantially by this point in the long-term evolution experiment (Lenski and Travisano 1994
; Cooper and Lenski 2000
). The rate of fitness improvement might have reaccelerated slightly with the emergence of the mutator – that remains to be determined – but previous experiments indicate that any such acceleration is likely to be rather small given the fairly large population size and correspondingly short waiting-time for new beneficial mutations (Sniegowski et al. 1997
; de Visser et al. 1999
). Although we have population samples at only two time points after the mutator phenotype evolved, the data indicate a dramatic and on-going increase in the standing genetic diversity, and it will be interesting to see whether it continues to increase.