Comprehensive analysis of diversity in complex systems has been limited by the inability to deeply sequence any of the constituent populations. The low species richness of the AMD microbial communities has allowed for nearly complete genome reconstruction for the dominant populations. In particular,
Leptospirillum group II was sampled an average of 20 times at each genomic locus. Whereas the vast majority of sequencing reads originating from this population have been placed into one genomic path through manual curation of the assembly, we uncovered evidence for considerable population-level diversity in the form of strain variation, in part mediated by integrated elements of putative phage, plasmid, or transposon origin. Although these phenomena have been previously reported for complex environments [
3,
11,
16], here we quantified the distribution of the variability in gene sequence and content across large regions of the genome. The high resolution of our data allowed us to test hypotheses regarding the processes that shaped this natural population.
Recombination and Mutation
There are two possible sources of new intrapopulation sequence variation: mutation and/or immigration of new variant types, followed by recombination with members of the preexisting population. The variation may have adaptive value, may be deleterious, or may be neutral. These alternative hypotheses predict distinct patterns of variation within the Leptospirillum group II population.
Recombination has been widely documented in natural microbial populations [
18,
21,
28,
41–
44]. We observed the signatures of recombination events among two or more closely related 5-way CG
Leptospirillum group II strains, as well as recombination between the 5-way CG population and more distantly related
Leptospirillum group II genome types (). Distinct strains defined by shared SNP patterns and frequencies were apparent even when the recombination breakpoints were not directly observable (). We were not able to manually determine recombination block length and frequency (as we did in a previous study with metagenomic data, [
28]) because extremely low sequence divergence between strains prevented confident identification of transition points.
The similarity between the read depths of the observed strains (
Figure S1) raises the question of how many distinct haplotypes are present in the population. It is tempting to conclude that because strains in different regions of the genome are present at similar frequencies, they must therefore belong to a single haplotype present at that frequency in the population. Short read lengths and low overall SNP density prevent us from directly linking distinct strains within the assembly, however. It will be necessary to examine whether the frequencies of these strains remain correlated in multiple samples across space and time to determine the number of distinct haplotypes present in the population.
Notably, the value of the per-site population mutation parameter θ in the 5-way
Leptospirillum group II population is smaller than most observations of θ obtained from multi locus sequence typing (MLST) data from bacterial pathogens [
45,
46]. Per-site MLST estimates of θ averaged over multiple loci ranged from a low of 5 × 10
−3 for
Neisseria gonorrhoeae to 0.1 for
Haemophilus influenzae, probably reflecting the more diverse populations from which alleles for MLST were drawn and the limited sampling of individual loci compared to our genome-wide estimate. The value of θ that we observe using an identical formula (2 × 10
−4) is indicative of the overall low levels of variation in the
Leptospirillum group II population, perhaps suggesting a relatively recent population expansion or purifying selection. We also calculated an estimate of θ using a model for metagenomic data that explicitly accounts for sequencing error in estimation, and tested whether the data fit a model of constant population size or exponential growth [
37]; the fit of these models was not significantly different. Contig-based estimates of θ were consistent for a model of constant population size (0.05 ± 0.02). The only other available estimate of θ for microbial metagenomic data is for a population from an activated sludge bioreactor [
37], where θ* was estimated to be 0.015 under a model of population growth. The activated sludge population was nearly clonal and had undergone recent population expansion prior to sampling. The slightly higher value of θ we observe in
Leptospirillum group II using this method is indicative of a larger amount of standing variation than was present in the sludge population.
Selection on Closely Related Strains
Given the clear evidence that multiple closely related strains coexist in the
Leptospirillum group II population, the key question is whether this variation results in differential fitness between strains. If strains are maintained because each is maximally fit in a different ecological niche, we expect to see evidence of positive or balancing selection in variant regions. This expectation underlies the ecotype or periodic selection concept, which posits that distinct sequence clusters in microbial populations correspond to ecologically separate units [
47]. The “Stable Ecotype Model” [
5] suggests that diversity within separate clusters is periodically purged by selection, and distinct clusters persist due to the low frequency of intercluster recombination. Permanent divergence only arises if a mutation allows an individual of one ecotype to colonize a new niche [
5]. This model suggests that selection maintains coexisting clusters, and that within-cluster diversity is low. Under this scenario, we expect to see an excess of fixed nonsynonymous differences between strains, as these amino acid changes would indicate niche-specific adaptations. Alternatively, if sequence variation is not maintained by positive selection, we expect to find evidence for neutral or negative selection in these regions. We tested these alternative hypotheses for the coexistence of closely related strains using the MK test for polymorphism and divergence [
36]. This test posits that under a model of neutral evolution, the ratio of nonsynonymous (replacement) to synonymous substitutions within a population is the same as the ratio of replacement to synonymous fixed differences between populations. An excess of replacement fixed differences indicates positive selection, whereas a dearth indicates negative selection.
Our results from MK tests on regions of strain variation indicate strongly that positive selection on amino acid substitutions does not separate coexisting strains. We observe instead a significant deficiency of fixed replacements relative to polymorphic replacements, indicating that negative (purifying) selection is acting to remove deleterious nonsynonymous mutations differentiating closely related strains. This result is robust to the effect of sequencing error in SNP estimation and the use of low-frequency (nonreplicated) polymorphisms (
Tables S2 and
S4, and
Text S1). These additional analyses clearly demonstrate that an excess of low-frequency polymorphisms due to sequencing error is not masking evidence of genome-wide positive selection. Recent theoretical results indicate that an abundance of slightly deleterious polymorphisms can substantially mask the presence of adaptive evolution [
48]. However, in that case, one would expect the average allele frequency of nonsynonymous polymorphisms to be significantly lower than that of synonymous polymorphisms [
48], which we do not observe (average synonymous site frequency in dominant strain, 0.14; average nonsynonymous site frequency, 0.16). Our results therefore indicate that this population does not fit the predictions of the stable ecotype model, and suggests that strains of 5-way CG
Leptospirillum group II detected by sequence variation may occupy a single niche within the biofilm environment. These results are also consistent with the strong purifying selection acting on most genes in the coexisting archaeal genus
Ferroplasma, despite a far more extensive mosaic genome structure than that of
Leptospirillum group II [
15].
Our results also rule out an alternative to the stable ecotype model, in which high recombination rates prevent periodic selective sweeps from purging diversity within strains. In this scenario, strain variant regions contain beneficial mutations decoupled from the remainder of the genome. This hypothesis—that recombination speeds adaptation by spreading beneficial variants throughout the population—again posits that these variant regions are under positive selection. Thus, it predicts a detectable signature of excess fixed amino acid replacements in recombinant regions and distinct patterns of polymorphism frequency between regions containing adaptive variants and the remainder of the genome. Results from MK tests rule out positive selection on strain variant regions. The observed patterns of polymorphism density also do not match the second prediction of the high recombination model. We do not observe a significant difference in polymorphism density between reads within substrain blocks, reads within the main strain, and reads in interstrain regions, as would be expected if the action of selection on these regions was decoupled due to high recombination rates.
Although the substrains within the 5-way CG population do not appear to be under selection themselves, it is plausible that that linkage between them is absent due to selective sweeps eliminating variation in intersubstrain regions. This hypothesis predicts a lower polymorphism density in intersubstrain regions than within substrains, since a selective sweep would carry hitchhiking neutral variants to fixation. Again, the observation that polymorphism density within substrain and interstrain regions is not significantly different is strong evidence against this hypothesis. The interblock selective sweep scenario therefore requires a higher mutation rate in intersubstrain regions, which would have allowed polymorphisms to accumulate after a selective sweep to the same levels observed in adjacent substrain regions. Such extreme spatial variation in genomic mutation rate is unlikely, and hence, we can also rule out the “interstrain” sweep model.
An alternative explanation for the observed patterns of polymorphism and divergence is that a few sites within each substrain have experienced positive or balancing selection, while all other neutral polymorphisms in the substrain are hitchhiking to higher frequency along with the selected sites. Such a signal might not be detectable using MK polymorphism-divergence tests. This scenario requires that recombination has not yet separated the adaptive site from linked neutral sites. If the mutation occurred in the distant past, the recombination rate would have to have been extremely low relative to the mutation rate to preserve the well-defined substrain blocks observed here. Because there is evidence for significant recombination, this scenario seems unlikely. If the positively selected mutation occurred recently, it would be in the process of sweeping to fixation along with linked neutral polymorphisms. Under this selective sweep scenario, we expect low levels of polymorphism in the region associated with the adaptive mutation relative to the intersubstrain region (). This would manifest as a lower polymorphism density in substrain sequences relative to interstrain sequences, which was not observed.
Of the three deeply sampled recombinant regions between CG-type and other Leptospirillum group II types (), two do not show significant evidence for either positive or negative selection. The third shows strong evidence for purifying selection, consistent with our results for recombinant 5-way CG strains. These regions do not, therefore, appear to be an exception to the general pattern of purifying selection in recombinant blocks.
The most likely explanation for the existence of blocks of closely related variant sequence is geographic separation of two or more ancestral strains, followed by subsequent mixing and recombination. The mosaic genome structure of
Ferroplasma is also consistent with this model [
15]. The high read depth and connectivity of the dominant strain compared to the smaller fragments of alternate sequence type () suggest that a small number of individuals from one or more strains migrated into a preexisting population. Genetic exchange likely occurred between the immigrants and the existing strain, as shown by the incorporation of 1- to 10-kb fragments, but was insufficient to erase the genomic signature of the existing strain. It is possible that this strain was better adapted to its environment than immigrant strains, and retained a selective advantage, which explains its higher frequency in the present population. For a migration/recombination model to be plausible, however, we need to determine whether the observed divergence between the dominant strain and immigrant substrain fragments is consistent with a geographic separation scenario, given the geologic history of the Richmond Mine.
Assuming that the observed divergence between strains is typical of ancestral haplotypes, and a mutation rate of 10
−9 per site per generation [
49], the ancestral strains had to be separated for at least 8 × 10
5 generations to accumulate the observed divergence. This translates into a separation time of at least 1,400 y, given a minimum generation time of 15 h [
50]. The Richmond ore body was exposed to weathering for 780,000 y prior to the onset of mining operations around 150 y ago [
51]. A physical separation of the strains of order 10
3 y is therefore possible. Assuming that the two strains were reunited 150 years ago (at most about 100,000 generations), the observed level of variation would be maintained, provided that the effective population size (
Ne) is sufficiently large that the original variation would not all drift to fixation in this time (
Ne at least the same order of magnitude as the number of generations). If we postulate that the effective population size of
Leptospirillum group II is at least of this order of magnitude (plausible, given 10
8 cells per cm
3 in the biofilm), it is certainly possible that the two strains could have comingled as the result of increased connectivity within the mountain due to mining.
In summary, the combination of assumptions required to fit alternative explanations to the data makes the simpler explanation of geographic separation and subsequent remixing more likely.
Genome-Wide Spatial Patterning of Recombination
Substrains are distributed unevenly around the genome in a mosaic of 1–10-kb fragments (). Both SNP density and minor allele frequency appear to be higher on both sides of the origin of replication () rather than the origin of transfer, suggesting the preferential incorporation of novel genetic material in this region. A similar pattern was observed in an environmental population of
Ferroplasma acidarmanus, fer1 and fer2, in which 77% of interpopulation recombination events clustered within 500 genes of the origin of replication [
28].
The mechanism responsible for this symmetry is not known, but would seem to indicate a higher probability of recombination events associated with newly initiated replication forks proceeding bidirectionally from the origin of replication. It is well known that recombinational intermediates can be converted into functional replication forks (reviewed in [
52]). Replication forks resulting from recombination of introduced fragments could interact with replication forks initiated at
oriC, resulting in the duplication of a chromosome containing the introduced fragment, but this process is not well understood [
52]. Another example of genomic symmetry across the replication axis involves large- and small-scale inversions in closely related bacterial species [
53,
54]. The reason for this is thought to be selection against disruption of polarized motifs clustered symmetrically near the terminus of replication and important for chromosome replication and segregation [
55]. Perhaps recombination events closer to the terminus of replication are counterselected due to interference with these motifs, resulting in an apparent clustering of events around the origin. A breakdown in synteny between related strains further from the origin is evoked as an explanation for symmetry around the origin of replication in
Ferroplasma [
28], but the lack of major genome rearrangement events observed in our assembly makes it less likely as an explanation for the 5-way CG
Leptospirillum group II population.
The presence of a large integrated plasmid within the Leptospirillum group II population suggests conjugation as a possible mechanism of genetic transfer between individuals. This plasmid region contains genes coding for proteins in the tra and trb loci, which are known to be involved in formation of a bridge between two cells through which chromosomal DNA transfer can occur. In Escherichia coli Hfr strains, transfer is initiated at the oriT site and proceeds in the 5′ to 3′ direction. Marker genes closer to the oriT site are transferred more frequently due to the increased probability of breakage of conjugation bridges with time. Therefore, if conjugation were the primary mechanism of transfer, we would expect to see more incorporation of novel fragments in 5-way CG Leptospirillum group II close to, and on one side of, the putative oriT (sequence and polarity not known). Despite uncertainty about how the block containing the origin of replication is linked to the block containing the oriT, we can evaluate both of these considerations to some extent. The incidence of recombined substrain blocks is higher clockwise from the integrated plasmid region than counterclockwise from it, except immediately adjacent to the oriT. Selection against recombined regions carrying variants of conserved proteins (e.g., a large region encoding ribosomal proteins) may have obscured this pattern to some extent.
The short recombinant fragments observed both in this study and in
Ferroplasma [
28] are strikingly similar to those observed among strains of
E. coli. Milkman and colleagues [
56–
58] showed that
E. coli chromosomes typically contain distinct sequence blocks (size range around 1 kb) embedded in larger regions that have overall high similarity to other strains. A subsequent statistical comparison of six whole
E coli genome sequences also supports a mosaic structure, albeit with shorter fragment size (50% < 1 kb, 80% < 2 kb [
59]). These segments are an order of magnitude less than the average length observed in individual conjugation (48–105 kb) or transduction (10–32 kb) events [
57]. The degree of fragmentation of donor DNA was determined by differences in donor–recipient restriction-modification (RM) systems [
57,
60].
The clustered pattern of small segments that we observe is consistent with the idea that large pieces of DNA are transferred by conjugation or transduction between closely related strains, fragmented, and then incorporated as a series of small segments [
56]. The fragmentation likely occurred through the action of restriction enzymes. It is interesting that most genes annotated as part of RM pathways in
Leptospirillum group II are found in strain variant regions, suggesting that restriction specificity varies among closely related organisms (
Table S1). A similar observation was made for an environmental
Ferroplasma population [
15]. Thus, the diversity of RM systems within the 5-way CG population could explain the generation of small DNA fragments from DNA introduced from closely related strains.
Gene Content Variation
In addition to variation in sequence composition, variation in gene content could lead to an increase in fitness, be neutral, or be deleterious. As shown from between-strain genomic comparisons of isolates [
2,
7–
13] and population analyses [
15], a high fraction of individuals in a population contain unique regions due to phage, plasmid, or transposon insertion. Community genomics identifies the laterally transferred regions that are present population-wide in addition to those present at low frequency at the time and conditions of sampling, and also provides information about the degree of gene content variation in these regions.
Our data suggest that the majority of the genes on strain variant paths are present in low copy number in the population. Most observed alternate genome paths are short (a single read length, or about 1 kb; ) but incomplete. This could be in part because the paths are undersampled and in part because the insertions are on the scale of a single gene. The shallow coverage, but high density, of these variants suggests that there is a large pool of low-frequency genotypes from which the observed variants are drawn. Given that 118 alternate paths (an average length of 1 kb spanned by approximately two reads per variant) are observed in a 500-kb segment, and that there are on the order of 42,000 reads in the main genome path, the observed frequency of variant reads is approximately 3%. The variants thus span approximately 20% of this 500-kb region. If the genome of every cell in the biofilm was 20% different from every other cell, and all were present at the same frequency, we would expect the frequency of unique reads to be 20%. Therefore, a few types are present at high frequency, while the majority of variants are present at very low frequencies. The minimum number of genome types occurs when each of these low-frequency types only occurs in one cell (combinatorial variation would greatly increase the number of possible genome types). Based on the observed frequency of variant reads, low-frequency variants comprise approximately 15% of the total population. Based on earlier estimates of approximately 10
8 Leptospirillum group II cells in the sample used for sequencing [
21], there are a minimum of approximately 10
7 genome types present. The true frequency is probably somewhat higher as not all possible strain paths were counted in the assembly process. This finding of a very heterogeneous genome pool suggests that the population has not been strongly shaped by recent pervasive selective sweeps unless the genotypic variants are generated at a rate that is high compared to the frequency of such sweeps.
The high estimated number of coexisting genome types is fairly consistent with estimates based on genotyping of isolates from a complex system [
11]. Importantly, however, the majority of the cells in the population share the majority of their genomes, which is what makes reconstruction of a composite genome possible. Based on their low frequency, the functional significance of these highly variable regions is unclear. Tracking of strain population structure over time may help to resolve the extent to which different genotypes represent distinct adaptations.
It is notable that many of the strain variant regions present at higher abundance in the population (reconstructed alternative genome paths listed near the end of
Table S1) encode genes typical of plasmids (e.g., trwB, traA, traB, and mobD) or viruses (e.g., integrases). Of those integrated elements present at high frequency, some encode genes of potential functional significance. A case in point involves a region where most individuals carry one or more genes of the LuxIR system, involved in quorum sensing [
61], whereas only a minor part of the population encodes the complete pathway ().
High-frequency variants, which are often associated with marked variation in gene content (such as the LuxIR region), could be the result of multiple processes: (1) a mobile element encoding multiple genes inserts into one genome, and that type increases in frequency; (2) the variable loss of genes from mobile elements results in different individuals with differences in gene content; (3) a mobile element integrates into different individuals at the same site, localized by some genomic feature (e.g., tRNAs); and (4) a mobile element integrates at a single site in one individual, followed by an increase in frequency of that genome type, followed by recombination of related but distinct elements in or near the insertion site. For example, closely related viruses derived from a diverse viral population could use the initial virus insertion as a locus for homologous recombination. This could result in a high diversity of genome types at the altered locus.
High-frequency variants also result from recombination between related individuals followed by an increase in frequency of individuals carrying the recombinant block. A notable example of functionally significant genes on a recombinant block is the block of UBA genes containing a CRISPR system, including CRISPR-associated (
cas) genes (). Recent experimental work with
Streptococcus thermophilus demonstrated that CRISPRs confer resistance to viral infection [
35]. The CRISPR locus contains spacers approximately 30 bp long that match predominantly viral sequence, separated by repeats of similar length and adjacent
cas genes inferred to be the protein machinery of the acquired resistance system (recently reviewed in [
62]).
Results of the current study suggest that a block containing CRISPR genes recombined from the UBA population into the 5-way CG population in a single event (A) and subsequently increased in frequency. The region of this UBA-type block containing the
cas genes has risen to 100% frequency in the population (genes 28–36 in C), whereas the first half of the UBA block co-occurs with cells containing sequence of the 5-way type (genes 1–26 in C). A region containing phage/transposon insertions separates the two halves of the recombination block. The absence of any fixed differences between the
cas locus in the UBA genome and the UBA-type integrated
cas locus in the 5-way CG population suggests that this was either a very recent event and/or that strong purifying selection is acting to eliminate variation. Evidence from previous work indicating very rapid diversification of the spacer complement supports the first explanation but does not rule out the second [
63]). The observation that the genome type carrying the CRISPR locus has apparently rapidly risen to high frequency in the population suggests that viral resistance conferred a selective advantage to the recipient types.
Conclusions
Deep sequencing of the 5-way CG
Leptospirillum group II population has given us both an unprecedented catalog of within-population variation and the opportunity to test hypotheses relating to the origin and maintenance of this variation. This opens a route for incorporating population genomic evidence into reconstruction of the natural history of the Iron Mountain microbial ecosystem. Very few metagenomic studies of microbial communities have applied the tools of population genetics to understand the origin and maintenance of sequence-level variation (e.g., [
15,
37]). In the present study, we translate community genomic data into a form suitable for one such test [
36] by separation and comparison of genomic regions varying in their SNP composition. We show that, contrary to many models of microbial population structure, extensive strain variation in
Leptospirillum group II is not maintained by positive selection for adaptive variants. Instead, we detected strong evidence for purifying selection in strain variant regions. The data support a model of geographic isolation, mixing, and extensive recombination, a scenario consistent with the geological history of Iron Mountain. We believe that the application of population genetics to metagenomic data, still a nascent field, has the potential for many more such insights into the processes governing microbial population structure.
In addition to variation at the nucleotide level, our data present a snapshot of ongoing genetic exchange in a natural population, including recombination and the rapid uptake and loss of plasmid and phage-derived genes. Most gene-content variants are present at low frequency, consistent with a very large number of coexisting genotypes. This suggests a dynamic process involving the continual generation and loss of gene-content variants. Very few variant regions have reached high frequency in the population, but two examples documented here involve regions of potential functional importance (LuxIR, involved in quorum sensing, and CRISPR, involved in viral resistance). Despite the high number of variants, the genome structure of the majority of population members is sufficiently similar to allow for assembly of composite genome paths. The functional significance of this extensive gene-content variation remains unclear. To provide an answer to this question, it will be important to examine the expression levels of gene variants across space and time, as well as examine signatures of selection in genes present in regions with high gene-content variation (such as the LuxIR region discussed above).
Due to the effect of strain variation on downstream postgenomic approaches such as proteomics [
64], a thorough documentation of the population-level diversity in community genomic datasets is of great importance. Current technologies and methods have progressed to strain-level resolution, allowing the discrimination of closely related protein variants [
18]. Databases that include population-level strain variant sequences will allow us to track the presence and activity of these variants over time and as a function of geochemical conditions. These types of studies will allow us to determine the functional importance of variation within natural populations.