Innate immune systems with built-in self/non-self recognition mechanisms have long been known to protect prokaryotic genomes against insertions of foreign DNA 
. For example, well-studied restriction-modification systems often preserve genomic integrity by methylating prokaryotic DNA, enabling prokaryotes to recognize and cleave unmethylated foreign DNA 
. Yet, the foreign DNA attacking prokaryotes includes the most abundant and rapidly diversifying members of the biosphere, viruses 
. With viruses quickly evolving counter-strategies against prokaryotic immune systems, prokaryotes require immunological plasticity to keep pace. Here we computationally predict and directly document the evolution of an adaptive immune system that enables prokaryotes to serially acquire new immunities against diversifying viruses and plasmids. Importantly, the prokaryotic adaptive immune system is genomically encoded (i.e., hereditable) and acquires new immune elements unidirectionally, making this adaptive immune system distinct from its eukaryotic analogues 
The microbial adaptive immune system is mediated by a genomic locus termed C
epeats (CRISPR). CRISPR loci have been found in approximately 45% of sequenced bacteria and over 90% of sequenced archaea 
. Utilizing adjacently encoded C
ssociated (Cas) proteins 
, CRISPR loci incorporate short 21–72 base-pair sequences from targeted regions in invading viruses and plasmids 
. Once transcribed and processed into CRISPR RNAs, these viral and plasmid-derived sequences confer sequence-specific immunity by binding and cleaving cognate viral and plasmid regions during subsequent genomic invasions 
The viral and plasmid binding sequences incorporated into host CRISPR loci are termed ‘spacers,’ reflecting their addition interspacing highly synonymous 23–47 base-pair sequences, termed ‘repeats’ 
. Correspondingly, the targeted viral and plasmid sequences are known as ‘proto-spacers’ 
. With spacer immunity specific to a matching proto-spacer sequence, viruses can escape CRISPR targeting by mutating their proto-spacers or by mutating nearby p
otifs (PAMs), regions which likely act as recognition sites for the CRISPR/Cas machinery 
. Natural selection favors the emergence of viruses with mutations in CRISPR-targeted regions, leading to a coevolutionary arms race 
as hosts incorporate new spacers to combat viral adaptations 
. Coevolutionary arms races have been well-documented in other virus-microbe systems 
. Yet, unlike previously studied coevolutionary wars, CRISPR recorded arms races naturally differentiate current host adaptations from previous host adaptations. This is because new spacers are added unidirectionally, adjacent to a leader sequence at a single end of the locus termed the ‘leader-end.’ Previously acquired spacers are also commonly maintained, leaving a cassette-like recording of current (i.e., spacers closest to the leader-end) and past (i.e., spacers farther from the leader-end) adaptations. Partial timelines of coevolution can thus be constructed for host and viral species refractory to laboratory challenge experiments 
Previously, we described one CRISPR recording through metagenomic reconstructions of the CRISPR loci sampled from floating microbial biofilms in an a
rainage (AMD) system 
. The prime advantage of probing these generally closed, acidophilic environments is that they are dominated by relatively few species 
. Our AMD research targeted the extremophilic archaeon I-plasma 
. Growing in an AMD biofilm matrix at temperatures ranging from approximately 30° to 48° Celsius and pHs ranging from approximately 0.3 to 1.2, I-plasma is one of around 12 species in the acidophilic order Thermoplasmatales 
. Reconstructing the CRISPR loci of I-plasma, we noted that the newest, leader-end spacers emerged highly diverse and cell-specific. In contrast, the trailer-end spacers (i.e., the oldest spacers found farthest from the leader sequence) were highly clonal population-wide, matching earlier observations of trailer-end clonality in acidophilic Leptospirillum
and more recent observations in bacterial Escherichia coli
and archaeal Sulfolobus islandicus
Surprisingly, I-plasma's trailer-end spacers appeared conserved despite appearing to provide no immunity against current viruses (Figure S1
). In reconstructions (~20-fold coverage) of the I-plasma locus in the AMD biofilm, only newly acquired leader-end CRISPR spacers matched currently sampled viruses, implying that previously targeted viral sequences had since evolved or disappeared. Similarly, laboratory challenge experiments 
document rapid viral evolution in the face of CRISPR targeting.
Here we sought to understand why trailer-end spacers are often conserved despite failing to confer immunity against current viruses. Trailer-end conservation is especially surprising in light of the genomic compactness of Bacteria and Archaea, whose genomes rarely exceed 13MB 
. Prokaryotes have also been shown to delete genetic material approximately ten times as frequently as they insert 
. With a bias toward genomic deletions, we hypothesized that bacteria and archaea would only preserve CRISPR's genetic material if natural selection favored it.
To find and probe the selection pressure driving the preservation of CRISPR trailer-ends, we combined metagenomic reconstructions of CRISPR loci across a multi-year period with a population-genetic mathematical model of virus-CRISPR dynamics in a natural system. Three previous studies have constructed mathematical models of virus-host dynamics in the CRISPR system 
, but none were built to explain why CRISPR loci emerge with both trailer-end clonality and trailer-end conservation. Building a model in which CRISPR locus length is an emergent property of the model parameters, we probe whether tuning parameters to increase trailer-end conservation increases prokaryotic fitness even when viruses mutate rapidly. We further capture the dynamics through which the trailer-ends of CRISPR loci are purged of spacer diversity.
A population-genetic model (see Text S1
for the full algorithm) was built to analyze how the intracellular processes of CRISPR and virus mutation drive the long-term development of natural CRISPR loci captured via metagenomic analysis. For simplicity, the model restricts its study of host and viral genomes to monitoring host spacers and viral proto-spacers. All other elements in the genomes are ignored. Host and viral populations are then divided into ‘strains’: all hosts sharing the same ordered set of spacers are assigned to a single host strain while all viruses with identical proto-spacers are assigned to a single viral strain (Figure S2
). Each strain's cumulative frequency is tracked across thousands of iterations, as mutations alter host immunity and viral infectivity.
The iterations of the model are not directly dependent on time. Each iteration is instead defined to be the period of variable duration in which a large, preset number of virus-host interactions occurs (). During each virus-host interaction, one of two possible outcomes generally occurs. If the host and viral strains share a spacer, the host survives and the virus is cleared. Conversely, if no spacer is shared, the virus kills the host and the virus survives. Of course, exceptions to both of these situations are allowed in the model. Hosts are given a small probability of surviving even when lacking spacers against an invading virus (). Further, CRISPR is given a small probability of failing to provide immunity even when a host spacer matches an infecting virus' proto-spacer (). This failure rate has been measured in viral plaquing assays conducted by two independent groups 
Table of parameters used in model.
With a large number of interactions per iteration, virus-host interactions are assumed to be well-mixed and distributed according to strain frequencies. Since viruses are most likely to encounter high-frequency host strains, this selects for the viral lines that can kill the dominant hosts, resulting in negative frequency-dependent selection, a process termed ‘kill the winner’ in microbial ecology 
. During some interactions, stochastic mutations create new host and viral strains, as hosts unidirectionally add spacers and viruses mutate random proto-spacers. Old host and viral strains are simultaneously depressed in frequency and driven extinct when no longer immune and infective, respectively. At the end of an iteration, the model takes a metagenomic snapshot of the surviving host and viral populations. We analyzed these snapshots across model iterations to capture patterns of CRISPR-driven immunity as they emerge.
Here we describe the main assumptions of the model; a more in-depth analysis of each model assumption can be found in the Supplementary Information (Text S2
). First, the model assumes that virus and host populations do not go irreversibly extinct. With host and viral populations continually extant, in each iteration the model can simply wait until any preset number of virus-host interactions occurs. We can thus define iterations to be the variable duration period in which such a preset number of interactions occurs. Empirical support for assuming the long-run coexistence of virus and host in natural environments comes from two metagenomic studies. In the first study, Rodriguez-Brito et al.
recovered consistently high amounts of virus and host genomes in four aquatic regions across a year-long period. Similarly, in the experimental part of our study, we reconstructed the relative abundances of CRISPR loci and viruses in an acid mine drainage system across the last two years of our six-year metagenomic time series experiment. In each sampling, both host and viral genomes were recovered.
Large microbial population sizes limit the effect of sampling noise in modulating the frequencies (genetic drift) of established strains in our model. But since new mutants arise at low frequencies, we incorporated demographic stochasticity in their ability to establish (i.e., avoid extinction due to a low initial frequency). We did so by allowing new mutants randomly distributed ‘emergence periods’ during which they were not subject to the model's clearance of low-frequency strains. All strains, excluding new mutants in their randomly-sized emergence periods, are cleared when their frequencies drop below a threshold, effecting mutation-selection balance and preventing the model from accumulating an uncontrollable number of strains as new mutants are created. Thus, without the randomness component, the emergence period allows new mutants a chance to reach ‘establishment frequencies,’ after which each mutant can compete in the model solely via its CRISPR-determined fitness.
By increasing the rate at which viable mutants establish, the emergence period increases competition between distinct spacer-adding lines (clonal interference). This promotes ‘kill the winner’ dynamics, making it harder for individual lines to sweep. Despite this increase in competition among beneficial mutants, below we capture losses of trailer-end diversity and rapid selective sweeps. To assure that these results also occur without the emergence period, we tested the model without an emergence period and found both trailer-end clonality and stochastic sweeps (Figure S3