CRISPR is a widely distributed family of repeats in prokaryotes [
1–
3,
5,
7,
15]. Preliminary insight into their biology came with the discovery that four different protein families occur in prokaryotes only if CRISPRs are present. These proteins are always near a set of these repeats and always include Cas1 [
3]. In the current study, we built on these prior findings and established a number of HMM-defined Cas protein families. These protein families have been found to form conserved clusters across multiple genomes, which allowed us to create rules for the identification of specific subtypes of CRISPR/Cas system.
From the study presented here, it is apparent that CRISPR/Cas systems are far more complex than previously appreciated. Forty-five distinct protein families associated with CRISPRs have been identified among the first 200 completed prokaryotic genomes. These are currently represented by 53 HMMs ( and ). These models are sensitive, in that they unambiguously identify genes, and are also selective, in that they do not identify genes in organisms lacking CRISPR/cas loci. The subtype-specific models accurately discriminate between the subtypes but may, infrequently, identify genes in novel CRISPR/cas contexts that, given sufficient additional genomes, would warrant the status of separate subtypes.
Previous work by Makarova and colleagues [
14], conducted on a smaller set of available microbial genomes and without the knowledge of the associated CRISPRs, resulted in the identification of some 20 gene families (COGs) proposed to act in DNA repair, many of which contain genes identified by our HMM models. The relationship between these two sets of families is uneven, with some of our HMMs spanning multiple COGs, some COGs spanning multiple HMMs, and some COGs including genes we believe unrelated to CRISPRs. COG0640, e.g., includes eight putative transcriptional regulators in
A. fulgidus and five in
M. jannaschii, but only MJ0379 and AF1869, one locus in each species, are CRISPR-associated; they encode the Csa3 protein of the Apern-type CRISPR system. These differences are not unexpected, considering the different clustering methods and search algorithms applied to unequal datasets in this case. Their work also introduced the RAMP superfamily[
14], to which a number of Cas protein families belong. The proposed helicase, nuclease, and other domains for DNA repair metabolism may instead or in addition act in the processes of CRISPR physiology: mobilization, maintenance, processing, and addition of new spacer elements. To reflect this change in interpretation, we propose renaming the RAMP superfamily from repair-associated to repeat-associated mysterious protein, thus preserving the acronym currently in use.
The groups of gene families that comprise the CRISPR/Cas subtypes appear to have traveled together through evolutionary time as discrete units. Even the core cas genes appear to have the same evolutionary history as their partner subtype-specific genes (). The reasonable hypothesis that the Cas proteins interact (i.e., bind to, stabilize, regulate the expression of, cleave, modify, degrade, etc.) with the repeats in their DNA or expressed RNA form is supported by the observation of subtype-specific characteristics of the repeats such as repeat periodicity. Although as demonstrated in this study, CRISPR/cas loci of different subtypes can coexist within the same genome, phylogenetic reconstructions of Cas core proteins do not provide any evidence of switching between subtypes having repeat periodicities of 60, 61, and those with longer periodicities (). The RAMP module and the RAMP-like Mtube subtype would appear to deviate from this pattern, showing varying degrees of independence from dedicated cas core genes and their associated repeat periodicities.
It has been previously suggested that
cas genes have undergone LGT events based on phylogenetic analyses and conservation of gene order [
14], anomalous nucleotide frequencies [
8], and the presence of multiple chromosomal CRISPR loci [
3]. Our finding that the core
cas genes belong to multiple CRISPR subtypes, each with its own sporadic distribution, indicates that this conclusion should be reexamined and reconfirmed. Indeed we have observed several lines of evidence that support the LGT hypothesis: (1) CRISPR/
cas loci representing five different subtypes are found on plasmids (subtype Ypest in
Legionella pneumophila Lens, subtype Dvulg in
D. vulgaris, subtype Hmari in
H. marismortui, subtype Ecoli in
Photobacterium profundum, and both subtypes Mtube and Ecoli in
T. thermophilus HB8). In the case of
L. pneumophila Lens, a second, nearly identical copy of the locus is found on the chromosome. (2) In
L. pneumophila Paris, by contrast, there is no trace of any gene with homology to the Ypest subtype genes or repeats found in the Lens strain, while an entirely different (untyped) CRISPR locus is found. Differences in CRISPR locus content have been observed between closely related strains of
S. pyogenes, Listeria monocytogenes, and
T. thermophilus. (3) Comparison of the Ecoli subtype loci from
E. coli K12-MG1655 and
E. coli O157:H7 EDL933 shows that while Cas1, Cas2, and the surrounding genomic region are nearly identical between K12-MG1655 and O157:H7 EDL933, this similarity does not extend to the rest of the Cas proteins in the cluster. For K12-MG1655, these proteins are most similar to those in
Geobacter sulfurreducens, while for O157:H7 EDL933 they are most similar to those of
Photorhabdus luminescens. (4) Cas1 proteins found in
Porphyromonas gingivalis W83,
Vibrio vulnificus YJ016 and
Nostoc sp. PCC 7120 are fusion proteins, having a C-terminal Cas1 domain but also a reverse transcriptase domain similar to that found in group II introns. This may represent one mechanism used for mobilization in a subset of CRISPR loci.
Clusters of cas genes and their associated repeats must maintain themselves in prokaryotic populations by reproducing and mobilizing themselves as fast as they are degraded. We see numbers of degenerate CRISPR/cas systems as well as profound differences in cas gene content between closely related species or strains. This is significant, because it implies that the process of replenishing genomes with intact CRISPR loci is frequent. We are inclined to believe that CRISPR/cas loci may, under certain circumstances, confer selective advantages to their host cells and, in these cases, stabilize the loci against degradation. We have yet to observe a single instance of a duplicated cas gene cluster on the chromosome(s) of any species. This is in contrast to selfish genetic elements such as transposons, which persist in a given lineage largely through redundancy.
Plasticity with respect to the number of repeat copies, as well as the extensive differences in the spacers between repeats, is observed in CRISPR loci [
2,
12,
15,
23]. The finding that spacer sequences derive from foreign DNA, such as phage and transposons, suggests a defensive capacity for at least some instances of CRISPR system [
12,
13], but roles in replicon partitioning in the Archaea [
1] and regulation of fruiting body development in
M. xanthus [
19,
22] are also suggested. Correlation of repeat expression with CRISPR subtype is in order. Apern subtype repeats are expressed and processed in
A. fulgidus and
S. solfataricus [
11]. Also expressed, in addition to their neighboring
cas genes, are the Nmeni repeats of
Streptococcus agalactiae (H. Tettelin and J. Dunning Hotopp, personal communication), the Mtube repeats of
Staphylococcus epidermidis (S. Gill, personal communication), and the Hmari/Mtube/RAMP module region repeats of
T. maritima (data not shown). Five separate markers from the Ecoli-type CRISPR array of
G. sulfurreducens were up-regulated 2- to 3-fold when cells were grown with Fe(III) versus fumarate as electron acceptor [
24].
We have characterized multiple distinct subtypes of CRISPR/
cas loci and demonstrated profound differences in CRISPR system content between closely related strains and species. Beneficial roles may include defense of the host against foreign DNA [
12,
13] and regulation of the fruiting body development cycle by the DevR and DevS
cas genes in the special case of
M. xanthus [
19,
22]. These findings support an emerging model of CRISPR/
cas systems. They appear to be portable adaptation modules for their host genomes. They are sufficiently unstable that degenerate forms are often seen and sufficiently mobile that multiple instances of LGT are apparent. Their repeat arrays consistently are among the most rapidly evolving loci seen in strain comparison studies, such that they are the basis of “spoligotyping” [
23,
25,
26]. Both
cas gene and repeat expression can be differentially regulated. They can be co-opted by their hosts for new regulatory systems, as seen for a pathway unique to
Myxococcus in the interaction between the non-Cas protein FruA and Cas protein DevR. The adaptations they enable may be supplanted later by the evolution of more stable regulatory systems, but in the meantime they may be superbly useful in rapid adaptation, such as in the invasion of a new biological niche.