|Home | About | Journals | Submit | Contact Us | Français|
The recently discovered CRISPR-Cas adaptive immune system is present in almost all archaea and many bacteria. It consists of cassettes of CRISPR repeats that incorporate spacers homologous to fragments of viral or plasmid genomes that are employed as guide RNAs in the immune response, along with numerous CRISPR-associated (cas) genes that encode proteins possessing diverse, only partially characterized activities required for the action of the system. Here, we investigate the evolution of the cas genes and show that they evolve under purifying selection that is typically much weaker than the median strength of purifying selection affecting genes in the respective genomes. The exceptions are the cas1 and cas2 genes that typically evolve at levels of purifying selection close to the genomic median. Thus, although these genes are implicated in the acquisition of spacers from alien genomes, they do not appear to be directly involved in an arms race between bacterial and archaeal hosts and infectious agents. These genes might possess functions distinct from and additional to their role in the CRISPR-Cas-mediated immune response. Taken together with evidence of the frequent horizontal transfer of cas genes reported previously and with the wide-spread microscale recombination within these genes detected in this work, these findings reveal the highly dynamic evolution of cas genes. This conclusion is in line with the involvement of CRISPR-Cas in antiviral immunity that is likely to entail a coevolutionary arms race with rapidly evolving viruses. However, we failed to detect evidence of strong positive selection in any of the cas genes.
CRISPR-Cas is a recently discovered adaptive immune system that is present in almost all archaea and many bacteria (7, 10, 37). A striking feature of the CRISPR-Cas system is that it can “remember” the identity of infectious agents (such as viruses and plasmids) by incorporating DNA sequences derived from the genomes of such agents (and possibly alien DNA in general) into the genome of a prokaryotic host. The CRISPR-Cas system thus allows prokaryotic cells to acquire information about the external environment (or more precisely, alien DNA present in the environment), incorporate this information into the host genome, and thereby transmit it to the progeny. Thus, CRISPR-Cas clearly exemplifies the principle of Lamarckian inheritance (28).
The CRISPR-Cas module is in the genome of archaea and bacteria in two parts, namely, arrays of repeat sequences known as clustered, regularly interspaced, short palindromic repeats (CRISPRs) and genes encoding CRISPR-associated (Cas) proteins (1, 26, 38). The operation of the CRISPR-Cas immune system can be divided into three functionally distinct stages, namely, adaptation, expression, and interference, each carried out through interactions between CRISPRs, their transcripts, Cas proteins, and foreign DNA. At the adaptation stage, DNA sequences of about 30 bp (called spacers) that are homologous to certain regions (called protospacers) in the genomes of infectious agents are incorporated into a CRISPR locus (7). The incorporation of a spacer is accompanied by the duplication of a similarly sized, CRISPR-constitutive repeat sequence, which joins the incoming spacer to an existing spacer, thereby elongating the CRISPR cassette by one unit. At the expression stage, CRISPR loci are transcribed, and the transcripts are processed into small RNA molecules (called crRNAs), which bind to an enzymatic complex consisting of Cas proteins, known as CASCADE (10, 13, 24). At the interference stage, a crRNA directs the bound CASCADE complex along with an additional Cas protein (Cas3) to destroy the foreign DNA or in some cases RNA after the crRNA forms a duplex with the cognate protospacer sequence (10, 23, 27, 64).
The acquisition of spacers at the adaptation stage is the critical step at which the distinction between self and nonself is made (also see reference 40). Otherwise, autoimmunity would ensue through the incorporation of the host's own DNA into the CRISPR loci; such self targeting indeed has been detected but appears to be extremely rare (58). The selection of protospacers in foreign DNA sequences is nonrandom: protospacers often are located adjacent to short, conserved motifs called protospacer adjacent motifs (PAMs), which are implicated in the selection of protospacers (14, 45). Moreover, PAMs are necessary for the recognition of foreign DNA sequences during the interference stage (14).
The CRISPR-Cas systems show remarkable diversity of protein sequences and genomic organization of the cas operons. At least 45 distinct protein families have been identified in association with CRISPR loci in various bacterial and archaeal genomes (22). Further analyses involving more sensitive methods of sequence and structure comparison supplemented by the analysis of cas operon architectures have revealed distant homologous relationships between many Cas protein families (33, 34). The recently developed classification divides CRISPR-Cas systems into three distinct types (I, II, and III) (35). All of these systems contain two universal genes: cas1, a metal-dependent DNase that is implicated, with no sequence specificity, in the integration of protospacers into CRISPR cassettes (39, 65); and cas2, a metal-dependent endoribonuclease that also appears to be involved in the adaptation stage (8). Apart from the conservation of cas1 and cas2, the three types of CRISPR-Cas systems substantially differ in their sets of constituent genes, and each is characterized by a unique signature gene (35). The signature genes for the three types are cas3 (a superfamily 2 helicase containing an N-terminal HD superfamily nuclease domain) (57), cas9 (a large protein containing a predicted RuvC-like and HNH nuclease domains), and cas10 (a protein containing a domain homologous to the palm domain of nucleic acid polymerases and nucleotide cyclases), respectively (35). Within the three major types, CRISPR-Cas systems can be further classified into subtypes based on a number of criteria, which include distinct signature genes along with the phylogeny of the universal cas1 gene (35). The Cas proteins known as RAMPs (for repeat-associated mysterious proteins) are present in several copies in both type I and III systems. Some of the RAMP proteins have been shown to possess sequence- or structure-specific RNase activity that is involved in the processing of pre-crRNA transcripts (10, 11, 24). The crystal structures of several RAMPs have been solved and shown to contain one or two RNA recognition motif (RRM) domains that show substantial structural variations in different Cas proteins (24, 32, 34, 53, 63).
The CRISPR-Cas modules could be expected to undergo rapid evolution in natural environments because of recurrent selection pressure exerted by coevolving viruses (20, 49). This expectation appears to be consistent with the extreme diversity of Cas protein sequences and structures. Moreover, in accord with the prediction of rapid evolution, it has been shown that the spacer composition of CRISPRs in biofilm-forming, acidophilic archaea evolves rapidly in a natural environment (62), turning over on a time scale of months (4). In addition, the evidence for the horizontal gene transfers (HGTs) of CRISPR-Cas modules has been accumulated by comparative sequence analyses focusing on various taxonomic ranks ranging from phyla to strains, indicating that the CRISPR-Cas modules undergo HGT on various evolutionary timescales (12, 19, 61, 62). However, in contrast to these observations, which are compatible with the rapid evolution of CRISPR-Cas modules, it has been shown that the spacer compositions of CRISPRs in Escherichia coli and Salmonella enterica evolve at a much slower rate, remaining unchanged for 103 to 105 years (60, 61). Because such slow evolution is at odds with the expectation for an active immune system interacting with evolving viruses, this finding led to the suggestion that, at least in some organisms, the CRISPR-Cas system could perform functions other than defense against infectious agents (60), a case in point being the reported involvement of Cas1 in DNA repair (6). Given these contrasting findings on the pace of CRISPR evolution, we sought to investigate the microevolution of the cas genes to gain further insights into tempo and mode in the evolution of different variants of the CRISPR-Cas system and potentially into the functions of cas genes.
In this work, we systematically examined the nature and intensity of selection pressure that affects different cas genes by estimating the ratio of nonsynonymous to synonymous substitutions (dN/dS), the generally accepted gauge of the type and strength of selection in the evolution of genes and individual amino acid sites. The results indicate that cas genes generally are subject to purifying selection, the intensity of which, however, varies greatly depending on the gene family and significantly differs between the stages of CRISPR immunity in which the genes are involved. Most of the cas genes evolve under much weaker selection pressure than the average selection pressure exerted on genes in the respective bacterial and archaeal genomes. However, we did not detect evidence of strong positive selection in any of the cas genes.
The completely sequenced genomes of 1,164 bacteria and archaea were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) in August 2010. The profiles of 52 Cas proteins (or domains) reported by Makarova et al. (35) (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/CRISPRclass/index.html) were obtained from Pfams (17) and TIGRFAMs (22, 55).
The Cas profiles were searched against the genomes using PSI-BLAST (3) (E value, 10−6), with the consensus sequence of each profile used as the master sequence. To remove false positives, the hits were searched against the Conserved Domains Database (CDD) using RPS-BLAST (36) (E value, 10−6) with hits to the NCBI Protein Clusters Database discarded. A PSI-BLAST hit was considered a bona fide Cas protein if a set of nonoverlapping, best-match profiles obtained as RPS-BLAST hits included (i) one of the profiles used in the previous PSI-BLAST search and/or (ii) the profile of a Cluster of Orthologous Groups of protein (COG) (59) describing this Cas protein (35). To increase the sample size, the obtained Cas sequences were searched against the genomes using BLASTP (2) with a stringent E value cutoff of 10−10, and the hits were considered additional true Cas sequences. All Cas sequences were pooled to remove redundancy and clustered with the BLASTClust program (ftp://ftp.ncbi.nlm.nih.gov) (similarity, >70%; bidirectional coverage, >90%). The clusters were categorized into groups, each representing a single Cas protein or a concatenation of multiple Cas proteins (domains), as follows. Every sequence of a cluster was searched against the profiles used in the previous PSI-BLAST search and those of COGs describing a single Cas protein with RPS-BLAST (E value, 10−1). If the union of nonoverlapping, best-match profiles consisted of multiple profiles, the cluster was considered to belong to a group representing the concatenation of the respective proteins (domains); if it consisted of a single profile, the assignment was done straightforwardly.
For each cluster, the protein sequences were aligned with the MUSCLE program (16). DNA sequences were aligned based on the protein sequence alignments with the tranalign program from EMBOSS (52). The DNA sequences containing frame shifts were discarded together with the respective protein sequences. Protein sequences that were identical to each other were removed, except for one sequence, together with the respective DNA sequences. Phylogenetic trees were estimated from the protein alignments with the PhyML program (21). The trees were approximately rooted by the least-square distance method of Wolf et al. (66). For each tree, every monophyletic group of genes that reside in an identical genome was collapsed into one operational taxonomic unit (OTU) (inparalogs). Subsequently, every monophyletic group of genes each of which resides in a distinct genome belonging to an identical genus was considered a group of orthologous genes. To ensure a uniform level of sequence divergence within every orthologous group, genomes belonging to an identical Alignable Tight Genomic Cluster (AGTC) (46) were considered to belong to an identical genus (in the current data set, Salmonella, Citrobacter, Shigella, and Escherichia belonged to a single ATGC, and so did Nostoc and Anabaena). Otherwise, taxonomic classification was obtained from the NCBI Taxonomy Database (54). The procedure of orthology assignment described above was based on the assumption that the divergence time between genomes was too short to allow duplication followed by differential loss below the level of a genus (i.e., HGTs within a genus were ignored).
The aligned DNA sequences were examined for recombination signals with the RDP3 software (42). Recombination signals were accepted if at least 5 different methods detected statistically significant (P < 0.05) evidence of recombination (9, 18, 41, 43, 44, 48, 51). Based on the description in RDP3, MaxChi and Chimaera were considered the same method, and so were Chimaera and 3SEQ. MaxChi and 3SEQ were, however, considered different methods (no transitivity assumed). If a recombination signal was detected, all sequences involved in it (i.e., potential parental sequences and recombinants) were removed from a cluster, and the remaining sequences were examined with RDP3 again. A cycle of recombination detection and sequence removal was repeated until no recombination signals were detected in every alignment consisting of 3 or more DNA sequences (RDP3 cannot examine alignments consisting of 2 sequences because it is difficult, if not impossible, to detect recombination from 2 sequences). The sequences with recombination signals were discarded to improve the quality of the dN/dS ratio estimation (5, 56).
The dN/dS ratios were estimated with the maximum-likelihood method implemented in the PAML program (67) by comparing all possible pairs of DNA sequences within each cluster. Estimations yielding the number of synonymous substitutions per synonymous site (dS) that fell outside the range of 0.25 ≤ dS ≤ 1.5 were discarded to improve the quality of estimation (when the value of dS was too small, the dN/dS ratio seemed to be overestimated because of the inflation of a quotient by a small denominator; conversely, when the value of dS was too large, the estimation of dN/dS is not reliable because of saturation in synonymous sites). To compare the dN/dS ratios for different classes of genes, three statistical methods were applied: Mann-Whitney U test with the Holm-Bonferroni correction (25), t test with the Holm-Bonferroni correction, and the Dunnett T3 procedure of the Tukey-Kramer test (15).
To map the dN/dS ratios of cas genes onto the genomic distributions of dN/dS ratios, dN/dS ratios were estimated for each gene from the respective pairs of genomes. The pairs of genomes were initially selected such that at least one of the pairs contained the cas genes for which the dN/dS ratios were estimated in the previous step. For each pair of genomes, a reciprocal BLASTP (2) search was done, and orthology was assigned to genes according to the bidirectional best-hit criterion (29). The dN/dS ratios of the orthologous genes were estimated with PAML (67) as described above; in this case, a search for recombination events was not performed because all subsequent analysis used only the median of the genomic distributions that would not be substantially affected by a small fraction of genes with detectable recombination (the median, in general, is robust to extreme values). If the genomic median of the dS values fell outside the range of 0.25 ≤ dS ≤ 1.5, such a pair of genomes was discarded to improve the quality of the estimation. Because many pairs fell outside this range, the scope of selections was extended to genomes belonging to the same genus (54) as that of the discarded genomes, assuming that organisms within the same genus have similar genomic distributions of dN/dS ratios (at least with respect to median values). Consequently, 39 genomic distributions of dN/dS ratios were obtained with the following characteristics. The median ranged from 0.025 to 0.11 with a mean (± standard deviation [SD]) value of 0.065 ± 0.021. The scaled median absolute deviation (MAD) ranged from 0.023 to 0.062 with a mean (±SD) value of 0.044 ± 0.011 (a scaling factor of 1.48 was used, because 1.48 MAD ≈ SD if the population distribution is Gaussian). The medians of the dN/dS ratio distributions were used to scale the dN/dS ratios of cas genes.
Site-specific dN/dS ratios were estimated for the clusters consisting of at least 4 sequences with PAML (67) using 4 models, namely, M1a, M2a, M7, and M8, and unrooted phylogenetic trees estimated in the previous step. The likelihood ratio test was done between M1a and M2a and between M7 and M8 to detect evidence of positive selection. Sites were considered positively selected if the posterior probability for a site to be under positive selection was above 0.95 (with no correction for multiple comparisons).
The dN/dS ratio is a gauge of selection affecting proteins under the assumption that synonymous sites in protein-coding sequences evolve neutrally (31, 68). The dN/dS ratios of all examined cas genes were less than 1 (Fig. 1), indicating that the cas genes generally evolve under purifying selection. However, the ratios varied greatly among the cas genes, covering a range between 0.05 and 0.3. This range spanned roughly 4 to 10 times the scaled median absolute deviation (MAD) of the genomic (gene by gene) distribution of dN/dS ratios: in the genomic distributions estimated from each of the 39 analyzed pairs of genomes, the scaled MAD ranged from 0.023 to 0.062 with a mean (±SD) value of 0.044 ± 0.011 (see Materials and Methods). This large variability of the dN/dS ratio among the cas genes likely reflects their diverse functions during different stages of CRISPR-Cas immune response and possibly roles beyond the immune response, such as the apparent involvement of Cas1 in DNA repair (6).
The cas genes were classified into various groups on the basis of functional as well as structural features (Fig. 1) (33). The genes first were divided into four groups: (i) genes involved in the adaptation stage, (ii) genes involved in the interference stage, (iii) genes encoding CASCADE subunits, and (iv) genes encoding predicted transcription factors (regulation genes). The CASCADE gene group, the largest of the four groups, was further classified in two ways. The first classification included five gene groups: (i) genes for the large subunit of the CASCADE complex (also known as the CRISPR polymerase), (ii) genes for the small subunit of the CASCADE complex, and (iii to v) three groups of RAMP proteins, namely, the cas5, cas6, and cas7 families. In the second classification, the CASCADE subunits were partitioned into genes from type I CRISPR-Cas systems and genes from type III CRISPR-Cas systems (35). Finally, the RAMP genes (cas5, cas6, and cas7 groups) were reclassified according to the presence and absence of demonstrated or predicted enzymatic activity: demonstrated nucleases with a catalytic His (RAMP w/H*), predicted nucleases with a highly conserved His (RAMP w/H), and proteins without conserved His (RAMP w/o H). The cas9 gene was not included in any of these groups, because it shows no detectable similarity to any other cas gene (33).
These different groups of cas genes displayed significant variations in dN/dS ratios (Fig. 2). The adaptation genes, especially cas1 and cas2, had significantly lower dN/dS ratios than the genes in all other groups (Fig. 2A) (P < 1.4 × 10−7 with the Mann-Whitney U test, P < 1.9 × 10−4 with the t test, and P = 1.2 × 10−6 with the Dunnett T3 test) (see Materials and Methods). These low dN/dS ratios indicate the slower evolution of protein sequences and accordingly stronger purifying selection (assuming that the effect of positive selection is negligible when considered on a whole-gene level). Therefore, this result indicates that cas1, cas2, and cas4 are under the strongest purifying selection of all cas genes. Although cas1, cas2, and cas4 appear to be involved in the adaptation stage of the CRISPR-Cas immune response, the finding of relatively strong purifying selection is poorly compatible with the hypothesis that these genes are directly involved in coevolution with infectious agents. Rather, there is a parallel between the universality of cas1 and cas2 in various types of CRISPR-Cas and stronger purifying selection exerted on them, in that both features, albeit from distinct angles and on different evolutionary scales, point to the strong evolutionary conservation of these genes (see the discussion of recombination below). This finding is compatible with the general trend of positive correlation between genes' propensity for loss and rate of sequence evolution (30).
The genes encoding the small subunit of the CASCADE complex had significantly greater dN/dS ratios than the cas5 and cas7 groups of the RAMP genes (P < 1.1 × 10−3 with the Mann-Whitney U test, P < 8.8 × 10−3 with the t test, and P = 0.044 with the Dunnett T3 test) and had the greatest median value among the five groups of the CASCADE group of cas genes (Fig. 2B). The elevated gene-wide dN/dS ratio might reflect positive selection on a subset of amino acid sites in the protein, and this interpretation seems to be compatible with the prediction that the small subunit recognizes PAMs in the foreign elements during the interference stage (33). However, given that this protein is small and mostly alpha-helical (33), an alternative and perhaps more plausible explanation is that the elevated dN/dS ratio reflects relaxed structural constraints (and hence weak purifying selection) on this protein. In addition, the genes encoding the large subunit of the CASCADE complex also had significantly greater dN/dS ratios than the cas5 and cas7 groups (P < 2.8 × 10−5 with the Mann-Whitney U test, P < 1.5 × 10−4 with the t test, and P = 0.021 with the Dunnett T3 test) (Fig. 2B). After removing the outliers with dN/dS ratios greater than 0.4, which all corresponded to csy1, the difference remained significant between the large subunit and the cas5 group (P = 1.1 × 10−5 with the Mann-Whitney U test, P = 0.013 with the t test, and P = 9.2 × 10−4 with the Dunnett T3 test). Taken together, these results suggest that the RAMP genes are under stronger gene-wide purifying selection than the non-RAMP components of the CASCADE complex (the small and large subunits).
The genes encoding the CASCADE subunits from type III CRISPR-Cas systems had significantly greater dN/dS ratios than the CASCADE genes from type I CRISPR-Cas systems (P < 3.9 × 10−6 with the Mann-Whitney U test, and P < 1.8 × 10−4 with the t test; the T3 procedure was not applicable) (Fig. 2C), suggesting that the CASCADE subunits are subject to significantly stronger purifying selection in type I systems than in type III systems. The biological underpinning of this difference is unclear. It might be relevant that type III CRISPR-Cas modules often co-occur with CRISPR-Cas modules of other types in prokaryotic genomes, and those that do co-occur in some cases lack cas1 and cas2, the two genes that are otherwise present in all CRISPR-Cas systems (33, 35). Moreover, in the phylogenetic tree of Cas1 proteins, type III systems appear as polyphyletic groups, in contrast to type I and type II systems, which appear as monophyletic groups (35). These findings suggest that type III systems often are horizontally transferred to genomes that already encode other types of CRISPR-Cas systems. Thus, there is a parallel between this enhanced mobility and the faster evolution of the CASCADE subunits in the type III CRISPR-Cas due to relaxed purifying selection.
The RAMP genes that encode predicted nucleases with a highly conserved histidine had significantly greater dN/dS ratios than the genes encoding presumably noncatalytic RAMPs that lack a conserved histidine (Fig. 2D) (P = 4.8 × 10−6 with the Mann-Whitney U test, P = 4 × 10−6 with the t test, and P = 5.6 × 10−3 with the Dunnett T3 test). This difference persisted when the comparison was done between the noncatalytic RAMPs and experimentally characterized RAMP nucleases, although in this case the difference was only marginally significant (P = 0.05 with the Mann-Whitney U test, P = 0.10 with the t test, and P = 0.25 with the Dunnett T3 test). This finding appears counterintuitive, because in general enzymes would be expected to evolve under stronger purifying selection than homologous but inactive proteins. A possible interpretation that does not contradict this expectation is that the (predicted) catalytically active RAMPs are subject to positive selection in a small subset of its amino acid sites (see, however, the estimation of site-specific dN/dS ratios below).
The intensity of selection pressure exerted on the genomes of prokaryotes is highly variable among different groups of bacteria and archaea, as indicated by at least an order-of-magnitude variation in the medians of the genomic (gene by gene) distributions of the dN/dS ratios (47). This variability is thought to reflect the diverse environments prokaryotes inhabit and their variable effective population sizes. To reduce the variation in the dN/dS ratios attributable to such taxon-dependent biases, the dN/dS ratios of cas genes were scaled by the median of the dN/dS ratios of the respective genomes (see Materials and Methods). This scaling showed that the dN/dS ratios of cas genes generally were greater than the genomic median in the respective genera, with several notable exceptions of cas1 and cas2, the two most slowly evolving cas genes (Fig. 3). Thus, most cas genes evolve under relatively relaxed purifying selection and/or relatively intensified positive selection compared to that of the genomic average. These two causes for the elevated levels of dN/dS ratios are difficult to discriminate, because dN/dS ratios reflect the superposition of both effects. However, it is relevant to note that among the genes that display clear evidence of positive selection (i.e., dN/dS > 1), the majority are involved in immune processes or in evasion thereof (68). Thus, it is tempting to draw a parallel between the high dN/dS ratios that we detected for the cas genes and this general trend. Among the cas genes, the relatively strong purifying selection on cas1 and cas2 remains apparent even after removing the taxonomic biases, which is in agreement with the findings described above.
The scaled dN/dS ratios of cas genes did not show substantial variations among the taxonomic groups of bacteria and archaea (Fig. 4), indicating that the cas genes occupy similar places on the genomic spectrum of selection pressure, largely independent of the genomes to which the genes belong. Of particular note is the apparent absence of atypical values in the taxonomic group, including E. coli and S. enterica. The spacer composition of CRISPRs in these species evolves slowly, remaining unchanged for 103 to 105 years, suggesting that in these bacteria CRISPR-Cas members do not function as a typical immune system (60, 61). Although not in direct conflict with this hypothesis, the absence of atypical dN/dS values in the cas genes from these species reported here appears to be compatible neither with the interpretation that these genes are on the verge of degradation nor with the possibility that their functions are completely different from those of other CRISPR-Cas systems.
Given the high dN/dS ratios of the cas genes relative to that of the genomic median, we searched for evidence of positive selection in these genes through the estimation of site-specific dN/dS ratios (see Materials and Methods). The results, however, did not reveal any consistent evidence of positive selection (although statistically significant evidence of positive selection was detected in two clusters of cas9, the sites predicted to be under positive selection did not overlap between the two clusters; moreover, an additional analysis using Datamonkey  did not detect any statistically significant evidence either). These observations suggest that, the immune functions of CRISPR-Cas notwithstanding, there are no sites under strong positive selection in cas genes. This result, however, should be taken with caution, because the size of the samples generally was small (e.g., the number of sequences in the two cas9 clusters was effectively four, the bare minimum required for statistical significance according to PAML). Thus, the further analysis of larger data sets is required to characterize potential effects of positive selection in cas genes.
Recombination signals were detected in 22 of the 4,130 clusters from 15 cas genes that belonged to various groups defined above: adaptation genes, interference genes, RAMPs, the large and small subunits of the CASCADE complex, and cas9 (Table 1). Notably, recombination was not predicted for cas1, although this ubiquitous gene represents the largest fraction of all cas genes in the analyzed data set (689 of 6,079).
Several studies have presented ample evidence of HGT in CRISPR-Cas modules, thus revealing the high evolutionary mobility of the cas genes (12, 19, 62). Moreover, there are indications that entire cas gene cassettes have been transferred between the identical loci associated with CRISPRs in the genomes of E. coli and S. enterica, suggesting that the genomic regions associated with CRISPRs are hot spots of recombination (61). Extending these results, our findings indicate that recombination also occurs in a wide variety of cas genes. The only major exception to this trend is the absence of evidence of microscale recombination in cas1. This finding parallels the other lines of evidence on the strong conservation of cas1 that is manifest both in the ubiquitous representation of this gene in CRISPR-Cas modules and in the low dN/dS ratios described above.
We report here that cas genes evolve under purifying selection that is typically much weaker than the median strength of purifying selection affecting genes in the respective genomes. The exceptions are the cas1 and cas2 genes, which evolve at levels of purifying selection close to the genomic median. Taken together with the evidence of frequent HGT in the cas genes reported previously and wide-spread microscale recombination in the genes described here, these findings reveal the dynamic evolution of cas genes. This conclusion is in line with the involvement of CRISPR-Cas in antiviral immunity that is likely to entail a coevolutionary arms race with rapidly evolving viruses. However, we failed to detect evidence of strong positive selection in any of the cas genes.
Additionally, two notable observations were made regarding the biological correlates of the selection intensity estimated for different cas genes. The genes that are implicated in the adaptation stage of the CRISPR-Cas process (spacer acquisition), in particular cas1 and cas2, were found to be subject to the strongest purifying selection among all cas genes. This finding is compatible with the (near) ubiquity of these genes in CRISPR-Cas systems and, in the case of cas1, with the absence of evidence of microscale recombination within this gene. These results do not seem to support the possibility that cas1 and cas2 are directly engaged in the coevolutionary arms race, although they are likely to physically interact with foreign genetic elements. A potentially important factor underlying the relatively strong purifying selection that affects cas1 and cas2 could be the additional involvement of these genes in processes distinct from the CRISPR-Cas-mediated immunity, as suggested by experiments implicating cas1 in DNA repair functions (6). Another notable observation is that the RAMPs containing a predicted catalytic histidine had higher dN/dS ratios (weaker purifying selection) than those observed for predicted noncatalytic RAMPs. This result is unexpected, because within the same protein family stronger purifying selection generally would be predicted for enzymatically active proteins. One interpretation of this observation is that the (predicted) catalytic RAMPs experience positive selection in a small subset of amino acid sites. Although the estimation of site-specific dN/dS ratios did not reveal convincing evidence of positive selection in these genes, further analysis of expanded data sets is required to clarify this issue.
We thank David Kristensen for help with data analysis and Igor Rogozin for helpful discussions.
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.
Published ahead of print 16 December 2011