|Home | About | Journals | Submit | Contact Us | Français|
The CRISPR–Cas (clustered regularly interspaced short palindromic repeats–CRISPR-associated proteins) modules are adaptive immunity systems that are present in many archaea and bacteria. These defence systems are encoded by operons that have an extraordinarily diverse architecture and a high rate of evolution for both the cas genes and the unique spacer content. Here, we provide an updated analysis of the evolutionary relationships between CRISPR–Cas systems and Cas proteins. Three major types of CRISPR–Cas system are delineated, with a further division into several subtypes and a few chimeric variants. Given the complexity of the genomic architectures and the extremely dynamic evolution of the CRISPR–Cas systems, a unified classification of these systems should be based on multiple criteria. Accordingly, we propose a `polythetic' classification that integrates the phylogenies of the most common cas genes, the sequence and organization of the CRISPR repeats and the architecture of the CRISPR–cas loci.
The CRISPR–Cas (clustered regularly interspaced short palindromic repeats–CRISPR-associated proteins) modules are adaptive immunity systems that are encoded by most archaea and many bacteria and that act against invading genetic elements1–6, such as viruses and plasmids (Supplementary information S1 (table)). Distinct arrays of short repeats interspersed with unique spacers have been recognized in bacterial and archaeal genomes for years, and although it was proposed that these repeat arrays could have an important common function7, the nature of that function has been elucidated only recently. Independently, Cas proteins that are encoded by putative operons adjacent to CRISPR sequences were analysed in detail with computational methods and found to contain domains that are characteristic of several nucleases, a helicase, a polymerase and various RNA-binding proteins8. It was initially speculated that these proteins constitute a novel DNA repair system9, but the observation that some of the unique CRISPR spacers are almost identical to fragments of virus and plasmid genes led to the hypothesis that CRISPR–Cas systems might be involved in defence against selfish elements10–12. On the basis of these findings and a comprehensive computational re-analysis of the Cas proteins13,14, a model was proposed14 that drew an analogy between the CRISPR–Cas system of archaea and bacteria and the RNA interference (RNAi) mechanisms of eukaryotes15. However, unlike the eukaryotic RNAi systems, the CRISPR–Cas system integrates a small piece of DNA derived from foreign nucleic acid into the CRISPR locus of the host genome as the first step in the series of events that leads to immunity against the invader14. The hypothesis that the CRISPR–Cas system plays a part in defence against invading DNA has been validated by the demonstration that integration of a short phage-specific sequence into the CRISPR locus of the lactic acid bacterium Streptococcus thermophilus conferred resistance to the cognate phage16. In these experiments, resistance to the phage was abrogated by as little as a single mismatch between the CRISPR insert (referred to as the spacer) and the target phage sequence16, although recent studies with archaeal CRISPR–Cas systems revealed a lower stringency of spacer–target complementarity17,18.
The CRISPR–Cas systems mediate immunity to invading genetic elements via a three-stage process — adaptation, expression and interference (FIG. 1) — that can be divided into two distinct, quasi-independent subsystems: the highly conserved `information processing' subsystem, which includes the adaptation stage, and the `executive' subsystem, which includes the expression and interference stages. Whereas the proteins involved in the information processing subsystem (Cas1 and Cas2) are likely to be highly conserved, the proteins of the executive subsystem vary greatly between different organisms1–3,6,19.
During the adaptation stage, short pieces of DNA homologous to virus or plasmid sequences are integrated into the CRISPR loci16,20,21. Viral challenge typically triggers insertion of a single virus-derived resistance-conferring spacer, with a characteristic length of approximately 30 bp, at the leader side of a CRISPR locus; acquisition of multiple spacers from the same phage is less frequent, as are internal insertions. Each integration event is accompanied by the duplication of a repeat and thus creates a new spacer-repeat unit. The selection of spacer precursors (proto-spacers) from the invading DNA appears to be determined by the recognition of proto-spacer-adjacent motifs (PAMs) (FIG. 1); PAMs are usually only several nucleotides long and differ between variants of the CRISPR–Cas system22,23. There is currently no direct evidence for a mechanism of spacer acquisition, although the most highly conserved Cas proteins, Cas1 and Cas2, are the prime candidates for proteins with key roles in this process16,24.
The second stage in CRISPR–Cas-mediated immunity is expression (FIG. 1), during which the long primary transcript of a CRISPR locus (pre-crRNA) is generated and processed into short crRNAs. The processing step is catalysed by endoribonucleases that either operate as a subunit of a larger complex (such as the CRISPR-associated complex for antiviral defence (Cascade) in Escherichia coli) (FIG. 1) or as a single enzyme (such as Cas6 in the archaeon Pyrococcus furiosus). Recently, an intriguing variant was discovered in Streptococcus pyo-genes in which a trans-encoded small RNA (tracrRNA) acts as a guide for the processing of pre-crRNA, which in this organism is catalysed by RNase III in the presence of Csn1 (also known as Cas9; see below)25. In the case of the Cascade complex of type I CRISPR-Cas systems24,26, the mature crRNA remains associated with the complex after the initial endonuclease cleavage (FIG. 1), whereas in P. furiosus the crRNA, processed by Cas6, is passed on to a distinct Cas protein complex (the Cascade complex of type III systems, Cmr-type; see below), where it is processed further at the 3′ end by unknown nucleases27–29.
The third step is interference (FIG. 1), during which the foreign DNA or RNA is targeted and cleaved within the proto-spacer sequence6,20,21. The crRNAs guide the respective complexes of Cas proteins, such as the E. coli Cascade complex, to the complementary virus or plasmid target sequences that match the spacers. In E. coli, the cleavage is probably catalysed by the HD endonuclease domain of the Cas3 protein24. Furthermore, the PAMs seem to play an important part in the interference process23,30. In S. thermo philus and E. coli, targeting either strand of the phage DNA confers immunity to the cognate phage, an observation that is best compatible with DNA being the target16,24,26. Furthermore, insertion of a self-splicing intron into the proto-spacer sequence of the target gene renders the corresponding plasmid resistant to CRISPR-mediated immunity in Staphylococcus epidermidis, indicating that it is the invading DNA rather than the corresponding mRNA that is targeted in this species31. In addition, the hyperthermophilic archaeon Sulfolobus solfataricus targets the DNA of Sulfolobus spindle-shaped virus 1 (SSV1), as CRISPR-mediated immunity does not depend on transcription of the target gene18. However, in vitro experiments with the CRISPR–Cas system from P. furiosus showed that in this species the crRNA targets the foreign mRNA instead28. These findings emphasize the remarkable mechanistic and functional diversity of CRISPR–Cas systems, although the full range of their activities remains to be determined. Various Cas proteins might participate in either one stage or multiple stages of CRISPR–Cas system action, most probably as protein complexes6.
Several Cas proteins have been shown to possess RNase and/or DNase activity, often in agreement with the bioinformatic predictions. This includes the two universal core Cas proteins: Cas1, a metal-dependent DNase that has no sequence specificity and has been proposed to be involved in the integration of the spacer DNA into the CRISPR cassette32, and Cas2, a metal-dependent endoribonuclease for which the role in the CRISPR–Cas mechanism remains unclear33. Repeat-associated mysterious proteins (RAMPs) (see below), which form a large superfamily of Cas proteins, contain at least one RNA recognition motif (RRM; also known as a ferredoxin-fold domain) and a characteristic glycine-rich loop14. Some of the RAMPs have been shown to possess sequence- or structure-specific RNase activity that is involved in the processing of pre-crRNA transcripts24,26,27.
Extensive bioinformatic analyses have shown that the genomes of various CRISPR-containing organisms encode approximately 65 distinct sets of orthologous Cas proteins, which can be classified into 23–45 families, depending on the classification criteria13,14. Furthermore, eight distinct subtypes of the CRISPR–Cas system (CASS1–CASS8) have been delineated on the basis of the composition and architecture of the cas operons and on Cas1 phylogeny13,14.
The diversity of CRISPR–Cas systems identified in newly sequenced genomes is rapidly increasing1,4 — in a representative set of 703 archaeal and bacterial genomes, 310 (44%) encode one or more CRISPR–Cas modules (TABLE 1; Supplementary Information S1 (table)) — hence, an urgent need exists for a unified classification and nomenclature of the cas genes. In this Opinion article, we summarize the shortcomings of the existing classifications and nomenclature of the CRISPR–Cas systems and propose a new, `polythetic' classification that combines information from phylogenetic and comparative genomic analyses.
The original, widely used classification proposed in 2005 by Haft et al. was based on an analysis of 40 bacterial and archaeal genomes, the topology of the Cas1 phylo-genetic tree, and generalized cas operon organizations typified by the CRISPR–Cas systems that are present in eight genomes13. The names of four core cas genes were adopted as originally proposed by Jansen et al. in 2002 (REF. 8). Two other core genes, cas5 and cas6, were then added using the same principle, and names for genes encoding proteins specific to each of the eight CRISPR systems were proposed13. For example, the unique genes found in the E. coli system were denoted cse1 (CRISPR system of E. coli gene number 1), cse2, cse3, cse4 and cas5e (elsewhere, these E. coli genes were also labelled casA, casB, casE, casC and casD, respectively, which added to the confusion)24.
Although the original approach13 offered attractive simplicity, it did not take into account the distant relationships that have been shown to exist between many Cas proteins. For example, the proteins of COG1857 (see the clusters of orthologous groups of proteins (COGs) database34), which are present in the majority of CRISPR–Cas systems and are clearly orthologous14, have been given at least five different names: Cse4, Csd2, Csh2, Cst2 and Csa2 (TABLE 2). Furthermore, the currently used classification does not account for the complexity of the evolutionary relationships between the CRISPR–Cas systems in diverse bacteria and archaea. For example, the Ecoli and Ypest systems (named after E. coli str. K12 sub-str. MG1655 and various strains of Yersina pestis, in which they are the only CRISPR–Cas systems found) are clearly related, as indicated by the similarity of their operon organizations, the absence of cas4 and the phylogenetic clustering of Cas1, whereas the Apern, Tneap–Hmari and Dvulg systems (the only systems found in Aeropyrum pernix, Thermotoga neapolitana DSM 4359 and Haloarcula marismortui str. ATCC 43049, and Desulfovibrio vulgaris str. Hildenborough, respectively) are also related, as they share a common gene of the BH0338 family14. Conversely, extensive recombination within CRISPR–Cas operons has resulted in hybrid systems that cannot be assigned to any of the proposed groups despite the fact that they contain typical cas genes. The linkage between CRISPR–Cas groups and particular organisms can be misleading owing to the presence of multiple CRISPR–Cas systems in the same genome, the presence of different systems in different strains of a single species and the occurrence of hybrid systems.
The inconsistencies between the nomenclature of the CRISPR–Cas systems and the names of Cas proteins are rapidly growing. In particular, many of these proteins are currently classified into families that do not have systematic names pointing to their involvement with a CRISPR–Cas system (such as the BH0338 family, the CXXC-CXXC family and the GSU0053 family, among many others).
Taken together, these problems substantially complicate the use of the current classification and nomenclature of CRISPR–Cas systems and motivate the effort behind the creation of a new, unifying, internally consistent and flexible classification scheme.
Here, we propose a new, polythetic classification of CRISPR–Cas systems in which the cas1 and cas2 genes constitute the core of three distinct types of system (FIG. 2; TABLE 2). Cas1 and Cas2 are present in all CRISPR–Cas systems that are predicted to be active, and are thought to be the information-processing subsystem that is involved in spacer integration during the adaptation stage.
Typical type I loci contain the cas3 gene, which encodes a large protein with separate helicase and DNase activities35, in addition to genes encoding proteins that probably form Cascade-like complexes with different compositions24,26. These complexes contain numerous proteins that have been included in the RAMP superfamily, which encompasses the large Cas5 and Cas6 families, on the basis of extensive sequence and structure comparisons14 (see TABLE 2 for the available structures). Furthermore, the Cas7 (COG1857) proteins represent another distinct, large family within the RAMP superfamily, as detected by the HHPred method, which can detect distant sequence and structure similarities between proteins36 (Supplementary Information S2 (figure)). In addition, the complexes involved in the CRISPR–Cas function may contain large proteins such as Cse1 and BH0338-like families, as well as small α-helical proteins such as Cse2, or other, less conserved subunits.
In the Cascade complex, a RAMP protein with RNA endonuclease activity has been identified as the main enzyme that catalyses the processing of the long spacer–repeat-containing transcript into a mature crRNA24,26. In most cases, the catalytic RAMP proteins (Cas6, Cas6e and Cas6f; see TABLE 2) do not belong to the most prevalent Cas5 or Cas7 families of RAMPs and are often encoded in the periphery of the respective operon. However, the subtype I-C system (also known as Dvulg or CASS1) (FIG. 2; TABLE 2) might be an exception in which either Cas5 or Cas7 possesses RNase activity. The type I CRISPR–Cas systems seem to target DNA; target cleavage is catalysed by the HD nuclease domains of Cas3 (REF. 35). As the RecB nuclease domain of Cas4 is fused to Cas1 in several type I CRISPR–Cas systems, Cas4 could potentially play a part in spacer acquisition instead.
The type II systems include the `HNH'-type system (Streptococcus-like; also known as the Nmeni subtype, for Neisseria meningitidis serogroup A str. Z2491, or CASS4), in which Cas9, a single, very large protein, seems to be sufficient for generating crRNA and cleaving the target DNA, in addition to the ubiquitous Cas1 and Cas2. Cas9 contains at least two nuclease domains, a RuvC-like nuclease domain near the amino terminus and the HNH (or McrA-like) nuclease domain in the middle of the protein, but the function of these domains remains to be elucidated. However, as the HNH nuclease domain is abundant in restriction enzymes and possesses endonuclease activity37,38, it is likely to be responsible for target cleavage. Furthermore, for the S. thermophilus type II CRISPR–Cas system, targeting of plasmid and phage DNA has been demonstrated in vivo20 and inactivation of Cas9 has been shown to abolish interference16.
Type II systems cleave the pre-crRNA through an unusual mechanism that involves duplex formation between a tracrRNA and part of the repeat in the pre-crRNA; the first cleavage in the pre-crRNA processing pathway subsequently occurs in this repeat region. This cleavage is catalysed by the housekeeping, double-stranded RNA-specific RNase III in the presence of Cas925.
The type III CRISPR–Cas systems contain polymerase and RAMP modules in which at least some of the RAMPs seem to be involved in the processing of the spacer–repeat transcripts, analogous to the Cascade complex. Type III systems can be further divided into sub-types III-A (also known as Mtube or CASS6) and III-B (also known as the polymerase–RAMP module). Subtype III-A systems can target plasmids, as has been demonstrated in vivo for S. epidermidis31, and it seems plausible that the HD domain of the polymerase-like protein encoded in this subtype (COG1353) might be involved in the cleavage of target DNA. There is strong evidence that, at least in vitro, the type III-B CRISPR–Cas systems can target RNA, as shown with a subtype III-B system from P. furiosus28. It is intriguing that these two type III systems seem to target different nucleic acids, and this finding will require further study.
The only identified ribonucleases in the type III CRISPR–Cas systems, apart from the universal Cas2 protein, are RAMP proteins. Type III systems include at least two RAMPs in addition to Cas6, which is involved in CRISPR transcript processing. In many organisms, type III CRISPR–cas operons lack the cas1–cas2 gene pair; in all these cases, an additional CRISPR locus (of either type I or type II) is also present in the respective genome, indicating that Cas1 and Cas2 are probably provided in trans. In other organisms, the polymerase–RAMP modules are present in a single operon with cas1 and cas2, forming a module with the typical architecture in S. epidermidis and Mycobacterium tuberculosis (a type III-A module) and forming a distinct version in Halorhodospira halophila (a type III-B module). In these organisms, the type III operon is the only CRISPR–cas locus, suggesting that the polymerase–RAMP module forms a fully functional, autonomous type III system when combined with Cas1 and Cas2, which are likely to be involved in the incorporation of new spacers.
Most of the CRISPR–cas loci can be readily classified into the proposed three types and their subtypes according to the presence of type-specific and subtype-specific signature genes (TABLE 2). However, for the loci that cannot be classified even at the type level, such as the CRISPR–Cas system in Acidithiobacillus ferrooxidans str. ATCC 23270 (discussed further below), we propose the name type U.
The three types of CRISPR systems show a distinctly non-uniform distribution among the major lineages of the Archaea and the Bacteria (TABLE 1). In particular, the type II systems have been found exclusively in the Bacteria so far, whereas type III systems are more common in the Archaea. The previously observed trend of over-representation of CRISPR in the Archaea compared to the Bacteria still holds14,39 (TABLE 1). Moreover, the majority of archaeal genomes carry more than one CRISPR–Cas system; typically, different modules within the same genome are unrelated.
On the basis of the gene composition and architecture of the respective cas operons, the three basic types of CRISPR–Cas system can be further classified into subtypes that largely agree with the previously delineated variants13,14. Each of the subtypes contains a signature gene or genes that are represented almost exclusively in the given subtype and can be used to identify the subtype (FIG. 2; TABLE 2). To facilitate classification, a single signature gene was chosen for each subtype: in cases with several candidates, the longest gene was selected, as longer genes are typically more easily detectable in sequence searches than shorter genes. In addition, we introduce subtypes I-U, II-U and III-U for systems that lack currently defined subtype-specific signature genes but either might fit one of the established subtypes on the basis of further structure and sequence analysis, or potentially could become founders of new subtypes.
The ubiquitous, highly conserved Cas1 protein can be used as a scaffold to investigate the evolution of the CRISPR–Cas system (the other universal protein, Cas2, is too small to yield a well resolved tree). The phylogenetic tree of Cas1 includes several well-resolved branches that generally agree with the classification of CRISPR–Cas systems into subtypes I-A (Apern or CASS5), I-B (Tneap–Hmari or CASS7), I-C (Dvulg or CASS1), I-E (Ecoli or CASS2), I-F (Ypest or CASS3) and III-A (Mtube or CASS6), and type II (Nmeni or CASS4)14, with a few notable exceptions (FIG. 3; see Supplementary information S3 (box) for data to construct the complete tree). In particular, Cas1 proteins associated with the polymerase–RAMP module (the type III systems) appear in several unrelated positions in the tree (FIG. 3), suggesting that this module can operate with a variety of cas1 and cas2 genes both in cis and in trans.
The CRISPR repeats can be classified into at least 12 groups on the basis of sequence similarity40. Four groups of CRISPR repeats clearly correspond to distinct CRISPR–Cas subtypes: group 2 corresponds to sub-type I-E systems, group 3 corresponds to subtype I-C systems, group 4 corresponds to subtype I-F systems and group 10 corresponds to type II systems. These four variants of CRISPR–Cas systems have the most stable operon organizations; by contrast, subtypes I-A, I-B and I-D and type III systems seem to be prone to recombination between different types and subtypes. Structural characteristics of the CRISPR repeats of these four groups could potentially be used for classification, in addition to phylogenetic data and signature genes. The other eight groups of repeats cannot be unequivocally associated with particular CRISPR–Cas system subtypes.
Integration of all the above considerations into a dendrogram reflects our present understanding of the evolutionary history of CRISPR–Cas systems (FIG. 2). Subtypes of the type I system are grouped according to their operon organizations and the phylogeny of the respective Cas1 proteins.
We propose to retain the well-established names for core genes of the CRISPR–Cas systems: the ubiquitous cas1 and cas2 (found in all three types), cas3 (type I), cas4 (types I and II), cas5 (type I) and cas6 (types I and III). In the cases for which orthology can be confidently traced, we extend the usage of these six cas gene names; for example, cmx5 of subtype I-C is renamed cas5, and cmx6 is renamed cas6. In cases for which significant sequence similarity between Cas proteins is observed but orthologous relationships cannot be definitively assigned, a letter derived from the subtype label is added; hence, cse3 and csy4 in the former nomenclature become cas6e and cas6f, respectively, as they are likely to be extremely divergent derivatives of cas6 (TABLE 2).
In type I systems, there are two additional genes for which orthology is readily detectable between different subtypes. We refer to these genes as cas7 and cas8 (which can be further divided into cas8a, cas8b and cas8c); both encode subunits of the Cascade complex (TABLE 2). The cas8a, cas8b and cas8c genes are the signature genes for subtypes I-A, I-B and I-C, respectively. In type II and type III systems, the respective signature genes are designated cas9, and cas10 (formerly cmr2, csm1 and csx11).
When a gene is clearly a fusion or fission of established genes, we propose an ad hoc nomenclature indicating the relationship of this variant to the `canonical' forms. Thus, cas2–cas3 in subtype I-F systems is a fusion of cas2 and cas3, whereas cas3′ and cas3″ denote the genes that encode only the helicase domain or only the HD domain of Cas3, respectively.
For less common genes that have been named previously13, the `legacy' nomenclature can be retained. As the Cas protein sequences are highly diverged, it is expected that, with the increasing representation of sequences and structures, many of these genes will eventually be incorporated into existing families. We propose to continue assigning further `numerical' names to newly merged cas gene families in the future (such as cas11, cas12, and so on).
For the remaining CRISPR-associated genes, we propose to assign interim gene names (csx1, in which `x' indicates an unclassified family), with an indication of the family or superfamily where known (such as csx1, COG1517 family, or csx10, RAMP superfamily).
The phylogenetic tree of Cas1 reproduces most of the previously established groups fairly well, with the exception of the type III systems (FIG. 3). However, for the deep branches, assigning a subtype can be problematic. In many cases, detailed analysis of the gene orders reveals a more complicated picture with different arrangements of cas genes in the operons, potentially owing to frequent horizontal gene transfer and recombination involving the CRISPR–cas loci. In particular, a notable recombinant CRISPR–Cas system is present in approximately 30 archaeal and bacterial genomes, including cyanobacteria (such as the region spanning the loci slr7010–ssr7072 in Synechocystis sp. PCC 6803). In this CRISPR–Cas system, the type I-C system has combined with a distinct type III gene arrangement encoding the polymerase–RAMP module, containing cas3, cas10 (which is predicted to be an inactivated polymerase with an HD domain), csc2 (from the COG1337 family, and the RAMP super-family), csc1 (from the RAMP superfamily), cas6, cas4, cas1 and cas2. This hybrid system containing signature genes for both type I and type III systems is represented in approximately 30 archaeal and bacterial genomes. As this system is likely to be functional, we have classified it as subtype I-D (FIG. 2).
Another interesting CRISPR–Cas system, typified by A. ferrooxidans str. ATCC 23270 (loci AFE_1037-AFE_1040), has been detected in only four genomes to date. This CRISPR–cas locus seems to possess a distinct gene content and could potentially contribute to our understanding of the functions and evolution of CRISPR–Cas systems in general. This system contains neither of the two ubiquitous core genes (cas1 or cas2) nor any other signature genes of the three CRISPR–Cas types or the ten subtypes. The A. ferrooxidans system consists of four genes denoted csf1, csf2, csf3 and csf4 (TIGRFAMs entries TIGR03114, TIGR03115, TIGR03116 and TIGR03117, respectively), which encode a Zn-finger domain-containing protein, a protein containing two RAMP domains, another distinct RAMP protein and a DinG-like helicase of the XPD family, respectively39. According to the CRISPRdb database41, a CRISPR array is present in the vicinity of these four genes in all of the respective genomes, although the architecture of these arrays is unique in each genome. Thus, this system might function in conjunction with different CRISPR arrays and does not require a distinct repeat signature. Indeed, three of the four genomes containing this system possess cas1 and cas2 genes that are located in other parts of the genome and are associated with type I CRISPR–Cas systems. It remains unclear whether this is a self-sufficient system or rather a defective system that captures and utilizes pre-existing CRISPR arrays that are generated by other, Cas1-containing CRISPR–Cas systems. More data are needed to classify this novel system as a separate CRISPR–Cas type, but this finding illustrates the diversity of CRISPR–Cas systems and the challenges that are associated with their classification.
Many cas genes, in particular genes that encode RAMP proteins, seem to evolve at exceptionally high rates. CRISPR–Cas systems can contain genes that encode highly divergent proteins which may not fall into a known Cas protein family after the structure is solved. For such genes and proteins, family assignment is extremely complicated. For example, a CRISPR system very similar to subtype I-F, as determined by Cas1 similarity, is present in Photobacterium profundum and several other bacteria. This system includes two proteins, PBPRB1992 and PBPRB1993, that show no significant sequence similarity to any Cas proteins. However, analyses of the sequence motifs that are conserved in these proteins, the predicted secondary structure of the proteins, and the length and position of the corresponding genes in the operon strongly suggest that they belong to the Cas7 and Cas5 families of RAMPs, respectively. Another example is the CRISPR–Cas system of Geobacter sulfurreducens: according to the phylogeny of Cas1, this system should be assigned to subtype I-C. The operon for this system encodes three uncharacterized proteins, GSU0052, GSU0053 and GSU0054; the last two of these proteins contain several motifs that are similar to the characteristic motifs of the RAMP superfamily and thus might be RAMP homologues (TABLE 2). However, none of these proteins could be linked to known Cas families, even using the most sensitive of the available methods for the detection of remote sequence similarity36,42,43. Therefore, only a comparison of the solved structures might shed light on the relationships of these and other highly diverged Cas proteins with known Cas families. In such cases, assignment of new gene names seems to be premature because these proteins are likely to eventually assume already existing names. Therefore, it is proposed that these genes are given temporary csx names.
Many CRISPR–cas loci belong to `islands' that contain various `high-mobility' genes such as toxins–antitoxins, transposases and components of other defence systems44. Some of these genes can be erroneously linked to CRISPR–Cas systems, so caution should be exercised in the classification and naming of genes as cas or even csx before functional connections with CRISPR–Cas systems are convincingly established.
An additional challenge to the nomenclature is presented by the variable domain architectures of some of the Cas proteins, including the domain fusions and fissions discussed above for Cas3. Other notable fusions include the fusion of cas2 and cas3 (in subtype I-F systems), of cas1 and cas4 (such as is found in GSU0057 from G. sulfurreducens), of cas1 and a DEDDh family exonuclease (for example, LBUL_0800 from Lactobacillus delbrueckii subsp. bulgaricus) and of cas1 and a reverse transcriptase (for example, VVA1544 from Vibrio vulnificus).
In several genomes, homologues of some cas genes also appear in contexts other than CRISPR–Cas systems. These proteins might represent distinct antivirus defence systems or components thereof, or they could be involved in other functions such as DNA repair. The latter possibility is emphasized by the recent demonstration that cas1 mutants of E. coli have DNA repair-deficient phenotypes45. Homologues of Cas proteins that probably function in processes other than adaptive immunity include RAMPs of the COG5551 sub-family and the COG1517, COG1468 and COG3513 families. In cases such as these, classification and labelling of the genes as cas should be avoided.
The CRISPR arrays contain few stop codons and, accordingly, are often erroneously translated into hypothetical proteins. Unfortunately, these artefacts then enter the databases and tend to be amplified during the analysis of new genomes, so there are currently at least two Pfam entries that consist of non-existent `pseudo-Cas proteins' (PF11194 and PF11664). Care should be taken during the annotation of new genome sequences to avoid further proliferation of such errors.
Given the complexity and the highly dynamic mode of evolution of the CRISPR–Cas systems, it would be counterproductive to attempt classification on the basis of any single criterion — for instance, the phylogeny of Cas1. Thus, we propose a polythetic classification that integrates the phylogenies of the conserved cas genes, the sequences of and structural similarities between other Cas proteins, and the composition and organization of the known and putative operons. It should be emphasized that a robust family classification of the Cas proteins, many of which diverge rapidly, is not only a matter of convenient description but also a basis for experimental validation of the respective functional predictions. Therefore, it is important that this classification be continuously updated and revised when necessary, using new sequence and structure information combined with state-of-the-art computational methods. The classification described here is available at the NCBI CRISPR/Cas website, along with tools for the identification of Cas proteins. In the future, a fine-grained classification of the CRISPR–Cas systems should become feasible on the basis of phylogenies and structures of Cas proteins, the operon organizations of cas genes and the architectures of CRISPR repeats.
The authors thank M. Terns for critical reading of the manuscript and useful discussions. K.S.M., Y.I.W. and E.V.K. are supported by the intramural funds of the US Department of Health and Human Services (National Library of Medicine); D.H.H. is supported by a US National Institutes of Health grant (1 R01 HG004881); E.C. acknowledges funding from Umeå University, Sweden, and the Swedish Research Council. S.M. acknowledges funding from the National Sciences and Engineering Research Council of Canada (the Discovery programme); F.J.M.M. acknowledges support from the University of Alicante, Spain, (Vicerrectorado de Investigacion, and Desarrollo e Innovacion) for the use of its research technical services; A.F.Y. is supported by the Government of Canada through Genome Canada and the Ontario Genomics Institute (grant 2009-OGI-ABC-1405). S.J.B. and J.O. are supported by Veni and TOP grants from the Netherlands Organization for Scientific Research (NWO).
Competing interests statement The authors declare no competing financial interests.