To obtain a set of CRISPR arrays we employed the PILER-CR program [
14] on 439 currently available bacterial and archaeal genomes in IMG version 1.50 [
15]. We found 561 arrays, ranging in size from 3 to 220 repeats, in 195 genomes (44% of the genomes tested). These results are in agreement with the results of Godde
et al. [
7], who found CRISPR arrays in 40% of the genomes they tested. Overall, our set of CRISPRs contained 561 repeat sequences (as repeats are generally identical within an array) and 13,372 spacers.
Repeats were first noticed to be palindromic by Mojica
et al. [
16], a feature that was subsequently incorporated into the acronym CRISPR [
2]. We hypothesized that the palindromic signature might be indicative of a functional RNA secondary structure within the repeat. This hypothesis is supported by the experimental demonstration that CRISPRs are transcribed and processed into non-messenger RNAs in several Archaea [
17], indicating that they are active through an RNA intermediate.
To assess the possibility that CRISPR repeats form stable RNA secondary structures, we used the RNAfold software [
18] (see Materials and methods) to predict the intramolecular RNA structure for each of the repeats in our set. This software provides a bit-score that reflects the stability of each secondary structure. We compared the stability of the predicted secondary structure of repeats and spacers to that of similarly sized sequences selected randomly from bacterial genomes (Figure ). We found that the folding-score distribution of repeats deviates from the scores for random sequences, indicating a tendency of repeats to form stable secondary structure.
The trimodal pattern of the RNA folding distribution for CRISPR repeats (Figure ) suggests that they are not homogeneous, and that a large subset form stable secondary structures, in contrast to spacers and random sequences. To identify repeat subtypes we first attempted to align each of the 561 repeats in our set to all other repeats using the Smith-Waterman algorithm [
19]. The sequence similarity results were then clustered using the MCL algorithm [
20] (see Materials and methods). This procedure generated 33 clusters, 12 of which contained 10 or more members, with the largest cluster (cluster 1) containing 94 repeat sequences. Some clusters contained repeats from organisms as distantly related as Archaea and Bacteria, supporting the inference that CRISPR/CAS systems can be horizontally transferred between microorganisms [
5-
7].
As an independent measure for the validity of the clustering, we examined the RNA stability scores in each of the MCL-defined clusters (note that RNA stability was not taken into account in the clustering procedure). As seen in Figure , clusters 2 and 3 comprise repeats with consistently high folding scores, indicating pronounced secondary structure. By contrast, clusters 1, 6, 7, 9, 10 and 11 contain repeats with consistently poor folding scores. Clusters 4, 5, 8 and 12 show intermediate folding scores, suggesting they have weaker secondary structures. Together, these groups explain the trimodal distribution observed in Figure . The homogeneity of RNA structure stability scores within each cluster, along with the dramatic difference in scores between clusters, suggests that our clustering method is valid.
To further explore the observation that repeats form stable RNA secondary structures, we examined sequence alignments of the repeat clusters. CRISPR repeats are generally considered to be highly dissimilar to each other [
7], except for similar repeats in strains of the same species or in closely related species [
1]. However, repeats within the clusters we generated, although often containing sequences from vastly different phylogenetic groups, were generally more similar to each other and hence alignable. Figure presents a multiple alignment of a subset of the repeats in cluster 3. A highly stable stem-loop structure was consistently predicted for repeats in this cluster by RNAfold [
18] (Figure ). Notably, substitutions in the predicted stem structure are consistently accompanied by compensatory changes that preserve the base pairing (Figure ). This mutational pattern, together with the presence of G:U base pairs (Figure ), is typical of conserved RNA secondary structures and highlights the importance of the stem-loop in the repeats for the functionality of CRISPRs.
A summary of the repeat similarity space is presented in Figure . As with cluster 3 (Figure ), repeats in other clusters with high and intermediate folding scores also form stem-loop structures (Figure ) and display compensatory mutations, suggesting stable structures. While the stem-loop motif is seen in all of these clusters, the actual sequence, as well as the length of the stem, its position relative to the unstructured region, and the size of the unstructured sequence varies between clusters. For example, while the stem in cluster 4 is typically 5 bp long and is found in the middle of the repeat, the stem in cluster 3 is typically 7 bp long, and is found towards the 5' end of the repeat (Figures and ). The difference in calculated folding scores between clusters with high and intermediate scores is likely to be due to the stem length and the frequency of GC as opposed to AT base pairings. Consistent with previous reports [
7], many repeat clusters have a conserved 3' terminus of GAAA(C/G), possibly acting as a binding site for one of the conserved CAS proteins.
Two recent studies identified between 20 and 45 gene families of CASs [
5,
6]. Based on the tendency of CAS genes to appear together, Haft
et al. [
5] defined eight CAS subtypes (named Ecoli, Ypest, Nmeni, Dvulg, Tneap, Hmari, Apern and Mtube). We sought to determine whether our CRISPR repeat clusters corresponded to particular CAS subtypes. For this, we searched 20 kb of sequence flanking each side of the repeat array for CAS genes using the 45 CAS families TIGRFAM hidden Markov models (HMMs) defined by Haft
et al. [
5].
We found that the Ecoli CAS subtype genes appear exclusively in the proximity of structured repeat cluster 2, and, similarly, the Dvulg and Ypest CAS subtypes correspond strictly to our structured clusters 3 and 4, respectively (Table and Table S1 in Additional data file 1). Presumably, specific and different sets of genes are needed in order to recognize, bind and process the different repeat types. Despite the overall pronounced correspondence between the CAS subtypes and repeat clusters, particularly for structured clusters, there are notable exceptions. For example, the reported frequent co-occurrence of the Mtube subtype with other CAS subtypes [
5] is consistent with its promiscuous association with numerous repeat clusters (Table ). Another interesting exception is the co-occurrence of the Tneap and Apern subtypes in the
Thermococcus kodakaraensis genome with cluster 6, which is apparently due to a fusion of the Tneap and Apern subtypes (Figure S1 and Table S1 in Additional data file 1). This genome contains three CRISPR arrays, all with identical repeat sequences classified as cluster 6 (Table S1 in Additional data file 1). In some cases the CAS subtype for one or more repeat cluster members differs from the consensus for that cluster (Table S1 in Additional data file 1), suggesting that the association between CRISPR repeat subtypes and CAS subtypes is somewhat flexible.
| Table 1Occurrence of CAS subtypes in the proximity (± 20 kb) of the 12 largest repeat clusters |
We also identified a repeat cluster (cluster 5) that is not associated with any of the recognized CAS subtypes. We found that it is associated with most of the core CASs (cas1-4 and cas6), but lacks any of the additional type-defining genes. Cluster 5 occurs exclusively in genomes that contain other CRISPR repeat subtypes and it is possible that it employs at least part of their CAS machinery.