We identified and selected 64 known CRISPRs—including the streptococcal CRISPR—from complete and draft bacterial genomes and 86 novel CRISPRs from the 751 HMP whole-metagenome assemblies, using metaCRT and CRISPRAlign (see Methods
). For each selected CRISPR, we then applied the targeted assembly approach (for each CRISPR, first pool the reads that contain the repeat, and then assemble the pooled reads only; see Methods
for a validation of the targeted assembly approach using simulated datasets) to achieve a more comprehensive characterization of the CRISPR loci in the human microbiome shotgun datasets. Below we provide detailed analysis of the targeted assembly approach, and the resulting CRISPR loci (listed in and Tables S1
List of selected CRISPRs discussed in the paper.
Targeted assembly improves the characterization of CRISPRs
We first asked if our targeted assembly strategy helps to identify CRISPR elements from metagenomic datasets, and found that it greatly improved detection (see comparison in ). The improvements are twofold. First, the targeted assembly approach identifies known CRISPRs in more human microbiome datasets, as compared to the annotation of CRISPRs using whole-metagenome assemblies. Second, targeted assembly resulted in longer CRISPR arrays, from which we can extract many more diverse spacers for analyzing the evolution of the CRISPRs and other purposes. Here we use three examples to demonstrate the performance of the targeted assembly.
Comparison of CRISPR identification using whole-metagenome assembly and targeted assembly.
The first example is the streptococcal CRISPR SmutaL36 (see ), a CRISPR that is conserved in streptococcal species such as Streptococcus mutans
. This CRISPR was observed only in a limited number of samples (38 out of 751 datasets) when using contigs from whole-metagenome assembly. But our targeted CRISPR assembly identifies instances of CRISPR SmutaL36 in ~10 times more (386) datasets. Consistent with the distribution of streptococcus
across body sites, most of the 386 datasets are from oral samples: 120 of 128 supragingival plaques (94%), 128 of 135 tongue dorsum samples (95%), and 97 of 121 buccal mucosa samples (80%) (see ). CRISPR SmutaL36 was only found in a small proportion of samples from other body locations, where streptococcus
rarely exists (e.g.
, 4 of 148 stool samples, and none of the posterior fornix datasets). shows the details of targeted assembly of this CRISPR in two datasets.
Distribution of selected CRISPRs across body sites.
The other two examples are GhaemL36 and SRS018394L37 (see details in ). CRISPR GhaemL36 was initially identified from the genome of Gemella haemolysans ATCC 10379 using metaCRT. Targeted assembly further identified instances of this CRISPR in 258 oral-associated samples. The longest contig—of 3121 bases—was assembled from the SRS019071 dataset. This CRISPR array has even more repeats (48; i.e., 47 spacers) than the CRISPR array in the Gemella haemolysans reference genome, which has 29 repeats. CRISPR SRS018394L37 (currently not yet associated with a host genome) was initially identified from the whole-metagenome assembly of SRS018394, but targeted assembly reveals the presence of this CRISPR in 238 oral-associated microbiomes. The contig that was assembled in SRS049389 is the longest one (2014 bps), containing 25 spacers.
In most cases we have tested, targeted assembly dramatically improves the identification of CRISPRs in the HMP datasets: for 142 CRISPRs (out of 150), targeted assembly resulted in CRISPR identification in more HMP samples as compared to using whole-metagenome assemblies, and for 36 CRISPRs, targeted assembly identified instances of the corresponding CRISPR in at least 10 times more datasets (see Table S1
). It suggests that specifically designed assembly approaches, such as the targeted assembly approach for CRISPR assembly presented here, are important for the characterization of functionally important repetitive elements that otherwise may be poorly assembled in a whole-metagenome assembly (which tends to be confused by repeats), and such a comprehensive identification is important for deriving an unbiased distribution of these functional elements across different body sites among individuals.
Novel CRISPRs are found in human microbiome samples
In order to identify novel CRISPR loci, with which to seed further targeted assemblies, we set out to find loci based simply on the structural patterns of CRISPR loci, using the program metaCRT, which we modified from CRT (see Methods
). As a result, we found and selected 86 different types of novel CRISPR repeats in metagenomic samples, which could not be found in reference genomes, for further targeted assembly (see Methods
). lists selected examples of novel CRISPRs that we identified in HMP datasets (see for naming conventions). A full list of CRISPRs (including the number of CRISPR contigs assembled in each metagenomic dataset) is available as Table S1
. In this section, we highlight two examples of novel CRISPRs.
CRISPR SRS012279L38 was identified from a whole-metagenome assembly contig of dataset SRS012279 (derived from a tongue dorsum sample; see ). The identified CRISPR contig has 6 copies of a 38-bp repeat (the last copy is incomplete; see for the consensus sequence of the repeats). De novo
gene prediction by FragGeneScan 
reveals 10 protein-coding genes in this contig, among which 9 share similarities with cas
genes from other genomes, including Leptotrichia buccalis
DSM 1135 (NC_013192, an anaerobic, gram-negative species, which is a constituent of normal oral flora 
) and Fusobacterium mortiferum
ATCC 9817, by BLASTP search using the predicted protein sequences as queries (see ). (By contrast, BLASTX search of this contig against nr database only achieved annotations for 7 genes). In addition, similarity searches revealed a single identical copy of this repeat in the genome of Leptotrichia buccalis
DSM 1135 (from 1166729 to 1166764; de novo
CRISPR prediction shows that this genome has several CRISPR arrays, including an array that has 84 copies of a 29-bp repeat, but none of the CRISPRs have the same repeat sequence as SRS012279L38). These two lines of evidence (similar cas
genes, and an identical region in the genome) suggest that the SRS012279L38 CRISPR we found in the human microbiomes could have evolved from Leptotrichia buccalis
or a related species.
A potentially novel CRISPR array identified in a contig (9848 bases) from sample SRS012279.
Targeted assembly of this novel CRISPR (SRS012279L38) in HMP datasets resulted in 278 contigs from 97 datasets, confirming the presence of this CRISPR in human microbiomes. In particular, the CRISPR fragments (407 bps) identified from the whole-metagenome assembly of SRS012279 were assembled into a longer CRISPR contig (890 bps) by targeted assembly. A total of 14 unique but related repeat sequences were identified from 278 CRISPR contigs, and two of them (which differ at 3 positions) are dominant, constituting 71% of the repeats in the CRISPR contigs. Notably, all the repeats could be clustered into a single consensus sequence with an identity threshold of 88%. By contrast, the spacer sequences are very diverse across different samples. For example, we obtained a total of 352 unique spacer sequences, which were clustered into 345 consensus sequences with an identity threshold of 90%. Among 352 unique spacers, 114 spacer sequences were shared by multiple samples—a single spacer was shared by at most eight samples.
The second example is CRISPR SRS023604L36, initially identified in a whole-metagenome assembly contig of dataset SRS023604 (derived from posterior fornix), which has 5 copies of a 36 bp repeat (see consensus sequence of the CRISPR repeat in ). Targeted assembly of this CRISPR across all HMP metagenomic datasets revealed further instances of this CRISPR in several other datasets, including two from stool, and two from posterior fornix. Moreover, the CRISPR contig was assembled into a longer contig of 778 bps containing 12 copies of the CRISPR repeat. BLAST search of the CRISPR repeat against the nr database did not reveal any significant hits.
Expanding the CRISPR space by human microbiomes
To investigate how much the CRISPRs identified in the HMP datasets can expand the CRISPR space (sequence space of the CRISPR repeats), we built a network of CRISPRs, based on the sequence similarity between CRISPR repeats. An edge in the network between two CRISPR repeats, each represented by a node, indicates that the two repeats can be transformed from one to another by at most 10 operations (including mutations, insertions, and deletions). Since it is difficult to determine the direction of CRISPR repeats 
(especially for the CRISPR arrays that have incomplete structures), given two repeats, we calculated two edit distances—one is the distance between the two repeats, and the other one is between one repeat and the reverse complement of the other—and used the smaller value as the edit distance between the two repeats. The global network (; see Figure S1
with node labels) shows that most of the novel CRISPRs identified in the human microbiomes are remotely related to ones identified in complete (or draft) genomes. Still, there are small clusters that contain only novel HMP CRISPRs, indicating that these CRISPRs are substantially different from ones identified in the reference genomes. In , we have colored nodes by body site: while specific CRISPR repeats can be highly specific to body site (see below), the larger families of repeats shown in do not appear to cluster based on body site.
Visualizations of the CRISPR network of 150 CRISPRs, each represented as a node.
We further studied the sequence patterns of the repeats for each group and our results show 1) distinct patterns among the groups, and 2) high conservation around the stem and start/end positions in CRSIPR repeats of each group (see sequence logos—for the large groups—in Figure S2
). The consensuses revealed by the logos show consistencies with the results in a previous study, which used a similar approach, based on alignments of CRISPR repeats, for classification of CRISPR repeats 
CRISPRs have diverse distributions across human body sites and individuals
Overall, the distributions of CRISPRs are largely body-site specific (see and Figure S3
; the name of CRISPR and the number of samples in which the CRISPR was found are listed in Table S3
). For example, CRISPRs AhydrL30 ad BcoprL32 are only found in stool samples (see ). Exceptions include two CRISPRs that were found from both a significant number of gut- and oral-associated samples: Neis_t014_L28 were found in 51 gut samples and 92 oral-associated samples; FalocL36 identified from Filifactor alocis
ATCC 35896 were found in 63 gut samples and 72 oral-associated samples, including 50 tongue dorsum samples (see ).
Distribution of CRISPRs across body sites.
The first 50 CRISPRs shown in are mainly found in stool samples. AshahL36, which was initially identified from Alistipes shahii WAL 8301, was found in more than half of gut-related samples (96 out of 147 samples). On the other hand, 99 CRISPRs are mainly found in oral samples, in particular, tongue dorsum, supragingival plaque, and buccal mucosa. We found 5 CRISPRs that exist in more than half of the oral-associated samples (out of 417): SmutaL36, KoralL32 from Kingella oralis ATCC 51147, Veil_sp3_1_44_L36 and Veil_sp3_1_44_L35 from Veillonella sp. 3_1_44, and SoralL35 from Streptococcus oralis ATTC 35037. 4 CRISPRs are mostly found in vaginal samples (AlactL29, LjensL36, LjassL36, and LcrisL29). 1 CRISPR is skin-specific (PacneL29), found mainly in skin samples. Below we discuss the body-site distributions of a few examples.
Neis_t014_L28 and Neis_t014_L36 are inferred from a single genome, Neisseria sp.
oral taxon 014 str. F0314, but these two CRISPRs show distinct absence/presence profiles across body sites (see ). For stool samples, there exists only CRISPR Neis_t014_L28 in 51 datasets, but not Neis_t014_L36. And Neis_t014_L36 is relatively more prevalent in oral-associated samples as compared to Neis_t014_L28. The different body site distributions can be explained by the fact that these two CRISPRs are found in different sets of genomes (although both can exist in a common genome, Neisseria sp.
oral taxon 014 str. F0314). Neis_t014_L36 has been identified in multiple Neisseria
genomes, including Neisseria meningitidis
ATCC 13091, Neisseria meningitidis
8013 (so Neis_t014_L36 belongs to the Nmeni subtype among the 8 subtypes defined by Haft et al 
), Neisseria flavescens
SK114, and Actinobacillus minor
NM305. Neis_t014_L28, however, was only found in Neisseria sp.
oral taxon 014 str. F0314. On the other hand, even though we could not find any CRISPRs containing the exactly same repeat as Neis_t014_L28 in the complete/draft genomes other than Neisseria sp.
oral taxon 014 str. F0314, many CRISPRs, when a few mismatches are allowed, were found in diverse genomes (for example, Crenothrix polyspora
, Legionella pneumophila
2300/99 Alcoy, and Thioalkalivibrio sp.
K90mix) from environmental samples.
Four CRISPRs (AlactL29, LjensL36, LjassL36, and LcrisL29) exist mostly in vaginal samples. AlactL29, initially identified from the Anaerococcus lactolyticus
genome, was found only in 3 vaginal samples. Notably, LjensL36 was found in 28 vaginal samples (which comprise 43% of vaginal samples collected) and 1 skin sample. This observation is consistent with a previous study showing that Lactobacillus
constitutes over 70% of all bacteria sampled from vaginas of healthy, fertile women, and Lactobacillus jensenii
is one of the major genomes 
. Interestingly, we could find evidence of adaptation in the LjensL36 spacer against Lactobacillus
phage Lv-1 (NC_011801) (see below). LjassL36 was found in 33 vaginal samples by targeted assembly. We confirmed that it is in different Lactobacillus
genomes, such as Lactobacillus gasseri
and Lactobacillus crispatus
, by BLAST search. CRISPR LcrisL29, which was identified in the Lactobacillus crispatus
genome, was found in 31 vaginal samples, and we found the same repeat sequence in the Lactobacillus helveticus
PacneL29 was the only skin-specific CRISPR we found in the HMP datasets. Interestingly, instances of PacneL29 are found in Propionibacterium acnes HL110PA4 and Propionibacterium acnes J139, but not other P. acnes isolates (including KPA171202, SK137, J165, and SK187). This indicates a potential application of CRISPRs in the characterization of specific stains for a species in human microbiomes.
CRISPRs have very diverse spacers
The HMP project enables us to explore the diversity of streptococcal CRISPRs (and others) at a much broader scale (with 751 samples from 104 healthy individuals). The CRISPRs that we identified in human microbiomes exhibited substantial sequence diversity in their spacers among subjects. Targeted assembly of the streptococcal CRISPRs (SmutaL36) in HMP datasets resulted in a total of 15,662 spacers identified from 386 samples, among which 7,815 were unique spacers (clustering of the spacers at 80% identify resulted in a non-redundant collection of 7,436 sequences). See Figure S4
for the sharing of the spacers in streptococcal CRISPRs among all individuals, which shows several large clusters of spacers that are shared by multiple individuals (for clarity, we only keep spacers that were shared by more than eight samples in this figure). In particular, the most common spacer is shared by 25 individuals (in 32 samples).
More importantly, we could check the sharing of CRISPR spacers across different body sites and sub-body sites (e.g.
, multiple oral sites) using HMP datasets (Pride et al.
examined streptococcal CRISPRs in saliva samples from 4 individuals 
). shows the spacer sharing among 6 selected individuals, each of whom has multiple samples with identified streptococcal CRISPRs from multiple body sites (see Figure S5
for the spacer sharing with spacers clustered at 80% sequence identify). By examining the distribution of the spacers across samples, we observed that samples re-sampled from the same individual and oral site shared the most spacers, different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences and histories on the evolution of CRISPRs. Our observation is largely consistent with the conclusion from Pride et al.
. But our study showed that different samples from the same oral site of the same person, even samples collected many months apart, could still share a significant number of spacers (e.g.
, the supragingival plaque samples from individual 1 in visit 1 and visit 2, with 238 days between the two visits, and the tongue dorsum samples from individual 5 in visit 1 and visit 3, with 336 days between the two visits; as shown in ). Our study also showed that although the different oral sites of the same individual share similar spacers, this sharing (e.g.
, between the supragingival plaque sample and the buccal mucosa sample for individual 1) is minimal, as compared to the spacer sharing between samples collected in different visits but from the same oral site (e.g.
, between the supragingival plaque samples from visit 1 and visit 2 for individual 1). Finally, our study shows that the spacer turnover varies among individuals—for the 6 selected individuals, individual 3 shows significantly higher turnover of the spacers between visits, as compared to other individuals.
Sharing of streptococcal CRISPR spacers among samples from 6 individuals.
We also checked the spacer diversity for the CRISPR KoralL32, since it and its variants are one of the most abundant CRISPRs. This CRISPR was assembled from 339 samples: 327 from oral sites and 2 from gut. The targeted assembly of KoralL32 found 7282 unique spacers, among which the most commonly shared spacer is shared by 35 individuals (in 58 samples). Figure S6
shows the sharing of the spacers among the individuals for this CRISPR, which shows similar spacer-sharing patterns as those found in the streptococcal CRISPRs.
The similarity between spacers from the same individual suggests that we may still be able to trace the evolution of CRISPRs, especially in the same body site of the same individual, even though the CRISPR loci tend to have extremely high turnover of their spacers.
CRISPR spacer sequences can be used to trace the viral exposure of microbial communities
As a consequence of CRISPR adaptation, the spacer contents in CRISPR arrays reflect diverse phages and plasmids that have passed through the host genome 
. However, previous studies have shown that only 2% of the spacer sequences have matches in GenBank, which is probably due to the fact that bacteriophage and plasmids are still poorly represented in databases 
. Similarity searches of identified spacers against viral genomes enable identification of the viral sources of the spacers (i.e.
, proto-spacers) captured in each CRISPR locus. For example, similarity searches of the 7,815 unique spacers in the streptococcal CRISPR against viral genomes revealed similarities between streptococcal spacers and 22 viral genomes (species names and accession IDs are listed in Table S4
), and the two most prevalent viruses are Streptococcus
phage PH10 (NC_012756) and Streptococcus
phage Cp-1 (NC_001825) (see ). suggests that the potential proto-spacers are rather evenly distributed along the phage genomes (except for a few regions, including a region that encodes for an integrase, which is highlighted in red in ). Although the positional distribution of the proto-spacers is close to random, the sequences adjoining the proto-spacers for streptococcal CRISPR we identified in the virus genomes showed conserved short sequence motifs (GG) (see Figure S7
for the sequence logo), which is also the most common proto-spacer adjacent motif (PAM) shared by several CRISPR groups, as reported in 
Traces of viral sequences in the streptococcal CRISPRs in human microbiomes.
Another example is CRISPR PacneL29, which is mainly found in skin-associated microbiomes. BLAST search of the identified spacers against the virus genome dataset revealed similarity between the spacers and several regions in Propionibacterium
phage PA6 (NC_009541). We also found evidence of adaptation in LjensL36 against Lactobacillus
phage Lv-1 (NC_011801): BLAST search shows significant matches to a total of 38 regions in the phage genome. Overall we found 23 CRISPRs that have spacers with high sequence similarities (≥90% over 30 bps) with virus genomes collected from the NCBI ftp site (Table S5
We also searched the spacers against plasmid sequences (collected in the IMG database). For example, matches were found between the detected streptococcal CRISPR spacers and more than 10 plasmid sequences (including Streptococcus thermophilus
plasmid pER35, pER36, pSMQ308, and pSMQ173b; Bacillus subtilis
plasmid pTA1040; and Streptococcus pneumoniae
plasmids pSMB1, pDP1 and pSpnP1). See Table S6
for a summary of the plasmids that share high homology with the CRISPR spacers.
The CRISPER spacers can also be used to identify viral contigs in metagenome assemblies that contain proto-spacers. As an example, similarity searches of identified streptococcal CRISPR spacers against the HMP assemblies revealed 37 potential viral contigs (of lengths from 2,134 to 56,413 bp): these contigs show high homology (>80% sequence similarity) with known viral genomes. The largest contig (of 56,413 bps) is similar to the genome of Streptococcus phage Dp-1 (NC_015274), with 88% sequence identify, and covers almost the entire viral genome (of 59,241 bps). A future paper will fully explore this approach.
Conserved CRISPR repeat sequences can be used to reveal rare species in human microbiome
Because of the large number of repeats that many CRISPR loci contain, CRISPR repeats of rare species with low sequence coverage in a community can still be found. It was reported that repeat-based classification 
corresponds to a cas
gene-based classification of CRISPRs 
, which revealed several subtypes of CRISPRs largely constrained within groups of evolutionarily related species (e.g.
, the Ecoli subtype). As such, we may use the presence of the repeats of a particular CRISPR as a first indication of the presence of related genome(s) in a microbiome, even though CRISPR locus has been found transferred horizontally as a complete package among genomes 
We use CRISPR PpropL29 as an example to demonstrate this potential application, as PpropL29 was identified in only a small proportion of the HMP samples (11 datasets): including 7 supragingival plaque samples (out of 125) and 4 tongue dorsum samples (out of 138). All the PpropL29-related repeats identified in these samples can be clustered into 7 unique sequences. In order to find the most likely reference genomes for these 7 unique repeat sequences, we blasted these repeat sequences against the human microbiome reference genomes and found 100% identity matches in the Lautropia mirabills genome. To investigate the overall coverage of this genome by the reads (not only the CRISPR regions), we mapped the entire collection of reads from four samples: SRS019980 and SRS021477 (both are from supragingival plaque, and have an 100% identity match with the CRISPR repeat in the Lautropia mirabills genome); SRS019974 (from tongue dorsum, with a slightly different CRISPR repeat sequence with 3 differences); and SRS019906 (which does not contain any CRISPR repeats similar to PpropL29, used as a control). The mapping results show the reads from two samples SRS019980 and SRS021477 each cover ~80% of the Lautropia mirabills genome, which is very significant evidence that these two microbiomes include Lautropia mirabills. But the other two samples have only a limited number of reads mapped to the genome (e.g., only 3089, reads in SRS019906 were mapped into Lautropia mirabills). This contrast suggests that identification of CRISPRs by targeted assembly could provide significant evidence for the existence of certain rare genomes.