L. casei genome features
Genome features of the 17
L.
casei strains included in the study, which provide a broad representation of genetic, ecological, and geographical diversity in the species, are presented in Tables

and . The 12 new draft sequences had a range of coverage between 17X and 133X (mean = 31.6X) and ranged in total contig number from 28–167 (mean = 99 contigs) (Table

). Using a set of only three closed genome references (ATCC 334, BL23, and Zhang), we were able to determine orientation and order the contigs of each draft. The contig numbers listed in Table

were obtained after contigs were ordered and oriented in Mauve and the draft was aligned end-to-end based on the advanced order, which reduced the number of contigs obtained from the
de novo assembly by an average of 20% (data not shown). Each of the genomes was assembled into scaffolds based on the placed and unplaced contigs from Mauve, with the unplaced contigs being ordered by name at the end of the scaffold.
| Table 1Genome features of the 17 L. casei strains used in the study1 |
| Table 2Orthologous clusters in the L. casei supragenome1 |
Although the final contig order was not independently validated for all 12 draft genomes, an optical restriction map [
47] was used as a reference to validate assembly of the Lpc-37 genome by this approach (see Additional file
1: Figure S1 in supplementary online material). The overall arrangement of the contig order for Lpc-37 was done using progressive Mauve, and alignment of the contigs to the optical map demonstrates that the method of assembly and ordering yielded a well ordered draft, with only a few major gaps. The optical map gave an estimated genome size of 3,014,302 bp with a total concatenated ordered draft length of 2,916,119 bp (~97% coverage). Based on the alignment between optical map and draft assembly, the size difference is likely derived from 3 major sequence gaps approximately 30, 70, and 10kb in size (Additional file
1: Figure S1).
The number of predicted CDS features in each genome ranged from 2,643 to 3,262, with an overall GC content of 46.1-46.6% (Table

). Comparative genomics of the resulting architecture revealed synteny was relatively high across the
L.
casei genomes, with several large blocks of highly conserved gene content across all 17
L.
casei strains (Figure

). These data also indicate that genome size differences were not due to major chromosomal insertions, deletions or re-arrangements.
Sequence homology was determined even more accurately when strains were aligned based on gene content (Figure

), so that most of the diversity is observed as indels. As expected, the number of Locally Collinear Blocks (LCBs) detected using the progressive Mauve alignment increased with the number of genomes compared. Hierarchical clustering of strains based on overall gene content yielded a dendrogram with 6 clusters, designated A-F (Figure

). When members of each cluster were aligned to each other or to their closest reference genome, the number of LCBs was reduced, and the overall sequence homology detected between the strains was apparent based on the height of the similarity profile within each LCB (Figure

). No major genomic rearrangements were detected within any of the draft sequences. Content conservation was even higher within clusters (Figure

), most notably for clusters E and F. Even within clusters B and C, which contain 7 and 3 strains, respectively, overall synteny was high with mostly localized polymorphic content (Figure

).
Comparison of the gene content-based dendrogram (Figure

) to an MLST-based phylogenetic tree for the 17
L.
casei strains examined in this study (Figure

) revealed similar clustering for strains in gene content clusters A, E, F, and parts of B and C. However, MLST-derived relationships among the remaining strains did not resemble those derived from overall gene content. The basis for this observation is well understood; MLST-based phylogeny reflects relatively slow genome evolution caused by point mutations and selective pressure, whereas the gene content dendrogram captures more rapid (and unpredictable) large-scale insertion and deletion events. Thus, variations between gene content- and MLST-based dendrograms are expected in bacterial species like
L.
casei that display frequent intra-species recombination [
22,
25,
28,
30].
Strain-specific features of the L. casei supragenome
Protein homology searches with strain-specific accessory gene products (i.e., unique to a particular strain) revealed 8-87% (mean = 54%) of these proteins had highest similarity to gene products found in other strains of
L.
casei or
Lactobacillus paracasei (see Additional file
2: Table S1 in the online supplementary material). The lowest fractions (<50%) were found in the two wine isolates (A2-362 and UCD174) and three dairy strains (Lc-10, ATCC 334, and Zhang). Expansion of the analysis to include gene products with greatest homology to proteins from other species of lactobacilli accounted for 45-95% (mean = 77%) of the strain-specific accessory genes products. For example, different proteins with very high amino acid identity (E value < 1e
-80) to orthologs in the closely related species
Lactobacillus rhamnosus were found in all
L.
casei strains except three dairy isolates (M36, BD-II, and BL23). Additionally, strain-specific accessory genes encoding orthologs with high homology to proteins from a broad range of other
Lactobacillus species including
L.
plantarum,
L.
fermentum,
L.
brevis,
L.
buchneri,
L.
coryniformis,
L.
coleohominis,
L.
farciminis,
L.
helveticus,
L.
hilgardii,
L.
jensenii,
L.
kefiranofaciens,
L.
ois,
L.
pentosus, and
L.
salivarius were sporadically distributed among the 17
L.
casei strains examined (Additional file
2: Table S1). Many of these lactobacilli are only distantly related to
L.
casei, but all share at least one ecological niche with this species and theoretically might have contributed to the diversity of the
L.
casei supragenome. There were, for example, clear relationships between ecological co-localization with particular
Lactobacillus species and the unique accessory gene content of individual
L.
casei strains.
L.
coryniformis, for example, is commonly found in fermented plant material (e.g., silage) [
2] and accessory gene products with very high homology scores to various proteins from this species were found within each of the five silage and wine isolates, but not in any of the human or dairy strains (Additional file
2: Table S1). Similarly,
L.
casei strains of dairy origin had a greater prevalence of accessory genes whose products gave very high homology scores to different proteins from
L.
fermentum (Additional file
2: Table S1), a species commonly found in milk products [
2].
Overall,
L.
casei plant isolates showed the most diverse repertoire of strain-specific accessory genes. Genes encoding orthologs with high amino acid identity to different proteins from
L.
plantarum,
L.
buchneri, and
L.
pentosus, for example, were most prevalent in plant isolates, as were proteins with orthologs in distantly related bacteria such as
Enterococcus,
Streptococcus,
Bacillus, and
Clostridium species (Additional file
2: Table S1). Finally, accessory genes with very high homology scores to various proteins from
Listeria sp. were detected in each of the human
L.
casei isolates and in the three dairy strains, but not in any of the plant isolates, even though
Listeria are commonly recovered from vegetation [
49].
If niche-associated gene exchange contributes to the composite nature of the
L.
casei supragenome, and this process is important to
L.
casei strain evolution and lifestyle adaptation, then one might expect to find evidence of relatively recent events among different strains. Indeed, several insertion sequence elements in
L.
casei strains shared at least 99% nucleotide sequence identity with elements that have been identified in the ge nomes of
L.
rhamnosus,
L.
brevis,
L.
buchneri,
L.
fermentum,
Oenococcus oeni and even
Listeria innocua (see Additional file
3: Table S2 in the online supplementary material). More interestingly,
L.
casei strains UCD174, BD-II, and UW1 each possess a unique polycistronic region encoding features associated with lifestyle adaptation [
28] that share very high (98-99%) nucleotide sequence identity with genomic regions in
L.
plantarum or
L.
brevis (see Additional file
3: Table S2 and Additional file
4: Figure S2 in the supplementary online material). Only one strain of
L.
brevis has been sequenced to date, but nucleotide BLAST searches showed each of the clusters with homology to
L.
plantarum was common among sequenced strains of that species.
The four-gene cluster found in
L.
casei UW1 is virtually identical to a locus in
L.
brevis ATCC 367 that encodes an ABC sugar transport system of unknown specificity (Additional file
4: Figure S2A). Tests on subset of nine stains, including UW1, for the ability to grow in CDAA supplemented with one of 60 different substrates did not reveal any capability that was unique to strain UW1 (see Additional file
5: Figure S3 in supplementary online material), so the function of this gene cluster remains unclear. In both species, the cluster is flanked on one side by a gene for transposase (Additional file
4: Figure S2A), and the G+C content of the cluster (ORFs range from 37-39%) is considerably lower than that of either species' genome (46%) (Table

and Additional file
3: Table S2), suggesting the cluster may have been acquired by one or both species from a third donor.
The polycistronic cluster identified in
L.
casei BD-II is plasmid borne, and carries five of the six genes that comprise the
L.
plantarum lar operon [
50] (Additional file
4: Figure S2B). Goffin and coworkers [
50] showed
lar is required for lactate racemization activity in
L.
plantarum, but the function of most
lar-encoded proteins, or even whether all six genes are required for this activity, remains unknown. D-lactate is a component of the cell wall in
L.
plantarum and
L.
casei, and Goffin et al. [
50] suggested Lar activity may provide the cell with a rescue pathway for D-lactate production under conditions that inhibit D-lactate dehydrogenase activity. The
lar genes of
L.
casei BD-II show 98-99% nucleotide sequence identity to their counterparts in
L.
plantarum, and the BD-II locus is flanked on each side by transposase genes (Additional file
4: Figure S2B). There are no transposase genes in the immediate vicinity of the
lar operon in
L.
plantarum. Collectively, these observations provide good evidence that the
lar locus in
L.
casei BD-II was acquired via HGT from
L.
plantarum, but its role, if any, in lifestyle adaptation must yet be determined.
The third and most compelling example of lifestyle evolution via HGT was found in the wine isolate
L.
casei UCD174. This bacterium contains a polycistronic cluster for L(+)-tartrate catabolism and malate transport that was previously thought to be one of three defined
L.
plantarum-specific gene clusters [
51]. Tartaric and malic acids are the primary acids in grapes and therefore the strongest acids in wine. Tartrate dehydratase allows cells to convert tartrate to oxaloacetic acid, an important metabolic intermediate, and malate transport and metabolism are known to enhance acid tolerance in
L.
casei and other lactic acid bacteria [
52-
54]. Thus, acquisition of this cluster by UCD174 very likely promoted survival and adaptation of this strain to the acid environment of wine. Features of the UCD174 tartrate dehydratase cluster, including 99% nucleotide sequence identity over the cluster and associated
aroAB genes with the corresponding
L.
plantarum locus, and the presence of flanking transposase genes in UCD174 (Additional file
4: Figure S2C), provide strong evidence that the locus was acquired by HGT from
L.
plantarum.
Finally, we have noted
L.
rhamnosus might be an important source of genetic diversity in
L.
casei, but also found evidence that the reverse may be true. Of the nine publicly available complete or draft genome sequences for
L.
rhamnosus, only two strains, GG (ATCC 53103) and LMS2-1, contain a genomic region that encodes three secreted LPXTG-like pilin proteins (SpaCBA) plus a dedicated sortase for their export [
55]. Functional genomic studies showed the SpaCBA pilus promotes adhesion to intestinal epithelial cells, and may function to modulate interleukin-8 expression that is induced by lipoteichoic acid or other surface molecules [
56]. While the
spaCBA locus and associated sortase are clearly uncommon among
L.
rhamnosus strains, it is fully conserved in
L.
casei strains ATCC 334, BL23, Zhang, and T71499, and present but probably inactive (due to frameshifts or deletions) in
L.
casei strains 21/1, M36, CRF28, UW4, A2-362, 32G, Lc-10, and Lpc-37. Collectively, these strains span the major MLST-defined genetic lineages for
L.
casei[
28,
30], which suggests the genes were not recently acquired. The
spaCBA and sortase genes of
L.
casei ATCC 334 show 95-99% nucleotide sequence identity to their counterparts in
L.
rhamnosus, whose locus is also flanked by transposase genes that are virtually identical to elements found in
L.
casei. There are no transposase genes in the immediate vicinity of the
spaCBA cluster in the
L.
casei strains sequenced to date. Collectively, these observations provide compelling evidence that the
spaCBA locus in
L.
rhamnosus GG and LMS2-1 may have originally been acquired via HGT from
L.
casei.
Overall, the composite nature of the
L.
casei strain-specific accessory gene pool and the presence of gene clusters in some strains that appear to have been recently acquired support our hypothesis that evolution of the
L.
casei supragenome has been heavily influenced by ecological co-localization with other bacterial species. Placed within the greater context of the DGH, we propose that
L.
casei, and probably other ecologically flexible species, have access to a supragenome whose composition is not exclusive to the species, but instead might be viewed as a subset of the microbial metagenome within a particular niche. While the primary mechanism(s) for supragenome access by
L.
casei are unknown, natural transformation has never been demonstrated in lactobacilli, and the prevalence of IS elements and plasmid-linked traits among genes that appear to have been recently acquired (see Additional file
3: Table S2) suggest that conjugation may be a key driver of HGT in
L.
casei. However, the widespread distribution of phage-related proteins among the
L.
casei accessory gene pool (Additional file
2: Table S1) suggest transduction could also be an important mechanism for genome evolution in this species.
Adaptive immunity against invasive DNA
Although conjugation and transduction may be important mechanisms for HGT in
L.
casei, these and other bacteria have also acquired sophisticated mechanisms to combat invasive bacteriophage and plasmid DNA. One key example is the CRISPR-Cas adaptive immunity system, which consists of clustered regularly interspaced short palindromic repeats (CRISPR) adjacent to
cas (CRISPR-associated) genes. The CRISPR loci are comprised of partially palindromic repeats separated by short stretches of "spacer" DNA that are acquired from invasive bacteriophage or plasmid DNA. Once present, these spacer sequences allow cells to recognize and cleave invasive DNA that contains those sequences [
57-
60]. Two distinct types of CRISPR loci were identified in the
L.
casei genomes. These two loci are typically characterized by idiosyncratic CRISPR repeats: 5’-GTCTCAGGTAGATGTCGAATCAATCAGTTCAAGAGC-3’ for the Type II-A (Lsal1 family) locus, and 5’-GTTTTCCCCGCACATGCGGGGGTGATCCC-3’ for the Type I-E (Ldbu1 family) locus. Such repeats have been previously identified in a variety of lactobacilli, including
L.
salivarius,
L.
casei and
L.
rhamnosus for Lsal1, and
L.
casei,
L.
delbrueckii,
L.
fermentum,
L.
acidophilus and
L.
brevis for Ldbu1 [
61]. Overall, CRISPR repeats were highly conserved, with >97% typical repeats across Lsal1, and >96% typical repeats for Ldbu1 (data not shown).
Lsal1-type CRISPR loci were identified in 11 strains, while Ldbu1-type loci were identified in 3 strains, with both types present in the M36 genome (Figure

A). Only four strains (32G, A2-362, 12A, and UW4) did not possess CRISPR loci. The
cas content for the Lsal1 loci is typical of Type II-A systems [
62], with the universal
cas1 and
cas2, in combination with the
cas9 signature gene. Also, a tracrRNA homologous to those found in Type II systems [
63] was identified in the intergenic region between
cas9 and
cas1. Though universal genes were highly conserved, up to 19% nucleotide polymorphism was observed for the
cas9 signature gene. The genomic location of these loci across the 13 strains was consistent, and occurred immediately following LCABL_23790 homologs (Figure

A). The
cas content for Ldbu1 is typical of Type I-E systems [
62], with the universal
cas1 and
cas2, in combination with the
cas3 signature gene, and Cascade-encoding genes, notably
cas6e[
62,
64].
As expected, the spacer content was hypervariable across CRISPR loci from different strains, with spacer numbers ranging from 4 to 44 for Lsal1, and between 21 and 60 for Ldbu1 (Figure

B). Of note, identical sets of spacers were conserved across cluster F strains (BL23, BD-II and LC2W) indicating very close genetic relatedness, which is also reflected in strain clustering based on overall gene content and MLST analysis (Figures

and ). Several sets of contiguous spacers were also conserved between cluster F strains and strains Lc-10 and Zhang, which suggest common ancestry or HGT inheritance of this CRISPR locus. In contrast, spacer content showed more typical hypervariability across the other strains, with only 1 spacer shared between UW1 and UCD174 (Figure

B). For Ldbu1, however, several sets of contiguous spacers were conserved between the dairy strains Lpc-37 and M36 (Figure

B).
Analysis of spacer sequences revealed that numerous spacers show homology to
Lactobacillus phages (Lc-Nu, A2, Lrm1, and J1) and plasmids (pYIT356, pREN, and pLgLA39) (Figure

). Analysis of sequence conservation in the direct vicinity of proto-spacers that showed similarity or identity to CRISPR spacers revealed the presence of proto-spacer associated motifs (PAM) [
65-
67], namely, a conserved TGAAA immediately downstream of the Lsal1 protospacers, and AAY immediately upstream of the Ldbu1 protospacers (Figure

). The TGAAA pentamer is homologous to the AGAAW PAM previously identified downstream of protospacers in
Streptococcus thermophilus Type II systems [
58,
65,
66].
Overall, CRISPR locus hypervariability across strains, in terms of occurrence, locus type and spacer content illustrates their functional value in response to environmental pressure, notably in providing resistance against phages. This polymorphism indicates CRISPR loci are desirable targets for typing of
L.
casei strains, in disagreement with a previous report [
68]. The critical role that CRISPR-Cas systems play in resistance to viruses has been documented in dairy cultures [
57,
58,
65,
66,
69], as well as environmental samples [
60,
70-
74]. The occurrence of numerous
L.
casei CRISPR spacers with homology to
Lactobacillus phages (notably Lc-Nu, Lrm1, A2 and phi AT3) that prey upon closely related species (Figure

) combined with the larger numbers of spacer sequences in strains isolated from commercial cheese production environments (Lc-10, Lpc-37 and M36) further underscores the selective pressure against phage infection that exists in industrial dairy manufacturing environments. The propensity of hypervariable CRISPR loci for HGT [
75] is also illustrated within
L.
casei by the co-occurrence of various IS elements for both types of loci (Figure

A), the sharing of contiguous spacer sets across strains that belong to different phylogenetic clusters (Figures

and B), and the skewed GC content of
cas genes (50-63% for Ldbu1
cas genes versus 46.5% genome-wide content). Overall, these results highlight the reliance of
L.
casei strains on CRISPR-Cas systems to provide immunity against invasive elements, as previously shown in bacteria and archaea [
57,
59,
60,
76,
77].
Evolution via genome decay
Our results and previous studies [
21,
28] have indicated HGT is a dominant force in genome evolution of
L.
casei, but a prior CGH experiment also provided evidence for a genetically distinct and geographically distributed cluster of
L.
casei cheese specialists whose evolution was accompanied by extensive decay of genes involved in carbohydrate utilization and transcriptional regulation [
28]. This hypothesis is supported by the fact that energy production in
L.
casei is primarily derived from carbohydrate fermentation, so niche adaptation should be heavily predicated by the ability of strains to utilize available carbohydrate. Fermenting plant material, for example, can contain a diverse array of simple and complex carbohydrates as well as sugar alcohols [
27], and many of these substrates will also be encountered in the GI tract as a consequence of diet. Thus, the ability to utilize C5 sugars and certain C5 and C6 sugar alcohols is more prevalent in
L.
casei isolated from plant material and the human GI tract than in cheese isolates [
30]. The overlap in carbohydrate availability and use by
L.
casei associated with plants or the gastrointestinal tract also supports the hypothesis that many strains from these environments should be viewed as ecological generalists, whereas adaptation to cheese has been accompanied by extensive genome decay that, ultimately, resulted in niche specialization [
28].
To explore the role of genome decay in the relationship between niche adaptation and substrate utilization, we tested a subset of nine stains (ATCC 334, 21/1, 32G, M36, CFR28, T71499, 12A, UW1 and UW4) distributed across the major MLST-defined
L.
casei lineages [
28,
30] for the ability to grow in CDAA supplemented with one of 60 different substrates associated with plant, gut, or dairy niches. Growth was detected on a total of 30 substrates, with individual strains able to utilize as few as 17 to as many as 26 different substrates (Additional file
5: Figure S3). Results showed the two cheese specialists UW1 and UW4, which share the same MLST lineage (Figure

), had the most restricted substrate profile with growth on 18 and 17 different substrates, respectively. Each of the other strains was able to utilize at least 20 different carbohydrates, while two of the corn silage isolates (32G and 12A) each grew on 26 substrates, although their profiles were not identical (Additional file
5: Figure S3).
The genetic basis for utilization of many of these substrates is unknown, and efforts to detail the impact of genome decay on different substrate profiles is further challenged by the fact that most of
L.
casei genome sequences used for this study are incomplete. Nonetheless, evidence for genome decay in the evolution of the cheese specialist strains UW1 and UW4 was observed in regard to genes for inulin, sucrose and cellobiose utilization. In
L.
casei, the ability to ferment sucrose and inulin is encoded by an operon for fructooligosaccharide utilization (
fos) [
78,
79] that is present in the other seven strains tested, which were all sucrose- and inulin-positive, but completely absent in UW1 and UW4.
In contrast to the single
fos operon, we identified nine distinct gene clusters among the 17
L.
casei genomes studied here that may function in cellobiose utilization (Figure

). Cellobiose is a disaccharide formed by enzymatic or acid hydrolysis of cellulose that, like sucrose and inulin, may be encountered in plant material or in the human gastrointestinal tract but not in milk or cheese. Each of the nine strains analyzed in this part of the study possessed single copies of one (UW1) to eight (21/1 and 12A) of these gene clusters (Figure

), and all but UW1 were able to ferment cellobiose (Additional file
5: Figure S3). While the function of each cluster in cellobiose utilization (as opposed to other β-glucosides) has not been demonstrated, data for strains ATCC 334 and UW4 show at least two of these clusters must enable cellobiose fermentation in
L.
casei (Figures

and S3). Cross comparison of the distribution of clusters 1 to 6 across the MLST lineages for the nine
L.
casei strains tested (Figures

and ) provides clear evidence of genome decay; many clusters are entirely absent, presumably due to deletion events, and all but 12A contained at least one cluster that was predicted to be nonfunctional due to frameshift mutations (Figure

). The single cluster in the cellobiose-negative strain UW1, for example, was predicted to be nonfunctional (Figure

). Finally, examination of all 17 genomes included in this study confirmed the cheese specialists UW1 and UW4 had the fewest total cellobiose clusters, and were the only strains lacking cluster 5.