|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: DP CB MLDM PLB. Performed the experiments: DP CB MC. Analyzed the data: DP YQ PR MLDM PLB. Wrote the paper: DP YQ PR MLDM PLB.
The population structure and diversity of Lactococcus lactis subsp. lactis, a major industrial bacterium involved in milk fermentation, was determined at both gene and genome level. Seventy-six lactococcal isolates of various origins were studied by different genotyping methods and thirty-six strains displaying unique macrorestriction fingerprints were analyzed by a new multilocus sequence typing (MLST) scheme. This gene-based analysis was compared to genomic characteristics determined by pulsed-field gel electrophoresis (PFGE).
The MLST analysis revealed that L. lactis subsp. lactis is essentially clonal with infrequent intra- and intergenic recombination; also, despite its taxonomical classification as a subspecies, it displays a genetic diversity as substantial as that within several other bacterial species. Genome-based analysis revealed a genome size variability of 20%, a value typical of bacteria inhabiting different ecological niches, and that suggests a large pan-genome for this subspecies. However, the genomic characteristics (macrorestriction pattern, genome or chromosome size, plasmid content) did not correlate to the MLST-based phylogeny, with strains from the same sequence type (ST) differing by up to 230 kb in genome size.
The gene-based phylogeny was not fully consistent with the traditional classification into dairy and non-dairy strains but supported a new classification based on ecological separation between “environmental” strains, the main contributors to the genetic diversity within the subspecies, and “domesticated” strains, subject to recent genetic bottlenecks. Comparison between gene- and genome-based analyses revealed little relationship between core and dispensable genome phylogenies, indicating that clonal diversification and phenotypic variability of the “domesticated” strains essentially arose through substantial genomic flux within the dispensable genome.
The massively increasing amount of genomic data becoming available is raising questions about the classical view of bacterial species, particularly in terms of gene content. Beginning with the pioneering observation of Lan & Reeves , it is now established that sequencing a single genome fails to describe the genetic content of the species, and intraspecies variation needs to be considered to gain insight into the full “species genome”. This genome, alternatively named the pan-genome, is composed of a core genome made up of genes ubiquitously present in all strains of a given species, and a dispensable genome containing genes found only in single strains or particular lineages. Depending on the species and the number of strains sequenced, the core genome only represents from 40% to 80% of a single genome, and the pan-genome may be almost 4 times the size of a genome in a single strain –. Understanding the extent of the genetic diversity within a species should help the choice of strains to be sequenced for pan-genome characterization. A powerful method for population genetic studies is multilocus sequence typing (MLST) , a method based on the sequencing of a limited number (generally five to seven) genes of the core genome. MLST outperforms restriction- or other PCR-based typing methods, because it provides information about key features of the evolutionary history, the population structure, and long-term epidemiology of bacterial species , . Although MLST has been principally used to study the major bacterial pathogens, several recent MLST schemes have been developed for lactic acid bacteria (LAB), the most important group of microorganisms used for food processing, including the species Lactobacillus plantarum , Lactobacillus casei , , Oenococcus oeni , and Streptococcus thermophilus .
Lactococcus lactis is the major LAB species used in milk fermentation, a preservation process probably first developed in the Early Neolithic . This type of fermentation involving natural starters has been used empirically at a small scale for thousands of years, through the practice of back-slopping. Industrial scale fermentation started in the early-20th century with the use of defined single- and multiple-strain commercial starters . L. lactis is a microorganism that is generally recognized as safe (GRAS), and is also now used as a cell factory for production of recombinant proteins , and as a therapeutic drug delivery vector , . Taxonomically, it is a mesophilic Gram-positive species related to the Streptococcaceae ; it is subdivided into three subspecies, L. lactis subsp. hordniae, L. lactis subsp. lactis (including the biovar diacetylactis), and L. lactis subsp. cremoris. The two latter subspecies differ by less than 0.7% in their 16S rDNA sequences  but display an average of only 85% DNA identity at the genome level , a value slightly higher than that between Escherichia coli and Salmonella typhimurium . L. lactis subsp. lactis is found in various environments including animal sources, dairy products and plant surfaces , , whereas the subspecies cremoris is only isolated from raw milk and dairy products , , with few exceptions , . This ability of L. lactis subsp. lactis to colonize a larger ecological niche is associated to a greater genomic diversity, as revealed by DNA-fingerprinting analyses including random amplification of polymorphic DNA (RAPD) and pulsed-field gel electrophoresis (PFGE): subspecies lactis strains are spread across many clusters whereas subspecies cremoris strains are grouped in a small number of closely related clusters , , . Dairy strains of both subspecies tend to display lower diversity than non-dairy strains , , . The sole MLST scheme reported to date for L. lactis analyzed the nucleotide variability at five genetic loci of 89 L. lactis subsp. lactis and L. lactis subsp. cremoris isolates  and substantiated these previous observations. However, this study gave little information about the population structure and gene diversity/evolution of the species.
We report an analysis of the diversity of L. lactis subsp. lactis strains at both gene (MLST) and genome (PFGE) levels. Seventy-six lactococcal isolates were analyzed by various molecular typing methods to validate a collection of 36 strains. A new MLST scheme was constructed using a rational “top-down” approach, and the population structure, the genetic diversity, and a gene evolution model were estimated for this subspecies. Genome characteristics, including chromosome size and plasmid content, and genomic relatedness were estimated by PFGE analysis. The findings are informative about the origin and the ecology of the strains analyzed.
MLST is currently considered to be the gold standard method for studying strain relationships and the population structure of bacteria . Though dozens of MLST schemes have been developed to date, many of them do not follow the good practices for the rational development of MLST schemes , such as appropriate initial population sampling, rational choice of the genetic loci to be characterized, and use of suitable statistics (referred in this study as the quantity calculated from a set of data) for a robust estimation of genetic diversity.
Exploration of bacterial diversity by MLST requires the use of an appropriate strain collection validated by different genotyping methods. We therefore obtained 76 isolates from several public and industrial collections and used various phenotypic (ability to grow in milk or to utilize lactose) and genotypic (ribotyping, ARDRA, and partial 16S rDNA sequencing) methods to study them. Eighteen strains were discarded from the analysis because they presented ARDRA and/or ribotyping results that were either ambiguous or inconsistent with an affiliation to the L. lactis subsp. lactis group. Indeed, partial 16S rDNA sequencing (data not shown) revealed that these strains belonged either to the cremoris subspecies or to genera other than Lactococcus (Enterococcus faecalis, Enterococcus pseudoavium, Lactobacillus casei, and Leuconostoc citreum). The remaining 57 strains were subjected to SmaI-macrorestriction analysis by PFGE; an additional 21 strains were thereby excluded because they displayed a macrorestriction fingerprint identical to that of strains already selected for the collection. This observation enlightens strain redundancy that may exist in laboratory collections, probably because these collections are mostly constituted from phenotypic characterizations. Moreover, the PFGE analysis not only confirmed the close genomic relationships between known pairs of strains (IL594 and its plasmid-free derivative IL1403 , S86 and its [Lac]- spontaneous derivative S86-B), but also identified unexpected close relatedness between some other pairs of strains, such as LD01/LD02 and UCMA5713/UCMA5733. As strains of each pair differed by only one SmaI fragment identified as the lactose plasmid (by Southern-hybridization against the lacE gene carried on the lactose plasmid ), the pairs presumably correspond to [Lac] variants of the same strains. Thus, the bacterial collection validated for this study consisted of 36 strains displaying different SmaI macrorestriction patterns (pulsotype). Twenty-three of the strains originated from dairy environments, such as milk, fermented products, or starter strains, and 13 strains had been isolated from various non-dairy environments, including plants, animal skin, and sourdough bread (Table S1, provided as supplementary material).
Some of the criteria for the choice of the gene set have changed since the first proposals of MLST schemes. For instance, targeting only housekeeping genes emerged as optional , whereas choosing loci that follow the same evolutionary route (i.e. displaying congruent tree topology) may be essential to minimize noise when extracting phylogenetic signals from concatenated sequences . According to good practice for the rational development of a MLST scheme , we developed a new lactococcal MLST scheme using a four-step “top-down” approach. First, 33 loci were evaluated by in silico analysis using publicly available lactococcal nucleotide sequences, with emphasis on DNA polymorphism, chromosomal distribution, and gene paralogy. These loci were either markers commonly used in other eubacterial MLST schemes, including the five loci used for the previous L. lactis MLST scheme (atpA, bcaT, pepN, pepXP, and rpoA) , or indicators of the overall rate of genome divergence between bacterial species (recN, glyA, and metS[metG]) .
Fourteen loci (bcaT, glyA, pgk, dprA, pfk, comX, metS, mutX, rpoA, recN, tkt, pepXP, pdp, and xerS) fulfilling the above criteria were selected for the second step of the analysis: the determination of the entire DNA sequences of these genes in a subset of 13 strains of the collection displaying various levels of genomic diversity as assessed by PFGE analysis (data not shown). Five loci were rejected following this analysis: three loci (comX, mutX, and xerS) gave sequence data of too short length or of poor quality, one (rpoA) provided only a weak phylogenetic signal with only three SNPs among the 13 strains, and one (metS[metG]) did not clearly separate the lactis and cremoris subspecies. The findings for the metS[metG] marker support the recent observation that some genes encoding aminoacid-tRNA synthetases, though belonging to the core of the minimum bacterial gene set , , may be horizontally transferred between species or subspecies , , .
The third step of the MLST scheme design consisted of selecting the most polymorphic region of ≈500 bp in length for each of the nine loci selected, and determining the sequence of this region in each strain of the entire validated collection, excluding one strain from each parent/derivative pair (Table S1). For each locus, the quality of the phylogenetic signal was investigated by split decomposition analysis , a method allowing the visualization of conflicting signals in phylogenetic studies by representing incompatibilities between data as networks. Five loci (pepXP, recN, pdp, pgk, and glyA) gave classical tree-like structures, whereas three loci (dprA, pfk, and bcaT) gave little network-like structures (Fig. S1, provided as Supporting Information). In contrast, the tkt locus displayed a split-network arrangement typical of phylogenetic incompatibilities within data (Fig. S1) and was rejected from the scheme. Comparative analysis with four different lactococcal genomes revealed that tkt flanked a genomic island (data not shown), a chromosomal position prone to intragenic recombination leading to phylogenetic incongruence among Escherichia coli strains .
Finally, the MLST scheme was optimized to give the best compromise between a small number of loci to sequence and a large number of sequence types (STs) generated. With 10 to 13 alleles per locus, the combination of the eight loci selected allowed 26 STs to be distinguished among the 32 strains analyzed. As the five loci displaying no conflicting trees were uniformly distributed on the three sequenced genomes (data not shown), they were used as the backbone for the MLST scheme. This five-locus scheme generated 23 STs and addition of dprA, pfk, or bcaT loci individually generated 24, 25, and 25 STs respectively, whereas only the simultaneous addition of pfk and bcaT increased the number of STs to 26. These observations allowed the rejection of the dprA locus from the scheme. In addition, the pfk and bcaT loci are located only 51 kbp apart in the three sequenced genomes, and the phylogenetic tree generated by the concatenated sequence from the six-locus scheme did not change either its topology or its robustness relative to the seven-loci tree (data not shown); consequently, the pfk locus was removed from the scheme.
In conclusion, the new MLST scheme targeted six loci uniformly distributed along the chromosome (Fig. S2, provided as Supporting Information) and displaying little phylogenetic inconsistency: three housekeeping genes (glyA, pgk, and pdp), two catabolic genes (bcaT, and pepXP) genes, and one gene of the SOS regulon (recN). Note that two of these loci (recN and glyA) belong to the gene set identified as the best predictors of whole-genomes relatedness , whereas only two loci described in the previous lactococcal MLST scheme , bcaT and pepXP, were retained. The new MLST scheme allowed 25 ST to be distinguished among the 32 strains analyzed.
Survey of MLST studies showed that several statistics are used, sometimes with redundancy, to estimate the bacterial genetic diversity. In addition, many studies compare the level of genetic diversity between bacterial species although the loci selected for each MLST scheme are generally different and may have diverse evolutionary rates. Therefore, selecting which statistic to use for estimation of bacterial gene diversity level is not a trivial task since no comparative study has been performed to date. We analyzed the robustness of two statistics most commonly used in MLST studies -the percentage of variable sites and the nucleotide diversity (π, )- using concatenated DNA sequences obtained from MLST data of several bacterial species (Table 1). The maximal nucleotide diversity (πMAX), defined as the number of nucleotide differences per site between the two most divergent sequences within the population, was also included (Table 1). The sensitivity of these statistics to the sample size was estimated by calculating their values both from all available STs and then from a random sample of 25 STs (the size of our ST sample). As isolate redundancy within each ST cannot be exclude in absence of complementary genotypic characterization, this analysis was performed using only one sample from each ST (non redundant STs).
As expected , the percentage of variable sites was found to be very sensitive to the sample size and it rapidly reached large values, close to site saturation and without biological meaning, as the sample size increased (n>500). This behavior illustrates how the use of this statistic to estimate DNA polymorphism in MLST studies may be misleading. By contrast, the nucleotide diversity (π) was only slightly affected by the sample size. However, it was found strongly affected by the set of loci selected, as illustrated by the significant (p<0.0001, Welch's test) differences between π values found when comparing the two MLST schemes developed for the Acinetobacter calcoaceticus - A. baumannii (Acb) complex , . This indicates that π is inappropriate for comparing genetic diversity between bacterial species, unless using the same MLST scheme. In addition, π displayed high standard deviation values for some species, especially when the sample size was small (see for instance the values computed for 25 STs in Enterococcus faecium or Acb complex, Table 1). This statistic gives a global characterization of gene diversity and does not reveal sampling biases, such as errors in datasets (e.g. chimeric sequences or taxonomically misclassified isolates) or non-uniform population structures (e.g. the existence of independent genetic lineages within a species). These sampling biases were easily revealed by the maximal nucleotide diversity (πMAX), which is not directly sensitive to sampling size but only to the extreme values of sequence divergence, and by calculating the maximal to average pairwise nucleotide differences ratio (πMAX/π). For most species, this ratio was between 1.97 and 5.62, even for species known to contain several genetic lineages, for example Listeria monocytogenes . In contrast, four species (Streptococcus pneumoniae, Staphylococcus aureus, Enterococcus faecalis, and E. faecium) displayed ratios of between 8.25 and 22.65, with π values generally higher than 10%. Phylogenetic trees computed with ST sequences from these species (data not shown) revealed that only few STs (less than 4% of the population) contributed to these high values, such that the values fell to become similar to those for other species after removal of these “outlier” STs. For instance, πMAX value for E. faecalis dropped from 7.53% (πMAX/π ratio =10.31) to 1.71% (πMAX/π ratio =2.34) after removal of ST80, a chimeric ST made of E. faecalis and E. faecium sequences (data not shown). Similar chimeric STs also explained the aberrant values found for E. faecium (data not shown). This analysis led us to conclude that only π with its standard deviation, and πMAX statistics were appropriate for estimating intraspecific genetic diversity of data samples and for detecting particular population structures or sample biases. In addition, it led us to assume that πMAX might be suitable for comparing genetic diversity between bacterial species, even if estimated from different MLST schemes.
Twenty-two of the 25 STs included only single strains and the other three STs contained between two and seven strains (Table S1), with the most represented ST (ST6) including the reference strain IL1403 (with its parent IL594), LD01 (with its derivative, LD02), LD61 , LD90, and LD42. Genetic lineages in L. lactis subsp. lactis were identified by eBURST analysis , with clonal complex defined as group of STs sharing five of the six loci, and ancestor ST of each clonal complex defined as the ST with the highest number of neighboring STs (single locus variants, SLV). The 25 STs were distributed in 14 unique ST (singletons) and two clonal complexes (Fig. 1). The major clonal complex (CC1) included nine STs (corresponding to 20 strains) with ST15 identified as the ancestor genotype, whereas the second complex (CC2) comprised only two STs. A good correspondence was observed between strain origin and ST clustering, since the two clonal complexes (CC1 and CC2) contained only strains involved in milk processing (isolated from fermented products, or used as starters), with the exception of strains UCMA5713 and its [Lac]- variant UCMA5733. A dairy origin was, however, strongly suspected for these two UCMA strains because both were isolated from grassland close to a dairy factory (N. Desmasure, personal communication); also, strain UCMA5713 rapidly ferments milk (data not shown) and contains both lacE and prtP genes (Table S1). In contrast, eleven of the 14 singletons corresponded to non-dairy strains. Relaxing the parameters for lineage definition to double-locus-variants (DLV, defined as STs sharing 4/6 loci) resulted in the merging of only three STs into a single group (ST11, ST13 and ST19, Fig. 1). The remaining 11 STs differ from each other by three to six loci, suggesting high level of genetic diversity among the corresponding isolates.
The small contribution of homologous recombination to gene and genome evolution (see below) allowed strain relatedness to be assessed by classical tree-based phylogenetic analysis . We used the neighbor-joining method  with concatenated sequences (2,934-bp) of the six loci (Fig. 2a). This analysis revealed that STs corresponding mainly to strains involved in milk processing (i.e. STs from CC1, CC2, and ST12) formed a genetic lineage distinct from other STs (bootstrap value 90%) and split in two clusters (G1 and G2, bootstrap value 99%). Except for the two strains isolated from animal skin (ST13 and ST19), which grouped together with one strain isolated from milk (ST11), the remaining STs corresponding to strains isolated from plants or raw milk showed no tendency to cluster. However, the genetic distance within the subspecies was far below the distance observed between lactis and cremoris subspecies, as assessed by including the corresponding 2,934-bp sequences of the two sequenced cremoris strains, MG1363  and SK11  (Fig. 2b). This strongly supports the notion that subspecies lactis and cremoris indeed constitute two distinct genetic lineages that presumably diverged a long time ago , .
The genome relatedness of the 36 strains was estimated by computing Dice coefficients (SD) from pairwise comparisons of SmaI-macrorestriction patterns obtained by PFGE (Fig. S3a, provided as Supporting Information). Two-thirds of the strains (23/36) displayed values (SD<0.6) typical of unrelated strains , , with 55% of these values being between 0.11 and 0.35, the range observed when comparing the subspecies cremoris MG1363 strain to any subspecies lactis strain in the collection (yellow, Fig. S3a). This large diversity in genome fingerprints impeded robust UPGMA-based strain clustering, as no internal node was found to be significant when performing a bootstrap analysis (data not shown). Nevertheless, strains involved in milk processing were clearly separated from the other strains (compare Fig. 3 and Fig. S3b). However, strains belonging to the same ST were not necessarily clustered together, as fingerprints displayed unexpectedly low SD values (0.44<SD<0.76 for strains belonging to ST6, SD<0.48 for strains belonging to ST15, and SD=0.4 for those from ST9, Fig. S3a). Since PFGE essentially monitors genome rearrangements rather than mutations, such SD values strongly suggest high variability either in genome organization (for instance through rearrangements such as large inversions), or in genome content (through insertions/deletions of mobile genetic elements such as phages, ICEs, genomic islands etc.) within the subspecies lactis.
We measured intergenic recombination by estimating the linkage disequilibrium between the six loci, using the standardized index of association statistic, IAS . To minimize linkage disequilibrium introduced by sampling bias or recent expansion of adaptive genotypes , only one sample from each ST was analyzed. A significant linkage disequilibrium was found when considering either the 25 STs of the collection (IAS=0.387, p<0.001) or the 14 singletons (IAS=0.1214, p<0.01), but not when grouping the STs from CC1 and CC2 (IAS=0.055, p=0.198); this indicates that L. lactis subsp. lactis is essentially clonal. The intragenic recombination was estimated by empirical calculation of the per site ratio of recombination to mutation (r/m) statistic, which gives the relative probability that an individual nucleotide site will change by recombination or mutation . Briefly, this method compares allelic variation from the ancestral ST to the SLV belonging to the clonal complex. If the variant allele differs by one SNP from the ancestral sequence, with this SNP not found in other ST, the nucleotide difference is counted as a point mutation (m), and if the variant allele either differs from the ancestral sequence by several SNPs, or is found in unrelated ST(s), these different nucleotides are considered originating from a recombination event (r). Three loci (bcaT, pgk, and recN) displayed allelic variation within the CC1 (Table S1), with eight allelic changes from the ancestor sequence (ST15) corresponding to 14 SNPs (Fig. S4). Four SNPs could be assigned to point mutations whereas ten SNPs were considered to have occurred by recombination, giving a per site r/m ratio of 2.5:1, a low value for a bacterial species , . The main contributor for recombination events was pgk. Note that alleles 7 and 8 of this locus are closer to alleles 11 and 13 present in CC2 than to any allele present in CC1 (Fig. S4). Consequently, genetic variations at this locus may be responsible for the discrepancies in strain classifications observed between the allele- (eBURST, Fig. 1) and nucleotide-based (phylogenetic tree, Fig. 2a) methods. Both inter- and intragenic recombination tests, as well as the observation that only two loci (tkt and metS) displayed phylogenetic incompatibilities, among the 10 loci selected for the MLST scheme design, suggest that recombination may have happened, but has not played a major role in L. lactis subsp. lactis evolution.
We calculated the nucleotide diversity at each locus in L. lactis subsp. lactis (Table 2); it was from 0.66% for bcaT and pepXP, to 1.1% for glyA, with πMAX ranging from 1.87% (pgk) to 3.07% (recN). These values confirmed the different evolution rates of the genes used in the new MLST scheme. In addition, the π values for glyA and recN, two loci whose variability is strongly correlated to the overall genome pair variability , were very close to the value obtained for concatenated sequences (see below). This validated the new MLST scheme as representative of the core genome relatedness within the subspecies lactis. We computed π and πMAX from the concatenated sequences of the six loci: they were 0.82% (±0.1%) and 2.01%, respectively (Table 2). This πMAX value falls within the range of values calculated for several species including S. aureus and some Streptococci (Table 1). However, the diversity was distributed unequally between strains of different origins, with strains involved in milk processing (cluster G1+G2) displaying almost fivefold lower diversity than other strains (Table 2). Thus, the non-dairy strains are the essential contributors to the genetic diversity within the subspecies. Inclusion into the analysis of DNA sequences from the subsp. cremoris strains MG1363 and SK11 not only raised the π and πMAX values to 2.44% and 12.4% respectively, but also increased the π standard deviation to 0.98%, a value considered to be characteristic of highly divergent sequences (Table 1). These results strongly support the idea of the early separation of subspecies lactis and cremoris into two independent genetic lineages as suggested by the phylogenetic tree (Fig. 1b). Indeed, they indicate that the two subspecies should be analyzed separately in MLST studies.
A gene evolution model is generally established from MLST studies by computing the dN/dS ratio (ratio of the number of non-synonymous changes per non-synonymous site to the number of synonymous changes per synonymous sites). However, it has been demonstrated that the dN/dS ratio is not appropriate for inferring selection pressures from single bacterial populations, in which most differences between sequences represent segregating polymorphism rather than fixed substitutions, as assumed by the model . Therefore, we developed a gene evolution model using the less controversial statistical tests of neutrality, such as the Tajima's D test , and the coalescent-based Fu & Li's D and F tests . As substantial evidence indicates the separation of cremoris and lactis subspecies into two genetic lineages, the requirement of Fu & Li's D and F test for an outgroup sequence  could be fulfilled using DNA sequences from the subsp. cremoris strain MG1363. All three tests gave values that did not significantly deviate from zero (p>0.05, Table 2), indicating that the six loci evolved by random genetic drift.
Although the first lactococcal genome sequence available was of a strain belonging to subspecies lactis , little is known about genome variability in this subspecies. In addition, previous analyses essentially focused on dairy strains ,  and no information is available for strains of other origins. To estimate the extent of genome size differences between strains, and the contribution of plasmids and the chromosome to this genome size variation, various PFGE analyses were performed. Each strain contained from one to ten plasmids ranging from 2.2 kb to 120 kb in size (data not shown), making up 1% (35 kb, strain NCDO2118) to 12% (329 kb, strain UCMA5713) of the total genome (Table S1). Plasmid genetic markers determining metabolic properties important for dairy product manufacture, such as lactose (lacE gene) or casein catabolism (prtP gene), and citrate utilization (citP gene), were assigned to particular plasmids by Southern hybridization (Fig. 3, and Table S1). The two strains isolated from animal skin (NCDO2633 and NCDO2146) contained a copy of the lacE gene, but all other non-dairy strains had none of these markers. Among the dairy strains, nearly half (10/22) contained prtP and lacE, and the others either contained prtP (2/22) or lacE (6/22) alone, or neither (4/22). Therefore, although these two markers were generally associated with dairy strains, neither allowed unambiguous identification of strain origin. Strains containing the citP gene were grouped in three STs (ST6, ST15, and ST16) of the clonal complex CC1 (Fig. 3, and Table S1). This strongly supports the view that biovar diacetylactis may be distinguished by differences in chromosomal sequences not uniquely related to citrate utilization , and corresponds to a close genetic lineage among L. lactis subsp. lactis dairy strains.
The mean chromosome size was 2475 kb overall with about 15% difference between the smallest (2,304±41 kb in strain S188) and the largest (2,725±72 kb in strain A12) (Fig. 3 and Table S1). This is a larger range than found for streptococcal species assumed to contain an “open” genome . Indeed, this size spread ranks the subspecies lactis amongst bacterial species with high genome diversity (Fig. 4). The mean genome size for the different isolates (the sum of chromosome and all plasmids) was 2,619 kb, with about 20% difference between the extremes (2,359 kb in strain S188 and 2,930 kb in strain A12) (Fig. 3). Although the plasmid content significantly contributed to the genome size (Spearman ρ=0.69, p<10−5), no correlation between chromosome and plasmid sizes was detected (Spearman ρ=0.1, p=0.57). In addition, no correlation was found between strain origin and the size of its plasmid content (Mann-Whitney test, p=0.948), or its plasmid profile (data not shown). This was also true of other genomic characteristics (Fig. 3): there was no significant difference between dairy and non-dairy strains as concerns mean chromosome size (Mann-Whitney's test, p=0.616), or mean genome size (Mann-Whitney test, p=0.766). Lastly, the genomic features did not correlate with the MLST-based phylogenetic relationships between strains: some strains belonging to the same ST differed by up to 230 kb in total genome size (see for instance strains IL594 and LD61, Fig. 3), whereas some unrelated strains had similar chromosome sizes and plasmid contents (see for instance strains Co1 and LD02, Fig. 3).
To contribute to the characterization of the natural variability of L. lactis, we report a comparative evaluation of the genetic and genomic diversity of a collection of 36 strains isolated from different ecological sources and geographical areas. The various analyses revealed unexpectedly high variability within the subspecies lactis at both gene and genome levels, and gave clues about its population structure and evolution. These findings were not entirely coherent with the traditional division into dairy (i.e. isolated from dairy substrates) and non-dairy (i.e. isolated from other sources) strains, but rather support a new classification based on ecological separation between several ecotypes  corresponding to “domesticated” and “environmental” strains.
At the gene level, MLST analysis revealed two clonal complexes (CC1 and CC2) and 14 singletons. This genetic structure clearly clustered strains involved in milk processing, a human activity, and isolated from dairy starters (13 strains, Table S1) or fermented product (3 strains). These “domesticated” strains were almost exclusively found in the two clonal complexes, whereas “environmental” strains, isolated from various sources such as plant and animals (including raw milk), were scattered into unique STs. This demarcation was also observed in the phylogenetic tree built using the concatenated sequences, with “domesticated” strains clustering as a single clade that could be further decomposed into two genetic groups, G1 and G2 (with G1 including most strains from the biovar diacetylactis). In contrast, “environmental” strains were spread evenly across the phylogenetic tree and constitute the major contributors to the genetic diversity observed within the subspecies. The allelic distribution of all loci used in the MLST scheme strongly supported this opposition between the two ecotypes. This type of evolutionary pattern appears to be a general trend among the subspecies lactis and is not due to geographic sampling bias, because a similar separation has been observed when examining the phylogenetic tree produced by the alternative MLST scheme for lactococcal strains of other geographical origins .
The phylogenetic trees from both studies, rooted with strains from the subspecies cremoris, indicate that “environmental” strains appeared first, and that “domesticated” strains emerged only recently from a single founder event. It is assumed that high genetic diversity of “environmental” strains explains their ubiquitous presence in various natural environments (plants, animals and milk), but allows only poor growth during milk processing where they become a subdominant population. Such strains are expected to be only infrequently isolated from fermented products by standard bacteriological methods, as in the case of strains LD98 (Table S1 and Fig. 3) or ATCC 19435T . This hypothesis is further supported by the identification of numerous “environmental” strains in raw milks from different areas (data not shown). The emergence of the “domesticated” strains through a single founder event suggests acquisition of adaptative mutations that allowed the descendant of this lineage to become the dominant lactis subspecies population during milk processing. Possibly, the founder event was acquisition of the plasmid-encoded genes involved in casein or lactose catabolism, because both genes i) are highly prevalent in the “domesticated” strains, ii) are undoubtedly advantageous for rapid growth in milk as strains containing both functions are able to ferment milk, and iii) the lactococcal plasmids are known to carry other functions of adaptive value in milk , . However, in view of the versatility of such extra-chromosomal elements , , illustrated in this study by the complex distribution pattern of prtP and lacE genes with nearly half “domesticated” strains lacking one or both (Fig. 3), it appears more likely that there were several independent events, involving plasmid acquisition and loss. Indeed, the instability suggests that genes brought by these plasmids are not the key features responsible for the emergence of “domesticated” strains, and that only their presence in a subsample of the bacterial complex is essential, an assumption supported by the fact that artisanal whey, sourdoughs, and even defined commercial starters, are generally composed of several L. lactis strains.
In the absence of reliable universal molecular clock in bacteria , it is difficult to infer divergence times for the different evolutionary steps. Nevertheless, empirical cheese production at local scale by spontaneous fermentation or back-slopping over thousands of years would presumably have allowed the emergence of several independent genotypes adapted to milk processing; consequently, the uniqueness of the origin of “domesticated” strains, and low DNA polymorphism, are inconsistent with early lactococcal domestication of the order of 10,000 years ago. A simpler explanation would be that “domesticated” strains originate from a bottleneck event caused by the sampling of a very limited number of strains isolated from natural starters in the early 20th century, when defined commercial starters were first used for standardized cheese production . Subsequently, the emergence of the genetic group G1 is presumably associated with a second bottleneck allowing the emergence of fast acid-producing strains (corresponding to “modern” industrial strains) more adapted to the large-scale cheese production developed 40–50 years ago. These successive founder effects associated with human subsampling are supported by the different tests of neutrality, all of which indicate that each locus of the MLST scheme evolved by random genetic drift.
In contrast to the gene phylogeny, the macrorestriction analysis by PFGE did not allow robust strain clustering, except for few “modern” industrial strains of biovar diacetylactis belonging to the major ST (including the sequenced strain IL1403). In addition, the low SD values within this ST revealed unexpectedly high genome plasticity within the subspecies, with most macrorestriction fingerprints being as disparate within the subspecies as between subspecies lactis and cremoris. As 84% of the SmaI restriction sites found in the KF147 chromosome  are also present in the IL1403 chromosome (data not shown), the low SD values corresponded mostly to genome rearrangements such as inversions and insertion/excision of mobile genetic elements. This genome variability was also evident in the substantial variation in chromosome and total genome sizes (15% and 20%, respectively), indicating high fluctuation in strain-to-strain coding capacity. This range of genome size variability indicates that the pan-genome is as large as generally observed for species inhabiting diverse ecological niches, such as Lactobacillus sakei , Pseudomonas aeruginosa , and E. coli . As most of the strains analyzed (28/36) have a chromosome larger than the IL1403 chromosome, this strain cannot be considered as representative of the coding capacities of the subspecies. In addition, genome characteristics (chromosome size, plasmid content size, plasmid profile, and total genome size) did not correlate with strain origin or with MLST-based phylogeny, with strains indistinguishable by MLST displaying up to 230 kb differences in genome size. This suggests that clonal diversification and phenotypic variability of the “domesticated” strains are largely the consequences of substantial genomic flux within the dispensable genome. Although large differences between the sizes of genomes of closely related strains has been suggested to be common in prokaryotes , this has been reported to date for only few proteobacteria, notably Vibrio splendidus , Sinorhizobium meliloti  and E. coli .
In conclusion, the core genome-based phylogeny substantiates early separation of the L. lactis subspecies lactis and cremoris, leads to the proposal of a new strain classification within the subspecies lactis, and suggests that there have been several genetic bottlenecks in the evolutionary history of strains involved in milk processing. The use of MLST will be of great help in defining the ecological and phylogenetic status of new lactococcal strains, and may be more informative than other genotyping methods. The high genome variability suggests a large pan-genome for the subspecies. However, this variability correlated with neither the strain origin nor the gene-based phylogeny, so numerous strains from the different ecotypes will need to be sequenced to characterize the lactococcal pan-genome.
Lactococcus lactis strains were obtained from various laboratory and industrial collections (LMGM-Toulouse, France for Sx strains; LMA-Caen, France for UCMAx strains; LBAE-Auch, France for the A12 strain; SOREDAB-La Boissiere Ecole, France for LLx and LDx strains). NCDO strains were obtained from the collection held at INRA (Jouy-en-Josas, France). Bacteria were grown at 30°C on M17-broth (Merck KGaA, Darmstadt, Germany) supplemented with 5 g.l−1 (w/v) of lactose or glucose. The lactose fermentation test was performed on milk-citrate BCP agar medium . Strains are listed in Table S1 (provided as Supporting Information).
Genomic DNA was extracted using the “DNeasy™ tissue” kit according to the manufacturer's instructions (Qiagen, Hilden, Germany). DNA probes corresponding to genetic markers of important industrial traits (lacE, encoding the lactose-specific Enzyme II of the PTS system; prtP, encoding the cell envelope-associated serine proteinase; and citP, encoding the membrane bound citrate permease involved in citrate uptake) were obtained by PCR amplification, and radiolabeled with dATP-32P using the “Megaprime™ DNA labeling system” (GE Healthcare Europe, GmbH). Restriction enzymes were purchased from New England Biolabs (Ipswich, USA). The automated RiboPrinter® (DuPont Qualicon, Wilmington, USA) device was used for EcoRI-ribotyping, according to the manufacturer's instructions. The V1-V4 region of the 16S DNA was amplified and double-strand sequenced (Eurofins MWG operon, Ebersberg, Germany), using primers E8_F (5′-AGAGTTTGATCCTGGCTCAG-3′) and E807_R (5′-TGGACTACCAGGGTATCTAATC-3′). Internal fragments of each of the six loci, pepXP (X-prolyl-dipeptidyl aminopeptidase), recN (ATPase involved in DNA repair), pdp (pyrimidine-nucleoside phosphorylase), pgk (phosphoglycerate kinase), glyA (serine hydroxymethyltransferase), and bcaT (branched-chain-amino-acid aminotransferase), were amplified and double-strand sequenced (Eurofins MWG operon, Ebersberg, Germany) using the primers listed in Table S2 (Supporting Information). Primers were designed by standard procedures using Clone Manager version 9.0 software (Sci-Ed Software). PCR conditions were: initial denaturation at 94°C for 3 min; 30 cycles at 94°C for 45 s, 55°C for 1 min, 72°C for 1 min using a MJ Mini thermocycler (Bio-Rad, Hercules, USA) in a 50 µl-mixture containing 10 ng of genomic DNA, 200 µM of each dNTP, 0.2 µM of each primer, 2.5 U Taq polymerase in 1x thermopol buffer (New England Biolabs). PCR products were purified using the “QIAquick PCR” Purification Kit (Qiagen). The quality of every sequence chromatogram was checked manually and each SNP was considered as correct if present on both DNA strands.
Preparation of lactococcal DNA embedded in agarose matrix, digestion of DNA, pulsed field gel electrophoresis (PFGE), and Southern-blot with dried agarose gels were performed as previously described . The size of each digested restriction fragment was estimated manually by comparison with either λ DNA concatemers (lambda ladder PFG marker, New England Biolabs) or L. lactis IL1403 SmaI restriction fragments , with PFGE conditions optimized for optimal resolution (pulse times of 2 to 210 s, depending on the fragment size to be determined). The size of the chromosome in each strain was estimated by averaging the sum of restriction fragment sizes calculated from either single SmaI digestion or double I-CeuI/NotI digestions. SmaI endonuclease has previously been used to estimate the genome size of various lactococcal strains ,  but generally cuts large plasmids in one or two fragments, leading to a slight overestimation of the chromosome size. In contrast, I-CeuI and NotI do not cut lactococcal plasmids but generate a very large chromosomal fragment (>1.5 Mb) whose size is difficult to determine accurately by electrophoresis . When applied to the IL1403 chromosome, this averaging method gave a value (2,411±14 kb) close to the size (2,365 kb) calculated from the chromosome sequence . Plasmid DNA was linearized by S1 nuclease digestion . Briefly, DNA embedded in agarose matrix was incubated at 37°C for 40 min with 2.5 units of S1 nuclease in 200 µl of 1x S1 buffer (Promega, Madison, USA). The reaction was stopped by adding 1 ml of TE 10/50 (Tris-Cl pH 8, 10 mM; EDTA 50 mM) and samples were kept on ice until PFGE electrophoresis.
The genomic relatedness of bacterial strains was estimated from pairwise comparisons of PFGE SmaI-macrorestriction patterns, and a matrix of binary data was constructed based on the presence/absence of each band. Dice coefficients (SD) for each pairwise comparison and corresponding genomic distances (1-SD) were calculated from the matrix using the WINBOOT program . A UPGMA dendrogram was constructed with the NEIGHBOR program in the PHYLIP package v3.69 . Bootstrap analysis of the UPGMA tree was performed using the WINBOOT program with 1000 pseudoreplications. For MLST analysis, forward and reverse DNA sequences were trimmed, aligned, and analyzed using MEGA4 v4.1 . Conflicting phylogenetic signals were analyzed by split decomposition using SplitsTree v4.10 . Allele and isolate dataset creation, arbitrary allele numbering, and Sequence Type (ST) assignation were done using mlstdbNet software . STs clustering into clonal complexes (CC) and founder assignation were performed using eBURST . Neighbor-joining trees (bootstrap 1000 using the Kimura two-parameter model ) were established with MEGA4 v4.1. The number of segregating sites (S), nucleotide diversity (π), Tajima's D, and Fu & Li's D and F values were calculated using DnaSP v5.10 . The standardized index of association (IAS) was calculated using LIAN 3.5 (http://gump.auburn.edu/cgi-bin/lian/lian.cgi.pl). The MLST data from several bacterial species were downloaded from different MLST web sites (http://www.pasteur.fr/recherche/genopole/PF8/mlst/, http://www.mlst.net/, http://pubmlst.org/). Sequences with missing data were removed from the database by manual inspection using MEGA4 v4.1, and redundant sequences were removed using the NRDB program (http://pubmlst.org/perl/mlstanalyse/mlstanalyse.pl?site=pubmlst&page=nrdb&referer=pubmlst.org). πMAX values were extracted from the squared similarity matrix calculated with the DNADIST program (D option set to “similarity table”) in the PHYLIP v3.69 package .
Allele sequences of the six MLST loci were deposited in Genbank/EMBL under the accession numbers HM597775 to HM597845. Sequence data are also available through our MLST web site (http://www-mlst.biotoul.fr/).
L. lactis subsp lactis characteristics of the strains used in this study.
Primers used for the MLST.
Split decomposition analysis of the different alleles at each individual locus. The conflicting phylogenetic tree topologies are illustrated by interconnected network. Numbers indicate allele number.
Location of loci used in the MLST scheme on the chromosome of IL1403 strain.
a) Matrix of SD values for all pairwise comparisons of PFGE fingerprints. b) UPGMA dendrogram derived from the SD values.
Polymorphic nucleotide sites found in the 32 L. lactis subsp. lactis strains at the six MLST genes. Only polymorphic sites are shown, with numbering starting at the beginning of the aligned sequence portion of each gene.
This work is dedicated to Dr. Pascal Le Bourgeois' friend and colleague Jean-Marc Reyrat (deceased 10-28-2009). We thank Véronique Monnet, Emmanuelle Maguin, Nathalie Desmasure, and Catherine Faucher for providing bacterial strains, Stephane Chailloux and Emmanuelle Maguin for helpful discussions.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was carried out with the financial support of the « ANR- Agence Nationale de la Recherche - The French National Research Agency » under the « Programme National de Recherche en Alimentation et nutrition humaine », project « ANR-05-PNRA-20, Génoferment». D. Passerini was supported by a fellowship from the Ministère de l'Enseignement Supérieur et de la Recherche. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.