The comparative analysis presented here of 81 bacterial genomes, covering 6 genera and 43 different species, could be performed by grouping their genes into gene families and comparing core and pan-genomes of various subsets of genomes. The findings frequently confirmed taxonomic relationships but could not identify common signatures, in terms of gene content, for all non-pathogenic bacteria included in the analysis. This finding is surprising, as all these species occupy a similar niche. Conserved genes were compared by means of a consensus tree, while genes variably present were analyzed by cluster analysis. The latter indicated that Leuconostoc genomes share a considerable number of variable genes with Lactobacillus. Functional analysis of the proteins coded by the genes comprising a genus’ core genome identified the relative strong conservation of information storage genes; this was observed for all genera analyzed. When all genomes were divided into a pathogenic and a non-pathogenic group, the pan-genome of both groups showed a surprisingly similar COG distribution; however, their core genome differed considerably. It was observed that, in the core genome of non-pathogenic genomes, genes for post-translational modification and chaperones were enriched.
A simultaneous comparison of the pan- and core genomes of publicly available genomes of Lactobacillus
, as was performed here, has not been published before, but similar analyses have been published for smaller selections of organisms. Canchaya and co-workers [2
] performed comparative genomics of the then five available Lactobacillus
genomes from different species and commented on the high variability within this genus. Schleifer and Ludwig [23
] stated that “It is widely recognized that the taxonomy of this genus is unsatisfactory due to the highly heterogeneous nature of its members”. Indeed, data presented here illustrate the diversity within Lactobacillus
. However, the heterogeneity of this genus is not larger than that of other bacteria. Using the same comparison criteria as applied here, the pan-genome of 53 E. coli
genomes was found to comprise 13,000 gene families, even within this single species [18
]. Similarly, an analysis of 27 genomes from 7 Vibrio
species produced a pan-genome of nearly 15,000 gene families for this genus [31
], and 38 genomes of 5 Burkholderia
species contained as much as 26,000 gene families [28
]. Thus, the diversity in gene content within the genus Lactobacillus
, based on the genome sequences currently available, is not exceptional in the bacterial world.
Our analyses are mainly based on core genomes, an approach that others followed as well [2
]. Those authors had defined a core genome for Lactobacillus
whose size is similar to our findings. However, the fraction of identified orthologous genes in the pairwise comparisons performed by those authors range from 52.3% to 68.9%, which is much higher than our findings of between 12% and 42%, shown in the BLAST Matrix of Fig. . The difference may be due to the way these percentages were calculated. Whereas we express these as the fraction of gene families found in the reciprocal pan-genome of the pair of analyzed genomes, their calculations are different, and they do not state the cut-off used to recognize orthologous genes as such. In view of their limited reported range, we believe our way of expressing pairwise homology is more useful, as it gives a more sensitive measure. In a subsequent publication, comparative genomics was performed with a larger set of 12 Lactobacillus
]. Inclusion of 7 more genomes reduced their core genome to 141 genes which indicates they used more strict criteria of inclusion than the 50–50 rule we applied. Similar to our analysis, these authors compared the COG classes of the core genes they had identified, and their findings also reported the largest class represented to be genes involved in translation, followed by replication.
Comparative genomics of both Lactobacillus
was presented in a review [30
], which mentioned the ability of Bifidobacterium
to “synthesize at least 19 amino acids and (…) all of the enzymes that are needed for the biosynthesis of pyrimidine and purine nucleotides”. These authors further emphasized the importance of carbohydrate metabolism for Bifidobacterium
with its capability to degrade complex sugars. Indeed, top-level metabolism genes form a major part of the Bifidobacterium
core genome (Fig. ) with class E (amino acid metabolism) as the largest fraction within that category. When we compare this core genome with that of Lactobacillus
(Fig. ), our analysis shows that class F genes (nucleotide metabolism) comprise the largest metabolism gene fraction in the Lactobacillus
core genome. Ventura and co-workers [30
] used a known physiological characteristic (Bifidobacterium
species are known for their prototrophy) and looked for evidence of this in the genomes. In contrast, we have done a neutral analysis of pan- and core genome COG class representation and then compared this between genera. Our approach reveals novel insights that would remain unnoticed when known phenotypes are taken as a start, for instance the conservation of COG class O genes, involved in post-translational modification and chaperones, in both of these genera.
The authors of a recent review on Bifidobacterium
] pointed out that most Bifidobacterium
genomes have been sequenced from organisms that have a long history of culture outside their natural habitat, the gut, with the exception of B. longum
DJO10A. There is good evidence that the genome of Bifidobacterium
is subject to gene reduction to adapt to prolonged culture conditions. This could potentially bias our comparative analysis of Bifidobacterium
genomes with that of the other probiotic organisms.
The term ‘lactic acid bacteria’ is commonly used to describe bacteria used as starter cultures and fermentation of foodstuffs. LAB can refer to species from the genera Lactobacillus
or all of the Lactobacillales, and sometimes includes Bifidobacterium
as well. However, there are good reasons why these bacteria have been placed into different genera and phyla. The analyses presented here support their current taxonomic positions and stress their differences in gene content. The term LAB incorrectly suggests all these organisms are somehow related; a view that is still being presented in the literature [15
]. The use of the term LAB is a bit misleading, as the genetic content from these various genera differ significantly. Moreover, some of the genera within LAB comprise only non-pathogenic species (Leuconostoc
), whereas other genera are a mixture of pathogenic and non-pathogenic species and strains (Streptococcus
). It would be better to refrain from the term LAB as there is no common denominator, other than the production of lactic acid (which is not restricted to these organisms) to collectively describe all species and strains supposedly included in this diverse group of organisms.
An extensive comparative study of Enterococcus
genomes could not be identified from the literature. Most studies concentrate on pathogenicity of E. faecalis
. Vebø and co-workers [29
] compared probiotic and (uro-)pathogenic E. faecalis
genomes; however, those comparisons were not based on sequence data. The Enterococcus
genomes we have included were mostly from pathogenic organisms (only two non-pathogenic E. faecalis
strains whose sequences were nearing completion were publicly available at the time of analysis), which limits the strength of this analysis, as it cannot be used to compare and contrast multiple non-pathogenic with pathogenic Enterococcus
genomes. The 11 genomes included represent only 4 species, giving a pan-genome of nearly 8,000 gene families. The first four species of Lactobacillus
genomes in the pan-genome plots of Fig. produce smaller pan-genomes, which could suggest that the diversity of Enterococcus
could be at least as extensive as that of Lactobacillus
. The pairwise BLAST comparison within this genus resulted in homologues varying from 24% to 84%, again indicating extensive intra-genus diversity.
are frequently considered as closely related, but the BLAST Matrix comparing all included genomes (Supplementary Fig. S1
) does not support this. Instead, somewhat surprisingly, the observed homology between Leuconostoc
genomes is slightly higher than that between Streptococcus
. On the other hand, Lc. lactis
was positioned in between these two genera in the tree based on variable gene content. A shared gene pool between these genera can be hypothesized. Based on the conserved core genes, however, Enteroccus
is more related to Streptococcus
, while Lactococcus
is more distinct.
A small comparative study of Streptococcus
genomes combined with MLST suggested that S. thermophilus
is a relatively young clone, evolved by genome reduction which removed or inactivated Streptococcus
virulence genes [13
]. It is possible, however, that the reduced genomes observed are the result of prolonged use as starter cultures, as no fresh human isolates have been sequenced to date. In a short review, Delorme [5
] states that “S. thermophilus
is related to Lactococcus lactis
…”. Indeed, from the all-against-all BLAST Matrix, a similarity between 17.3% and 20.2% is recorded between genomes of these two species, which is higher than that between S. thermophilus
and any other non-streptococcal genome. However, Lc. lactis
also shares 16.0% to 18.0% of reciprocal genes with S. suis
, so these overlapping percentages of gene similarity are no indicator of similarity in (probiotic) phenotype. Within the Streptococcus
genus, the stated similarity of S. thermophilus
with Streptococcus sanguinis
(the only member of the viridans group for which a genome sequence is available) is confirmed in our Matrix, but an even higher similarity is found with Streptococcu gordonii
The COG analysis of the core genomes of separate genera identified both similarities and differences. The three top-level functional COG groups are relatively equally divided over the functionally annotated pan-genes of all species, but their core genomes differ. Notably, Lactobacillus and Leuconostoc both have a smaller fraction of metabolism core genes than the other four genera and a larger information storage gene fraction. Information storage genes are essential, but redundancy allows so much variation between organisms that they are not all maintained in a core genome of diverse species. In the approach presented here, we first identified the core genomes of groups of bacteria and then sorted the genes in these core genomes for top-level COG categories. As a consequence, genes that were insufficiently conserved based on sequence similarity to be maintained in the core genome are removed despite their possible functional conservation. Using this approach, we found no correlation between the diversity within a genus (using the difference of their pan- and core genome as a measure) and the fraction of their information/storage COG genes. This lack of correlation is illustrated by the core genome of Bifidobacterium (724, or 10% of its pan-genome) and Leuconostoc (1,164, or 40% of its pan-genome). These two core genomes contain 34% and 31% information/storage genes, respectively, despite a huge difference in the degree of variation in these two genera.
Of particular interest is the COG analysis where all genomes were divided into a pathogenic and a non-pathogenic group. Virulence genes are not a separate COG category, but from the comparison of the core genomes of the pathogenic group with that of the non-pathogenic group, we can hypothesize that genes belonging to COG categories M (cell wall/membrane biosynthesis) and O (post-translational modification, chaperones) would mostly contribute to virulence. Conversely, it could be assumed that genes highly overrepresented in the core genome of the non-pathogenic group (compared to the core genome of the pathogenic group) most likely contribute to their probiotic or fermentative lifestyle. We observe enrichment for genes belonging to COG class J (translation, ribosomal structure and biogenesis) and again O (post-translational modification and chaperones). The finding that core genes of the non-pathogenic isolates are more frequently information storage genes and less likely metabolic genes than the core genes of pathogens is counter-intuitive. It is generally accepted that commensals and probiotic strains are most adequately equipped to live in the intestine, which would assume they share a large number of (conserved) metabolic genes to do so. Instead, the reduced metabolism gene fraction in their core genome suggests that there is a large variation within these genes, which reflects the diversity of the various commensals, fermentative and probiotic isolates. The vast enrichment for information/storage genes in the core genome of the non-pathogenic organisms is possibly a reflection of the relative poor conservation of all other functional classes in this group, an effect that appears to be less pronounced in the (ecologically more diverse) pathogenic group. The fact that Bifidobacterium are not present in the pathogenic group may have skewed these results slightly. A more accurate prediction for conserved genes with an important role in bacteria with a non-pathogenic lifestyle may become possible in the future, when more non-pathogenic Enterococcus genomes become available, which allows comparison of gene content within a genus or even species.