With the increasing number of available bacterial genome sequences, when these genomes are compared, the genetic variation within bacterial species is greater than previously predicted [
1,
2]. Rapid and reliable sub-typing of bacterial pathogens is important for identification of outbreaks and monitoring of trends in order to establish population structure and to study the evolution among bacterial genomes especially within and between the outbreak strains. Today, the most widely used typing methods for bacterial genomes include multilocus sequence typing (MLST), pulsed field gel electrophoresis (PFGE), sequencing of 16S rRNA genes, and multilocus variable-number of tandem-repeat analysis (MLVA).
PFGE and MLVA have major benefits, but are time consuming and the results are difficult to standardize [
3]. Other typing methods which rely on one or a few ubiquitous genes, such as the 16S rRNA gene or a set of housekeeping genes in MLST, are capable of classification at the species level and sometimes also at the subspecies level, but the biological information in a narrow selection of genes will rarely be sufficient to clearly distinguish between closely related strains such as several isolates of the same serotype [
4-
6]. Thus, more of the genome content should be considered rather than just one or a few genes [
4].
The price and time for whole genome sequencing will soon be in the same range as the traditional typing methods mentioned above. Genome sequencing can be a powerful method in epidemiological and evolutionary investigations [
7-
9]. Although, to date, this has only been used in more limited epidemiological investigations where isolates suspected to be part of the same outbreak have been compared to a reference genome. In the future, it is likely that WGS will become a routine tool for identification and characterization of bacterial isolates, as hinted at in the first 'real-time' sequencing of the
E. coli O104 outbreak in Germany in the summer of 2011 [
10] and the
Vibrio cholerae outbreak in Haiti in October 2010 [
11]. This requires standard procedures for identifying variation and for analyzing similarities and differences.
Conserved genes are present across bacterial genomes of the same species (or genus). A fraction of these genes--those conserved in all (or most) of the genomes of a given bacterial taxonomic group--is called the 'core-genome' of that group. The core-genome can be identified either within a genus or species [
3] and can be used to identify the variable genes in a given genome [
12]. In addition, the conserved genes in general appear to evolve more slowly, and can be used for determining relationships among bacterial isolates [
13].
Currently there are more than a hundred bacterial species for which sufficient genomic data are available to estimate the species core-genome (that is, there are at least three genomes sequenced from the same species) [
14]. Among these,
Salmonella enterica is a good candidate species for conserved gene identification because the genomes are quite similar [
15]. Moreover,
S. enterica is one of the most important food-borne pathogens and is responsible for global outbreaks [
16] which makes international standard typing procedures of major importance in order to allow for global comparisons [
17]. The
Salmonella genus has only two species with sequenced genomes:
Salmonella bongori and
Salmonella enterica. In turn,
S. enterica is divided into 6 sub-species:
enterica, salamae, arizonae, diarizone, houtenae and
indica. Presently,
S. enterica is classified into more than 2,500 serotypes [
18].
In order to investigate an outbreak caused by
Salmonella, characterization of
Salmonella isolates from genome data is a crucial step.
Salmonella genomes are highly similar, particularly within subspecies
enterica, where little variance exists in the genomes [
15]. This high similarity presents a challenge for typing and classification.
In their pioneering work Tettelin
et al. [
1] defined the core genes of a species by being those genes found present in (nearly) all known members of the species. Since then others have studied core and pan genomes at the genus level or even at the kingdom level [
19], but for our purposes the original definition at the species level is suitable. In this work we identify the core genes within
S. enterica genomes and determine variation between the different available genomes, both in terms of sequence and presence/absence of non-core genes; in the latter case using a method originally published by Snipen & Ussery [
20]. We evaluate the value of different approaches for classification of isolates in epidemiological settings and compare our findings to currently used sequencing methods, both in long term trend analysis and outbreak investigations.