|Home | About | Journals | Submit | Contact Us | Français|
Salmonella enterica serovar Typhimurium is a Gram-negative pathogen that causes gastroenteritis in humans and a typhoid-like disease in mice and is often used as a model for the disease promoted by the human-adapted S. enterica serovar Typhi. Despite its health importance, the only S. Typhimurium strain for which the complete genomic sequence has been determined is the avirulent LT2 strain, which is extensively used in genetic and physiologic studies. Here, we report the complete genomic sequence of the S. Typhimurium strain 14028s, as well as those of its progenitor and two additional derivatives. Comparison of these S. Typhimurium genomes revealed differences in the patterns of sequence evolution and the complete inventory of genetic alterations incurred in virulent and avirulent strains, as well as the sequence changes accumulated during laboratory passage of pathogenic organisms.
The genomes of related bacteria can differ in three ways: (i) gene content, where one bacterial species or strain harbors genes absent from the other organism; (ii) nucleotide substitutions within largely conserved DNA sequences, which can result in amino acid changes in orthologous proteins, form pseudogenes, and promote distinct expression patterns of genes present in the two organisms; and (iii) changes in gene arrangement, caused by inversions and translocations. These differences have been observed not only across bacterial species but also among strains belonging to the same species. Recent genomic analyses have revealed that many bacterial pathogens of humans are virtually monomorphic (1) and exhibit very limited sequence diversity, raising questions about the nature of the genetic changes governing distinct behaviors. Furthermore, several bacterial pathogens that have been subjected to extensive passage in the laboratory display altered virulence characteristics, but the genetic basis for these alterations remains largely unknown. Here, we address both of these questions by determining and analyzing the genome sequences of closely related isolates of Salmonella enterica serovar Typhimurium, a Gram-negative pathogen that has been used as a preeminent model to investigate basic genetic mechanisms (2, 8, 46, 59), as well as the interaction between bacterial pathogens and mammalian hosts (11, 41).
The genus Salmonella is divided into two species: Salmonella bongori and Salmonella enterica, which together comprise over 2,300 serovars differing in host specificity and the disease conditions they promote in various hosts. For example, S. enterica serovar Typhi is human restricted and causes typhoid fever, whereas serovar Typhimurium is a broad-host-range organism that causes gastroenteritis in humans and a typhoid-like disease in mice. Although the complete genome sequences of 15 Salmonella enterica strains are available, there is only a single representative of S. Typhimurium—strain LT2 (31). Despite its wide application in genetic analysis, strain LT2 is highly attenuated for virulence in both in vitro and in vivo assays (52, 56), leading many investigators to use other S. Typhimurium isolates to examine the genetic basis for bacterial pathogenesis (11, 14, 16).
Over 300 virulence genes (3, 5, 47) have already been identified in Salmonella enterica serovar Typhimurium 14028 (now termed S. enterica subsp. enterica serovar Typhimurium ATCC 14028), which is a descendant of CDC 60-6516, a strain isolated in 1960 from pools of hearts and livers of 4-week-old chickens (P. Fields, personal communication). Whereas strain 14028 has been typed as LT2, a designation based on phage sensitivity (27), the two strains were isolated from distinct sources decades apart, which makes their genealogy and exact relationship obscure. A derivative of the original 14028 strain with a rough colony morphology (due to changes in O-antigen expression) was designated 14028r to distinguish it from the original smooth strain, renamed 14028s, and was used in a genetic screen for Salmonella virulence genes because it retained lethality for mice and the ability to survive within murine macrophages. Strain 14028 was also used for the identification of Salmonella genes that were specifically expressed during infection of a mammalian host (30). Both 14028 and LT2 possess a 90-kb virulence plasmid promoting intracellular replication and systemic disease (14), but they differ in their prophage contents, as is often the case among S. Typhimurium strains (12, 13).
To identify the individual changes that differentiate S. Typhimurium strains and to assess the nature of variation that arises during laboratory storage and passage, we determined the genome sequence of strain 14028s. This genome was then used as a reference for sequencing its progenitors, including the original source strain CDC 60-6516 and the earliest smooth and rough variants. Our analysis uncovered the genomic differences that arose during the past decades of laboratory cultivation and showed that derivatives with different virulence potentials can follow distinct patterns of sequence evolution.
Sequences were generated for the genomes of four strains of S. Typhimurium. First, a complete and annotated sequence was produced for the genome of a contemporary isolate of strain 14028s. Freezer stocks of this strain have been repeatedly passaged and subcultured and used in experimental settings for more than 2 decades. Using the completed 14028s sequence as a reference, we subsequently obtained the genomic sequences of three archival stocks of the following strains. (i) CDC 60-6516, a pathogenic strain, was initially isolated from samples of hearts and livers of 4-week-old chickens and served as the source from which the original 14028 strain was derived. This strain was stored in stab culture at room temperature from 1960 to 2003 and subsequently as a glycerol stock at −70°C from 2003 to 2009. (ii) ATCC 14028r is the original rough-colony strain described previously (11). The strain derives from a frozen glycerol stock stored on 18 February 1984. (iii) ATCC 14028s was obtained from a frozen glycerol stock stored on 21 February1985. To discriminate it from the contemporary reference strain, this strain is referred to here as 14028s-o. Note that although this strain was deposited later, it is not derived from ATCC 14028r, but rather represents the original lineage from which the rough strain evolved. The majority of Salmonellae form smooth colonies, but spontaneous mutations to rough colony morphology occur at a frequency of >10−5. Comparisons of the allelic variants of these strains with those already typed by MLST assigned them to Salmonella enterica sequence type (ST) 19.
To derive the complete nucleotide sequence of strain 14028s, we employed a mixed strategy consisting of producing: (i) random reads that averaged 220 nucleotides (nt) and sampled each nucleotide, on average, 17 times (i.e., 17× coverage), obtained with a Roche 454-FLX pyrosequencer; (ii) 30× coverage of reads averaging 35 nt obtained with an Illumina-Solexa sequencer; (iii) 100× mate pair coverage by ABI SOLiD, in which 22-nt reads are derived from both ends of randomly sheared 500-bp fragments; and (iv) 2× mate pair coverage by ABI-Sanger sequencing, in which >900-nt reads are derived from both ends of clones produced from 2- to 3-kb cloned fragments.
Reads obtained using the 454 and Sanger methods were loaded together into Roche's GSassembler using default parameters, and all contigs corresponding to a single read were discarded. All remaining contigs were aligned with MUMmer and then compared to the published sequence of S. Typhimurium (strain LT2; accession no. NC003197) using BLASTN. Contigs aligning with more than one location on the reference genome were tagged as repeat elements. Mate pair data produced by ABI-SOLiD were used to predict the order and orientation of these 454/Sanger contigs, thereby confirming that no inversions or translocations were present relative to the LT2 genome. However, because most of the breaks in the assembly occurred in repeat regions, ABI-SOLiD reads were not informative for closing any gaps.
Because GSassembler (and other sequence assemblers) can collapse repeated elements by underestimating the diversity or number of tandemly repeated units, we initially treated all repetitive sequences as gaps. Each of these gaps in the 14028s genome was filled by designing PCR primers to regions of unique sequence adjacent to the gap, amplifying across the unknown region, and sequencing the PCR products via conventional methods. Whenever the resulting PCR products exceeded the length of a single Sanger sequencing read, complete double-stranded sequences were obtained by primer walking. In cases where the gaps were too large to amplify from unique flanking sequences (e.g., rRNA operons), we designed forward and reverse primers based on repeated sequences believed to reside in the middle of the gap. These primers were used in conjunction with unique flanking primers to generate two PCR products that together spanned the gap. Only one repeated element, corresponding to an ~20-kb region common to two prophages, was too large to resolve by these methods, and in this case, the consensus sequence derived from GSassembler was included in the finished genome in both locations (1304985 to 1325378 and 2789731 to 2810125).
Initial comparisons of the complete, gap-free 14028s genome to the published LT2 genome revealed a number of 1-bp indels that resulted in frameshifts in annotated genes. Many of these indels were situated in homopolymeric sequences, which are known to be prone to errors in 454 pyrosequencing technology. To verify indels in stretches of homopolymers, we manually examined the Sanger reads spanning each homopolymeric region that contained an indel and corrected the sequence to match the Sanger read if it did not match the consensus sequence. Although data obtained by Illumina/Solexa were not integrated in our genome assembly, these reads were also used to confirm or correct homopolymeric indels. For 15 homopolymeric indels for which Sanger or Solexa reads were either absent or inconsistent, the DNA sequences were checked by PCR amplification and sequencing.
Wherever possible, the 14028s genome annotation followed that of the S. Typhimurium LT2 genome. For sequences not present in that genome, candidate genes were identified in Prodigal (http://compbio.ornl.gov/prodigal) and searched with BLASTP against the nonredundant protein sequence database. Because numerous other Salmonella genome sequences have become available since LT2 was first published (31), BLASTX searches of all remaining intergenic regions were also performed against the nonredundant database. All potential coding regions ≥30 amino acids in length and annotated in any other Salmonella genome were added to the 14028s annotation. Due to the different start codons listed for a number of genes, we also updated our annotation to reflect the longer gene product whenever the reading frame was conserved.
After completing the 14028s genome, we obtained ~400-nt reads via Roche 454-Titanium pyrosequencing of strains 14028s-o, 14028r, and 60-6516 to coverage depths of 21×, 20×, and 32×, respectively. Each genome was assembled with GSassembler using default parameters, and the resulting contigs were aligned with the finished 14028s genome using BLASTN. Apparent differences between the resequenced genomes and the 14028s genome were filtered to exclude the unreliable set of 1-bp indels in homopolymers of >4 bp, as well as those 1-bp indels and base substitutions marked as low confidence by GSassembler. All remaining base substitutions and indels, including all homopolymeric indels observed in more than one genome, were checked by Sanger sequencing of a PCR corresponding to the region.
The complete S. Typhimurium 14028s genome has been deposited in GenBank under accession numbers CP001363 for the bacterial chromosome and CP001362 for the plasmid.
Using four sequencing platforms (ABI-Sanger, Roche-454, Illumina/Solexa, and ABI-SOLiD), we assembled the complete sequence of a contemporary isolate of strain 14028s, which was, in turn, used as a reference for sequencing the genomes of three progenitor strains (Fig. (Fig.1).1). Due to the very short read lengths and/or high error frequencies, data generated by ABI-SOLiD and Illumina/Solexa were not included in the initial genome assembly of strain 14028s. The 454 run yielded a total of 372,340 reads (82,619,725 bases; 17× coverage), and the Sanger runs yielded 10,752 reads (9,958,592 bases; 2× coverage). The combined assembly based on these two data sets formed 259 contigs, 106 of which corresponded to individual reads and were removed from subsequent analyses. The remaining 153 contigs served as the starting point for genome assembly and were aligned with the published LT2 genome. Several of the 14028s contigs represented repeat elements not present in the published LT2 genome, and alternatively, several regions of the published genome had no counterpart in 14028s.
After all gaps in the 14028s genome were closed, comparison with the LT2 genome uncovered 55 single-base indels leading to frameshifts in annotated genes. Although individual Salmonella strains are each expected to contain some unique pseudogenes, 454 sequencing technologies are prone to errors in long hompolymeric runs, thereby generating indels. By reexamining each of these sites using sequencing technologies other than 454, we found that 25 of the 55 single-base indels were attributable to sequencing errors. Not all such errors occurred in long homopolymers: six were in runs of three or fewer mononucleotides, and two occurred within runs of nonidentical nucleotides. Additionally, data from the three resequenced genomes revealed 25 candidate errors (primarily intergenic homopolymers) in our original 14028s assembly that were each checked by PCR and Sanger sequencing.
The 14028s and LT2 genomes are over 98% identical in sequence, with the greatest difference between the two strains attributable to the distribution of four prophages (Fig. (Fig.2).2). Two intact LT2 prophages (Fels-1 and Fels-2) are not present in 14028s; however, there are remnants of several Fels-2 genes at the corresponding positions in the 14028s genome, indicating that the prophage is ancestral to both strains but was subsequently eliminated from 14028s. In contrast, strain 14028s contains two prophages not present in LT2: one, integrated within tRNASer, is 40,146 bp in length and >99% identical to the previously characterized S. enterica phage ST64B, and the other, inserted near the 3′ end of the icdA gene, spans 51,101 bp. Most of this 51-kb prophage is virtually identical to a prophage in S. enterica serovar Newport; however, the remaining portions, constituting about 30% of its length, display significant identity to prophages in other bacteria only at the amino acid level. This phage, referred to as Gifsy-3, contains two known virulence genes, sspH1 and pagJ (12).
In addition to the prophage insertions and deletions, numerous other classes of insertion/deletion events differentiate the 14028s and LT2 genomes (Table (Table1).1). Even though four IS200 elements in 14028s and LT2 reside in identical locations in the two strains, there are eight others (six in 14028s and two in LT2) that are confined to only one of the strains. One of the IS200 insertions unique to 14028s disrupts STM1228, encoding a putative periplasmic protein, whereas the other five occur within intergenic regions. In addition to changes in genome contents attributable to phage and transposable elements, there were 11 other indels >100 bp in length, three of which occurred within protein-coding regions, including a 5.1-kb deletion that removed at least four genes (STM3256 to STM3259) from the 14028s genome. Aside from genes encoded within prophages, there are no other gene-size insertions into 14028s or large deletions in LT2.
When the 14028s and LT2 genomes were compared, there were a total of 142 indels under 100 bp in length, mostly due to single-base insertions or deletions (see below). Although a few of these single-base indels could have arisen from sequencing errors (only those indels that occurred within coding regions were initially confirmed by PCR and Sanger sequencing), the vast majority were subsequently verified by ABI-Solid and/or Solexa/Illumina reads. Considering only unambiguous indels that were verified by PCR and/or alternate sequencing technologies, there were approximately equal numbers of insertions and deletions in the two strains.
We detected a total of 962 base substitutions between LT2 and 14028s. Due to our postassembly finishing of all repeated elements in the 14028s genome, base substitutions (and small indels) were elevated by nearly an order of magnitude in multicopy regions (red bars in Fig. Fig.2).2). In all likelihood, the sequences of these regions in the LT2 genome reflect overcollapsed contigs in which merged sequences from nonidentical repeats were assembled in duplicated regions of the published genome. Due to the potential that these multicopy regions of the published LT2 genome (which include insertion sequence [IS] elements, rRNA operons, and prophages) are subject to sequencing artifacts, we excluded all such regions from the subsequent analyses, leaving a total of 540 base substitutions separating LT2 and 14028s (Table (Table22).
Overall, substitutions in intergenic regions and synonymous sites occur at about the same frequency (1.58 × 10−4 per intergenic site versus 1.61 × 10−4 per synonymous site), as expected if these sites are neutral or nearly so or if both types contain the same proportion of neutral sites. The Ka/Ks ratio (i.e., the ratio of nonsynonymous to synonymous site substitutions) based on substitution differences within annotated genes is 0.531, which is over 10 times that observed in comparisons of homologous genes from Escherichia coli and S. enterica (Ka/Ks = 0.036). The inflated Ka/Ks ratio observed between the two closely related Salmonella strains is due to the shorter duration of purifying selection acting on slightly deleterious mutations.
The availability of the genome sequences of several outside reference strains (32, 40, 55) allowed us to infer the ancestral states of most of the observed base substitutions and to establish in which of the two strains each substitution occurred. In all, 272 base substitutions (and 50 indels) could be unequivocally assigned to the 14028s lineage versus 244 base substitutions (and 42 indels) assigned to the LT2 lineage (Fig. (Fig.3;3; absolute numbers of each class of substitution are given in Table S1 in the supplemental material). For three classes of substitutions, there are large differences between the two strains. (i) There are more TA-to-CG substitutions in intergenic regions of 14028s. (ii) The complementary substitution, CG-to-TA, is prevalent at nonsynonymous sites of 14028s (72 versus 48) and was the most common type of substitution detected. (iii) In contrast to the asymmetry between the strains in the frequency of TA/CG transitions, CG-to-GC transversions in LT2 significantly (P = 0.001) outnumbered those in 14028s. The ability to recognize the direction of each substitution also allowed us to identify specific trends in base-compositional bias. For those sites that could be assigned, there was a decrease in overall G+C content in both strains, which was most pronounced at synonymous sites.
When all single-base substitutions and small indels were considered (while those few coding genes that occurred in multiple copies were ignored), 13 of the 14028s genes contained potentially inactivating mutations that resulted in considerably shortened coding regions, which, when combined with those genes disrupted by large insertions or deletions, yielded a total of 19 pseudogenes in 14028s that were intact in LT2 (Table (Table3).3). In addition to fimbrial protein (lpfD) and virulence factor (ratB) genes, which are at least 20% shorter than their LT2 counterparts, there are several other slightly truncated genes in 14028s, including fumarase (fumA), ribonulease Z (elaC), and secreted effector protein (avrA) genes that have been shown to be distributed variably among Salmonella strains (see Table S2 in the supplemental material). Moreover, as discussed above, the serU tRNA gene contains a phage insertion in 14028s.
Of the 39 pseudogenes that were originally annotated in LT2, 38 are present in 14028s (and the other is located in a prophage absent from 14028s), and in no case is it evident that any of the shared pseudogenes were inactivated independently in the two strains. In addition, there are four genes in LT2 that were originally considered to be functional but whose homologs, due to a base substitution or frameshift, were found to encode longer proteins in 14028s. In three of these cases (host specificity protein J and two putative arylsulfate sulfotransferases), the open reading frame (ORF) in 14028s encompassed two adjacent genes in the LT2 genome, representing examples of recent LT2 pseudogenes (or potential sequencing errors) that are full length in other sequenced Salmonella genomes.
In addition to those genes truncated by the presence of frameshifting or nonsense mutations, there are numerous genes that have incurred nonsynonymous substitutions that could potentially affect protein function, as well as numerous frameshifts that have altered the sequence at one end of the ORF, even if they do not substantially shorten it (see Table S2 in the supplemental material). With respect to mutations that might be responsible for the difference in virulence potential of 14028s and LT2, there are amino acid substitutions in BigA, a surface-exposed virulence protein, in the putative virulence factors MviM and SrfB, as well as in several regulatory proteins, most notably RpoS and SlyA, which are both known to modulate Salmonella virulence (9, 26). For example, the start codon for the rpoS gene is ATG in 14028s but TTG in LT2, which results in lower RpoS levels in the latter strain. In the case of the SlyA protein, 14028s and LT2 differ at two positions, one at which Asp is changed conservatively to Glu and one where the LT2 Ala98 located within a predicted alpha helix adjacent to the DNA binding domain is replaced by a helix breaker Pro in 14028s (38).
We applied 454-Titanium technology to recover the genomic sequences of three strains that form the lineage leading to our contemporary isolate of 14028s. We obtained coverage depths of 32×, 20×, and 21× for strains 60-6516, 14028r, and 14028s-o, respectively, and in each case, automated assembly yielded approximately the same number of contigs (~200) as in the original 14028s assembly. These alignments revealed which contigs in the assemblies of the three resequenced strains matched multiple locations on the genome, reflecting the presence of repetitive elements. The repeated regions included prophages, IS elements, rRNA operons, and several other multicopy genes (notably the oadAB, dcoAB, and ccmGH operons), which together constitute ~3% of each genome (Fig. (Fig.2).2). Because the sequences of these potentially overcollapsed repeats were not individually resolved, as was done for the 14028s genome sequence, they were excluded from subsequent analyses.
Two structural changes in genome architecture were evident in the nonrepetitive regions of the three resequenced genomes. In the 14028r genome, the hin invertase region, responsible for flagellar phase variation, is in an orientation opposite to that in the other genomes. Strain 60-6516, the putative ancestor of all 14028 strains, lacks a 2,351-bp region including four genes: dcuA (an anaerobic C4-dicarboxylate transporter), aspA (aspartate ammonia lyase), fxsA (F exclusion of bacteriophage T7), and STM14_5203, encoding a hypothetical protein of unknown function. The presence of this region in the three 14028 strains, and in all other S. enterica strains sequenced to date, points to a deletion that occurred in culture or during recent cultivation of strain 60-6516. Aside from these changes, we did not detect any cases of gene amplification or translocation, as observed in some archival strains of LT2 that were maintained in stab culture (42).
After all ambiguous polymorphisms (i.e., low-confidence substitutions and indels and those proved false by PCR and Sanger sequencing) were removed and each position was compared to its inferred ancestral state based on the LT2 sequence, there were seven changes in 60-6516, two changes in 14028r, one change in 14028s-o, and three changes in 14028s. The last represent the changes that have accumulated during laboratory passage and consist of three nonsynonymous substitutions, one in nuoL, one in the transcriptional regulator gene cytR, and one in locus STM14_1964 (encoding a putative cytoplasmic protein). Of the mutations found in the three progenitor strains, three occur in genes of known function: there is a frameshift rpoS mutation, coding for the starvation-induced sigma factor (10, 17) that differentiates 60-6516 from the other strains; there is a frameshift mutation in the rfbJ gene of 14028r responsible for the rough colony phenotype because the rfbJ gene codes for CDP-abequose synthase (22); and there is a 12-bp deletion in the rbsR gene of 14028s-o, coding for the repressor of the ribose regulon (24).
By determining the complete nucleotide sequence of S. Typhimurium 14028s, we were able to (i) identify the specific alterations that are responsible for the phenotype traits, particularly the virulence attributes, specific to this strain; (ii) search for genomic signatures of strain domestication; and (iii) distinguish lineage-specific changes in the rates and patterns of mutations. We should first note that the initial comparison of our sequence to the published LT2 genome revealed that base substitutions and indels were much more frequent in regions representing repeated DNA, including rRNA operons, multicopy elements, and duplicated genes. The type, magnitude, and distribution of these changes suggest that amalgamation and overcollapsing of sequences in the repeated regions of the LT2 genome are responsible for the prevalence of substitution in repeated sequences. The phenomenon of contig overcollapse is fairly common in genomic sequences reported for eukaryotes with high densities of repeated elements. Because it affected the assembly of ~3% of the LT2 genome, we were forced to exclude all polymorphisms detected in repeated regions. As a result, we removed nearly half of the base substitutions and indels from our data set in order to ascertain unambiguously the differences that exist between the compared Salmonella strains.
When the base substitutions that separate strains LT2 and 14028s were considered, CG-to-TA transitions (likely originating from deamination events) appeared at the highest frequency and constituted over 50% of the point-mutational differences between the two strains. They were followed by the complementary TA-to-CG mutation, which occurred at less than one-third of the frequency. Several studies have experimentally assessed the spectrum of mutations in enteric bacteria (6, 15, 28, 29, 49, 51). However, the relative frequencies and rank order of base substitutions in these analyses (even in those not employing mutator strains) do not mirror those detected in our genome comparisons (see Fig. S1 in the supplemental material), indicating that the rates and patterns of mutations can vary greatly under different natural and laboratory conditions (35, 45). This difference was also observed in the resequenced, laboratory-stored strains of 14028s: those few base substitutions that we detected were almost evenly split between CG-to-TA and TA-to-AT mutations, the latter of which was the rarest mutation observed between 14028s and LT2.
The disparities between the spectra of mutations observed under natural and laboratory-based conditions might not only be caused by differences in the occurrence of each class of mutation but could also result from selective processes. Nonsynonymous sites constitute the largest target in bacterial genomes, and certain mutations at these sites could potentially be deleterious and subsequently be removed from the genome. The Ka/Ks values (i.e., the ratio of the rate of nonsynonymous substitutions to the rate of synonymous substitutions) for the 14028s-LT2 comparisons are less than 1.0, indicating that some proportion of mutations at nonsynonymous sites have been removed. Overall, the per-site substitution rate at nonsynonymous sites was about half that at synonymous and intergenic sites (4.3 × 10−5 versus 8.0 × 10−5), and in particular, the number of CG-to-GC mutations was significantly higher (P < 0.0005) at nonsynonymous sites. Naturally, most indels within genes cause frameshifts and are highly deleterious, but the rate of single-base indels to single-base substitutions in noncoding spacer regions was 1:3, identical to that reported for the aphid endosymbiont Buchnera aphidicola (34).
Several methods can be used to estimate the date of divergence between 14028s and LT2. Applying the experimentally observed mutation rate of 0.9 × 10−10 per base per replication calculated for Salmonella (21) and an estimate of 100 to 200 generations per year in the wild yields a divergence date of 3,000 to 6,000 years ago. Alternatively, given a substitution frequency of 8.0 × 10−5 at synonymous sites and the previously calibrated rate of 0.9% per million years based on the estimated E. coli-Salmonella divergence (36, 37), the split between 14028s and LT2 is estimated to have occurred approximately 9,000 years ago. Finally, our resequencing data, which recognized nine substitutions in the 50 years separating the 14028s and 60-6516 lineages, suggest that the divergence between 14028 and LT2 occurred about 3,000 years ago. It is common for divergence times based on laboratory-derived mutation rates and on sequence comparisons to differ by an order of magnitude (35); however, it appears that over the short term the estimates are fairly similar.
Since its divergence from the attenuated LT2 strain, the highly virulent 14028s accumulated 10% more base substitutions, primarily at nonsynonymous sites. Examination of the genomic locations of these nonsynonymous substitutions indicates that they occur in genes with varied functions (as opposed to affecting a particular class of genes, such as those involved in host interactions). This genome-wide distribution suggests that these substitutions have been fixed through a nonadaptive process, i.e., genetic drift, which occurs in lineages that experience population bottlenecks (4). Since 14028s is pathogenic and LT2 is not and the 14028s population sizes are more apt to fluctuate, it is not surprising that, like other host-associated bacteria, the 14028s lineage has accumulated a high number of mildly deleterious mutations, as reflected in an increased number of nonsynonymous substitutions (23).
Comparisons of the genome contents of multiple Salmonella serovars have shown that that several virulence genes with sporadic distributions are phage associated and that host-restricted serovars experience significant gene loss (32, 43, 44, 53). Although strains 14028s and LT2 are more closely related than any pair of Salmonella strains examined in these studies, there are differences in their genome contents associated with the insertion and deletion of prophages. As previously reported (12), LT2, but not 14028s, harbors the Fels-1 and Fels-2 phages, whereas 14028s uniquely carries Gifsy-3 and an ST64B-like phage. In addition, we have detected several phage-related genes that are restricted to only one of the strains (Fig. (Fig.2).2). Although genes encoded on Gifsy-2 are essential for virulence in mice, this is not the case for Gifsy-3, suggesting that the virulence effects of the Gifsy-3 genes (sspH1 and pagJ) are limited to the enteric phase of the disease (33). Because the genome contents of strains LT2 and 14028s are very similar, differences in the functional status of genes and/or activities of the encoded gene products shared by the strains are most likely the source of the observed difference in virulence properties. Few, if any, of the pseudogenes unique to LT2 are likely candidates to explain its reduced virulence, suggesting that polymorphisms in any of several genes with known roles in virulence, such as bigA, shdA, avrA, sseI, mviM, srfB, sipA, slyA, or rpoS, may be responsible for their distinct pathogenic properties.
Numerous studies have examined the accumulation of substitutions in laboratory-propagated strains of bacteria, in which organisms are typically kept under continuous-growth conditions (in either serial culture or chemostats) and the mutations conferring a selective advantage are identified (18, 25, 57). In such situations, not only do mutations and genomic rearrangements accumulate, but salient characteristics that are not under selection are also often lost over time. For example, it is common for cultivated bacteria to become less virulent (20, 50), lose flagella (48), change colony morphology (7), develop auxotrophies, and eliminate extrachromosomal elements. Distinguishing all of the genetic differences among evolved or closely related bacterial strains has recently become a relatively straightforward process (18, 19, 39, 54, 58). Through comparative genomic analyses, we have shown that strains obtained from natural sources and subsequently used for experimental studies display characteristic rates and patterns of mutations, and we have identified the potential source of the phenotypic differences among strains.
This research was supported in part by grants from the National Institutes of Health (GM56120 and GM74738 to H.O. and AI49561 and AI42236 to E.A.G.). E.A.G. is an investigator of the Howard Hughes Medical Institute.
We thank Becky Nankivell for preparing the figures and Leigh Riley at NCBI for her queries and assistance in submitting the sequence and in assigning accession numbers.
Published ahead of print on 6 November 2009.
†Supplemental material for this article may be found at http://jb.asm.org.