Genome features. The genome of
Y. pestis KIM consists of a single circular chromosome of 4,600,755 bp with an average G+C content of 47.64%. Features of the sequence are shown on the map in Fig. . The origin and the terminus of replication were assigned by several criteria (these have not been experimentally determined). The polarity-switching point of C-G skew was used in both cases. For the origin, the locations of DnaA boxes were used, and for the terminus,
dif (the site in the center of the terminus where chromosome dimer resolution occurs) and the
terC site were used. Base pair 1 of the genome was assigned between the
mioC gene and the DnaA boxes within the origin of replication. As in
Escherichia coli, the putative origin and terminus of replication divide the genome into two replichores, and replication presumably proceeds bidirectionally. The DNA strands that are replicated continuously in the direction from origin to terminus are leading strands. Their complementary strands are lagging strands. The DNA sequence in Fig. represents the leading strand in replichore 1 and the lagging strand in replichore 2. The sequence contains many highly skewed oligomers, occurring preferentially on leading strands. The most statistically significant skewed oligomer is Chi (GCTGGTGG) (Fig. ), which stimulates DNA repair by homologous recombination in an orientation-dependent manner (
25,
30). The possibilities that Chi sites may be involved in normal DNA replication (
4,
8) and in the rescue of stalled replication forks have been noted (
25). Another family of octamers, with consensus RRNAGGGS (
9), are highly skewed in each of the replichores, confirming our identification of both the origin and the terminus. The orientation of the map in Fig. was chosen to match that of
E. coli K-12 by the organization of rRNA operons and the gene content, gene orientations, and relative positions in the two replichores.
Sequence analysis revealed 4,198 predicted ORFs with an average size of 940 bp covering 86% of the genome. Predicted ORFs smaller than 50 amino acids (aa) were only annotated if they had a convincing database match. There are 2,385 ORFs on the leading strands and 1,813 ORFs on the lagging strands. ORFs on the leading strands outnumbering ORFs on the lagging strands reflects the preference for encoding proteins in the same direction as replication.
Comparison with Y. pestis CO92. We compared the KIM genome sequence with that of
Y. pestis strain CO92, recently published (
33), and found that more than 95% of the sequence is shared by the two genomes. The CO92 genome is ~50 kb larger than the KIM genome, the result of an 11-kb and many smaller insertions in CO92 relative to KIM. About 27 kb of the difference is due to insertion sequence (IS) elements, which are more numerous in CO92. CO92 also has one less rRNA operon (see below). At the protein level, 3,672 of 4,198 total ORFs match CO92 ORFs (>90% amino acid identity and >60% of each protein length in the alignment). Of the remaining 526 unmatched KIM ORFs, 318 have only 100 aa or less. Most of the unmatched ORFs encode hypothetical proteins. Although the genome sequences are very conserved, extensive rearrangements are seen. Figure shows the alignment of the two genomes. For the purpose of comparison, both genomes are divided into 27 segments, each of which matches its counterpart in either direct or reverse orientation. There are three regions where multiple inversions appear to have taken place. For each region, we calculated the most parsimonious series of inversions that could account for the organizational differences between the two genomes, plotted in Fig. . In most steps it was possible to identify a sequence homology, mostly IS elements, that could have given rise to the proposed inversion. The most complicated region is the one that spans the replication origin, which has 12 segments and a minimum of nine inversions required to produce the observed rearrangement. Any intrareplichore inversion causes the DNA sequences involved to switch from leading strand to lagging strand or vice versa, resulting in the change of C-G skew at that particular region, which will still be detectable if the event was relatively recent and amelioration has not yet taken place (visible in Fig. and ).
rRNA operons. Rearrangement of the
Y. pestis genome might also have led to the difference in the number of rRNA operons in the two strains. Of the seven rRNA operons in KIM, two are in the very highly conserved regions and the other five are in the multiple-inversion region I that is around the replication origin. As demonstrated in Fig. , segment 7 has three rRNA operons, with one at the end of the segment. Both segments 24 and 1 have an rRNA operon at one end. In step 4 of the rearrangement process, 24 becomes adjacent to 7, and in step 5, it becomes adjacent to 1. Both steps bring rRNA operons together in the genome, forming large tandem repeats and providing opportunities for recombination. This could result in deletion of one of the tandem rRNA operons during the process of evolution without the loss of essential genes in between. A process of this type could explain the loss of an rRNA operon in CO92. Variant rRNA patterns (ribotypes) have been reported among
Y. pestis biovar Orientalis strains isolated over the last 65 years (
18), but there is no evidence of rRNA-specific rearrangements in KIM. Indeed, their conserved locations with respect to the origin of replication is remarkable in light of the level of rearrangement observed for other backbone genes (see below).
Comparison with E. coli K-12. It has always been surprising that
E. coli and
Salmonella have such consistent gene order (synteny), considering their evolutionary divergence time of 110 million years. In our previous analysis of
E. coli O157:H7 (
34), we found by comparison with K-12 that a large proportion of both genomes is shared and colinear and that in both cases the shared regions are punctuated by islands of unique sequence, apparently acquired by horizontal transfer. In general, the shared regions include genes of central metabolism and basic conserved physiology of the
Enterobacteriaceae. The islands, by contrast, frequently contained genes associated with pathogenicity or survival in the mammalian intestine, as well as many of unknown function, and some contained evidence of mobility, e.g., phage genes or IS elements. We named the shared regions “backbone” to denote the common framework in which the specialized islands are inserted. We examined KIM to determine whether this organization is detectable in
Y. pestis and, if so, to what extent. In the case of the two
E. coli strains, backbone regions matched at >98% nucleotide identity. The KIM and K-12 genomes do not match at this level; only 20% of randomly tested sequence reads matched K-12 at better than 60% nucleotide identity. In a random sample of KIM proteins, 45.5% matched K-12 proteins at better than 60% amino acid identity. When context is also taken into account, the extent of locally colinear segments emerges.
Our standard criteria for inference of orthology in the E. coli-versus-E. coli comparison were >90% amino acid identity and >60% of each protein length in the alignment. For KIM predicted proteins, we used a match of >40% amino acid identity with the K-12 protein, and alignments included >60% of both genes. This was justified when the match to K-12 was clearly the best, or the only, match, and in addition, when the adjacent encoded protein(s) also had best matches to adjacent genes in K-12. Despite their remote distance in the phylogenic tree, KIM and K-12 do in fact share many homologous genes. By these criteria, roughly 53.7% of KIM genes (2,254 of 4,198) are identified as backbone genes, with an average amino acid identity with the K-12 ortholog of 72.9% and with an average of 96.2% of the Y. pestis ORF length and 97.4% of the K-12 ORF length in the alignment. The most interesting finding is not only that so many genes are shared but that the arrangements of these backbone genes are quite similar (73% of backbone genes are in segments that are locally colinear with K-12). Figure shows the synteny of orthologous genes of both genomes. We define ori distance as the distance of each gene from the origin on its replichore. An interreplichore inversion causes neighboring genes to switch from one replichore to the other, but the gene order with respect to the origin is not changed. If such an inversion happens asymmetrically around the replication origin, the ori distances of genes outside the inversion are changed. An intrareplichore inversion, however, inverts the gene order. When the offsets of the ori distances of orthologous genes from the replication origin are plotted against their average distances, a clear pattern of the arrangements of backbone genes in both genomes emerges (Fig. ).
While the gene order of KIM appears at first glance to be totally scrambled relative to that of K-12, we found that the orientations and the ori distances of corresponding backbone genes are in fact highly conserved (±400 kb) (Fig. ). This form of colinearity is to be expected if the principle rearrangements in evolution occur by a series of approximately symmetrical interreplichore inversions which exchange genes from one replichore to the other without moving them very much relative to the origin (
11). This type of inversion also preserves the sizes of replichores, as well as the leading strand of replication, and hence does not disrupt the strand biases characteristic of the replichore. It has happened so frequently that almost half of the backbone genes have switched to the other replichore (Fig. ). Only near the origin are there a few examples of intrareplichore inversions (Fig. ) between K-12 and KIM. The conservation of backbone structure is also observed from the relative locations of rRNA operons (five rRNA operons on replichore 1 and two on replichore 2). The finding of substantial colinearity in this sense between the backbone genomes of
E. coli and
Y. pestis KIM is remarkable, considering they may have been separated by as much as 500 million years of evolution from a presumed common ancestor. We estimated their time of divergence from the variability of three shared housekeeping genes, using the divergence time of 110 million years for
E. coli and
Salmonella (
32) as a calibration, resulting in a weighted average of 375 million years with a standard error of 145 million years.
Our observations are generally in agreement with the work of Segall et al. (
42), who drew attention to the conservation of gene order among members of the family
Enterobacteriaceae and demonstrated that intrareplichore inversions could not be obtained experimentally by homologous recombination, though once constructed by other techniques, such inversions are stable.
These phenomena seem more easily interpretable when we consider the possible effect nucleoid segregation during replication may have on the availability of DNA for inversional recombination (Fig. ). If DNA exchanges are limited to rather short exposed regions near replication forks, interreplichore inversions will tend to have endpoints that are about equidistant from the origin, given that the two forks move at approximately equal rates (
24). On the other hand there is little scope for intrareplichore inversion, since very little distance along a single replichore will be exposed at any given time. An exception may occur during replication initiation, since all the intrareplichore inversions between K-12 and KIM are concentrated near the origin. Larger regions of the chromosome may also be exposed during other processes, such as DNA damage repair.
The codon usage of backbone genes and KIM-specific genes was analyzed. Some rare codons, such as AGA and AGG for arginine, are used differently in backbone and KIM-specific genes. KIM-specific genes use as much as threefold more AGA and AGG codons for arginine than backbone genes. They also use the rare codon AUA for isoleucine two times more than backbone genes. This difference is also seen in the K-12 genome between backbone and strain-specific genes. Codons are also used differently in ORFs on the leading strands and ORFs on the lagging strands. The leading-strand ORFs contain slightly more G and T than the lagging-strand ORFs; therefore, they are C-G and A-T skewed. The C-G skew for the leading strand ORFs is particularly strong in the third codon position (data not shown).
Phage, ISs, and other repeats. Four types of IS elements were found in the KIM genome. IS1541A is the most abundant (49 complete, 6 partial, 3 interrupted by IS100). There are 35 copies of IS100, 19 copies of IS285, and 8 intact and 2 partial copies of IS1661. All of the partial IS1541A elements have an adjacent IS100, suggesting that loss of part of IS1541A was initiated by IS100 insertion. Two other complex groups consist of IS100 flanked on one side by a partial IS1541A and on the other by a partial IS1661, suggesting an insertion by IS100 into IS1541A and IS1661, followed by recombination between the two IS100 elements. These observations, and the instability of the pigmentation locus, pgm, also bounded by IS100 elements, show that this IS is very actively mobile in KIM. The IS content of KIM is slightly lower than that of CO92, but the same types of elements occur in both.
Some
E. coli strains elaborate a heat-stable enterotoxin, EAST1. The coding sequence is homologous to a central region of IS
285 on the opposite strand from the transposase. In KIM, 18 of the 19 IS
285 copies are identical in this region, having two in-frame stop codons preventing expression, as previously observed (
47). The nonidentical copy has 133 single-nucleotide differences and a 10-bp deletion but also has both stop codons.
The KIM genome contains six regions resembling phage (Table ). The most complete is a cryptic lambdoid prophage of 41 kb, located inside a 46-kb island with ORFs similar to those for most of the lambda head and tail proteins and bounded by a 31-bp direct repeat. The integrase gene is disrupted by IS
100. An ORF in the Q-lysis interval is positioned for potential transcriptional control by Q, reminiscent of the Shiga toxin phage in
E. coli EDL933 (
39). However, this ORF has no characteristic that suggests a function in pathogenesis. No tRNAs are encoded in the prophage. Two genes encoding phage holin and endolysin are found between the
tca and
tcc genes of the insecticidal toxin subunits. A single-stranded prophage observed in the CO92 genome is not present in KIM.
| TABLE 1.Coordinates and characteristics of phage and the largest island regions |
We searched the genome for REP (BIME) elements and other repeat features characterized in the
E. coli genome. If KIM contains such elements, they are not sufficiently similar to the
E. coli sequences for recognition by a consensus search, with one exception. Two Rhs elements are present in
Y. pestis. These elements are highly conserved regions of ~10 kb containing several ORFs, including the very large “core” ORF with a repeat peptide motif (
21). Though the products encoded by the element are thought to be associated with the cell surface, their functions are unknown.
E. coli strains contain five to seven Rhs elements. Their G+C contents suggest that they originated outside
E. coli, and similar elements have been found in several other species.
Islands. Y. pestis-specific regions are interspersed among the colinear backbone segments. These islands are of all sizes, and some include many ORFs of unknown function, as well as gene groups with putative functions but uncertain substrates, such as transporters. Other islands contain well-characterized segments, such as the yersiniabactin region of the high pathogenicity island (
6,
16,
20). Table shows the extents and contents of some of the larger island regions, including phage regions. Some islands show characteristics of classical pathogenicity islands (distinct G+C content, integrase, and insertion near a tRNA), and the yersiniabactin region is an example. Inserted near tRNA-Asn with a CP4-57-like integrase adjacent, the region has a consistently higher G+C content than flanking areas (59.6% versus a 47.6% average for the whole genome). A 10-nucleotide repeat of the end of the tRNA was found about 35 kb distant, just after the end of the section with high G+C content. In contrast, in the area in which the insecticidal toxins are encoded, none of these features were found, except for large fluctuations in G+C content. The region containing the type II secretion genes has a strikingly lower G+C content than average for the KIM genome (35 versus 47.6%), but no tRNA or recombinase is associated with this region.
Genes and functions. (i) Energy metabolism. In general, the Y. pestis genome contains energy genes typical of a member of the family Enterobacteriaceae, with a few exceptions. Most genes are intact and have a high level of identity with their E. coli homologs. Y. pestis uses respiration (aerobic) and fermentation (anaerobic) to produce energy. Hydrogenases are widespread in bacteria and catalyze both the production and consumption of hydrogen gas. Three distinct multisubunit hydrogenases (nickel enzymes) of E. coli and the ancillary enzymes for utilization of the nickel cofactor are absent from the Y. pestis genome. There is no high-affinity nickel transporter like nikABCDE of E. coli. It is possible that low-affinity transporters may be able to import nickel.
The formate dehydrogenases H and O are present. In fdoG, as in E. coli, an opal stop codon is presumably translated as selenocysteine. This exceptional translation event requires the proteins encoded by selA, selB, and selD, as well as the selC tRNA. In the KIM genome, the entire set of sel genes is present and presumably functional, but in CO92, selB is interrupted by a +1 frameshift. In both genomes, fdoG has the opal codon, so that gene is presumably nonfunctional in CO92. Interestingly, in both Y. pestis genomes, the opal codon in fdhF is replaced by a cysteine codon, which should be a functional substitution. This may be a second example of adaptation to an environment in which micronutrients are unavailable. Other electron donors are present, but no glucose dehydrogenase gene was found. Glycerol fermentation is a defining phenotype of biovar Mediaevalis; both glpD and glpABC systems are present. Genes for biosynthesis of quinones are present. Electron acceptors, such as fumarate and nitrite reductases, are present, though not all of those found in E. coli. The nar nitrate reductase systems of E. coli are missing, and the nap homologs, though present, are inactivated by a frameshift mutation in napA, accounting for the characteristic inability of biovar Mediaevalis to reduce nitrate to nitrite. All enzymes involved in the anaerobic dissimilation of pyruvate in E. coli are present in KIM.
Curiously, a KIM locus, hpa, encodes a pathway in aromatic catabolism characterized in E. coli W but not found in K-12. The proteins encoded by this locus, also in Salmonella enterica serovar Dublin and other bacteria, degrade 4-hydroxyphenylacetic acid. In several more steps, compounds are produced that feed back into energy metabolism. Three of the enzymes in this pathway are also similar to ORFs in E. coli C.
(ii) Transport systems. Mechanisms of iron acquisition have been studied in
Y. pestis (
16,
35,
36). The genome sequence revealed eight intact putative transport systems for iron and two for heme. Five of these loci are newly discovered. Two previously unknown multidomain “factory proteins,” putative nonribosomal peptide synthetases (NRPS) were found, in addition to the HMWP1 and HMWP2 proteins, whose roles in yersiniabactin synthesis have been elucidated (
16). One NRPS system is encoded adjacent to a putative iron-siderophore transport system. Several redundant transport systems found in
E. coli are apparently absent from KIM, for example, the high-affinity nickel transport system (NikA to -E; see above), and the AqpZ aquapore is also missing.
(iii) DNA replication and translation. As noted above, the structure of the replication origin is similar to that of oriC, and all the expected replication genes are present and intact, with the exception of dnaC, of which there is no trace. DnaC guides the DnaB helicase onto the DNA-replisome complex in E. coli. Although it is dispensable in some plasmid systems, it is not clear whether it is essential for genomic replication. We note that Y. pestis grows with a generation time of 1.25 h even under the best conditions. Thus, it probably does not need to initiate multiple replication forks between cell divisions as seen in E. coli at a high growth rate. It is possible that this difference changes the requirement for DnaC or that the Y. pestis functional equivalent has no sequence similarity with DnaC of E. coli. Genes for translation are similar to those in K-12 and are contained in the backbone. There are no ORFs that match the E. coli proteins PrfH (a peptide chain release factor), RimL (an acetyltransferase for ribosomal subunit protein L), YebU (a putative nucleoid protein), and YgcA (a putative RNA methyl transferase). Only one lysine tRNA synthetase is found in KIM, most similar to LysS (E. coli has LysS and -U).
(iv) Motility and chemotaxis. An essentially complete motility and chemotaxis system is present in the KIM genome, despite the fact that Y. pestis is nonmotile. Two sets of flagellar genes were found, one similar to the systems of Salmonella and E. coli but with a truncated FlhD, a transcriptional activator for the flagellar genes. This probably accounts for the lack of motility. The second gene set is incomplete and much less similar to any characterized system. In addition to the flagellar operons, six other putative chemotaxis-transducing proteins were found besides Tsr and Tap.
(v) Secretion systems. Searches for similarities to known protein secretion mechanisms revealed that, as expected, KIM has an intact Sec system, components of a signal recognition particle, and components that could specify a twin arginine transfer mechanism for secretion of folded proteins with redox cofactors. Similarly to CO92, KIM has a nearly complete type II secretion system: all the ORFs except the GspM homolog are present. GspM interacts with GspE, -F, and -L to form an inner membrane structure (
40). While this protein is required in some systems, it is not known whether it is essential for type II secretion in
Y. pestis. Such mechanisms are often used to secrete degradative enzymes, and it will be of interest to learn whether the substrates of this system have primarily nutritional or virulence roles. No obvious type IV secretion mechanism was found.
Type III secretion systems translocate effector proteins from the bacterium directly into the mammalian host target cell. Genes of a type III system are present in the KIM chromosome, but they are more closely related to the
Salmonella genes located on island SPI2 than to the
Yersinia enterocolitica chromosomal locus. Not all of the
Salmonella genes in this group are represented in KIM. The SsaJ homolog, a lipoprotein, is disrupted by a frame shift in KIM but is intact in CO92. The type III gene complement in KIM suggests that all of the essential components are complete for secretion but not translocation. A functional type III system is encoded by the
ysc genes on plasmid pCD1 in KIM, as previously described (
38). Of the 22 ORFs in the chromosomally encoded locus, only 6 are significantly similar to proteins of the plasmid system. These are orthologs of SsaRSTU/YscRSTU, which form an inner membrane complex, and also YscN, a cytoplasmic ATPase, and YscC/SepC, the outer membrane component.
We examined the genome for potential adhesins that could be important for virulence or maintenance in the flea vector. There is a
tadABCDEFG locus in
Y. pestis KIM, as previously described for strain CO92 (
23). As this locus may be an accessory for secretion and assembly of fibers that could function in biofilm formation (
3), we examined the ca. 6 kb upstream of this locus for similarities to a secretin-like protein or to pilins. Although there is homology to a secretin, this ORF is split in strain KIM and, interestingly, is absent in strain CO92 and hence is unlikely to be functional in either strain of
Y. pestis.
There are two ORFs (in addition to the previously reported disrupted
invA) that could encode large proteins (1,050 and 3,013 aa) with significant similarity to invasin of
Yersinia pseudotuberculosis and
Y. enterocolitica. Two unlinked tandem pairs of potential proteins have weak similarity to YadA of the enteropathogenic yersiniae, and at least one of each pair has a potentially cleavable signal sequence and could be surface located. Two homologs of Ail of the enteropathogenic yersiniae are present in the
Y. pestis KIM chromosome. There are at least 10 predicted proteins that resemble autotransporters in strain KIM. An additional autotransporter-like protein, homologous to YapB in strain CO92, is truncated and hence may not be functional in
Y. pestis KIM. Both
Y. pestis strains have two large (3,295 and 2,579 aa in strain KIM) predicted surface proteins with similarities to hemagglutinins and hemolysins. Two additional ORFs encode proteins with similarity to known or putative adhesins in
E. coli O157:H7. One of these is more similar to the Iha adhesin (
45) (54% identical over 686 aa) than it is to its closest counterpart in strain CO92 (34% identical over 683 aa). The
rscBAC locus that affects systemic dissemination of
Y. enterocolitica (
31) is present in
Y. pestis KIM, but the predicted RscA homologue, similar to that of
Haemophilus influenzae HmwA, is encoded by a broken ORF. This might have the effect of enhancing the disseminative character of
Y. pestis KIM, in analogy with the effect of deletion of RscA in
Y. enterocolitica. Interestingly,
rscA is intact in
Y. pestis CO92.
(vi) Gene regulation. Among the gene expression regulatory systems in KIM, the absence of SoxRS was surprising.
Y. pestis survives phagocytosis by macrophages in vitro and certainly encounters oxidative and nitric oxide stress, two of the host's most effective defenses. SoxRS play a central role in the ability of
E. coli to survive and adapt to various adverse conditions. SoxR is the redox-sensing activator of SoxS, which is a global regulator of several response genes, including Mn-superoxide dismutase. However,
soxS mutants in
Salmonella enterica serovar Typhimurium were neither attenuated for virulence in mice nor displayed increased sensitivity to macrophage killing (
10,
12,
46). KIM does possess
oxyRS, the regulator of a second set of stress response genes which include
katG, encoding a catalase or peroxidase, also present in KIM.
This account presents the most unexpected and important features of the KIM genome and its features relative to CO92 and E. coli. While it is by no means a complete analysis of the genome contents, it should provide the foundation for many future experiments aimed at fully characterizing the pathogenicity and lifestyle of Y. pestis.