|Home | About | Journals | Submit | Contact Us | Français|
We have sequenced the genome of Shigella flexneri serotype 2a, the most prevalent species and serotype that causes bacillary dysentery or shigellosis in man. The whole genome is composed of a 4 607 203 bp chromosome and a 221 618 bp virulence plasmid, designated pCP301. While the plasmid shows minor divergence from that sequenced in serotype 5a, striking characteristics of the chromosome have been revealed. The S.flexneri chromosome has, astonishingly, 314 IS elements, more than 7-fold over those possessed by its close relatives, the non-pathogenic K12 strain and enterohemorrhagic O157:H7 strain of Escherichia coli. There are 13 translocations and inversions compared with the E.coli sequences, all involve a segment larger than 5 kb, and most are associated with deletions or acquired DNA sequences, of which several are likely to be bacteriophage-transmitted pathogenicity islands. Furthermore, S.flexneri, resembling another human-restricted enteric pathogen, Salmonella typhi, also has hundreds of pseudogenes compared with the E.coli strains. All of these could be subjected to investigations towards novel preventative and treatment strategies against shigellosis.
Shigella species are Gram-negative, non-sporulating, facultative anaerobes causing bacillary dysentery or shigellosis in man with estimated annual episodes of 160 million and 1.1 million deaths, most of which are children under 5 years old in developing countries (1). In China, more than 10 million cases are estimated per annum, of which 50–70% are caused by Shigella flexneri serotype 2a and most are associated with epidemic and pandemic shigellosis (2). Shigella are highly invasive in the colon and the rectum, and are able to proliferate in the host cell cytoplasm, triggering an inflammatory reaction. The clinical manifestations of Shigella infection vary from short-lasting watery diarrhea to acute inflammatory bowel disease characterized by fever, intestinal cramp and bloody diarrhea with mucopurulent feces (1). Since the current preventive and treatment strategies are found to be inadequate, the World Health Organization has placed an anti-Shigella vaccine as a priority (3).
Shigella was recognized as the etiologic agent for bacillary dysentery in the 1890s, and was adopted as a genus in the 1950s and subgrouped into four species: S.dysenteriae, S.flexneri, S.boydii and S.sonnei (4). However, a recent genetic study argues that Shigella emerged from multiple independent origins of Escherichia coli 35 000–270 000 years ago and may not constitute a genus (5). Genes on a virulence plasmid encode the primary virulence determinants, including the invasion plasmid antigens (Ipa) and their devoted Mxi-Spa type III secretion apparatus, but many chromosomal loci are also virulence required (6). Thus, the defining point for Shigella to evolve from E.coli must be the acquisition of the precursor of the current-day virulence plasmid carrying genes necessary for the bacteria to invade and access the host cell cytoplasm. This is a niche unique amongst the enteric pathogens with the exception of enteroinvasive E.coli that also possesses the virulence plasmid causing similar pathogenic characteristics (4). Subsequent evolution of the chromosome, however, enables the full expression of virulence. Hence, despite the fact that plasmid sequences from serotype 5a have become available (7,8), we felt it necessary to sequence the whole genome of S.flexneri 2a, the most prevalent species and serotype. Particularly, the expression of virulence depends on a complex regulation mechanism that involves dialog between the chromosome and the virulence plasmid (9), and a better understanding of this requires the availability of the whole genome sequence. Indeed, though the virulence plasmid from serotype 2a has minor divergence from that of serytype 5a, we have revealed the highly volatile and dynamic nature of the Shigella chromosome in comparison with the genomes of the non-pathogenic K12 strain and the enterohemorrhagic O157:H7 strain of E.coli (10,11). Furthermore, we have uncovered many chromosomal loci that potentially contribute to virulence in addition to those identified by the classic genetic studies (6).
Shigella flexneri strain 301 (abbreviated Sf301), which we sequenced, was isolated from a patient with severe acute clinical manifestations of shigellosis in the Changping District, Beijing, in 1984, and has since been used as a reference strain for S.flexneri in China. The strain was routinely grown at 37°C overnight on tryptic soy agar containing 0.01% Congo red. Red colonies were inoculated into tryptic soy broth and grown to stationary phase at 37°C for isolating plasmid and chromosomal DNAs.
The plasmid and the chromosomal libraries were separately constructed using pBluescript KS(–) (Strategene) as vectors. Approximately 48 000 clones were sequenced from both ends using the big-dye kit (ABI) and ABI377 or ABI3700 automated sequencers, giving rise to 10 times coverage of the genome.
Sequences were assembled initially using the phred/phrap program (12) when the sequence coverage was ~4-fold over the estimated size of the genome. The program was run with optimized parameters and the quality score was set to ≥20. Further assembly was carried out repeatedly using the same program when more sequences were obtained. When 100 500 sequences were assembled into 318 contigs, the Consed program was used for sequence finishing (13). Gaps among contigs were closed either by primer walking on selected clones, which were identified by analysis on the forward and the reversed links between contigs using the perl/Tk algorithm, or by sequencing the DNA amplicons generated by polymerase chain reaction (PCR).
Glimmer 2.0, a program that searches for protein coding regions, was used to identify those ORFs possessing more than 30 consecutive codons (14). Overlapping and closely clustered ORFs were manually inspected. Predicted polypeptide sequences were used to search the non-redundant protein database with BLASTP, and the clusters of orthologous groups of proteins (COGs) database was used to identify families to which predicted proteins are related (15).
Mobile elements and repetitive sequences were identified using pair-wise comparison. tRNA sequences were identified by the program tRNAscan-SE (16). Repetitive regions were defined as those that have at least 200 bp with the significance of e–10 by BLASTN against the Sf301 genome itself and known IS databases. Sequence annotation and graphs of the circular and linear genomic maps were prepared using a newly developed Perl-Script tool kit (available at ftp://ftp.chgb.org.cn/pub/).
Whole genomic comparison with E.coli K12 MG1655 (accession no. U00096) and O157 EDL933 (accession no. AE00517H) was performed using the GenomeComp program (J.Yang, J.Wang, Q.Jin, Y.Shen, Z.Yao and R.Chen, manuscript in preparation).
The accession numbers for Sf301 chromosome and plasmid pCP301 are AE005674 and AF386526, respectively, in GenBank.
The primary features of the Sf301 genome are summarized in Table Table11 and graphically viewed in Figures Figures11 and and2.2. The whole genome of Sf301 is composed of a 4 607 203 bp chromosome and a 221 618 bp virulence plasmid, designated pCP301. The chromosome shares a common ‘backbone’ sequence ~3.9 Mb with those of E.coli K12 (MG1655) (10) and O157 (EDL933) (11), which is essentially collinear. However, the backbone sequence is interrupted by numerous segments of K12-, O157- and Shigella-specific DNA, designated ‘K-islands’ (KIs), ‘O-islands’ (OIs) and ‘S-islands’ (SIs), respectively (Fig. (Fig.1,1, circle 1). The co-linearity is also broken by numerous inversions and translocations compared with the E.coli sequences, 13 of which involve DNA segments >5 kb and are all bordered by IS elements and mostly associated with deletions or SIs (Fig. (Fig.2).2). All of these were confirmed by subsequent PCR sequencing of the junctions of each of the translocations and inversions. In the case of EDL933, there is only one inversion near the replication terminus with respect to K12 as noted previously (Fig. (Fig.2)2) (11). The dynamic gene shifts of the Sf301 chromosome are in contrast to the conserved genetic maps of E.coli K12, Salmonella typhimurium, other E.coli strains, other Salmonella spp., Klebsiella pneumoniae and many other enterics (17). However, there is no evidence for gene drift mediated by recombination between rRNA operons as observed in S.typhi (18) and in some Shigella strains (19). All the rRNA operons of Sf301 fall in approximately the same loci as those of E.coli (10,11). Natural selection that optimizes all promoters has operated to conserve genetic maps among enterics (17). A gradient of gene dosage generated from rapid chromosomal replication constrains genes to certain locations relative to the replication origin, and actively transcribed genes have a strong bias to be transcribed away from the origin, whereas weakly transcribed genes are evenly orientated (20). Genetic rearrangements alter the locations, orientations, and the coding strand (in the cases of inversions) with respect to the origin of replication, possibly changing the amount of transcription of many genes. This in turn may affect their dosage, and in some cases impair growth (21,22). Hence, the changed genetic map suggests that S.flexneri may have re-optimized its promoters to cope with selection pressures in the unique intracellular or ex vivo environments.
Sf301 has in total 64 SIs with sizes >1 kb, all of which are numbered and detailed in the ‘linear map 1’ (Supplementary Material). Among them, several, including the previously identified pathogenicity islands (PAIs) SHE-1 (23) and SHE-2 (24), have implications in virulence. Strikingly, there are seven ipaH genes, five of which are located in five large SIs, designated as ipaH islands 1–5 (Fig. (Fig.2).2). All five ipaH genes in the islands are next to the genes that potentially encode proteins sharing 73–76% identity with a 188 amino acid hypothetical protein of unknown function from Salmonella bacteriophage P27 (accession no. NP_543109). The majority of the remaining genes in the ipaH islands share homologies with genes of different phages including those identified in the genome of O157 EDL933. But the overall gene contents and organizations in all 5 ipaH islands have little similarity. This suggests that the chromosomal ipaH genes were originally linked with phage P27 and subsequently transmitted to S.flexneri by different phages. The plasmid, pCP301, carries five ipaH genes, termed ipaH9.8, ipaH7.8, ipaH4.5, ipaH2.5 and ipaH1.4, at approximately the same loci as those in pWR100 and pWR501 from serotype 5a (7,8). None of these is next to the genes of the phage P27 paralogs, suggesting that they came from different sources or, alternatively, were transmitted to the plasmid via different vehicles. The pWR501-borne ipaH7.8 is involved in the escape of Shigella from phagocytic vacuoles in the macrophages (25), but other ipaH genes have not been assigned a function. However, there is evidence that S.flexneri expresses more IpaH9.8 within host cells, and the proteins penetrate the host cell nuclei (26). This, and the fact that all IpaH proteins have a leucine-rich repeat region found in a diverse group of proteins from bacteria and eukaryotes (27), implies that IpaH might be involved in manipulating host gene expression. Alignment of all IpaH proteins indicates that they have identical C-terminal, but variable N-terminal, halves (Fig. (Fig.3).3). This suggests that they may interact with different host substrates, but exert similar functions.
In ipaH island 2, four consecutive genes, similar to the Salmonella sitABCD and the Yersinea yfe operons (28), may encode proteins required for iron uptake. Since the Salmonella sitABCD can complement the growth in iron-restricted medium of an enterobactin-deficient E.coli, a role of the Salmonella sit-like (SSL) system in iron uptake is implicated. Iron uptake mechanisms have undergone complicated adjustments in S.flexneri. On the one hand, the enterobactin system is impaired due to the presence of stop codons in fepE, fhuE and entC genes (see Table Table3),3), and on the other hand, the SSL system has been introduced, and additionally, SHE-2 encodes an aerobactin system (24). In some strains the E.coli fec enterobactin system is re-introduced along with so-called Shigella resistance locus PAI within a multiple resistance deletable element (MRDE) (29). However, MRDE is not present in Sf301.
Two other SIs are worthy of mention, the sci and the SfII islands (Fig. (Fig.2).2). The sci island is 22 789 bp in length and possesses a typical structure of PAI—inserted at an asp-tRNA and ends with an IS629 on the other side. It carries paralogs of the Salmonella sciCDEFF operon (accession no. AJ320483) of unknown function and of phage P22 and HK620, suggesting that it is possibly another phage-transmitted PAI. SfII has been demonstrated to be a lysogenic phage in which two genes, bgt encoding a bactoprenol glucosyl transferase and gtrII encoding a glucosyl transferase, are required for expression of the type II antigen (30). Thus, phage-mediated horizontal DNA transfer appears to be one of the major routes by which S.flexneri gains virulence determinants.
The IS elements identified in the Sf301 genome are listed in Table Table2.2. In the chromosome, there are astonishingly 247 complete and 67 partial IS elements, which makes it the most IS-rich chromosome among enterics. The predominant species is IS1, followed by IS600, IS2 and IS4. They all are frequently associated with SIs, inversions and translocations, deletions and insertional gene inactivation (see ‘linear map 1’ in Supplementary Material). The IS elements are, therefore, probably the major cause of the dynamics of the Sf301 chromosome. Indeed, the presence of IS91 at two ends of MRDE (mentioned above) allows the precise acquisition and excision of the entire 99 kb segment (29). Furthermore, IS1 and other IS elements have also been shown to be able to mediate various genetic rearrangements (31,32), and IS1 in particular can cause inversions and deletions (32). It is plausible that the IS elements will mediate further evolution of the chromosome. Similarly, pCP301 has large numbers of IS elements, sharing similar composition with pWR501 from serotype 5a (Table (Table2),2), indicating that the virulence plasmids are also volatile and dynamic. One difference between pCP301 and pWR501 is that the former has two copies of iso-IS10R that may be transposed from the Sf301 chromosome (Table (Table2).2). But, it remains to be seen whether this IS element is present in the genome of serotype 5a. If not, it might be used as a marker for epidemiological studies.
With the respect to the Sf301 chromosome, MG1655 and EDL933 possess two kinds of islands. One kind is formed owing to the deletions of the corresponding E.coli DNA segments from the Sf301 chromosome, which is hardly surprising given the dynamics of the genome. These include the so called ‘Black Hole’ harboring cadA responsible for converting lysine to cardverine that adversely affects virulence (33) and the kcp locus harboring ompT that inhibits the induction of guinea pig keratoconjunctivitis (34) (arrows in Fig. Fig.2).2). It remains to be investigated how many such ‘Black Holes’ have deletions of genes that would otherwise inhibit full expression of virulence.
The other kind of island is apparently formed by laterally acquired DNA sequences, of which the large ones are evident in Figure Figure22 with the scales used. A FASTA query of these groups of OIs and KIs against the Sf301 genome reveals no significant homologous sequence, and a query of all the SIs against EDL933 and MG1655 genomes reveals no homologous sequence either. Thus, O157 and S.flexneri appear to have acquired their island DNA from different sources and have evolved from ancestral E.coli strains through unrelated paths. Furthermore, all the SIs, OIs and KIs have no duplicated copies, indicating that none of them is mobile.
We must point out that we do not define sequences shared by paired strains (EDL933 or MG1655 with Sf301) as islands, though these may appear to be ‘islands’ with respect to the third genome. These sequences may reflect genetic properties of the ancestral E.coli strain that Sf301 evolved from. An example of these are the rfa/waa genes involving LPS biogenesis (Fig. (Fig.4).4). Sf301 and EDL933 have identical numbers of genes that share 99% identity in each case, whereas MG1655 has an equivalent functional operon with more genes and poor homology with the former (Fig. (Fig.4).4). Studies into this type of shared sequence may shed more light on strain diversity and evolution.
Apart from deletions of corresponding E.coli DNA segments, the formation of pseudogenes through introduction of stop codons, frame shifts, truncations and insertions in the coding regions appears to also play a major part in losing unwanted genes in S.flexneri. Pseudogenes with known functions according to the E.coli protein database are listed in Table Table3.3. Answers to many of the phenotypic characteristics of Shigella, such as the loss of motility and utilization of lactose, maltose and xylose, etc., can be found here. It is noted that 90% of these pseudogenes are intact in O157 EDL933. To this end, S.flexneri resembles S.typhi, another enteric pathogen restricted to humans. The presence of large numbers of pseudogenes has been postulated to be one of the main reasons that S.typhi evolved from the rest of the Salmonella species to become a solely human pathogen (35). Likewise, the originally closely linked O157 and Shigella have evolved in diverse directions. Strain O157 became a successful pathogen with broad host range mainly by acquiring DNA (Table (Table11 and Fig. Fig.2),2), whereas Shigella also became a successful pathogen but restricted to humans only, by acquiring, as well as losing, DNA.
Like previously sequenced virulence plasmids (pWR100 and pWR501) from serotype 5a strains (7,8), pCP301 is a mosaic of potential virulence-related genes, IS elements, maintenance genes and functionally unknown ORFs. All the previously identified virulence genes are present in pCP301. These include the primary invasion genes ipa and mxi-spa (encoding the invasion plasmid antigens and the type III secretion system, respectively), virG/IcsA (required for polymerizing host actin to provide propelling force for intra- and inter-cellular spread) and virF (necessary for regulating virulence gene expression). The replication origin (R100-like) ori and G site (single-strand initiation site) in pCP301 are identical to those of pWR501 and pWR100. pCP301 also has maintenance genes, repA, copA and copB, for replication; parA and parB for partitioning; and ccdA and ccdB for post-segregation killing. The noticeable difference between pCP301 and the plasmids from serotype 5a is the presence of more IS-related DNA in pCP301, making its size close to pWR501 (221 851 bp) which is larger than pWR100 because of a Tn501 (8360 bp) insertion (8). So, both Shigella serotypes most likely acquired the ancestral virulence plasmid from the same source. One other minor divergence is that the ipa-mxi-spa loci in pWR501 and pCP301 are in the same orientation, whereas in pMYSH6000, the virulence plasmid from another 2a strain, they are in inverse orders (36). This indicates that the divergence of the plasmids does not necessarily correlate with serotypes. A detailed comparison of pCP301 with pWR501 is available in the Supplementary Material (‘linear map 2’).
Comparison of the S.flexneri genome with that of E.coli supports the previous genetic study (5) that S.flexneri is closely related to E.coli and may turn out to belong to the same genus. The global gene content (Table (Table1)1) and alignments (Fig. (Fig.2)2) indicate that S.flexneri is more closely related to the non-pathogenic K12 strain MG1655 rather than the pathogenic O157 strain EDL933. This is in agreement with the suggestions that O157 and K12 last shared an ancestor ~4.5 million years ago (37), whereas Shigella evolved from multiple E.coli strains much later, correlating with the appearance of early man in the paleolithic (5). All these studies call strongly to reclassify Shigella species as members of E.coli.
To meet the demand of its unique pathogenic lifestyle, the S.flexneri chromosome has evolved distinctive characteristics after acquisition of the large virulence plasmid. Most importantly, there are several potential bacteriophage-transmitted PAIs, many translocations, inversions and deletions of the corresponding E.coli DNA segments, and numerous pseudogenes. These findings provide an invaluable genetic basis for future studies into understanding bacterial evolution, as well as pathogenicity, and the development of novel preventive and treatment strategies against shigellosis.
Supplementary Material is available at NAR Online.
We thank Paul Langford, Janine Bosse and Alick Stephens for critical reading of the manuscript, and the following for technical support: Shu Wang, Tianjing Cai, Yingying Xi, Xinyu Tan, Yanrui Jiang, Shitao Zhuang, Xinfeng Zhou, Li Rong, Tao Lu, Wei Liu and Lihong Chen from National Center of Human Genome Research, Beijing, and Shuiyun Lan, Yueqing Zhang and Minjie Chen from Fudan University for their expert technical assistance. This project was supported by the State Key Basic Research Program (grant no. G1999054103) and High Technology Project (grant no. Z19-02-05-01) from the Ministry of Science and Technology of China.
DDBJ/EMBL/GenBank accession nos AE005674, AF386526