Search tips
Search criteria 


Logo of springeropenLink to Publisher's site
Microbial Ecology
Microb Ecol. 2010 November; 60(4): 708–720.
Published online 2010 July 11. doi:  10.1007/s00248-010-9717-3
PMCID: PMC2974192

Comparison of 61 Sequenced Escherichia coli Genomes


Escherichia coli is an important component of the biosphere and is an ideal model for studies of processes involved in bacterial genome evolution. Sixty-one publically available E. coli and Shigella spp. sequenced genomes are compared, using basic methods to produce phylogenetic and proteomics trees, and to identify the pan- and core genomes of this set of sequenced strains. A hierarchical clustering of variable genes allowed clear separation of the strains into clusters, including known pathotypes; clinically relevant serotypes can also be resolved in this way. In contrast, when in silico MLST was performed, many of the various strains appear jumbled and less well resolved. The predicted pan-genome comprises 15,741 gene families, and only 993 (6%) of the families are represented in every genome, comprising the core genome. The variable or ‘accessory’ genes thus make up more than 90% of the pan-genome and about 80% of a typical genome; some of these variable genes tend to be co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species, suggests a continuum rather than sharp species borders in this group of Enterobacteriaceae.


The availability of complete genome sequences from multiple isolates of a given species has opened up a whole new range of research strategies. By far the best-studied bacterial species is Escherichia coli, and the highest number of individual genome sequences is available for this species, which has been the working horse of bacteriology for as long as the specialization exists. Numerous basic molecular processes have been first characterized and extensively studied in E. coli, leading to insights that could subsequently be applied to other bacteria [47]. Despite the vast amount of knowledge already available for E. coli, based on decades of experimental research, genetic manipulation and, more recently, observations based on single or multiple genome sequences, comparison of a large number of E. coli genome sequences can still provide novel insights, such as the presence of genomic islands, present in some pathogenicity groups, but missing in others. At the time of writing, there are more than 100 E. coli genome sequence projects reported, many of which have been deposited to GenBank. Here, we compare 61 publically available genome sequences of E. coli and Shigella spp. isolates.

Escherichia spp. and Shigella spp. are Gram-negative, facultative anaerobic, intestinal bacteria belonging to the Enterobacteriaceae, which are taxonomically placed within the gamma subdivision of the Proteobacteria phylum. Although Shigella spp. isolates have been rewarded their own genus, which is divided into several species (representing different sero-groups), its separation from Escherichia spp. is mainly historical. For example, in Bergey’s Manual of Systematic Bacteriology, the section on Shigella phylogeny begins with the following sentence: “Scientific evidence accumulated to date strongly supports that view that Shigella species are biotypes/pathotypes or clones of E. coli” [39]. More than 50 years ago it was observed that Shigella spp. and E. coli have the same fertility system [25]; in 1972, Brenner et al. [3] found that based on DNA/DNA hybridization, that Shigella spp. and E. coli are the same species. Experiments with multilocus enzyme electrophoresis concluded that nearly all of the Shigella species are clones from within E. coli species [35]. Further, analysis of 16S rRNA sequence alignment places Shigella spp. within E. coli [6]. Thus, all current evidence indicates that Shigella spp. should be classified as E. coli [23, 36]. Both genera contain highly diverse species, although Shigella spp. are as related to E. coli as they are to each other. E. coli is a ubiquitous component of the intestinal gut flora of animals including humans, and can survive and multiply in abiotic environments as well. The species comprises both benign and pathogenic variants, whilst Shigella spp. are all enteropathogens in mammals.

E. coli isolates have in the past been divided into subgroups in various ways. Based on established pathogenicity towards the human host, pathogenic versus commensal E. coli have been recognized, although it is acknowledged that 'pathogenic' E. coli strains may colonize other animal species asymptomatically. Pathogenic E. coli have further been subdivided according to their typical site of infection and clinical manifestations in humans, for instance enteropathogenic, uropathogenic, or extra-intestinal pathogenic E. coli, or based on their virulence mechanisms, such as enterohemorragic (EHEC), enterotoxigenic, enteroinvasive, and enteroaggregative E. coli [1, 13]. Other divisions that are frequently used are based on serology (e.g., serotypes O127:H7 or K12) or, mainly for population genetic purposes, on phylogenetic properties of particular housekeeping genes, as established by MULTI-ENZYME electrophoresis and later by multilocus sequence typing (MLST) [25]. Finally, some isolates are described simply for their source of isolation, such as environmental isolates or avian pathogenic E. coli.

All these subdivisions have been applied more or less frequently to group isolates that share particular features. We were interested to see if any of these groupings would hold when isolates were compared based on their complete genome sequences, considering some or all of their genes. Isolates from some groups (based on whatever grouping) have been more frequently sequenced than from others, and complete information on all characteristics of interest (pathogenicity, source of isolation, serotype) is not available for all sequenced isolates. Despite these recognized shortcomings in sampling bias and recorded information, comparison of these 61 genome sequences revealed that neither the 16S gene, nor gene fragments usually used for MLST, provides biologically meaningful information on the relatedness of the sequenced isolates. The best way to analyze this is by taking into account all the genomic content, rather than looking at one or a few individual genes. The E. coli core genome has been previously reported to be less than half the genes [13], with more than half the E. coli genes in any given genome being found in some strains, but missing in others. Many of these variable genes can be clustered to specific regions, located on genomic islands in an E. coli chromosome.

Materials and Methods

Bacterial Genomes and Gene Annotations

Sixty-one bacterial genomes of E. coli and Shigella spp. were used in this study (Table 1). Of these, 39 fully sequenced genomes and 19 genomes for which the sequence was still in progress at the time of extraction were obtained from GenBank (1). Sequence from E. coli O103 Oslo was obtained from Norwegian Veterinary Institute and sequences from strains LANL ECA and LANL ECF were obtained from Los Alamos National Lab. Genome sequences of Escherichia albertii, Escherichia fergusonii, and Salmonella enterica Typhimurium LT2 were included for comparison (Table 1). The ‘quality score’ for each genome is given in Table 1, based on the suggested scale by Chain et al. [4]. A completely sequenced genome that has been deposited to GenBank is given a score of ‘1’, with the only exception being E. coli O157:H7 isolate EDL933, which currently has more than 4,000 “N’s” in the DNA sequence of the GenBank file, representing unfilled gaps along the chromsomal sequence—hence, this genome is given a lower score of ‘2’. The higher scores represent lower quality (and often more contigs, or pieces of the DNA, although sequence quality is not measured only by this, as described in [4]).

Table 1
Genomes used in this study

16S Ribosomal RNA Analysis

The sequences encoding 16S ribosomal RNA were extracted from the analyzed genomes using RNAmmer [22]; sequences with an RNAmmer score above 1,400 were considered reliable and were kept for analysis. From every genome, the gene with highest similarity to rrsH of E. coli K12 MG1655 was selected and these sequences were aligned using ClustalX [24]. A phylogenic tree was generated by ClustalX using the Bootstrap neighborhood-joining method, showing the bootstrap values at branch points, visualized by NJPlot [34].

In Silico MLST

The alleles for seven housekeeping genes used for MLST of various species ( were analyzed. These were fragments of adk, fumC, icd, gyrB, mdh, purA, and recA. The obtained DNA sequences were extracted from the genome sequence, concatenated and phylogenically analyzed as described above. Alignments were not manually adjusted to avoid subjective interpretation of the outcome.

Predicted Proteome Analysis

The predicted proteomes comprising all protein-coding genes were extracted from the GenBank files for the published genomes. For unpublished genomes, they were predicted using EasyGene [30]. All predicted proteomes were compared by BLASTP reciprocal pairwise comparison. Two genes were attributed to a single gene family and considered 'conserved' when they shared at least 50% amino acid identity over at least 50% of the length of the longest gene.

A hierarchical clustering was performed for the complete pan-genome as described by Snipen et al. [38]. Briefly, a pan-genome matrix was constructed consisting of 1 s and 0 s where each row corresponds to a gene family, as described above, and each column to a genome. Cell (i,j) in the matrix is 1 if gene family i is present in genome j, or 0 if it is absent. Manhattan distances were calculated and used for hierarchical clustering to generate the tree. The plotted distance between two genomes shows the proportion of gene families where their present/absent status differs. Thus, pan-genome hierarchical clustering analyses genes that are not conserved, but vary in their presence or absence between genomes. Shorter distances represent genomes with more gene families in common. Genes only occurring in a single genome (singletons) were not included in the analysis. Bootstrap values (per mil) were computed for each inner node by re-sampling the rows of the matrix.

A pan- and core genome plot was constructed according to [12]. The order of genomes was chosen based on the pan-genome tree, starting with the largest E. coli O157 genome. For the pan-genome curve, all cumulative BLAST hits found in the genomes were plotted as a running total, which increases as more genomes are added. The number of gene families with at least one representative in every genome was plotted for the core genome and this slowly decreases with the addition of more genomes, as these genomes may lack genes from gene families that had been conserved in the previously plotted genomes.

A BLAST atlas was constructed as described by Hallin et al. [14].

Results and Discussion

A number of characteristics of each of the 61 genomes are summarized in Table 1, such as their size, their number of recognized protein genes, and their gene density. Their GC content varies around 50% for all genomes (not shown), but their size and number of genes varies extensively. The smallest E. coli genome included is that of strain BL21 (DE3) sequenced by the Korean consortium, which is only 4.56 Mbp, and the smallest Shigella genome is that of Shigella dysenteriae Sd197, with 4.56 Mbp. The longest genome of the completed genomes so far belongs to E. coli O157:H7 strain EC4115, with 5.70 Mbp. Longer genomes are listed in Table 1, but since those sequences are still in multiple contigs, it is possible that their stated length is overestimated. These size differences mean that around one million nucleotides (approximately 20% of a genome) can be absent in one E. coli or Shigella isolate and present in another. These 'extra' sequences are not void, as indicated by the variation in number of genes: the longest E coli genome has 1,158 more predicted genes than the shortest E. coli genome (5,315 genes for strain EC4115 and 4157 genes for BL21). Further, the observed gene density is relatively constant, at 0.911 ± 0.04 genes per 1,000 base pairs. It should be noted that published proteomes have been defined using different gene prediction programs and definitions, so that the observed slight variation in gene density might be explained by non-standardized gene identification.

Phylogeny of 16S Ribosomal RNA and MLST Genes

A phylogenetic tree based on the 16S ribosomal RNA sequences extracted from a representative set of 20 Enterobacteriacea genomes is shown in panel a of Fig. 1, which is in agreement with the known phylogeny of the family. The tree for the full set of the 61 E. coli and Shigella strains, including two additional species of Escherichia and one from S. enterica is shown in Fig. 1b. From this figure, it is obvious that phylogeny of the 16S rRNA gene does not resolve well within the genus level, as is known, because the rRNA operons are so similar. Although some of the tree nodes are predicted with uncertainty, clearly the genera Shigella and Escherichia are not separated, nor are E. coli genes separated from those of E. fergusonii or E. albertii. This finding was expected, considering the close relatedness between Escherichia spp. and Shigella spp. In general, 16S sequences are not suitable to analyze inter-strain relationships within a species or between closely related species, as illustrated with this set of Enterobacteriaceae genes. This questions the reliability to use 16S as an indicator for the species to which unknown sequenced DNA belongs [45].

Figure 1Figure 1
Phylogenetic tree based on extracted 16S rRNA sequences. a Comparison of 20 different Enterobacteriaceae, based on extracted 16S rRNA sequences from the GenBank sequence files. E. coli and Shigella are shown in green. b Tree of 61 sequenced E. coli ( ...

Next, it was investigated if conserved housekeeping genes, frequently assessed for MLST, provide a better representation of the relatedness of the investigated genomes. Various MLST schemes are in use for E. coli [9, 28] or Shigella spp. [35] but these are not standardized and the genes assessed in these schemes are not conserved in all genomes. We used the combination of seven housekeeping genes that has been applied to a number of bacterial species [26] ( Since S. enterica lacks fumC (an observation that somewhat weakens the general applicability of this MLST gene set), that genome was not included in the analysis. The resulting tree, shown in Fig. 2, still mixes E. coli with Shigella species, and does not separate all pathogenic strains from commensal strains. Some of the phylogroups previously defined by multilocus enzyme electrophoresis are clearly separated, such as the E cluster containing all O157 strains, the A/B cluster of commensal K12 and B strains, and the B2 cluster containing some of the uropathogenic strains, in accordance to comparisons carried out by others [40]. Other authors concluded that the O157 serotype of EHEC probably evolved in successive evolutionary events [9]; however, that conclusion is not supported by the MLST tree. And although the B phylogroup is known for its commensal isolates, one of which being used by Delbrück and Luria for their famous phage work, this branch also contains the enteroaggregative strain 101-1 (Fig. 2). Moreover, the two S. dysenteriae strains are widely separated from each other. Pupo et al. [36], who used a different set of MLST genes, also found that isolates of the three species Shigella flexneri, Shigella boydii, and S. dysenteriae, could not always be grouped together nor separated from E. coli. Various enteroinvasive E. coli serotypes have been suggested as ancestral to the different Shigella serogroups [23], which could explain the lack of differentiation power of MLST in this case. Apparently, neither MLST gene sets are suitable to group these Enterobacteriaceae organisms in a meaningful way. The performance of MLST could in theory be improved by selecting different genes, for instance using a set of genes specifically chosen to produce the desired grouping. However, the strength of MLST analysis should be that a conserved set of genes is able to identify phylogenetic relationships in any collection of isolates from one species. If one has to select a 'standard' gene set specifically for the species under investigation, it weakens the general application of MLST considerably.

Figure 2
Phylogenetic tree of concatenated MLST gene alleles (adk, fumC, icd, gyrB, mdh, purA, recA), extracted from the genome sequences. Color use is the same as in Fig. 1

Pan-Genome Comparisons

MLST analyzes allelic differences in genes whose presence has to be conserved in all genomes. However, we hypothesized that genes that are variably present could provide useful information as to the true relatedness of the analyzed genomes. Since the variable fraction contain genes that are present in some, absent in other genomes, a phylogenetic analysis cannot be performed to capture all information. Figure 3 displays a pan-genome clustering tree, based on the gene families that are variably present in the analyzed genomes (gene families comprising singletons were excluded). The hierarchical clustering obtained by this analysis correctly separates the Shigella spp. and S. Typhimurium from Escherichia spp. and, within the latter genus, separates E. coli from the other Escherichia spp (Fig. 3). Moreover, all E. coli O157:H7 genomes now cluster together, as do the K12 derivatives (W3110, MG1655, DH1, BW2952, DH10B, and ATCC8739). The strains belonging to phylogenic group B are also positioned in one cluster, to which the non-pathogenic commensal strain HS also seems to belong. All these are avirulent isolates, and it is quite impressive that all these are positioned close together in the tree. We conclude that this analysis of variable genes identifies inter-strain relationships that can be correlated to the lifestyle of the organisms.

Figure 3
Pan-genome clustering of E. coli (black) and related species (colored), based on the alignment of their variable gene content. The genomes now cluster according to species and a relatedness between E. coli K12 derivatives (green block) and group B isolates ...

The contribution of every genome to the complete pan-genome of E. coli and related organisms is demonstrated in Fig. 4, where the pan-genome and core genome, as defined by other authors [40] of the analyzed sequences is plotted. The number of novel gene families for every added genome is also shown. As can be seen, all genomes contribute to the increase of the pan-genome. This increase is less strong when similar genomes are added (for instance all four K12 genomes, or the B strains). The addition of Shigella spp. genomes does not alter the shape of the pan-genome curve, but addition of the other Escherichia genomes causes a sharp increase. The contribution of E. fergusonii to the pan-genome has been noted before [42].

Figure 4
Pan- and core genome plot of the analyzed genomes. The blue pan-genome curve connects the cumulative number of gene families present in the analyzed genomes. The red core genome curve connects the conserved number of gene families. The gray bars show ...

The core genome reduces in size as more genomes are added, with an expected significant drop when the shorter genomes are assessed (starting with E. coli K12 DH10B, at position 18 in Fig. 4). The core genome reaches 1,472 gene families conserved in 53 E. coli genomes, which is further reduced to 993 gene families if Shigella spp. are considered as well. The bars show how many novel gene families each genome contributes to the growing pan-genome. It should be noted that the order in which genomes are analyzed influences the number of these reported novel gene families other than for singletons. When novel genes are considered, instead of novel gene families, the findings can be even more dramatic. For instance, six novel E. coli genome sequences identified approximately 10,000 novel genes [42]. Previous work has estimated a core genome of 1,976 genes for 20 E. coli genomes and a pan-genome of 17,831 genes. Our analysis of 53 E. coli genomes identified 1,472 conserved gene families and 13,296 gene families comprising the pan-genome. We prefer to report these findings as gene families, instead of individual genes, using clearly defined criteria for inclusion of genes into a gene family (described in the “Materials and Methods” section).

Where are all these variable genes located in a genome? Gene order is not strongly conserved between the analyzed genomes, so that gene location depends which genome is considered. Nevertheless, by visualizing where a gene, whose presence can vary, is located on a single reference genome provides further information, and this can be visualized in a BLAST atlas [14]. In the BLAST atlas of Fig. 5, it becomes apparent that the variable gene content is not evenly distributed over the reference genome, but appears to be distributed over various islands. The reference chromosome of E. coli O157:H7 EC4115 was chosen, as it is the largest chromosome for which a complete sequence is currently available. Around this, all other genomes are plotted, whereby lack of color indicates that particular gene from EC4115 is missing in the shown genome. The strong conservation of gene presence within the O157 serotype (in green) contrasts with the multiple 'gaps' seen in the other lanes. Every gap represents multiple genes in strain EC4115, illustrating that gene variation is not evenly distributed along the genome, but located in islands.

Figure 5
BLAST atlas. In the middle, a genome atlas of E. coli O157:H7 strain EC4115 is shown, around which BLAST lanes are shown. Every lane corresponds to a genome, with the following colors (going outwards): green E. coli O157:H7 (15 lanes); light blue E. coli ...

Concluding Remarks

“This gene is not found in E. coli”, is an expression often heard in discussions about novel genes in various organisms, and when people are looking for functional matches in databases. It is a sobering thought to realize that any given E. coli genome sequenced will have only roughly 20% of its genes part of the E. coli core, and the remaining 80% are not found in all other E. coli genomes. After a comparison of the diversity with many sequenced E. coli genomes, it has become clear such a statement can only be valid when it is specified which E. coli genome sequence has been searched. Of the predicted pan-genome comprising about 16,000 gene families, the core (slightly less than a thousand genes) is found to be only about a fifth of a typical E. coli genome which contains around 5,000 genes. Many of the accessible or variable genes, making up more than 90% of the pan-genome and roughly four fifth of a typical genome, are often found co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species is far greater than many had anticipated, and represents a broad set of functions for adapting to many different environments. The comparative methods used here are generally applicable to genomes of related species, and are considered a valuable tool to evaluate current insights of species' relatedness and evolutionary history.


We would like to thank the Danish Research Councils and the DTU Globalization funds for financial support.

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.


1. Anjum MF, Lucchini S, Thompson A, Hinton JCD, Woodward MJ. Comparative genomic indexing reveals the phylogenomics of Escherichia coli pathogens. Infect Immun. 2003;71:4674–4683. doi: 10.1128/IAI.71.8.4674-4683.2003. [PMC free article] [PubMed] [Cross Ref]
2. Blattner FR, Plunkett G3, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. doi: 10.1126/science.277.5331.1453. [PubMed] [Cross Ref]
3. Brenner DJ, Fanning GR, Skerman FJ, Falkow S. Polynucleotide sequence divergence among strains of Escherichia coli and closely related organisms. J. Bact. 1972;109:953–965. [PMC free article] [PubMed]
4. Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, Nelson KE, Parkhill J, Pitluck S, Qin X, Read TD, Schmutz J, Sozhamannan S, Sterk P, Strausberg RL, Sutton G, Thomson NR, Tiedje JM, Weinstock G, Wollam A, Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium. Detter JC. Genomics. Genome project standards in a new era of sequencing. Science. 2009;326:236–237. doi: 10.1126/science.1180614. [PMC free article] [PubMed] [Cross Ref]
5. Chen SL, Hung C, Xu J, Reigstad CS, Magrini V, Sabo A, Blasiar D, Bieri T, Meyer RR, Ozersky P, Armstrong JR, Fulton RS, Latreille JP, Spieth J, Hooton TM, Mardis ER, Hultgren SJ, Gordon JI. Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc Natl Acad Sc USA. 2006;103:5977–5982. doi: 10.1073/pnas.0600938103. [PubMed] [Cross Ref]
6. Cilia V, Lafay B, Christen R. Sequence heterogeneities among 16S ribosomal RNA sequences, and their effect on phylogenetic analyses at the species level. Mol. Biol. Evol. 1996;13:451–461. [PubMed]
7. Dobrindt U, Blum-Oehler G, Nagy G, Schneider G, Johann A, Gottschalk G, Hacker J. Genetic structure and distribution of four pathogenicity islands (PAI I(536) to PAI IV(536)) of uropathogenic Escherichia coli strain 536. Infect Immun. 2002;70:6365–6372. doi: 10.1128/IAI.70.11.6365-6372.2002. [PMC free article] [PubMed] [Cross Ref]
8. Durfee T, Nelson R, Baldwin S, Plunkett G3, Burland V, Mau B, Petrosino JF, Qin X, Muzny DM, Ayele M, Gibbs RA, Csörgo B, Pósfai G, Weinstock GM, Blattner FR. The complete genome sequence of Escherichia coli DH10B: insights into the biology of a laboratory workhorse. J Bacteriol. 2008;190:2597–2606. doi: 10.1128/JB.01695-07. [PMC free article] [PubMed] [Cross Ref]
9. Feng PCH, Monday SR, Lacher DW, Allison L, Siitonen A, Keys C, Eklund M, Nagano H, Karch H, Keen J, Whittam TS. Genetic diversity among clonal lineages within Escherichia coli O157:H7 stepwise evolutionary model. Emerging Infect Dis. 2007;13:1701–1706. [PMC free article] [PubMed]
10. Ferenci T, Zhou Z, Betteridge T, Ren Y, Liu Y, Feng L, Reeves PR, Wang L. Genomic sequencing reveals regulatory mutations and recombinational events in the widely used MC4100 lineage of Escherichia coli K-12. J Bacteriol. 2009;191:4025–4029. doi: 10.1128/JB.00118-09. [PMC free article] [PubMed] [Cross Ref]
11. Fricke WF, Wright MS, Lindell AH, Harkins DM, Baker-Austin C, Ravel J, Stepanauskas R. Insights into the environmental resistance gene pool from the genome sequence of the multidrug-resistant environmental isolate Escherichia coli SMS-3-5. J Bacteriol. 2008;190:6779–6794. doi: 10.1128/JB.00661-08. [PMC free article] [PubMed] [Cross Ref]
12. Friis C, Wassenaar TM, Javed MA, Snipen L, Lagersen K, Hallin PF, Newell DG, Manning G, Ussery DW (Submitted for publication) Genomic characterization of Campylobacter jejuni M1 [PMC free article] [PubMed]
13. Fukiya S, Mizoguchi H, Tobe T, Mori H. Extensive genomic diversity in pathogenic Escherichia coli and Shigella strains revealed by comparative genomic hybridization microarray. J Bacteriol. 2004;186:3911–3921. doi: 10.1128/JB.186.12.3911-3921.2004. [PMC free article] [PubMed] [Cross Ref]
14. Hallin PF, Binnewies TT, Ussery DW. The genome BLASTatlas-a GeneWiz extension for visualization of whole-genome homology. Mol Biosyst. 2008;4:363–371. doi: 10.1039/b717118h. [PubMed] [Cross Ref]
15. Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama K, Han CG, Ohtsubo E, Nakayama K, Murata T, Tanaka M, Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N, Yasunaga T, Kuhara S, Shiba T, Hattori M, Shinagawa H. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 2001;8:11–22. doi: 10.1093/dnares/8.1.11. [PubMed] [Cross Ref]
16. Hayashi K, Morooka N, Yamamoto Y, Fujita K, Isono K, Choi S, Ohtsubo E, Baba T, Wanner BL, Mori H, Horiuchi T. Highly accurate genome sequences of Escherichia coli K-12 strains MG1655 and W3110. Mol Syst Biol. 2006;2:2006.0007. doi: 10.1038/msb4100049. [PMC free article] [PubMed] [Cross Ref]
17. Iguchi A, Thomson NR, Ogura Y, Saunders D, Ooka T, Henderson IR, Harris D, Asadulghani M, Kurokawa K, Dean P, Kenny B, Quail MA, Thurston S, Dougan G, Hayashi T, Parkhill J, Frankel G. Complete genome sequence and comparative genome analysis of enteropathogenic Escherichia coli O127:H6 strain E2348/69. J Bacteriol. 2009;191:347–354. doi: 10.1128/JB.01238-08. [PMC free article] [PubMed] [Cross Ref]
18. Itoh Y, Nagano I, Kunishima M, Ezaki T. Laboratory investigation of enteroaggregative Escherichia coli O untypeable:H10 associated with a massive outbreak of gastrointestinal illness. J Clin Microbiol. 1997;35:2546–2550. [PMC free article] [PubMed]
19. Jeong H, Barbe V, Lee CH, Vallenet D, Yu DS, Choi S, Couloux A, Lee S, Yoon SH, Cattolico L, Hur C, Park H, Ségurens B, Kim SC, Oh TK, Lenski RE, Studier FW, Daegelen P, Kim JF. Genome sequences of Escherichia coli B strains REl606 and Bl21(DE3) J Mol Biol. 2009;394:644–652. doi: 10.1016/j.jmb.2009.09.052. [PubMed] [Cross Ref]
20. Jin Q, Yuan Z, Xu J, Wang Y, Shen Y, Lu W, Wang J, Liu H, Yang J, Yang F, Zhang X, Zhang J, Yang G, Wu H, Qu D, Dong J, Sun L, Xue Y, Zhao A, Gao Y, Zhu J, Kan B, Ding K, Chen S, Cheng H, Yao Z, He B, Chen R, Ma D, Qiang B, Wen Y, Hou Y, Yu J. Genome sequence of Shigella flexneri 2a: insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157. Nucleic Acids Res. 2002;30:4432–4441. doi: 10.1093/nar/gkf566. [PMC free article] [PubMed] [Cross Ref]
21. Johnson TJ, Kariyawasam S, Wannemuehler Y, Mangiamele P, Johnson SJ, Doetkott C, Skyberg JA, Lynne AM, Johnson JR, Nolan LK. The genome sequence of avian pathogenic Escherichia coli strain O1:K1:H7 shares strong similarities with human extraintestinal pathogenic E. coli genomes. J Bacteriol. 2007;189:3228–3236. doi: 10.1128/JB.01726-06. [PMC free article] [PubMed] [Cross Ref]
22. Lagesen K, Hallin P, Rødland EA, Staerfeldt H, Rognes T, Ussery DW. Rnammer: consistent and rapid annotation of ribosomal rna genes. Nucleic Acids Res. 2007;35:3100–3108. doi: 10.1093/nar/gkm160. [PMC free article] [PubMed] [Cross Ref]
23. Lan R, Alles MC, Donohoe K, Martinez MB, Reeves PR. Molecular evolutionary relationships of enteroinvasive Escherichia coli and Shigella spp. Infect Immun. 2004;72:5080–5088. doi: 10.1128/IAI.72.9.5080-5088.2004. [PMC free article] [PubMed] [Cross Ref]
24. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–2948. doi: 10.1093/bioinformatics/btm404. [PubMed] [Cross Ref]
25. Luria SE, Burrous JW. Hybridization between Escherichia coli and Shigella. J. Bacteriology. 1957;74:461–476. doi: 10.1002/path.1700740226. [PMC free article] [PubMed] [Cross Ref]
26. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang Q, Zhou J, Zurth K, Caugant DA, Feavers IM, Achtman M, Spratt BG. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sc USA. 1998;95:3140–3145. doi: 10.1073/pnas.95.6.3140. [PubMed] [Cross Ref]
27. McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P, Courtney L, Porwollik S, Ali J, Dante M, Du F, Hou S, Layman D, Leonard S, Nguyen C, Scott K, Holmes A, Grewal N, Mulvaney E, Ryan E, Sun H, Florea L, Miller W, Stoneking T, Nhan M, Waterston R, Wilson RK. Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature. 2001;413:852–856. doi: 10.1038/35101614. [PubMed] [Cross Ref]
28. Moura RA, Sircili MP, Leomil L, Matté MH, Trabulsi LR, Elias WP, Irino K, Pestana de Castro AF. Clonal relationship among atypical enteropathogenic Escherichia coli strains isolated from different animal species and humans. Appl Environ Microbiol. 2009;75:7399–7408. doi: 10.1128/AEM.00636-09. [PMC free article] [PubMed] [Cross Ref]
29. Nie H, Yang F, Zhang X, Yang J, Chen L, Wang J, Xiong Z, Peng J, Sun L, Dong J, Xue Y, Xu X, Chen S, Yao Z, Shen Y, Jin Q. Complete genome sequence of Shigella flexneri 5b and comparison with Shigella flexneri 2a. BMC Genomics. 2006;7:173. doi: 10.1186/1471-2164-7-173. [PMC free article] [PubMed] [Cross Ref]
30. Nielsen P, Krogh A. Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics. 2005;21:4322–4329. doi: 10.1093/bioinformatics/bti701. [PubMed] [Cross Ref]
31. Ogura Y, Ooka T, Iguchi A, Toh H, Asadulghani M, Oshima K, Kodama T, Abe H, Nakayama K, Kurokawa K, Tobe T, Hattori M, Hayashi T. Comparative genomics reveal the mechanism of the parallel evolution of O157 and non-O157 enterohemorrhagic Escherichia coli. Proc Natl Acad Sc USA. 2009;106:17939–17944. doi: 10.1073/pnas.0903585106. [PubMed] [Cross Ref]
32. Oshima K, Toh H, Ogura Y, Sasamoto H, Morita H, Park S, Ooka T, Iyoda S, Taylor TD, Hayashi T, Itoh K, Hattori M. Complete genome sequence and comparative analysis of the wild-type commensal Escherichia coli strain SE11 isolated from a healthy adult. DNA Res. 2008;15:375–386. doi: 10.1093/dnares/dsn026. [PMC free article] [PubMed] [Cross Ref]
33. Perna NT, Plunkett G3, Burland V, Mau B, Glasner JD, Rose DJ, Mayhew GF, Evans PS, Gregor J, Kirkpatrick HA, Pósfai G, Hackett J, Klink S, Boutin A, Shao Y, Miller L, Grotbeck EJ, Davis NW, Lim A, Dimalanta ET, Potamousis KD, Apodaca J, Anantharaman TS, Lin J, Yen G, Schwartz DC, Welch RA, Blattner FR. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature. 2001;409:529–533. doi: 10.1038/35054089. [PubMed] [Cross Ref]
34. Perrière G, Gouy M. WWW-query: an on-line retrieval system for biological sequence banks. Biochimie. 1996;78:364–369. doi: 10.1016/0300-9084(96)84768-7. [PubMed] [Cross Ref]
35. Pupo GM, Karaolis DK, Lan R, Reeves PR. Evolutionary relationships among pathogenic and nonpathogenic Escherichia coli strains inferred from multilocus enzyme electrophoresis and mdh sequence studies. Infect. Immun. 1997;65:2685–2692. [PMC free article] [PubMed]
36. Pupo GM, Lan R, Reeves PR. Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics. Proc Natl Acad Sc USA. 2000;97:10567–10572. doi: 10.1073/pnas.180094797. [PubMed] [Cross Ref]
37. Rasko DA, Rosovitz MJ, Myers GSA, Mongodin EF, Fricke WF, Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R, Henderson IR, Sperandio V, Ravel J. The pan-genome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol. 2008;190:6881–6893. doi: 10.1128/JB.00619-08. [PMC free article] [PubMed] [Cross Ref]
38. Snipen L, Ussery DW. Standard operating procedure for comparing pan-genome trees. Standards Genomic Sciences. 2010;2:135–141. doi: 10.4056/sigs.38923. [PMC free article] [PubMed] [Cross Ref]
39. Strockbine NA, Maurelli AT. “Genus XXXV-Shigella”, page 812 of Bergey’s manual of systematic bacteriology. 2. New York: Springer publishing company; 2005.
40. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJB, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome” Proc Natl Acad Sc USA. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [PubMed] [Cross Ref]
41. Toh H, Oshima K, Toyoda A, Ogura Y, Ooka T, Sasamoto H, Park S, Iyoda S, Kurokawa K, Morita H, Itoh K, Taylor TD, Hayashi T, Hattori M. Complete genome sequence of the wild-type commensal Escherichia coli strain SE15 belonging to phylogenetic group B2. J Bacteriol. 2010;192:1165–1166. doi: 10.1128/JB.01543-09. [PMC free article] [PubMed] [Cross Ref]
42. Touchon M, Hoede C, Tenaillon O, Barbe V, Baeriswyl S, Bidet P, Bingen E, Bonacorsi S, Bouchier C, Bouvet O, Calteau A, Chiapello H, Clermont O, Cruveiller S, Danchin A, Diard M, Dossat C, Karoui ME, Frapy E, Garry L, Ghigo JM, Gilles AM, Johnson J, Bouguénec C, Lescat M, Mangenot S, Martinez-Jéhanne V, Matic I, Nassif X, Oztas S, Petit MA, Pichon C, Rouy Z, Ruf CS, Schneider D, Tourret J, Vacherie B, Vallenet D, Médigue C, Rocha EPC, Denamur E. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet. 2009;5:e1000344. doi: 10.1371/journal.pgen.1000344. [PMC free article] [PubMed] [Cross Ref]
43. Wei J, Goldberg MB, Burland V, Venkatesan MM, Deng W, Fournier G, Mayhew GF, Plunkett G3, Rose DJ, Darling A, Mau B, Perna NT, Payne SM, Runyen-Janecky LJ, Zhou S, Schwartz DC, Blattner FR. Complete genome sequence and comparative genomics of Shigella flexneri serotype 2a strain 2457T. Infect Immun. 2003;71:2775–2786. doi: 10.1128/IAI.71.5.2775-2786.2003. [PMC free article] [PubMed] [Cross Ref]
44. Welch RA, Burland V, Plunkett G3, Redford P, Roesch P, Rasko D, Buckles EL, Liou S, Boutin A, Hackett J, Stroud D, Mayhew GF, Rose DJ, Zhou S, Schwartz DC, Perna NT, Mobley HLT, Donnenberg MS, Blattner FR. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sc USA. 2002;99:17020–17024. doi: 10.1073/pnas.252529799. [PubMed] [Cross Ref]
45. Woo PCY, Lau SKP, Teng JLL, Tse H, Yuen K. Then and now: use of 16S rDNA gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. Clin Microbiol Infect. 2008;14:908–934. doi: 10.1111/j.1469-0691.2008.02070.x. [PubMed] [Cross Ref]
46. Yang F, Yang J, Zhang X, Chen L, Jiang Y, Yan Y, Tang X, Wang J, Xiong Z, Dong J, Xue Y, Zhu Y, Xu X, Sun L, Chen S, Nie H, Peng J, Xu J, Wang Y, Yuan Z, Wen Y, Yao Z, Shen Y, Qiang B, Hou Y, Yu J, Jin Q. Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery. Nucleic Acids Res. 2005;33:6445–6458. doi: 10.1093/nar/gki954. [PMC free article] [PubMed] [Cross Ref]
47. Zimmer C. Microcosm: E. coli and the new science of life. New York: Pantheon books; 2008.

Articles from Springer Open Choice are provided here courtesy of Springer