|Home | About | Journals | Submit | Contact Us | Français|
With the goal of solving the whole-cell problem with Escherichia coli K-12 as a model cell, highly accurate genomes were determined for two closely related K-12 strains, MG1655 and W3110. Completion of the W3110 genome and comparison with the MG1655 genome revealed differences at 267 sites, including 251 sites with short, mostly single-nucleotide, insertions or deletions (indels) or base substitutions (totaling 358 nucleotides), in addition to 13 sites with an insertion sequence element or defective prophage in only one strain and two sites for the W3110 inversion. Direct DNA sequencing of PCR products for the 251 regions with short indel and base disparities revealed that only eight sites are true differences. The other 243 discrepancies were due to errors in the original MG1655 sequence, including 79 frameshifts, one amino-acid residue deletion, five amino-acid residue insertions, 73 missense, and 17 silent changes within coding regions. Errors in the original MG1655 sequence (<1 per 13 000 bases) were mostly within portions sequenced with out-dated technology based on radioactive chemistry.
From the dawn of modern biology, the intestinal bacterium Escherichia coli has been the most intensively studied organism. Many basic molecular processes, best understood in E. coli, are universal throughout the natural world. The wealth of information on E. coli makes it an ideal test bed for pushing forward the limits of our ability to understand a cell through computational modeling (Wanner et al, 2005). As a first step of an E. coli systems biology project in Japan (Mori, 2004), we undertook the task of determination of highly accurate E. coli K-12 genomes, which are key for precisely defining the cell parts.
We present back-to-back manuscripts on more accurate E. coli K-12 genomes (this paper) and new resources (Baba et al, 2006) of value for both basic biology and systems-level research on E. coli K-12. A key tenet of postgenomics sciences requires an accurate appraisal of the cell parts. Here, we describe determination of highly accurate genome sequences of two common ‘wild-type' K-12 strains. Knowledge of E. coli gene sequences, products, and functions is of value not only to E. coli cell biologists but also to others who rely on E. coli information for understanding of processes in diverse cells having conserved genes, proteins, RNAs, or motifs. Elsewhere, we describe a community effort for re-annotation of these more accurate genomes (Riley et al, 2006). Postgenomic sciences can be accelerated by development and sharing of biological resources. In the accompanying paper, we describe construction of mutants that have in-frame, single-gene knockouts of nearly all nonessential E. coli protein-encoding genes (Baba et al, 2006) by use of a now standard method for direct modification of chromosomal genes (Datsenko and Wanner, 2000).
Systematic determination of the complete E. coli K-12 genome was among the first targets for whole-genome sequencing. From 1989 to 1997, projects led by T Yura and A Ishihama, by K Mizobuchi, and by T Horiuchi and H Mori in Japan and by F Blattner, by G Church, and by R Davis in the USA reported many long continuous sequence segments (contigs) of the E. coli K-12 genome (Daniels et al, 1992; Yura et al, 1992; Burland et al, 1993, 1995; Plunkett et al, 1993; Fujita et al, 1994; Sofia et al, 1994; Aiba et al, 1996; Itoh et al, 1996; Oshima et al, 1996; Yamamoto et al, 1997). While long contigs from the Church, Davis, and Mizobuchi projects were also deposited to GenBank™ or DNA Data Bank of Japan (DDBJ) over this period, results are unpublished. The complete E. coli genome sequence (Blattner et al, 1997) has provided a wealth of information on the gene products, gene organization, and chromosome structure.
All groups had chosen E. coli K-12 for whole-genome sequencing because more was known about it than any organism. The ancestral strain had been isolated from the stool of a convalescent diphtheria patient in 1922 and given the designation ‘K-12' when deposited in a strain collection at Stanford in 1925 (Bachmann, 1996). In the early 1940s, EL Tatum, who was then at Stanford, acquired E. coli K-12. Because it was prototrophic, easy to grow in a defined medium, and had a short generation time, he used it in his seminal studies of biochemical genetics (Tatum, 1959). In 1946, J Lederberg and EL Tatum demonstrated sexual recombination in E. coli K-12 (Lederberg and Tatum, 1946), a property requiring the F+ ‘fertility factor', which was later found to be rare among E. coli isolates from nature. Mating occurred between different K-12 derivatives because particular descendents had lost the F+ factor, which otherwise leads to incompatibility. In 1950, E Lederberg reported that the original Lederberg and Tatum K-12 strain was lysogenic for phage λ (Lederberg, 1950). Derivatives that had lost λ acted as sensitive hosts for λ released from lysogenic E. coli K-12 (Lederberg and Lederberg, 1953). Shortly thereafter, phage P1 (Bertani, 1951) was shown to carry out generalized transduction in E. coli (Lennox, 1955). Largely because of these early studies, E. coli K-12 became the primary source of basic information on innumerable biochemical and molecular processes over the past 60 years.
Owing to its widespread use, a huge number of E. coli K-12 derivatives now exist (Bachmann, 1996). In an effort to get away from the early heavily mutagenized Stanford strains, E. coli K-12 W3110 (λ−, F−) was extensively used as an ancestral stock (Bachmann, 1972). The first physical map of the whole E. coli chromosome was created using a W3110 genomic library (Kohara et al, 1987). Subsequently, groups in Japan chose W3110 for whole-genome sequencing (Yura et al, 1992), while the Blattner group chose MG1655 (Guyer et al, 1981), which is more closely related to ancestral E. coli K-12 (EMG2 or WG1), except for loss of the F+ factor and λ prophage (Figure 1).
Determination of the complete W3110 genome and comparison with that of MG1655 (GenBank™ U00096, 1998 submission) revealed differences at 282 locations. These included 13 sites where an insertion sequence (IS) or defective phage exists in only one strain, two sites due to the W3110 inversion (Hill and Harnish, 1981), and 267 sites with sequence conflicts (Figure 2). To determine how many of the latter are true differences, these regions were PCR amplified from both strains and directly sequenced. Only eight are true differences. In all, 16 of the 267 sites with conflicts were due to errors in the W3110 sequence. These differences (totaling 17 nucleotides (nt); Supplementary Table 1) were due to errors in cloning (5 nt), sequencing (6 nt), or assembly (6 nt).
The remaining 243 (totaling 358 nt; Supplementary Table 2) were errors in the original MG1655 GenBank™ deposit. These included 104 sites with 1-, 2-, or multiple (short) nt substitutions, 134 sites with 1-, 2-, or 4-nt indels (Table I). MG1655 segments sequenced were deposited to DDBJ in January 2004 (Accession numbers AG613214–AG613378) and incorporated into a new MG1655 GenBank™ release (U00096.2; June 2004 version).
In total, 13 sites have an IS element or defective phage in only E. coli K-12 W3110 or MG1655 (Figure 3). Of these, 11 sites have an IS element only in W3110. One defective phage (CPZ-55) is only in MG1655. One site has an IS5 element in W3110 and an IS1 element in MG1655. Locations of all IS elements and defective phages in MG1655 and W3110 and the W3110 inversion are shown in Supplementary Figure 1.
The finding that the complete genome sequences of MG1655 and W3110 are nearly alike gives high confidence in the assembly. Resolution of discrepancies showed that the original MG1655 genome sequence was highly accurate (<1 error per 13 000 nt). Independent cloning and sequencing and reconciliation of differences have provided a pair of highly accurate E. coli K-12 genomes.
Most (ca. 88%) of the E. coli K-12 genome encodes proteins. As expected, the majority of the 1-, 2-, and 4-nt indel corrections (79 of 134) lie within coding regions; these 79 corrections resulted in frame shifting of 77 different open reading frames (orfs) (Table I). One multiple nt substitution changed adjacent residues; another changed the reading frame. Five indel corrections resulted in one 1-codon deletion, three 1-codon insertions, and one 2-codon insertion. Accordingly, 84 corrections dramatically alter protein coding regions by frame shifting or otherwise changing lengths of orfs. Of the 78 frameshifts, 23 resulted in fusing adjacent or overlapping orfs into a single orf, two led to fission of orfs into two, and one led to recognition of a conserved coding sequence on the opposite strand to that previously predicted, that is, an inversion with respect to the predicted coding region. Examples are illustrated elsewhere (Riley et al, 2006). Other corrections in coding regions included 73 amino-acid switches and 17 silent changes. It is more difficult to assess effects of corrections in intergenic regions (73 corrections) or RNA genes (two corrections).
E. coli K-12 W3110 has been widely used as a wild-type strain in Japan, the USA, and elsewhere from 1956. Because both MG1655 and W3110 are descendents of W1485 (Figure 1), they diverged more than 50 years ago. Yet, they have few differences. Further, only two of the 12 W3110-specific IS insertions are in common among stocks of W3110 from nine different laboratories in Japan. Two others are in the majority of these stocks. Eight are only in the Kohara stock that was used for genome sequencing (unpublished data). Because transposition of IS elements occurs in resting E. coli K-12 (Naas et al, 1994), the additional IS copies in W3110 Kohara probably arose during storage in stabs. The finding of so few differences is consistent with these strains having been stored as lyophilized or frozen cultures during much of the interim (Barratt and Tatum, 1950). Presumably, the defective CPZ-55 phage in MG1655 is in ancestral K-12 and was lost in the line leading to W3110.
The eight site (9 nt) differences between MG1655 and W3110 include seven in orfs and one in an rRNA gene (Table II). Two (rpoS and dcuA) are nonfunctional alleles in W3110. Because progenitor E. coli probably has the E33 (GAG) allele, and W3110 (like ancestral EMG2) has the Stop33 (TAG) allele (Figure 1), pseudoreversion to Q33 (CAG) apparently arose in MG1655 (Table II). Different stocks of W3110 have also been shown to carry different rpoS alleles (Jishage and Ishihama, 1997).
In addition to nonfunctional rpoS and dcuA alleles, W3110 has ISs disrupting four genes of known function (gatA for galactitol PTS enzyme II; dcuC for aerobic and anaerobic C4-dicarboxylate transporters; rcsC for a hybrid sensory kinase controlling capsule biosynthesis; and tnaB for a low-affinity tryptophan permease in the tryptophanase operon). These are likely to affect metabolism such as growth on galactitol (gatA) or succinate (dcuA and dcuC), polysaccharide biosynthesis (rcsC), or use of tryptophan as a carbon and nitrogen source (tnaB). These illustrate the breadth of phenotypic differences that can arise among isolates of a single species maintained separately for several decades.
Five true differences between MG1655 and W3110 are missense changes; one is silent. Whether the missense ones are functional has not been determined. The T29K change in crp affects a surface-exposed residue not involved in interactions with cAMP, DNA, or RNA polymerase. Substitutions of this residue are likely to be neutral (RH Ebright, personal communication). MG1655 has the ancestral (EMG2 and W1485) allele for four; W3110 has the ancestral allele for three (Figure 1; Table II).
The creation of highly accurate E. coli K-12 genome sequences provided the impetus for a cooperative re-annotation of both MG1655 and W3110 (Riley et al, 2006). The complete W3110 genome with the latest annotation has the Accession number DDBJ AP009048. These highly accurate E. coli K-12 genomes were used in the design of a collection of in-frame, gene knockout mutants (the Keio collection), whose construction is described in the accompanying manuscript (Baba et al, 2006).
In all, 60% (2.6 Mb) of the E. coli K-12 W3110 genome had been previously completely determined and deposited in DDBJ (Yura et al, 1992; Aiba et al, 1996; Itoh et al, 1996; Oshima et al, 1996; Yamamoto et al, 1997). Most of the remainder and uncertain regions were completely determined in this work by using a set of λ clones (Kohara et al, 1987). Initially, each chromosomal segment was amplified by long-range PCR, fragmented by sonication, cloned into an M13 vector and sequenced (Aiba et al, 1996; Itoh et al, 1996; Oshima et al, 1996; Yamamoto et al, 1997). Later, 20 continuous λ clones were separately amplified, mixed, fragmented, cloned, and sequenced, and the sequences were assembled into 100–200 kbp continuous regions. The remaining 10% was determined by insertion of two rare I-SceI restriction sites into the genome within the fadB-yicN and hflX-thrA intervals. The intervening regions were recovered by digestion, fragmented, cloned, and sequenced as described (Blattner et al, 1997). Ancestral alleles were determined by sequencing the respective PCR-amplified regions from EMG2 and its immediate descendent W1485. Chromosomal DNAs for W3110 and MG1655 were from Yuji Kohara and the National Institute of Genetics (Shizuoka Japan), respectively. Strains EMG2 and W1485F+ were from Mary Berlyn. Automated sequencing was carried out with an ABI 3100 sequencer.
Supplementary Table 1
Supplementary Figure 1
Supplementary Table 2
We thank Yuji Kohara for strain W3110, Yukiko Yamazaki for sequence analysis, Naomi Ishine, Masami Inagaki, Kayo Shirai, and Mineko Shimizu for technical assistance, Nicole Perna and Guy Plunkett III for helpful discussions and sharing unpublished data, Mary Berlyn for information on K-12 pedigrees, and our many collaborators for helpful discussions at the E. coli re-annotation meetings. This work was supported by CREST, JST (Japan Science and Technology) to TH and HM. and BLW is supported by NIH GM62662.