Complete genome sequences for the two strains of E.coli
K-12 allow comparison of the current sequence for the MG1655 genome with the original 1997 version. It also allows comparison of the gene content of two K-12 strains which have had different histories since their isolation in the early 1950s in the laboratory of Joshua and Esther Lederberg at the University of Wisconsin. Their common ancestor was an isolate of the original K-12 cured of lambda and F. MG1655 was stored most of the time before it was sequenced in 1997. Cultures were maintained variously lyophilized, frozen and on stab. In contrast W3110 would have undergone many more generations over this period of time as it was used actively for research over these years, passing from laboratory to laboratory. For more detail on these histories see Ref. (4
Inspection of the two genomic sequences and consultation resulted in changing many designated start codons which led to elimination of some old genes and formation of some new ones. Compared with the content of GenBank™ entry U00096.2
there were 682 changes in start codon assignments of previously identified genes, 31 old genes have been eliminated (Supplementary Table 6) and 66 new ones mostly of unknown function have been recognized (48 CDSs, 17 pseudogenes and 1 RNA). Even small differences in start sites affect important matters such as the design of probes for microarray experiments, quantifying distance relationships to upstream regulatory elements, and the design of primers for gene amplification and gene deletion, e.g. as used for the construction of a complete set of E.coli
K-12 in-frame, single-gene knockout mutants (38
Several corrections had dramatic consequences for a gene(s). In addition to changing the reading frame, 2 frameshifts led to fissions (), 23 led to fusions of adjacent or overlapping ORFs into single proteins, as shown for hdfR in , and 1 led to an inversion, i.e. recognition of a conserved protein encoded on the opposite strand (). Other corrections led to missense changes (62), were silent (17) or in intergenic regions (73) or RNA genes (2).
Figure 1 Gene fissions, fusions and an inversion resulting from 1 nt indel corrections. Of 78 frameshift corrections, two 1 nt indels led to fissions (splitting) of genes (A and B), 23 resulted in gene fusions, similar to the example in (C), and 1 led to an inversion (more ...)
Inspecting and comparing the sequence data for both MG1655 and W3110, we can ask how the genomes of these isolates, both K-12 strains, differ. Owing to extra copies of IS elements, there are 17 more genes for IS element proteins (more IS1, IS2 and IS5 genes) in W3110 than in MG1655. In return there are 11 genes, 9 encoding the CPZ-55 prophage and 2 encoding IS1 proteins, in MG1655 that are absent in W3110 (). As both IS elements and temperate phages are horizontally transmitted genetic elements, differences of this kind in the two E.coli genomes are not unexpected. The presence of more IS elements in W3110 could reflect its role as a prime experimental strain that has experienced more exposure and more generations than has MG1655.
Genes not common to strains MG1655 and W3110
Predicting pseudogenes requires an extremely high level of nucleotide accuracy. Pseudogenes are caused by frameshifts, in-frame stops, or insertions or deletions which divide a gene into fragments. Most of the pseudogenes are broken into two fragments (18 ancestral genes that are now 36 pseudogene fragments), a few are broken into three fragments (3 ancestral genes that are now 9 pseudogene fragments) and several exist as single fragments (41
) in both strains. In addition, W3110 contains six genes with an IS insertion resulting in either split genes (four ancestral genes that are now eight pseudogenes) or truncated genes (two pseudogenes). Hence the number of pseudogenes differs between the two strains.
We do not know the full phenotypic consequences of the genetic differences between the two K-12 isolates. Functions of the four genes that are pseudogenes in W3110, split by insertion, are known. These are genes for the galactitol PTS enzyme II (GatA), aerobic and anaerobic C4-dicarboxylate transporters (DcuC), a hybrid sensory kinase (RcsC), and a low-affinity tryptophan permease in the tryptophanase operon (TnaB). Each of these may affect metabolism, for instance growth on galactitol or succinate would be affected unless redundant systems are present. The use of tryptophan as a carbon and nitrogen source may also be affected. These testable characteristics illustrate the breadth of phenotypic difference possible between isolates of one strain of a single bacterial species maintained separately for several decades.
With the updated annotation in hand, in terms of the biology of the organism we can ask how much we have learned about the E.coli cell in terms of the functions of its gene products. How many genes encode enzymes, how many genes encode a transporter function, regulator function or have cellular roles? Surveying the content of the two genomes that is in common (4453 genes), the numbers of gene products of different types in our snapshot are listed in .
Numbers and types of known and predicted gene products of E.coli K-121
Comparing the number of genes in with earlier counts, we find that in 1993, before the genome sequence was known, only 1700 genes were listed (41
). Upon completing the genome sequence in 1997, the number of MG1655 genes was 4289 (3
), a number that is close to today's total of 4464 (for the 4453 genes in common see and for MG1655-specific genes see ). The increase is due in large part to identifying small proteins and small RNAs (42
We looked at the proportions of types of molecular functions of the genes and compared these values with assessments of the same kind collected at earlier stages of knowledge of the genome. One needs to be aware that gene products can serve more than one cell role, thus choosing to identify a gene with a single category is sometimes arbitrary and can shift between assessments. In spite of this potential variability, we see that over a period of 12 years the proportion of enzymes, transporters, regulators and undefined membrane proteins has remained remarkably stable at ~33, 13, 9 and 6%, respectively. The proportion of the genome occupied by phage and IS genes also has remained steady at ~7%. Changes in other categories reflect new discoveries and/or redefinitions of a role category. The category called ‘factors’, although a small category, has increased in size over 10-fold from the earliest assessment because of discoveries of new factors such as transcription and translation factors and chaperones. An increase in size of another small category, ‘carriers’, results in large part from redefining the category ‘carriers’ to include specialized electron-carrying proteins and specialized electron-carrying subunits of enzymes. We drew an arbitrary line defining cytochrome and iron–sulfur proteins and subunits as ‘carriers’, but retaining definition of NAD(P)H-binding proteins and flavoproteins as ‘enzymes’ as the latter often have the catalytic site in the same polypeptide chain. Finally, numbers of known RNA genes have risen from 104 reported in 1993 through 116 reported in 2004, to 156 today. The increase in the numbers results from the identification of new ‘small RNAs’ many of which have regulatory function. Future experimental characterizations of the cellular functions of presently unknown genes will complete the picture of the contents and proportions of all types of macromolecules in an E.coli cell.
Beyond the annotation activities, a third aim was to produce a gene identification system for E.coli
K-12 genes that is consistent between the two strains over the vast regions where they are essentially identical while also making accessible those genes that are strain specific or have different map locations. Owing to use of slightly different coordinate systems, more copies of IS elements in W3110, a defective phage only in MG1655, and the large W3110 inversion (44
), there is no simple formula relating the positions of corresponding nucleotides in the two K-12 genomes. The problem being that the genomes do not have the same length, and there is a gene order reversal due an inversion. Consequently, consistent sequential numbering of sequence and features is impossible.
Our solution was to provide a tripartite system of identifiers for each annotated feature: ‘b’ numbers for MG1655, ‘JW’ numbers for W3110, and ‘ECK’ (E.coli K-12) numbers for reference to E.coli K-12 as a composite strain. The b and JW numbers are indexed to the nucleotide sequences of the respective genomes and ECK numbers point to the corresponding b and/or JW numbers depending on whether the gene exists in one or both genomes. In updating the MG1655 genome, we retained the original b numbers if the gene was not substantially changed. Otherwise, the original b number was permanently retired and a new number was taken from the end of the series. The JW numbers were similarly styled. We chose this approach over one that would introduce decimal extensions to existing numbers as a process more easily applied in cases of future changes. Single ECK numbers were assigned for each unique CDS of an IS element, resulting in a one to many mapping for these CDSs. We limited the ‘one to many’ nomenclature to mobile elements so, for example, ribosomal RNA genes are each assigned separate ECK numbers. Genes interrupted by an IS element or frameshift were given unique b and JW numbers for each gene segment and the same ECK number for all gene segments. The ECK unique identifiers are numbered sequentially in the order of the MG1655 map beginning with thrL.
community uses Demerec format (45
) for gene names consisting of a unique three-letter abbreviation intended to suggest a function, followed by a capital letter to distinguish different genes related to the same function. ‘Official’ gene names are managed by the Coli Genetic Stock Center (CGSC) [(11
) and )]. The ‘y gene’ system (46
) follows a unique Demerec format with names beginning with the letter ‘y’ as a way to name genes of unknown function. Although intended for only temporary use until a function was unraveled, y gene names have been retained in the literature for many genes whose function is now well understood. We updated the nomenclature in two ways: (i) Mary Berlyn of the CGSC at Yale University provided new Demerec names, from the literature and personal communications, to replace y gene names for which functions have now been discovered, and has resolved conflicts and redundancies resulting from multiple name assignments made to a single gene or class of genes or the same name assigned to multiple genes, (ii) Kenn Rudd assigned y gene names for some newly delineated genes of unknown function. In all cases, both the canonical name and synonyms are in Supplementary Table 1. For some genes, informal names that do not comply with the Demerec rules are also given as locus names. These include names for fragmented pseudogenes (each fragment named by adding on ‘_1’, ‘_2’ and so on, numbering from the N-terminal end of the full length protein) and multiple copies of IS proteins (each copy assigned an extension of ‘-1’, ‘-2’, ‘-3’ and so on, depending on its chromosomal location).
Some genes are clearly inactivated by deletion, frameshift or IS element insertion. In an attempt to connect terminology with genetic nomenclature of eukaryotes, we refer to these as pseudogenes and pseudogene fragments. Individual fragments of divided pseudogenes are given the same ECK identifier but locus names are modified as described above. In addition to specification of the fragments, an entry under the same ECK identifier for individual fragments, provides the range of nucleotides of the entire (ancestral) pseudogene. Unique locus identifiers have only been assigned to the predicted ancestral pseudogenes in MG1655.
The output data
The main table, Supplementary Table 1, has a row for each gene or gene fragment and 44 data columns. Because this table has empty spaces where a property does not apply to a particular gene type, separate more compact tables are provided for enzymes (Supplementary Table 2), transport proteins (Supplementary Table 3), regulatory proteins (Supplementary Table 4) and the remainder (Supplementary Table 5). All five tables are provided in both spreadsheet (Microsoft Excel) and text formats. The text format offers a seldom-seen advantage in the presentation of genomic data in that the information is not presented one gene at a time, but the information can be addressed as a whole. This format lends itself to importation into relational or other database management systems and to exploration using query languages.
The information in the data columns is given in (vide supra), which has a description of the type of information in each column and the major sources used in the annotation process. Text notes with definitions and explanations of the types of data in the table and descriptions of how they were generated are in Supplementary Document 1 Explanatory Notes. Table contents are not exhaustive. Most entries could be expanded. For instance only the coarsest granularity of terms that are available in the Gene Ontology (GO) system were applied to each gene product. Time did not permit taking proper advantage of the rich detail of the ontology. Application of fine detail awaits future work by member(s) of the E.coli community.
We can ask where we stand in having definite facts about every gene in the organism. summarizes how many gene products have functions that have been demonstrated experimentally, how many have functions that can be predicted by similarity to known genes and how many are still of unknown function. Unknown gene products were divided into those that are conserved in the sense of having similarity to the sequence of at least one other protein in current databases, and those that are not. Of the least known, there is useful information for some, such as presence of a predicted domain within the sequence. Only 5.3% of E.coli K-12 genes remain totally unknown without even a predicted domain, no clue to their identity or function at this time. The larger category of unknowns having some information about them constitutes an additional 8.6%. These CDSs require proper characterization to learn the identity and function of their gene products. It seems likely the number of genes of unknown function of various kinds will continue to fall as experimental findings continue to accumulate in the future.
Figure 2 Status of annotation of E.coli gene products. The total number of gene products present in both MG1655 and W3110, 4452 excluding oriC, are categorized according to their function assignment. Evidence code and gene type assignments available in the Supplementary (more ...)