Genome-scale genetic footprinting in E. coli.
Genetic footprinting was first introduced for analysis of gene essentiality in S. cerevisiae
). A modification of this technique using a Tn5-
based in vitro transposome system (11) in E. coli
was previously described, and gene essentiality within three cofactor biosynthetic pathways has been analyzed (9
). Here we have extended this pilot analysis to the whole-genome level by using the same standardized growth conditions. The general experimental scheme is illustrated in the supplementary data. Briefly, following transposon mutagenesis, a population of ~2 × 105
independent mutants was grown aerobically for 23 doublings in enriched LB medium supplemented with kanamycin. Genomic DNA was isolated from the whole population and used to map individual transposon inserts with a nested PCR approach.
Distribution of the 1.8 × 104 distinct insert locations detected along the E. coli chromosome is illustrated in Fig. . The densities of transposon insertion events are randomly distributed, with two notable exceptions: an overall maximum around the origin of replication (oriC) and a minimum around the terminus (dif). This may reflect increased target copy number at the origin of replication in the actively dividing bacterial population used in this experiment. The overall insertion density is 3.218/kb, without appreciable variation between coding (3.221/kb) and noncoding (3.193/kb) regions.
Assessment of conditional gene essentiality based on genetic footprinting data.
Unambiguous essentiality assessments were made for 3,746 (or 87% of the total) E. coli protein-encoding genes or ORFs (Table ). Of these, 620 (14%) were asserted to be essential, and 3,126 (73%) were asserted to be nonessential (dispensable) based on the occurrence of transposon inserts within each ORF and the overall insertion density in the local environment, as described in the supplementary data. The complete essentiality list is reported in the supplementary data (see Table S1). No assertions could be made for 327 genes for technical reasons, such as limited efficiency of PCRs in certain regions of the E. coli chromosome or nonspecific primer annealing in areas of DNA repeats. For 218 genes, we considered the evidence to be insufficient for a specific conclusion about essentiality. These genes were systematically called ambiguous, according to the criteria listed in the supplementary data. For example, ORFs shorter than 240 bp (<80 aa) and with no inserts were consistently classified as ambiguous rather than essential. In certain cases, relatively long ORFs (>900 bp) containing only a single transposon were designated ambiguous rather than nonessential.
Distribution of essential and nonessential genes and average ERIs in selected functional categoriesa
Our results are generally consistent with previously published data on individual genes and with data from currently available collections of systematic gene deletions in E. coli
. For example, of the 1,379 individual gene deletion mutants listed at the University of Wisconsin E. coli
Genome Project website (http://www.genome.wisc.edu/functional/tnmutagenesis.htm
), only 12% produced apparently conflicting designations of genes as essential (for a detailed list of the discrepancies, see Table S2 in the supplementary data). Although we have not attempted to reconcile each individual case, several reasons for discrepancies can be envisioned. Most importantly, the term essential, which intuitively suggests an absolute requirement for cell viability, also applies to any gene that imparts a substantial fitness advantage. Thus, mutants lacking gene products necessary for maintaining vigorous growth fall into the same category as those with “true lethal” mutations. Therefore, certain genes may be classified as essential by genetic footprinting, yet corresponding viable deletion mutants may be obtained. In addition, differences in medium compositions, aeration levels, temperatures, and cell densities may account for many inconsistencies. Surprisingly, polar effects, in which transposon insertion into dispensable genes disrupts transcription of essential genes, are relatively rare in genetic footprinting. This may be due to the presence of weakly active promoter-like sequences within the transposon used in these experiments (9
). Most examples of polar effects are associated with genes that may require high levels of expression to sustain rapid growth rates.
Discrepancies resulting from inserts detected in the genes otherwise considered to be essential also occur. In some cases, single inserts occur close to protein termini or in interdomain boundary regions in multidomain proteins. For proteins consisting of two or more independently functioning domains, inserts may be tolerated within the 3′ portion of the gene if the C-terminal domain of the protein it encodes is associated with a dispensable function. This can occur even when a function associated with the N-terminal domain (from the 5′ region of the gene) is genuinely essential (as with ftsX
]). Small, localized chromosomal duplications may account for inserts in genes otherwise recognized as essential (2
). In this scenario, one copy of a duplicated gene provides the essential function while the other copy containing the transposon is stabilized by selection for kanamycin resistance. Large genes with only a small number of inserts may fall into this category since the total number of specific duplications within the population prior to transformation is probably very small (25
Functional context analyses of essentiality data.
The interpretation of genomic essentiality data can be approached in a number of alternate ways, such as by using chromosomal (positional), functional (system level), or phylogenetic (evolutionary) context analysis. In addition to refining initial essentiality assignments and reconciling apparent discrepancies with existing knowledge, such analyses can improve and expand existing understanding of the systemic behavior of the cell at various levels. Without attempting a comprehensive analysis, we have limited the scope of our efforts to (i) prototyping and illustrating such analysis by using selected examples from various functional systems, (ii) evaluating the internal consistency of our data, and (iii) developing preliminary observations at the system level, as presented below.
Initially, we analyzed the data in a functional context, which involved dividing the overall physiology of the organism into smaller, internally coherent subsystems such as amino acid biosynthesis, nucleotide metabolism, and other broad functional categories (Table ). This approach mirrors the standard didactic subdivision of microbial biochemistry and physiology. It also provides an organizational framework with which to analyze total genomic data and allows specific metabolic questions to be addressed.
For consistency, our functional analysis is based exclusively on SWISS-PROT functional annotations (8
). Each of the 1,849 gene products with specific SWISS-PROT annotations and defined biochemical functions supported by solid experimental evidence was placed into one of the 12 functional categories (Table and supplementary data [see Table S1]). Among the remaining 2,242 uncategorized protein-encoding genes, many have been tentatively annotated in SWISS-PROT and other databases, but most of these annotations either fall short of giving a specific testable function or have not been confirmed by direct experiments. As expected, the ratios of essential genes within various functional categories are rather uneven (Table ). Categories that include gene products involved with key aspects of cellular metabolism (such as nucleic acid and protein metabolism) contain a substantially higher percentage of essential genes (28 and 48%, respectively) than the average for the entire genome (14%). The percentages of essential genes in categories such as signaling, motility, and chemotaxis (8%) and membrane transport (8%) are substantially below the whole-genome benchmark. The average essentiality for the subset of 2,242 uncategorized genes (11%) is substantially lower than the average for the subset of categorized genes (19%). Several representative metabolic and nonmetabolic systems (7 of 12 functional categories) were selected for use as examples of functional context analysis and for evaluation of the internal consistency of the data. Here we describe one such analysis, with additional detailed interpretations presented in the supplementary data.
Amino acid metabolism: lysine biosynthesis.
Most of the genes responsible for biosynthesis of various amino acids were expected to be nonessential since the medium contains most of the amino acids required for growth. With a few notable exceptions, this expectation was confirmed by our results. Of the 91 genes with specific SWISS-PROT annotations indicating involvement in amino acid biosynthesis, only 16 appear to be essential (Fig. ). Six of these genes are involved in lysine biosynthesis. E. coli
produces lysine from aspartate via the nine-step pathway (Fig. ). Although lysine is available in the growth medium, its immediate precursor, diaminopimelate (DAP), which is required for cell wall biosynthesis, is not. The lysA
gene encoding the enzyme that converts DAP to lysine at the last step of this pathway is dispensable. Analysis of DAP-lysine biosynthesis provides an example of refining pathway reconstruction and individual functional assignments based on genome-scale essentiality data. Genes (asd
, and dapF
) encoding most of the enzymes leading to DAP production are essential. The first gene in this pathway (lysC
), encoding aspartokinase III, is dispensable due to the functional redundancy of the additional aspartokinase isozymes (encoded by metL
). In contrast, the asd
genes involved with the second and the third steps of DAP-lysine biosynthesis are essential in spite of the existence of apparent paralogs. Proteins encoded by the yjhH
functionally uncharacterized genes are often annotated as potential dihydrodipicolinate synthases based on their high sequence similarities with the dapA
gene product (BLAST E scores of 4e−33
, respectively). However, genetic footprinting data suggest that under our experimental conditions neither is capable of complementing loss of the essential dapA
function. The opposite situation is observed with succinyl-DAP aminotransferase (encoded by argD
), which is firmly defined as dispensable in our data. This apparent inconsistency can be resolved by assuming functional complementation by the argM
gene product. The argM
gene is known to encode succinyl-ornithine transaminase, which is primarily involved in arginine biosynthesis. However, this enzyme is closely related to succinyl-DAP aminotransferase by sequence, and the aminotransferases are known to possess rather broad substrate specificities, especially for structurally similar substrates (such as succinyl-DAP and succinyl-ornithine). Overexpression of the argM
gene has been demonstrated to suppress an argD
mutation in E. coli
FIG. 2. Essentiality of genes controlling amino acid biosynthesis in E. coli. (A) Functional overview of amino acid biosynthesis. Each block represents one or more pathways leading to production of a particular amino acid or its key intermediates (shown in smaller (more ...) Phylogenetic analysis of essentiality data within functional groups.
To assess the data set from an evolutionary perspective, we examined the distribution of conditionally essential and dispensable E. coli
genes with respect to the occurrence of putative orthologs across a broad range of diverse bacterial genomes. Putative orthologs within a reference set of 32 complete bacterial genomes chosen to represent maximum phylogenetic diversity were identified based on protein families, supplemented by bidirectional best hits (see Materials and Methods). For this analysis we introduce a simple parameter: an ERI computed for each E. coli
gene as the fraction of genomes from the reference set containing a putative ortholog of the gene. ERI values varying from 0 (for genes unique to E. coli
) to 1.0 (for omnipresent genes) are provided in the supplementary data (see Table S1). In a recent study, the Profiling of E. coli
Chromosome data (http://www.shigen.nig.ac.jp/ecoli/pec
) were used to demonstrate a remarkable tendency of essential gene sequences to be more evolutionarily conserved than those of nonessential genes (19
). In our analysis, we used ERI values to focus on occurrence of essential and nonessential genes (preservation of orthologs) rather than on conservation of their respective sequences.
Figure depicts the overall number of E. coli genes in decreasing order over the range of ERI values. An initial sharp decrease in the number of preserved genes (~40%) occurs over a rather small phylogenetic distance of less than four genomes in our reference set (ERI ≤ 0.1). Further decay is at much lower rates, and orthologs of ~10% of E. coli genes are preserved in at least 25 diverse genomes (ERI ≥ 0.8). This reflects a nonrandom ortholog preservation pattern, characterized by a highly conserved core group of genes. This core is highly enriched by genes identified as essential in our study. The tendency of essential genes to be evolutionarily preserved is also reflected in Fig. , demonstrating a significantly positive correlation (0.5240) between essentiality (Fig. ) and ERIs (Fig. ) along the E. coli chromosome. Similarly, plotting the fraction of essential genes at different ERI values demonstrates that the relationship between the two parameters has the following form: y = yo + aebx, implying that the essentiality of genes with a given ERI is due partly to a very strong tendency of essential genes to be retained by evolution (the exponential behavior dominant above an ERI of 0.6) and partly to an essential gene fraction of ~10% that is present among genes within any ERI value group (Fig. ).
FIG. 3. Distribution of E. coli genes as a function of ERIs. (A) Total number of genes with an ERI above the threshold plotted versus the ERI threshold. Color coding within bars represents fractions of essential (red), nonessential (green), ambiguous (yellow), (more ...)
Comparison of average essentiality and ERI values between different functional categories reveals significant correlation (Table ). Functional categories including highly specialized proteins such as transporters, regulators, and signaling molecules are characterized by average ERI values close to the average for the whole genome (~0.3). Average essentiality within these groups also does not exceed an overall whole-genome level (~14%). The least essential group of all uncategorized proteins with historically elusive functions has the lowest average ERI, ~0.2. Therefore, many of these proteins are likely to be specific to the environmental and phylogenetic niches of E. coli. On the other hand, the bulk of cellular intermediary metabolism (categories AAM, CHM, NCM, LPC, and MSM [Table ]) is associated with ERI values of 0.4 to 0.5. Essentiality within these metabolic categories varies depending on the levels of functional redundancy of their constituents in rich medium. Not surprisingly, the highest ERI values (up to 0.7) as well as the highest ratio of essential genes (up to 48%) occurs in functional categories that include replication, transcription, and translation, i.e., cellular processes that are conserved and unconditionally essential in most organisms.
Figure illustrates the changes in distribution of essential genes between functional categories depending on their tendencies to be evolutionarily preserved. An initial bias in distribution of all categorized essential genes towards those involved with synthesis and processing of informational macromolecules increases dramatically at higher ERI values. The fraction of all essential genes contributed jointly by the functional categories PMS and NAM (Table ) (~30%) increases almost twofold (up to ~60%) for a subset of essential genes with ERIs of >0.8, ultimately exceeding 90% as the ERI approaches 1.0.
FIG. 4. Distribution of essential genes among functional categories as a function of ERI thresholds. Functional categories are color coded and specified by three-letter designations as in Table . Within every threshold group, each bar represents (more ...)
This analysis reveals two distinct classes of essential genes, which may be referred to as broadly preserved essential genes and species-specific essential genes. A subset of less than 180 genes (~4% of the genome) with ERIs of >0.8 accounts for ~25% of all of the essential genes revealed in this study, and it appears to provide an approximation of broadly preserved essential genes. Functional content analysis of this subset (Fig. ) strongly supports the expectation that these genes represent universally and unconditionally essential constituents of cellular central machinery. This notion is in good agreement with available complete and partial gene essentiality datasets for Mycoplasma pneumoniae
and Mycoplasma genitalium
), Haemophilus influenzae
), Staphylococcus aureus
), and Streptococcus pneumoniae
). The overwhelming majority (70 to 87%) of assigned genes in these data, which correspond to E. coli
genes listed in Fig. , appear to be essential (see Table S5 in the supplementary data for details). Of note, many of these broadly preserved essential genes, including those with yet undefined functions, may be considered potential broad-spectrum anti-infective drug targets (9
FIG. 5. E. coli genes found to be essential and preserved in over 80% of diverse bacterial genomes (ERI > 0.8). These universal essential genes are grouped by functional categories (described in Table ). NTP, nucleotide triphosphate; (more ...)
In contrast, more than 75% of genes within the set of species-specific essential genes (which account for ~30% of all essential E. coli genes with ERI values of <0.1) encode uncategorized proteins with poorly defined or completely unknown functions. Many of the genes with known functions within this class are related to transcription regulation, membrane transport, signaling, and other cellular processes whose essentiality is either strictly condition dependent or limited to a set of very specific needs of E. coli and closely related species.
Among the 263 essential genes marked in our analysis as uncategorized (see Table S1 in the supplementary data), 19 genes have specific functions assigned to them while 73 genes have putative assignments (according to SWISS-PROT and other public archives). Those include assignments indicating just an element of possible function, such as “probable GTP-binding protein” (ychF). For the remaining 171 genes, we were unable to find any reliable functional assignments. These genes may be qualified as essential unknowns (at least at the time when this analysis was performed). The list of these genes along with their respective ERI values is provided in the supplementary data (see Table S6). Only 10 (yciL, yjeE, ybeY, yebC, yjgF, ydeE, yoaB, yqgF, ycdK, and yhbC) of the essential unknowns (<6%) are broadly conserved in bacteria (ERIs of 0.8 to 1). In contrast, more than 60% of genes in this set are poorly conserved across our reference set of diverse genomes (108 genes with ERIs of 0 to 0.1). Less than half of them (42 genes) are conserved in most Enterobacteriaceae, while others are present only in E. coli and some closely related species.
System level analysis of essentiality data within topologic modules of E. coli metabolism.
It is widely recognized that the thousands of components of a living cell are dynamically interconnected, so that cellular functional properties are a result of the complex intracellular web of molecular interactions within the cell (14
). This is perhaps most evident with intermediary metabolism, in which hundreds of metabolic substrates are densely integrated through biochemical reactions (17
). Metabolic networks are organized into many small, highly connected topologic modules that combine in a hierarchical manner into larger, less cohesive units, with their numbers and degrees of clustering following a power law, as previously demonstrated for 43 reference organisms (28
). Within E. coli
, hierarchical modularity closely overlaps with known metabolic functions (28
To comprehend the results of individual gene essentiality in the context of cellular system level functional organization, we projected the essentiality phenotype of metabolic enzymes onto a global topologic representation of the E. coli
metabolic network (28
). As shown in Fig. , the overall essentiality ratio of metabolic enzymes within the full metabolic network is relatively low, with essential enzymes limited to a subset of modules. Visual inspection of the figure indicates that while many metabolic modules are almost entirely nonessential, at the lowest hierarchical level several branches corresponding to small topologic modules appear to be essential, i.e., they are composed of biochemical reactions catalyzed by predominantly essential enzymes. Of these, the largest fractions are within the topologic modules related to nucleotide, coenzyme, and lipid metabolism. The pyrimidine metabolic module appears to contain the highest level of essential reactions.
FIG. 6. The evolutionary retention and essentiality ratio of enzymes in the topologic modules of E. coli metabolism. The hierarchical tree derived from the topologic overlap matrix of E. coli metabolism that quantifies the relation between the various modules (more ...)
A significant correlation between essentiality and ERI values is apparent within metabolic modules, and many of the highly essential modules also contain metabolic enzymes with the highest ERI values (Fig. ). Generally, essentiality and evolutionary retention of metabolic enzymes correlate, although exceptions are also evident as illustrated in detail for the pyrimidine module (supplementary data [see Fig. S3]). Pyrimidine metabolism, however, represents a special case in E. coli
MG1655, since the rph-1
mutation in this strain depresses expression of the downstream pyrE
). This strain is prototrophic for pyrimidines but grows significantly better in uracil-supplemented media. Although our studies were performed with rich media containing significant amounts of exogenous pyrimidines, the low level of pyrE
transcription may have affected the ability of cells to efficiently adjust the relative levels of the pyrimidine nucleotides. This may explain the relatively high level of gene essentiality within the pyrimidine-related topologic module. These observations, however, may also reflect a hypothesized generic feature of metabolic networks: their limited ability to fully compensate for perturbations by reorganization of metabolic fluxes within evolutionarily conserved topologic modules.