To expose the significance of the PHX gene classes, we plotted B(g|C) versus B(g|RP) traversing all individual genes g (≥100 codons in length). The plots are given in Fig. for each of the four rapidly growing bacteria. The distribution of points reveals two horns. The left horn effectively corresponds to the PHX genes. The right horn we refer to as putative alien genes. It consists of genes that significantly differ in their codon usages from the four classes C, RP, CH, and TF and will be discussed in a separate publication. If we replace the horizontal axis B(g|RP) with the coordinates of B(g|TF) or B(g|CH), the plots in Fig. remain largely unchanged (data not shown).
Top 20 PHX genes. The distribution of PHX genes among the four fast-growing bacteria is displayed in Table . The highest
E(
g) value exceeds 2 in all four genomes. Such high values are rare among the completely sequenced genomes (cf. reference
21). These four bacteria have a substantial number of PHX genes, ranging from 142 to 306.
| TABLE 1Distribution of PHX genes in four fast-growing bacteria |
Table presents the 20 genes with the highest predicted expression levels in the genomes of E. coli, V. cholerae, H. influenzae, and B. subtilis. In those few instances when the homologous genes in the other genomes are not PHX, their E(g) values are shown in parentheses. The genes are segregated into functional categories. Almost all ribosomal proteins attain high expression levels in all rapidly growing bacteria (Tables and ). The S1 ribosomal protein gene (exceeding 500 codons in length in most bacteria) in B. subtilis is found at the diminished length of 327 codons but is still PHX, with an E(g) value of 1.20. Ribosomal protein genes are present in single copies, in contrast to rRNA genes, and are predominantly of a high expression level, presumably conforming with stoichiometric requirements for ribosome formation between proteins and RNA and among the proteins themselves.
| TABLE 3Predicted expression levels for ribosomal protein genes among four fast-growing bacteriaa |
The major (eubacterial) chaperone/degradation proteins HSP70 (DnaK) and HSP60 (GroEL) and the mRNA degradation protein polynucleotide phosphorylase (Pnp) are prominently PHX. Pnp in
B. subtilis, however, is not PHX [
E(
g) = 0.79]. The corresponding genes in
H. influenzae achieve
E(
g) values of 1.29, 1.47, and 1.72, respectively. The gene enolase (
eno), listed under energy metabolism as part of the glycolysis pathway, is potently PHX. It is also a component of the mRNA degradosome in a multifunctional capacity (
27) and so has reasons for being potently PHX.
Processing factors for protein synthesis are outstandingly PHX, especially the ATP-dependent DNA-directed RNA polymerase units RpoB and RpoC and the elongation factors EF-G (
fus), EF-Tu (
tuf), EF-Ts (
tsf). The elongation factor EF-Tu often is present in two copies, both dramatically PHX.
B. subtilis has but one copy, and it is PHX. EF-G (
fusA) is present in two copies in
V. cholerae, with
E(
g) values of 2.02 and 0.96. The DNA helicase DeaD is PHX in
E. coli, V. cholerae, and
B. subtilis but not in
H. influenzae, nor is the second copy in
B. subtilis PHX. It will be interesting to see if there are functional differences between the two copies of DeaD and the EF-G proteins. DeaD box proteins protect mRNA from endonucleases (
27).
Many glycolysis genes are among the top PHX genes; they include genes for pyruvate kinase (pykA and pykF), fructose-1,6-bisphosphate aldolase (fba), phosphoglycerate kinase (pgk), enolase (eno), and glyceraldehyde-3-phosphate dehydrogenase (gap).
PHX genes contributing to anaerobic fermentation include the alcohol/acetaldehyde dehydrogenase gene (adhE). Other PHX genes of energy metabolism include several, but significantly not all, genes of the tricarboxylic acid (TCA) cycle. Several subunits of the pyruvate dehydrogenase complex genes are among the top PHX genes. These include genes for multiple copies of the three enzymatic components, pyruvate dehydrogenase E1 (aceE), except in B. subtilis, dihydrolipoamide acetyltransferase E2 (aceF), and lipoamide dehydrogenase E3 (lpdA), all part of the pyruvate oxidation pathway. Genes contributing to proton gradient-driven ATP synthesis (namely, the genes for the two major subunits of the ATP synthase catalytic domain, atpA and atpD) are potently PHX. The PHX gene for adenylosuccinate synthetase (purA) stands out, except in B. subtilis. It participates in the de novo biosynthesis pathway of purine nucleotides and in the first step of AMP biosynthesis. However, the genes for the other enzymes of that pathway are not PHX.
Several porin genes of E. coli, H. influenzae, and V. cholerae are PHX. These are absent from B. subtilis, which is a gram-positive bacterium, lacking the distinctive gram-negative outer membrane. The peptidoglycan-associated lipoprotein (Pal) attached to the outer membrane by a lipid anchor is PHX in gram-negative bacteria. Several lipid biosynthesis PHX genes are among the top 20. The first enzyme of the glyoxylate shunt pathway, isocitrate lyase (AceA), is PHX in the moderately fast-growing Deinococcus radiodurans (90-min average doubling time) and the slow-growing Mycobacterium tuberculosis (24 to 36 h). It exists in E. coli and V. cholerae but is not PHX and has not been detected for most prokaryotic genomes. Isocitrate lyase is widespread in plant and fungal organisms. There is an open reading frame (ORF) (yeiM) of unknown function but possibly encoding a nucleoside transporter, with an E(g) value of 2.00, in V. cholerae, and there is a homolog with an E(g) value of 1.07 in H. influenzae but not PHX in E. coli and B. subtilis. Differences among genes in predicted expression levels present challenging questions for experimentation.
PHX genes in
H. influenzae parallel PHX genes in
E. coli. These include genes for mainstream glycolysis and TCA enzymes and genes for detoxification and DNA damage control, such as the
sodA and catalase genes. The highest
E(
g) value is 2.01, attained by the elongation factor EF-G (
fusA). The heat shock proteins GroEL and DnaK are among the most highly expressed. The ribosome release factor (Rrf) is the top PHX protein in
H. influenzae. Rrf is responsible for the release of ribosomes from mRNA at the termination of protein synthesis (
37). Rrf is present and generally highly expressed in all eubacterial organisms with completely sequenced genomes but is absent from archaea (
35).
Comparison of predicted levels of expression in E. coli with 2D gel patterns. For many
E. coli proteins, two-dimensional (2D) gel electrophoresis data for their abundances during growth in minimal medium are available. We compared the molar abundances of 96 proteins (with lengths of ≥100 amino acids [aa] [
45,
46]) with the set of PHX genes (Table ). Among the 20 most abundant of the 96 proteins, 17 were identified as PHX by our method. Among the 20 least abundant proteins of the 96, only 7 qualified as PHX. Of the remaining 56 proteins, which have intermediate molar abundances on 2D gels, 28 were identified as PHX. This agreement between high 2D gel abundances and high
E(
g) values supports naming the genes “highly expressed.”
| TABLE 4Comparison of 2D gel expression measurements (45) and predicted E(g) values |
Three exceptions to the good agreement between high protein molar abundances and PHX status are MetE, FolA, and IlvE, which are involved in amino acid biosynthesis and methylation. These proteins are among the most abundant in 2D gel determinations but do not qualify as PHX. The enzymatic turnover rate for MetE, determined by kinetic studies, is low but is compensated for with a high molar abundance (
12). In
E. coli, the methionine biosynthesis pathway includes MetK, with a very high
E(
g) value, 2.21, whereas MetE has an
E(
g) value of 0.69 and MetH has an
E(
g) value of 0.60. MetE and MetH offer strict alternative pathways for
l-methionine synthesis. MetK acts on homocysteine to produce
S-adenosylmethionine, which serves as a methyl donor for a broad range of metabolites, lipids, and vitamins (
41). It has been conjectured that the
metE gene or the entire Met operon in
E. coli, because of its codon usage, may be a newly laterally transferred gene analogous to the Cob operon of
Salmonella enterica serovar Typhimurium (
24). FolA (dihydrofolate reductase) registers high 2D gel assessments but has a low
E(
g) value, 0.60.
Hecker and colleagues (e.g., reference
3) have conducted extensive 2D gel assessments of
B. subtilis proteins. Consulting their 2D database (
http://microbio2.biologie.uni-greiswald.de:8880), we compared the brightest spots on their gels with the
E(
g) values for the corresponding proteins: RpS2, 1.84; SerA, 0.62; IlvC, 1.03; AroA, 0.93; Gap, 1.80; PdhC, 2.05; CitC, 1.33; TufA, 1.97; Fus, 2.34; YwjH, 1.01; RpL10, 1.46; ClpP, 1.05; SodA, 1.64; and CitH, 0.81. Most of these proteins are PHX, and several achieve an
E(
g) value of >1.8. Thus, there is a good correlation of PHX proteins with high 2D gel abundances in
B. subtilis, as in
E. coli. Classes of PHX genes. Tables and through compare for the four fast-growing bacteria predicted levels of expression of all ribosomal protein genes, of the genes for the major transcription/translation processing factors, of the chaperone/degradation protein genes, and of the major energy metabolism genes. The extended repair gene repertoire of the four genomes and the vitamin biosynthesis genes of E. coli are evaluated in terms of E(g) levels (Tables and ). Each class is discussed in turn.
| TABLE 5Predicted expression levels for translation/transcription processing genes among four fast-growing bacteriaa |
| TABLE 9Major energy metabolism genes of the four fast-growing bacteria and their predicted E(g) valuesa |
| TABLE 11Major vitamin biosynthesis genes of E. coli |
(i) Ribosomal protein genes (Table ). Ribosomes of the four fast-growing bacteria have practically the same numbers of small- and large-subunit proteins. However, among all prokaryotic genomes, that number ranges from 50 to 65, while in eukaryotes, the number is constant at 79 (except in yeast, 78) (
48,
50). This information suggests a greater range of variation in the patterns of protein synthesis among prokaryotes, consistent with the constrained phylogenetic origin of eukaryotic cells compared with the less constrained origin of prokaryotic species.
Thirty-five RP genes are shown in Table (only those ≥100 codons long). Unlike those of yeast and
Drosophila, many of the bacterial RP genes are concatenated to form a large operon encompassing 20 to 40% of all RP genes. Genes for some of the major translation/transcription processing factors, including
tuf,
fus,
rpoA,
rpoB, and
rpoC, are within or near the large RP operon. Other RP operons typically consist of two to five genes. In
E. coli, the cluster of L7/L12, L10, L1, L11,
rpoB, and
rpoC is noteworthy.
B. subtilis possesses an RP cluster that effectively combines the two largest clusters of
E. coli. In these fast-growing bacteria, most of the eubacterial RP genes are positioned near the origin of replication,
oriC. It is evident from Table that virtually all RP genes are PHX. The EF-Tu gene is often duplicated, with both copies being PHX and incorporated near or in an RP cluster.
groEL,
rpoB, and
rpoC also tend to localize to the vicinity of the main RP cluster. Many eukaryotic and eubacterial ribosomal proteins are multifunctional (
50).
The “giant” RP (labeled S1 or RpsA, generally exceeding 500 amino acids in length) has a remarkable phylogeny. It is recognized in most eubacteria but is not part of an RP operon, and it generally reaches among the highest expression levels. In
B. subtilis, there is an S1 homolog, but it is only 327 codons long, and the S1 gene is entirely missing from the three current completely sequenced mycoplasma genomes. The S1 gene is essential in
E. coli, where it is thought to contribute to the initiation of polypeptide synthesis. The absence of an S1 protein in
B. subtilis can possibly be compensated for by a strong ribosome binding site (
34). The evolutionarily deep branching bacterium
Aquifex aeolicus has a giant S1 gene.
Thermotoga maritima, allowing for a frameshift, also has an S1 homolog. None of the archaeal genomes has an S1 homolog, and eukaryotic genomes also lack an S1 homolog.
The origin of replication (oriC) for E. coli is identified within the 232-bp interval from 3923372 to 3923603. The major RP cluster is proximal to oriC at 3436600 to 3476134 and contains, in addition to RP genes, genes for the elongation factors EF-Tu and EF-G and two flanking chaperones of the peptidyl-prolyl cis-trans isomerase (PPIase) family. Proximity to oriC implies a higher-than-average gene copy number per rapidly growing cell. A second RP cluster occurs proximally on the other side of oriC and includes genes for a duplicate copy of EF-Tu (tufB) and the DNA-directed RNA polymerase units rpoB and rpoC. The E(g) values for RP genes (≥100 codons long) in E. coli range from 2.44 to 1.13. All but one of the RP genes are PHX; the single exception is L9 in B. subtilis. The majority have E(g) values exceeding 1.50. The correlations of E(g) values among the RP genes of E. coli, V. cholerae, and H. influenzae are high (Table ).
Does stoichiometry matter? For example, among the RP genes, why aren't all 50S units PHX at the same expression level? A partial answer may be that not all ribosomal proteins play an exclusive role in determining ribosome structure. Some may have a regulatory role (e.g., S1 is proposed to function in translation initiation) (M. Nomura, personal communication) (
34). The acidic ribosomal protein component P
0 is PHX in archaea but is absent from eubacteria. L7/L12 is also acidic and is thought to act in adapting mRNA chains to the ribosome. Actually, L7/L12 forms dimers with an elongated shape. Two dimers associate with a copy of L10 to form a very strong complex (
4). Very relevant is that several ribosomal proteins are multifunctional (
50). For example, S9 provides ancillary utility in certain repair activities (
49); S16, in part, acts as an endonuclease (
31).
(ii) Genes for transcription/translation processing factors (Table ). The majority of protein synthesis factors are PHX over all prokaryotic genomes. Expression levels correlate highly across species (Table , footnote
a). As with the ribosomal proteins, the
E(
g) values cover a wide range. Elongation factor EF-G (
fus) is distinctive, with an
E(
g) value exceeding 2 for each genome. The highest expression levels in
E. coli occur for the RpoB and RpoC subunits of the core RNA polymerase. RpoA is PHX in
B. subtilis but not in
E. coli,
V. cholerae, and
H. influenzae. Why are the predicted expression levels for the RpoB and RpoC subunits higher than that for RpoA? Based on the RNA polymerase stoichiometry (one copy of RpoB, one copy of RpoC, but two RpoA units), should one expect elevated expression levels for RpoA compared to RpoB and RpoC? A possible explanation relates to the differences in protein sizes, RpoB and RpoC being larger proteins than RpoA. It has been observed for
E. coli that codon choices in long genes tend to be more biased than those in short genes (
10). Interestingly,
Mycoplasma genitalium, its relative
Ureaplasma urealyticum, and the spirochete
Treponema pallidum feature PHX RpoA but not RpoB and RpoC.
(iii) Chaperone/degradation protein genes (Table ). Among the top PHX genes in most eubacterial genomes are those for the major chaperone protein archetypes, DnaK and GroEL. These reach E(g) values exceeding 1.3 (>2 in E. coli). The gene for the multifunctional enzyme Pnp, fundamental in RNA processing and mRNA degradation, attains the highest predicted E(g) value, 2.66, among all E. coli genes. Pnp is PHX in many eubacterial genomes but not in B. subtilis.
Thioredoxin (
trxA) implements protein folding by catalyzing the formation or disruption of disulfide bonds. The eukaryotic thioredoxin homolog is protein disulfide isomerase, operating in the endoplasmic reticulum. It has been verified experimentally that protein disulfide isomerase augments protein folding needs (
7,
15,
47). The highest
E(
g) values for thioredoxin occur in
B. subtilis (1.35) and then in other fast-growing bacteria in the order
D. radiodurans (1.23) (data not shown),
V. cholerae (1.21),
H. influenzae (1.11), and
E. coli (1.06).
Peptidyl-prolyl
cis-trans isomerases (PPIases) accelerate the proper folding of proteins by promoting the
cis-trans isomerization of imide bonds in proline within oligopeptides.
E. coli has at least nine PPIases defined by sequence similarity. One of these, the survival protein SurA, enhances the folding of periplasmic and outer membrane proteins. As expected, SurA does not exist in gram-positive
B. subtilis, which has neither compartment. Trigger factor (Tig) is a ribosome-associated chaperone that can complement DnaK (
8). Tig and DnaK cooperate in the folding of newly synthesized proteins. Simultaneous deletion of Tig and DnaK is lethal under usual growth conditions (
43). Tig is broadly PHX for eubacterial genomes but is not found for archaeal genomes. Expression levels of Tig in fast-growing bacteria are quite similar (Table ).
DegP is a chaperone folding factor that is significantly PHX, with an E(g) value of 1.26; it acts primarily in degrading misfolded proteins in the periplasm. Also associated with periplasmic and cytoplasmic chaperones are several PPIases, including PpiC [E(g) = 1.02], PpiB (1.53), FkpA (1.40), SlyD (2.08), PpiA (0.95), PpiD (1.11), SurA (1.10), FhlB (0.85), and YaaD (0.77); four are active in the periplasm, and five are active in the cytoplasm. Another relevant chaperone protein is disulfide oxidase (DsbA), which is marginally PHX, with an E(g) value of ≈1.02; it senses misfolded proteins in the periplasm.
Correlations among the fast-growing bacteria for levels of expression of major chaperone genes are generally significantly high (Table , footnote a). However, E. coli and B. subtilis are marginally correlated (0.3). In E. coli, degradation proteins are mostly PHX, but this is not consistently the case for the other fast-growing bacteria. Why are the major chaperone genes so often PHX? Chaperone/degradation proteins are vitally needed both during rapid growth and in stationary phase. In normal cell physiology, these proteins have multiple functions: they contribute decisively in ensuring correct protein folding, in remedying misfolded structures, in directing protein trafficking, and in coordinating protein secretion. Chaperone proteins also contribute to conformational changes and to minimizing protein damage during stress.
| TABLE 6Predicted expression levels for chaperone/degradation genes among four fast-growing bacteriaa |
(iv) Levels of expression of aminoacyl-tRNA synthetases (Table ). There are 19 PHX tRNA synthetase polypeptides in
E. coli, including two subunits of phenylalanyl-tRNA synthetase (PheS-α and PheT-β) and two subunits of glycyl-tRNA synthetase (GlyQ-α and GlyS-β). However, there are only eight in
V. cholerae, seven in
H. influenzae, and three in
B. subtilis. IleS is missing from
H. influenzae, and GlnS is missing from
B. subtilis, which uses amidotransferase modifications to produce Gln-tRNA
Gln from Glu-tRNA
Glu synthetase. Actually, the GlnS gene is absent from most prokaryotic genomes (
14).
Expression level correlations for the tRNA synthetase genes among the three rapidly dividing gram-negative genomes are generally positive but low. On the other hand, the corresponding relationship of B. subtilis with E. coli is uncorrelated (−0.04) and that of B. subtilis with V. cholerae is modestly negatively correlated (−0.24). LysS is the only PHX tRNA synthetase for all four genomes.
There are three aminoacyl-tRNA synthetases in E. coli which occur at only moderate predicted expression levels: CysS, with an E(g) of 0.89; TrpS, with an E(g) of 0.91; and HisS, with an E(g) of 0.74. The average amino acid usage frequencies for E. coli genes correlate positively with the predicted expression levels for tRNA synthetases. Interestingly, the three lowest amino acid usage frequencies in E. coli are for Cys (1.2%), Trp (1.5%), and His (2.3%) (Table ).
| TABLE 8Relationship between aminoacyl-tRNA synthetase expression levels and amino acid frequencies in E. coli proteinsa |
| TABLE 7Predicted expression levels for aminoacyl-tRNA synthetase genes among four fast-growing bacteriaa |
(v) Levels of expression of major energy metabolism genes (Table ). Enzymes of major catabolic pathways can be divided into four groups: glycolysis, pyruvate metabolism, the pentose phosphate pathway, and the TCA cycle. The glycolysis genes are predominantly PHX in all four fast-growing bacteria, with very high E(g) values, >2.00, for several of these genes in E. coli. Hexokinase and glucokinase are prominent glycolysis proteins in most eukaryotes, but the former is not found in most prokaryotes, including the four fast-growing bacteria under analysis in this study. Why? In glycolysis, hexokinase converts glucose to glucose-6-phosphate. However, glucose-6-phosphate arises from other hexoses and from glucose transported into the cell via the phosphotransferase system. Perhaps the multiplicity of sources means that glucokinase need not be PHX. Glucokinase occurs in many (but not all) eubacteria, normally at low to moderate E(g) values, 0.3 to 0.8.
The genes for pyruvate dehydrogenase are commonly PHX in the four genomes. The TCA genes are generally PHX in E. coli but generally not PHX in H. influenzae and B. subtilis. In B. subtilis, two TCA genes are PHX and the others cover the range 0.4 to 1.0. Many prominent TCA genes appear to be absent from H. influenzae. Why are TCA genes in B. subtilis mostly not PHX? The TCA cycle, apart from energy (ATP) production, can contribute in myriad ways to cellular needs, especially in making precursors and intermediates to macromolecules, e.g., in amino acid, vitamin, and heme biosyntheses (see Discussion). The order of actions in the TCA cycle is as follows: citrate synthase (GltA; in B. subtilis, there are two versions, designated CitZ and CitA), aconitate hydratase (AcnA/AcnB), isocitrate dehydrogenase (Icd), 2-oxoglutarate dehydrogenase (SucA), succinyl coenzyme A (succinyl-CoA) synthetase (SucD and SucC), succinate dehydrogenase (SdhB, SdhC, and SdhD), fumarate hydratase (FumA, FumB, FumC, or CitG), and malate dehydrogenase (Mdh/CitH). The initial enzymes of the TCA pathway in E. coli are all PHX, with E(g) values ≥1.29, whereas those beyond succinyl-CoA synthetase (except for Mdh) all have E(g) values ≤1.10, and most are not PHX. Apart from the differences in the expression levels among the TCA cycle genes, correlations among genomes for energy metabolism gene expression levels across all four fast-growing bacteria are high, suggesting similar uses for this set of enzymes (Table , footnote a).
Certain gene groups generally not PHX. Specific regulatory proteins or proteins responding to special demands and used few times, as in the highly specialized DNA repair processes, are not expected to be PHX. Also, specific transcription proteins and DNA replication proteins, because the cell assembles few replication machines, tend not to be PHX.
(i) Genomic repair proteins. Table reports predicted expression levels for the main collection of repair proteins for the four genomes. Only two repair proteins of
E. coli reach PHX levels: RecA and Ssb (single-stranded DNA binding protein) [
E(
g) for both, 1.48]. Two other repair proteins are borderline PHX: Dut (deoxyuridine 5′-triphosphate nucleotide hydrolase) and HepA [
E(
g) = 0.97 and 0.99, respectively]. Other repair proteins have low to moderate predicted expression levels, the
E(
g) values almost always in the range from 0.35 to 0.80. These evaluations parallel those for
D. radiodurans, in which RecA [
E(
g), 2.04] has a dramatically high predicted expression level and MutT (gene no. DR2358) reaches an
E(
g) of 1.29, these being the only two proteins qualifying as PHX (
22). The other repair proteins of
D. radiodurans have
E(
g) values in the range 0.40 to 0.80.
(ii) Vitamin biosynthesis proteins (Table ). Pathways to the synthesis of vitamins, of which only small amounts are needed to provide adequate cofactor function, have largely low predicted expression levels, with
E(
g) values of about 0.40 to 0.75. In
E. coli, the genes acting in the synthesis of six vitamin cofactors, biotin, thiamine, riboflavin, lipoate, pyridoxal, and cobalamin, were examined. Only RibH, which participates in riboflavin biosynthesis, is PHX in
E. coli. Although the enzymes of the biosynthetic pathways are poorly expressed, some of the enzymes that utilize the vitamins as cofactors are highly expressed, for example, biotin carboxylase (a subunit of
E. coli acetyl-CoA carboxylase). In
B. subtilis, RibE, which is not PHX, in the same pathway forms an oligomer complex with RibH in which the structural union (RibE-RibH) combines 3 units of RibE with 60 units of RibH (
23). This anomalous stoichiometry makes it likely that RibH furnishes structural support and, for this reason, is PHX; in this guise, RibH may be used in other capacities. Paradoxically, RibH is not PHX in
B. subtilis.
Interestingly, M. tuberculosis features nine PHX proteins among the vitamin biosynthesis pathways. Synechocystis and A. aeolicus each have three PHX vitamin biosynthesis genes, Borrelia burgdorferi has one, Archaeoglobus fulgidus has two, T. pallidum has one, and D. radiodurans has one. The biotin carboxylase protein is PHX in the E. coli, H. influenzae, V. cholerae, Helicobacter pylori, Synechocystis, Chlamydia trachomatis, and A. fulgidus genomes.
(iii) Genes of signal transduction pathways. In Table 8 of reference
21, the predicted expression levels for several two-component sensor genes (histidine kinases) of
E. coli and
B. subtilis are reported. In all of those examples, the predicted expression levels were low, the
E(
g) values ranging from 0.30 to 0.70.
One particular example is the Cpx regulon of the sensor kinase/phosphatase periplasmic family, which encompasses the genes encoding CpxA and CpxR (components of a histidine kinase), CpxP (down regulates the Cpx pathway), and NlpE (membrane lipoprotein), believed to eliminate abnormal proteins in the periplasm and to recover amino acids during nitrogen starvation (
32). These proteins regulate a hierarchy of ς factors, including ς
32 and ς
E, active in autoregulation and repression. The predicted expression levels are low [for CpxA,
E(
g) = 0.70; for CpxR,
E(
g) = 0.57; for CpxP,
E(
g) = 0.62; and for NlpE,
E(
g) = 0.61], as is common with specific regulatory proteins. Cpx is a sensor kinase acting in the periplasm. The Cpx pathway apparently also monitors pilus assembly during infection of tissues by uropathogenic
E. coli (
17).
(iv) Principal starvation genes of E. coli and their predicted levels of expression (Table ). The genes shown in Table are associated with starvation states, as discussed in the review (
26). Three genes in this category are PHX:
dps, also labeled
pexB [
E(
g), 1.13], which provides protection from oxidative radicals;
rpoH, which encodes ς
32 [
E(
g),1.46]; and the survival protein, SurA [
E(
g),1.10], a chaperone which is a member of the PPIase family. We expect these proteins, by virtue of their codon usage patterns, to be capable of high levels of expression, especially when induced by starvation. Other starvation proteins (Table ) have low to moderate
E(
g) values. The ς
E factor, which regulates the activity of other periplasmic proteins, is not PHX, and the same is true for ς
54 and ς
38, which respond to nitrogen and/or carbon starvation, respectively. However, ς
32 (
rpoH), the principal chaperone sigma factor, pervasively registers as PHX, presumably to establish high levels of chaperone production.
| TABLE 12Genes induced under starvation conditions in E. coli |
Homologous PHX genes among the fast-growing bacteria. Table compares the numbers of homologous PHX gene families among the four rapidly dividing bacteria. There are 60 gene families common to the four fast-growing bacteria, with each member PHX. Thirty-two of these are families of RP genes, eight are families of TF genes, and nine are families of genes essential for energy metabolism. Twenty-three gene families distinguish E. coli with PHX representatives, but these are not PHX in the other three fast growers, including five CH genes and five TF genes.
| TABLE 13Families of homologous genes among the four fast-growing bacteria with at least one PHX gene |
E. coli and V. cholerae share 124 homologous genes that are both PHX and in total 236 homologous genes with one or both genes being PHX; the respective values for E. coli and H. influenzae are 105 and 226, and the values for V. cholerae and H. influenzae are 94 and 156. Paired PHX genes between fast-growing bacteria and non-fast-growing bacteria are fewer in numbers (Table ). Of homologous genes among genomes with at least one PHX gene, the expression levels for E. coli versus archaeal genomes and E. coli versus H. pylori and M. genitalium genomes are uncorrelated or negatively correlated (Table ). Similarly, V. cholerae, H. influenzae, and B. subtilis expression levels correlate negatively with homologous genes of archaeal genomes, possibly reflecting differences in lifestyles, habitats, and energy sources.
| TABLE 14Numbers of pairs of homologousa genes with one or both genes PHX and correlations between their E(g) valuesb |
Codon usages along the gene and expression levels. For relatively long genes (≥600 codons long), we determined expression levels with the gene length divided into three equal parts (5′, middle, and 3′ parts). The pairwise correlations among the three parts of the E. coli genes are high, 0.86, 0.85, and 0.88, respectively, indicating that expression levels calculated from codon biases are effectively the same for the three parts of genes.
Independent of gene size, we observed (
20) that the middle and 3′ end of the genes show quite similar codon frequencies, whereas the 5′ third-codon ensemble possesses somewhat different codon frequencies. This finding may reflect differences in translation initiation versus later stages of translation elongation. A prominent example concerns encoding of arginine with major codons (CGN) versus minor codons (AGR). The AGR codons are scarce in
E. coli genes and are restricted mostly to the 5′ end of the genes (especially to the initial 30 bp), whereas CGN codons are preferred elsewhere in the genes (
6).
PHX ORFs shared by the four fast-growing genomes. Genes are considered homologous if their SSPA (significant segment pair alignment) score (percent similarity; see reference
5) is ≥40%. Examples include three ORFs (
yaaH,
yajC, and
yeeX) common to
E. coli and
V. cholerae, three similar ORFs (
yfiD,
yjjK, and
yebC) present in the genomes of
E. coli,
V. cholerae, and
H. influenzae, respectively, and one ORF (
ybaB) common to
E. coli and
B. subtilis. These PHX genes of unknown function offer attractive candidates for mutagenesis and knockout studies to determine their functions.
Distributions of PHX genes over the chromosomes. Clusters of PHX genes are displayed in Table . Statistical significance was assessed using the
r-scan analysis protocol described elsewhere (
18).
The PHX genes in each cluster generally possess the same transcription orientation, mostly that of the leading strand. However, E. coli features the PHX fumarate reductase operon genes (kb 4380 → 4376) frdD, frdB, and frdA untypically located in the lagging strand (the direction of transcription is indicated by the arrow). The genes encoding the principal units of NADH dehydrogenase I, N, L, I, G, F, and C cover positions 2402 → 2387 (about a 5-kb extent) on the leading strand.
The PHX gene clusters of E. coli, apart from the segments at kb 450 → 447 and kb 4380 → 4376 of the cytochrome o ubiquinol oxidase operon and the fumarate reductase operon, respectively, are all located in the leading strand. Note that the two RP clusters near oriC (kb 3476 → 3437 and kb 4174 → 4183) include a number of TF genes and some PPIase genes. There are no extended intervals devoid of PHX genes in the E. coli genome.
The V. cholerae large chromosome contains two significantly long segments, at kb 43 to 327 and kb 1657 to 1985, each devoid of PHX genes and positioned antipodal in the chromosome. The main PHX clusters correspond to long RP operons located in the leading strand. These descriptions indicate that PHX genes are irregularly distributed in the V. cholerae chromosomes. The V. cholerae genome has two chromosomes (chromosome I, 2.96 Mb, and chromosome II, 1.07 Mb) containing 138 PHX genes and 14 PHX genes, respectively. The PHX genes in the large chromosome comprise 7% of its genes. V. cholerae has a single PHX RP gene on chromosome II.
In H. influenzae, the PHX clusters are of RP genes and protein synthesis genes.
B. subtilis contains a PHX cluster which features a conglomerate of 27 RP genes (kb 118 → 154) intermeshed with the protein synthesis genes
rpoB,
rpoC,
fus,
tuf, and
rpoA. A compact operon of PHX genes distinguishes five glycolysis genes (kb 3482 → 3475), enolase (
eno), phosphoglycerate mutase (
pgm), triosephosphate isomerase (
tpi), phosphoglycerate kinase (
pgk), and glyceraldehyde-3-phosphate dehydrogenase (
gap), located in the leading strand. The cluster at kb 3475

→ 3482 ostensibly renders the main glycolysis genes highly efficient, putatively making it less important to express many respiration genes. All clusters are located in the leading strand.
B. subtilis also has a 245-kb stretch devoid of PHX genes, at kb 35 to 280.