|Home | About | Journals | Submit | Contact Us | Français|
The genome sequence of the solvent-producing bacterium Clostridium acetobutylicum ATCC 824 has been determined by the shotgun approach. The genome consists of a 3.94-Mb chromosome and a 192-kb megaplasmid that contains the majority of genes responsible for solvent production. Comparison of C. acetobutylicum to Bacillus subtilis reveals significant local conservation of gene order, which has not been seen in comparisons of other genomes with similar, or, in some cases closer, phylogenetic proximity. This conservation allows the prediction of many previously undetected operons in both bacteria. However, the C. acetobutylicum genome also contains a significant number of predicted operons that are shared with distantly related bacteria and archaea but not with B. subtilis. Phylogenetic analysis is compatible with the dissemination of such operons by horizontal transfer. The enzymes of the solventogenesis pathway and of the cellulosome of C. acetobutylicum comprise a new set of metabolic capacities not previously represented in the collection of complete genomes. These enzymes show a complex pattern of evolutionary affinities, emphasizing the role of lateral gene exchange in the evolution of the unique metabolic profile of the bacterium. Many of the sporulation genes identified in B. subtilis are missing in C. acetobutylicum, which suggests major differences in the sporulation process. Thus, comparative analysis reveals both significant conservation of the genome organization and pronounced differences in many systems that reflect unique adaptive strategies of the two gram-positive bacteria.
The Clostridia are a diverse group of gram-positive, rod-shaped anaerobes that include several toxin-producing pathogens (notably Clostridium difficile, Clostridium botulinum, Clostridium tetani, and Clostridium perfringens) and a large number of terrestrial species that produce acetone, butanol, ethanol, isopropanol, and organic acids through fermentation of a variety of carbon sources (38, 72, 73, 86). Isolates of Clostridium acetobutylicum were first identified between 1912 and 1914, and these were used to develop an industrial starch-based acetone, butanol, and ethanol (ABE) fermentation process, to produce acetone for gunpowder production, by Chaim Weizmann during World War I (13, 34, 82, 87). During the 1920s and 1930s, increased demand for butanol led to the establishment of large fermentation factories and a more efficient molasses-based process (20, 34). However, the establishment of more cost-effective petrochemical processes during the 1950s led to the abandonment of the ABE process in all but a few countries. The rise in oil prices during the 1970s stimulated renewed interest in the ABE process and in the genetic manipulation of C. acetobutylicum and related species to improve the yield and purity of solvents from a broader range of fermentation substrates (52, 59, 87). This has developed into an active research area over the past two decades.
The type strain, Clostridium acetobutylicum ATCC 824, was isolated in 1924 from garden soil in Connecticut (83) and is one of the best-studied solventogenic clostridia. Strain relationships among solventogenic clostridia have been analyzed (11, 32, 33), and the ATCC 824 strain was shown to be closely related to the historical Weizmann strain. The ATCC 824 strain has been characterized from a physiological point of view and used in a variety of molecular biology and metabolic engineering studies in the United States and in Europe (3, 14, 22–24, 47, 56, 57, 79). This strain is known to utilize a broad range of monosaccharides, disaccharides, starches, and other substrates, such as inulin, pectin, whey, and xylan, but not crystalline cellulose (5, 6, 42, 52, 53). Physical mapping of the genome demonstrated that this strain has a 4-Mb chromosome with 11 ribosomal operons (9) and harbors a large plasmid, about 200 kb in size, which carries the genes involved in solvent formation, hence the name pSOL1 (10). Much work has been done to elucidate the metabolic pathways by which solvents are produced and to isolate solvent-tolerant or solvent-overproducing strains (8, 21, 35, 62, 69, 71, 80). Genetic systems have been developed that allow genes to be manipulated in C. acetobutylicum ATCC824 and related organisms (25, 48–52, 84), and these have been used to develop modified strains with altered solventogenic properties (25, 28, 54, 60).
Knowledge of the complete genome sequence of C. acetobutylicum ATCC 824 is expected to facilitate the further design and optimization of genetic engineering tools and the subsequent development of novel, industrially useful organisms. The sequence also offers the opportunity to compare two moderately related, gram-positive bacterial genomes (C. acetobutylicum and Bacillus subtilis) and to examine the gene repertoire of a mesophile anaerobe with metabolic capacities that were not previously represented in the collection of complete genomes.
The genome of C. acetobutylicum ATCC 824 was sequenced by the whole genome shotgun approach (18), using a combination of fluorescence-based and multiplex sequencing approaches (70). The finishing phase involved exhaustive gap closure and quality enhancement work using a variety of biochemical methods and computational tools. Clones from a plasmid library made with randomly sheared 2.0- to 2.5-kb inserts were sequenced from both ends. The sequences were preprocessed and base called with Phred (15), and low-quality reads were removed (multiplex or short-run dye terminator reads with fewer than 100 Phred Q-30 bases [error rate of 10−3], and long-run dye terminator reads with fewer than 175 Q-30 bases). This resulted in 4.9 Mb of multiplex reads and 21.3 Mb of ABI dye-terminator reads (8.3-fold sequence coverage; 51,624 reads in all). The data were assembled using Phrap (University of Washington; http: //bozeman.mbt.washington.edu/phrap.docs/phrap.html), which produced 551 contigs spanning a total of 4.03 Mb. A total of 0.76-fold coverage in paired reads from lambda clones was generated from two genomic lambda libraries (one provided by G. Bennett and one constructed at GTC). These data, together with data from primer-directed sequence walks across all captured gaps (sequence gaps with a bridging clone insert), and second-attempt sequences corresponding to missing mates at the ends of the contigs were reassembled with the original shotgun data to produce a final Phrap assembly. This assembly contained 108 contigs and 88 supercontigs. Further primer-directed sequencing efforts, using plasmid and PCR-generated templates, resulted in the eventual closure of the remaining captured gaps.
Uncaptured gaps were closed using one of the following methods. The lambda libraries were screened with PCR products designed from the ends of contigs and labeled during the amplification process with digoxigenin. Positive clones from the chemiluminescence screening (Boehringer Mannheim kit) were sequenced from both ends and used as templates for additional primer walks. This resulted in 28 contig joins. Direct genomic sequencing was used to walk into gaps wherever unique primers could be specified near the end of a contig. Primers were identified using GTC's PrimerPicker software and were matched back to the genomic assembly using cross_match (University of Washington; http://bozeman.mbt.washington.edu/phrap.docs/phrap.html). Unique primers were added to 2X Big Dye reactions with 2.5 μg of total genomic DNA; these reactions yielded an 85% success rate with average Q-30 scores of 190. This procedure resulted in 13 contig joins, and multiple walks were performed in many cases. Combinatorial PCR was also used. Initially, a matrix of PCR primers representing all possible combinations in pools of 10 was used to reduce the number of PCRs that had to be performed. This was followed with a “2 × 2” approach using all possible combinations of pairs from the ends of the remaining contigs. Wherever products were observed, the primers contained in the original reaction would be used in combinations of 2 to determine which contig ends belonged together. These primers were then used to amplify the genomic DNA bridging the gap, and the products were used for primer walks. This procedure proved successful in bridging and closing the remaining gaps. The contigs that constituted pSOL1 were identified and linked at an early stage in the project; further work allowed us to produce a finished sequence for the plasmid of 192,000 bp.
Sequence reads from the above efforts were incorporated into the contigs by means of the custom GTC incremental assembly tools: Inc_Asm, Contig_Merge, CM_calc, CM_auto, and Update_Overlaps. Misassemblies were identified through aberrant coverage or clone tiling and by inappropriate juxtaposition of restriction sites compared to the physical map (9). Each case was successfully resolved using PCR and sequence confirmation. The genome contained 11 rDNA operons, 6 of which occurred in two triplets, approximately 18 kb in length. Each ribosomal operon was independently amplified by PCR utilizing the flanking unique sequences, and the resulting products were sequenced. The operons were then incorporated into the genome assembly at the correct positions. Assembly of the final 13 contigs had to be done manually because of the repetitive elements and the limitations of Phrap and Contig_Merge.
The genome sequence was screened for regions of low sequence quality, and 2,883 ‘quality gaps’ were identified. Of these, 2,769 were improved by resequencing of the plasmid template with an alternate chemistry (e.g., energy-transfer dye primer; AP Biotech, Piscataway, N.J.). The remaining quality gaps were improved by means of primer walks. Based on the consensus quality scores generated by Phrap and Contig_Merge, and on the results of systematic quality checks on the lower-quality regions in the final contigs (when it was no longer possible to use the assembly tools because of repeats), we estimate the overall error rate to be substantially less than 1 error in 10,000 bases.
The genome was analyzed and annotated in context with a large number of finished bacterial and archaeal genomes. Custom Perl scripts were used to automate the execution of similarity search algorithms, and additional scripts were used to filter the results and to create tab-delimited tables and Web pages to summarize the most biologically and functionally relevant information. The program uniorf (a wrapper around ExractOrfs5; GTC) was used to identify open reading frames (ORFs). The coding ORFs were identified using one or more of the three criteria: significant BLASTP2 hit, C. acetobutylicum-specific dicodon usage, or a length of ≥400 residues. Start codons were predicted by their proximity to ribosome binding sequences (67) and by compatibility with BLAST alignment data that minimized or eliminated overlaps. The predicted protein sequences were individually analyzed using sensitive profile-based methods for database searching, including PSI-BLAST (1, 2), IMPALA (64), and SMART (65, 66). All potential frameshifts identified during the analysis phase were investigated in the final sequence assembly. Corrections were made in every case where a probable sequence error could account for the apparent frameshift. In a few cases, genomic PCR amplification and product sequencing was undertaken to evaluate the potential frameshifts. The program tRNAscan was used to identify tRNA genes.
Paralogous families of proteins were identified by comparing the complete set of predicted C. acetobutylicum proteins to itself (after filtering for low-complexity regions with the SEG program (88) using the PSI-BLAST program, which was run for three iterations, and clustering proteins by single-linkage (clustering threshold e value, 0.001) using the GROUPER program (81). Assignment of predicted proteins to clusters of orthologous groups (COGs) (78) was based on the results of the COGNITOR program (78), with manual verification. The functional assignments embedded in the COG database were also used for reconstruction of metabolic pathways and other functional systems in C. acetobutylicum in conjunction with the KEGG (37) and WIT (55) databases. Analysis of the phyletic distribution of the database hits reported by the BLASTP program was performed using the TAX_COLLECTOR program of the SEALS package (81). This was followed by phylogenetic tree construction for selected individual cases. Multiple alignments for phylogenetic reconstruction were generated using the ClustalW program (29) and, when necessary, further adjusted on the basis of the PSI-BLAST search outputs. Phylogenetic trees were constructed using the neighbor-joining method with 1,000 bootstrap replications as implemented in the NEIGHBOR program of the PHYLIP package (16). Evolutionary distance matrices for neighbor-joining tree construction were generated using the PROTDIST program of the PHYLIP package, with Kimura's correction for multiple substitutions.
The sequence of the C. acetobutylicum strain ATCC 824 genome is available in GenBank under the accession number AE001437, and that of the megaplasmid pSOL1 is available under accession number AE001438. Graphical representations of the genome with detailed annotation are available at http://www.ncbi.nlm.nih.gov and http://www.genomecorp.com/programs/sequence_data_clost.shtml.
The C. acetobutylicum ATCC 824 genome consists of 3,940,880 bp. Genes are distributed fairly evenly, with ~51.5% being transcribed from the forward strand and ~49.5% from the complementary strand. A total of 3,740 polypeptide-encoding ORFs and 107 stable RNA genes were identified, accounting for 88% of the genomic DNA, with intergenic regions averaging ~121 bp. A putative replication origin (base 1) and terminus were identified by GC and AT skew analysis (45); the origin marks a strong inflection point in the coding strand and contains several DnaA boxes, as well as gyrA, gyrB, and dnaA genes that are adjacent to the replication origin in many other bacteria. Another strong inflection in the coding strand occurs at the diametrically opposed putative replication terminus (reminiscent of the Mycoplasma genitalium genome (19) (Fig. (Fig.1).1). The 11 ribosomal operons are clustered in general proximity to the origin of replication and are all oriented in the same direction as the leading replication fork. The megaplasmid, pSOL1, consists of 192,000 bp and appears to encode 178 polypeptides. The single obvious skew inflection was placed at the origin (base 1), although there is no other support for a replication origin at this position (a repA homolog resides ~2.2 kb away). In contrast to the genome, there is no obvious coding strand bias in the plasmid.
There appear to be two unrelated cryptic prophages in the genome. The first of these spans approximately 90 kb and includes approximately 85 genes (CAC1113 to CAC1197), with 11 phage-related genes, 3 XerC and XerD recombinase-related genes, and a number of DNA processing enzymes. This region contains a strong coding-strand inflection point near its center and has lower-than-average GC content. The second apparent prophage appears to span approximately 60 kb and displays similar coding characteristics in approximately 79 genes (CAC1878 to CAC1957; slightly higher than average in GC content). Genes for three distinct insertion sequence-related proteins (CAC0248, CAC3531, and CAC0656-57) are present on the chromosome. Only one of these is intact; another is a fragment, and the third has a frameshift. Another frameshifted gene coding for a TnpA-related transposase resides on pSOL1 (CAP0095-96). Thus, it appears that there are no active insertion sequence elements in the C. acetobutylicum genome.
There are 73 tRNA genes. The isoleucine tRNA could not be identified using standard search methods; this correlates with the displacement of the typical bacterial form of isoleucyl-tRNA with the eukaryotic version, although for other similarly displaced aminoacyl-tRNA synthetases (see below), the cognate tRNAs were readily identified.
The genome of C. acetobutylicum provides us with at least two unique opportunities: (i) compare, for the first time, two large and moderately related gram-positive bacterial genomes, those of C. acetobutylicum and B. subtilis (41); (ii) investigate the genes that underlie the diverse set of metabolic capabilities so far not represented in the collection of complete genomes.
The median level of sequence similarity (26) between probable orthologs in C. acetobutylicum and B. subtilis was greater than between C. acetobutylicum and any other bacterium, but only by a rather small margin, indicating significant divergence (Table (Table1).1). Compared to the other pairs of evolutionarily relatively close genomes, the Clostridium-Bacillus pair is more distant than the species within the gamma-proteobacterial lineage (Escherichia coli, Haemophilus influenzae, Vibrio cholerae, and Pseudomonas aeruginosa) or Helicobacter pylori and Campylobacter jejunii; in contrast, the level of divergence between C. acetobutylicum and B. subtilis is comparable to that between the two spirochetes, Treponema pallidum and Borrelia burgdorferi (Table (Table1).1). The comparative analysis of the spirochete genomes has proved to be highly informative for elucidating the functions of many of their genes and predicting previously undetected aspects of the physiology of these pathogens (76).
A taxonomic breakdown of the closest homologs for the C. acetobutylicum proteins immediately reveals the specific relationship with the low-GC gram-positive bacteria, with the reliable best hits for 31% of the C. acetobutylicum protein sequences being to this bacterial lineage (Fig. (Fig.2).2). However, nearly as many proteins produced clear best hits to homologs from other taxa (Fig. (Fig.2),2), which emphasizes the likely major role for lateral gene transfer, a hallmark of microbial evolution.
The same trends appear even more notable when the genome organizations of C. acetobutylicum and other bacteria are compared. Gene order is, in general, poorly conserved in the bacteria, with no extended synteny detected even among relatively close genomes, such as those of E. coli and P. aeruginosa or H. influenzae. In contrast, a genomic dot plot comparison of C. acetobutylicum with B. subtilis revealed several regions of colinearity (Fig. (Fig.3A3A and B). Thus, at least some bacterial genomes separated by a moderate evolutionary distance, as exemplified by C. acetobutylicum and B. subtilis, appear to retain the memory of parts of the ancestral gene order. A systematic mapping of conserved gene strings (many of which form known or predicted operons) on the C. acetobutylicum genome shows the clear preponderance of gene clusters shared with B. subtilis but also considerable complementary coverage by conserved operons from other bacterial and even archaeal genomes (Fig. (Fig.3C;3C; see supplementary material at ftp://ncbi.nlm.nih.gov/pub /koonin/Clostridium). Altogether, 1,243 Clostridium genes (32% of the total predicted number of genes and 40% of the genes with detectable homologs) belong to conserved gene strings; 779 of these are in 271 predicted operons shared with B. subtilis (Fig. (Fig.3C;3C; see supplementary material at ftp://ncbi.nlm.nih.gov/pub /koonin/Clostridium).
The genome region that shows the greatest level of gene order conservation between C. acetobutylicum and B. subtilis includes ~200 genes and includes primarily (predicted) operons encoding central cellular functions, such as translation and transcription (Fig. (Fig.3C).3C). The multiple genome alignment for this region clearly shows numerous rearrangements of gene clusters, with large-scale colinearity seen only between C. acetobutylicum and B. subtilis. The intermediate conservation of gene order seen between C. acetobutylicum and B. subtilis is likely to be particularly informative in terms of complementing functional predictions based on direct sequence conservation. For example, the predicted large “superoperon,” which contains genes for several components of the translation machinery (def, encoding N-formylmethionyl-tRNA deformylase; fmt, encoding methionyl-tRNA formyl transferase; and fmu, encoding a predicted rRNA methylase), transcription, and replication, additionally includes the genes yloO (CAC1727), yloP (CAC1728), and yloQ (CAC1729). These genes encode predicted protein phosphatase, serine-threonine protein kinase, and a GTPase, respectively. Based on the operon context, the readily testable predictions can be made that yloQ is a previously uncharacterized translation factor, whereas yloO and yloP are likely to play a role in the regulation of translation and/or transcription.
The mosaic picture of operon conservation can be explained by a combination of the processes of horizontal operon transfer, gene (operon) loss, and operon disruption (rearrangement). Distinguishing between these phenomena is, in many cases, difficult, but in certain extreme situations, one of the evolutionary routes is clearly preferable. A striking example is the conservation of the nitrogen fixation operon (six genes in a row) between C. acetobutylicum and another nitrogen fixator, the archaeon Methanobacterium thermoautotrophicum (Fig. (Fig.4A).4A). This particular gene organization so far has not been seen in any other genome except for that of another clostridial species, C. pasteurianum, in which, interestingly, two genes of the operon are deleted (Fig. (Fig.4A).4A). Similarly, the aromatic amino acid biosynthesis operon is conserved, albeit with local rearrangements, in C. acetobutylicum, Thermotoga maritima, and partially in Chlamydia (Fig. (Fig.4B).4B). In these and similar cases, it is hard to imagine an evolutionary scenario that does not involve horizontal mobility of these operons, along with operon disruption in some of the bacterial and archaeal lineages.
In general, C. acetobutylicum carries the typical complement of genes that are conserved in most bacteria. The only gene that is present in all other bacteria (and, in fact, in all genomes sequenced to date) but is missing in C. acetobutylicum is that for thymidylate kinase.
A differential genome display analysis for C. acetobutylicum and B. subtilis, which was performed using the COG system (78), revealed 186 conserved protein families (COGs) that are represented in C. acetobutylicum but not in B. subtilis. Many of these proteins are involved in redox chains that are characteristic of the anaerobic metabolism of Clostridia as opposed to the aerobic metabolism of B. subtilis, as well as oxidation and reduction that are required for assimilation of nitrogen and hydrogen. Another group of enzymes belongs to biosynthetic pathways that are present in C. acetobutylicum but not in B. subtilis, primarily those for certain coenzymes, for example, cyancobalamin (see supplementary material at ftp://ncbi.nlm.nih.gov/pub/koonin /Clostridium). Conversely, 335 COGs were detected in which B. subtilis was represented, whereas C. acetobutylicum was not. An obvious part of this set consists of genes coding for components of aerobic redox chains, such as cytochromes and proteins involved in the assembly of cytochrome complexes. Also missing are a variety of membrane transporters, the glycine cleavage system that is present in the majority of bacteria. Several metabolic pathways are incomplete; for example, a considerable part of the tricarboxylic acid (TCA) cycle and molybdopterin biosynthesis is missing. The TCA cycle is incomplete in many prokaryotes, but in most of these cases, the chain of reactions producing three key precursors, 2-oxoglutarate, succinyl-CoA, and fumarate, can proceed in either the oxidative or the reductive direction (30). In C. acetobutylicum, citrate synthase, aconitase, and isocitrate dehydrogenase are missing. It appears, however, that what remains of the TCA cycle could function in the reductive (counterclockwise in Fig. Fig.5)5) direction. The counterparts of enzymes involved in succinyl-CoA and 2-oxoglutarate formation in other organisms are missing in C. acetobutylicum. However, the genome encodes acetoacetyl:acyl CoA-transferase that catalyzes butyryl-CoA formation in solventogenesis (CAP0163-0164) and might also utilize succinate for the synthesis of succinyl-CoA and 2-oxoacid:ferredoxin oxidoreductase (CAC2458-2459) that could catalyze 2-oxoglutarate formation from succinyl-CoA (Fig. (Fig.5).5). Succinate dehydrogenase/fumarate reductase, the enzyme that normally catalyzes the reduction of fumarate to succinate, seems to be missing in C. acetobutylicum. However, this reaction is linked to the electron transfer chain and might be supported by another dehydrogenase whose identity could not be easily determined.
The repertoires of transcriptional regulators in B. subtilis (27) and C. acetobutylicum are very similar. In particular, of the 17 sigma factors predicted in C. acetobutylicum, 11 have readily detectable orthologs in B. subtilis. C. acetobutylicum also encodes numerous predicted specific transcriptional regulators, including 28 members of the AcrR/TetR family, 22 members of the MarR/EmrRs family, 14 members of the LysR family, 14 members of the Xre family, 9 members of the LacI family, and also several smaller sets of paralogous regulators. One-to-one orthologous relationships could be established only for a minority of these proteins (data not shown), and in some cases, such as, for example, that of the MarR/EmrRs family, part of the observed diversity seems to be due to independent family expansion.
The set of sporulation genes in C. acetobutylicum surprisingly differs from the set that has been well studied in B. subtilis (75). The number and diversity of detectable sporulation genes in Clostridium is much smaller. The most dramatic difference was observed among the SpoV genes. C. acetobutylicum does not have orthologs of the spoVF, spoVK, and spoVM genes, the disruption of which in B. subtilis leads to formation of immature spores that are sensitive to heat, organic solvents, and lysozyme (75). The phosphorelay system that functions in phase 0 of sporulation in B. subtilis (7, 31) appears to be missing in C. acetobutylicum, as indicated by the absence of an ortholog of SpoOB (phosphotransferase B) and SpoOF (a response regulator). In contrast, C. acetobutylicum encodes an apparent ortholog of the SpoOA (CAC2071) signaling protein that consists of a CheY domain and DNA-binding HTH domain and three proteins homologous to the ambiactive transcription repressors and activators AbrB and Abh (CAC1941, CAC0310, and CAC3647), also involved in phase 0 in B. subtilis. Interestingly, the SpoOA gene has been shown to control solventogenesis in solvent-forming Clostridia (60). In B. subtilis, sporulation is regulated by opposing activities of a distinct family of histidine kinases, KinA to KinE, and the Rap family phosphatases; orthologs of these genes were not detected in C. acetobutylicum.
B. subtilis has 22 cot genes that are responsible for coat biosynthesis; only 14 of these genes are conserved in C. acetobutylicum. Similarly, B. subtilis has 21 ger genes, 7 of which are represented by orthologs in Clostridium. Many of the missing GER genes encode various receptors of germination, which appear to be different in these bacteria. Furthermore, C. acetobutylicum does not have an ortholog of the cell-division-initiation gene divIC (75), which is essential in B. subtilis, suggesting differences in the mechanism of septum formation.
B. subtilis has a large set of competence genes which are involved in DNA uptake (12). The majority of these genes are represented by orthologs in C. acetobutylicum, but the proteins encoded by these genes in B. subtilis and C. acetobutylicum typically are not the most closely related members of the respective clusters of orthologs (data not shown). Operon disruption and rearrangements are also observed, suggesting a significant functional difference between the two gram-positive bacteria.
Many of the clostridial genes that are missing in B. subtilis seem to show distinct evolutionary affinities and probably have been acquired via horizontal transfer. In particular, a significant number of clostridial genes are conserved in all archaea whose genomes have been sequenced to date but are present in bacteria only sporadically (Table (Table2).2). Many of these genes encode various redox proteins, which reflects the similarity between the anaerobic redox chains in archaea and clostridia. For most of these “archaeal” genes found in bacteria, the probable evolutionary model is a single entry into the bacterial world by horizontal transfer from the Archaea, followed by dissemination among the Bacteria. In several cases, however, direct gene transfer from archaea into the clostridial genome seems likely; examples include the genes for a metal-dependent hydrolase of the metallo-beta-lactamase superfamily (CAC0535), a calcineurin-like phosphatase which has undergone duplication in C. acetobutylicum, probably subsequent to the acquisition of an archaeal gene (CAC1010 and CAC1078), and a predicted DNA-binding protein (CAC3166). Another group of clostridial genes includes probable eukaryotic acquisitions (Table (Table2).2). As with archaeal genes, the scenario of a single entry into the bacterial world followed by horizontal dissemination is likely for many of these genes, for example, that for the FHA domain discussed below. However, about 50 genes in C. acetobutylicum could have been directly hijacked from eukaryotes (Table (Table2).2). An interesting example is the nucleotide pyrophosphatase, which is encoded within one of the gene clusters including genes for FHA-containing proteins (Fig. (Fig.6)6) and therefore may be also implicated in signaling. As noticed previously, lateral acquisition of some of the aminoacyl-tRNA synthetases from eukaryotes, accompanied by displacement of the original copies, seems to have occurred repeatedly in bacterial evolution (85). C. acetobutylicum is no exception, with its arginyl-tRNA synthetase showing a clear eukaryotic affinity. In these cases, horizontal gene transfer from eukaryotes to specific bacterial lineages appears more likely than horizontal transfer in the opposite direction, bacteria to eukaryotes. The latter interpretation would require independent gene loss in multiple bacterial lineages accompanied by multiple instances of nonorthologous displacement.
Most of the essential functions in C. acetobutylicum and B. subtilis are associated with readily detectable orthologs, but there are also notable cases of nonorthologous gene displacement (Table (Table3).3). Examples include glycyl-tRNA synthetase, which is represented by the typical bacterial, two-subunit form in B. subtilis and by the one-subunit archaeal-eukaryotic version in C. acetobutylicum, and uracil-DNA glycosylase, similarly represented by the classical bacterial enzyme (ortholog of E. coli Ung) and by the archaeal version in C. acetobutylicum (Table (Table3).3). In many cases, while an apparent orthologous relationship was detected between a clostridial protein and its counterpart from B. subtilis, there was nevertheless a clear difference in the domain architectures (Table (Table2).2). Notable examples of unusual domain organizations from C. acetobutylicum include the FtsK ATPase, which is fused to the FHA domain (see below), a Pkn2 family protein kinase fused to tetratricopeptide repeats (CAC0404), and another ATPase fused to a LexA-like DNA-binding domain (CAC1793). The evolution of another set of genes seems to have involved xenologous gene displacement whereby a gene in one of the compared genomes (C. acetobutylicum or B. subtilis) is displaced by the ortholog from a distant branch of the phylogenetic tree, e.g., eukaryotes (Table (Table3).3). Characteristically, this evolutionary pattern was detected for three aminoacyl-tRNA synthetases, those for isoleucine, arginine, and histidine; in each of these cases, C. acetobutylicum possesses the archaeal-eukaryotic version as opposed to the typical bacterial versions found in B. subtilis. Another interesting example of xenologous displacement involves the two forms of clostridial ribonucleotide reductase, neither of which groups with the counterparts from B. subtilis in phylogenetic trees. One of the ribonucleotide reductase genes in B. subtilis contains the single intein in that organism; C. acetobutylicum has no inteins, however. These observations show that there had been a significant horizontal exchange of genes between the Clostridium lineage and certain archaea and/or eucaryotes subsequent to its divergence from the Bacillus lineage.
The results of systematic analysis of protein families that are specifically expanded with C. acetobutylicum are largely compatible with the current knowledge of the physiology of the bacterium (Table (Table4).4). For example, distinct families of proteins involved in sporulation, anaerobic energy conversion, and carbohydrate degradation were identified (Table (Table4).4). A so far unique feature is the presence of four diverged copies of the single-stranded DNA-binding proteins, an essential component of the replication machinery that is present in one or two copies in all other sequenced bacterial genomes. In addition, this analysis revealed remarkable aspects of the signal transduction system in this bacterium. Of particular interest is the proliferation of the phosphopeptide-specific, protein-protein interaction module, the FHA domain, which is generally rare in the Bacteria (44). C. acetobutylicum encodes five FHA-domain-containing proteins, which is comparable to the number of these domains in other bacteria with versatile Ser/Thr-phosphorylation-based signaling, namely Mycobacterium tuberculosis (10) and Synechocystis sp. (7); most of the other bacteria do not encode FHA domains or possess just one copy (58). Four of the genes coding for FHA-domain-containing proteins in C. acetobutylicum belong to two partially similar gene clusters that are unique for C. acetobutylicum and additionally include genes for other phosphorylation-dependent signaling proteins, namely predicted protein kinases and phosphatases (Fig. (Fig.6).6). The fusion of the FHA domain with the FtsK ATPases, which is involved in chromosome segregation, and the presence, in one of the clusters, of an ATPase of the MinD family, also involved in chromosome partitioning, suggest previously unsuspected regulation of cell division in C. acetobutylicum via reversible protein phosphorylation. The fifth FHA-domain-containing protein seems to belong to yet another predicted operon that is potentially involved in cell division as indicated by the presence of genes for a penicillin-binding protein and another membrane protein implicated in cell division in other bacteria (Fig. (Fig.6).6). These observations are compatible with the hypothesis on the role of phosphorylation in the regulation of this process in C. acetobutylicum. Another signaling system that is predicted to play a prominent role in C. acetobutylicum on the basis of protein family expansion analysis includes the so-called HD-GYP domains (name based on the one-letter code for characteristic amino acids) that are suspected to possess cyclic diguanylate phosphoesterase activity (Table (Table4);4); the only comparable expansion of the HD-GYP domain is seen in T. maritima. The HD-GYP proteins could play a major role in sensing the redox state of the environment in C. acetobutylicum (M. Y. Galperin, D. A. Natale, L. Aravind, and E. V. Koonin, Letter, J. Mol. Microbiol. Biotechnol. 1:303–305, 1999).
The solventogenesis pathways of C. acetobutylicum involve the formation of acetone, acetate, butanol, butyrate, and ethanol from acetyl-CoA (52). Two mechanisms of butanol formation have been identified in C. acetobutylicum, one of which is associated with solventogenesis (production of butanol, ethanol, and acetone) and the other with alcohologenesis (production of butanol and ethanol only). The genes involved in solventogenesis have been previously identified on the megaplasmid and sequenced (Galperin et al, letter), but the genes responsible for alcohologenesis were unknown. The genome sequencing allows the identification of a second alcohol-aldehyde dehydrogenase (CAP0035), a pyruvate decarboxylase (CAP0025), and an ethanol dehydrogenase (CAP0059) that are probably involved in this alcohologenic metabolism (Fig. (Fig.5)5) and interestingly are also carried by the megaplasmid. The enzymes involved in the final steps of solvent formation show variable phylogenetic profiles, and in particular, several of them appear to be specifically related to the homologs from the archaeon Archaeoglobus fulgidus (Fig. (Fig.5).5). In contrast, the genes for the two subunits of another key enzyme of the acetone pathway, acetoacetyl-CoA:acyl-CoA transferase, show a clear proteobacterial affinity. Together with the fact that a significant subset of the solventogenesis enzymes is encoded on the clostridial megaplasmid, these observations suggest that these pathways could have evolved via a complex sequence of gene/operon acquisition events. The megaplasmid also carries second copies of genes involved in PTS-type sugar transport (CAP0066-68), glycolysis (aldolase, CAP0064) and central metabolism (thiolase, CAP0078). It would be interesting to determine the expression profiles of the plasmid-encoded and chromosomal copies of these genes to investigate (i) whether these genes and the solventogenic genes are regulated or coregulated and (ii) whether metabolic complementarily exists between the chromosome and the plasmid in C. acetobutylicum.
The cellulosome, the macromolecular complex for cellulose degradation, has been genetically and biochemically characterized in four Clostridium species (C. thermocellum, C. cellulovorans, C. cellulolyticum, and C. josui) but not in C. acetobutylicum (which is able to hydrolyze carboxy-methyl cellulose but not amorphous or crystalline cellulose (68). The proteins of the cellulosome contain a C-terminal Ca2+-binding dockerin domain, which is required for the binding to the cohesin domains of a scaffolding protein (36, 40). Genome sequence analysis revealed at least 11 proteins that are confidently identified as cellulosome components (Fig. (Fig.5A).5A). Most of these genes are organized in an operon-like cluster (CAC910 to CAC919) with a gene order similar to that of those in mesophilic C. cellulolyticum and C. cellulovorans, as distinct from the more dispersed organization in the thermophile C. thermocellum (4, 77). The large glycohydrolase CAC3469 is the homolog of EngE of C. cellulovorans, which is also encoded away from the main cellulosome gene cluster. Unlike EngE, CAC3469 possesses an additional cell adhesion domain (Fig. (Fig.5A).5A). This protein contains S-layer homology domains and cell adhesion domains similar to those of SlpA, one of the anchoring proteins of C. thermocellum. The presence of the short cohesion domain protein CAC914 suggests a role in cellulosome function related to that of the HbpA protein of C. cellulovorans (77). The other dockerin-domain containing proteins, those of the GH48, GH5, and GH9 families, might interact with CAC910, the ortholog of the scaffolding protein CbpA. Generally, although the cellulosome has not been detected in C. acetobutylicum, the number of relevant proteins and domains would seem sufficient to encode the various combinations of cellulose-binding and hydrolytic proteins found in this complex. An interaction between CAC3469 and CAC910 could be speculatively proposed as a means of anchoring a potential cellulosome-like structure to the peptidoglycan.
In work analyzing the cellulolytic activities of C. acetobutylicum strains, it was found that NRRL B 527 could hydrolyze Avicel and acid-swollen cellulose but C. acetobutylicum ATCC 824 could not (42). The subsequent taxonomic and historical analyses of these strains (32, 33) indicate a close relationship and suggest that further investigation of the cluster from strain B 527 would be informative in elucidating the reason for the different cellulolytic activities of the two strains. Further work is required to resolve these issues and to determine the exact functions of the cellulosome subunits in C. acetobutylicum.
In addition to the known cellulosome components, C. acetobutylicum encodes numerous other proteins that are predicted to be involved in the degradation of xylan, levan, pectin, starch, and other polysaccharides. Altogether, there seem to be over 90 genes encoding proteins implicated in these processes, including representatives of at least 14 distinct families of glycosyl hydrolases. In particular, a predicted operon located on the C. acetobutylicum megaplasmid (CAP0114 to CAP0120) consists mostly of genes encoding xylan degradation enzymes. Similarly to the cellulosome components, these enzymes possess complex domain architectures, with the oligosaccharide-binding ricin domain (74) typically present at the C terminus; the addition of ricin domain is (so far) a unique feature of this postulated novel system for xylan degradation in Clostridium (Fig. (Fig.5B).5B). Two of the putative xylanases presumably correspond to previously reported enzymes of xylan degradation isolated from C. acetobutylicum ATCC 824 (43).
A number of sugar PTS transport system genes,as well as the corresponding regulatory system analogs (e.g., Hpr, ptsK, and CcpA), have been found which couple transport signals to genetic regulation of degradative operons (61, 63). Non-PTS-mediated uptake of certain sugars, especially pentoses, has been found in several clostridial species (52). Many primary active transporters, including ABC-type transporters and P-type ATPases, electrochemical potential-driven transporters, channels and pores, and uncharacterized transporters were detected among the gene products of C. acetobutylicum (Fig. (Fig.5;5; see details in the figure legend). There is, however, no ortholog of the glucose facilitator of B. subtilis (17).
Along with previously characterized molecular complexes involved in extracellular hydrolysis of organic polymers, a novel system possibly related to these processes was detected. The signature of this system is a previously undetected domain with a distinct repetitive structure, which we designated as “ChW repeats” (clostridial hydrophobic, with a conserved W, tryptophan) (Fig. (Fig.7B).7B). So far, the only nonclostridial protein containing similar repeats was detected in Streptomyces coelicolor (Fig. (Fig.7B).7B). All proteins containing ChW repeats contain confidently predicted signal peptides at their N termini and do not contain predicted transmembrane helices, which suggests that all of them are secreted (Fig. (Fig.7A).7A). Some of the ChW-repeat proteins contain additional enzymatic domains, such as glycosyl hydrolases or proteases, which implicates them in the degradation of polysaccharides and proteins. Several ChW-repeat proteins also contain domains that are involved in cell interactions, such as the cell adhesion domain (39) and the leucine-rich repeat (internalin) domain (46) (Fig. (Fig.7A).7A). The internalin domain has been shown to play a critical role in the host cell invasion by the bacterial pathogen Listeria monocytogenes (46). In C. acetobutylicum, these domains might be responsible for interactions with plant cells. ChW repeats also could function in either substrate-binding or protein-protein interactions. The specific expansion of this domain in C. acetobutylicum suggests the existence of a novel molecular system, which partially resembles the cellulosome and could also form structurally distinct multisubunit complexes involved in polymer degradation and interaction with the environment. Elucidation of the function of this system is expected to shed light on the unique physiology of C. acetobutylicum.
The extreme diversity of the domain architectures of the proteins that comprise the cellulosome and other predicted polymer degradation systems suggests that such complexes are highly dynamic not only in terms of the subunit stoichiometry (68) but also with respect to the genetic organization, with horizontal gene transfer, domain shuffling, and nonorthologous gene displacement playing pivotal roles in their evolution. C. acetobutylicum is the first sequenced bacterial genome with such a remarkable abundance of polymer degradation systems, which makes it a model for future studies on other bacteria with similar lifestyles. In addition, the sequencing of the C. acetobutylicum genome will offer perspectives in future comparative genomic studies concerning pathogenic bacteria, e.g., C. difficile, C. tetani, and C. perfringens, which are currently being sequenced by other groups.
This work was supported by research grants DE-FG02-95ER-61967 (D.R.S.), DE-FG02-00ER629 (M.J.D.), NSF BES0001288 (G.N.B.), and USDA 00-35504-9269 (G.N.B.).
We are grateful to Guy Plunkett (University of Wisconsin) for performing the skew analyses and to John Reeve for helpful discussions and for suggesting C. acetobutylicum as a target for genomic sequencing.