|Home | About | Journals | Submit | Contact Us | Français|
Caldicellulosiruptor bescii DSM 6725 utilizes various polysaccharides and grows efficiently on untreated high-lignin grasses and hardwood at an optimum temperature of ~80°C. It is a promising anaerobic bacterium for studying high-temperature biomass conversion. Its genome contains 2666 protein-coding sequences organized into 1209 operons. Expression of 2196 genes (83%) was confirmed experimentally. At least 322 genes appear to have been obtained by lateral gene transfer (LGT). Putative functions were assigned to 364 conserved/hypothetical protein (C/HP) genes. The genome contains 171 and 88 genes related to carbohydrate transport and utilization, respectively. Growth on cellulose led to the up-regulation of 32 carbohydrate-active (CAZy), 61 sugar transport, 25 transcription factor and 234 C/HP genes. Some C/HPs were overproduced on cellulose or xylan, suggesting their involvement in polysaccharide conversion. A unique feature of the genome is enrichment with genes encoding multi-modular, multi-functional CAZy proteins organized into one large cluster, the products of which are proposed to act synergistically on different components of plant cell walls and to aid the ability of C. bescii to convert plant biomass. The high duplication of CAZy domains coupled with the ability to acquire foreign genes by LGT may have allowed the bacterium to rapidly adapt to changing plant biomass-rich environments.
Lignocellulosic plant biomass is the most abundant renewable alternative to petroleum as a source of fuel (1). It consists mainly of cellulose and hemicellulose in combination with up to 20% lignin. Biological conversion of this chemically and physically complex material, represents a major challenge (2,3). Expensive thermal and chemical pretreatments are needed to decrease its recalcitrance and expose the polysaccharides to carbohydrate-active enzymes (CAZy) and carbohydrate-binding modules (CBMs) that help destroy the plant cell walls (4,5). Despite intensive studies, many aspects of microbial and enzymatic biomass-to-biofuel conversion are still not understood. Thermophilic anaerobic bacteria hold great promise as they display higher bioconversion rates, minimize the risk of contamination, facilitate product recovery and synthesize highly thermostable enzymes (6,7). However, only a relatively small number of anaerobic thermophiles are able to convert crystalline cellulose into soluble fermentable sugars, and only a few of them are able to metabolize simultaneously the hexose and pentose sugars that are produced from cellulose and hemicellulose, respectively (1,8).
One of the best studied of the cellulolytic microbes is Clostridium thermocellum, which grows optimally at 60°C (9). It produces ethanol and is being used for the consolidated bioprocessing of plant biomass (6,8–10). Its cellulolytic system is a large multi-protein complex called the cellulosome, the enzymatic components of which act synergistically to degrade crystalline cellulose (9–11). The recent availability of genetic systems in C. thermocellum and a related thermophile (12) provides a much needed tool to investigate the mechanisms of cellulose degradation. Several members of the genus Caldicellulosiruptor are able to degrade cellulose at even higher temperatures (up to 90°C) and they also utilize pentose sugars (13–18). The genomes of C. saccharolyticus DSM 8903 (19) and C. bescii DSM 6725 have been sequenced (20) and some CAZy enzymes have been purified and characterized from both species (21,22). Representatives of this genus have potential utility in biomass-to-sugars conversion processes but more comprehensive studies are needed to understand the degradative mechanisms involved.
Caldicellulosiruptor bescii grows at temperatures up to 90°C and is the most thermophilic bacterium capable of growth on crystalline cellulose (16). It also utilizes xylan, pectin and starch and is also able to grow efficiently on untreated plant biomass with high lignin content (14,16). The bacterium is capable of using cellulose and xylan simultaneously. Its ability to grow on the hardwood poplar is of particular interest as this hardwood can be genetically manipulated to potentially decrease recalcitrance (23). For example, one transgenic poplar line overexpressing xyloglucanase is less recalcitrant to cellulolytic enzymes (24). In the present article we analyze the genome of C. bescii with a particular focus on genes encoding enzymes involved in plant biomass conversion. We also present transcriptomic and proteomic data and compare its genome with those of other anaerobic thermophiles, including its close relative C. saccharolyticus. This analysis will contribute to our understanding of plant biomass conversion at extreme temperatures and will provide a genetic basis for the plant biomass-degrading properties displayed by this remarkable organism.
Caldicellulosiruptor bescii strain DSM 6725 was obtained from the German Culture Collection (www.dsmz.de/index.htm). The organism was grown in the 516 medium as previously described (14) except that vitamin and trace mineral solutions were modified. Medium composition and growth conditions are given in the Supplementary Data.
RNA extraction and purification was carried out as described previously (26). RNA samples were converted to fluorescence-labeled cDNA and hybridized to a whole-genome C. bescii microarray according to the procedures previously described (27). Additional information is provided in the Supplementary Data.
Extracellular protein (ExtP) fractions were prepared from 1l cultures grown for 24h on different substrates. The residual insoluble substrates (if present) were removed by decantation. The cells and ExtP fractions were separated by centrifugation. The ExtP was filtered through a 0.2µm membrane, concentrated using a 10kDa membrane and dialyzed against 50mM NH4HCO3, pH 8.0. To obtain intracellular protein (IntP) and membrane protein (MemP) fractions, samples were prepared from the sedimented cells (Supplementary Data for more details).
Samples for tandem mass-spectrometry were prepared as described earlier (28). Fragmentation spectra (MS/MS) obtained from each sample were searched against the DSM 6725 proteome using SEQUEST (29) and filtered using DTASelect (30). Filter levels were set at +1s 1.8, +2s 2.5 and +3s 3.5 to obtain a false-discovery rate of <5% at protein level. Additionally, a minimum of two unique peptides per locus were required in order to identify the protein (see Supplementary Data for more details).
The genome sequence of C. bescii was determined by the Joint Genome Institute (JGI) (20) and the annotated version was downloaded from the NCBI genome database (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Clusters of orthologous genes (COG)-based functional assignment (http://www.ncbi.nlm.nih.gov/COG/old/xognitor.html), extracellular proteins, transmembrane helices (http://www.cbs.dtu.dk/services/TMHMM/) and insertion sequence (IS) elements (31) were predicted as described in the Supplementary Data.
Operons were predicted by our previously published method (32), which was ranked as the best available operon prediction program by an independent study (33). Carbohydrate-active enzymes were searched for using BLAST- and HMM-based tools and sequence libraries used for the updates of the CAZy database [http://www.cazy.org; (34)]. Further details are included in the Supplementary Data. Prediction of transporters and transcription factors (TFs) was carried out as described in the Supplementary Data.
The KEGG assignment (http://www.genome.jp/kegg/) and the COG groups (http://www.ncbi.nlm.nih.gov/COG/old/xognitor.html) were downloaded from the databases. Upon analysis, we found that 71–88% of operonic gene pairs belong to the same KEGG pathways, whereas 48–75% of gene pairs that are not predicted to be in the same operons but having the intergenic distance between gene pairs <100bp are in the same KEGG pathway (Supplementary Table S1). Our analysis suggested that operonic gene pairs and gene pairs with short intergenic distance are more likely to be functioned in the same pathway. Using this approach, we assigned the associated function for a hypothetical protein, if this protein is predicted to be in the same operon with genes assigned to KEGG pathway, or if the gene is near another gene with annotated function.
To identify the genes expressed in C. bescii cells grown on glucose or cellulose (filter paper), we compared the gene expression profiles of C. bescii genes and their homologs in Escherichia coli K12. The data set of E. coli K12 grown on various carbon sources (GSE2037) was downloaded from NCBI, and we identified 278 genes whose expression was reduced when not growing on glucose, using SAM with the cut-off P-value of 0.05. Mapping this gene set to the C. bescii genome (cut-off e-value of 1e-20) results in 206 homologs termed glucose-related genes. The log-likelihood ratio at intensity i was calculated as ln(f1i/fwi), where f1i and fwi are the respective frequencies of glucose-related genes and all genes having the probe intensity i. The cut-off intensity to consider that a gene is expressed was chosen so that ln(f1i/fwi)=0.
The C. bescii genome contains a 2919718bp circular chromosome with 35.2% GC content and is slightly smaller than the size of the average bacterial genome (3.3Mb; www.ncbi.nlm.nih.gov/genomes/lproks.cgi). It contains two native circular plasmids termed AX710673 pBAL (8294bp, 38.5% GC) and AX710687 pBAS2 (3653bp, 42.9% GC), both of which were isolated and sequenced previously (35). The sequence of AX710687 pBAS2 reported here is identical but that of AX710673 pBAL has 8 deletions, 11 insertions and 7 mismatches (Supplementary Figure S2). AX710673 and AX710687 encode eight and four open reading frames (ORFs), respectively. AX710687 encodes exclusively uncharacterized proteins, while AX710673 encodes two putative regulators and two proteins involved in nucleic acid metabolism.
The chromosome is predicted to contain 2654 protein coding genes. Their arrangement on the two strands suggests that it has two equal replicores with a positive correlation between the direction of transcription and replication (Figure 1). The 16S RNA sequences confirmed that what was formerly termed Anaerocellum thermophilum is a member of the phylum Firmicutes, class Clostridia, order Clostridiales and that it should be classified in the genus Caldicellulosiruptor (15). C. hydrothermalis and C. kronotskiensis are the closest known relatives of C. bescii and C. saccharolyticus DSM 8903 is the closest relative with a sequenced genome (15,19).
We compared the general features of the C. bescii genome to those of five anaerobic thermophiles containing significant numbers of CAZy-related genes potentially involved in plant biomass degradation (Table 1): C. saccharolyticus DSM 8903, C. thermocellum ATCC 27405, Thermotoga maritima MSB8, Thermoanaerobacter pseudethanolicus ATCC 33223 and T. tengcongensis MB4. The genome size of C. bescii is similar to that of C. saccharolyticus (2.97Mb) (19). Both utilize crystalline cellulose and xylan and are very closely related with over 2300 C. bescii genes having as their top Blast hit in the C. saccharolyticus sequence. The C. bescii genome is smaller than that of the cellulolytic but not xylanolytic C. thermocellum ATCC 27405 (Topt 60°C, 3.8Mb) (36), and larger than the genome of the xylanolytic but not cellulolytic T. maritima MSB8 (Topt 80°C, 1.86Mb) (37). By 16S rRNA analysis, C. bescii is closely related to the Thermoanaerobacter genus. Its genome is most similar in size to that of T. tengcongensis MB4 (Topt 75°C, 2.7Mb) (38) and slightly smaller than that of T. pseudethanolicus ATCC 33223 (formerly T. ethanolicus strain 39E, Topt 65°C, 2.4Mb). Both C. bescii and C. saccharolyticus grow on polysaccharides such as starch but do not utilize cellulose or xylan.
Using the COG approach to predict gene function (39), we analyzed the genomes of 41 anaerobic thermophiles (Supplementary Tables S2 and S3). The genome of C. bescii is significantly enriched in genes encoding proteins involved with cell motility and secretion (COG group N) and cell division (group D), in agreement with the SignalP result suggesting that C. bescii has a large number of secreted proteins. Interestingly, the cellulolytic C. thermocellum ATCC 27405 has a significantly higher than average number of genes responsible for DNA replication, recombination or repair, which may have facilitated the development of a genetic system for this organism (28). On the other hand, the C. thermocellum genome appears to have fewer genes involved in intracellular trafficking and in defense mechanisms. The genome of C. bescii has a lower number of uncharacterized genes (groups R and S) and a higher percentage of genes not assigned to COG categories.
Of the 2666 proteins encoded by the C. bescii genome, 394 (14.8%) and 344 (12.9%) are predicted to have signal peptides and transmembrane helices, respectively. Using a previously-developed program (32), it was found that the 2666 genes in C. bescii are predicted to be organized into 1209 transcriptional units, 577 of which are multi gene. The 259 genes with functions related to carbohydrate metabolism and sugar transport were predicted to form 180 transcriptional units, 111 of which are multi-gene and 69 are single-gene operons (Supplementary Table S4). Furthermore, C. bescii (and its close relative C. saccharolyticus) is predicted to contain 14  sigma factors, 8  anti-sigma modulators, 97  putative transcription regulators and 18  histidine kinases. Among the 18 putative histidine kinases, 11 are predicted to be membrane-associated. The high number of putative regulators suggests that C. bescii is highly responsive to changing environmental conditions and nutrient availability. This is in line with a recent report showing that the composition of C. thermocellum cellulosome varies with the growth substrate (40).
Insertion sequences (IS) have been found to be actively involved in the genomic recombination and horizontal gene transfer events in prokaryotic genomes (41). The coding region of an IS is flanked by fixed-length non-coding terminal regions, which are essential in mediating transposition and genomic recombination (31,42–44). Unfortunately, in many cases genome annotations include only the potential coding sequences carried by the elements and ignore their terminal regions. The statistics of IS elements in six genomes of anaerobic thermophiles are summarized in Table 1 (see also Supplementary Table S5). Thermotoga maritima harbors the smallest number of IS elements, with only 3 full copies and 22 partial copies. Caldicellulosiruptor bescii also harbors much fewer full IS copies  than the other four bacteria, especially compared to its closest relative C. saccharolyticus . Full copies of IS elements are typically the results of recent proliferation. In light of these data, the genome of C. bescii is probably more stable than the other genomes, except for that of T. maritima.
The presence of IS elements suggests that all of the organisms listed in Table 1 likely have a history of horizontal gene transfer events (Supplementary Table S5). Accordingly, the C. bescii genome has multiple sequences that are much more closely related to those in other genomes than they are to those in C. saccharolyticus, suggesting that these regions are the results of such events. As shown in Supplementary Figure S3 and Table S6, these include three of the thermophilic organisms listed in Table 1, C. thermocellum (23 genes), T. tengcongensis (21 genes) and T. pseudethanolicus (18 genes), as well as Petrotoga mobilis (11 genes), Thermoanaerobacter sp. X514 (11 genes) and Dictyoglomus thermophilum (14 genes). In addition, eleven C. bescii genes show the highest similarity to those in C. phytofermentans a mesophilic anaerobe that, like C. bescii, is both celluloytic and xylanolytic. These ‘horizontally transferred’ genes in C. bescii are predicted to encode ABC transporters , carbohydrate-active enzymes (CAZy) , mobile-element related , signal transducers and DNA binding (all containing a helix-turn-helix motif: 15), and genes encoding domains of unknown function like conserved domain UPF0236 , KWG leptospira repeat  and a radical SAM domain , many of which may be involved in various catabolic and anabolic pathways (45).
The distribution of CAZy genes (http://www.cazy.org) related to plant biomass degradation within the genomes of the 41 anaerobic thermophiles is shown in Supplementary Tables S2 and S7. Glycoside transferases were not considered in this group as they are mainly involved in the biosynthesis of polysacchardies. Among these thermophiles, the 16 genomes of the archaea encode very few CAZy proteins. They do not contain polysaccharide lyases (PLs) and 13 of the 16 genomes do not encode CBMs, which are critical for degradation of insoluble polysaccharides. Six of the archaeal species grow on starch, although three of them are not predicted to contain genes that encode CBMs. Three of the genomes encode CBMs, glycosyde hydrolases (GHs) and carbohydrate esterases (CEs). Two of them, Pyrococcus furiosus and Thermococcus kodakaraensis, do not grow on cellulose or xylan, but do grow on starch, while the other, Thermofilum pendens, does not grow on any polysaccharide that has been examined although its genome encodes several GHs, CBMs and CEs. The presence of two GH13s, the recombinant forms of which are amylolytic enzymes, suggests that this organism can grow on starch or cyclodextrins (46).
In contrast to the anaerobic thermophilic archaea, all of the genomes of the 25 anaerobic thermophilic bacteria encode CBMs, GHs and CEs (Supplementary Tables S2 and S7). However, in many cases growth of these organisms on components of plant biomass has not been reported. Starch is the most common polysaccharide to be used by this group and 14 of them, including C. bescii, have this ability and all contain α-amylase-type enzymes (GH13). PLs are identified in eight of these bacteria, and five of them have been shown to grow on pectin, including C. bescii, C. saccharolyticus, C. thermocellum, T. lettingae and T. maritima. The genome of C. saccharolyticus does not contain PLs but it does encode two GH28s that are putatively involved in hydrolysis of pectin backbone. Based on the numbers of CBMs and GHs that they contain, these anaerobic thermophilic bacteria (Supplementary Table S2) can be classified into three groups wherein (i) both CBMs and GHs are low (7 genomes); (ii) CBMs are low but GHs are high (15 genomes) and (iii) both CBMs and GHs are high (3 genomes). The latter category includes C. bescii, C. saccharolyticus and C. thermocellum. All representatives of the Caldicellulosiruptor genus grow on cellulose, xylan, pectin and starch (13,16). C. thermocellum does not grow on xylan as it cannot consume xylose, however, it depolymerizes xylan into xylose, xylobiose and xylooligosaccharides (9). Consequently, there is a clear correlation between the number of representatives of CAZy genes in a genome and the plant biomass-degrading abilities of a microorganism.
The modular architecture of the 88 CAZy genes in C. bescii is shown in Supplementary Table S8. There is a comparable number of such genes in C. saccharolyticus . Other common characteristics include (i) a similar module arrangement for CAZy-related proteins that do not contain CBMs, (ii) all proteins containing CBMs are predicted to be extracellular based on the presence of signal peptides (extracellular CAZy proteins in C. bescii are shown in Table 3); and (iii) the major CBM3s present in the enzymes from both organisms are derived from subfamilies 3a and 3b. CBM3a/3b bind tightly to crystalline cellulose and thus enhance the access of cellulases to their substrate relative to other cellulose-directed CBMs (47,48). In this respect, these two Caldicellulosiruptor species are similar to cellulolytic clostridia that produce cellulosomes, where CBM3 plays a pivotal role in substrate targeting of their respective cellulase complexes (10,11). The clostridial enzymes generally contain additional CBMs that direct the cellulose-tethered complex to specific regions of the cell wall, consistent with the activity of the enzyme containing these additional targeting modules (49,50). In contrast, C. bescii and C. saccharolyticus contain fewer of these additional, non-crystalline cellulose-binding CBM families (Supplementary Tables S2 and S8). The most significant of these are five CBM22s, and one CBM36 that likely targets xylan (Supplementary Table S8). Within this context it should be noted that CBM22s bind tightly to isolated xylan chains but not to hemicellulose within the plant cell wall (51). Thus, CBM22-containing enzymes likely target xylans that have been released from the plant cell wall. It appears, therefore, that CBM3s work in both of these bacteria as the primary mechanism for the attachment of enzymes to plant polysaccharides. Furthermore, the majority of CBM3-containing enzymes contain multiple CBM3s. These are likely to confer extremely tight binding to cellulose to offset the dissociation promoted by elevated temperatures. Indeed, it has been suggested that there is a general correlation between the growth temperature at which an organism and the frequency of finding enzymes with multiple CBM copies (52).
Notably, the CBM3s in the genomes of both C. bescii and C. saccharolyticus are concentrated only in one gene cluster and this encodes mainly CAZy proteins (Cbes_1853-_1867 and Csac_1076-_1085: see Figure 2 and Supplementary Figure S4). However, there is a significant difference in the arrangement of these gene clusters. In C. bescii this cluster is enriched in CBM3s, which are present as double or triple modules within one gene product, in comparison to the cluster of C. saccharolyticus (16 versus 10). Specifically, C. bescii contains three genes encoding PLs of different families that are absent from C. saccharolyticus. Moreover, of all thermophilic anaerobes, only C. bescii has PLs of three different families (Supplementary Table S2). The C. bescii cluster also contains three GH48s versus one in C. saccharolyticus. The GH48s are key enzymes in crystalline cellulose hydrolysis and are uniquely arranged in C. bescii. There is no other known example of three modules of this type in combination with a second catalytic module of different CAZy activity (Figure 2A and B). Interestingly, a deletion mutant of C. thermocellum lacking two GH48s was able to completely hydrolyze crystalline cellulose, albeit at a slower rate than the wild-type (28). This cluster in C. bescii also has three GH5 mannanases (versus one gene in C. saccharolyticus) and six genes encoding multifunctional CAZy proteins (versus three genes in C. saccharolyticus), each containing two catalytic modules of different hydrolytic activity separated by double or triple CBM3s.
Consequently, this CAZy-enriched gene cluster in C. bescii uniquely contains CBM3s that potentially mediate the binding of 13 catalytic modules to the insoluble substrate, while in C. saccharolyticus there are only eight catalytic modules attached to CBM3s. The C. bescii gene cluster also contains a GH74 module, which is a putative xyloglucanase (Table 3). This enzyme has an important role in biomass degradation as it hydrolyzes xyloglucan networks (53). In C. bescii the GH74 enzyme is part of a multi-modular protein with two CBM3s and GH48 (Cbes_1860), the combination of which is predicted to display synergism by binding the two catalytic modules to xyloglucan, hydrolyzing xyloglucan and releasing and hydrolyzing cellulose. In C. saccharolyticus the corresponding gene is truncated to GH74-CBM3 (Figure 2A and B). In Cbes_1867, GH48 is combined with GH9 via triplet of CBM3s. The combination of GH9 (endoglucanase) and GH48 (exoglucanase) assumes a synergy in hydrolysis of amorphous and crystalline parts of cellulose. Similarly, Cbes_1857 has the modular structure GH10-CBM3-CBM3-GH48 where GH10 is a xylanase, and the catalytic modules can act in a concert on mixed type xylan/cellulose substrates.
In general, the C. bescii gene cluster encodes a powerful set of CAZy enzymes active against major components of plant cell walls (cellulose, xylan, xylogluvan, pectin and mannan). In contrast, the analogous cluster in C. saccharolyticus is significantly truncated and lacks genes encoding some important biomass-related activities. All but of one the CBM3s in the C. bescii gene cluster has >99% identity, the three GH48s are 100% identical, and there are also three GH5s with high degree of sequence identity. Such gene duplication in the main CAZy-containing cluster in C. bescii suggests that both diversity of CAZys, and the ‘dosage’ of individual CAZy are important for this bacterium to adapt to new growth substrates, including various polysaccharides and related materials derived from plant biomass (54).
Our analysis of the CAZy-related genes in C. bescii revealed that the NCBI annotation of several of these sequences is incorrect and/or incomplete. For example, Cbes_1853 is annotated as cellulose 1,4-β-cellobiosidase, Cbes_1857, _1860 and _1867 are annotated as glycoside hydrolases family 48 and Cbes_1865 is annotated as a glycoside hydrolase family 9. Based on comparisons with other sequences in the CAZy database, we propose that Cbes_1853 is a rhamnogalacturonan lyase; Cbes_1857, _1860, _1865 and _1867 are bifunctional enzymes containing GH10/GH48, GH74/GH48, GH9/GH5 and GH9/GH48, respectively. These changes are listed in Supplementary Table S8. Our new annotations suggest that these genes contain multiple domains, and such combination of multiple domains could be the key to biomass degradation.
A total of 257 genes in the C. bescii genome are predicted to encode transporters including 171 involved in sugar transport (Supplementary Table S9). Cellular transport systems can be classified into seven main classes (http://www.chem.qmul.ac.uk/iubmb/mtp/). Although the total number of transporter genes is similar in the genomes of the two Caldicellulosiruptor species, C. saccharolyticus contains 18 more genes of family 3.A.1 that transport organic and inorganic molecules of various sizes, while C. bescii has 11 more genes of family 2.A.1 that transport molecules of small sizes including lactose (Supplementary Table S10).
ABC transporters in bacterial genomes are composed of an inner membrane component (IMC) and an ATPase component. In the C. bescii genome (Supplementary Table S9) in most cases the IMCs are paired and encoded by one operon suggesting that the ABC transporter system is tetrameric (two IMCs and two ATPases). The ATPases are typically not linked and are located remotely from the IMCs, suggesting that one ATPase serves multiple IMCs (55,56). Multiple solute-binding proteins (SBPs) were also identified in both genomes. They are generally located close to IMCs, but often are predicted in separate operons. Many SBPs belong to functional category COG1653 that includes putative proteins transporting various oligosaccharides and simple sugars. In many cases sugar transport systems are found in the same operon or in the vicinity of genes encoding CAZy related proteins. This observation suggests that these transporters are involved in the transport of sugars released by the corresponding enzymes encoded by these CAZy related operons or genes. In particular, the gene cluster Cbes_0050-_0063 contains ABC transporters and four glycosyl transferases of families GT2 and GT4 that transfer mannosyl, rhamnosyl, N-acetyl-glucosaminyl, β-galactosaminyl and galactosyl, glucosyl, mannosyl or xylosyl groups, respectively. It seems likely that ABC transporter elements located in the same operon are involved in transport of related sugars. The neighboring operons, Cbes_0174-_0181 and Cbes_0182-_0187 encode elements of ABC transporters and glycoside hydrolases GH43, GH39 and GH10, which encode xylanase, xylosidase and arabinofuranosidases, respectively. Transporter operon Cbes_1107-_1112 is located close to genes Cbes_1103 (GH51 with putative activities endoglucanase or arabinofuranosidase) and Cbes_1104 (GH4 displays activities of α-glucosidase, α-galactosidase and α-glucuronidase) (Supplementary Table S9). These observations imply that genes encoding sugar transport and sugar metabolism are typically closely associated.
In comparing the pathways present in C. bescii and C. sacharolyticus assigned by the KEGG database, we found that both genomes are similar in term of the number of genes present in assigned pathways, as shown in Supplementary Table S11. However, there is one pathway present in C. bescii only. Its genome includes four genes essential for the biosynthesis of deoxythymidine-diphosphate rhamnose (dTDP-l-rhamnose) from glucose-1-phosphate, which is produced from cellobiose by cellobiose phosphorylase (Supplementary Table S11 and Figure S5). This is of particular interest as the activated sugar donor, glucose-1-phosphate, could be an energy source or could participate in the glycosylation of extracellular proteins and flagella biosynthesis (57,58), particularly since the genome of C. bescii is enriched in genes related to secretion and motility. In some bacteria, arabinogalactan is attached to peptidoglycan via a rhamnose-N-acetylglucosamine disaccharide linker unit (59) so it is not clear whether this pathway in C. bescii is essential for conversion of components of plant biomass. There is also a difference between the two Caldicellulosiruptor species in alanine metabolism. In particular, C. bescii and C. sacharolyticus contains eight and one copy, respectively, of homologs of alanine racemase (EC. 188.8.131.52), which reversibly converts l-alanine to d-alanine. However, they both contain only a single copy of d-alanine-d-alanine ligase (EC. 184.108.40.206), which converts d-alanine to d-alanyl-d-alanine, an enzyme involved in peptidoglycan metabolism in Gram-positive bacteria. The consequences of this are not clear at present.
Seventeen CAZy genes in the genome of C. bescii do not have their closest relatives in C. saccharolyticus (based on Blast analysis; see Supplementary Table S12). These genes were probably acquired from thermophilic  and mesophilic  microorganisms. Fourteen of these microorganisms degrade polysaccharides and three of them produce ethanol, but they also include three methanogenic archaea, which are not known to degrade polysaccharides. Ten of the 17 C. bescii genes are organized into three clusters that contain multiple CAZy-related proteins: Cbes_0052-_0061 and Cbes_0154-_0157 are all composed of GTs transferred from mesophiles, Cbes_1853-_1855 encodes PLs enzymes acquired from a thermophile and two mesophiles and Cbes_1853-_1855 was incorporated into a region containing multiple GH and CBM-containing genes. The latter gene cluster is discussed further below.
There are also 25 genes related to ABC transporters that do not have their closest relatives in C. saccharolyticus (Supplementary Table S13). It is assumed that these were acquired by lateral gene transfer but in this case only from bacteria. The closest relatives of the 25 genes are found in 17 bacteria, many of which are capable of metabolizing polysaccharides with some generating ethanol as an end product. The ABC transporter genes appear to have been acquired predominantly  from mesophiles. Interestingly, 17 of the 25 genes are organized into 5 gene clusters, 3 of which are adjacent to 3 of the CAZy-gene clusters discussed above. In particular, cluster Cbes_2371-_2376 has four of its top Blast hits in C. phytofermentans, an organism that is capable of producing high concentrations of ethanol during cellulose fermentation. The same cluster encodes 3 ABC transporter genes, a GH43, a histidine kinase and a response regulator, suggesting that this six-gene cluster was horizontally transferred more or less intact from a Clostridium species, and may play a significant role in biomass degradation. Similarly, the Cbes_2076-_2094 cluster contains two ABC transporters (six genes), a GH2 and two integrase-related genes. A large number of genes in this cluster have their top Blast hits in two species, B. subtilis and D. thermophilum, indicating that this region could be a hot spot for DNA integration or genome rearrangement in C. bescii.
These data suggest that the exchange of genetic information has had a significant impact on the metabolic capabilities of C. bescii, and that this exchange has occurred between very different microorganisms, including (i) archaea and bacteria, (ii) aerobes and anaerobes, (iii) Gram-positive and Gram-negative bacteria and (iv) (hyper)thermophiles and mesophiles (and even psychrophiles). Moreover, these observations provide conclusive evidence for the divergent evolution of what appear to be two very closely-related species, C. bescii and C. saccharolyticus.
In some cellulolytic microorganisms such as C. thermocellum, the strong interactions between the cells and the insoluble polysaccharide substrate are mediated by the cellulosome. The genome analyses of C. bescii shows that it does not produce a cellulosome complex as no dockerin- and cohesin-like domains of either types I or II were identified. In addition, genes encoding extracellular CAZy enzymes did not contain similar domains of unknown function that might encode new types of dockerins. However, microscopy studies show that C. bescii cells directly attach to xylan and switchgrass (Figure 3). The attachment is dynamic as many cells are also planktonic, enabling cell densities to be used as a measure of cell growth (14,15). Although the mechanism is not known, analysis of the genome of C. bescii reveals many genes that are predicted to encode modules that could be involved in such cell–substrate interactions (Table 4). They include surface-layer homology (SLH) domains which are known to mediate the binding of proteins to cell surfaces (60), fibronectin type 3-like (Fn3) domains containing binding sites for the cell surface (http://pfam.sanger.ac.uk), and lysine motif (LysM) domains found in a variety of enzymes involved in bacterial cell wall degradation that may have a general peptidoglycan-binding function (61). In addition, C. bescii contains Fn3-like domains that have sequence similarity to so-called ‘X’ domains, which have shown to bind carbohydrates (62).
Specifically, Cbes_0594 has an SLH domain combined with GH5 and CBM28. Binding of C. stercorarium xylanase to the cell wall via its SLH domains has been demonstrated (63). Cbes_0174 and Cbes_0181 contain bacterial solute-binding domains (SBPs), which are typically attached to an outer membrane and are components of sugar transport systems (64). Cbes_0174 has an N-terminal and Cbes_0181 has C-terminal modules with BLAST hits to CBM6 and pfam CBM_4_9, respectively (designated here as CBM_X, Table 4). All three proteins are candidates for binding to both cell (by SLH, SBP) and polysaccharides (by CBM28, CBM_X). There are also many modules of unknown biological function listed in Table 4 (modules designated as ‘X’, LysM, Fn3, RHS, etc.) that contain signal peptides and could potentially be presented on the cell surface of C. bescii that may display novel catalytic and/or binding functions.
According to the NCBI annotation, the C. bescii genome contains a total of 826 ORFs of unknown function that are annotated as encoding either hypothetical (723 HP) or conserved hypothetical proteins (103 CHP: Supplementary Table S14). We have now assigned a putative function to 46 of them via the KEGG  and COG  databases, and using the CAZy database another previously annotated CHP is annotated as a GT4 (Cbes_1572). In order to obtain some insight into the likely function of some of other C/HPs, we utilized the fact that genes transcribed in the same operon or gene cluster are often functionally related (65,66). A gene cluster is defined here as set of genes encoded on the same DNA strand with intergenic distances between adjacent genes of <300bp (66). We found that 17 C/HPs are in the same operon with or located adjacent to CAZy genes, therefore, they are predicted to be functionally related to carbohydrate metabolism and potentially plant biomass conversion. As an example, Supplementary Figure S6 shows genes encoding a CHP (Cbes_0178) associated with genes encoding sugar transporters, suggesting that this CBP is likely involved in the same function. Consequently, using the KEGG and COG annotations, operon and gene cluster prediction analyses, putative functions can be assigned to a total of 295 HPs (41%, 428 remain unassigned) and 44 CHPs (43%, 59 remain unassigned: Supplementary Table S15).
A total of 1429 (54%) of 2666 predicted protein-coding sequences (PCSs) were confirmed by proteomic analyses (Supplementary Table S16) and 1790 (67.1%) PCSs were confirmed by transcriptomic data (Supplementary Tables S17 and S18). Therefore, a total of 2196 (83%) of the annotated PCSs were confirmed, including 46.6% by both methods, 18.5% by proteomics and 34.9% by transcriptomics. Among 88 genes annotated as CAZy-related genes, 59 (67.0%) are confirmed experimentally, including 28.8% by both methods. Among 826 PCSs that were annotated as encoding C/HPs, 613 (74%) were expressed on different substrates according to the transcriptomic and proteomic results. These data also allowed us to correct putative transcription unit (TU or operon) boundaries for 18 gene pairs (or 5% of the gene pairs with proteomic data: Supplementary Table S19). This leads to the splitting and merging of 20 TUs into 30 TUs. The 257 genes predicted to encode sugar transporters and CAZys are organized into 180 TUs (Supplementary Table S4). Of the 171 transporters predicted to be sugar-related and 88 CAZy genes, expression at the RNA or protein level was shown for 136 (79%; Supplementary Table S9) and 84 (Supplementary Table S8), respectively, have been detected.
When C. bescii was grown on crystalline cellulose (filter paper) versus glucose, a total of 1203 genes had a significant change in expression level, as shown in Supplementary Table S17. These included 64 CAZys (32 down- and 32 up-regulated: Supplementary Table S8), 90 transporters (29 down- and 61 up-regulated: Supplementary Table S9) and 358 C/HPs (124 down- and 234 up-regulated: Supplementary Table S14). Among the 21 primary CAZy genes (encoding proteins with signal peptides and CBMs) (Supplementary Table S8), 16 were up-regulated on cellulose and 14 proteins were identified on both cellulose and xylan (Table 3). Of 21 genes putatively related to cell–substrate adhesion (Table 4), 12 genes were up-regulated on cellulose and 8 proteins were identified on cellulose and xylan.
More detailed analyses were conducted with operons/gene clusters (Supplementary Table S20, see also Tables S21 and S22). Among the gene clusters whose expression is up-regulated during growth on cellulose, there are six of potential interests. These include (i) Cbes_1856-Cbes_1864 encoding the majority of CAZy multi-modular multifunctional enzymes discussed above, as well as two HPs; (ii) Cbes-2371-Cbes_2375 encoding GH43 and two membrane components of ABC transporters, a gene cluster that is missing in C. saccharolyticus; (iii) Cbes_2413-Cbes_2421, Cbes_2494-Cbes_2500 and Cbes_0261-Cbes_0265, all of which encode HPs; and (iv) Cbes_2591-Cbes_2595 encoding an α-amylase, a DNA repair protein and three HPs. The up-regulation of these gene clusters on cellulose suggests that they are involved in plant cell wall conversion. Five clusters, including genes encoding proteins of different metabolic pathways, were down-regulated on cellulose indicating the plasticity of transcription regulation upon changing growth conditions. Upon switching from glucose to cellulose, four genes of the Cbes_1853-Cbes_1864 cluster are down-regulated while five other genes of the same cluster are up-regulated. This differential regulation validates our operon prediction that this cluster contains multiple transcription units.
Among 42 predicted TFs with significant changes in gene expression, 17 and 25 were down- and up-regulated, respectively, when cells were grown on crystalline cellulose versus glucose (Supplementary Table S23). It was previously suggested (67) that the level of expression of a TF is proportional to the number of operons that it regulates. This observation was used to predict the number of operons regulated by the TFs. Of 13 TFs with >4-fold changes, 7 and 6 were down- and up-regulated, respectively. In particular, TF Cbes_1856, a component of the major CAZy gene cluster, Cbes_1856-Cbes_1864, is up-regulated 3.4-fold suggesting that it is involved in plant biomass conversion. Cbes_2264 is up-regulated 17-fold. This TF is part of an operon encoding sugar transporters (Cbes-_2265-Cbes_2266), which are also up-regulated. These data suggest that these transporters utilize soluble oligomeric products of cellulose hydrolysis rather than glucose. In contrast, TF Cbes_2033 is down-regulated >8-fold, and it is located upstream of gene cluster Cbes_2029-Cbes_2031, which contains a predicted sugar transporter that is up-regulated 4-fold. This cluster is presumably not involved in cellulose metabolism. TF Cbes_1901 is down-regulated >9-fold, although the adjacent HP gene is up-regulated <2-fold, supporting the prediction that this TF regulates multiple operons (Supplementary Table S23).
It is also evident that the production of some CAZys, sugar transporters and C/HPs are sugar-specific (Supplementary Tables S8 and S16). Two of them were found only when cells were grown on xylan: Cbes_0618 (CBM22-CBM22-GH10) and Cbes_0152 (CE7). This is in accord with the CAZy annotation, as the CBM22 domain binds xylan, GH10 is an endo-xylanase and CE7 is an acetyl–xylan esterase. These proteins are assumed to play a pivotal role in hemicellulose degradation. Other proteins were detected only upon growth on cellulose and cellobiose, but not on xylan. They include Cbes_0097 (GH30) and Cbes_0458 (GH1), which are potential β-glucosidases related to cellulose degradation, Cbes_0468 (GH36, potential α-galactosidase) and Cbes_0609 (CBM41-CBM48-GH13-CBM20 with CBMs binding to α-linked polysaccharides and α-amylase). Production of the latter two proteins during growth on cellulose suggests that the stereospecificity of the sugar linkage is not important for the regulation of the respective genes. Cbes_0459 and Cbes_0460 are putative cellobiose/cellodextrin phosphorylases (GH94) detected in much higher amounts than β-glucosidases. This is consistent with the energetics of cellulose degradation as cellobiose/cellodextrin phosphorylases provide an advantage for anaerobic cellulolytic microorganisms. They convert cellobiose/cellodextrins into glucose and glucose-1-phosphate without utilizing valuable ATP, which can be conserved for energy-consuming reactions. In contrast, β-glucosidase hydrolyses cellobiose into two glucose molecules, which must be phosphorylated with ATP before they can be utilized. Hence, in general, interpretation of the microarray and proteomic data is consistent with the CAZy database classification. Five and three sugar transporters were identified only after growth on xylan or on cellulose/cellobiose, respectively, consistent with the specificity of these proteins for certain oligosaccharides. Furthermore, 8 and 16 C/HPs were detected on xylan and cellulose/cellobiose, respectively. Two of the xylan-specific proteins (Cbes_2729 and _2368), and 2 cellulose/cellobiose specific proteins (Cbes_2630 and _1288) were detected at relatively high levels. These data suggest that these previously uncharacterized proteins play important roles in hemicellulose/cellulose metabolism, even though they have no recognizable CAZy domains.
Caldicellulosiruptor bescii is the most thermophilic anaerobic bacterium capable of utilizing cellulose as well as multiple polysaccharides and unprocessed plant biomass. From an analysis of its genome, coupled with transcriptomic and proteomic data, we suggest that not one particular feature but a combination of properties that act in synergy enables the bacterium to degrade various polysaccharides and plant biomass:
Currently there is an increased interest in members of the Caldicellulosiruptor genus that display the ability to degrade multiple polysaccharides as well as plant biomass. Like the prototypical cellulose-degrader, C. thermocellum, these bacteria have a high potential for use in efficient two-step biomass-sugar–biofuel conversion processes. The data presented here are a valuable source of information that can be utilized for further characterization of the Caldicellulosiruptor species that will lead to a deeper understanding of the mechanisms of the non-cellulosomal plant biomass conversion process.
Supplementary Data are available at NAR Online.
This work was supported by the Bioenergy Science Center (BESC), Oak Ridge National Laboratory, a US Department of Energy Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science (contract no. DE-PS02-06ER64304) (DOE 4000063512); the University of California, Lawrence Berkeley National Laboratory (contract no. DE-AC02-05CH11231); Lawrence Livermore National Laboratory (contract No. DE-AC52-07NA27344); Los Alamos National Laboratory (contract No. DE-AC02-06NA25396). Agence Nationale de la Recherche, e-TRICEL (grant No. AANR-07-BIOE-006, to B.H.); National Science Foundation, (DEB-0830024, DBI-0542119). Funding for open access charge: US Department of Energy (DE-AC05-00OR22725).
Conflict of interest statement. None declared.