New lignocellulolytic enzymes are needed that maintain optimal activity under the harsh conditions present during industrial enzymatic deconstruction of biomass, including high temperatures, the absence of free water, and the presence of inhibitors from the biomass. Enriching lignocellulolytic microbial communities under these conditions provides a source of microorganisms that may yield robust lignocellulolytic enzymes tolerant to the extreme conditions needed to improve the throughput and efficiency of biomass enzymatic deconstruction. Identification of promising enzymes from these systems is challenging due to complex substrate-enzyme interactions and requirements to assay for activity. In this study, metatranscriptomes from compost-derived microbial communities enriched on rice straw under thermophilic and mesophilic conditions were sequenced and analyzed to identify lignocellulolytic enzymes overexpressed under thermophilic conditions. To determine differential gene expression across mesophilic and thermophilic treatments, a method was developed which pooled gene expression by functional category, as indicated by Pfam annotations, since microbial communities performing similar tasks are likely to have overlapping functions even if they share no specific genes.
Differential expression analysis identified enzymes from glycoside hydrolase family 48, carbohydrate binding module family 2, and carbohydrate binding module family 33 domains as significantly overexpressed in the thermophilic community. Overexpression of these protein families in the thermophilic community resulted from expression of a small number of genes not currently represented in any protein database. Genes in overexpressed protein families were predominantly expressed by a single Actinobacteria genus, Micromonospora.
Coupling measurements of deconstructive activity with comparative analyses to identify overexpressed enzymes in lignocellulolytic communities provides a targeted approach for discovery of candidate enzymes for more efficient biomass deconstruction. Glycoside hydrolase family 48 cellulases and carbohydrate binding module family 33 polysaccharide monooxygenases with carbohydrate binding module family 2 domains may improve saccharification of lignocellulosic biomass under high-temperature and low moisture conditions relevant to industrial biofuel production.
Lignocellulose deconstruction; Solid-state culture; Microbial communities; Biofuels; Cellulase; Glycoside hydrolase family 48; Carbohydrate binding module family 2; Carbohydrate binding module family 33
The development of advanced biofuels from lignocellulosic biomass will require the use of both efficient pretreatment methods and new biomass-deconstructing enzyme cocktails to generate sugars from lignocellulosic substrates. Certain ionic liquids (ILs) have emerged as a promising class of compounds for biomass pretreatment and have been demonstrated to reduce the recalcitrance of biomass for enzymatic hydrolysis. However, current commercial cellulase cocktails are strongly inhibited by most of the ILs that are effective biomass pretreatment solvents. Fortunately, recent research has shown that IL-tolerant cocktails can be formulated and are functional on lignocellulosic biomass. This study sought to expand the list of known IL-tolerant cellulases to further enable IL-tolerant cocktail development by developing a combined in vitro/in vivo screening pipeline for metagenome-derived genes.
Thirty-seven predicted cellulases derived from a thermophilic switchgrass-adapted microbial community were screened in this study. Eighteen of the twenty-one enzymes that expressed well in E. coli were active in the presence of the IL 1-ethyl-3-methylimidazolium acetate ([C2mim][OAc]) concentrations of at least 10% (v/v), with several retaining activity in the presence of 40% (v/v), which is currently the highest reported tolerance to [C2mim][OAc] for any cellulase. In addition, the optimum temperatures of the enzymes ranged from 45 to 95°C and the pH optimum ranged from 5.5 to 7.5, indicating these enzymes can be used to construct cellulase cocktails that function under a broad range of temperature, pH and IL concentrations.
This study characterized in detail twenty-one cellulose-degrading enzymes derived from a thermophilic microbial community and found that 70% of them were [C2mim][OAc]-tolerant. A comparison of optimum temperature and [C2mim][OAc]-tolerance demonstrates that a positive correlation exists between these properties for those enzymes with a optimum temperature >70°C, further strengthening the link between thermotolerance and IL-tolerance for lignocelluolytic glycoside hydrolases.
Cellulase; Ionic liquid; Thermophilic; Biofuel
High-solids incubations were performed to enrich for microbial communities and enzymes that decompose rice straw under mesophilic (35°C) and thermophilic (55°C) conditions. Thermophilic enrichments yielded a community that was 7.5 times more metabolically active on rice straw than mesophilic enrichments. Extracted xylanase and endoglucanse activities were also 2.6 and 13.4 times greater, respectively, for thermophilic enrichments. Metagenome sequencing was performed on enriched communities to determine community composition and mine for genes encoding lignocellulolytic enzymes. Proteobacteria were found to dominate the mesophilic community while Actinobacteria were most abundant in the thermophilic community. Analysis of protein family representation in each metagenome indicated that cellobiohydrolases containing carbohydrate binding module 2 (CBM2) were significantly overrepresented in the thermophilic community. Micromonospora, a member of Actinobacteria, primarily housed these genes in the thermophilic community. In light of these findings, Micromonospora and other closely related Actinobacteria genera appear to be promising sources of thermophilic lignocellulolytic enzymes for rice straw deconstruction under high-solids conditions. Furthermore, these discoveries warrant future research to determine if exoglucanases with CBM2 represent thermostable enzymes tolerant to the process conditions expected to be encountered during industrial biofuel production.
Thermophilic bacteria are a potential source of enzymes for the deconstruction of lignocellulosic biomass. However, the complement of proteins used to deconstruct biomass and the specific roles of different microbial groups in thermophilic biomass deconstruction are not well-explored. Here we report on the metagenomic and proteogenomic analyses of a compost-derived bacterial consortium adapted to switchgrass at elevated temperature with high levels of glycoside hydrolase activities. Near-complete genomes were reconstructed for the most abundant populations, which included composite genomes for populations closely related to sequenced strains of Thermus thermophilus and Rhodothermus marinus, and for novel populations that are related to thermophilic Paenibacilli and an uncultivated subdivision of the little-studied Gemmatimonadetes phylum. Partial genomes were also reconstructed for a number of lower abundance thermophilic Chloroflexi populations. Identification of genes for lignocellulose processing and metabolic reconstructions suggested Rhodothermus, Paenibacillus and Gemmatimonadetes as key groups for deconstructing biomass, and Thermus as a group that may primarily metabolize low molecular weight compounds. Mass spectrometry-based proteomic analysis of the consortium was used to identify >3000 proteins in fractionated samples from the cultures, and confirmed the importance of Paenibacillus and Gemmatimonadetes to biomass deconstruction. These studies also indicate that there are unexplored proteins with important roles in bacterial lignocellulose deconstruction.
In the future, we may be faced with the need to provide treatment for an emergent biological threat against which existing vaccines and drugs have limited efficacy or availability. To prepare for this eventuality, our objective was to use a metabolic network-based approach to rapidly identify potential drug targets and prospectively screen and validate novel small-molecule antimicrobials. Our target organism was the fully virulent Francisella tularensis subspecies tularensis Schu S4 strain, a highly infectious intracellular pathogen that is the causative agent of tularemia and is classified as a category A biological agent by the Centers for Disease Control and Prevention. We proceeded with a staggered computational and experimental workflow that used a strain-specific metabolic network model, homology modeling and X-ray crystallography of protein targets, and ligand- and structure-based drug design. Selected compounds were subsequently filtered based on physiological-based pharmacokinetic modeling, and we selected a final set of 40 compounds for experimental validation of antimicrobial activity. We began screening these compounds in whole bacterial cell-based assays in biosafety level 3 facilities in the 20th week of the study and completed the screens within 12 weeks. Six compounds showed significant growth inhibition of F. tularensis, and we determined their respective minimum inhibitory concentrations and mammalian cell cytotoxicities. The most promising compound had a low molecular weight, was non-toxic, and abolished bacterial growth at 13 µM, with putative activity against pantetheine-phosphate adenylyltransferase, an enzyme involved in the biosynthesis of coenzyme A, encoded by gene coaD. The novel antimicrobial compounds identified in this study serve as starting points for lead optimization, animal testing, and drug development against tularemia. Our integrated in silico/in vitro approach had an overall 15% success rate in terms of active versus tested compounds over an elapsed time period of 32 weeks, from pathogen strain identification to selection and validation of novel antimicrobial compounds.
Tropical forest soils decompose litter rapidly with frequent episodes of anoxia, making it likely that bacteria using alternate terminal electron acceptors (TEAs) such as iron play a large role in supporting decomposition under these conditions. The prevalence of many types of metabolism in litter deconstruction makes these soils useful templates for improving biofuel production. To investigate how iron availability affects decomposition, we cultivated feedstock-adapted consortia (FACs) derived from iron-rich tropical forest soils accustomed to experiencing frequent episodes of anaerobic conditions and frequently fluctuating redox. One consortium was propagated under fermenting conditions, with switchgrass as the sole carbon source in minimal media (SG only FACs), and the other consortium was treated the same way but received poorly crystalline iron as an additional terminal electron acceptor (SG + Fe FACs). We sequenced the metagenomes of both consortia to a depth of about 150 Mb each, resulting in a coverage of 26× for the more diverse SG + Fe FACs, and 81× for the relatively less diverse SG only FACs. Both consortia were able to quickly grow on switchgrass, and the iron-amended consortium exhibited significantly higher microbial diversity than the unamended consortium. We found evidence of higher stress in the unamended FACs and increased sugar transport and utilization in the iron-amended FACs. This work provides metagenomic evidence that supplementation of alternative TEAs may improve feedstock deconstruction in biofuel production.
Anaerobic decomposition; switchgrass; Panicum virgatum; tropical forest soil; feedstock-adapted consortia; bacteria; archaea; metagenomics
The Deepwater Horizon oil spill in the Gulf of Mexico is the deepest and largest offshore spill in the United State history and its impacts on marine ecosystems are largely unknown. Here, we showed that the microbial community functional composition and structure were dramatically altered in a deep-sea oil plume resulting from the spill. A variety of metabolic genes involved in both aerobic and anaerobic hydrocarbon degradation were highly enriched in the plume compared with outside the plume, indicating a great potential for intrinsic bioremediation or natural attenuation in the deep sea. Various other microbial functional genes that are relevant to carbon, nitrogen, phosphorus, sulfur and iron cycling, metal resistance and bacteriophage replication were also enriched in the plume. Together, these results suggest that the indigenous marine microbial communities could have a significant role in biodegradation of oil spills in deep-sea environments.
oil spill; deep-sea plume; microbial community; metagenomics; functional gene arrays; GeoChip
Metagenomics approaches provide access to environmental genetic diversity for biotechnology applications, enabling the discovery of new enzymes and pathways for numerous catalytic processes. Discovery of new glycoside hydrolases with improved biocatalytic properties for the efficient conversion of lignocellulosic material to biofuels is a critical challenge in the development of economically viable routes from biomass to fuels and chemicals.
Twenty-two putative ORFs (open reading frames) were identified from a switchgrass-adapted compost community based on sequence homology to related gene families. These ORFs were expressed in E. coli and assayed for predicted activities. Seven of the ORFs were demonstrated to encode active enzymes, encompassing five classes of hemicellulases. Four enzymes were over expressed in vivo, purified to homogeneity and subjected to detailed biochemical characterization. Their pH optima ranged between 5.5 - 7.5 and they exhibit moderate thermostability up to ~60-70°C.
Seven active enzymes were identified from this set of ORFs comprising five different hemicellulose activities. These enzymes have been shown to have useful properties, such as moderate thermal stability and broad pH optima, and may serve as the starting points for future protein engineering towards the goal of developing efficient enzyme cocktails for biomass degradation under diverse process conditions.
Generation of biofuels from sugars in lignocellulosic biomass is a promising alternative to liquid fossil fuels, but efficient and inexpensive bioprocessing configurations must be developed to make this technology commercially viable. One of the major barriers to commercialization is the recalcitrance of plant cell wall polysaccharides to enzymatic hydrolysis. Biomass pretreatment with ionic liquids (ILs) enables efficient saccharification of biomass, but residual ILs inhibit both saccharification and microbial fuel production, requiring extensive washing after IL pretreatment. Pretreatment itself can also produce biomass-derived inhibitory compounds that reduce microbial fuel production. Therefore, there are multiple points in the process from biomass to biofuel production that must be interrogated and optimized to maximize fuel production. Here, we report the development of an IL-tolerant cellulase cocktail by combining thermophilic bacterial glycoside hydrolases produced by a mixed consortia with recombinant glycoside hydrolases. This enzymatic cocktail saccharifies IL-pretreated biomass at higher temperatures and in the presence of much higher IL concentrations than commercial fungal cocktails. Sugars obtained from saccharification of IL-pretreated switchgrass using this cocktail can be converted into biodiesel (fatty acid ethyl-esters or FAEEs) by a metabolically engineered strain of E. coli. During these studies, we found that this biodiesel-producing E. coli strain was sensitive to ILs and inhibitors released by saccharification. This cocktail will enable the development of novel biomass to biofuel bioprocessing configurations that may overcome some of the barriers to production of inexpensive cellulosic biofuels.
In microbial communities, extracellular polymeric substances (EPS), also called the extracellular matrix, provide the spatial organization and structural stability during biofilm development. One of the major components of EPS is protein, but it is not clear what specific functions these proteins contribute to the extracellular matrix or to microbial physiology. To investigate this in biofilms from an extremely acidic environment, we used shotgun proteomics analyses to identify proteins associated with EPS in biofilms at two developmental stages, designated DS1 and DS2. The proteome composition of the EPS was significantly different from that of the cell fraction, with more than 80% of the cellular proteins underrepresented or undetectable in EPS. In contrast, predicted periplasmic, outer membrane, and extracellular proteins were overrepresented by 3- to 7-fold in EPS. Also, EPS proteins were more basic by ∼2 pH units on average and about half the length. When categorized by predicted function, proteins involved in motility, defense, cell envelope, and unknown functions were enriched in EPS. Chaperones, such as histone-like DNA binding protein and cold shock protein, were overrepresented in EPS. Enzymes, such as protein peptidases, disulfide-isomerases, and those associated with cell wall and polysaccharide metabolism, were also detected. Two of these enzymes, identified as β-N-acetylhexosaminidase and cellulase, were confirmed in the EPS fraction by enzymatic activity assays. Compared to the differences between EPS and cellular fractions, the relative differences in the EPS proteomes between DS1 and DS2 were smaller and consistent with expected physiological changes during biofilm development.
Thioalkalivibrio sp. K90mix is an obligately chemolithoautotrophic, natronophilic sulfur-oxidizing bacterium (SOxB) belonging to the family Ectothiorhodospiraceae within the Gammaproteobacteria. The strain was isolated from a mixture of sediment samples obtained from different soda lakes located in the Kulunda Steppe (Altai, Russia) based on its extreme potassium carbonate tolerance as an enrichment method. Here we report the complete genome sequence of strain K90mix and its annotation. The genome was sequenced within the Joint Genome Institute Community Sequencing Program, because of its relevance to the sustainable removal of sulfide from wastewater and gas streams.
natronophilic; sulfide; thiosulfate; sulfur-oxidizing bacteria; soda lakes
In an effort to discover anaerobic bacteria capable of lignin degradation, we isolated “Enterobacter lignolyticus” SCF1 on minimal media with alkali lignin as the sole source of carbon. This organism was isolated anaerobically from tropical forest soils collected from the Short Cloud Forest site in the El Yunque National Forest in Puerto Rico, USA, part of the Luquillo Long-Term Ecological Research Station. At this site, the soils experience strong fluctuations in redox potential and are net methane producers. Because of its ability to grow on lignin anaerobically, we sequenced the genome. The genome of “E. lignolyticus” SCF1 is 4.81 Mbp with no detected plasmids, and includes a relatively small arsenal of lignocellulolytic carbohydrate active enzymes. Lignin degradation was observed in culture, and the genome revealed two putative laccases, a putative peroxidase, and a complete 4-hydroxyphenylacetate degradation pathway encoded in a single gene cluster.
Anaerobic lignin degradation; tropical forest soil isolate; facultative anaerobe
Sequencing of bacterial and archaeal genomes has revolutionized our understanding of the many roles played by microorganisms1. There are now nearly 1,000 completed bacterial and archaeal genomes available2, most of which were chosen for sequencing on the basis of their physiology. As a result, the perspective provided by the currently available genomes is limited by a highly biased phylogenetic distribution3–5. To explore the value added by choosing microbial genomes for sequencing on the basis of their evolutionary relationships, we have sequenced and analysed the genomes of 56 culturable species of Bacteria and Archaea selected to maximize phylogenetic coverage. Analysis of these genomes demonstrated pronounced benefits (compared to an equivalent set of genomes randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of new protein families and biological properties, and the prediction of functions for known genes from other organisms. Our results strongly support the need for systematic ‘phylogenomic’ efforts to compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come.
Development of cellulosic biofuels from non-food crops is currently an area of intense research interest. Tailoring depolymerizing enzymes to particular feedstocks and pretreatment conditions is one promising avenue of research in this area. Here we added a green-waste compost inoculum to switchgrass (Panicum virgatum) and simulated thermophilic composting in a bioreactor to select for a switchgrass-adapted community and to facilitate targeted discovery of glycoside hydrolases. Small-subunit (SSU) rRNA-based community profiles revealed that the microbial community changed dramatically between the initial and switchgrass-adapted compost (SAC) with some bacterial populations being enriched over 20-fold. We obtained 225 Mbp of 454-titanium pyrosequence data from the SAC community and conservatively identified 800 genes encoding glycoside hydrolase domains that were biased toward depolymerizing grass cell wall components. Of these, ∼10% were putative cellulases mostly belonging to families GH5 and GH9. We synthesized two SAC GH9 genes with codon optimization for heterologous expression in Escherichia coli and observed activity for one on carboxymethyl cellulose. The active GH9 enzyme has a temperature optimum of 50°C and pH range of 5.5 to 8 consistent with the composting conditions applied. We demonstrate that microbial communities adapt to switchgrass decomposition using simulated composting condition and that full-length genes can be identified from complex metagenomic sequence data, synthesized and expressed resulting in active enzyme.
Slackia heliotrinireducens (Lanigan 1983) Wade et al. 1999 is of phylogenetic interest because of its location in a genomically yet uncharted section of the family Coriobacteriaceae, within the deep branching Actinobacteria. Strain RHS 1T was originally isolated from the ruminal flora of a sheep. It is a proteolytic anaerobic coccus, able to reductively cleave pyrrolizidine alkaloids. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of the genus Slackia, and the 3,165,038 bp long single replicon genome with its 2798 protein-coding and 60 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
Gram-positive coccus; anaerobic; asaccharolytic; pyrrolizidine alkaloids; Coriobacteriaceae
Sanguibacter keddieii is the type species of the genus Sanguibacter, the only genus within the family of Sanguibacteraceae. Phylogenetically, this family is located in the neighborhood of the genus Oerskovia and the family Cellulomonadaceae within the actinobacterial suborder Micrococcineae. The strain described in this report was isolated from blood of apparently healthy cows. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of a member of the family Sanguibacteraceae, and the 4,253,413 bp long single replicon genome with its 3735 protein-coding and 70 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
blood isolate; aerobic; facultative anaerobic; Sanguibacteraceae; Micrococcineae
Cryptobacterium curtum Nakazawa et al. 1999 is the type species of the genus, and is of phylogenetic interest because of its very distant and isolated position within the family Coriobacteriaceae. C. curtum is an asaccharolytic, opportunistic pathogen with a typical occurrence in the oral cavity, involved in dental and oral infections like periodontitis, inflammations and abscesses. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of the actinobacterial family Coriobacteriaceae, and this 1,617,804 bp long single replicon genome with its 1364 protein-coding and 58 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
oral infections; opportunistic pathogenic; periodontitis; non-spore-former; anaerobic; asaccharolytic; Coriobacteriaceae
Halogeometricum borinquense Montalvo-Rodríguez et al. 1998 is the type species of the genus, and is of phylogenetic interest because of its distinct location between the halobacterial genera Haloquadratum and Halosarcina. H. borinquense requires extremely high salt (NaCl) concentrations for growth. It can not only grow aerobically but also anaerobically using nitrate as electron acceptor. The strain described in this report is a free-living, motile, pleomorphic, euryarchaeon, which was originally isolated from the solar salterns of Cabo Rojo, Puerto Rico. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of the halobacterial genus Halogeometricum, and this 3,944,467 bp long six replicon genome with its 3937 protein-coding and 57 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
halophile; free-living; non-pathogenic; aerobic; pleomorphic cells; euryarchaeon
Saccharomonospora viridis (Schuurmans et al. 1956) Nonomurea and Ohara 1971 is the type species of the genus Saccharomonospora which belongs to the family Pseudonocardiaceae. S. viridis is of interest because it is a Gram-negative organism classified among the usually Gram-positive actinomycetes. Members of the species are frequently found in hot compost and hay, and its spores can cause farmer’s lung disease, bagassosis, and humidifier fever. Strains of the species S. viridis have been found to metabolize the xenobiotic pentachlorophenol (PCP). The strain described in this study has been isolated from peat-bog in Ireland. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of the family Pseudonocardiaceae, and the 4,308,349 bp long single replicon genome with its 3906 protein-coding and 64 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
thermophile; hot compost; Gram-negative actinomycete; farmer’s lung disease; bagassosis; humidifier fever; pentachlorophenol metabolism; Pseudonocardiaceae
Brachybacterium faecium Collins et al. 1988 is the type species of the genus, and is of phylogenetic interest because of its location in the Dermabacteraceae, a rather isolated family within the actinobacterial suborder Micrococcineae. B. faecium is known for its rod-coccus growth cycle and the ability to degrade uric acid. It grows aerobically or weakly anaerobically. The strain described in this report is a free-living, nonmotile, Gram-positive bacterium, originally isolated from poultry deep litter. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of a member of the actinobacterial family Dermabacteraceae, and the 3,614,992 bp long single replicon genome with its 3129 protein-coding and 69 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
mesophile; free-living; non-pathogenic; aerobic; rod-coccus growth cycle; uric acid degradation; Dermabacteraceae
Kytococcus sedentarius (ZoBell and Upham 1944) Stackebrandt et al. 1995 is the type strain of the species, and is of phylogenetic interest because of its location in the Dermacoccaceae, a poorly studied family within the actinobacterial suborder Micrococcineae. Kytococcus sedentarius is known for the production of oligoketide antibiotics as well as for its role as an opportunistic pathogen causing valve endocarditis, hemorrhagic pneumonia, and pitted keratolysis. It is strictly aerobic and can only grow when several amino acids are provided in the medium. The strain described in this report is a free-living, nonmotile, Gram-positive bacterium, originally isolated from a marine environment. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of a member of the family Dermacoccaceae and the 2,785,024 bp long single replicon genome with its 2639 protein-coding and 64 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
mesophile; free-living; marine; aerobic; opportunistic pathogenic; Dermacoccaceae
Motivation: Microbial phenotypes are typically due to the concerted action of multiple gene functions, yet the presence of each gene may have only a weak correlation with the observed phenotype. Hence, it may be more appropriate to examine co-occurrence between sets of genes and a phenotype (multiple-to-one) instead of pairwise relations between a single gene and the phenotype. Here, we propose an efficient class association rule mining algorithm, netCAR, in order to extract sets of COGs (clusters of orthologous groups of proteins) associated with a phenotype from COG phylogenetic profiles and a phenotype profile. netCAR takes into account the phylogenetic co-occurrence graph between COGs to restrict hypothesis space, and uses mutual information to evaluate the biconditional relation.
Results: We examined the mining capability of pairwise and multiple-to-one association by using netCAR to extract COGs relevant to six microbial phenotypes (aerobic, anaerobic, facultative, endospore, motility and Gram negative) from 11 969 unique COG profiles across 155 prokaryotic organisms. With the same level of false discovery rate, multiple-to-one association can extract about 10 times more relevant COGs than one-to-one association. We also reveal various topologies of association networks among COGs (modules) from extracted multiple-to-one correlation rules relevant with the six phenotypes; including a well-connected network for motility, a star-shaped network for aerobic and intermediate topologies for the other phenotypes. netCAR outperforms a standard CAR mining algorithm, CARapriori, while requiring several orders of magnitude less computational time for extracting 3-COG sets.
Availability: Source code of the Java implementation is available as Supplementary Material at the Bioinformatics online website, or upon request to the author.
Supplementary information: Supplementary data are available at Bioinformatics online.
Transcription regulation has been responsible for organismal complexity and diversity in the course of biological evolution and adaptation, and it is determined largely by the context-dependent behavior of cis-regulatory elements (CREs). Therefore, understanding principles underlying CRE behavior in regulating transcription constitutes a fundamental objective of quantitative biology, yet these remain poorly understood. Here we present a deterministic mathematical strategy, the motif expression decomposition (MED) method, for deriving principles of transcription regulation at the single-gene resolution level. MED operates on all genes in a genome without requiring any a priori knowledge of gene cluster membership, or manual tuning of parameters. Applying MED to Saccharomyces cerevisiae transcriptional networks, we identified four functions describing four different ways that CREs can quantitatively affect gene expression levels. These functions, three of which have extrema in different positions in the gene promoter (short-, mid-, and long-range) whereas the other depends on the motif orientation, are validated by expression data. We illustrate how nature could use these principles as an additional dimension to amplify the combinatorial power of a small set of CREs in regulating transcription.
computational method; matrix factorization; MED; principles of transcription regulation; transcriptional regulatory networks; yeast
Intracellular signal transduction is achieved by networks of proteins and small molecules that transmit information from the cell surface to the nucleus, where they ultimately effect transcriptional changes. Understanding the mechanisms cells use to accomplish this important process requires a detailed molecular description of the networks involved.
We have developed a computational approach for generating static models of signal transduction networks which utilizes protein-interaction maps generated from large-scale two-hybrid screens and expression profiles from DNA microarrays. Networks are determined entirely by integrating protein-protein interaction data with microarray expression data, without prior knowledge of any pathway intermediates. In effect, this is equivalent to extracting subnetworks of the protein interaction dataset whose members have the most correlated expression profiles.
We show that our technique accurately reconstructs MAP Kinase signaling networks in Saccharomyces cerevisiae. This approach should enhance our ability to model signaling networks and to discover new components of known networks. More generally, it provides a method for synthesizing molecular data, either individual transcript abundance measurements or pairwise protein interactions, into higher level structures, such as pathways and networks.