experimental validation; hypothetical proteins; crowdsourcing; high-throughput; traceability
Glioblastoma (GBM) is thought to be driven by a sub-population of cancer stem cells (CSCs) that self-renew and recapitulate tumor heterogeneity, yet remain poorly understood. Here we present a comparative histone modification analysis of GBM CSCs that reveals widespread activation of genes normally held in check by Polycomb repressors. These activated targets include a large set of developmental transcription factors (TFs) whose coordinated activation is unique to the CSCs. We demonstrate that a critical factor in the set, ASCL1, activates Wnt signaling by repressing the negative regulator DKK1. We show that ASCL1 is essential for maintenance and in vivo tumorigenicity of GBM CSCs. Genomewide binding profiles for ASCL1 and the Wnt effector LEF1 provide mechanistic insight and suggest widespread interactions between the TF module and the signaling pathway. Our findings demonstrate regulatory connections between ASCL1, Wnt signaling and collaborating TFs that are essential for the maintenance and tumorigenicity of GBM CSCs.
Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
Flux balance analysis and constraint based modeling have been successfully used in the past to elucidate the metabolism of single cellular organisms. However, limited work has been done with multicellular organisms and even less with humans. The focus of this paper is to present a novel use of this technique by investigating human nutrition, a challenging field of study. Specifically, we present a steady state constraint based model of skeletal muscle tissue to investigate amino acid supplementation's effect on protein synthesis. We implement several in silico supplementation strategies to study whether amino acid supplementation might be beneficial for increasing muscle contractile protein synthesis. Concurrent with published data on amino acid supplementation's effect on protein synthesis in a post resistance exercise state, our results suggest that increasing bioavailability of methionine, arginine, and the branched-chain amino acids can increase the flux of contractile protein synthesis. The study also suggests that a common commercial supplement, glutamine, is not an effective supplement in the context of increasing protein synthesis and thus, muscle mass. Similar to any study in a model organism, the computational modeling of this research has some limitations. Thus, this paper introduces the prospect of using systems biology as a framework to formally investigate how supplementation and nutrition can affect human metabolism and physiology.
The functional characterization of Open Reading Frames (ORFs) from sequenced genomes remains a bottleneck in our effort to understand microbial biology. In particular, the functional characterization of proteins with only remote sequence homology to known proteins can be challenging, as there may be few clues to guide initial experiments. Affinity enrichment of proteins from cell lysates, and a global perspective of protein function as provided by COMBREX, affords an approach to this problem. We present here the biochemical analysis of six proteins from Helicobacter pylori ATCC 26695, a focus organism in COMBREX. Initial hypotheses were based upon affinity capture of proteins from total cellular lysate using derivatized nano-particles, and subsequent identification by mass spectrometry. Candidate genes encoding these proteins were cloned and expressed in Escherichia coli, and the recombinant proteins were purified and characterized biochemically and their biochemical parameters compared with the native ones. These proteins include a guanosine triphosphate (GTP) cyclohydrolase (HP0959), an ATPase (HP1079), an adenosine deaminase (HP0267), a phosphodiesterase (HP1042), an aminopeptidase (HP1037), and new substrates were characterized for a peptidoglycan deacetylase (HP0310). Generally, characterized enzymes were active at acidic to neutral pH (4.0–7.5) with temperature optima ranging from 35 to 55°C, although some exhibited outstanding characteristics.
The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST.
By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX.
Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website.
This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell).
The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (∼2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ∼90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes.
Type 2 diabetes and obesity are increasingly affecting human populations around the world. Our goal was to identify early molecular signatures predicting genetic risk to these metabolic diseases using two strains of mice that differ greatly in disease susceptibility.
RESEARCH DESIGN AND METHODS
We integrated metabolic characterization, gene expression, protein-protein interaction networks, RT-PCR, and flow cytometry analyses of adipose, skeletal muscle, and liver tissue of diabetes-prone C57BL/6NTac (B6) mice and diabetes-resistant 129S6/SvEvTac (129) mice at 6 weeks and 6 months of age.
At 6 weeks of age, B6 mice were metabolically indistinguishable from 129 mice, however, adipose tissue showed a consistent gene expression signature that differentiated between the strains. In particular, immune system gene networks and inflammatory biomarkers were upregulated in adipose tissue of B6 mice, despite a low normal fat mass. This was accompanied by increased T-cell and macrophage infiltration. The expression of the same networks and biomarkers, particularly those related to T-cells, further increased in adipose tissue of B6 mice, but only minimally in 129 mice, in response to weight gain promoted by age or high-fat diet, further exacerbating the differences between strains.
Insulin resistance in mice with differential susceptibility to diabetes and metabolic syndrome is preceded by differences in the inflammatory response of adipose tissue. This phenomenon may serve as an early indicator of disease and contribute to disease susceptibility and progression.
COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
Complete and accurate annotation of gene function is an essential starting point for genome interpretation and a host of systems and synthetic biology endeavors. Detecting errors in existing annotation now has an important new tool.
Methylthiotransferases (MTTases) are a closely related family of proteins that perform both radical-S-adenosylmethionine (SAM) mediated sulfur insertion and SAM-dependent methylation to modify nucleic acid or protein targets with a methyl thioether group (–SCH3). Members of two of the four known subgroups of MTTases have been characterized, typified by MiaB, which modifies N6-isopentenyladenosine (i6A) to 2-methylthio-N6-isopentenyladenosine (ms2i6A) in tRNA, and RimO, which modifies a specific aspartate residue in ribosomal protein S12. In this work, we have characterized the two MTTases encoded by Bacillus subtilis 168 and find that, consistent with bioinformatic predictions, ymcB is required for ms2i6A formation (MiaB activity), and yqeV is required for modification of N6-threonylcarbamoyladenosine (t6A) to 2-methylthio-N6-threonylcarbamoyladenosine (ms2t6A) in tRNA. The enzyme responsible for the latter activity belongs to a third MTTase subgroup, no member of which has previously been characterized. We performed domain-swapping experiments between YmcB and YqeV to narrow down the protein domain(s) responsible for distinguishing i6A from t6A and found that the C-terminal TRAM domain, putatively involved with RNA binding, is likely not involved with this discrimination. Finally, we performed a computational analysis to identify candidate residues outside the TRAM domain that may be involved with substrate recognition. These residues represent interesting targets for further analysis.
To characterize the hormonal milieu and adipose gene expression in response to catch-up growth (CUG), a growth pattern associated with obesity and diabetes risk, in a mouse model of low birth weight (LBW).
RESEARCH DESIGN AND METHODS
ICR mice were food restricted by 50% from gestational days 12.5–18.5, reducing offspring birth weight by 25%. During the suckling period, dams were either fed ad libitum, permitting CUG in offspring, or food restricted, preventing CUG. Offspring were killed at age 3 weeks, and gonadal fat was removed for RNA extraction, array analysis, RT-PCR, and evaluation of cell size and number. Serum insulin, thyroxine (T4), corticosterone, and adipokines were measured.
At age 3 weeks, LBW mice with CUG (designated U-C) had body weight comparable with controls (designated C-C); weight was reduced by 49% in LBW mice without CUG (designated U-U). Adiposity was altered by postnatal nutrition, with gonadal fat increased by 50% in U-C and decreased by 58% in U-U mice (P < 0.05 vs. C-C mice). Adipose expression of the lipogenic genes Fasn, AccI, Lpin1, and Srebf1 was significantly increased in U-C compared with both C-C and U-U mice (P < 0.05). Mitochondrial DNA copy number was reduced by >50% in U-C versus U-U mice (P = 0.014). Although cell numbers did not differ, mean adipocyte diameter was increased in U-C and reduced in U-U mice (P < 0.01).
CUG results in increased adipose tissue lipogenic gene expression and adipocyte diameter but not increased cellularity, suggesting that catch-up fat is primarily associated with lipogenesis rather than adipogenesis in this murine model.
Aberrant activation of signaling pathways drives many of the fundamental biological processes that accompany tumor initiation and progression. Inappropriate phosphorylation of intermediates in these signaling pathways are a frequently observed molecular lesion that accompanies the undesirable activation or repression of pro- and anti-oncogenic pathways. Therefore, methods which directly query signaling pathway activation via phosphorylation assays in individual cancer biopsies are expected to provide important insights into the molecular “logic” that distinguishes cancer and normal tissue on one hand, and enables personalized intervention strategies on the other.
We first document the largest available set of tyrosine phosphorylation sites that are, individually, differentially phosphorylated in lung cancer, thus providing an immediate set of drug targets. Next, we develop a novel computational methodology to identify pathways whose phosphorylation activity is strongly correlated with the lung cancer phenotype. Finally, we demonstrate the feasibility of classifying lung cancers based on multi-variate phosphorylation signatures.
Highly predictive and biologically transparent phosphorylation signatures of lung cancer provide evidence for the existence of a robust set of phosphorylation mechanisms (captured by the signatures) present in the majority of lung cancers, and that reliably distinguish each lung cancer from normal. This approach should improve our understanding of cancer and help guide its treatment, since the phosphorylation signatures highlight proteins and pathways whose phosphorylation should be inhibited in order to prevent unregulated proliferation.
Motivation: Type 2 diabetes is a chronic metabolic disease that involves both environmental and genetic factors. To understand the genetics of type 2 diabetes and insulin resistance, the DIabetes Genome Anatomy Project (DGAP) was launched to profile gene expression in a variety of related animal models and human subjects. We asked whether these heterogeneous models can be integrated to provide consistent and robust biological insights into the biology of insulin resistance.
Results: We perform integrative analysis of the 16 DGAP data sets that span multiple tissues, conditions, array types, laboratories, species, genetic backgrounds and study designs. For each data set, we identify differentially expressed genes compared with control. Then, for the combined data, we rank genes according to the frequency with which they were found to be statistically significant across data sets. This analysis reveals RetSat as a widely shared component of mechanisms involved in insulin resistance and sensitivity and adds to the growing importance of the retinol pathway in diabetes, adipogenesis and insulin resistance. Top candidates obtained from our analysis have been confirmed in recent laboratory studies.
Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity.
Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data.
Supplementary information: Supplementary data are available at Bioinformatics online.
The traditional approach to studying complex biological networks is based on the identification of interactions between internal components of signaling or metabolic pathways. By comparison, little is known about interactions between higher order biological systems, such as biological pathways and processes.
We propose a methodology for gleaning patterns of interactions between biological processes by analyzing protein-protein interactions, transcriptional co-expression and genetic interactions. At the heart of the methodology are the concept of Linked Processes and the resultant network of biological processes, the Process Linkage Network (PLN).
We construct, catalogue, and analyze different types of PLNs derived from different data sources and different species. When applied to the Gene Ontology, many of the resulting links connect processes that are distant from each other in the hierarchy, even though the connection makes eminent sense biologically. Some others, however, carry an element of surprise and may reflect mechanisms that are unique to the organism under investigation. In this aspect our method complements the link structure between processes inherent in the Gene Ontology, which by its very nature is species-independent.
As a practical application of the linkage of processes we demonstrate that it can be effectively used in protein function prediction, having the power to increase both the coverage and the accuracy of predictions, when carefully integrated into prediction methods.
Our approach constitutes a promising new direction towards understanding the higher levels of organization of the cell as a system which should help current efforts to re-engineer ontologies and improve our ability to predict which proteins are involved in specific biological processes.
Single nucleotide polymorphisms (SNPs) have been used extensively in genetics and epidemiology studies. Traditionally, SNPs that did not pass the Hardy-Weinberg equilibrium (HWE) test were excluded from these analyses. Many investigators have addressed possible causes for departure from HWE, including genotyping errors, population admixture and segmental duplication. Recent large-scale surveys have revealed abundant structural variations in the human genome, including copy number variations (CNVs). This suggests that a significant number of SNPs must be within these regions, which may cause deviation from HWE.
We performed a Bayesian analysis on the potential effect of copy number variation, segmental duplication and genotyping errors on the behavior of SNPs. Our results suggest that copy number variation is a major factor of HWE violation for SNPs with a small minor allele frequency, when the sample size is large and the genotyping error rate is 0∼1%.
Our study provides the posterior probability that a SNP falls in a CNV or a segmental duplication, given the observed allele frequency of the SNP, sample size and the significance level of HWE testing.
In embryonic stem (ES) cells, bivalent chromatin domains with overlapping repressive (H3 lysine 27 tri-methylation) and activating (H3 lysine 4 tri-methylation) histone modifications mark the promoters of more than 2,000 genes. To gain insight into the structure and function of bivalent domains, we mapped key histone modifications and subunits of Polycomb-repressive complexes 1 and 2 (PRC1 and PRC2) genomewide in human and mouse ES cells by chromatin immunoprecipitation, followed by ultra high-throughput sequencing. We find that bivalent domains can be segregated into two classes—the first occupied by both PRC2 and PRC1 (PRC1-positive) and the second specifically bound by PRC2 (PRC2-only). PRC1-positive bivalent domains appear functionally distinct as they more efficiently retain lysine 27 tri-methylation upon differentiation, show stringent conservation of chromatin state, and associate with an overwhelming number of developmental regulator gene promoters. We also used computational genomics to search for sequence determinants of Polycomb binding. This analysis revealed that the genomewide locations of PRC2 and PRC1 can be largely predicted from the locations, sizes, and underlying motif contents of CpG islands. We propose that large CpG islands depleted of activating motifs confer epigenetic memory by recruiting the full repertoire of Polycomb complexes in pluripotent cells.
Polycomb-group (PcG) proteins play essential roles in the epigenetic regulation of gene expression during development. PcG proteins are repressors that catalyze lysine 27 tri-methylation on histone H3. They are antagonized by trithorax-group proteins that catalyze lysine 4 tri-methylation. Recent studies of ES cells revealed a novel chromatin pattern consisting of overlapping lysine 27 and lysine 4 tri-methylation. Genomic regions with these opposing modifications were termed “bivalent domains” and proposed to silence developmental regulators while keeping them “poised” for alternate fates. However, our understanding of PcG regulation and bivalent domains remains limited. For instance, bivalent domains affect over 2,000 promoters with diverse functions, which suggests that they may function in diverse cellular processes. Moreover, the mechanisms that underlie the targeting of PcG complexes to specific genomic regions remain completely unknown. To gain insight into these issues, we used ultra high-throughput sequencing to map PcG complexes and related modifications genomewide in human and mouse ES cells. The data identify two classes of bivalent domains with distinct regulatory properties. They also reveal striking relationships between genome sequence and chromatin state that suggest a prominent role for the DNA sequence in dictating the genomewide localization of PcG complexes and, consequently, bivalent domains in ES cells.
In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.
We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.
A cross-validation study, using data from the yeast Saccharomyces cerevisiae, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional in silico validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.
A systematic analysis of the relationship between the neoplastic and developmental transcriptome provides an outline of global trends in cancer gene expression.
In recent years, the molecular underpinnings of the long-observed resemblance between neoplastic and immature tissue have begun to emerge. Genome-wide transcriptional profiling has revealed similar gene expression signatures in several tumor types and early developmental stages of their tissue of origin. However, it remains unclear whether such a relationship is a universal feature of malignancy, whether heterogeneities exist in the developmental component of different tumor types and to which degree the resemblance between cancer and development is a tissue-specific phenomenon.
We defined a developmental landscape by summarizing the main features of ten developmental time courses and projected gene expression from a variety of human tumor types onto this landscape. This comparison demonstrates a clear imprint of developmental gene expression in a wide range of tumors and with respect to different, even non-cognate developmental backgrounds. Our analysis reveals three classes of cancers with developmentally distinct transcriptional patterns. We characterize the biological processes dominating these classes and validate the class distinction with respect to a new time series of murine embryonic lung development. Finally, we identify a set of genes that are upregulated in most cancers and we show that this signature is active in early development.
This systematic and quantitative overview of the relationship between the neoplastic and developmental transcriptome spanning dozens of tissues provides a reliable outline of global trends in cancer gene expression, reveals potentially clinically relevant differences in the gene expression of different cancer types and represents a reference framework for interpretation of smaller-scale functional studies.
The transcriptional program induced by growth factor stimulation is classically described in two stages: the rapid protein synthesis-independent induction of immediate-early genes, followed by the subsequent protein synthesis-dependent induction of secondary response genes. In the current study, we obtained a comprehensive view of this transcriptional program. As expected, we identified both rapid and delayed gene inductions. Surprisingly, however, a large fraction of genes induced with delayed kinetics did not require protein synthesis and therefore represented delayed primary rather than secondary response genes. Of 133 genes induced within 4 hours of growth factor stimulation, 49 (37%) were immediate-early genes, 58 (44%) were delayed primary response genes, and 26 (19%) were secondary response genes. Comparison of immediate-early and delayed primary response genes revealed functional and regulatory differences. Whereas many immediate-early genes encoded transcription factors, transcriptional regulators were not prevalent amongst the delayed primary response genes. The lag in induction of delayed primary response compared to immediate-early mRNAs was due to delays in both transcription initiation and subsequent stages of elongation and processing. Consistent with increased abundance of RNA polymerase II at their promoters, immediate-early genes were characterized by over-representation of transcription factor binding sites and high affinity TATA boxes. Immediate-early genes also had short primary transcripts with few exons, whereas delayed primary response genes more closely resembled other genes in the genome. These findings suggest that genomic features of immediate-early genes, in contrast to the delayed primary response genes, are selected for rapid induction, consistent with their regulatory functions.
Type 2 diabetes mellitus is a complex disorder associated with multiple genetic, epigenetic, developmental, and environmental factors. Animal models of type 2 diabetes differ based on diet, drug treatment, and gene knockouts, and yet all display the clinical hallmarks of hyperglycemia and insulin resistance in peripheral tissue. The recent advances in gene-expression microarray technologies present an unprecedented opportunity to study type 2 diabetes mellitus at a genome-wide scale and across different models. To date, a key challenge has been to identify the biological processes or signaling pathways that play significant roles in the disorder. Here, using a network-based analysis methodology, we identified two sets of genes, associated with insulin signaling and a network of nuclear receptors, which are recurrent in a statistically significant number of diabetes and insulin resistance models and transcriptionally altered across diverse tissue types. We additionally identified a network of protein–protein interactions between members from the two gene sets that may facilitate signaling between them. Taken together, the results illustrate the benefits of integrating high-throughput microarray studies, together with protein–protein interaction networks, in elucidating the underlying biological processes associated with a complex disorder.
Type 2 diabetes mellitus currently affects millions of people. It is clinically characterized by insulin resistance in addition to an impaired glucose response and associated with numerous complications including heart disease, stroke, neuropathy, and kidney failure, among others. Accurate identification of the underlying molecular mechanisms of the disease or its complications is an important research problem that could lead to novel diagnostics and therapy. The main challenge stems from the fact that insulin resistance is a complex disorder and affects a multitude of biological processes, metabolic networks, and signaling pathways. In this report, the authors develop a network-based methodology that appears to be more sensitive than previous approaches in detecting deregulated molecular processes in a disease state. The methodology revealed that both insulin signaling and nuclear receptor networks are consistently and differentially expressed in many models of insulin resistance. The positive results suggest such network-based diagnostic technologies hold promise as potentially useful clinical and research tools in the future.
Despite its central role in cell survival and proliferation, the transcriptional program controlled by GSK-3 is poorly understood. We have employed a systems level approach to characterize gene regulation downstream of PI 3-kinase/Akt/GSK-3 signaling in response to growth factor stimulation of quiescent cells. Of 31 immediate-early genes whose induction was dependent on PI 3-kinase signaling, 12 were induced directly by inhibition of GSK-3. Most of the GSK-3 regulated genes encoded transcription factors, growth factors and signaling molecules. Binding sites for CREB were highly over-represented in the upstream regions of these genes, with 9 genes containing CREB sites that were conserved in mouse orthologs. Binding sites predicted in 6 genes were confirmed by CREB chromatin immunoprecipitation and forskolin induction of CBP binding. Moreover, CREB siRNA substantially blocked induction of 5 genes by forskolin and of 3 genes following inhibition of GSK-3. These results indicate that GSK-3 actively represses gene expression in quiescent cells, with inhibition of CREB playing a key role in this transcriptional response.
Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function.
Machine learning approaches offer the potential to systematically identify transcriptional regulatory interactions from a compendium of microarray expression profiles. However, experimental validation of the performance of these methods at the genome scale has remained elusive. Here we assess the global performance of four existing classes of inference algorithms using 445 Escherichia coli Affymetrix arrays and 3,216 known E. coli regulatory interactions from RegulonDB. We also developed and applied the context likelihood of relatedness (CLR) algorithm, a novel extension of the relevance networks class of algorithms. CLR demonstrates an average precision gain of 36% relative to the next-best performing algorithm. At a 60% true positive rate, CLR identifies 1,079 regulatory interactions, of which 338 were in the previously known network and 741 were novel predictions. We tested the predicted interactions for three transcription factors with chromatin immunoprecipitation, confirming 21 novel interactions and verifying our RegulonDB-based performance estimates. CLR also identified a regulatory link providing central metabolic control of iron transport, which we confirmed with real-time quantitative PCR. The compendium of expression data compiled in this study, coupled with RegulonDB, provides a valuable model system for further improvement of network inference algorithms using experimental data.
Organisms can adapt to changing environments—becoming more virulent, for example, or activating stress responses—thanks to a flexible gene expression program controlled by the dynamic interactions of hundreds of transcriptional regulators. To unravel this regulatory complexity, multiple computational algorithms have been developed to analyze gene expression profiles and detect dependencies among genes over different conditions. It has been difficult to judge whether these algorithms can generate accurate global maps of regulatory interactions, however, because of the absence of a model organism with both a compendium of gene expression data and a corresponding network of experimentally determined regulatory interactions. To address this issue, we assembled 445 Escherichia coli microarrays, applied four classes of inference algorithms to the dataset, and validated the predictions against 3,216 experimentally determined E. coli interactions. The top-performing algorithm identifies 1,079 regulatory interactions at a confidence level of 60% or higher. Of these predicted interactions, 741 are novel and illuminate the regulation of amino acid biosynthesis, flagella biosynthesis, osmotic stress response, antibiotic resistance, and iron regulation. By defining the capabilities and limitations of network inference algorithms for large-scale mapping of prokaryotic regulatory networks, our work should facilitate their application to the mapping of novel microbes.
A novel, machine-learning method is developed to predict transcriptional regulatory interactions, making use of microarray data. One interaction identified appears to be important for the control of iron transport.