SOX2 is a master regulator of both pluripotent embryonic stem cells (ESCs) and multipotent neural progenitor cells (NPCs); however, we currently lack a detailed understanding of how SOX2 controls these distinct stem cell populations. Here we show by genome-wide analysis that, while SOX2 bound to a distinct set of gene promoters in ESCs and NPCs, the majority of regions coincided with unique distal enhancer elements, important cis-acting regulators of tissue-specific gene expression programs. Notably, SOX2 bound the same consensus DNA motif in both cell types, suggesting that additional factors contribute to target specificity. We found that, similar to its association with OCT4 (Pou5f1) in ESCs, the related POU family member BRN2 (Pou3f2) co-occupied a large set of putative distal enhancers with SOX2 in NPCs. Forced expression of BRN2 in ESCs led to functional recruitment of SOX2 to a subset of NPC-specific targets and to precocious differentiation toward a neural-like state. Further analysis of the bound sequences revealed differences in the distances of SOX and POU peaks in the two cell types and identified motifs for additional transcription factors. Together, these data suggest that SOX2 controls a larger network of genes than previously anticipated through binding of distal enhancers and that transitions in POU partner factors may control tissue-specific transcriptional programs. Our findings have important implications for understanding lineage specification and somatic cell reprogramming, where SOX2, OCT4, and BRN2 have been shown to be key factors.
In mammals, a few thousand transcription factors regulate the differential expression of more than 20,000 genes to specify ∼200 functionally distinct cell types during development. How this is accomplished has been a major focus of biology. Transcription factors bind non-coding DNA regulatory elements, including proximal promoters and distal enhancers, to control gene expression. Emerging evidence indicates that transcription factor binding at distal enhancers plays an important role in the establishment of tissue-specific gene expression programs during development. Further, combinatorial binding among groups of transcription factors can further increase the diversity and specificity of regulatory modules. Here, we report the genome-wide binding profile of the HMG-box containing transcription factor SOX2 in mouse embryonic stem cells (ESCs) and neural progenitor cells (NPCs), and we show that SOX2 occupied a distinct set of binding sites with POU homeodomain family members, OCT4 in ESCs and BRN2 in NPCs. Thus, transitions in SOX2-POU partners may control tissue-specific gene networks. Ultimately, a global analysis detailing the combinatorial binding of transcription factors across all tissues is critical to understand cell fate specification in the context of the complex mammalian genome.
Polycomb repressive complexes (PRCs) play key roles in developmental epigenetic regulation. Yet the mechanisms that target PRCs to specific loci in mammalian cells remain incompletely understood. In this study, we show that Bmi1, a core component of Polycomb Repressive Complex 1 (PRC1), binds directly to the Runx1/CBFβ transcription factor complex. Genome-wide studies in megakaryocytic cells demonstrate significant chromatin occupancy overlap between the PRC1 core component Ring1b and Runx1/CBFβ, and functional regulation of a considerable fraction of commonly bound genes. Bmi1/Ring1b and Runx1/CBFβ deficiency generate partial phenocopies of one another in vivo. We also show that Ring1b occupies key Runx1 binding sites in primary murine thymocytes and that this occurs via Polycomb Repressive Complex 2 (PRC2) independent mechanisms. Genetic depletion of Runx1 results in reduced Ring1b binding at these sites in vivo. These findings provide evidence for site-specific PRC1 chromatin recruitment by core binding transcription factors in mammalian cells.
Cellular signal transduction generally involves cascades of post-translational protein modifications that rapidly catalyze changes in protein-DNA interactions and gene expression. High-throughput measurements are improving our ability to study each of these stages individually, but do not capture the connections between them. Here we present an approach for building a network of physical links among these data that can be used to prioritize targets for pharmacological intervention. Our method recovers the critical missing links between proteomic and transcriptional data by relating changes in chromatin accessibility to changes in expression and then uses these links to connect proteomic and transcriptome data. We applied our approach to integrate epigenomic, phosphoproteomic and transcriptome changes induced by the variant III mutation of the epidermal growth factor receptor (EGFRvIII) in a cell line model of glioblastoma multiforme (GBM). To test the relevance of the network, we used small molecules to target highly connected nodes implicated by the network model that were not detected by the experimental data in isolation and we found that a large fraction of these agents alter cell viability. Among these are two compounds, ICG-001, targeting CREB binding protein (CREBBP), and PKF118–310, targeting β-catenin (CTNNB1), which have not been tested previously for effectiveness against GBM. At the level of transcriptional regulation, we used chromatin immunoprecipitation sequencing (ChIP-Seq) to experimentally determine the genome-wide binding locations of p300, a transcriptional co-regulator highly connected in the network. Analysis of p300 target genes suggested its role in tumorigenesis. We propose that this general method, in which experimental measurements are used as constraints for building regulatory networks from the interactome while taking into account noise and missing data, should be applicable to a wide range of high-throughput datasets.
The ways in which cells respond to changes in their environment are controlled by networks of physical links among the proteins and genes. The initial signal of a change in conditions rapidly passes through these networks from the cytoplasm to the nucleus, where it can lead to long-term alterations in cellular behavior by controlling the expression of genes. These cascades of signaling events underlie many normal biological processes. As a result, being able to map out how these networks change in disease can provide critical insights for new approaches to treatment. We present a computational method for reconstructing these networks by finding links between the rapid short-term changes in proteins and the longer-term changes in gene regulation. This method brings together systematic measurements of protein signaling, genome organization and transcription in the context of protein-protein and protein-DNA interactions. When used to analyze datasets from an oncogene expressing cell line model of human glioblastoma, our approach identifies key nodes that affect cell survival and functional transcriptional regulators.
Heat-Shock Factor 1 (HSF1), master regulator of the heat-shock response, facilitates malignant transformation, cancer cell survival and proliferation in model systems. The common assumption is that these effects are mediated through regulation of heat-shock protein (HSP) expression. However, the transcriptional network that HSF1 coordinates directly in malignancy and its relationship to the heat-shock response have never been defined. By comparing cells with high and low malignant potential alongside their non-transformed counterparts, we identify an HSF1-regulated transcriptional program specific to highly malignant cells and distinct from heat shock. Cancer-specific genes in this program support oncogenic processes: cell-cycle regulation, signaling, metabolism, adhesion and translation. HSP genes are integral to this program, however, many are uniquely regulated in malignancy. This HSF1 cancer program is active in breast, colon and lung tumors isolated directly from human patients and is strongly associated with metastasis and death. Thus, HSF1 rewires the transcriptome in tumorigenesis, with prognostic and therapeutic implications.
HSP90; HSP70; ChIP-Seq; genome-wide; outcome signature; Nurses’ Health Study; immunohistochemistry
In Huntington’s disease (HD), polyglutamine expansions in the huntingtin (Htt) protein cause subtle changes in cellular functions that, over-time, lead to neurodegeneration and death. Studies have indicated that activation of the heat shock response can reduce many of the effects of mutant Htt in disease models, suggesting that the heat shock response is impaired in the disease. To understand the basis for this impairment, we have used genome-wide chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) to examine the effects of mutant Htt on the master regulator of the heat shock response, HSF1. We find that, under normal conditions, HSF1 function is highly similar in cells carrying either wild-type or mutant Htt. However, polyQ-expanded Htt severely blunts the HSF1-mediated stress response. Surprisingly, we find that the HSF1 targets most affected upon stress are not directly associated with proteostasis, but with cytoskeletal binding, focal adhesion and GTPase activity. Our data raise the intriguing hypothesis that the accumulated damage from life-long impairment in these stress responses may contribute significantly to the etiology of Huntington’s disease.
Huntington Disease; heat shock transcription factor; Heat-Shock Response; Chromatin Immunoprecipitation; cDNA Microarrays; Deep Sequencing
High-throughput technologies including transcriptional profiling, proteomics and reverse genetics screens provide detailed molecular descriptions of cellular responses to perturbations. However, it is difficult to integrate these diverse data to reconstruct biologically meaningful signaling networks. Previously, we have established a framework for integrating transcriptional, proteomic and interactome data by searching for the solution to the prize-collecting Steiner tree problem. Here, we present a web server, SteinerNet, to make this method available in a user-friendly format for a broad range of users with data from any species. At a minimum, a user only needs to provide a set of experimentally detected proteins and/or genes and the server will search for connections among these data from the provided interactomes for yeast, human, mouse, Drosophila melanogaster and Caenorhabditis elegans. More advanced users can upload their own interactome data as well. The server provides interactive visualization of the resulting optimal network and downloadable files detailing the analysis and results. We believe that SteinerNet will be useful for researchers who would like to integrate their high-throughput data for a specific condition or cellular response and to find biologically meaningful pathways. SteinerNet is accessible at http://fraenkel.mit.edu/steinernet.
The growing epidemic of obesity and metabolic diseases calls for a better understanding of adipocyte biology. The regulation of transcription in adipocytes is particularly important, as it is a target for several therapeutic approaches. Transcriptional outcomes are influenced by both histone modifications and transcription factor binding. Although the epigenetic states and binding sites of several important transcription factors have been profiled in the mouse 3T3-L1 cell line, such data are lacking in human adipocytes. In this study, we identified H3K56 acetylation sites in human adipocytes derived from mesenchymal stem cells. H3K56 is acetylated by CBP and p300, and deacetylated by SIRT1, all are proteins with important roles in diabetes and insulin signaling. We found that while almost half of the genome shows signs of H3K56 acetylation, the highest level of H3K56 acetylation is associated with transcription factors and proteins in the adipokine signaling and Type II Diabetes pathways. In order to discover the transcription factors that recruit acetyltransferases and deacetylases to sites of H3K56 acetylation, we analyzed DNA sequences near H3K56 acetylated regions and found that the E2F recognition sequence was enriched. Using chromatin immunoprecipitation followed by high-throughput sequencing, we confirmed that genes bound by E2F4, as well as those by HSF-1 and C/EBPα, have higher than expected levels of H3K56 acetylation, and that the transcription factor binding sites and acetylation sites are often adjacent but rarely overlap. We also discovered a significant difference between bound targets of C/EBPα in 3T3-L1 and human adipocytes, highlighting the need to construct species-specific epigenetic and transcription factor binding site maps. This is the first genome-wide profile of H3K56 acetylation, E2F4, C/EBPα and HSF-1 binding in human adipocytes, and will serve as an important resource for better understanding adipocyte transcriptional regulation.
We have used a simple and efficient method to identify condition-specific transcriptional regulatory sites in vivo to help elucidate the molecular basis of sex-related differences in transcription, which are widespread in mammalian tissues and affect normal physiology, drug response, inflammation, and disease. To systematically uncover transcriptional regulators responsible for these differences, we used DNase hypersensitivity analysis coupled with high-throughput sequencing to produce condition-specific maps of regulatory sites in male and female mouse livers and in livers of male mice feminized by continuous infusion of growth hormone (GH). We identified 71,264 hypersensitive sites, with 1,284 showing robust sex-related differences. Continuous GH infusion suppressed the vast majority of male-specific sites and induced a subset of female-specific sites in male livers. We also identified broad genomic regions (up to ∼100 kb) showing sex-dependent hypersensitivity and similar patterns of GH responses. We found a strong association of sex-specific sites with sex-specific transcription; however, a majority of sex-specific sites were >100 kb from sex-specific genes. By analyzing sequence motifs within regulatory regions, we identified two known regulators of liver sexual dimorphism and several new candidates for further investigation. This approach can readily be applied to mapping condition-specific regulatory sites in mammalian tissues under a wide variety of physiological conditions.
Cellular response to stimuli is typically complex and involves both regulatory and metabolic processes. Large-scale experimental efforts to identify components of these processes often comprise of genetic screening and transcriptomic profiling assays. We previously established that in yeast genetic screens tend to identify response regulators, while transcriptomic profiling assays tend to identify components of metabolic processes. ResponseNet is a network-optimization approach that integrates the results from these assays with data of known molecular interactions. Specifically, ResponseNet identifies a high-probability sub-network, composed of signaling and regulatory molecular interaction paths, through which putative response regulators may lead to the measured transcriptomic changes. Computationally, this is achieved by formulating a minimum-cost flow optimization problem and solving it efficiently using linear programming tools. The ResponseNet web server offers a simple interface for applying ResponseNet. Users can upload weighted lists of proteins and genes and obtain a sparse, weighted, molecular interaction sub-network connecting their data. The predicted sub-network and its gene ontology enrichment analysis are presented graphically or as text. Consequently, the ResponseNet web server enables researchers that were previously limited to separate analysis of their distinct, large-scale experiments, to meaningfully integrate their data and substantially expand their understanding of the underlying cellular response. ResponseNet is available at http://bioinfo.bgu.ac.il/respnet.
The transcriptional regulatory networks that specify and maintain human tissue diversity are largely uncharted. To gain insight into this circuitry, we used chromatin immunoprecipitation combined with promoter microarrays to identify systematically the genes occupied by the transcriptional regulators HNF1α, HNF4α, and HNF6, together with RNA polymerase II, in human liver and pancreatic islets. We identified tissue-specific regulatory circuits formed by HNF1α, HNF4α, and HNF6 with other transcription factors, revealing how these factors function as master regulators of hepatocyte and islet transcription. Our results suggest how misregulation of HNF4α can contribute to type 2 diabetes.
Foxp3+CD4+CD25+ regulatory T (Treg) cells are essential for the prevention of autoimmunity1,2. Treg cells have an attenuated cytokine response to T-cell receptor stimulation, and can suppress the proliferation and effector function of neighbouring T cells3,4. The forkhead transcription factor Foxp3 (forkhead box P3) is selectively expressed in Treg cells, is required for Treg development and function, and is sufficient to induce a Treg phenotype in conventional CD4+CD25− T cells5–8. Mutations in Foxp3 cause severe, multi-organ autoimmunity in both human and mouse9–11. FOXP3 can cooperate in a DNA-binding complex with NFAT (nuclear factor of activated T cells) to regulate the transcription of several known target genes12. However, the global set of genes regulated directly by Foxp3 is not known and consequently, how this transcription factor controls the gene expression programme for Treg function is not understood. Here we identify Foxp3 target genes and report that many of these are key modulators of T-cell activation and function. Remarkably, the predominant, although not exclusive, effect of Foxp3 occupancy is to suppress the activation of target genes on T-cell stimulation. Foxp3 suppression of its targets appears to be crucial for the normal function of Treg cells, because overactive variants of some target genes are known to be associated with autoimmune disease.
DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression1. Comparative genomics has recently been used to identify potential cis-regulatory sequences within the yeast genome on the basis of phylogenetic conservation2–6, but this information alone does not reveal if or when transcriptional regulators occupy these binding sites. We have constructed an initial map of yeast's transcriptional regulatory code by identifying the sequence elements that are bound by regulators under various conditions and that are conserved among Saccharomyces species. The organization of regulatory elements in promoters and the environment-dependent use of these elements by regulators are discussed. We find that environment-specific use of regulatory elements predicts mechanistic models for the function of a large population of yeast's transcriptional regulators.
Biomolecular pathways are built from diverse types of pairwise interactions, ranging from physical protein-protein interactions and modifications to indirect regulatory relationships. One goal of systems biology is to bridge three aspects of this complexity: the growing body of high-throughput data assaying these interactions; the specific interactions in which individual genes participate; and the genome-wide patterns of interactions in a system of interest. Here, we describe methodology for simultaneously predicting specific types of biomolecular interactions using high-throughput genomic data. This results in a comprehensive compendium of whole-genome networks for yeast, derived from ∼3,500 experimental conditions and describing 30 interaction types, which range from general (e.g. physical or regulatory) to specific (e.g. phosphorylation or transcriptional regulation). We used these networks to investigate molecular pathways in carbon metabolism and cellular transport, proposing a novel connection between glycogen breakdown and glucose utilization supported by recent publications. Additionally, 14 specific predicted interactions in DNA topological change and protein biosynthesis were experimentally validated. We analyzed the systems-level network features within all interactomes, verifying the presence of small-world properties and enrichment for recurring network motifs. This compendium of physical, synthetic, regulatory, and functional interaction networks has been made publicly available through an interactive web interface for investigators to utilize in future research at http://function.princeton.edu/bioweaver/.
To maintain the complexity of living biological systems, many proteins must interact in a coordinated manner to integrate their unique functions into a cooperative system. Pathways are typically constructed to capture modular subsets of this dynamic network, each made up of a collection of biomolecular interactions of diverse types that together carry out a specific cellular function. Deciphering these pathways at a global level is a crucial step for unraveling systems biology, aiding at every level from basic biological understanding to translational biomarker and drug target discovery. The combination of high-throughput genomic data with advanced computational methods has enabled us to infer the first genome-wide compendium of bimolecular pathway networks, comprising 30 distinct bimolecular interaction types. We demonstrate that this interaction network compendium, derived from ∼3,500 experimental conditions, can be used to direct a range of biomedical hypothesis generation and testing. We show that our results can be used to predict novel protein interactions and new pathway components, and also that they enable system-level analysis to investigate the network characteristics of cell-wide regulatory circuits. The resulting compendium of biological networks is made publicly available through an interactive web interface to enable future research in other biological systems of interest.
Cellular signaling and regulatory networks underlie fundamental biological processes such as growth, differentiation, and response to the environment. Although there are now various high-throughput methods for studying these processes, knowledge of them remains fragmentary. Typically, the vast majority of hits identified by transcriptional, proteomic, and genetic assays lie outside of the expected pathways. These unexpected components of the cellular response are often the most interesting, because they can provide new insights into biological processes and potentially reveal new therapeutic approaches. However, they are also the most difficult to interpret. We present a technique, based on the Steiner tree problem, that uses previously reported protein-protein and protein-DNA interactions to determine how these hits are organized into functionally coherent pathways, revealing many components of the cellular response that are not readily apparent in the original data. Applied simultaneously to phosphoproteomic and transcriptional data for the yeast pheromone response, it identifies changes in diverse cellular processes that extend far beyond the expected pathways.
The transcription factor GATA-1 is required for terminal erythroid maturation and functions as an activator or repressor depending on gene context. Yet its in vivo site selectivity and ability to distinguish between activated versus repressed genes remain incompletely understood. In this study, we performed GATA-1 ChIP-seq in erythroid cells and compared it to GATA-1 induced gene expression changes. Bound and differentially expressed genes contain a greater number of GATA binding motifs, a higher frequency of palindromic GATA sites, and closer occupancy to the transcriptional start site versus non-differentially expressed genes. Moreover, we show that the transcription factor Zbtb7a occupies GATA-1 bound regions of some direct GATA-1 target genes, that the presence of SCL/TAL1 helps distinguish transcriptional activation versus repression, and that Polycomb Repressive Complex 2 (PRC2) is involved in epigenetic silencing of a subset of GATA-1 repressed genes. These data provide insights into GATA-1 mediated gene regulation in vivo.
GATA-1; Polycomb; Zbtb7a; erythroid; ChIP-seq
Understanding the mechanistic basis of transcriptional regulation has been a central focus of molecular biology since its inception. New high-throughput chromatin immunoprecipitation experiments have revealed that most regulatory proteins bind thousands of sites in mammalian genomes. However, the functional significance of these binding sites remains unclear. We present a quantitative model of transcriptional regulation that suggests the contribution of each binding site to tissue-specific gene expression depends strongly on its position relative to the transcription start site. For three cell types, we show that, by considering binding position, it is possible to predict relative expression levels between cell types with an accuracy approaching the level of agreement between different experimental platforms. Our model suggests that, for the transcription factors profiled in these cell types, a regulatory site's influence on expression falls off almost linearly with distance from the transcription start site in a 10 kilobase range. Binding to both evolutionarily conserved and non-conserved sequences contributes significantly to transcriptional regulation. Our approach also reveals the quantitative, tissue-specific role of individual proteins in activating or repressing transcription. These results suggest that regulator binding position plays a previously unappreciated role in influencing expression and blurs the classical distinction between proximal promoter and distal binding events.
Gene expression is controlled, in large part, by regulatory proteins called transcription factors that bind specific sites in the genome. A major focus of molecular biology has been understanding how these transcription factors interact with the cell's transcriptional machinery, the genome, and with each other to turn genes' expression on and off in various physiological contexts. Previous approaches for modeling transcriptional regulation have focused on the complex combinatorial interactions between groups of transcription factors at regulatory sites, or on the specific activating or repressive functions of individual proteins. In this work, we present a new modeling framework and demonstrate that an equally important, and previously overlooked, consideration in predicting the effect that a regulatory site has on gene expression is simply its location relative to the transcription start site of nearby genes. Our results show that, in general, the closer a binding event is to a gene's transcription start site, the more it influences expression. We also show that considering the particular proteins bound at a regulatory site helps predict the expression of nearby genes. However, considering the sequence conservation level of these sites does not lead to more accurate predictions.
Cells respond to stimuli by changes in various processes, including signaling pathways and gene expression. Efforts to identify components of these responses increasingly depend on mRNA profiling and genetic library screens, yet the functional roles of the genes identified by these assays often remain enigmatic. By comparing the results of these two assays across various cellular responses, we found that they are consistently distinct. Moreover, genetic screens tend to identify response regulators, while mRNA profiling frequently detects metabolic responses. We developed an integrative approach that bridges the gap between these data using known molecular interactions, thus highlighting major response pathways. We harnessed this approach to reveal cellular pathways related to alpha-synuclein, a small lipid-binding protein implicated in several neurodegenerative disorders including Parkinson disease. For this we screened an established yeast model for alpha-synuclein toxicity to identify genes that when overexpressed alter cellular survival. Application of our algorithm to these data and data from mRNA profiling provided functional explanations for many of these genes and revealed novel relations between alpha-synuclein toxicity and basic cellular pathways.
Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-specific preferences of transcription factors.
Regulation of gene expression plays a central role in the activity of living cells and in their response to internal (e.g., cell division) or external (e.g., stress) stimuli. Key players in determining gene-specific regulation are transcription factors that bind sequence-specific sites on the DNA, modulating the expression of nearby genes. To understand the regulatory program of the cell, we need to identify these transcription factors, when they act, and on which genes. Transcription regulatory maps can be assembled by computational analysis of experimental data, by discovering the DNA recognition sequences (motifs) of transcription factors and their occurrences along the genome. Such an analysis usually results in a large number of overlapping motifs. To reconstruct regulatory maps, it is crucial to combine similar motifs and to relate them to transcription factors. To this end we developed an accurate fully-automated method, termed BLiC, based upon an improved similarity measure for comparing DNA motifs. By applying it to genome-wide data in yeast, we identified the DNA motifs of transcription factors and their putative target genes. Finally, we analyze motifs of transcription factor that alter their target genes under different conditions, and show how cells adjust their regulatory program in response to environmental changes.
WebMOTIFS provides a web interface that facilitates the discovery and analysis of DNA-sequence motifs. Several studies have shown that the accuracy of motif discovery can be significantly improved by using multiple de novo motif discovery programs and using randomized control calculations to identify the most significant motifs or by using Bayesian approaches. WebMOTIFS makes it easy to apply these strategies. Using a single submission form, users can run several motif discovery programs and score, cluster and visualize the results. In addition, the Bayesian motif discovery program THEME can be used to determine the class of transcription factors that is most likely to regulate a set of sequences. Input can be provided as a list of gene or probe identifiers. Used with the default settings, WebMOTIFS accurately identifies biologically relevant motifs from diverse data in several species. WebMOTIFS is freely available at http://fraenkel.mit.edu/webmotifs.
Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.
A computational problem with many applications in molecular biology is to identify short DNA sequence patterns (motifs) that are significantly overrepresented in a target set of genomic sequences relative to a background set of genomic sequences. One example is a target set that contains DNA sequences to which a specific transcription factor protein was experimentally measured as bound while the background set contains sequences to which the same transcription factor was not bound. Overrepresented sequence motifs in the target set may represent a subsequence that is molecularly recognized by the transcription factor. An inherent limitation of the above formulation of the problem lies in the fact that in many cases data cannot be clearly partitioned into distinct target and background sets in a biologically justified manner. We describe a statistical framework for discovering motifs in a list of genomic sequences that are ranked according to a biological parameter or measurement (e.g., transcription factor to sequence binding measurements). Our approach circumvents the need to partition the data into target and background sets using arbitrarily set parameters. The framework is implemented in a software tool called DRIM. The application of DRIM led to the identification of novel putative transcription factor binding sites in yeast and to the discovery of previously unknown motifs in CpG methylation regions in human cancer cell lines.
We mapped the transcriptional regulatory circuitry for six master regulators in human hepatocytes using chromatin immunoprecipitation and high-resolution promoter microarrays. The results show that these regulators form a highly interconnected core circuitry, and reveal the local regulatory network motifs created by regulator–gene interactions. Autoregulation was a prominent theme among these regulators. We found that hepatocyte master regulators tend to bind promoter regions combinatorially and that the number of transcription factors bound to a promoter corresponds with observed gene expression. Our studies reveal portions of the core circuitry of human hepatocytes.
autoregulation; hepatocyte; transcriptional regulation; regulatory hierarchy; chromatin immunoprecipitation