Chromatin immunoprecipitation (ChIP) experiments allow the location of transcription factors to be determined across the genome. Subsequent analysis of the sequences of the identified regions allows binding to be localized at a higher resolution than can be achieved by current high-throughput experiments without sequence analysis, and may provide important insight into the regulatory programs enacted by the protein of interest. In this chapter we review the tools, workflow, and common pitfalls of such analyses, and recommend strategies for effective motif discovery from these data.
motif discovery; sequence motifs; chromatin immunoprecipitation; ChIP-seq; ChIP-chip; transcriptional regulation
Signaling and regulatory networks are essential for cells to control processes such as growth, differentiation, and response to stimuli. Although many “omic” data sources are available to probe signaling pathways, these data are typically sparse and noisy. Thus, it has been difficult to use these data to discover the cause of the diseases and to propose new therapeutic strategies. We overcome these problems and use “omic” data to reconstruct simultaneously multiple pathways that are altered in a particular condition by solving the prize-collecting Steiner forest problem. To evaluate this approach, we use the well-characterized yeast pheromone response. We then apply the method to human glioblastoma data, searching for a forest of trees, each of which is rooted in a different cell-surface receptor. This approach discovers both overlapping and independent signaling pathways that are enriched in functionally and clinically relevant proteins, which could provide the basis for new therapeutic strategies. Although the algorithm was not provided with any information about the phosphorylation status of receptors, it identifies a small set of clinically relevant receptors among hundreds present in the interactome.
multiple network reconstruction; prize-collecting Steiner forest; signaling pathways
Signaling and transcription are tightly integrated processes that underlie many cellular responses to the environment. A network of signaling events, often mediated by post-translational modification on proteins, can lead to long-term changes in cellular behavior by altering the activity of specific transcriptional regulators and consequently the expression level of their downstream targets. As many high-throughput, “-omics” methods are now available that can simultaneously measure changes in hundreds of proteins and thousands of transcripts, it should be possible to systematically reconstruct cellular responses to perturbations in order to discover previously unrecognized signaling pathways.
This chapter describes a computational method for discovering such pathways that aims to compensate for the varying levels of noise present in these diverse data sources. Based on the concept of constraint optimization on networks, the method seeks to achieve two conflicting aims: (1) to link together many of the signaling proteins and differential expressed transcripts identified in the experiments (“constraints”) using previously reported protein-protein and protein-DNA interactions, while (2) keeping the resulting network small and ensuring it is composed of the highest confidence interactions (“optimization”). A further distinctive feature of this approach is the use of transcriptional data as evidence of upstream signaling events that drive changes in gene expression, rather than as proxies for downstream changes in the levels of the encoded proteins.
We recently demonstrated that by applying this method to phosphoproteomic and transcriptional data from the pheromone response in yeast, we were able to recover functionally coherent pathways and to reveal many components of the cellular response that are not readily apparent in the original data. Here we provide a more detailed description of the method, explore the robustness of the solution to the noise level of input data and discuss the effect of parameter values.
The rapid development of high throughput biotechnologies has led to an onslaught of data describing genetic perturbations and changes in mRNA and protein levels in the cell. Because each assay provides a one-dimensional snapshot of active signaling pathways, it has become desirable to perform multiple assays (e.g. mRNA expression and phospho-proteomics) to measure a single condition. However, as experiments expand to accommodate various cellular conditions, proper analysis and interpretation of these data have become more challenging. Here we introduce a novel approach called SAMNet, for Simultaneous Analysis of Multiple Networks, that is able to interpret diverse assays over multiple perturbations. The algorithm uses a constrained optimization approach to integrate mRNA expression data with upstream genes, selecting edges in the protein-protein interaction network that best explain the changes across all perturbations. The result is a putative set of protein interactions that succinctly summarizes the results from all experiments, highlighting the network elements unique to each perturbation. We evaluated SAMNet in both yeast and human datasets. The yeast dataset measured the cellular response to seven different transition metals, and the human dataset measured cellular changes in four different lung cancer models of Epithelial-Mesenchymal Transition (EMT), a crucial process in tumor metastasis. SAMNet was able to identify canonical yeast metal –processing genes unique to each commodity in the yeast dataset, as well as human genes such as β-catenin and TCF7L2/TCF4 that are required for EMT signaling but escaped detection in the mRNA and phospho-proteomic data. Moreover, SAMNet also highlighted drugs likely to modulate EMT, identifying a series of less canonical genes known to be affected by the BCR-ABL inhibitor imatinib (Gleevec), suggesting a possible influence of this drug on EMT.
We demonstrate that the binding sites for highly conserved transcription factors vary extensively between human and mouse. We mapped the binding of four tissue-specific transcription factors (FOXA2, HNF1A, HNF4A, HNF6) to 4,000 orthologous gene pairs in hepatocytes purified from human and mouse livers. Despite the conserved function of these factors, from 41% to 89% of their binding events appear to be species-specific. When the same protein binds the promoters of orthologous genes, approximately two-thirds of the binding sites do not align.
The mitogen-activated protein kinase (MAPK) ERK2 is ubiquitously expressed in mammalian tissues and is involved in a wide range of biological processes. Although MAPKs have been intensely studied, identification of their substrates remains challenging. We have optimized a chemical genetic system using analog-sensitive ERK2,a form of ERK2 engineered to utilize an analog of ATP, to tag and isolate ERK2 substrates in vitro. This approach identified 80 proteins phosphorylated by ERK2, 13 of which are known ERK2 substrates. The 80 substrates are associated with diverse cellular processes, including regulation of transcription and translation, and mRNA processing, as well as regulation of the activity of the Rho-family guanosine triphosphatases. We found that one of the newly identified substrates, ETV3 (a member of the E-twenty six family of transcriptional regulators) was extensively phosphorylated on sites within canonical and non-canonical ERK motifs. Phosphorylation of ETV3 regulated transcription by preventing its binding to DNA at promoters for several thousand genes, including some involved in negative feedback regulation of itself and of upstream signals.
SOX2 is a master regulator of both pluripotent embryonic stem cells (ESCs) and multipotent neural progenitor cells (NPCs); however, we currently lack a detailed understanding of how SOX2 controls these distinct stem cell populations. Here we show by genome-wide analysis that, while SOX2 bound to a distinct set of gene promoters in ESCs and NPCs, the majority of regions coincided with unique distal enhancer elements, important cis-acting regulators of tissue-specific gene expression programs. Notably, SOX2 bound the same consensus DNA motif in both cell types, suggesting that additional factors contribute to target specificity. We found that, similar to its association with OCT4 (Pou5f1) in ESCs, the related POU family member BRN2 (Pou3f2) co-occupied a large set of putative distal enhancers with SOX2 in NPCs. Forced expression of BRN2 in ESCs led to functional recruitment of SOX2 to a subset of NPC-specific targets and to precocious differentiation toward a neural-like state. Further analysis of the bound sequences revealed differences in the distances of SOX and POU peaks in the two cell types and identified motifs for additional transcription factors. Together, these data suggest that SOX2 controls a larger network of genes than previously anticipated through binding of distal enhancers and that transitions in POU partner factors may control tissue-specific transcriptional programs. Our findings have important implications for understanding lineage specification and somatic cell reprogramming, where SOX2, OCT4, and BRN2 have been shown to be key factors.
In mammals, a few thousand transcription factors regulate the differential expression of more than 20,000 genes to specify ∼200 functionally distinct cell types during development. How this is accomplished has been a major focus of biology. Transcription factors bind non-coding DNA regulatory elements, including proximal promoters and distal enhancers, to control gene expression. Emerging evidence indicates that transcription factor binding at distal enhancers plays an important role in the establishment of tissue-specific gene expression programs during development. Further, combinatorial binding among groups of transcription factors can further increase the diversity and specificity of regulatory modules. Here, we report the genome-wide binding profile of the HMG-box containing transcription factor SOX2 in mouse embryonic stem cells (ESCs) and neural progenitor cells (NPCs), and we show that SOX2 occupied a distinct set of binding sites with POU homeodomain family members, OCT4 in ESCs and BRN2 in NPCs. Thus, transitions in SOX2-POU partners may control tissue-specific gene networks. Ultimately, a global analysis detailing the combinatorial binding of transcription factors across all tissues is critical to understand cell fate specification in the context of the complex mammalian genome.
Polycomb repressive complexes (PRCs) play key roles in developmental epigenetic regulation. Yet the mechanisms that target PRCs to specific loci in mammalian cells remain incompletely understood. In this study, we show that Bmi1, a core component of Polycomb Repressive Complex 1 (PRC1), binds directly to the Runx1/CBFβ transcription factor complex. Genome-wide studies in megakaryocytic cells demonstrate significant chromatin occupancy overlap between the PRC1 core component Ring1b and Runx1/CBFβ, and functional regulation of a considerable fraction of commonly bound genes. Bmi1/Ring1b and Runx1/CBFβ deficiency generate partial phenocopies of one another in vivo. We also show that Ring1b occupies key Runx1 binding sites in primary murine thymocytes and that this occurs via Polycomb Repressive Complex 2 (PRC2) independent mechanisms. Genetic depletion of Runx1 results in reduced Ring1b binding at these sites in vivo. These findings provide evidence for site-specific PRC1 chromatin recruitment by core binding transcription factors in mammalian cells.
Cellular signal transduction generally involves cascades of post-translational protein modifications that rapidly catalyze changes in protein-DNA interactions and gene expression. High-throughput measurements are improving our ability to study each of these stages individually, but do not capture the connections between them. Here we present an approach for building a network of physical links among these data that can be used to prioritize targets for pharmacological intervention. Our method recovers the critical missing links between proteomic and transcriptional data by relating changes in chromatin accessibility to changes in expression and then uses these links to connect proteomic and transcriptome data. We applied our approach to integrate epigenomic, phosphoproteomic and transcriptome changes induced by the variant III mutation of the epidermal growth factor receptor (EGFRvIII) in a cell line model of glioblastoma multiforme (GBM). To test the relevance of the network, we used small molecules to target highly connected nodes implicated by the network model that were not detected by the experimental data in isolation and we found that a large fraction of these agents alter cell viability. Among these are two compounds, ICG-001, targeting CREB binding protein (CREBBP), and PKF118–310, targeting β-catenin (CTNNB1), which have not been tested previously for effectiveness against GBM. At the level of transcriptional regulation, we used chromatin immunoprecipitation sequencing (ChIP-Seq) to experimentally determine the genome-wide binding locations of p300, a transcriptional co-regulator highly connected in the network. Analysis of p300 target genes suggested its role in tumorigenesis. We propose that this general method, in which experimental measurements are used as constraints for building regulatory networks from the interactome while taking into account noise and missing data, should be applicable to a wide range of high-throughput datasets.
The ways in which cells respond to changes in their environment are controlled by networks of physical links among the proteins and genes. The initial signal of a change in conditions rapidly passes through these networks from the cytoplasm to the nucleus, where it can lead to long-term alterations in cellular behavior by controlling the expression of genes. These cascades of signaling events underlie many normal biological processes. As a result, being able to map out how these networks change in disease can provide critical insights for new approaches to treatment. We present a computational method for reconstructing these networks by finding links between the rapid short-term changes in proteins and the longer-term changes in gene regulation. This method brings together systematic measurements of protein signaling, genome organization and transcription in the context of protein-protein and protein-DNA interactions. When used to analyze datasets from an oncogene expressing cell line model of human glioblastoma, our approach identifies key nodes that affect cell survival and functional transcriptional regulators.
Heat-Shock Factor 1 (HSF1), master regulator of the heat-shock response, facilitates malignant transformation, cancer cell survival and proliferation in model systems. The common assumption is that these effects are mediated through regulation of heat-shock protein (HSP) expression. However, the transcriptional network that HSF1 coordinates directly in malignancy and its relationship to the heat-shock response have never been defined. By comparing cells with high and low malignant potential alongside their non-transformed counterparts, we identify an HSF1-regulated transcriptional program specific to highly malignant cells and distinct from heat shock. Cancer-specific genes in this program support oncogenic processes: cell-cycle regulation, signaling, metabolism, adhesion and translation. HSP genes are integral to this program, however, many are uniquely regulated in malignancy. This HSF1 cancer program is active in breast, colon and lung tumors isolated directly from human patients and is strongly associated with metastasis and death. Thus, HSF1 rewires the transcriptome in tumorigenesis, with prognostic and therapeutic implications.
HSP90; HSP70; ChIP-Seq; genome-wide; outcome signature; Nurses’ Health Study; immunohistochemistry
In Huntington’s disease (HD), polyglutamine expansions in the huntingtin (Htt) protein cause subtle changes in cellular functions that, over-time, lead to neurodegeneration and death. Studies have indicated that activation of the heat shock response can reduce many of the effects of mutant Htt in disease models, suggesting that the heat shock response is impaired in the disease. To understand the basis for this impairment, we have used genome-wide chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) to examine the effects of mutant Htt on the master regulator of the heat shock response, HSF1. We find that, under normal conditions, HSF1 function is highly similar in cells carrying either wild-type or mutant Htt. However, polyQ-expanded Htt severely blunts the HSF1-mediated stress response. Surprisingly, we find that the HSF1 targets most affected upon stress are not directly associated with proteostasis, but with cytoskeletal binding, focal adhesion and GTPase activity. Our data raise the intriguing hypothesis that the accumulated damage from life-long impairment in these stress responses may contribute significantly to the etiology of Huntington’s disease.
Huntington Disease; heat shock transcription factor; Heat-Shock Response; Chromatin Immunoprecipitation; cDNA Microarrays; Deep Sequencing
High-throughput technologies including transcriptional profiling, proteomics and reverse genetics screens provide detailed molecular descriptions of cellular responses to perturbations. However, it is difficult to integrate these diverse data to reconstruct biologically meaningful signaling networks. Previously, we have established a framework for integrating transcriptional, proteomic and interactome data by searching for the solution to the prize-collecting Steiner tree problem. Here, we present a web server, SteinerNet, to make this method available in a user-friendly format for a broad range of users with data from any species. At a minimum, a user only needs to provide a set of experimentally detected proteins and/or genes and the server will search for connections among these data from the provided interactomes for yeast, human, mouse, Drosophila melanogaster and Caenorhabditis elegans. More advanced users can upload their own interactome data as well. The server provides interactive visualization of the resulting optimal network and downloadable files detailing the analysis and results. We believe that SteinerNet will be useful for researchers who would like to integrate their high-throughput data for a specific condition or cellular response and to find biologically meaningful pathways. SteinerNet is accessible at http://fraenkel.mit.edu/steinernet.
The growing epidemic of obesity and metabolic diseases calls for a better understanding of adipocyte biology. The regulation of transcription in adipocytes is particularly important, as it is a target for several therapeutic approaches. Transcriptional outcomes are influenced by both histone modifications and transcription factor binding. Although the epigenetic states and binding sites of several important transcription factors have been profiled in the mouse 3T3-L1 cell line, such data are lacking in human adipocytes. In this study, we identified H3K56 acetylation sites in human adipocytes derived from mesenchymal stem cells. H3K56 is acetylated by CBP and p300, and deacetylated by SIRT1, all are proteins with important roles in diabetes and insulin signaling. We found that while almost half of the genome shows signs of H3K56 acetylation, the highest level of H3K56 acetylation is associated with transcription factors and proteins in the adipokine signaling and Type II Diabetes pathways. In order to discover the transcription factors that recruit acetyltransferases and deacetylases to sites of H3K56 acetylation, we analyzed DNA sequences near H3K56 acetylated regions and found that the E2F recognition sequence was enriched. Using chromatin immunoprecipitation followed by high-throughput sequencing, we confirmed that genes bound by E2F4, as well as those by HSF-1 and C/EBPα, have higher than expected levels of H3K56 acetylation, and that the transcription factor binding sites and acetylation sites are often adjacent but rarely overlap. We also discovered a significant difference between bound targets of C/EBPα in 3T3-L1 and human adipocytes, highlighting the need to construct species-specific epigenetic and transcription factor binding site maps. This is the first genome-wide profile of H3K56 acetylation, E2F4, C/EBPα and HSF-1 binding in human adipocytes, and will serve as an important resource for better understanding adipocyte transcriptional regulation.
We have used a simple and efficient method to identify condition-specific transcriptional regulatory sites in vivo to help elucidate the molecular basis of sex-related differences in transcription, which are widespread in mammalian tissues and affect normal physiology, drug response, inflammation, and disease. To systematically uncover transcriptional regulators responsible for these differences, we used DNase hypersensitivity analysis coupled with high-throughput sequencing to produce condition-specific maps of regulatory sites in male and female mouse livers and in livers of male mice feminized by continuous infusion of growth hormone (GH). We identified 71,264 hypersensitive sites, with 1,284 showing robust sex-related differences. Continuous GH infusion suppressed the vast majority of male-specific sites and induced a subset of female-specific sites in male livers. We also identified broad genomic regions (up to ∼100 kb) showing sex-dependent hypersensitivity and similar patterns of GH responses. We found a strong association of sex-specific sites with sex-specific transcription; however, a majority of sex-specific sites were >100 kb from sex-specific genes. By analyzing sequence motifs within regulatory regions, we identified two known regulators of liver sexual dimorphism and several new candidates for further investigation. This approach can readily be applied to mapping condition-specific regulatory sites in mammalian tissues under a wide variety of physiological conditions.
Cellular response to stimuli is typically complex and involves both regulatory and metabolic processes. Large-scale experimental efforts to identify components of these processes often comprise of genetic screening and transcriptomic profiling assays. We previously established that in yeast genetic screens tend to identify response regulators, while transcriptomic profiling assays tend to identify components of metabolic processes. ResponseNet is a network-optimization approach that integrates the results from these assays with data of known molecular interactions. Specifically, ResponseNet identifies a high-probability sub-network, composed of signaling and regulatory molecular interaction paths, through which putative response regulators may lead to the measured transcriptomic changes. Computationally, this is achieved by formulating a minimum-cost flow optimization problem and solving it efficiently using linear programming tools. The ResponseNet web server offers a simple interface for applying ResponseNet. Users can upload weighted lists of proteins and genes and obtain a sparse, weighted, molecular interaction sub-network connecting their data. The predicted sub-network and its gene ontology enrichment analysis are presented graphically or as text. Consequently, the ResponseNet web server enables researchers that were previously limited to separate analysis of their distinct, large-scale experiments, to meaningfully integrate their data and substantially expand their understanding of the underlying cellular response. ResponseNet is available at http://bioinfo.bgu.ac.il/respnet.
The transcriptional regulatory networks that specify and maintain human tissue diversity are largely uncharted. To gain insight into this circuitry, we used chromatin immunoprecipitation combined with promoter microarrays to identify systematically the genes occupied by the transcriptional regulators HNF1α, HNF4α, and HNF6, together with RNA polymerase II, in human liver and pancreatic islets. We identified tissue-specific regulatory circuits formed by HNF1α, HNF4α, and HNF6 with other transcription factors, revealing how these factors function as master regulators of hepatocyte and islet transcription. Our results suggest how misregulation of HNF4α can contribute to type 2 diabetes.
Foxp3+CD4+CD25+ regulatory T (Treg) cells are essential for the prevention of autoimmunity1,2. Treg cells have an attenuated cytokine response to T-cell receptor stimulation, and can suppress the proliferation and effector function of neighbouring T cells3,4. The forkhead transcription factor Foxp3 (forkhead box P3) is selectively expressed in Treg cells, is required for Treg development and function, and is sufficient to induce a Treg phenotype in conventional CD4+CD25− T cells5–8. Mutations in Foxp3 cause severe, multi-organ autoimmunity in both human and mouse9–11. FOXP3 can cooperate in a DNA-binding complex with NFAT (nuclear factor of activated T cells) to regulate the transcription of several known target genes12. However, the global set of genes regulated directly by Foxp3 is not known and consequently, how this transcription factor controls the gene expression programme for Treg function is not understood. Here we identify Foxp3 target genes and report that many of these are key modulators of T-cell activation and function. Remarkably, the predominant, although not exclusive, effect of Foxp3 occupancy is to suppress the activation of target genes on T-cell stimulation. Foxp3 suppression of its targets appears to be crucial for the normal function of Treg cells, because overactive variants of some target genes are known to be associated with autoimmune disease.
DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression1. Comparative genomics has recently been used to identify potential cis-regulatory sequences within the yeast genome on the basis of phylogenetic conservation2–6, but this information alone does not reveal if or when transcriptional regulators occupy these binding sites. We have constructed an initial map of yeast's transcriptional regulatory code by identifying the sequence elements that are bound by regulators under various conditions and that are conserved among Saccharomyces species. The organization of regulatory elements in promoters and the environment-dependent use of these elements by regulators are discussed. We find that environment-specific use of regulatory elements predicts mechanistic models for the function of a large population of yeast's transcriptional regulators.
Biomolecular pathways are built from diverse types of pairwise interactions, ranging from physical protein-protein interactions and modifications to indirect regulatory relationships. One goal of systems biology is to bridge three aspects of this complexity: the growing body of high-throughput data assaying these interactions; the specific interactions in which individual genes participate; and the genome-wide patterns of interactions in a system of interest. Here, we describe methodology for simultaneously predicting specific types of biomolecular interactions using high-throughput genomic data. This results in a comprehensive compendium of whole-genome networks for yeast, derived from ∼3,500 experimental conditions and describing 30 interaction types, which range from general (e.g. physical or regulatory) to specific (e.g. phosphorylation or transcriptional regulation). We used these networks to investigate molecular pathways in carbon metabolism and cellular transport, proposing a novel connection between glycogen breakdown and glucose utilization supported by recent publications. Additionally, 14 specific predicted interactions in DNA topological change and protein biosynthesis were experimentally validated. We analyzed the systems-level network features within all interactomes, verifying the presence of small-world properties and enrichment for recurring network motifs. This compendium of physical, synthetic, regulatory, and functional interaction networks has been made publicly available through an interactive web interface for investigators to utilize in future research at http://function.princeton.edu/bioweaver/.
To maintain the complexity of living biological systems, many proteins must interact in a coordinated manner to integrate their unique functions into a cooperative system. Pathways are typically constructed to capture modular subsets of this dynamic network, each made up of a collection of biomolecular interactions of diverse types that together carry out a specific cellular function. Deciphering these pathways at a global level is a crucial step for unraveling systems biology, aiding at every level from basic biological understanding to translational biomarker and drug target discovery. The combination of high-throughput genomic data with advanced computational methods has enabled us to infer the first genome-wide compendium of bimolecular pathway networks, comprising 30 distinct bimolecular interaction types. We demonstrate that this interaction network compendium, derived from ∼3,500 experimental conditions, can be used to direct a range of biomedical hypothesis generation and testing. We show that our results can be used to predict novel protein interactions and new pathway components, and also that they enable system-level analysis to investigate the network characteristics of cell-wide regulatory circuits. The resulting compendium of biological networks is made publicly available through an interactive web interface to enable future research in other biological systems of interest.
Cellular signaling and regulatory networks underlie fundamental biological processes such as growth, differentiation, and response to the environment. Although there are now various high-throughput methods for studying these processes, knowledge of them remains fragmentary. Typically, the vast majority of hits identified by transcriptional, proteomic, and genetic assays lie outside of the expected pathways. These unexpected components of the cellular response are often the most interesting, because they can provide new insights into biological processes and potentially reveal new therapeutic approaches. However, they are also the most difficult to interpret. We present a technique, based on the Steiner tree problem, that uses previously reported protein-protein and protein-DNA interactions to determine how these hits are organized into functionally coherent pathways, revealing many components of the cellular response that are not readily apparent in the original data. Applied simultaneously to phosphoproteomic and transcriptional data for the yeast pheromone response, it identifies changes in diverse cellular processes that extend far beyond the expected pathways.
The transcription factor GATA-1 is required for terminal erythroid maturation and functions as an activator or repressor depending on gene context. Yet its in vivo site selectivity and ability to distinguish between activated versus repressed genes remain incompletely understood. In this study, we performed GATA-1 ChIP-seq in erythroid cells and compared it to GATA-1 induced gene expression changes. Bound and differentially expressed genes contain a greater number of GATA binding motifs, a higher frequency of palindromic GATA sites, and closer occupancy to the transcriptional start site versus non-differentially expressed genes. Moreover, we show that the transcription factor Zbtb7a occupies GATA-1 bound regions of some direct GATA-1 target genes, that the presence of SCL/TAL1 helps distinguish transcriptional activation versus repression, and that Polycomb Repressive Complex 2 (PRC2) is involved in epigenetic silencing of a subset of GATA-1 repressed genes. These data provide insights into GATA-1 mediated gene regulation in vivo.
GATA-1; Polycomb; Zbtb7a; erythroid; ChIP-seq
Understanding the mechanistic basis of transcriptional regulation has been a central focus of molecular biology since its inception. New high-throughput chromatin immunoprecipitation experiments have revealed that most regulatory proteins bind thousands of sites in mammalian genomes. However, the functional significance of these binding sites remains unclear. We present a quantitative model of transcriptional regulation that suggests the contribution of each binding site to tissue-specific gene expression depends strongly on its position relative to the transcription start site. For three cell types, we show that, by considering binding position, it is possible to predict relative expression levels between cell types with an accuracy approaching the level of agreement between different experimental platforms. Our model suggests that, for the transcription factors profiled in these cell types, a regulatory site's influence on expression falls off almost linearly with distance from the transcription start site in a 10 kilobase range. Binding to both evolutionarily conserved and non-conserved sequences contributes significantly to transcriptional regulation. Our approach also reveals the quantitative, tissue-specific role of individual proteins in activating or repressing transcription. These results suggest that regulator binding position plays a previously unappreciated role in influencing expression and blurs the classical distinction between proximal promoter and distal binding events.
Gene expression is controlled, in large part, by regulatory proteins called transcription factors that bind specific sites in the genome. A major focus of molecular biology has been understanding how these transcription factors interact with the cell's transcriptional machinery, the genome, and with each other to turn genes' expression on and off in various physiological contexts. Previous approaches for modeling transcriptional regulation have focused on the complex combinatorial interactions between groups of transcription factors at regulatory sites, or on the specific activating or repressive functions of individual proteins. In this work, we present a new modeling framework and demonstrate that an equally important, and previously overlooked, consideration in predicting the effect that a regulatory site has on gene expression is simply its location relative to the transcription start site of nearby genes. Our results show that, in general, the closer a binding event is to a gene's transcription start site, the more it influences expression. We also show that considering the particular proteins bound at a regulatory site helps predict the expression of nearby genes. However, considering the sequence conservation level of these sites does not lead to more accurate predictions.