Search tips
Search criteria

Results 1-25 (1506206)

Clipboard (0)

Related Articles

1.  Genome-scale Analysis of Escherichia coli FNR Reveals Complex Features of Transcription Factor Binding 
PLoS Genetics  2013;9(6):e1003565.
FNR is a well-studied global regulator of anaerobiosis, which is widely conserved across bacteria. Despite the importance of FNR and anaerobiosis in microbial lifestyles, the factors that influence its function on a genome-wide scale are poorly understood. Here, we report a functional genomic analysis of FNR action. We find that FNR occupancy at many target sites is strongly influenced by nucleoid-associated proteins (NAPs) that restrict access to many FNR binding sites. At a genome-wide level, only a subset of predicted FNR binding sites were bound under anaerobic fermentative conditions and many appeared to be masked by the NAPs H-NS, IHF and Fis. Similar assays in cells lacking H-NS and its paralog StpA showed increased FNR occupancy at sites bound by H-NS in WT strains, indicating that large regions of the genome are not readily accessible for FNR binding. Genome accessibility may also explain our finding that genome-wide FNR occupancy did not correlate with the match to consensus at binding sites, suggesting that significant variation in ChIP signal was attributable to cross-linking or immunoprecipitation efficiency rather than differences in binding affinities for FNR sites. Correlation of FNR ChIP-seq peaks with transcriptomic data showed that less than half of the FNR-regulated operons could be attributed to direct FNR binding. Conversely, FNR bound some promoters without regulating expression presumably requiring changes in activity of condition-specific transcription factors. Such combinatorial regulation may allow Escherichia coli to respond rapidly to environmental changes and confer an ecological advantage in the anaerobic but nutrient-fluctuating environment of the mammalian gut.
Author Summary
Regulation of gene expression by transcription factors (TFs) is key to adaptation to environmental changes. Our comprehensive, genome-scale analysis of a prototypical global TF, the anaerobic regulator FNR from Escherichia coli, leads to several novel and unanticipated insights into the influences on FNR binding genome-wide and the complex structure of bacterial regulons. We found that binding of NAPs restricts FNR binding at a subset of sites, suggesting that the bacterial genome is not freely accessible for FNR binding. Our finding that less than half of the predicted FNR binding sites were occupied in vivo further challenges the utility of using bioinformatic searches alone to predict regulon structure, reinforcing the need for experimental determination of TF binding. By correlating the occupancy data with transcriptomic data, we confirm that FNR serves as a global signal of anaerobiosis but expression of some operons in the FNR regulon require other regulators sensitive to alternative environmental stimuli. Thus, FNR binding and regulation appear to depend on both the nucleoprotein structure of the chromosome and on combinatorial binding of FNR with other regulators. Both of these phenomena are typical of TF binding in eukaryotes; our results establish that they are also features of bacterial TF binding.
PMCID: PMC3688515  PMID: 23818864
2.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny 
PLoS Computational Biology  2005;1(7):e67.
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from or The full set of predicted sites from our tests on yeast are available at
Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until recently motif finders would typically take one of the following two general approaches. Given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a “phylogenetic footprinting” approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.
In this work the authors present an algorithm, PhyloGibbs, that combines these two approaches into one integrated Bayesian framework. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to arbitrary collections of multiple sequence alignments while taking into account the phylogenetic relations between the sequences.
The authors perform a number of tests on synthetic data and real data from Saccharomyces genomes in which PhyloGibbs significantly outperforms other existing methods. Finally, a novel anneal-and-track strategy allows PhyloGibbs to make accurate estimates of the reliability of its predictions.
PMCID: PMC1309704  PMID: 16477324
3.  The phosphoproteome of toll-like receptor-activated macrophages 
First global and quantitative analysis of phosphorylation cascades induced by toll-like receptor (TLR) stimulation in macrophages identifies nearly 7000 phosphorylation sites and shows extensive and dynamic up-regulation and down-regulation after lipopolysaccharide (LPS).In addition to the canonical TLR-associated pathways, mining of the phosphorylation data suggests an involvement of ATM/ATR kinases in signalling and shows that the cytoskeleton is a hotspot of TLR-induced phosphorylation.Intersecting transcription factor phosphorylation with bioinformatic promoter analysis of genes induced by LPS identified several candidate transcriptional regulators that were previously not implicated in TLR-induced transcriptional control.
Toll-like receptors (TLR) are a family of pattern recognition receptors that enable innate immune cells to sense infectious danger. Recognition of microbial structures, like lipopolysaccharide (LPS) of Gram-negative bacteria by TLR4, causes within hours substantial re-programming of macrophage gene expression, including up-regulation of chemokines driving inflammation, anti-microbial effector molecules and cytokines directing adaptive immune responses. TLR signalling is initiated by the adapter protein Myd88 and leads to the activation of kinase cascades that result in activation of the MAPK and NFkB pathways. Phosphorylation has an essential role in these early steps of TLR signalling, and in addition regulates critical transcription factors (TFs). Although TLR signalling has been extensively studied, a comprehensive analysis of phosphorylation events in TLR-activated macrophages is lacking. It is therefore unknown whether the canonical MAPK and NFkB pathways comprise the main phosphorylation events and which other molecular functions and processes are regulated by phosphorylation after stimulation with LPS.
Recent progress in mass spectrometry-based proteomics has opened the possibility to quantitatively investigate global changes in protein abundance and post-translational modifications. Stable isotope labelling with amino acids in cell culture (SILAC) allows highly accurate quantification, and has proved especially useful for direct comparison of phosphopeptide abundance in time-course or treatment analyses.
Here, we adapted SILAC to primary mouse macrophages, and performed a global, quantitative and kinetic analysis of the macrophage phosphoproteome after LPS stimulation. Bioinformatic analyses were used to identify kinases, pathways and biological processes enriched in the LPS-regulated phosphoproteome. To connect TF phosphorylation with transcription, we generated a parallel dataset of nascent RNA and used in silico promoter analysis to identify transcriptional regulators with binding site enrichment among the LPS-regulated gene set.
After establishing SILAC conditions for efficient labelling of primary bone marrow-derived macrophages in two independent experiments 1850 phosphoproteins with a total of 6956 phosphorylation sites were reproducibly identified. Phosphoproteins were detected from all cellular compartments, with a clear enrichment for nuclear and cytoskeleton-associated proteins. LPS caused major regulation of a large fraction of phosphopeptides, with 24% of all sites up-regulated and 9% down-regulated after stimulation (Figure 3A and B). These changes were highly dynamic, as the majority of the regulated phosphopeptides were up-regulated or down-regulated transiently or in a delayed manner (Figure 3C). Overall, the extent of changes in the phosphoproteome was comparable to the transcriptional re-programming, underscoring the importance of phosphorylation cascades in TLR signalling. Our parallel transcriptome data also showed that widespread phosphorylation precedes massive transcriptional changes.
To obtain footprints of kinase activation in response to TLR ligation, we searched phosphopeptide sequences for known linear sequence motifs of 33 kinases and identified kinase motifs enriched among LPS-regulated phosphorylation sites (compared to non-regulated phosphorylation sites) (Table I). Motif ERK/MAPK was highly enriched, in accordance with the essential role of the MAPK module in TLR signalling. Other kinases with motif enrichment have also recently been linked to TLR signalling (e.g. PKD; AKT and its targets GSK3 and mTOR). However, the DNA damage-actviated kinases ATM/ATR and the cell cycle-associated kinases AURORA and CHK1/2 have not been associated with the macrophage response to TLR activation yet. These finding shed new light on older data on the effect of TLR on macrophage proliferation in response to macrophage colony stimulating factor. Of interest, in follow-up experiments using pharmacological inhibitors of the kinases with motif enrichment, we observed that inhibition of ATM kinase activity caused increased LPS-induced expression of several cytokines and chemokines, suggesting that this pathway regulates inflammatory responses.
In further bioinformatic analyses, the Gene Ontology and signalling pathway annotations of phosphoproteins were used to identify signalling pathways and cellular processes targeted by TLR4-controlled phosphorylation (Table II). Among the expected hits, based on the known TLR pathways, were TLR signalling, MAPK and AKT as well as mTOR signalling. Of interest, the annotation terms ‘Rho GTPase cycle' and ‘cytoskeleton' were significantly enriched among LPS-regulated phosphoproteins, indicating a more prominent role for cytoskeletal proteins in the transduction of TLR signals or in the biological response to it.
We were especially interested in the phosphorylation of TFs and its regulation by LPS (Figure 6A). We hypothesised that functionally important TFs should have an increased frequency of binding sites in the promoters of LPS-regulated genes (Figure 6B). To identify transcriptionally regulated genes with high sensitivity, we isolated nascent RNA after metabolic labelling (Figure 6C–E). In silico promoter scanning using Genomatix software for binding sites for all 50 TF families with phosphorylated members was used to test for enrichment in transciptionally induced genes (Figure 6F). At the early time point, binding site enrichment for the canonical TLR-associated TF NFkB was detected, and in addition we found that several other TF families with an established role in the transcription of individual LPS-target genes showed binding site enrichment (CEBP, MEF2, NFAT and HEAT). In addition, enrichment for OCT and HOXC binding sites at the early time point and SORY matrices later after stimulation indicated an involvement of the phosphorylated members of the respective TF families in the execution of TLR-induced transcriptional responses. An initial test of the function for a few of these candidate transcriptional regulators was performed using siRNA knockdown in primary macrophages. These experiments suggested that knock down of the SORY binding phosphoprotein Capicua homolog (Cic) and to a lesser extent of the CREB family member Atf7 selectively attenuates LPS-induced expression of Il1a and Il1b.
In summary, this study provides a novel and global perspective on innate immune activation by TLR signalling (Figure 5). We quantitatively detected a large number of previously unknown site-specific phosphorylation events, which are now publicly available through the Phosida database. By combining different data mining approaches, we consistently identified canonical and newly implicated TLR-activated signalling modules. In particular, the PI3K/AKT and the related mTOR pathway were highlighted; furthermore, DNA damage–response associated ATM/ATR kinases and the cytoskeleton emerged as unexpected hotspots for phosphorylation. Finally, weaving together corresponding phophoproteome and nascent transcriptome datasets through the loom of in silico promoter analysis we identified TFs with a likely role in mediating TLR-induced gene expression programmes.
Recognition of microbial danger signals by toll-like receptors (TLR) causes re-programming of macrophages. To investigate kinase cascades triggered by the TLR4 ligand lipopolysaccharide (LPS) on systems level, we performed a global, quantitative and kinetic analysis of the phosphoproteome of primary macrophages using stable isotope labelling with amino acids in cell culture, phosphopeptide enrichment and high-resolution mass spectrometry. In parallel, nascent RNA was profiled to link transcription factor (TF) phosphorylation to TLR4-induced transcriptional activation. We reproducibly identified 1850 phosphoproteins with 6956 phosphorylation sites, two thirds of which were not reported earlier. LPS caused major dynamic changes in the phosphoproteome (24% up-regulation and 9% down-regulation). Functional bioinformatic analyses confirmed canonical players of the TLR pathway and highlighted other signalling modules (e.g. mTOR, ATM/ATR kinases) and the cytoskeleton as hotspots of LPS-regulated phosphorylation. Finally, weaving together phosphoproteome and nascent transcriptome data by in silico promoter analysis, we implicated several phosphorylated TFs in primary LPS-controlled gene expression.
PMCID: PMC2913394  PMID: 20531401
macrophage; nascent RNA; phosphoproteome; SILAC; toll-like receptors
4.  Combinatorial Regulation by a Novel Arrangement of FruA and MrpC2 Transcription Factors during Myxococcus xanthus Development▿  
Journal of Bacteriology  2009;191(8):2753-2763.
Myxococcus xanthus is a gram-negative soil bacterium that undergoes multicellular development upon nutrient limitation. Intercellular signals control cell movements and regulate gene expression during the developmental process. C-signal is a short-range signal essential for aggregation and sporulation. C-signaling regulates the fmgA gene by a novel mechanism involving cooperative binding of the response regulator FruA and the transcription factor/antitoxin MrpC2. Here, we demonstrate that regulation of the C-signal-dependent fmgBC operon is under similar combinatorial control by FruA and MrpC2, but the arrangement of binding sites is different than in the fmgA promoter region. MrpC2 was shown to bind to a crucial cis-regulatory sequence in the fmgBC promoter region. FruA was required for MrpC and/or MrpC2 to associate with the fmgBC promoter region in vivo, and expression of an fmgB-lacZ fusion was abolished in a fruA mutant. Recombinant FruA was shown to bind to an essential regulatory sequence located slightly downstream of the MrpC2-binding site in the fmgBC promoter region. Full-length FruA, but not its C-terminal DNA-binding domain, enhanced the formation of complexes with fmgBC promoter region DNA, when combined with MrpC2. This effect was nearly abolished with fmgBC DNA fragments having a mutation in either the MrpC2- or FruA-binding site, indicating that binding of both proteins to DNA is important for enhancement of complex formation. These results are similar to those observed for fmgA, where FruA and MrpC2 bind cooperatively upstream of the promoter, except that in the fmgA promoter region the FruA-binding site is located slightly upstream of the MrpC2-binding site. Cooperative binding of FruA and MrpC2 appears to be a conserved mechanism of gene regulation that allows a flexible arrangement of binding sites and coordinates multiple signaling pathways.
PMCID: PMC2668394  PMID: 19201804
5.  Evidence of association between Nucleosome Occupancy and the Evolution of Transcription Factor Binding Sites in Yeast 
Divergence of transcription factor binding sites is considered to be an important source of regulatory evolution. The associations between transcription factor binding sites and phenotypic diversity have been investigated in many model organisms. However, the understanding of other factors that contribute to it is still limited. Recent studies have elucidated the effect of chromatin structure on molecular evolution of genomic DNA. Though the profound impact of nucleosome positions on gene regulation has been reported, their influence on transcriptional evolution is still less explored. With the availability of genome-wide nucleosome map in yeast species, it is thus desirable to investigate their impact on transcription factor binding site evolution. Here, we present a comprehensive analysis of the role of nucleosome positioning in the evolution of transcription factor binding sites.
We compared the transcription factor binding site frequency in nucleosome occupied regions and nucleosome depleted regions in promoters of old (orthologs among Saccharomycetaceae) and young (Saccharomyces specific) genes; and in duplicate gene pairs. We demonstrated that nucleosome occupied regions accommodate greater binding site variations than nucleosome depleted regions in young genes and in duplicate genes. This finding was confirmed by measuring the difference in evolutionary rates of binding sites in sensu stricto yeasts at nucleosome occupied regions and nucleosome depleted regions. The binding sites at nucleosome occupied regions exhibited a consistently higher evolution rate than those at nucleosome depleted regions, corroborating the difference in the selection constraints at the two regions. Finally, through site-directed mutagenesis experiment, we found that binding site gain or loss events at nucleosome depleted regions may cause more expression differences than those in nucleosome occupied regions.
Our study indicates the existence of different selection constraint on binding sites at nucleosome occupied regions than at the nucleosome depleted regions. We found that the binding sites have a different rate of evolution at nucleosome occupied and depleted regions. Finally, using transcription factor binding site-directed mutagenesis experiment, we confirmed the difference in the impact of binding site changes on expression at these regions. Thus, our work demonstrates the importance of composite analysis of chromatin and transcriptional evolution.
PMCID: PMC3124427  PMID: 21627806
6.  The ets-Related Transcription Factor GABP Directs Bidirectional Transcription 
PLoS Genetics  2007;3(11):e208.
Approximately 10% of genes in the human genome are distributed such that their transcription start sites are located less than 1 kb apart on opposite strands. These divergent gene pairs have a single intergenic segment of DNA, which in some cases appears to share regulatory elements, but it is unclear whether these regions represent functional bidirectional promoters or two overlapping promoters. A recent study showed that divergent promoters are enriched for consensus binding sequences of a small group of transcription factors, including the ubiquitous ets-family transcription factor GA-binding protein (GABP). Here we show that GABP binds to more than 80% of divergent promoters in at least one cell type. Furthermore, we demonstrate that GABP binding is correlated and associated with bidirectional transcriptional activity in a luciferase transfection assay. In addition, we find that the addition of a strict consensus GABP site into a set of promoters that normally function in only one direction significantly increases activity in the opposite direction in 67% of cases. Our findings demonstrate that GABP regulates the majority of divergent promoters and suggest that bidirectional transcriptional activity is mediated through GABP binding and transactivation at both divergent and nondivergent promoters.
Author Summary
Surveys of the locations of genes in the human genome have revealed that a surprising number of genes, greater than 10%, have transcription start sites within 1 kb of one another on opposite strands. These divergent gene pairs, sometimes referred to as bidirectional genes, are common in organisms such as bacteria and yeast, but it is unknown why such an arrangement exists in large, mammalian genomes. Recently, it has become apparent that the promoters of these divergent genes are regulated by a subset of transcription factors, and we have focused on one of these, GA-binding protein (GABP). We find that it regulates a large number of human genes, including the majority of divergent genes, and that its binding is associated with, correlated with, and sufficient for bidirectional transcriptional activity. Although clearly GABP is a major regulator of divergent genes, which carry out a variety of roles critical for the function and survival of the cell, these data also propose novel roles for GABP as a transcription factor. For example, the ability of GABP to promote bidirectional transcription may prove to be biologically relevant in generating many of the transcripts that have been observed outside of protein coding genes.
PMCID: PMC2077898  PMID: 18020712
7.  A Graphical Modelling Approach to the Dissection of Highly Correlated Transcription Factor Binding Site Profiles 
PLoS Computational Biology  2012;8(11):e1002725.
Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that directly regulate gene expression from those that are indirectly associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package ddgraph, available as part of Bioconductor ( Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs.
Author Summary
Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression. Recent technological advances make it possible to map TF binding patterns across the whole genome. Multiple single-gene studies showed that combinatorial binding of multiple transcription factors determines the gene transcriptional output. A common naive assumption is that correlated binding profiles may indicate combinatorial binding. However, it has been found that many TFs bind to distinct hotspots whose role is currently unclear. It is thus of great interest to find transcription factor combinations whose correlated binding is causally most immediate to gene expression. Building upon theories of statistical dependence and causality, we develop novel graphical modelbased algorithms that handle highly correlated transcription factor binding profiles more efficiently and reliably than existing algorithms do. These algorithms can also be applied to other biological areas involving highly correlated variables, such as the analysis of high-throughput gene knock-down experiments.
PMCID: PMC3493460  PMID: 23144600
8.  Whole-Genome Cartography of Estrogen Receptor α Binding Sites 
PLoS Genetics  2007;3(6):e87.
Using a chromatin immunoprecipitation-paired end diTag cloning and sequencing strategy, we mapped estrogen receptor α (ERα) binding sites in MCF-7 breast cancer cells. We identified 1,234 high confidence binding clusters of which 94% are projected to be bona fide ERα binding regions. Only 5% of the mapped estrogen receptor binding sites are located within 5 kb upstream of the transcriptional start sites of adjacent genes, regions containing the proximal promoters, whereas vast majority of the sites are mapped to intronic or distal locations (>5 kb from 5′ and 3′ ends of adjacent transcript), suggesting transcriptional regulatory mechanisms over significant physical distances. Of all the identified sites, 71% harbored putative full estrogen response elements (EREs), 25% bore ERE half sites, and only 4% had no recognizable ERE sequences. Genes in the vicinity of ERα binding sites were enriched for regulation by estradiol in MCF-7 cells, and their expression profiles in patient samples segregate ERα-positive from ERα-negative breast tumors. The expression dynamics of the genes adjacent to ERα binding sites suggest a direct induction of gene expression through binding to ERE-like sequences, whereas transcriptional repression by ERα appears to be through indirect mechanisms. Our analysis also indicates a number of candidate transcription factor binding sites adjacent to occupied EREs at frequencies much greater than by chance, including the previously reported FOXA1 sites, and demonstrate the potential involvement of one such putative adjacent factor, Sp1, in the global regulation of ERα target genes. Unexpectedly, we found that only 22%–24% of the bona fide human ERα binding sites were overlapping conserved regions in whole genome vertebrate alignments, which suggest limited conservation of functional binding sites. Taken together, this genome-scale analysis suggests complex but definable rules governing ERα binding and gene regulation.
Author Summary
Estrogen receptors (ERs) play key roles in facilitating the transcriptional effects of hormone functions in target tissues. To obtain a genome-wide view of ERα binding sites, we applied chromatin immunoprecipitation coupled with a cloning and sequencing strategy using chromatin immunoprecipitation pair end-tagging technology to map ERα binding sites in MCF-7 human breast cancer cells. We identified 1,234 high quality ERα binding sites in the human genome and demonstrated that the binding sites are frequently adjacent to genes significantly associated with breast cancer disease status and outcome. The mapping results also revealed that ERα can influence gene expression across distances of up to 100 kilobases or more, that genes that are induced or repressed utilize sites in different regions relative to the transcript (suggesting different mechanisms of action), and that ERα binding sites are only modestly conserved in evolution. Using computational approaches, we identified potential interactions with other transcription factor binding sites adjacent to the ERα binding elements. Taken together, these findings suggest complex but definable rules governing ERα binding and gene regulation and provide a valuable dataset for mapping the precise control nodes for one of the most important nuclear hormone receptors in breast cancer biology.
PMCID: PMC1885282  PMID: 17542648
9.  Wide-Scale Analysis of Human Functional Transcription Factor Binding Reveals a Strong Bias towards the Transcription Start Site 
PLoS ONE  2007;2(8):e807.
Transcription factors (TF) regulate expression by binding to specific DNA sequences. A binding event is functional when it affects gene expression. Functionality of a binding site is reflected in conservation of the binding sequence during evolution and in over represented binding in gene groups with coherent biological functions. Functionality is governed by several parameters such as the TF-DNA binding strength, distance of the binding site from the transcription start site (TSS), DNA packing, and more. Understanding how these parameters control functionality of different TFs in different biological contexts is a must for identifying functional TF binding sites and for understanding regulation of transcription.
Methodology/Principal Findings
We introduce a novel method to screen the promoters of a set of genes with shared biological function (obtained from the functional Gene Ontology (GO) classification) against a precompiled library of motifs, and find those motifs which are statistically over-represented in the gene set. More than 8000 human (and 23,000 mouse) genes, were assigned to one of 134 GO sets. Their promoters were searched (from 200 bp downstream to 1000 bp upstream the TSS) for 414 known DNA motifs. We optimized the sequence similarity score threshold, independently for every location window, taking into account nucleotide heterogeneity along the promoters of the target genes. The method, combined with binding sequence and location conservation between human and mouse, identifies with high probability functional binding sites for groups of functionally-related genes. We found many location-sensitive functional binding events and showed that they clustered close to the TSS. Our method and findings were tested experimentally.
We identified reliably functional TF binding sites. This is an essential step towards constructing regulatory networks. The promoter region proximal to the TSS is of central importance for regulation of transcription in human and mouse, just as it is in bacteria and yeast.
PMCID: PMC1950076  PMID: 17726537
10.  Evolution and Selection in Yeast Promoters: Analyzing the Combined Effect of Diverse Transcription Factor Binding Sites 
In comparative genomics one analyzes jointly evolutionarily related species in order to identify conserved and diverged sequences and to infer their function. While such studies enabled the detection of conserved sequences in large genomes, the evolutionary dynamics of regulatory regions as a whole remain poorly understood. Here we present a probabilistic model for the evolution of promoter regions in yeast, combining the effects of regulatory interactions of many different transcription factors. The model expresses explicitly the selection forces acting on transcription factor binding sites in the context of a dynamic evolutionary process. We develop algorithms to compute likelihood and to learn de novo collections of transcription factor binding motifs and their selection parameters from alignments. Using the new techniques, we examine the evolutionary dynamics in Saccharomyces species promoters. Analyses of an evolutionary model constructed using all known transcription factor binding motifs and of a model learned from the data automatically reveal relatively weak selection on most binding sites. Moreover, according to our estimates, strong binding sites are constraining only a fraction of the yeast promoter sequence that is under selection. Our study demonstrates how complex evolutionary dynamics in noncoding regions emerges from formalization of the evolutionary consequences of known regulatory mechanisms.
Author Summary
Cells use sophisticated regulation to transform static genomic information into flexible function. We are still far from understanding how such regulation evolves. Short DNA sequences that physically bind transcription factors in promoter areas near target genes play an important role in gene regulation and are directly subject to mutation and selection. In this work, we develop a methodology for studying the evolution of promoter sequences under the effect of multiple regulatory interactions. We present a model that describes the evolutionary process at each genomic locus, taking into account a random flux of mutations that occur in it and the effects of transcription factor binding sites gain or loss. Our model accounts for dependencies (or epistasis) between adjacent loci that contribute to the same regulatory interactions: mutation in one such locus immediately changes the effect of mutations in the other. Using our model, we characterize the evolution of promoters in yeast, showing that many regulatory interactions that were discovered experimentally or computationally are evolutionarily unstable. The dynamic nature of transcriptional interactions may be explained if the regulatory phenotype is achieved through multiple interactions at different levels of specificity, and if only relatively few of these interactions are essential for themselves.
PMCID: PMC2186363  PMID: 18193940
11.  Redundancy and the Evolution of Cis-Regulatory Element Multiplicity 
PLoS Computational Biology  2010;6(7):e1000848.
The promoter regions of many genes contain multiple binding sites for the same transcription factor (TF). One possibility is that this multiplicity evolved through transitional forms showing redundant cis-regulation. To evaluate this hypothesis, we must disentangle the relative contributions of different evolutionary mechanisms to the evolution of binding site multiplicity. Here, we attempt to do this using a model of binding site evolution. Our model considers binding sequences and their interactions with TFs explicitly, and allows us to cast the evolution of gene networks into a neutral network framework. We then test some of the model's predictions using data from yeast. Analysis of the model suggested three candidate nonadaptive processes favoring the evolution of cis-regulatory element redundancy and multiplicity: neutral evolution in long promoters, recombination and TF promiscuity. We find that recombination rate is positively associated with binding site multiplicity in yeast. Our model also indicated that weak direct selection for multiplicity (partial redundancy) can play a major role in organisms with large populations. Our data suggest that selection for changes in gene expression level may have contributed to the evolution of multiple binding sites in yeast. We conclude that the evolution of cis-regulatory element redundancy and multiplicity is impacted by many aspects of the biology of an organism: both adaptive and nonadaptive processes, both changes in cis to binding sites and in trans to the TFs that interact with them, both the functional setting of the promoter and the population genetic context of the individuals carrying them.
Author Summary
TFs regulate gene expression by binding to specific sequences in the promoter regions of their target genes. Promoters often contain multiple copies of the same TF binding sites. How does this multiplicity evolve? One possibility is that individuals with multiple, redundant binding sites have higher fitness. However, nonadaptive processes are also likely to be important. Here, we develop a mathematical model of the evolution of TF binding sites to help us disentangle how different evolutionary mechanisms contribute to the evolution of binding site redundancy and multiplicity. We show that recombination is expected to promote the evolution of multiple binding sites. This prediction is corroborated by genome-wide data from yeast. Another important factor in the evolution of multiplicity predicted in our analysis is TF promiscuity, that is, the ability of a TF to bind to multiple sequences. In addition, our analysis indicated that direct selection can have large effects on the evolution of redundancy and multiplicity. Data from yeast identified selection for changes in expression level as a candidate mechanism for the evolution of multiple binding sites. We conclude that, although selection may play a major role in the evolution of multiplicity in regulatory regions, nonadaptive forces can also lead to high levels of multiplicity.
PMCID: PMC2900288  PMID: 20628617
12.  Evolutionary conservation of zinc finger transcription factor binding sites in promoters of genes co-expressed with WT1 in prostate cancer 
BMC Genomics  2008;9:337.
Gene expression analyses have led to a better understanding of growth control of prostate cancer cells. We and others have identified the presence of several zinc finger transcription factors in the neoplastic prostate, suggesting a potential role for these genes in the regulation of the prostate cancer transcriptome. One of the transcription factors (TFs) identified in the prostate cancer epithelial cells was the Wilms tumor gene (WT1). To rapidly identify coordinately expressed prostate cancer growth control genes that may be regulated by WT1, we used an in silico approach.
Evolutionary conserved transcription factor binding sites (TFBS) recognized by WT1, EGR1, SP1, SP2, AP2 and GATA1 were identified in the promoters of 24 differentially expressed prostate cancer genes from eight mammalian species. To test the relationship between sequence conservation and function, chromatin of LNCaP prostate cancer and kidney 293 cells were tested for TF binding using chromatin immunoprecipitation (ChIP). Multiple putative TFBS in gene promoters of placental mammals were found to be shared with those in human gene promoters and some were conserved between genomes that diverged about 170 million years ago (i.e., primates and marsupials), therefore implicating these sites as candidate binding sites. Among those genes coordinately expressed with WT1 was the kallikrein-related peptidase 3 (KLK3) gene commonly known as the prostate specific antigen (PSA) gene. This analysis located several potential WT1 TFBS in the PSA gene promoter and led to the rapid identification of a novel putative binding site confirmed in vivo by ChIP. Conversely for two prostate growth control genes, androgen receptor (AR) and vascular endothelial growth factor (VEGF), known to be transcriptionally regulated by WT1, regulatory sequence conservation was observed and TF binding in vivo was confirmed by ChIP.
Overall, this targeted approach rapidly identified important candidate WT1-binding elements in genes coordinately expressed with WT1 in prostate cancer cells, thus enabling a more focused functional analysis of the most likely target genes in prostate cancer progression. Identifying these genes will help to better understand how gene regulation is altered in these tumor cells.
PMCID: PMC2515153  PMID: 18631392
13.  Niche adaptation by expansion and reprogramming of general transcription factors 
Experimental analysis of TFB family proteins in a halophilic archaeon reveals complex environment-dependent fitness contributions. Gene conversion events among these proteins can generate novel niche adaptation capabilities, a process that may have contributed to archaeal adaptation to extreme environments.
Evolution of archaeal lineages correlate with duplication events in the TFB family.Each TFB is required for adaptation to multiple environments.The relative fitness contributions of TFBs change with environmental context.Changes in the regulation of duplicated TFBs can generate new adaptation capabilities.
The evolutionary success of an organism depends on its ability to continually adapt to changes in the patterns of constant, periodic, and transient challenges within its environment. This process of ‘niche adaptation' requires reprogramming of the organism's environmental response networks by reorganizing interactions among diverse parts including environmental sensors, signal transducers, and transcriptional and post-transcriptional regulators. Gene duplications have been discovered to be one of the principal strategies in this process, especially for reprogramming of gene regulatory networks (GRNs). Whereas eukaryotes require dozens of factors for recruitment of RNA polymerase, archaea require just two general transcription factors (GTFs) that are orthologous to eukaryotic TFIIB (TFB in archaea) and TATA-binding protein (TBP) (Bell et al, 1998). Both of these GTFs have expanded extensively in nearly 50% of all archaea whose genomes have been fully sequenced. The phylogenetic analysis presented in this study reveal lineage-specific expansions of TFBs, suggesting that they might encode functionally specialized gene regulatory programs for the unique environments to which these organisms have adapted. This hypothesis is particularly appealing when we consider that the greatest expansion is observed within the group of halophilic archaea whose habitats are associated with routine and dynamic changes in a number of environmental factors including light, temperature, oxygen, salinity, and ionic composition (Rodriguez-Valera, 1993; Litchfield, 1998).
We have previously demonstrated that variations in the expanded set of TFBs (a through e) in Halobacterium salinarum NRC-1 manifests at the level of physical interactions within and across the two families, their DNA-binding specificity, their differential regulation in varying environments, and, ultimately, on the large-scale segregation of transcription of all genes into overlapping yet distinct sets of functionally related groups (Facciotti et al, 2007). We have extended findings from this earlier study with a systematic survey of the fitness consequences of perturbing the TFB network of H. salinarum NRC-1 across 17 environments. Notably, each TFB conferred fitness in two or more environmental conditions tested, and the relative fitness contributions (see Table I) of the five TFBs varied significantly by environment. From an evolutionary perspective, the relationships among these fitness landscapes reveal that two classes of TFBs (c/g- and f-type) appear to have played an important role in the evolution of halophilic archaea by overseeing regulation of core physiological capabilities in these organisms. TFBs of the other clades (b/d and a/e) seem to have emerged much more recently through gene duplications or horizontal gene transfers (HGTs) and are being utilized for adaptation to specialized environmental conditions.
We also investigated higher-order functional interactions and relationships among the duplicated TFBs by performing competition experiments and by mapping genetic interactions in different environments. This demonstrated that depending on environmental context, the TFBs have strikingly different functional hierarchies and genetic interactions with one another. This is remarkable as it makes each TFB essential albeit at different times in a dynamically changing environment.
In order to understand the process by which such gene family expansions shape architecture and functioning of a GRN, we performed integrated analysis of phylogeny, physical interactions, regulation, and fitness landscapes of the seven TFBs in H. salinarum NRC-1. This revealed that evolution of both their protein-coding sequence and their promoter has been instrumental in the encoding of environment-specific regulatory programs. Importantly, the convergent and divergent evolution of regulation and binding properties of TFBs suggested that, aside from HGT and random mutations, a third plausible (and perhaps most interesting) mechanism for acquiring a novel TFB variant is through gene conversion. To test this hypothesis, we synthesized a novel TFBx by transferring TFBa/e clade-specific residues to a TFBd backbone, transformed this variant under the control of either the TFBd or the TFBe promoter (PtfbD or PtfbE) into three different host genetic backgrounds (Δura3 (parent), ΔtfbD, and ΔtfbE), and analyzed fitness and gene expression patterns during growth at 25 and 37°C. This showed that gene conversion events spanning the coding sequence and the promoter, environmental context, and genetic background of the host are all extremely influential in the functional integration of a TFB into the GRN. Importantly, this analysis suggested that altering the regulation of an existing set of expanded TFBs might be an efficient mechanism to reprogram the GRN to rapidly generate novel niche adaptation capability. We have confirmed this experimentally by increasing fitness merely by moving tfbE to PtfbD control, and by generating a completely novel phenotype (biofilm-like appearance) by overexpression of tfbE.
Altogether this study clearly demonstrates that archaea can rapidly generate novel niche adaptation programs by simply altering regulation of duplicated TFBs. This is significant because expansions in the TFB family is widespread in archaea, a class of organisms that not only represent 20% of biomass on earth but are also known to have colonized some of the most extreme environments (DeLong and Pace, 2001). This strategy for niche adaptation is further expanded through interactions of the multiple TFBs with members of other expanded TF families such as TBPs (Facciotti et al, 2007) and sequence-specific regulators (e.g. Lrp family (Peeters and Charlier, 2010)). This is analogous to combinatorial solutions for other complex biological problems such as recognition of pathogens by Toll-like receptors (Roach et al, 2005), generation of antibody diversity by V(D)J recombination (Early et al, 1980), and recognition and processing of odors (Malnic et al, 1999).
Numerous lineage-specific expansions of the transcription factor B (TFB) family in archaea suggests an important role for expanded TFBs in encoding environment-specific gene regulatory programs. Given the characteristics of hypersaline lakes, the unusually large numbers of TFBs in halophilic archaea further suggests that they might be especially important in rapid adaptation to the challenges of a dynamically changing environment. Motivated by these observations, we have investigated the implications of TFB expansions by correlating sequence variations, regulation, and physical interactions of all seven TFBs in Halobacterium salinarum NRC-1 to their fitness landscapes, functional hierarchies, and genetic interactions across 2488 experiments covering combinatorial variations in salt, pH, temperature, and Cu stress. This systems analysis has revealed an elegant scheme in which completely novel fitness landscapes are generated by gene conversion events that introduce subtle changes to the regulation or physical interactions of duplicated TFBs. Based on these insights, we have introduced a synthetically redesigned TFB and altered the regulation of existing TFBs to illustrate how archaea can rapidly generate novel phenotypes by simply reprogramming their TFB regulatory network.
PMCID: PMC3261711  PMID: 22108796
evolution by gene family expansion; fitness; niche adaptation; reprogramming of gene regulatory network; transcription factor B
14.  The Fitness Landscapes of cis-Acting Binding Sites in Different Promoter and Environmental Contexts 
PLoS Genetics  2010;6(7):e1001042.
The biophysical nature of the interaction between a transcription factor and its target sequences in vitro is sufficiently well understood to allow for the effects of DNA sequence alterations on affinity to be predicted. But even in relatively simple in vivo systems, the complexities of promoter organization and activity have made it difficult to predict how altering specific interactions between a transcription factor and DNA will affect promoter output. To better understand this, we measured the relative fitness of nearly all Escherichia coli binding sites in different promoter and environmental contexts by competing four randomized promoter libraries controlling the expression of the tetracycline resistance gene (tet) against each other in increasing concentrations of drug. We sequenced populations after competition to determine the relative enrichment of each −35 sequence. We observed a consistent relationship between the frequency of recovery of each −35 binding site and its predicted affinity for that varied depending on the sequence context of the promoter and drug concentration. Overall the relative fitness of each promoter could be predicted by a simple thermodynamic model of transcriptional regulation, in which the rate of transcriptional initiation (and hence fitness) is dependent upon the overall stability of the initiation complex, which in turn is dependent upon the energetic contributions of all sites within the complex. As implied by this model, a decrease in the free energy of association at one site could be compensated for by an increase in the binding energy at another to produce a similar output. Furthermore, these data show that a large and continuous range of transcriptional outputs can be accessed by merely changing the , suggesting that evolved or engineered mutations at this site could allow for subtle and precise control over gene expression.
Author Summary
A major challenge in molecular genetics has been to understand how cis-regulatory information is integrated to determine the amount of transcript generated. The difficulty has been that there are a large number of variables (known and unknown) that combine through an extensive array of possible mechanisms. Differences in the affinity of a binding site for its cognate binder within the initiation complex are known to account for significant differences in promoter output, but data for the activity of binding site variants in vivo has been limited. Here, we were able to map the fitness of nearly all E. coli binding sites in multiple promoter and environmental contexts using a novel method that utilizes the sequencing power of a next generation DNA sequencer. These data for the first time show the phenotypic range and continuity of a nearly complete set of possible binding targets in vivo, and they are useful in our ability to understand the mechanism, evolution, and designability of gene regulation.
PMCID: PMC2912393  PMID: 20686658
15.  Comprehensive Annotation of Bidirectional Promoters Identifies Co-Regulation among Breast and Ovarian Cancer Genes 
PLoS Computational Biology  2007;3(4):e72.
A “bidirectional gene pair” comprises two adjacent genes whose transcription start sites are neighboring and directed away from each other. The intervening regulatory region is called a “bidirectional promoter.” These promoters are often associated with genes that function in DNA repair, with the potential to participate in the development of cancer. No connection between these gene pairs and cancer has been previously investigated. Using the database of spliced-expressed sequence tags (ESTs), we identified the most complete collection of human transcripts under the control of bidirectional promoters. A rigorous screen of the spliced EST data identified new bidirectional promoters, many of which functioned as alternative promoters or regulated novel transcripts. Additionally, we show a highly significant enrichment of bidirectional promoters in genes implicated in somatic cancer, including a substantial number of genes implicated in breast and ovarian cancers. The repeated use of this promoter structure in the human genome suggests it could regulate co-expression patterns among groups of genes. Using microarray expression data from 79 human tissues, we verify regulatory networks among genes controlled by bidirectional promoters. Subsets of these promoters contain similar combinations of transcription factor binding sites, including evolutionarily conserved ETS factor binding sites in ERBB2, FANCD2, and BRCA2. Interpreting the regulation of genes involved in co-expression networks, especially those involved in cancer, will be an important step toward defining molecular events that may contribute to disease.
Author Summary
Promoters are regulatory regions that control transcription of genes. A special class of promoters, known as bidirectional promoters, regulates expression of two genes instead of one. These promoters are situated between two adjacent genes whose transcription start sites are physically within 1,000 bp and oriented in opposite directions. Bidirectional promoters are found repeatedly in the genome, suggesting an important biological significance for this regulatory configuration. We developed an algorithm to map bidirectional promoters using data from a comprehensive list of transcribed sequences known as expressed sequence tags, or ESTs. This approach improved the number of previously characterized bidirectional promoters by 300%. Included in the new data are bidirectional promoters that regulate expression of genes implicated in somatic cancers. For instance, ten well-recognized genes implicated in breast and ovarian cancers were identified as having bidirectional promoters. Three of the genes are further related by having duplicate copies of the same binding site for a transcription factor within their bidirectional promoters. These binding sites are conserved among species, providing greater evidence that they are functionally important. This example, in which similar regulatory structures are used to control genes involved in cancer, illustrates how data can be mined from the comprehensive set of bidirectional promoters. Within this manuscript, we show statistical evidence that many cancer genes are regulated by bidirectional promoters. These promoters will be a valuable dataset for studying the role of gene regulation in tumor development.
PMCID: PMC1853124  PMID: 17447839
16.  Target Gene Analysis by Microarrays and Chromatin Immunoprecipitation Identifies HEY Proteins as Highly Redundant bHLH Repressors 
PLoS Genetics  2012;8(5):e1002728.
HEY bHLH transcription factors have been shown to regulate multiple key steps in cardiovascular development. They can be induced by activated NOTCH receptors, but other upstream stimuli mediated by TGFß and BMP receptors may elicit a similar response. While the basic and helix-loop-helix domains exhibit strong similarity, large parts of the proteins are still unique and may serve divergent functions. The striking overlap of cardiac defects in HEY2 and combined HEY1/HEYL knockout mice suggested that all three HEY genes fulfill overlapping function in target cells. We therefore sought to identify target genes for HEY proteins by microarray expression and ChIPseq analyses in HEK293 cells, cardiomyocytes, and murine hearts. HEY proteins were found to modulate expression of their target gene to a rather limited extent, but with striking functional interchangeability between HEY factors. Chromatin immunoprecipitation revealed a much greater number of potential binding sites that again largely overlap between HEY factors. Binding sites are clustered in the proximal promoter region especially of transcriptional regulators or developmental control genes. Multiple lines of evidence suggest that HEY proteins primarily act as direct transcriptional repressors, while gene activation seems to be due to secondary or indirect effects. Mutagenesis of putative DNA binding residues supports the notion of direct DNA binding. While class B E-box sequences (CACGYG) clearly represent preferred target sequences, there must be additional and more loosely defined modes of DNA binding since many of the target promoters that are efficiently bound by HEY proteins do not contain an E-box motif. These data clearly establish the three HEY bHLH factors as highly redundant transcriptional repressors in vitro and in vivo, which explains the combinatorial action observed in different tissues with overlapping expression.
Author Summary
NOTCH signaling is a central developmental pathway that influences a multitude of cell fate decisions and differentiation steps as well as later tissue homeostasis and regeneration. The three HEY genes encode basic helix-loop-helix transcription factors that are critical effectors to convey signaling by NOTCH receptors and similar signaling systems. This is underscored by the multitude of developmental defects observed in HEY single- and double-mutant mice. The mode of action of HEY proteins remained largely unexplored, however. By gene expression analysis and chromatin immunoprecipitation we have now identified a large set of HEY target genes. While only 500–2,000 mRNAs are regulated by HEY1 or HEY2, there are around 10,000 binding sites in the genome. HEY proteins act as transcriptional repressors that bind close to transcriptional start sites in all cases tested. In contrast, gene activation seems to be mediated by indirect/secondary mechanisms. The extent of regulation is rather limited, implicating HEY genes in modulating rather than switching on or off target gene expression. All our in vitro and in vivo data point to a high degree of redundancy between the three HEY genes, suggesting that tissue specific patterns and expression levels determine the final outcome of HEY induced cellular responses.
PMCID: PMC3355086  PMID: 22615585
17.  Genome-Wide Analysis of KAP1 Binding Suggests Autoregulation of KRAB-ZNFs 
PLoS Genetics  2007;3(6):e89.
We performed a genome-scale chromatin immunoprecipitation (ChIP)-chip comparison of two modifications (trimethylation of lysine 9 [H3me3K9] and trimethylation of lysine 27 [H3me3K27]) of histone H3 in Ntera2 testicular carcinoma cells and in three different anatomical sources of primary human fibroblasts. We found that in each of the cell types the two modifications were differentially enriched at the promoters of the two largest classes of transcription factors. Specifically, zinc finger (ZNF) genes were bound by H3me3K9 and homeobox genes were bound by H3me3K27. We have previously shown that the Polycomb repressive complex 2 is responsible for mediating trimethylation of lysine 27 of histone H3 in human cancer cells. In contrast, there is little overlap between H3me3K9 targets and components of the Polycomb repressive complex 2, suggesting that a different histone methyltransferase is responsible for the H3me3K9 modification. Previous studies have shown that SETDB1 can trimethylate H3 on lysine 9, using in vitro or artificial tethering assays. SETDB1 is thought to be recruited to chromatin by complexes containing the KAP1 corepressor. To determine if a KAP1-containing complex mediates trimethylation of the identified H3me3K9 targets, we performed ChIP-chip assays and identified KAP1 target genes using human 5-kb promoter arrays. We found that a large number of genes of ZNF transcription factors were bound by both KAP1 and H3me3K9 in normal and cancer cells. To expand our studies of KAP1, we next performed a complete genomic analysis of KAP1 binding using a 38-array tiling set, identifying ~7,000 KAP1 binding sites. The identified KAP1 targets were highly enriched for C2H2 ZNFs, especially those containing Krüppel-associated box (KRAB) domains. Interestingly, although most KAP1 binding sites were within core promoter regions, the binding sites near ZNF genes were greatly enriched within transcribed regions of the target genes. Because KAP1 is recruited to the DNA via interaction with KRAB-ZNF proteins, we suggest that expression of KRAB-ZNF genes may be controlled via an auto-regulatory mechanism involving KAP1.
Author Summary
Methylation of lysines 9 or 27 of histone H3 (H3me3K9 or H3me3K27, respectively) has been associated with silenced chromatin. However, a comprehensive comparison of the regions of the genome bound by these two types of modified histone H3 has not been performed. Therefore, we compared the binding patterns of H3me3K9 and H3me3K27 at ~26,000 human promoters in four different cell populations. Our studies indicated that the two marks segregate differentially with the two most common types of transcriptional regulators; H3me3K27 is highly enriched at homeobox genes and H3me3K9 is highly enriched at zinc-finger genes (ZNFs). We showed that many of the promoters bound by H3me3K9 are also bound by the corepressor KAP1. A genome-wide screen for KAP1 target genes revealed a difference in the location of KAP1 binding sites in the ZNF genes versus other targets. In general, KAP1 binding sites were localized to core promoter regions. However, KAP1 binding sites associated with ZNF genes are near the 3′ end of the coding region. Our results suggest that the KRAB-ZNF family members participate in an autoregulatory loop involving binding of the KAP1 protein to the 3′ end of the ZNF target genes, resulting in trimethylation of H3K9 and transcriptional repression.
PMCID: PMC1885280  PMID: 17542650
18.  Structure and function of the zeta-globin upstream regulatory element. 
Nucleic Acids Research  1996;24(24):4978-4986.
The human zeta-globin promoter contains a strong positive regulatory element in the 5' flanking region, designated the zeta-globin upstream regulatory element (URE). In this study, we define the minimal sequences required for URE function and characterize the associated protein-DNA interactions. Deletion experiments show that the URE spans a 60 bp region located between 220 and 279 bp 5' to the transcription start site. Further subdivision of this region shows that multiple cis acting sequences are present. Electrophoretic mobility shift assays demonstrate that the erythroid transcription factor GATA-1 binds a site at -230, and Sp1 and an unidentified factor bind a CCACC site at -240. The unidentified CCACC factor is distinct from two other CCACC factors, EKLF and BKLF/TEF-2. A third complex contains a novel DNA-binding activity that interacts with a site in the -269 to -255 region, designated URE binding factor (URE-BF). This factor is present in K562 cells that express zeta-globin, but is absent in the OCIM1 cell line, a human erythroid cell line that does not express zeta-globin. URE-BF appears to interact with a GATA factor, since formation of the URE-BF complex can be prevented by the presence of unlabeled oligonucleotides containing GATA sites. Finally, increasing the distance from the -230 GATA site to the two upstream sites causes a progressive decrease in zeta-globin promoter activity. There is no indication of a requirement for GATA-1 to be on the same side of the DNA helix as the other upstream factors. These results show that zeta-globin promoter function is highly dependent on a 60 bp region to which at least three different factors bind. Two of these factors may represent DNA-binding proteins not previously identified as important for regulation of globin gene expression. It is likely that these factors interact physically to create a functional regulatory unit.
PMCID: PMC146349  PMID: 9016669
19.  Understanding Variation in Transcription Factor Binding by Modeling Transcription Factor Genome-Epigenome Interactions 
PLoS Computational Biology  2013;9(12):e1003367.
Despite explosive growth in genomic datasets, the methods for studying epigenomic mechanisms of gene regulation remain primitive. Here we present a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. Based on the first principles of statistical mechanics, this model considers the interactions between epigenomic modifications and a cis-regulatory module, which contains multiple binding sites arranged in any configurations. We compiled a comprehensive epigenomic dataset in mouse embryonic stem (mES) cells, including DNA methylation (MeDIP-seq and MRE-seq), DNA hydroxymethylation (5-hmC-seq), and histone modifications (ChIP-seq). We discovered correlations of transcription factors (TFs) for specific combinations of epigenomic modifications, which we term epigenomic motifs. Epigenomic motifs explained why some TFs appeared to have different DNA binding motifs derived from in vivo (ChIP-seq) and in vitro experiments. Theoretical analyses suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. ChIP-seq data suggested that epigenomic boost of binding affinities in weak TF binding sites can function in mES cells. We showed in theory that the epigenome should suppress the TF binding differences on SNP-containing binding sites in two people. Using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of personal differences in NFκB binding in SNP-containing binding sites, which may explain why some SNPs introduce much smaller personal variations on TF binding than other SNPs. In summary, this model presents a powerful approach to analyze the functions of epigenomic modifications. This model was implemented into an open source program APEG (Affinity Prediction by Epigenome and Genome,
Author Summary
We developed a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. We postulated the existence of TF-specific epigenomic motifs, which could explain why some TFs appeared to have different DNA binding motifs derived from in vivo and in vitro experiments. The theoretical results suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. A preliminary analysis of the existing data suggested that epigenomic boost of binding affinities in weak TF binding sites could be a widespread regulatory mechanism in mES cells. Moreover, using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of individual differences in NFκB binding in SNP-containing binding sites, suggesting the theoretical mechanism for epigenome to attenuate the TF binding differences on SNP-containing binding sites in two individuals may contribute to link genomic variation to phenotypic variation. Thus, this model presents a powerful approach to analyze the functions of epigenomic modifications.
PMCID: PMC3854512  PMID: 24339764
20.  CisMiner: Genome-Wide In-Silico Cis-Regulatory Module Prediction by Fuzzy Itemset Mining 
PLoS ONE  2014;9(9):e108065.
Eukaryotic gene control regions are known to be spread throughout non-coding DNA sequences which may appear distant from the gene promoter. Transcription factors are proteins that coordinately bind to these regions at transcription factor binding sites to regulate gene expression. Several tools allow to detect significant co-occurrences of closely located binding sites (cis-regulatory modules, CRMs). However, these tools present at least one of the following limitations: 1) scope limited to promoter or conserved regions of the genome; 2) do not allow to identify combinations involving more than two motifs; 3) require prior information about target motifs. In this work we present CisMiner, a novel methodology to detect putative CRMs by means of a fuzzy itemset mining approach able to operate at genome-wide scale. CisMiner allows to perform a blind search of CRMs without any prior information about target CRMs nor limitation in the number of motifs. CisMiner tackles the combinatorial complexity of genome-wide cis-regulatory module extraction using a natural representation of motif combinations as itemsets and applying the Top-Down Fuzzy Frequent- Pattern Tree algorithm to identify significant itemsets. Fuzzy technology allows CisMiner to better handle the imprecision and noise inherent to regulatory processes. Results obtained for a set of well-known binding sites in the S. cerevisiae genome show that our method yields highly reliable predictions. Furthermore, CisMiner was also applied to putative in-silico predicted transcription factor binding sites to identify significant combinations in S. cerevisiae and D. melanogaster, proving that our approach can be further applied genome-wide to more complex genomes. CisMiner is freely accesible at: CisMiner can be queried for the results presented in this work and can also perform a customized cis-regulatory module prediction on a query set of transcription factor binding sites provided by the user.
PMCID: PMC4182448  PMID: 25268582
21.  Nucleosome organization in the vicinity of transcription factor binding sites in the human genome 
BMC Genomics  2014;15(1):493.
The binding of transcription factors (TFs) to specific DNA sequences is an initial and crucial step of transcription. In eukaryotes, this process is highly dependent on the local chromatin state, which can be modified by recruiting chromatin remodelers. However, previous studies have focused mainly on nucleosome occupancy around the TF binding sites (TFBSs) of a few specific TFs. Here, we investigated the nucleosome occupancy profiles around computationally inferred binding sites, based on 519 TF binding motifs, in human GM12878 and K562 cells.
Although high nucleosome occupancy is intrinsically encoded at TFBSs in vitro, nucleosomes are generally depleted at TFBSs in vivo, and approximately a quarter of TFBSs showed well-positioned in vivo nucleosomes on both sides. RNA polymerase near the transcription start site (TSS) has a large effect on the nucleosome occupancy distribution around the binding sites located within one kilobase to the nearest TSS; fuzzier nucleosome positioning was thus observed around these sites. In addition, in contrast to yeast, repressors, rather than activators, were more likely to bind to nucleosomal DNA in the human cells, and nucleosomes around repressor sites were better positioned in vivo. Genes with repressor sites exhibiting well-positioned nucleosomes on both sides, and genes with activator sites occupied by nucleosomes had significantly lower expression, suggesting that actions of activators and repressors are associated with the nucleosome occupancy around their binding sites. It was also interesting to note that most of the binding sites, which were not in the DNase I-hypersensitive regions, were cell-type specific, and higher in vivo nucleosome occupancy were observed at these binding sites.
This study demonstrated that RNA polymerase and the functions of bound TFs affected the local nucleosome occupancy around TFBSs, and nucleosome occupancy patterns around TFBSs were associated with the expression levels of target genes.
Electronic supplementary material
The online version of this article (doi: 10.1186/1471-2164-15-493) contains supplementary material, which is available to authorized users.
PMCID: PMC4073502  PMID: 24942981
Nucleosome occupancy; Transcription factor binding site; Clustering
22.  Hepatocyte nuclear factor 1 and C/EBP are essential for the activity of the human apolipoprotein B gene second-intron enhancer. 
Molecular and Cellular Biology  1992;12(3):1134-1148.
The tissue-specific transcriptional enhancer of the human apolipoprotein B gene contains multiple protein-binding sites spanning 718 bp. Most of the enhancer activity is found in a 443-bp fragment (+621 to +1064) that is located entirely within the second intron of the gene. Within this fragment, a 147-bp region (+806 to +952) containing a single 97-bp DNase I footprint exhibits significant enhancer activity. We now report that this footprint contains four distinct protein-binding sites that have the potential to bind nine distinct liver nuclear proteins. One of these proteins was identified as hepatocyte nuclear factor 1 (HNF-1), which binds with relatively low affinity to the 5' half of a 20-bp palindrome located at the 5' end of the large footprint. A binding site for C/EBP (or one of the related proteins that recognize similar sequences) was identified in the center of the 97-bp footprint. This binding site is coincident or overlaps with the binding sites for five other proteins, two of which appear to be distinct from the C/EBP-related family of proteins. The binding site for a nuclear factor designated protein I is located between the HNF-1 and C/EBP binding sites. Finally, the 3'-most 15 bp of the footprinted sequence contain a binding site for another nuclear protein, which we have called protein II. Mutations that abolish the binding of either HNF-1, protein II, or the C/EBP-related proteins severely reduce enhancer activity. However, deletion experiments demonstrated that neither the HNF-1-binding site alone, nor the combination of binding sites for HNF-1, protein I, and C/EBP, nor the C/EBP-binding site plus the protein II-binding site is sufficient to enhance transcription from a strong apolipoprotein B promoter. Rather, HNF-1 and C/EBP act synergistically with protein II to enhance transcription of the apolipoprotein B gene.
PMCID: PMC369545  PMID: 1545795
23.  A new myocyte-specific enhancer-binding factor that recognizes a conserved element associated with multiple muscle-specific genes. 
Molecular and Cellular Biology  1989;9(11):5022-5033.
Exposure of skeletal myoblasts to growth factor-deficient medium results in transcriptional activation of muscle-specific genes, including the muscle creatine kinase gene (mck). Tissue specificity, developmental regulation, and high-level expression of mck are conferred primarily by a muscle-specific enhancer located between base pairs (bp) -1350 and -1048 relative to the transcription initiation site (E. A. Sternberg, G. Spizz, W. M. Perry, D. Vizard, T. Weil, and E. N. Olson, Mol. Cell. Biol. 8:2896-2909, 1988). To begin to define the regulatory mechanisms that mediate the selective activation of the mck enhancer in differentiating muscle cells, we have further delimited the boundaries of this enhancer and analyzed its interactions with nuclear factors from a variety of myogenic and nonmyogenic cell types. Deletion mutagenesis showed that the region between 1,204 and 1,095 bp upstream of mck functions as a weak muscle-specific enhancer that is dependent on an adjacent enhancer element for strong activity. This adjacent activating element does not exhibit enhancer activity in single copy but acts as a strong enhancer when multimerized. Gel retardation assays combined with DNase I footprinting and diethyl pyrocarbonate interference showed that a nuclear factor from differentiated C2 myotubes and BC3H1 myocytes recognized a conserved A + T-rich sequence within the peripheral activating region. This myocyte-specific enhancer-binding factor, designated MEF-2, was undetectable in nuclear extracts from C2 or BC3H1 myoblasts or several nonmyogenic cell lines. MEF-2 was first detectable within 2 h after exposure of myoblasts to mitogen-deficient medium and increased in abundance for 24 to 48 h thereafter. The appearance of MEF-2 required ongoing protein synthesis and was prevented by fibroblast growth factor and type beta transforming growth factor, which block the induction of muscle-specific genes. A myoblast-specific factor that is down regulated within 4 h after removal of growth factors was also found to bind to the MEF-2 recognition site. A 10-bp sequence, which was shown by DNase I footprinting and diethyl pyrocarbonate interference to interact directly with MEF-2, was identified within the rat and human mck enhancers, the rat myosin light-chain (mlc)-1/3 enhancer, and the chicken cardiac mlc-2A promoter. Oligomers corresponding to the region of the mlc-1/3 enhancer, which encompasses this conserved sequence, bound MEF-2 and competed for its binding to the mck enhancer. These results thus provide evidence for a novel myocyte-specific enhancer-binding factor, MEF-2, that is expressed early in the differentiation program and is suppressed by specific polypeptide growth factors. The ability of MEF-2 to recognize conserved activating elements associated with multiple-specific genes suggests that this factor may participate in the coordinate regulation of genes during myogenesis.
PMCID: PMC363654  PMID: 2601707
24.  Functional analysis of transcription factor binding sites in human promoters 
Genome Biology  2012;13(9):R50.
The binding of transcription factors to specific locations in the genome is integral to the orchestration of transcriptional regulation in cells. To characterize transcription factor binding site function on a large scale, we predicted and mutagenized 455 binding sites in human promoters. We carried out functional tests on these sites in four different immortalized human cell lines using transient transfections with a luciferase reporter assay, primarily for the transcription factors CTCF, GABP, GATA2, E2F, STAT, and YY1.
In each cell line, between 36% and 49% of binding sites made a functional contribution to the promoter activity; the overall rate for observing function in any of the cell lines was 70%. Transcription factor binding resulted in transcriptional repression in more than a third of functional sites. When compared with predicted binding sites whose function was not experimentally verified, the functional binding sites had higher conservation and were located closer to transcriptional start sites (TSSs). Among functional sites, repressive sites tended to be located further from TSSs than were activating sites. Our data provide significant insight into the functional characteristics of YY1 binding sites, most notably the detection of distinct activating and repressing classes of YY1 binding sites. Repressing sites were located closer to, and often overlapped with, translational start sites and presented a distinctive variation on the canonical YY1 binding motif.
The genomic properties that we found to associate with functional TF binding sites on promoters -- conservation, TSS proximity, motifs and their variations -- point the way to improved accuracy in future TFBS predictions.
PMCID: PMC3491394  PMID: 22951020
25.  Comprehensive Human Transcription Factor Binding Site Map for Combinatory Binding Motifs Discovery 
PLoS ONE  2012;7(11):e49086.
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%–20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory “DNA words.” From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%—far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of “DNA words,” newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
PMCID: PMC3509107  PMID: 23209563

Results 1-25 (1506206)