The unfolded protein response (UPR) in eukaryotes upregulates factors that restore ER homeostasis upon protein folding stress and in yeast is activated by a non-conventional splicing of the HAC1 mRNA. The spliced HAC1 mRNA encodes an active transcription factor that binds to UPR-responsive elements in the promoter of UPR target genes. Overexpression of the HAC1 gene of S. cerevisiae can reportedly lead to increased production of heterologous proteins. To further such studies in the biotechnology favored yeast Pichia pastoris, we cloned and characterized the P. pastoris HAC1 gene and the splice event.
We identified the HAC1 homologue of P. pastoris and its splice sites. Surprisingly, we could not find evidence for the non-spliced HAC1 mRNA when P. pastoris was cultivated in a standard growth medium without any endoplasmic reticulum stress inducers, indicating that the UPR is constitutively active to some extent in this organism. After identification of the sequence encoding active Hac1p we evaluated the effect of its overexpression in Pichia. The KAR2 UPR-responsive gene was strongly upregulated. Electron microscopy revealed an expansion of the intracellular membranes in Hac1p-overexpressing strains. We then evaluated the effect of inducible and constitutive UPR induction on the production of secreted, surface displayed and membrane proteins. Wherever Hac1p overexpression affected heterologous protein expression levels, this effect was always stronger when Hac1p expression was inducible rather than constitutive. Depending on the heterologous protein, co-expression of Hac1p increased, decreased or had no effect on expression level. Moreover, α-mating factor prepro signal processing of a G-protein coupled receptor was more efficient with Hac1p overexpression; resulting in a significantly improved homogeneity.
Overexpression of P. pastoris Hac1p can be used to increase the production of heterologous proteins but needs to be evaluated on a case by case basis. Inducible Hac1p expression is more effective than constitutive expression. Correct processing and thus homogeneity of proteins that are difficult to express, such as GPCRs, can be increased by co-expression with Hac1p.
Motivation: Chromatin states are the key to gene regulation and cell identity. Chromatin immunoprecipitation (ChIP) coupled with high-throughput sequencing (ChIP-Seq) is increasingly being used to map epigenetic states across genomes of diverse species. Chromatin modification profiles are frequently noisy and diffuse, spanning regions ranging from several nucleosomes to large domains of multiple genes. Much of the early work on the identification of ChIP-enriched regions for ChIP-Seq data has focused on identifying localized regions, such as transcription factor binding sites. Bioinformatic tools to identify diffuse domains of ChIP-enriched regions have been lacking.
Results: Based on the biological observation that histone modifications tend to cluster to form domains, we present a method that identifies spatial clusters of signals unlikely to appear by chance. This method pools together enrichment information from neighboring nucleosomes to increase sensitivity and specificity. By using genomic-scale analysis, as well as the examination of loci with validated epigenetic states, we demonstrate that this method outperforms existing methods in the identification of ChIP-enriched signals for histone modification profiles. We demonstrate the application of this unbiased method in important issues in ChIP-Seq data analysis, such as data normalization for quantitative comparison of levels of epigenetic modifications across cell types and growth conditions.
Supplementary information: Supplementary data are available at Bioinformatics online.
HacA/Xbp1 is a conserved bZIP transcription factor in eukaryotic cells which regulates gene expression in response to various forms of secretion stress and as part of secretory cell differentiation. In the present study, we replaced the endogenous hacA gene of an Aspergillus niger strain with a gene encoding a constitutively active form of the HacA transcription factor (HacACA). The impact of constitutive HacA activity during exponential growth was explored in bioreactor controlled cultures using transcriptomic analysis to identify affected genes and processes.
Transcription profiles for the wild-type strain (HacAWT) and the HacACA strain were obtained using Affymetrix GeneChip analysis of three replicate batch cultures of each strain. In addition to the well known HacA targets such as the ER resident foldases and chaperones, GO enrichment analysis revealed up-regulation of genes involved in protein glycosylation, phospholipid biosynthesis, intracellular protein transport, exocytosis and protein complex assembly in the HacACA mutant. Biological processes over-represented in the down-regulated genes include those belonging to central metabolic pathways, translation and transcription. A remarkable transcriptional response in the HacACA strain was the down-regulation of the AmyR transcription factor and its target genes.
The results indicate that the constitutive activation of the HacA leads to a coordinated regulation of the folding and secretion capacity of the cell, but with consequences on growth and fungal physiology to reduce secretion stress.
HacA; Unfolded protein response; Secretion stress; RESS; XBP1; Aspergillus niger; Protein secretion
We have used a human artificial chromosome (HAC) to manipulate the epigenetic state of chromatin within an active kinetochore. The HAC has a dimeric α-satellite repeat containing one natural monomer with a CENP-B binding site, and one completely artificial synthetic monomer with the CENP-B box replaced by a tetracycline operator (tetO). This HAC exhibits normal kinetochore protein composition and mitotic stability. Targeting of several tet-repressor (tetR) fusions into the centromere had no effect on kinetochore function. However, altering the chromatin state to a more open configuration with the tTA transcriptional activator or to a more closed state with the tTS transcription silencer caused missegregation and loss of the HAC. tTS binding caused the loss of CENP-A, CENP-B, CENP-C, and H3K4me2 from the centromere accompanied by an accumulation of histone H3K9me3. Our results reveal that a dynamic balance between centromeric chromatin and heterochromatin is essential for vertebrate kinetochore activity.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. Because the compact genomes of prokaryotes harbor many binding sites separated by only few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Applications in prokaryotic genomes are further hampered by the fact that well studied data analysis methods for ChIP-Seq do not result in a resolution required for deciphering the locations of nearby binding events. We generated single-end tag (SET) and paired-end tag (PET) ChIP-Seq data for factor in Escherichia coli (E. coli). Direct comparison of these datasets revealed that although PET assay enables higher resolution identification of binding events, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak as a high resolution binding site identification (deconvolution) algorithm. dPeak implements a probabilistic model that accurately describes ChIP-Seq data generation process for both the SET and PET assays. For SET data, dPeak outperforms or performs comparably to the state-of-the-art high-resolution ChIP-Seq peak deconvolution algorithms such as PICS, GPS, and GEM. When coupled with PET data, dPeak significantly outperforms SET-based analysis with any of the current state-of-the-art methods. Experimental validations of a subset of dPeak predictions from PET ChIP-Seq data indicate that dPeak can estimate locations of binding events with as high as to resolution. Applications of dPeak to ChIP-Seq data in E. coli under aerobic and anaerobic conditions reveal closely located promoters that are differentially occupied and further illustrate the importance of high resolution analysis of ChIP-Seq data.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is widely used for studying in vivo protein-DNA interactions genome-wide. Current state-of-the-art ChIP-Seq protocols utilize single-end tag (SET) assay which only sequences ends of DNA fragments in the library. Although paired-end tag (PET) sequencing is routinely used in other applications of next generation sequencing, it has not been much adapted to ChIP-Seq. We illustrate both experimentally and computationally that PET sequencing significantly improves the resolution of ChIP-Seq experiments and enables ChIP-Seq applications in compact genomes like Escherichia coli (E. coli). To enable efficient identification using PET ChIP-Seq data, we develop dPeak as a high resolution binding site identification algorithm. dPeak implements probabilistic models for both SET and PET data and facilitates efficient analysis of both data types. Applications of dPeak to deeply sequenced E. coli PET and SET ChIP-Seq data establish significantly better resolution of PET compared to SET sequencing.
The nuclear receptor peroxisome proliferator activator receptor γ (PPARγ) is the target of antidiabetic thiazolidinedione drugs, which improve insulin resistance but have side effects that limit widespread use. PPARγ is required for adipocyte differentiation, but it is also expressed in other cell types, notably macrophages, where it influences atherosclerosis, insulin resistance, and inflammation. A central question is whether PPARγ binding in macrophages occurs at genomic locations the same as or different from those in adipocytes. Here, utilizing chromatin immunoprecipitation and high-throughput sequencing (ChIP-seq), we demonstrate that PPARγ cistromes in mouse adipocytes and macrophages are predominantly cell type specific. In thioglycolate-elicited macrophages, PPARγ colocalizes with the hematopoietic transcription factor PU.1 in areas of open chromatin and histone acetylation, near a distinct set of immune genes in addition to a number of metabolic genes shared with adipocytes. In adipocytes, the macrophage-unique binding regions are marked with repressive histone modifications, typically associated with local chromatin compaction and gene silencing. PPARγ, when introduced into preadipocytes, bound only to regions depleted of repressive histone modifications, where it increased DNA accessibility, enhanced histone acetylation, and induced gene expression. Thus, the cell specificity of PPARγ function is regulated by cell-specific transcription factors, chromatin accessibility, and histone marks. Our data support the existence of an epigenomic hierarchy in which PPARγ binding to cell-specific sites not marked by repressive marks opens chromatin and leads to local activation marks, including histone acetylation.
We cloned by phenotypic complementation a novel Saccharomyces cerevisiae's multicopy suppressor of the Schizosaccharomyces pombe cdc10-129 mutant which we call HAC1, an acronym of 'homologous to ATF/CREB 1'. It encodes a bZIP (basic-leucine zipper) protein of 230 amino acids with close homology to the mammalian ATF/CREB transcription factor and gel-retardation assays showed that it binds specifically to the CRE motif. HAC1 is not essential for viability. However, the hac1 disruptant becomes caffeine sensitive, which is suppressed by multicopy expression of the yeast PDE2 (Phosphodiesterase 2) gene. Although the mRNA level of HAC1 is almost constitutive throughout the cell cycle, it fluctuates during meiosis. The upstream region of the HAC1 gene contains a T4C site, a URS (upstream repression sequence) and a TR (T-rich) box-like sequence, which reside upstream of many meiotic genes. These results suggest that HAC1 may also be one of the meiotic genes.
Motivation: Although chromatin immunoprecipitation coupled with
high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip) is
increasingly used to map genome-wide–binding sites of transcription factors (TFs),
it still remains difficult to generate a quality ChIPx (i.e. ChIP-seq or ChIP-chip)
dataset because of the tremendous amount of effort required to develop effective
antibodies and efficient protocols. Moreover, most laboratories are unable to easily
obtain ChIPx data for one or more TF(s) in more than a handful of biological contexts.
Thus, standard ChIPx analyses primarily focus on analyzing data from one experiment, and
the discoveries are restricted to a specific biological context.
Results: We propose to enrich this existing data analysis paradigm by
developing a novel approach, ChIP-PED, which superimposes ChIPx data on large amounts of
publicly available human and mouse gene expression data containing a diverse collection of
cell types, tissues and disease conditions to discover new biological contexts with
potential TF regulatory activities. We demonstrate ChIP-PED using a number of examples,
including a novel discovery that MYC, a human TF, plays an important
functional role in pediatric Ewing sarcoma cell lines. These examples show that ChIP-PED
increases the value of ChIPx data by allowing one to expand the scope of possible
discoveries made from a ChIPx experiment.
Supplementary data are available at Bioinformatics
Deep sequencing approaches, such as chromatin immunoprecipitation by sequencing (ChIP-seq), have been successful in detecting transcription factor-binding sites and histone modification in the whole genome. An approach for comparing two different ChIP-seq data would be beneficial for predicting unknown functions of a factor. We propose a model to represent co-localization of two different ChIP-seq data. We showed that a meaningful overlapping signal and a meaningless background signal can be separated by this model. We applied this model to compare ChIP-seq data of RNA polymerase II C-terminal domain (CTD) serine 2 phosphorylation with a large amount of peak-called data, including ChIP-seq and other deep sequencing data in the Encyclopedia of DNA Elements (ENCODE) project, and then extracted factors that were related to RNA polymerase II CTD serine 2 in HeLa cells. We further analyzed RNA polymerase II CTD serine 7 phosphorylation, of which their function is still unclear in HeLa cells. Our results were characterized by the similarity of localization for transcription factor/histone modification in the ENCODE data set, and this suggests that our model is appropriate for understanding ChIP-seq data for factors where their function is unknown.
Motivation: Identifying the target genes regulated by transcription factors (TFs) is the most basic step in understanding gene regulation. Recent advances in high-throughput sequencing technology, together with chromatin immunoprecipitation (ChIP), enable mapping TF binding sites genome wide, but it is not possible to infer function from binding alone. This is especially true in mammalian systems, where regulation often occurs through long-range enhancers in gene-rich neighborhoods, rather than proximal promoters, preventing straightforward assignment of a binding site to a target gene.
Results: We present EMBER (Expectation Maximization of Binding and Expression pRofiles), a method that integrates high-throughput binding data (e.g. ChIP-chip or ChIP-seq) with gene expression data (e.g. DNA microarray) via an unsupervised machine learning algorithm for inferring the gene targets of sets of TF binding sites. Genes selected are those that match overrepresented expression patterns, which can be used to provide information about multiple TF regulatory modes. We apply the method to genome-wide human breast cancer data and demonstrate that EMBER confirms a role for the TFs estrogen receptor alpha, retinoic acid receptors alpha and gamma in breast cancer development, whereas the conventional approach of assigning regulatory targets based on proximity does not. Additionally, we compare several predicted target genes from EMBER to interactions inferred previously, examine combinatorial effects of TFs on gene regulation and illustrate the ability of EMBER to discover multiple modes of regulation.
Availability: All code used for this work is available at http://dinner-group.uchicago.edu/downloads.html
Supplementary Information: Supplementary data are available at Bioinformatics online.
Mapping genome-wide binding sites of all transcription factors (TFs) in all biological contexts is a critical step toward understanding gene regulation. The state-of-the-art technologies for mapping transcription factor binding sites (TFBSs) couple chromatin immunoprecipitation (ChIP) with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip). These technologies have limitations: they are low-throughput with respect to surveying many TFs. Recent advances in genome-wide chromatin profiling, including development of technologies such as DNase-seq, FAIRE-seq and ChIP-seq for histone modifications, make it possible to predict in vivo TFBSs by analyzing chromatin features at computationally determined DNA motif sites. This promising new approach may allow researchers to monitor the genome-wide binding sites of many TFs simultaneously. In this article, we discuss various experimental design and data analysis issues that arise when applying this approach. Through a systematic analysis of the data from the Encyclopedia Of DNA Elements (ENCODE) project, we compare the predictive power of individual and combinations of chromatin marks using supervised and unsupervised learning methods, and evaluate the value of integrating information from public ChIP and gene expression data. We also highlight the challenges and opportunities for developing novel analytical methods, such as resolving the one-motif-multiple-TF ambiguity and distinguishing functional and non-functional TF binding targets from the predicted binding sites.
Electronic Supplementary Material
The online version of this article (doi:10.1007/s12561-012-9066-5) contains supplementary material, which is available to authorized users.
Transcription factor binding sites; DNase-seq; ChIP-seq; FAIRE-seq; Next-generation sequencing; Motif
Motivation: Chromatin immunoprecipitation (ChIP) experiments followed by array hybridization, or ChIP-chip, is a powerful approach for identifying transcription factor binding sites (TFBS) and has been widely used. Recently, massively parallel sequencing coupled with ChIP experiments (ChIP-seq) has been increasingly used as an alternative to ChIP-chip, offering cost-effective genome-wide coverage and resolution up to a single base pair. For many well-studied TFs, both ChIP-seq and ChIP-chip experiments have been applied and their data are publicly available. Previous analyses have revealed substantial technology-specific binding signals despite strong correlation between the two sets of results. Therefore, it is of interest to see whether the two data sources can be combined to enhance the detection of TFBS.
Results: In this work, hierarchical hidden Markov model (HHMM) is proposed for combining data from ChIP-seq and ChIP-chip. In HHMM, inference results from individual HMMs in ChIP-seq and ChIP-chip experiments are summarized by a higher level HMM. Simulation studies show the advantage of HHMM when data from both technologies co-exist. Analysis of two well-studied TFs, NRSF and CCCTC-binding factor (CTCF), also suggests that HHMM yields improved TFBS identification in comparison to analyses using individual data sources or a simple merger of the two.
Availability: Source code for the software ChIPmeta is freely available for download at http://www.umich.edu/∼hwchoi/HHMMsoftware.zip, implemented in C and supported on linux.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: ChIP-seq data are enriched in binding sites for the protein immunoprecipitated. Some sequences may also contain binding sites for a coregulator. Biologists are interested in knowing which coregulatory factor motifs may be present in the sequences bound by the protein ChIP'ed.
Results: We present a finite mixture framework with an expectation–maximization algorithm that considers two motifs jointly and simultaneously determines which sequences contain both motifs, either one or neither of them. Tested on 10 simulated ChIP-seq datasets, our method performed better than repeated application of MEME in predicting sequences containing both motifs. When applied to a mouse liver Foxa2 ChIP-seq dataset involving ~ 12 000 400-bp sequences, coMOTIF identified co-occurrence of Foxa2 with Hnf4a, Cebpa, E-box, Ap1/Maf or Sp1 motifs in ~6–33% of these sequences. These motifs are either known as liver-specific transcription factors or have an important role in liver function.
Availability: Freely available at http://www.niehs.nih.gov/research/resources/software/comotif/.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Boundary elements partition eukaryotic chromatin into active and repressive domains, and can also block regulatory interactions between domains. Boundary elements act via diverse mechanisms making accurate feature-based computational predictions difficult. Therefore, we developed an unbiased algorithm that predicts the locations of human boundary elements based on the genomic distributions of chromatin and transcriptional states, as opposed to any intrinsic characteristics that they may possess. Application of our algorithm to ChIP-seq data for histone modifications and RNA Pol II-binding data in human CD4+ T cells resulted in the prediction of 2542 putative chromatin boundary elements genome wide. Predicted boundary elements display two distinct features: first, position-specific open chromatin and histone acetylation that is coincident with the recruitment of sequence-specific DNA-binding factors such as CTCF, EVI1 and YYI, and second, a directional and gradual increase in histone lysine methylation across predicted boundaries coincident with a gain of expression of non-coding RNAs, including examples of boundaries encoded by tRNA and other non-coding RNA genes. Accordingly, a number of the predicted human boundaries may function via the synergistic action of sequence-specific recruitment of transcription factors leading to non-coding RNA transcriptional interference and the blocking of facultative heterochromatin propagation by transcription-associated chromatin remodeling complexes.
High throughput signature sequencing holds many promises, one of which is the ready identification of in vivo transcription factor binding sites, histone modifications, changes in chromatin structure and patterns of DNA methylation across entire genomes. In these experiments, chromatin immunoprecipitation is used to enrich for particular DNA sequences of interest and signature sequencing is used to map the regions to the genome (ChIP-Seq). Elucidation of these sites of DNA-protein binding/modification are proving instrumental in reconstructing networks of gene regulation and chromatin remodelling that direct development, response to cellular perturbation, and neoplastic transformation.
Here we present a package of algorithms and software that makes use of control input data to reduce false positives and estimate confidence in ChIP-Seq peaks. Several different methods were compared using two simulated spike-in datasets. Use of control input data and a normalized difference score were found to more than double the recovery of ChIP-Seq peaks at a 5% false discovery rate (FDR). Moreover, both a binomial p-value/q-value and an empirical FDR were found to predict the true FDR within 2–3 fold and are more reliable estimators of confidence than a global Poisson p-value. These methods were then used to reanalyze Johnson et al.'s neuron-restrictive silencer factor (NRSF) ChIP-Seq data without relying on extensive qPCR validated NRSF sites and the presence of NRSF binding motifs for setting thresholds.
The methods developed and tested here show considerable promise for reducing false positives and estimating confidence in ChIP-Seq data without any prior knowledge of the chIP target. They are part of a larger open source package freely available from http://useq.sourceforge.net/.
Motivation: Chromatin immunoprecipitation followed by genome tiling array hybridization (ChIP-chip) is a powerful approach to identify transcription factor binding sites (TFBSs) in target genomes. When multiple related ChIP-chip datasets are available, analyzing them jointly allows one to borrow information across datasets to improve peak detection. This is particularly useful for analyzing noisy datasets.
Results: We propose a hierarchical mixture model and develop an R package JAMIE to perform the joint analysis. The genome is assumed to consist of background and potential binding regions (PBRs). PBRs have context-dependent probabilities to become bona fide binding sites in individual datasets. This model captures the correlation among datasets, which provides basis for sharing information across experiments. Real data tests illustrate the advantage of JAMIE over a strategy that analyzes individual datasets separately.
Availability: JAMIE is freely available from http://www.biostat.jhsph.edu/∼hji/jamie
Supplementary information: Supplementary data are available at Bioinformatics online.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Annotating repetitive regions of genomes experimentally is a challenging task. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) provides valuable data for characterizing repetitive regions of genomes in terms of transcription factor binding. Although ChIP-seq technology has been maturing, available ChIP-seq analysis methods and software rely on discarding sequence reads that map to multiple locations on the reference genome (multi-reads), thereby generating a missed opportunity for assessing transcription factor binding to highly repetitive regions of genomes. We develop a computational algorithm that takes multi-reads into account in ChIP-seq analysis. We show with computational experiments that multi-reads lead to significant increase in sequencing depths and identification of binding regions that are otherwise not identifiable when only reads that uniquely map to the reference genome (uni-reads) are used. In particular, we show that the number of binding regions identified can increase up to 36%. We support our computational predictions with independent quantitative real-time ChIP validation of binding regions identified only when multi-reads are incorporated in the analysis of a mouse GATA1 ChIP-seq experiment.
The unfolded phrotein response is a mechanism to cope with endoplasmic reticulum stress. In Saccharomyces cerevisiae, Ire1 senses the stress and mediates a signaling cascade to upregulate responsive genes through an unusual HAC1 mRNA splicing. The splicing requires interconnected activity (kinase and endoribonuclease) of Ire1 to cleave HAC1 mRNA at the non-canonical splice sites before translation into Hac1 transcription factor. Analysis of the truncated kinase domain from Ire1 homologs revealed that this domain is highly conserved. Characterization by domain swapping indicated that a functional ATP/ADP binding domain is minimally required. However the overall domain compatibility is critical for eliciting its full endoribonuclease function.
Unfolded protein response; Ire1; Domain swapping; HAC1 splicing; protein kinase; endoribonuclease
Dramatic progress in the development of next-generation sequencing technologies has enabled accurate genome-wide characterization of the binding sites of DNA-associated proteins. This technique, baptized as ChIP-Seq, uses a combination of chromatin immunoprecipitation and massively parallel DNA sequencing. Other published tools that predict binding sites from ChIP-Seq data use only positional information of mapped reads. In contrast, our algorithm MICSA (Motif Identification for ChIP-Seq Analysis) combines this source of positional information with information on motif occurrences to better predict binding sites of transcription factors (TFs). We proved the greater accuracy of MICSA with respect to several other tools by running them on datasets for the TFs NRSF, GABP, STAT1 and CTCF. We also applied MICSA on a dataset for the oncogenic TF EWS-FLI1. We discovered >2000 binding sites and two functionally different binding motifs. We observed that EWS-FLI1 can activate gene transcription when (i) its binding site is located in close proximity to the gene transcription start site (up to ∼150 kb), and (ii) it contains a microsatellite sequence. Furthermore, we observed that sites without microsatellites can also induce regulation of gene expression—positively as often as negatively—and at much larger distances (up to ∼1 Mb).
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) allows researchers to determine the genome-wide binding locations of individual transcription factors (TFs) at high resolution. This information can be interrogated to study various aspects of TF behaviour, including the mechanisms that control TF binding. Physical interaction between TFs comprises one important aspect of TF binding in eukaryotes, mediating tissue-specific gene expression. We have developed an algorithm, spaced motif analysis (SpaMo), which is able to infer physical interactions between the given TF and TFs bound at neighbouring sites at the DNA interface. The algorithm predicts TF interactions in half of the ChIP-seq data sets we test, with the majority of these predictions supported by direct evidence from the literature or evidence of homodimerization. High resolution motif spacing information obtained by this method can facilitate an improved understanding of individual TF complex structures. SpaMo can assist researchers in extracting maximum information relating to binding mechanisms from their TF ChIP-seq data. SpaMo is available for download and interactive use as part of the MEME Suite (http://meme.nbcr.net).
Chromatin immunoprecipitation coupled with high throughput DNA Sequencing (ChIP-Seq) has emerged as a powerful tool for genome wide profiling of the binding sites of proteins associated with DNA such as histones and transcription factors. However, no peak calling program has gained consensus acceptance by the scientific community as the preferred tool for ChIP-Seq data analysis. Analyzing the large data sets generated by ChIP-Seq studies remains highly challenging for most molecular biology laboratories.
Here we profile H3K27me3 enrichment sites in rice young endosperm using the ChIP-Seq approach and analyze the data using four peak calling algorithms (FindPeaks, PeakSeq, USeq, and MACS). Comparison of the four algorithms reveals that these programs produce very different peaks in terms of peak size, number, and position relative to genes. We verify the peak predictions using ChIP-PCR to evaluate the accuracy of peak prediction of the four algorithms. We discuss the approach of each algorithm and compare similarities and differences in the results. Despite their differences in the peaks identified, all of the programs reach similar conclusions about the effect of H3K27me3 on gene expression. Its presence either upstream or downstream of a gene is predominately associated with repression of the gene. Additionally, GO analysis finds that a substantially higher ratio of genes associated with H3K27me3 were involved in multicellular organism development, signal transduction, response to external and endogenous stimuli, and secondary metabolic pathways than the rest of the rice genome.
Direct binding by a transcription factor (TF) to the proximal promoter of a gene is a strong evidence that the TF regulates the gene. Assaying the genome-wide binding of every TF in every cell type and condition is currently impractical. Histone modifications correlate with tissue/cell/condition-specific (‘tissue specific’) TF binding, so histone ChIP-seq data can be combined with traditional position weight matrix (PWM) methods to make tissue-specific predictions of TF–promoter interactions.
Results: We use supervised learning to train a naïve Bayes predictor of TF–promoter binding. The predictor's features are the histone modification levels and a PWM-based score for the promoter. Training and testing uses sets of promoters labeled using TF ChIP-seq data, and we use cross-validation on 23 such datasets to measure the accuracy. A PWM+histone naïve Bayes predictor using a single histone modification (H3K4me3) is substantially more accurate than a PWM score or a conservation-based score (phylogenetic motif model). The naïve Bayes predictor is more accurate (on average) at all sensitivity levels, and makes only half as many false positive predictions at sensitivity levels from 10% to 80%. On average, it correctly predicts 80% of bound promoters at a false positive rate of 20%. Accuracy does not diminish when we test the predictor in a different cell type (and species) from training. Accuracy is barely diminished even when we train the predictor without using TF ChIP-seq data.
Availability: Our tissue-specific predictor of promoters bound by a TF is called Dr Gene and is available at http://bioinformatics.org.au/drgene.
Supplementary information: Supplementary data are available at Bioinformatics online.
Chromatin immunoprecipitation combined with the next-generation DNA sequencing technologies (ChIP-seq) becomes a key approach for detecting genome-wide sets of genomic sites bound by proteins, such as transcription factors (TFs). Several methods and open-source tools have been developed to analyze ChIP-seq data. However, most of them are designed for detecting TF binding regions instead of accurately locating transcription factor binding sites (TFBSs). It is still challenging to pinpoint TFBSs directly from ChIP-seq data, especially in regions with closely spaced binding events.
With the aim to pinpoint TFBSs at a high resolution, we propose a novel method named SeqSite, implementing a two-step strategy: detecting tag-enriched regions first and pinpointing binding sites in the detected regions. The second step is done by modeling the tag density profile, locating TFBSs on each strand with a least-squares model fitting strategy, and merging the detections from the two strands. Experiments on simulation data show that SeqSite can locate most of the binding sites more than 40-bp from each other. Applications on three human TF ChIP-seq datasets demonstrate the advantage of SeqSite for its higher resolution in pinpointing binding sites compared with existing methods.
We have developed a computational tool named SeqSite, which can pinpoint both closely spaced and isolated binding sites, and consequently improves the resolution of TFBS detection from ChIP-seq data.
Context-dependent transcription factor (TF) binding is one reason for differences in gene expression patterns between different cellular states. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) identifies genome-wide TF binding sites for one particular context—the cells used in the experiment. But can such ChIP-seq data predict TF binding in other cellular contexts and is it possible to distinguish context-dependent from ubiquitous TF binding?
We compared ChIP-seq data on TF binding for multiple TFs in two different cell types and found that on average only a third of ChIP-seq peak regions are common to both cell types. Expectedly, common peaks occur more frequently in certain genomic contexts, such as CpG-rich promoters, whereas chromatin differences characterize cell-type specific TF binding. We also find, however, that genotype differences between the cell types can explain differences in binding. Moreover, ChIP-seq signal intensity and peak clustering are the strongest predictors of common peaks. Compared with strong peaks located in regions containing peaks for multiple transcription factors, weak and isolated peaks are less common between the cell types and are less associated with data that indicate regulatory activity.
Together, the results suggest that experimental noise is prevalent among weak peaks, whereas strong and clustered peaks represent high-confidence binding events that often occur in other cellular contexts. Nevertheless, 30-40% of the strongest and most clustered peaks show context-dependent regulation. We show that by combining signal intensity with additional data—ranging from context independent information such as binding site conservation and position weight matrix scores to context dependent chromatin structure—we can predict whether a ChIP-seq peak is likely to be present in other cellular contexts.
Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way and with an acceptable false discovery rate.
We present the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data.
Triform outperforms several existing methods in the identification of representative peak profiles in curated benchmark data sets. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments. The Triform algorithm has been implemented in R, and is available via http://tare.medisin.ntnu.no/triform.
ChIP-Seq; Peak finding; Benchmark; Repeats