Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.
Transcription factors (TFs) are proteins that bind sites in the non-coding DNA and regulate the expression of targeted genes. Being able to predict the genome-wide binding locations of TFs is an important step in deciphering gene regulatory networks. Historically, there was very limited experimental data on the DNA-binding preferences of most TFs. Computational biologists used known sites to estimate simple binding site motifs, called position-specific scoring matrices, and scan the genome for additional potential binding locations, but this approach often led to many false positive predictions. Here we introduce a machine learning approach to leverage new high resolution data on the binding preferences of TFs, namely, protein binding microarray (PBM) experiments which measure the in vitro binding affinities of TFs with respect to an array of double-stranded DNA probes, and chromatin immunoprecipitation experiments followed by next generation sequencing (ChIP-seq) which measure in vivo genome-wide binding of TFs in a given cell type. We show that by training statistical models on high resolution PBM and ChIP-seq data, we can more accurately represent the subtle DNA binding preferences of TFs and predict their genome-wide binding locations. These results will enable advances in the computational analysis of transcriptional regulation in mammalian genomes.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. Because the compact genomes of prokaryotes harbor many binding sites separated by only few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Applications in prokaryotic genomes are further hampered by the fact that well studied data analysis methods for ChIP-Seq do not result in a resolution required for deciphering the locations of nearby binding events. We generated single-end tag (SET) and paired-end tag (PET) ChIP-Seq data for factor in Escherichia coli (E. coli). Direct comparison of these datasets revealed that although PET assay enables higher resolution identification of binding events, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak as a high resolution binding site identification (deconvolution) algorithm. dPeak implements a probabilistic model that accurately describes ChIP-Seq data generation process for both the SET and PET assays. For SET data, dPeak outperforms or performs comparably to the state-of-the-art high-resolution ChIP-Seq peak deconvolution algorithms such as PICS, GPS, and GEM. When coupled with PET data, dPeak significantly outperforms SET-based analysis with any of the current state-of-the-art methods. Experimental validations of a subset of dPeak predictions from PET ChIP-Seq data indicate that dPeak can estimate locations of binding events with as high as to resolution. Applications of dPeak to ChIP-Seq data in E. coli under aerobic and anaerobic conditions reveal closely located promoters that are differentially occupied and further illustrate the importance of high resolution analysis of ChIP-Seq data.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is widely used for studying in vivo protein-DNA interactions genome-wide. Current state-of-the-art ChIP-Seq protocols utilize single-end tag (SET) assay which only sequences ends of DNA fragments in the library. Although paired-end tag (PET) sequencing is routinely used in other applications of next generation sequencing, it has not been much adapted to ChIP-Seq. We illustrate both experimentally and computationally that PET sequencing significantly improves the resolution of ChIP-Seq experiments and enables ChIP-Seq applications in compact genomes like Escherichia coli (E. coli). To enable efficient identification using PET ChIP-Seq data, we develop dPeak as a high resolution binding site identification algorithm. dPeak implements probabilistic models for both SET and PET data and facilitates efficient analysis of both data types. Applications of dPeak to deeply sequenced E. coli PET and SET ChIP-Seq data establish significantly better resolution of PET compared to SET sequencing.
Insufficient protein-folding capacity in the endoplasmic reticulum (ER) induces the unfolded protein response (UPR). In the ER lumen, accumulation of unfolded proteins activates the transmembrane ER-stress sensor Ire1 and drives its oligomerization. In the cytosol, Ire1 recruits HAC1 mRNA, mediating its non-conventional splicing. The spliced mRNA is translated into Hac1, the key transcription activator of UPR target genes that mitigate ER-stress. In this study, we report that oligomeric assembly of the ER-lumenal domain is sufficient to drive Ire1 clustering. Clustering facilitates Ire1's cytosolic oligomeric assembly and HAC1 mRNA docking onto a positively charged motif in Ire1's cytosolic linker domain that tethers the kinase/RNase to the transmembrane domain. By the use of a synthetic bypass, we demonstrate that mRNA docking per se is a pre-requisite for initiating Ire1's RNase activity and, hence, splicing. We posit that such step-wise engagement between Ire1 and its mRNA substrate contributes to selectivity and efficiency in UPR signaling.
Proteins are built based on instructions in template molecules called messenger RNAs (or mRNAs), which are copied from the DNA of genes. As they are made, proteins must fold into a specific three-dimensional shape and some proteins pass into a compartment in the cell, called the endoplasmic reticulum, in which they fold. So-called molecular chaperone proteins assist this folding process. From the endoplasmic reticulum, most proteins travel to other destinations within or outside of the cell.
If the molecular chaperones in the endoplasmic reticulum are overwhelmed by their protein folding task, unfolded proteins accumulate; a situation that can be harmful to the cell. In eukaryotic cells including yeast, a sensor protein called Ire1 detects when unfolded proteins build up in the endoplasmic reticulum. As a result, the Ire1 sensor proteins join together to form clusters and an mRNA molecule called HAC1 is specifically recruited to the Ire1 clusters. The portions of the Ire1 protein that extend out from the endoplasmic reticulum into the cell proper then bind to HAC1 mRNA and cut a piece out of it. This edited mRNA encodes the instructions to build a protein that in turn boosts the expression of various components—including the appropriate molecular chaperones—that are needed to alleviate the stress caused by an excess of unfolded proteins.
Within clusters, individual Ire1 proteins interact through the portions of the protein found on the inside of the endoplasmic reticulum. Now, van Anken et al. show that these interactions are sufficient for forming and maintaining clusters. The interactions between the portions of the Ire1 proteins outside of the endoplasmic reticulum are needed for editing the HAC1 mRNA but not for forming and maintaining the clusters or for recruiting the HAC1 mRNA molecule to bind to Ire1. Instead, van Anken et al. discovered an mRNA binding site on the Ire1 clusters, which is separate from the part of the Ire1 protein that cuts the mRNA molecules. The Ire1 protein needs to first bind the HAC1 mRNA molecule at this binding site before it can cut it; van Anken et al. suggest that this two-step process helps ensure accurate and efficient editing of the HAC1 mRNA by Ire1. This process could also help to minimize the chance of other mRNA molecules being edited by mistake.
It will be of interest to investigate if similar safety measures are key for endoplasmic reticulum stress signaling mechanisms in humans, and whether these newly discovered steps can be targeted by drugs to treat disease.
stress signaling; endoplasmic reticulum; unfolded protein response; mRNA targeting; mRNA processing; S. cerevisiae
The unfolded protein response (UPR) in eukaryotes upregulates factors that restore ER homeostasis upon protein folding stress and in yeast is activated by a non-conventional splicing of the HAC1 mRNA. The spliced HAC1 mRNA encodes an active transcription factor that binds to UPR-responsive elements in the promoter of UPR target genes. Overexpression of the HAC1 gene of S. cerevisiae can reportedly lead to increased production of heterologous proteins. To further such studies in the biotechnology favored yeast Pichia pastoris, we cloned and characterized the P. pastoris HAC1 gene and the splice event.
We identified the HAC1 homologue of P. pastoris and its splice sites. Surprisingly, we could not find evidence for the non-spliced HAC1 mRNA when P. pastoris was cultivated in a standard growth medium without any endoplasmic reticulum stress inducers, indicating that the UPR is constitutively active to some extent in this organism. After identification of the sequence encoding active Hac1p we evaluated the effect of its overexpression in Pichia. The KAR2 UPR-responsive gene was strongly upregulated. Electron microscopy revealed an expansion of the intracellular membranes in Hac1p-overexpressing strains. We then evaluated the effect of inducible and constitutive UPR induction on the production of secreted, surface displayed and membrane proteins. Wherever Hac1p overexpression affected heterologous protein expression levels, this effect was always stronger when Hac1p expression was inducible rather than constitutive. Depending on the heterologous protein, co-expression of Hac1p increased, decreased or had no effect on expression level. Moreover, α-mating factor prepro signal processing of a G-protein coupled receptor was more efficient with Hac1p overexpression; resulting in a significantly improved homogeneity.
Overexpression of P. pastoris Hac1p can be used to increase the production of heterologous proteins but needs to be evaluated on a case by case basis. Inducible Hac1p expression is more effective than constitutive expression. Correct processing and thus homogeneity of proteins that are difficult to express, such as GPCRs, can be increased by co-expression with Hac1p.
Human artificial chromosome (HAC)-based vectors represent an alternative technology for gene delivery and expression with a potential to overcome the problems caused by the use of viral-based vectors. The recently developed alphoidtetO-HAC has an advantage over other HAC vectors because it can be easily eliminated from cells by inactivation of the HAC kinetochore via binding of tTS chromatin modifiers to its centromeric tetO sequences. This provides unique control for phenotypes induced by genes loaded into the alphoidtetO-HAC. However, inactivation of the HAC kinetochore requires transfection of cells by a retrovirus vector, a step that is potentially mutagenic. Here, we describe an approach to re-engineering the alphoidtetO-HAC that allows verification of phenotypic changes attributed to expression of genes from the HAC without a transfection step. In the new HAC vector, a tTS-EYFP cassette is inserted into a gene-loading site along with a gene of interest. Expression of the tTS generates a self-regulating fluctuating heterochromatin on the alphoidtetO-HAC that induces fast silencing of the genes on the HAC without significant effects on HAC segregation. This silencing of the HAC-encoded genes can be readily recovered by adding doxycycline. The newly modified alphoidtetO-HAC-based system has multiple applications in gene function studies.
Several human adrenocortical cell lines have been used as model systems for aldosterone production. However, these cell lines have not been directly compared with each other.
Human adrenal cell lines SW13, CAR47, the NCI-H295 and its sub-strains and sub-clones were compared with regard to aldosterone production and aldosterone synthase (CYP11B2) expression. Culture media was collected 48 h after incubation, aldosterone secretion was measured and the data were normalized to the amount of cell protein. RNA was isolated for microarray analysis and quantitative RT-PCR (qPCR). The cell lines with the highest aldosterone production were further tested with regard to angiotensin II (Ang II) stimulation.
Neither aldosterone nor CYP11B2 transcript were detected in SW13 or CAR47 cells. The aldosterone production by the NCI-H295, H295A, H295R-S1, H295R-S2, H295R-S3, HAC13, HAC15 and HAC50 were 119, 1, 6, 826, 18, 139, 412, and 1334 (pmol/mg protein/48h), respectively. H295A and H295R-S1 expressed less CYP11B2 than the commonly used H295R-S3 cells; while NCI-H295, H295R-S2, HAC13, HAC15 and HAC50 expressed 24, 14, 3, 10 and 35 fold higher CYP11B2 compared with the H295R-S3 cells. When treated with Ang II, NCI-H295, H295R-S2, HAC13, HAC15 and HAC50 showed significantly higher aldosterone production than the basal level (p<0.05).
A comparison of the available human adrenal cell lines indicates that the H295R-S2 and the clonal cell lines, HAC13, HAC15 and HAC50 produced the highest levels of aldosterone and responded well to Ang II.
Aldosterone; adrenocortical carcinoma; steroidogenesis
Human artificial chromosomes (HACs) are vectors that offer advantages of capacity and stability for gene delivery and expression. Several studies have even demonstrated their use for gene complementation in gene-deficient recipient cell lines and animal transgenesis. Recently, we constructed an advance HAC-based vector, alphoidtetO-HAC, with a conditional centromere. In this HAC, a gene-loading site was inserted into a centrochromatin domain critical for kinetochore assembly and maintenance. While by definition this domain is permissive for transcription, there have been no long-term studies on transgene expression within centrochromatin. In this study, we compared the effects of three chromatin insulators, cHS4, gamma-satellite DNA, and tDNA, on the expression of an EGFP transgene inserted into the alphoidtetO-HAC vector. Insulator function was essential for stable expression of the transgene in centrochromatin. In two analyzed host cell lines, a tDNA insulator composed of two functional copies of tRNA genes showed the highest barrier activity. We infer that proximity to centrochromatin does not protect genes lacking chromatin insulators from epigenetic silencing. Barrier elements that prevent gene silencing in centrochromatin would thus help to optimize transgenesis using HAC vectors.
Electronic supplementary material
The online version of this article (doi:10.1007/s00018-013-1362-9) contains supplementary material, which is available to authorized users.
Insulator; tDNA-gamma-satellite; cHS4; Human artificial chromosome-HAC
Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.
We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.
We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Transcription factors are critical proteins for sequence-specific control of transcriptional regulation. Finding where these proteins bind to DNA is of key importance for global efforts to decipher the complex mechanisms of gene regulation. Greater understanding of the regulation of transcription promises to improve human genetic analysis by specifying critical gene components that have eluded investigators. Classically, computational prediction of transcription factor binding sites (TFBS) is based on models giving weights to each nucleotide at each position. We introduce a novel statistical model for the prediction of TFBS tolerant of a broader range of TFBS configurations than can be conveniently accommodated by existing methods. The new models are designed to address the confounding properties of nucleotide composition, inter-positional sequence dependence and variable lengths (e.g. variable spacing between half-sites) observed in the more comprehensive experimental data now emerging. The new models generate scores consistent with DNA-protein affinities measured experimentally and can be represented graphically, retaining desirable attributes of past methods. It demonstrates the capacity of the new approach to accurately assess DNA-protein interactions. With the rich experimental data generated from chromatin immunoprecipitation experiments, a greater diversity of TFBS properties has emerged that can now be accommodated within a single predictive approach.
Motivation: Modelling the regulation of gene expression can provide insight into the regulatory roles of individual transcription factors (TFs) and histone modifications. Recently, Ouyang et al. in 2009 modelled gene expression levels in mouse embryonic stem (mES) cells using in vivo ChIP-seq measurements of TF binding. ChIP-seq TF binding data, however, are tissue-specific and relatively difficult to obtain. This limits the applicability of gene expression models that rely on ChIP-seq TF binding data.
Results: In this study, we build regression-based models that relate gene expression to the binding of 12 different TFs, 7 histone modifications and chromatin accessibility (DNase I hypersensitivity) in two different tissues. We find that expression models based on computationally predicted TF binding can achieve similar accuracy to those using in vivo TF binding data and that including binding at weak sites is critical for accurate prediction of gene expression. We also find that incorporating histone modification and chromatin accessibility data results in additional accuracy. Surprisingly, we find that models that use no TF binding data at all, but only histone modification and chromatin accessibility data, can be as (or more) accurate than those based on in vivo TF binding data.
Availability and implementation: All scripts, motifs and data presented in this article are available online at http://research.imb.uq.edu.au/t.bailey/supplementary_data/McLeay2011a.
Supplementary data are available at Bioinformatics online.
Motivation: Protein phosphorylation is critical for regulating cellular activities by controlling protein activities, localization and turnover, and by transmitting information within cells through signaling networks. However, predictions of protein phosphorylation and signaling networks remain a significant challenge, lagging behind predictions of transcriptional regulatory networks into which they often feed.
Results: We developed PhosphoChain to predict kinases, phosphatases and chains of phosphorylation events in signaling networks by combining mRNA expression levels of regulators and targets with a motif detection algorithm and optional prior information. PhosphoChain correctly reconstructed ∼78% of the yeast mitogen-activated protein kinase pathway from publicly available data. When tested on yeast phosphoproteomic data from large-scale mass spectrometry experiments, PhosphoChain correctly identified ∼27% more phosphorylation sites than existing motif detection tools (NetPhosYeast and GPS2.0), and predictions of kinase–phosphatase interactions overlapped with ∼59% of known interactions present in yeast databases. PhosphoChain provides a valuable framework for predicting condition-specific phosphorylation events from high-throughput data.
Availability: PhosphoChain is implemented in Java and available at http://virgo.csie.ncku.edu.tw/PhosphoChain/ or http://aitchisonlab.com/PhosphoChain
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
HACS1 is a Src homology 3 and sterile alpha motif domain–containing adaptor that is preferentially expressed in normal hematopoietic tissues and malignancies including myeloid leukemia, lymphoma, and myeloma. Microarray data showed HACS1 expression is up-regulated in activated human B cells treated with interleukin (IL)-4, CD40L, and anti–immunoglobulin (Ig)M and clustered with genes involved in signaling, including TNF receptor–associated protein 1, signaling lymphocytic activation molecule, IL-6, and DEC205. Immunoblot analysis demonstrated that HACS1 is up-regulated by IL-4, IL-13, anti-IgM, and anti-CD40 in human peripheral blood B cells. In murine spleen B cells, Hacs1 can also be up-regulated by lipopolysaccharide but not IL-13. Induction of Hacs1 by IL-4 is dependent on Stat6 signaling and can also be impaired by inhibitors of phosphatidylinositol 3-kinase, protein kinase C, and nuclear factor κB. HACS1 associates with tyrosine-phosphorylated proteins after B cell activation and binds in vitro to the inhibitory molecule paired Ig-like receptor B. Overexpression of HACS1 in murine spleen B cells resulted in a down-regulation of the activation marker CD23 and enhancement of CD138 expression, IgM secretion, and Xbp-1 expression. Knock down of HACS1 in a human B lymphoma cell line by small interfering ribonucleic acid did not significantly change IL-4–stimulated B cell proliferation. Our study demonstrates that HACS1 is up-regulated by B cell activation signals and is a participant in B cell activation and differentiation.
B lymphocytes; interleukin-4; signaling; gene expression; adaptor protein
Mapping genome-wide binding sites of all transcription factors (TFs) in all biological contexts is a critical step toward understanding gene regulation. The state-of-the-art technologies for mapping transcription factor binding sites (TFBSs) couple chromatin immunoprecipitation (ChIP) with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip). These technologies have limitations: they are low-throughput with respect to surveying many TFs. Recent advances in genome-wide chromatin profiling, including development of technologies such as DNase-seq, FAIRE-seq and ChIP-seq for histone modifications, make it possible to predict in vivo TFBSs by analyzing chromatin features at computationally determined DNA motif sites. This promising new approach may allow researchers to monitor the genome-wide binding sites of many TFs simultaneously. In this article, we discuss various experimental design and data analysis issues that arise when applying this approach. Through a systematic analysis of the data from the Encyclopedia Of DNA Elements (ENCODE) project, we compare the predictive power of individual and combinations of chromatin marks using supervised and unsupervised learning methods, and evaluate the value of integrating information from public ChIP and gene expression data. We also highlight the challenges and opportunities for developing novel analytical methods, such as resolving the one-motif-multiple-TF ambiguity and distinguishing functional and non-functional TF binding targets from the predicted binding sites.
Electronic Supplementary Material
The online version of this article (doi:10.1007/s12561-012-9066-5) contains supplementary material, which is available to authorized users.
Transcription factor binding sites; DNase-seq; ChIP-seq; FAIRE-seq; Next-generation sequencing; Motif
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Annotating repetitive regions of genomes experimentally is a challenging task. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) provides valuable data for characterizing repetitive regions of genomes in terms of transcription factor binding. Although ChIP-seq technology has been maturing, available ChIP-seq analysis methods and software rely on discarding sequence reads that map to multiple locations on the reference genome (multi-reads), thereby generating a missed opportunity for assessing transcription factor binding to highly repetitive regions of genomes. We develop a computational algorithm that takes multi-reads into account in ChIP-seq analysis. We show with computational experiments that multi-reads lead to significant increase in sequencing depths and identification of binding regions that are otherwise not identifiable when only reads that uniquely map to the reference genome (uni-reads) are used. In particular, we show that the number of binding regions identified can increase up to 36%. We support our computational predictions with independent quantitative real-time ChIP validation of binding regions identified only when multi-reads are incorporated in the analysis of a mouse GATA1 ChIP-seq experiment.
DNA sequence and local chromatin landscape act jointly to determine transcription factor (TF) binding intensity profiles. To disentangle these influences, we developed an experimental approach, called protein/DNA binding followed by high-throughput sequencing (PB–seq), that allows the binding energy landscape to be characterized genome-wide in the absence of chromatin. We applied our methods to the Drosophila Heat Shock Factor (HSF), which inducibly binds a target DNA sequence element (HSE) following heat shock stress. PB–seq involves incubating sheared naked genomic DNA with recombinant HSF, partitioning the HSF–bound and HSF–free DNA, and then detecting HSF–bound DNA by high-throughput sequencing. We compared PB–seq binding profiles with ones observed in vivo by ChIP–seq and developed statistical models to predict the observed departures from idealized binding patterns based on covariates describing the local chromatin environment. We found that DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in predicting changes in HSF binding affinity. We also investigated the extent to which DNA accessibility, as measured by digital DNase I footprinting data, could be predicted from MNase–seq data and the ChIP–chip profiles for many histone modifications and TFs, and found GAGA element associated factor (GAF), tetra-acetylation of H4, and H4K16 acetylation to be the most predictive covariates. Lastly, we generated an unbiased model of HSF binding sequences, which revealed distinct biophysical properties of the HSF/HSE interaction and a previously unrecognized substructure within the HSE. These findings provide new insights into the interplay between the genomic sequence and the chromatin landscape in determining transcription factor binding intensity.
Transcription factors (TFs) bind DNA to modulate levels of gene expression. TF binding sites change throughout development, in response to environmental stimuli, and different tissues have distinct TF binding profiles. The mechanism by which TFs discriminate between binding sites in a context dependent manner is an area of active research, but it is clear that the chromatin environment in which potential binding sites reside strongly influences binding. This study used the Heat Shock TF (HSF) to study the effect chromatin has upon induced HSF binding. We implemented an experimental technique to quantify all potential HSF binding sites in the genome. These data were incorporated into a modeling framework along with chromatin landscape information prior to HSF binding to accurately predict the intensities of inducible HSF binding sites. DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in the model. The binding data enabled the development of a more complete HSF/DNA interaction model, providing insight into the biophysical interaction of HSF trimer subunits and target DNA pentamers.
The global effort to annotate the non-coding portion of the human genome relies heavily on chromatin immunoprecipitation data generated with high-throughput DNA sequencing (ChIP-seq). ChIP-seq is generally successful in detailing the segments of the genome bound by the immunoprecipitated transcription factor (TF), however almost all datasets contain genomic regions devoid of the canonical motif for the TF. It remains to be determined if these regions are related to the immunoprecipitated TF or whether, despite the use of controls, there is a portion of peaks that can be attributed to other causes.
Analyses across hundreds of ChIP-seq datasets generated for sequence-specific DNA binding TFs reveal a small set of TF binding profiles for which predicted TF binding site motifs are repeatedly observed to be significantly enriched. Grouping related binding profiles, the set includes: CTCF-like, ETS-like, JUN-like, and THAP11 profiles. These frequently enriched profiles are termed ‘zingers’ to highlight their unanticipated enrichment in datasets for which they were not the targeted TF, and their potential impact on the interpretation and analysis of TF ChIP-seq data. Peaks with zinger motifs and lacking the ChIPped TF’s motif are observed to compose up to 45% of a ChIP-seq dataset. There is substantial overlap of zinger motif containing regions between diverse TF datasets, suggesting a mechanism that is not TF-specific for the recovery of these regions.
Based on the zinger regions proximity to cohesin-bound segments, a loading station model is proposed. Further study of zingers will advance understanding of gene regulation.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0412-4) contains supplementary material, which is available to authorized users.
Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research.
We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes.
HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-280) contains supplementary material, which is available to authorized users.
ChIP-Seq; Next-generation sequencing; Peak calling; Duplicate filtering; Irreproducible discovery rate
Chromatin immunoprecipitation (ChIP) coupled to high-throughput sequencing (ChIP-Seq) techniques can reveal DNA regions bound by transcription factors (TF). Analysis of the ChIP-Seq regions is now a central component in gene regulation studies. The need remains strong for methods to improve the interpretation of ChIP-Seq data and the study of specific TF binding sites (TFBS).
We introduce a set of methods to improve the interpretation of ChIP-Seq data, including the inference of mediating TFs based on TFBS motif over-representation analysis and the subsequent study of spatial distribution of TFBSs. TFBS over-representation analysis applied to ChIP-Seq data is used to detect which TFBSs arise more frequently than expected by chance. Visualization of over-representation analysis results with new composition-bias plots reveals systematic bias in over-representation scores. We introduce the BiasAway background generating software to resolve the problem. A heuristic procedure based on topological motif enrichment relative to the ChIP-Seq peaks’ local maximums highlights peaks likely to be directly bound by a TF of interest. The results suggest that on average two-thirds of a ChIP-Seq dataset’s peaks are bound by the ChIP’d TF; the origin of the remaining peaks remaining undetermined. Additional visualization methods allow for the study of both inter-TFBS spatial relationships and motif-flanking sequence properties, as demonstrated in case studies for TBP and ZNF143/THAP11.
Topological properties of TFBS within ChIP-Seq datasets can be harnessed to better interpret regulatory sequences. Using GC content corrected TFBS over-representation analysis, combined with visualization techniques and analysis of the topological distribution of TFBS, we can distinguish peaks likely to be directly bound by a TF. The new methods will empower researchers for exploration of gene regulation and TF binding.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-472) contains supplementary material, which is available to authorized users.
Chromatin immunoprecipitation; ChIP-Seq; Motif prediction; Over-representation analysis; Regulation; Sequence analysis; Transcription factor; Transcription factor binding site; Visualization
We have used a human artificial chromosome (HAC) to manipulate the epigenetic state of chromatin within an active kinetochore. The HAC has a dimeric α-satellite repeat containing one natural monomer with a CENP-B binding site, and one completely artificial synthetic monomer with the CENP-B box replaced by a tetracycline operator (tetO). This HAC exhibits normal kinetochore protein composition and mitotic stability. Targeting of several tet-repressor (tetR) fusions into the centromere had no effect on kinetochore function. However, altering the chromatin state to a more open configuration with the tTA transcriptional activator or to a more closed state with the tTS transcription silencer caused missegregation and loss of the HAC. tTS binding caused the loss of CENP-A, CENP-B, CENP-C, and H3K4me2 from the centromere accompanied by an accumulation of histone H3K9me3. Our results reveal that a dynamic balance between centromeric chromatin and heterochromatin is essential for vertebrate kinetochore activity.
Motivation: Chromatin immunoprecipitation (ChIP) experiments followed by array hybridization, or ChIP-chip, is a powerful approach for identifying transcription factor binding sites (TFBS) and has been widely used. Recently, massively parallel sequencing coupled with ChIP experiments (ChIP-seq) has been increasingly used as an alternative to ChIP-chip, offering cost-effective genome-wide coverage and resolution up to a single base pair. For many well-studied TFs, both ChIP-seq and ChIP-chip experiments have been applied and their data are publicly available. Previous analyses have revealed substantial technology-specific binding signals despite strong correlation between the two sets of results. Therefore, it is of interest to see whether the two data sources can be combined to enhance the detection of TFBS.
Results: In this work, hierarchical hidden Markov model (HHMM) is proposed for combining data from ChIP-seq and ChIP-chip. In HHMM, inference results from individual HMMs in ChIP-seq and ChIP-chip experiments are summarized by a higher level HMM. Simulation studies show the advantage of HHMM when data from both technologies co-exist. Analysis of two well-studied TFs, NRSF and CCCTC-binding factor (CTCF), also suggests that HHMM yields improved TFBS identification in comparison to analyses using individual data sources or a simple merger of the two.
Availability: Source code for the software ChIPmeta is freely available for download at http://www.umich.edu/∼hwchoi/HHMMsoftware.zip, implemented in C and supported on linux.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Direct binding by a transcription factor (TF) to the proximal promoter of a gene is a strong evidence that the TF regulates the gene. Assaying the genome-wide binding of every TF in every cell type and condition is currently impractical. Histone modifications correlate with tissue/cell/condition-specific (‘tissue specific’) TF binding, so histone ChIP-seq data can be combined with traditional position weight matrix (PWM) methods to make tissue-specific predictions of TF–promoter interactions.
Results: We use supervised learning to train a naïve Bayes predictor of TF–promoter binding. The predictor's features are the histone modification levels and a PWM-based score for the promoter. Training and testing uses sets of promoters labeled using TF ChIP-seq data, and we use cross-validation on 23 such datasets to measure the accuracy. A PWM+histone naïve Bayes predictor using a single histone modification (H3K4me3) is substantially more accurate than a PWM score or a conservation-based score (phylogenetic motif model). The naïve Bayes predictor is more accurate (on average) at all sensitivity levels, and makes only half as many false positive predictions at sensitivity levels from 10% to 80%. On average, it correctly predicts 80% of bound promoters at a false positive rate of 20%. Accuracy does not diminish when we test the predictor in a different cell type (and species) from training. Accuracy is barely diminished even when we train the predictor without using TF ChIP-seq data.
Availability: Our tissue-specific predictor of promoters bound by a TF is called Dr Gene and is available at http://bioinformatics.org.au/drgene.
Supplementary information: Supplementary data are available at Bioinformatics online.
We cloned by phenotypic complementation a novel Saccharomyces cerevisiae's multicopy suppressor of the Schizosaccharomyces pombe cdc10-129 mutant which we call HAC1, an acronym of 'homologous to ATF/CREB 1'. It encodes a bZIP (basic-leucine zipper) protein of 230 amino acids with close homology to the mammalian ATF/CREB transcription factor and gel-retardation assays showed that it binds specifically to the CRE motif. HAC1 is not essential for viability. However, the hac1 disruptant becomes caffeine sensitive, which is suppressed by multicopy expression of the yeast PDE2 (Phosphodiesterase 2) gene. Although the mRNA level of HAC1 is almost constitutive throughout the cell cycle, it fluctuates during meiosis. The upstream region of the HAC1 gene contains a T4C site, a URS (upstream repression sequence) and a TR (T-rich) box-like sequence, which reside upstream of many meiotic genes. These results suggest that HAC1 may also be one of the meiotic genes.
Despite explosive growth in genomic datasets, the methods for studying epigenomic mechanisms of gene regulation remain primitive. Here we present a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. Based on the first principles of statistical mechanics, this model considers the interactions between epigenomic modifications and a cis-regulatory module, which contains multiple binding sites arranged in any configurations. We compiled a comprehensive epigenomic dataset in mouse embryonic stem (mES) cells, including DNA methylation (MeDIP-seq and MRE-seq), DNA hydroxymethylation (5-hmC-seq), and histone modifications (ChIP-seq). We discovered correlations of transcription factors (TFs) for specific combinations of epigenomic modifications, which we term epigenomic motifs. Epigenomic motifs explained why some TFs appeared to have different DNA binding motifs derived from in vivo (ChIP-seq) and in vitro experiments. Theoretical analyses suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. ChIP-seq data suggested that epigenomic boost of binding affinities in weak TF binding sites can function in mES cells. We showed in theory that the epigenome should suppress the TF binding differences on SNP-containing binding sites in two people. Using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of personal differences in NFκB binding in SNP-containing binding sites, which may explain why some SNPs introduce much smaller personal variations on TF binding than other SNPs. In summary, this model presents a powerful approach to analyze the functions of epigenomic modifications. This model was implemented into an open source program APEG (Affinity Prediction by Epigenome and Genome, http://systemsbio.ucsd.edu/apeg).
We developed a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. We postulated the existence of TF-specific epigenomic motifs, which could explain why some TFs appeared to have different DNA binding motifs derived from in vivo and in vitro experiments. The theoretical results suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. A preliminary analysis of the existing data suggested that epigenomic boost of binding affinities in weak TF binding sites could be a widespread regulatory mechanism in mES cells. Moreover, using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of individual differences in NFκB binding in SNP-containing binding sites, suggesting the theoretical mechanism for epigenome to attenuate the TF binding differences on SNP-containing binding sites in two individuals may contribute to link genomic variation to phenotypic variation. Thus, this model presents a powerful approach to analyze the functions of epigenomic modifications.
Pressure ulcers (PU) are considered harmful conditions that are reasonably prevented if accepted standards of care are followed. They became subject to the payment adjustment for hospitalacquired conditions (HACs) beginning October 1, 2008. We examined several aspects of the accuracy of coding for pressure ulcers under the Medicare Hospital-Acquired Condition Present on Admission (HAC–POA) Program. We used the “4010” claim format as a basis of reference to show some of the issues of the old format, such as the underreporting of pressure ulcer stages on pressure ulcer claims and how the underreporting varied by hospital characteristics. We then used the rate of Stage III and IV pressure ulcer HACs reported in the Hospital Cost and Utilization Project State Inpatient Databases data to look at the sensitivity of PU HAC–POA coding to the number of diagnosis fields.
We examined Medicare claims data for FYs 2009 and 2010 to examine the degree that the presence of stage codes were underreported on pressure ulcer claims. We selected all claims with a secondary diagnosis code of pressure ulcer site (ICD-9 diagnosis codes 707.00–707.09) that were not reported as POA (POA of “N” or “U”). We then created a binary indicator for the presence of any pressure ulcer stage diagnosis code. We examine the percentage of claims with a diagnosis of a pressure ulcer site code with no accompanying pressure ulcer stage code.
Our results point to underreporting of PU stages under the “4010” format and that the reporting of stage codes varied across hospital type and location. Further, our results indicate that under the “5010” format, a higher number of pressure ulcer HACs can be expected to be reported and we should expect to encounter a larger percentage of pressure ulcers incorrectly coded as POA under the new format.
The combination of the capture of 25 diagnosis codes under the new “5010” format and the change from ICD-9 to ICD-10 will likely alleviate the observed underreporting of pressure ulcer HACs. However, as long as coding guidelines direct that Stage III and IV pressure ulcers be coded as POA, if a lower stage pressure ulcer was POA and progressed to a higher stage pressure ulcer during the admission, the acquisition of Stage III and IV pressure ulcers in the hospital will be underreported.
Hospital Acquired Conditions; Pressure Ulcer Coding; Health policy; politics; law; regulation; Medicare
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org.
Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.