The unfolded protein response (UPR) in eukaryotes upregulates factors that restore ER homeostasis upon protein folding stress and in yeast is activated by a non-conventional splicing of the HAC1 mRNA. The spliced HAC1 mRNA encodes an active transcription factor that binds to UPR-responsive elements in the promoter of UPR target genes. Overexpression of the HAC1 gene of S. cerevisiae can reportedly lead to increased production of heterologous proteins. To further such studies in the biotechnology favored yeast Pichia pastoris, we cloned and characterized the P. pastoris HAC1 gene and the splice event.
We identified the HAC1 homologue of P. pastoris and its splice sites. Surprisingly, we could not find evidence for the non-spliced HAC1 mRNA when P. pastoris was cultivated in a standard growth medium without any endoplasmic reticulum stress inducers, indicating that the UPR is constitutively active to some extent in this organism. After identification of the sequence encoding active Hac1p we evaluated the effect of its overexpression in Pichia. The KAR2 UPR-responsive gene was strongly upregulated. Electron microscopy revealed an expansion of the intracellular membranes in Hac1p-overexpressing strains. We then evaluated the effect of inducible and constitutive UPR induction on the production of secreted, surface displayed and membrane proteins. Wherever Hac1p overexpression affected heterologous protein expression levels, this effect was always stronger when Hac1p expression was inducible rather than constitutive. Depending on the heterologous protein, co-expression of Hac1p increased, decreased or had no effect on expression level. Moreover, α-mating factor prepro signal processing of a G-protein coupled receptor was more efficient with Hac1p overexpression; resulting in a significantly improved homogeneity.
Overexpression of P. pastoris Hac1p can be used to increase the production of heterologous proteins but needs to be evaluated on a case by case basis. Inducible Hac1p expression is more effective than constitutive expression. Correct processing and thus homogeneity of proteins that are difficult to express, such as GPCRs, can be increased by co-expression with Hac1p.
Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.
Transcription factors (TFs) are proteins that bind sites in the non-coding DNA and regulate the expression of targeted genes. Being able to predict the genome-wide binding locations of TFs is an important step in deciphering gene regulatory networks. Historically, there was very limited experimental data on the DNA-binding preferences of most TFs. Computational biologists used known sites to estimate simple binding site motifs, called position-specific scoring matrices, and scan the genome for additional potential binding locations, but this approach often led to many false positive predictions. Here we introduce a machine learning approach to leverage new high resolution data on the binding preferences of TFs, namely, protein binding microarray (PBM) experiments which measure the in vitro binding affinities of TFs with respect to an array of double-stranded DNA probes, and chromatin immunoprecipitation experiments followed by next generation sequencing (ChIP-seq) which measure in vivo genome-wide binding of TFs in a given cell type. We show that by training statistical models on high resolution PBM and ChIP-seq data, we can more accurately represent the subtle DNA binding preferences of TFs and predict their genome-wide binding locations. These results will enable advances in the computational analysis of transcriptional regulation in mammalian genomes.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. Because the compact genomes of prokaryotes harbor many binding sites separated by only few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Applications in prokaryotic genomes are further hampered by the fact that well studied data analysis methods for ChIP-Seq do not result in a resolution required for deciphering the locations of nearby binding events. We generated single-end tag (SET) and paired-end tag (PET) ChIP-Seq data for factor in Escherichia coli (E. coli). Direct comparison of these datasets revealed that although PET assay enables higher resolution identification of binding events, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak as a high resolution binding site identification (deconvolution) algorithm. dPeak implements a probabilistic model that accurately describes ChIP-Seq data generation process for both the SET and PET assays. For SET data, dPeak outperforms or performs comparably to the state-of-the-art high-resolution ChIP-Seq peak deconvolution algorithms such as PICS, GPS, and GEM. When coupled with PET data, dPeak significantly outperforms SET-based analysis with any of the current state-of-the-art methods. Experimental validations of a subset of dPeak predictions from PET ChIP-Seq data indicate that dPeak can estimate locations of binding events with as high as to resolution. Applications of dPeak to ChIP-Seq data in E. coli under aerobic and anaerobic conditions reveal closely located promoters that are differentially occupied and further illustrate the importance of high resolution analysis of ChIP-Seq data.
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is widely used for studying in vivo protein-DNA interactions genome-wide. Current state-of-the-art ChIP-Seq protocols utilize single-end tag (SET) assay which only sequences ends of DNA fragments in the library. Although paired-end tag (PET) sequencing is routinely used in other applications of next generation sequencing, it has not been much adapted to ChIP-Seq. We illustrate both experimentally and computationally that PET sequencing significantly improves the resolution of ChIP-Seq experiments and enables ChIP-Seq applications in compact genomes like Escherichia coli (E. coli). To enable efficient identification using PET ChIP-Seq data, we develop dPeak as a high resolution binding site identification algorithm. dPeak implements probabilistic models for both SET and PET data and facilitates efficient analysis of both data types. Applications of dPeak to deeply sequenced E. coli PET and SET ChIP-Seq data establish significantly better resolution of PET compared to SET sequencing.
We cloned by phenotypic complementation a novel Saccharomyces cerevisiae's multicopy suppressor of the Schizosaccharomyces pombe cdc10-129 mutant which we call HAC1, an acronym of 'homologous to ATF/CREB 1'. It encodes a bZIP (basic-leucine zipper) protein of 230 amino acids with close homology to the mammalian ATF/CREB transcription factor and gel-retardation assays showed that it binds specifically to the CRE motif. HAC1 is not essential for viability. However, the hac1 disruptant becomes caffeine sensitive, which is suppressed by multicopy expression of the yeast PDE2 (Phosphodiesterase 2) gene. Although the mRNA level of HAC1 is almost constitutive throughout the cell cycle, it fluctuates during meiosis. The upstream region of the HAC1 gene contains a T4C site, a URS (upstream repression sequence) and a TR (T-rich) box-like sequence, which reside upstream of many meiotic genes. These results suggest that HAC1 may also be one of the meiotic genes.
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Transcription factors are critical proteins for sequence-specific control of transcriptional regulation. Finding where these proteins bind to DNA is of key importance for global efforts to decipher the complex mechanisms of gene regulation. Greater understanding of the regulation of transcription promises to improve human genetic analysis by specifying critical gene components that have eluded investigators. Classically, computational prediction of transcription factor binding sites (TFBS) is based on models giving weights to each nucleotide at each position. We introduce a novel statistical model for the prediction of TFBS tolerant of a broader range of TFBS configurations than can be conveniently accommodated by existing methods. The new models are designed to address the confounding properties of nucleotide composition, inter-positional sequence dependence and variable lengths (e.g. variable spacing between half-sites) observed in the more comprehensive experimental data now emerging. The new models generate scores consistent with DNA-protein affinities measured experimentally and can be represented graphically, retaining desirable attributes of past methods. It demonstrates the capacity of the new approach to accurately assess DNA-protein interactions. With the rich experimental data generated from chromatin immunoprecipitation experiments, a greater diversity of TFBS properties has emerged that can now be accommodated within a single predictive approach.
Human artificial chromosomes (HACs) are vectors that offer advantages of capacity and stability for gene delivery and expression. Several studies have even demonstrated their use for gene complementation in gene-deficient recipient cell lines and animal transgenesis. Recently, we constructed an advance HAC-based vector, alphoidtetO-HAC, with a conditional centromere. In this HAC, a gene-loading site was inserted into a centrochromatin domain critical for kinetochore assembly and maintenance. While by definition this domain is permissive for transcription, there have been no long-term studies on transgene expression within centrochromatin. In this study, we compared the effects of three chromatin insulators, cHS4, gamma-satellite DNA, and tDNA, on the expression of an EGFP transgene inserted into the alphoidtetO-HAC vector. Insulator function was essential for stable expression of the transgene in centrochromatin. In two analyzed host cell lines, a tDNA insulator composed of two functional copies of tRNA genes showed the highest barrier activity. We infer that proximity to centrochromatin does not protect genes lacking chromatin insulators from epigenetic silencing. Barrier elements that prevent gene silencing in centrochromatin would thus help to optimize transgenesis using HAC vectors.
Electronic supplementary material
The online version of this article (doi:10.1007/s00018-013-1362-9) contains supplementary material, which is available to authorized users.
Insulator; tDNA-gamma-satellite; cHS4; Human artificial chromosome-HAC
Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.
We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.
We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.
HACS1 is a Src homology 3 and sterile alpha motif domain–containing adaptor that is preferentially expressed in normal hematopoietic tissues and malignancies including myeloid leukemia, lymphoma, and myeloma. Microarray data showed HACS1 expression is up-regulated in activated human B cells treated with interleukin (IL)-4, CD40L, and anti–immunoglobulin (Ig)M and clustered with genes involved in signaling, including TNF receptor–associated protein 1, signaling lymphocytic activation molecule, IL-6, and DEC205. Immunoblot analysis demonstrated that HACS1 is up-regulated by IL-4, IL-13, anti-IgM, and anti-CD40 in human peripheral blood B cells. In murine spleen B cells, Hacs1 can also be up-regulated by lipopolysaccharide but not IL-13. Induction of Hacs1 by IL-4 is dependent on Stat6 signaling and can also be impaired by inhibitors of phosphatidylinositol 3-kinase, protein kinase C, and nuclear factor κB. HACS1 associates with tyrosine-phosphorylated proteins after B cell activation and binds in vitro to the inhibitory molecule paired Ig-like receptor B. Overexpression of HACS1 in murine spleen B cells resulted in a down-regulation of the activation marker CD23 and enhancement of CD138 expression, IgM secretion, and Xbp-1 expression. Knock down of HACS1 in a human B lymphoma cell line by small interfering ribonucleic acid did not significantly change IL-4–stimulated B cell proliferation. Our study demonstrates that HACS1 is up-regulated by B cell activation signals and is a participant in B cell activation and differentiation.
B lymphocytes; interleukin-4; signaling; gene expression; adaptor protein
Motivation: Modelling the regulation of gene expression can provide insight into the regulatory roles of individual transcription factors (TFs) and histone modifications. Recently, Ouyang et al. in 2009 modelled gene expression levels in mouse embryonic stem (mES) cells using in vivo ChIP-seq measurements of TF binding. ChIP-seq TF binding data, however, are tissue-specific and relatively difficult to obtain. This limits the applicability of gene expression models that rely on ChIP-seq TF binding data.
Results: In this study, we build regression-based models that relate gene expression to the binding of 12 different TFs, 7 histone modifications and chromatin accessibility (DNase I hypersensitivity) in two different tissues. We find that expression models based on computationally predicted TF binding can achieve similar accuracy to those using in vivo TF binding data and that including binding at weak sites is critical for accurate prediction of gene expression. We also find that incorporating histone modification and chromatin accessibility data results in additional accuracy. Surprisingly, we find that models that use no TF binding data at all, but only histone modification and chromatin accessibility data, can be as (or more) accurate than those based on in vivo TF binding data.
Availability and implementation: All scripts, motifs and data presented in this article are available online at http://research.imb.uq.edu.au/t.bailey/supplementary_data/McLeay2011a.
Supplementary data are available at Bioinformatics online.
Nutrient response networks are likely to have been among the first response networks to evolve, as the ability to sense and respond to the levels of available nutrients is critical for all organisms. Although several forward genetic screens have been successful in identifying components of plant sugar-response networks, many components remain to be identified. Toward this end, a reverse genetic screen was conducted in Arabidopsis thaliana to identify additional components of sugar-response networks. This screen was based on the rationale that some of the genes involved in sugar-response networks are likely to be themselves sugar regulated at the steady-state mRNA level and to encode proteins with activities commonly associated with response networks. This rationale was validated by the identification of hac1 mutants that are defective in sugar response. HAC1 encodes a histone acetyltransferase. Histone acetyltransferases increase transcription of specific genes by acetylating histones associated with those genes. Mutations in HAC1 also cause reduced fertility, a moderate degree of resistance to paclobutrazol and altered transcript levels of specific genes. Previous research has shown that hac1 mutants exhibit delayed flowering. The sugar-response and fertility defects of hac1 mutants may be partially explained by decreased expression of AtPV42a and AtPV42b, which are putative components of plant SnRK1 complexes. SnRK1 complexes have been shown to function as central regulators of plant nutrient and energy status. Involvement of a histone acetyltransferase in sugar response provides a possible mechanism whereby nutritional status could exert long-term effects on plant development and metabolism.
histone acetyltransferase; chromatin modification; SnRK1; sugar signaling; sugar response; Arabidopsis; fertility; sucrose response
Histone acetyltransferases (HATs) play an important role in eukaryotic transcription. Eight HATs identified in rice (OsHATs) can be organized into four families, namely the CBP (OsHAC701, OsHAC703, and OsHAC704), TAFII250 (OsHAF701), GNAT (OsHAG702, OsHAG703, and OsHAG704), and MYST (OsHAM701) families. The biological functions of HATs in rice remain unknown, so a comprehensive protein sequence analysis of the HAT families was conducted to investigate their potential functions. In addition, the subcellular localization and expression patterns of the eight OsHATs were analyzed.
On the basis of a phylogenetic and domain analysis, monocotyledonous CBP family proteins can be subdivided into two groups, namely Group I and Group II. Similarly, dicotyledonous CBP family proteins can be divided into two groups, namely Group A and Group B. High similarities of protein sequences, conserved domains and three-dimensional models were identified among OsHATs and their homologs in Arabidopsis thaliana and maize. Subcellular localization predictions indicated that all OsHATs might localize in both the nucleus and cytosol. Transient expression in Arabidopsis protoplasts confirmed the nuclear and cytosolic localization of OsHAC701, OsHAG702, and OsHAG704. Real-time quantitative polymerase chain reaction analysis demonstrated that the eight OsHATs were expressed in all tissues examined with significant differences in transcript abundance, and their expression was modulated by abscisic acid and salicylic acid as well as abiotic factors such as salt, cold, and heat stresses.
Both monocotyledonous and dicotyledonous CBP family proteins can be divided into two distinct groups, which suggest the possibility of functional diversification. The high similarities of protein sequences, conserved domains and three-dimensional models among OsHATs and their homologs in Arabidopsis and maize suggested that OsHATs have multiple functions. OsHAC701, OsHAG702, and OsHAG704 were localized in both the nucleus and cytosol in transient expression analyses with Arabidopsis protoplasts. OsHATs were expressed constitutively in rice, and their expression was regulated by exogenous hormones and abiotic stresses, which suggested that OsHATs may play important roles in plant defense responses.
Histone acetyltransferase; Hormone; Phylogenetic tree; Subcellular localization; Rice; Stress
Endoplasmic reticulum (ER) stress is a condition in which the protein folding capacity of the ER becomes overwhelmed by an increased demand for secretion or by exposure to compounds that disrupt ER homeostasis. In yeast and other fungi, the accumulation of unfolded proteins is detected by the ER-transmembrane sensor IreA/Ire1, which responds by cleaving an intron from the downstream cytoplasmic mRNA HacA/Hac1, allowing for the translation of a transcription factor that coordinates a series of adaptive responses that are collectively known as the unfolded protein response (UPR). Here, we examined the contribution of IreA to growth and virulence in the human fungal pathogen Aspergillus fumigatus. Gene expression profiling revealed that A. fumigatus IreA signals predominantly through the canonical IreA-HacA pathway under conditions of severe ER stress. However, in the absence of ER stress IreA controls dual signaling circuits that are both HacA-dependent and HacA-independent. We found that a ΔireA mutant was avirulent in a mouse model of invasive aspergillosis, which contrasts the partial virulence of a ΔhacA mutant, suggesting that IreA contributes to pathogenesis independently of HacA. In support of this conclusion, we found that the ΔireA mutant had more severe defects in the expression of multiple virulence-related traits relative to ΔhacA, including reduced thermotolerance, decreased nutritional versatility, impaired growth under hypoxia, altered cell wall and membrane composition, and increased susceptibility to azole antifungals. In addition, full or partial virulence could be restored to the ΔireA mutant by complementation with either the induced form of the hacA mRNA, hacAi, or an ireA deletion mutant that was incapable of processing the hacA mRNA, ireAΔ10. Together, these findings demonstrate that IreA has both HacA-dependent and HacA-independent functions that contribute to the expression of traits that are essential for virulence in A. fumigatus.
Aspergillus fumigatus is the predominant mold pathogen of humans, responsible for life-threatening infections in patients with depressed immunity. The fungus is highly adapted for secretion, a feature that it uses to extract nutrients from the host environment. High rates of protein secretion can overwhelm the protein folding capacity of the endoplasmic reticulum (ER). The resulting ER stress is alleviated by the unfolded protein response (UPR), a signaling pathway that is triggered by the ER-membrane sensor IreA and executed by the downstream transcription factor HacA. This paper uncovers a novel role for IreA in the expression of multiple adaptive traits that allow the fungus to cope with stress conditions that are encountered during infection. Gene expression profiling of ΔireA and ΔhacA mutants revealed that IreA signals predominantly through the canonical IreA-HacA UPR pathway under extreme conditions of ER stress, but has unexpected HacA-dependent and HacA-independent functions even in the absence of ER stress. These findings establish IreA as an important regulator of A. fumigatus pathogenicity and suggest that therapeutic targeting of the dual functions of this protein could be an effective antifungal strategy.
Direct binding by a transcription factor (TF) to the proximal promoter of a gene is a strong evidence that the TF regulates the gene. Assaying the genome-wide binding of every TF in every cell type and condition is currently impractical. Histone modifications correlate with tissue/cell/condition-specific (‘tissue specific’) TF binding, so histone ChIP-seq data can be combined with traditional position weight matrix (PWM) methods to make tissue-specific predictions of TF–promoter interactions.
Results: We use supervised learning to train a naïve Bayes predictor of TF–promoter binding. The predictor's features are the histone modification levels and a PWM-based score for the promoter. Training and testing uses sets of promoters labeled using TF ChIP-seq data, and we use cross-validation on 23 such datasets to measure the accuracy. A PWM+histone naïve Bayes predictor using a single histone modification (H3K4me3) is substantially more accurate than a PWM score or a conservation-based score (phylogenetic motif model). The naïve Bayes predictor is more accurate (on average) at all sensitivity levels, and makes only half as many false positive predictions at sensitivity levels from 10% to 80%. On average, it correctly predicts 80% of bound promoters at a false positive rate of 20%. Accuracy does not diminish when we test the predictor in a different cell type (and species) from training. Accuracy is barely diminished even when we train the predictor without using TF ChIP-seq data.
Availability: Our tissue-specific predictor of promoters bound by a TF is called Dr Gene and is available at http://bioinformatics.org.au/drgene.
Supplementary information: Supplementary data are available at Bioinformatics online.
Despite explosive growth in genomic datasets, the methods for studying epigenomic mechanisms of gene regulation remain primitive. Here we present a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. Based on the first principles of statistical mechanics, this model considers the interactions between epigenomic modifications and a cis-regulatory module, which contains multiple binding sites arranged in any configurations. We compiled a comprehensive epigenomic dataset in mouse embryonic stem (mES) cells, including DNA methylation (MeDIP-seq and MRE-seq), DNA hydroxymethylation (5-hmC-seq), and histone modifications (ChIP-seq). We discovered correlations of transcription factors (TFs) for specific combinations of epigenomic modifications, which we term epigenomic motifs. Epigenomic motifs explained why some TFs appeared to have different DNA binding motifs derived from in vivo (ChIP-seq) and in vitro experiments. Theoretical analyses suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. ChIP-seq data suggested that epigenomic boost of binding affinities in weak TF binding sites can function in mES cells. We showed in theory that the epigenome should suppress the TF binding differences on SNP-containing binding sites in two people. Using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of personal differences in NFκB binding in SNP-containing binding sites, which may explain why some SNPs introduce much smaller personal variations on TF binding than other SNPs. In summary, this model presents a powerful approach to analyze the functions of epigenomic modifications. This model was implemented into an open source program APEG (Affinity Prediction by Epigenome and Genome, http://systemsbio.ucsd.edu/apeg).
We developed a model-based approach to systematically analyze the epigenomic functions in modulating transcription factor-DNA binding. We postulated the existence of TF-specific epigenomic motifs, which could explain why some TFs appeared to have different DNA binding motifs derived from in vivo and in vitro experiments. The theoretical results suggested that the epigenome can modulate transcriptional noise and boost the cooperativity of weak TF binding sites. A preliminary analysis of the existing data suggested that epigenomic boost of binding affinities in weak TF binding sites could be a widespread regulatory mechanism in mES cells. Moreover, using personal data, we identified strong associations between H3K4me2/H3K9ac and the degree of individual differences in NFκB binding in SNP-containing binding sites, suggesting the theoretical mechanism for epigenome to attenuate the TF binding differences on SNP-containing binding sites in two individuals may contribute to link genomic variation to phenotypic variation. Thus, this model presents a powerful approach to analyze the functions of epigenomic modifications.
Mapping genome-wide binding sites of all transcription factors (TFs) in all biological contexts is a critical step toward understanding gene regulation. The state-of-the-art technologies for mapping transcription factor binding sites (TFBSs) couple chromatin immunoprecipitation (ChIP) with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip). These technologies have limitations: they are low-throughput with respect to surveying many TFs. Recent advances in genome-wide chromatin profiling, including development of technologies such as DNase-seq, FAIRE-seq and ChIP-seq for histone modifications, make it possible to predict in vivo TFBSs by analyzing chromatin features at computationally determined DNA motif sites. This promising new approach may allow researchers to monitor the genome-wide binding sites of many TFs simultaneously. In this article, we discuss various experimental design and data analysis issues that arise when applying this approach. Through a systematic analysis of the data from the Encyclopedia Of DNA Elements (ENCODE) project, we compare the predictive power of individual and combinations of chromatin marks using supervised and unsupervised learning methods, and evaluate the value of integrating information from public ChIP and gene expression data. We also highlight the challenges and opportunities for developing novel analytical methods, such as resolving the one-motif-multiple-TF ambiguity and distinguishing functional and non-functional TF binding targets from the predicted binding sites.
Electronic Supplementary Material
The online version of this article (doi:10.1007/s12561-012-9066-5) contains supplementary material, which is available to authorized users.
Transcription factor binding sites; DNase-seq; ChIP-seq; FAIRE-seq; Next-generation sequencing; Motif
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Annotating repetitive regions of genomes experimentally is a challenging task. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) provides valuable data for characterizing repetitive regions of genomes in terms of transcription factor binding. Although ChIP-seq technology has been maturing, available ChIP-seq analysis methods and software rely on discarding sequence reads that map to multiple locations on the reference genome (multi-reads), thereby generating a missed opportunity for assessing transcription factor binding to highly repetitive regions of genomes. We develop a computational algorithm that takes multi-reads into account in ChIP-seq analysis. We show with computational experiments that multi-reads lead to significant increase in sequencing depths and identification of binding regions that are otherwise not identifiable when only reads that uniquely map to the reference genome (uni-reads) are used. In particular, we show that the number of binding regions identified can increase up to 36%. We support our computational predictions with independent quantitative real-time ChIP validation of binding regions identified only when multi-reads are incorporated in the analysis of a mouse GATA1 ChIP-seq experiment.
Human artificial chromosome (HAC)-based vectors represent an alternative technology for gene delivery and expression with a potential to overcome the problems caused by the use of viral-based vectors. The recently developed alphoidtetO-HAC has an advantage over other HAC vectors because it can be easily eliminated from cells by inactivation of the HAC kinetochore via binding of tTS chromatin modifiers to its centromeric tetO sequences. This provides unique control for phenotypes induced by genes loaded into the alphoidtetO-HAC. However, inactivation of the HAC kinetochore requires transfection of cells by a retrovirus vector, a step that is potentially mutagenic. Here, we describe an approach to re-engineering the alphoidtetO-HAC that allows verification of phenotypic changes attributed to expression of genes from the HAC without a transfection step. In the new HAC vector, a tTS-EYFP cassette is inserted into a gene-loading site along with a gene of interest. Expression of the tTS generates a self-regulating fluctuating heterochromatin on the alphoidtetO-HAC that induces fast silencing of the genes on the HAC without significant effects on HAC segregation. This silencing of the HAC-encoded genes can be readily recovered by adding doxycycline. The newly modified alphoidtetO-HAC-based system has multiple applications in gene function studies.
DNA sequence and local chromatin landscape act jointly to determine transcription factor (TF) binding intensity profiles. To disentangle these influences, we developed an experimental approach, called protein/DNA binding followed by high-throughput sequencing (PB–seq), that allows the binding energy landscape to be characterized genome-wide in the absence of chromatin. We applied our methods to the Drosophila Heat Shock Factor (HSF), which inducibly binds a target DNA sequence element (HSE) following heat shock stress. PB–seq involves incubating sheared naked genomic DNA with recombinant HSF, partitioning the HSF–bound and HSF–free DNA, and then detecting HSF–bound DNA by high-throughput sequencing. We compared PB–seq binding profiles with ones observed in vivo by ChIP–seq and developed statistical models to predict the observed departures from idealized binding patterns based on covariates describing the local chromatin environment. We found that DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in predicting changes in HSF binding affinity. We also investigated the extent to which DNA accessibility, as measured by digital DNase I footprinting data, could be predicted from MNase–seq data and the ChIP–chip profiles for many histone modifications and TFs, and found GAGA element associated factor (GAF), tetra-acetylation of H4, and H4K16 acetylation to be the most predictive covariates. Lastly, we generated an unbiased model of HSF binding sequences, which revealed distinct biophysical properties of the HSF/HSE interaction and a previously unrecognized substructure within the HSE. These findings provide new insights into the interplay between the genomic sequence and the chromatin landscape in determining transcription factor binding intensity.
Transcription factors (TFs) bind DNA to modulate levels of gene expression. TF binding sites change throughout development, in response to environmental stimuli, and different tissues have distinct TF binding profiles. The mechanism by which TFs discriminate between binding sites in a context dependent manner is an area of active research, but it is clear that the chromatin environment in which potential binding sites reside strongly influences binding. This study used the Heat Shock TF (HSF) to study the effect chromatin has upon induced HSF binding. We implemented an experimental technique to quantify all potential HSF binding sites in the genome. These data were incorporated into a modeling framework along with chromatin landscape information prior to HSF binding to accurately predict the intensities of inducible HSF binding sites. DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in the model. The binding data enabled the development of a more complete HSF/DNA interaction model, providing insight into the biophysical interaction of HSF trimer subunits and target DNA pentamers.
We have used a human artificial chromosome (HAC) to manipulate the epigenetic state of chromatin within an active kinetochore. The HAC has a dimeric α-satellite repeat containing one natural monomer with a CENP-B binding site, and one completely artificial synthetic monomer with the CENP-B box replaced by a tetracycline operator (tetO). This HAC exhibits normal kinetochore protein composition and mitotic stability. Targeting of several tet-repressor (tetR) fusions into the centromere had no effect on kinetochore function. However, altering the chromatin state to a more open configuration with the tTA transcriptional activator or to a more closed state with the tTS transcription silencer caused missegregation and loss of the HAC. tTS binding caused the loss of CENP-A, CENP-B, CENP-C, and H3K4me2 from the centromere accompanied by an accumulation of histone H3K9me3. Our results reveal that a dynamic balance between centromeric chromatin and heterochromatin is essential for vertebrate kinetochore activity.
Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets.
We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance.
We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
HacA/Xbp1 is a conserved bZIP transcription factor in eukaryotic cells which regulates gene expression in response to various forms of secretion stress and as part of secretory cell differentiation. In the present study, we replaced the endogenous hacA gene of an Aspergillus niger strain with a gene encoding a constitutively active form of the HacA transcription factor (HacACA). The impact of constitutive HacA activity during exponential growth was explored in bioreactor controlled cultures using transcriptomic analysis to identify affected genes and processes.
Transcription profiles for the wild-type strain (HacAWT) and the HacACA strain were obtained using Affymetrix GeneChip analysis of three replicate batch cultures of each strain. In addition to the well known HacA targets such as the ER resident foldases and chaperones, GO enrichment analysis revealed up-regulation of genes involved in protein glycosylation, phospholipid biosynthesis, intracellular protein transport, exocytosis and protein complex assembly in the HacACA mutant. Biological processes over-represented in the down-regulated genes include those belonging to central metabolic pathways, translation and transcription. A remarkable transcriptional response in the HacACA strain was the down-regulation of the AmyR transcription factor and its target genes.
The results indicate that the constitutive activation of the HacA leads to a coordinated regulation of the folding and secretion capacity of the cell, but with consequences on growth and fungal physiology to reduce secretion stress.
HacA; Unfolded protein response; Secretion stress; RESS; XBP1; Aspergillus niger; Protein secretion
Transcription factor-DNA interactions, central to cellular regulation and control, are commonly described by position weight matrices (PWMs). These matrices are frequently used to predict transcription factor binding sites in regulatory regions of DNA to complement and guide further experimental investigation. The DNA sequence preferences of transcription factors, encoded in PWMs, are dictated primarily by select residues within the DNA binding domain(s) that interact directly with DNA. Therefore, the DNA binding properties of homologous transcription factors with identical DNA binding domains may be characterized by PWMs derived from different species. Accordingly, we have implemented a fully automated domain-level homology searching method for identical DNA binding sequences.
By applying the domain-level homology search to transcription factors with existing PWMs in the JASPAR and TRANSFAC databases, we were able to significantly increase coverage in terms of the total number of PWMs associated with a given species, assign PWMs to transcription factors that did not previously have any associations, and increase the number of represented species with PWMs over an order of magnitude. Additionally, using protein binding microarray (PBM) data, we have validated the domain-level method by demonstrating that transcription factor pairs with matching DNA binding domains exhibit comparable DNA binding specificity predictions to transcription factor pairs with completely identical sequences.
The increased coverage achieved herein demonstrates the potential for more thorough species-associated investigation of protein-DNA interactions using existing resources. The PWM scanning results highlight the challenging nature of transcription factors that contain multiple DNA binding domains, as well as the impact of motif discovery on the ability to predict DNA binding properties. The method is additionally suitable for identifying domain-level homology mappings to enable utilization of additional information sources in the study of transcription factors. The domain-level homology search method, resulting PWM mappings, web-based user interface, and web API are publicly available at http://dodoma.systemsbiology.netdodoma.systemsbiology.net.
Motivation: Chromatin immunoprecipitation (ChIP) experiments followed by array hybridization, or ChIP-chip, is a powerful approach for identifying transcription factor binding sites (TFBS) and has been widely used. Recently, massively parallel sequencing coupled with ChIP experiments (ChIP-seq) has been increasingly used as an alternative to ChIP-chip, offering cost-effective genome-wide coverage and resolution up to a single base pair. For many well-studied TFs, both ChIP-seq and ChIP-chip experiments have been applied and their data are publicly available. Previous analyses have revealed substantial technology-specific binding signals despite strong correlation between the two sets of results. Therefore, it is of interest to see whether the two data sources can be combined to enhance the detection of TFBS.
Results: In this work, hierarchical hidden Markov model (HHMM) is proposed for combining data from ChIP-seq and ChIP-chip. In HHMM, inference results from individual HMMs in ChIP-seq and ChIP-chip experiments are summarized by a higher level HMM. Simulation studies show the advantage of HHMM when data from both technologies co-exist. Analysis of two well-studied TFs, NRSF and CCCTC-binding factor (CTCF), also suggests that HHMM yields improved TFBS identification in comparison to analyses using individual data sources or a simple merger of the two.
Availability: Source code for the software ChIPmeta is freely available for download at http://www.umich.edu/∼hwchoi/HHMMsoftware.zip, implemented in C and supported on linux.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
DNA Microarrays are regarded as a valuable tool for basic and applied research in microbiology. However, for many industrially important microorganisms the lack of commercially available microarrays still hampers physiological research. Exemplarily, our understanding of protein folding and secretion in the yeast Pichia pastoris is presently widely dependent on conclusions drawn from analogies to Saccharomyces cerevisiae. To close this gap for a yeast species employed for its high capacity to produce heterologous proteins, we developed full genome DNA microarrays for P. pastoris and analyzed the unfolded protein response (UPR) in this yeast species, as compared to S. cerevisiae.
By combining the partially annotated gene list of P. pastoris with de novo gene finding a list of putative open reading frames was generated for which an oligonucleotide probe set was designed using the probe design tool TherMODO (a thermodynamic model-based oligoset design optimizer). To evaluate the performance of the novel array design, microarrays carrying the oligo set were hybridized with samples from treatments with dithiothreitol (DTT) or a strain overexpressing the UPR transcription factor HAC1, both compared with a wild type strain in normal medium as untreated control. DTT treatment was compared with literature data for S. cerevisiae, and revealed similarities, but also important differences between the two yeast species. Overexpression of HAC1, the most direct control for UPR genes, resulted in significant new understanding of this important regulatory pathway in P. pastoris, and generally in yeasts.
The differences observed between P. pastoris and S. cerevisiae underline the importance of DNA microarrays for industrial production strains. P. pastoris reacts to DTT treatment mainly by the regulation of genes related to chemical stimulus, electron transport and respiration, while the overexpression of HAC1 induced many genes involved in translation, ribosome biogenesis, and organelle biosynthesis, indicating that the regulatory events triggered by DTT treatment only partially overlap with the reactions to overexpression of HAC1. The high reproducibility of the results achieved with two different oligo sets is a good indication for their robustness, and underlines the importance of less stringent selection of regulated features, in order to avoid a large number of false negative results.
Activation of the unfolded protein response (UPR) in eukaryotes involves the splicing of an unconventional intron from the mRNA encoding the transcriptional activator of the pathway. In Saccharomyces cerevisiae a 252-nucleotide (nt) unconventional intron is spliced out of the transcript of HAC1, changing the 3′ end of the HAC1 open reading frame and relieving the transcript from a translational block in a single step. The translational block is caused by the base pairing of part of the unconventional intron with the 5′-untranslated region (5′UTR). In Aspergillus niger and other aspergilli, the unconventional intron in hacA mRNA is only 20 nt long. Since this intron is part of a stable stem-loop structure, base pairing with the 5′UTR, in contrast to the case with yeast HAC1, is not possible. However, analysis of the hacA mRNA revealed a GC-rich inverted repeat (18 base pairings). Upon the activation of the UPR, the 5′UTR of hacA mRNA is truncated by 230 nt, removing the left part of this inverted repeat. This implies a similar release of a translational block as in the case of S. cerevisiae HAC1 but in two steps. The mechanism behind the 5′ truncation, which does not take place in either yeast HAC1 or mammalian xbp1 mRNA, has been hitherto unknown. Here we show that during secretion stress in A. niger, hacA transcription starts from a new start site closer to the ATG, relieving the transcript from translational attenuation. This transcriptional switch is mediated by HacA itself and the unfolded protein response element 2 (UPRE2) in the hacA promoter.