Search tips
Search criteria

Results 1-25 (1033831)

Clipboard (0)

Related Articles

1.  Epigenetic regulation of human cis-natural antisense transcripts 
Nucleic Acids Research  2012;40(4):1438-1445.
Mammalian genomes encode numerous cis-natural antisense transcripts (cis-NATs). The extent to which these cis-NATs are actively regulated and ultimately functionally relevant, as opposed to transcriptional noise, remains a matter of debate. To address this issue, we analyzed the chromatin environment and RNA Pol II binding properties of human cis-NAT promoters genome-wide. Cap analysis of gene expression data were used to identify thousands of cis-NAT promoters, and profiles of nine histone modifications and RNA Pol II binding for these promoters in ENCODE cell types were analyzed using chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. Active cis-NAT promoters are enriched with activating histone modifications and occupied by RNA Pol II, whereas weak cis-NAT promoters are depleted for both activating modifications and RNA Pol II. The enrichment levels of activating histone modifications and RNA Pol II binding show peaks centered around cis-NAT transcriptional start sites, and the levels of activating histone modifications at cis-NAT promoters are positively correlated with cis-NAT expression levels. Cis-NAT promoters also show highly tissue-specific patterns of expression. These results suggest that human cis-NATs are actively transcribed by the RNA Pol II and that their expression is epigenetically regulated, prerequisites for a functional potential for many of these non-coding RNAs.
PMCID: PMC3287164  PMID: 22371288
2.  Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq 
Nucleic Acids Research  2010;39(1):190-201.
Alternative promoters that are differentially used in various cellular contexts and tissue types add to the transcriptional complexity in mammalian genome. Identification of alternative promoters and the annotation of their activity in different tissues is one of the major challenges in understanding the transcriptional regulation of the mammalian genes and their isoforms. To determine the use of alternative promoters in different tissues, we performed ChIP-seq experiments using antibody against RNA Pol-II, in five adult mouse tissues (brain, liver, lung, spleen and kidney). Our analysis identified 38 639 Pol-II promoters, including 12 270 novel promoters, for both protein coding and non-coding mouse genes. Of these, 6384 promoters are tissue specific which are CpG poor and we find that only 34% of the novel promoters are located in CpG-rich regions, suggesting that novel promoters are mostly tissue specific. By identifying the Pol-II bound promoter(s) of each annotated gene in a given tissue, we found that 37% of the protein coding genes use alternative promoters in the five mouse tissues. The promoter annotations and ChIP-seq data presented here will aid ongoing efforts of characterizing gene regulatory regions in mammalian genomes.
PMCID: PMC3017616  PMID: 20843783
3.  Comparative study on ChIP-seq data: normalization and binding pattern characterization 
Bioinformatics  2009;25(18):2334-2340.
Motivation: Antibody-based Chromatin Immunoprecipitation assay followed by high-throughput sequencing technology (ChIP-seq) is a relatively new method to study the binding patterns of specific protein molecules over the entire genome. ChIP-seq technology allows scientist to get more comprehensive results in shorter time. Here, we present a non-linear normalization algorithm and a mixture modeling method for comparing ChIP-seq data from multiple samples and characterizing genes based on their RNA polymerase II (Pol II) binding patterns.
Results: We apply a two-step non-linear normalization method based on locally weighted regression (LOESS) approach to compare ChIP-seq data across multiple samples and model the difference using an Exponential-NormalK mixture model. Fitted model is used to identify genes associated with differential binding sites based on local false discovery rate (fdr). These genes are then standardized and hierarchically clustered to characterize their Pol II binding patterns. As a case study, we apply the analysis procedure comparing normal breast cancer (MCF7) to tamoxifen-resistant (OHT) cell line. We find enriched regions that are associated with cancer (P < 0.0001). Our findings also imply that there may be a dysregulation of cell cycle and gene expression control pathways in the tamoxifen-resistant cells. These results show that the non-linear normalization method can be used to analyze ChIP-seq data across multiple samples.
Availability: Data are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2800347  PMID: 19561022
4.  RNA Polymerase II Pausing Downstream of Core Histone Genes Is Different from Genes Producing Polyadenylated Transcripts 
PLoS ONE  2012;7(6):e38769.
Recent genome-wide chromatin immunoprecipitation coupled high throughput sequencing (ChIP-seq) analyses performed in various eukaryotic organisms, analysed RNA Polymerase II (Pol II) pausing around the transcription start sites of genes. In this study we have further investigated genome-wide binding of Pol II downstream of the 3′ end of the annotated genes (EAGs) by ChIP-seq in human cells. At almost all expressed genes we observed Pol II occupancy downstream of the EAGs suggesting that Pol II pausing 3′ from the transcription units is a rather common phenomenon. Downstream of EAGs Pol II transcripts can also be detected by global run-on and sequencing, suggesting the presence of functionally active Pol II. Based on Pol II occupancy downstream of EAGs we could distinguish distinct clusters of Pol II pause patterns. On core histone genes, coding for non-polyadenylated transcripts, Pol II occupancy is quickly dropping after the EAG. In contrast, on genes, whose transcripts undergo polyA tail addition [poly(A)+], Pol II occupancy downstream of the EAGs can be detected up to 4–6 kb. Inhibition of polyadenylation significantly increased Pol II occupancy downstream of EAGs at poly(A)+ genes, but not at the EAGs of core histone genes. The differential genome-wide Pol II occupancy profiles 3′ of the EAGs have also been confirmed in mouse embryonic stem (mES) cells, indicating that Pol II pauses genome-wide downstream of the EAGs in mammalian cells. Moreover, in mES cells the sharp drop of Pol II signal at the EAG of core histone genes seems to be independent of the phosphorylation status of the C-terminal domain of the large subunit of Pol II. Thus, our study uncovers a potential link between different mRNA 3′ end processing mechanisms and consequent Pol II transcription termination processes.
PMCID: PMC3372504  PMID: 22701709
5.  A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data 
BMC Bioinformatics  2012;13(Suppl 2):S2.
RNA polymerase II (PolII) is essential in gene transcription and ChIP-seq experiments have been used to study PolII binding patterns over the entire genome. However, since PolII enriched regions in the genome can be very long, existing peak finding algorithms for ChIP-seq data are not adequate for identifying such long regions.
Here we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a false discovery rate (FDR) approach. The binned ChIP-seq data for PolII are first processed using a non-local means (NL-means) algorithm for purposes of denoising. Then, a FDR approach is developed to determine the threshold for marking enriched regions in the binned histogram.
We first test our method using a public PolII ChIP-seq dataset and compare our results with published results obtained using the published algorithm HPeak. Our results show a high consistency with the published results (80-100%). Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on the breast cancer cell line MCF7. The results demonstrate that our method can effectively identify long enriched regions in ChIP-seq datasets. Specifically, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 Kbp (maximum 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (maximum 325,000 bp).
We demonstrated the effectiveness of this method in studying binding patterns of PolII in cancer cells which enables further deep analysis in transcription regulation and epigenetics. Our method complements existing peak detection algorithms for ChIP-seq experiments.
PMCID: PMC3375632  PMID: 22536865
6.  POLYPHEMUS: R package for comparative analysis of RNA polymerase II ChIP-seq profiles by non-linear normalization 
Nucleic Acids Research  2011;40(4):e30.
Chromatin immunoprecipitation coupled with massive parallel sequencing (ChIP-seq) is increasingly used to map protein–chromatin interactions at global scale. The comparison of ChIP-seq profiles for RNA polymerase II (PolII) established in different biological contexts, such as specific developmental stages or specific time-points during cell differentiation, provides not only information about the presence/accumulation of PolII at transcription start sites (TSSs) but also about functional features of transcription, including PolII stalling, pausing and transcript elongation. However, annotation and normalization tools for comparative studies of multiple samples are currently missing. Here, we describe the R-package POLYPHEMUS, which integrates TSS annotation with PolII enrichment over TSSs and coding regions, and normalizes signal intensity profiles. Thereby POLYPHEMUS facilitates to extract information about global PolII action to reveal changes in the functional state of genes. We validated POLYPHEMUS using a kinetic study on retinoic acid-induced differentiation and a publicly available data set from a comparative PolII ChIP-seq profiling in Caenorhabditis elegans. We demonstrate that POLYPHEMUS corrects the data sets by normalizing for technical variation between samples and reveal the potential of the algorithm in comparing multiple data sets to infer features of transcription regulation from dynamic PolII binding profiles.
PMCID: PMC3287170  PMID: 22156059
7.  A New Exhaustive Method and Strategy for Finding Motifs in ChIP-Enriched Regions 
PLoS ONE  2014;9(1):e86044.
ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (, ) motifs (, ) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (, ) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at
PMCID: PMC3901781  PMID: 24475069
8.  Extracting transcription factor targets from ChIP-Seq data 
Nucleic Acids Research  2009;37(17):e113.
ChIP-Seq technology, which combines chromatin immunoprecipitation (ChIP) with massively parallel sequencing, is rapidly replacing ChIP-on-chip for the genome-wide identification of transcription factor binding events. Identifying bound regions from the large number of sequence tags produced by ChIP-Seq is a challenging task. Here, we present GLITR (GLobal Identifier of Target Regions), which accurately identifies enriched regions in target data by calculating a fold-change based on random samples of control (input chromatin) data. GLITR uses a classification method to identify regions in ChIP data that have a peak height and fold-change which do not resemble regions in an input sample. We compare GLITR to several recent methods and show that GLITR has improved sensitivity for identifying bound regions closely matching the consensus sequence of a given transcription factor, and can detect bona fide transcription factor targets missed by other programs. We also use GLITR to address the issue of sequencing depth, and show that sequencing biological replicates identifies far more binding regions than re-sequencing the same sample.
PMCID: PMC2761252  PMID: 19553195
9.  A clustering approach for identification of enriched domains from histone modification ChIP-Seq data 
Bioinformatics  2009;25(15):1952-1958.
Motivation: Chromatin states are the key to gene regulation and cell identity. Chromatin immunoprecipitation (ChIP) coupled with high-throughput sequencing (ChIP-Seq) is increasingly being used to map epigenetic states across genomes of diverse species. Chromatin modification profiles are frequently noisy and diffuse, spanning regions ranging from several nucleosomes to large domains of multiple genes. Much of the early work on the identification of ChIP-enriched regions for ChIP-Seq data has focused on identifying localized regions, such as transcription factor binding sites. Bioinformatic tools to identify diffuse domains of ChIP-enriched regions have been lacking.
Results: Based on the biological observation that histone modifications tend to cluster to form domains, we present a method that identifies spatial clusters of signals unlikely to appear by chance. This method pools together enrichment information from neighboring nucleosomes to increase sensitivity and specificity. By using genomic-scale analysis, as well as the examination of loci with validated epigenetic states, we demonstrate that this method outperforms existing methods in the identification of ChIP-enriched signals for histone modification profiles. We demonstrate the application of this unbiased method in important issues in ChIP-Seq data analysis, such as data normalization for quantitative comparison of levels of epigenetic modifications across cell types and growth conditions.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732366  PMID: 19505939
10.  Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks 
BMC Bioinformatics  2008;9:523.
High throughput signature sequencing holds many promises, one of which is the ready identification of in vivo transcription factor binding sites, histone modifications, changes in chromatin structure and patterns of DNA methylation across entire genomes. In these experiments, chromatin immunoprecipitation is used to enrich for particular DNA sequences of interest and signature sequencing is used to map the regions to the genome (ChIP-Seq). Elucidation of these sites of DNA-protein binding/modification are proving instrumental in reconstructing networks of gene regulation and chromatin remodelling that direct development, response to cellular perturbation, and neoplastic transformation.
Here we present a package of algorithms and software that makes use of control input data to reduce false positives and estimate confidence in ChIP-Seq peaks. Several different methods were compared using two simulated spike-in datasets. Use of control input data and a normalized difference score were found to more than double the recovery of ChIP-Seq peaks at a 5% false discovery rate (FDR). Moreover, both a binomial p-value/q-value and an empirical FDR were found to predict the true FDR within 2–3 fold and are more reliable estimators of confidence than a global Poisson p-value. These methods were then used to reanalyze Johnson et al.'s neuron-restrictive silencer factor (NRSF) ChIP-Seq data without relying on extensive qPCR validated NRSF sites and the presence of NRSF binding motifs for setting thresholds.
The methods developed and tested here show considerable promise for reducing false positives and estimating confidence in ChIP-Seq data without any prior knowledge of the chIP target. They are part of a larger open source package freely available from
PMCID: PMC2628906  PMID: 19061503
11.  SIOMICS: a novel approach for systematic identification of motifs in ChIP-seq data 
Nucleic Acids Research  2013;42(5):e35.
The identification of transcription factor binding motifs is important for the study of gene transcriptional regulation. The chromatin immunoprecipitation (ChIP), followed by massive parallel sequencing (ChIP-seq) experiments, provides an unprecedented opportunity to discover binding motifs. Computational methods have been developed to identify motifs from ChIP-seq data, while at the same time encountering several problems. For example, existing methods are often not scalable to the large number of sequences obtained from ChIP-seq peak regions. Some methods heavily rely on well-annotated motifs even though the number of known motifs is limited. To simplify the problem, de novo motif discovery methods often neglect underrepresented motifs in ChIP-seq peak regions. To address these issues, we developed a novel approach called SIOMICS to de novo discover motifs from ChIP-seq data. Tested on 13 ChIP-seq data sets, SIOMICS identified motifs of many known and new cofactors. Tested on 13 simulated random data sets, SIOMICS discovered no motif in any data set. Compared with two recently developed methods for motif discovery, SIOMICS shows advantages in terms of speed, the number of known cofactor motifs predicted in experimental data sets and the number of false motifs predicted in random data sets. The SIOMICS software is freely available at∼xiaoman/SIOMICS/SIOMICS.html.
PMCID: PMC3950686  PMID: 24322294
12.  Cell-type and transcription factor specific enrichment of transcriptional cofactor motifs in ENCODE ChIP-seq data 
BMC Genomics  2013;14(Suppl 5):S2.
Cell type and TF specific interactions between Transcription Factors (TFs) and cofactors are essential for transcriptional regulation through recruitment of general transcription machinery to gene promoter regions and their identification heavily reliant on protein interaction assays.
Using TF targeted chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) data from Encyclopedia of DNA Elements (ENCODE), we report cell type and TF specific TF-cofactor interactions captured in vivo through enrichments of non target cofactor binding site motifs within ChIP-seq peaks. We observe enrichments in both known and novel cofactor motifs.
Given the regulatory implications which TF and cofactor interactions have on a cell's phenotype, their identification is necessary but challenging. Here we present the findings to our analyses surrounding the investigation of TF-cofactor interactions encoded within TF ChIP-seq peaks. Novel cofactor binding site enrichments observed provides valuable insight into TF and cell type specific interactions driving TF interactions.
PMCID: PMC3852067  PMID: 24564528
13.  A Widespread Distribution of Genomic CeMyoD Binding Sites Revealed and Cross Validated by ChIP-Chip and ChIP-Seq Techniques 
PLoS ONE  2010;5(12):e15898.
Identifying transcription factor binding sites genome-wide using chromatin immunoprecipitation (ChIP)-based technology is becoming an increasingly important tool in addressing developmental questions. However, technical problems associated with factor abundance and suitable ChIP reagents are common obstacles to these studies in many biological systems. We have used two completely different, widely applicable methods to determine by ChIP the genome-wide binding sites of the master myogenic regulatory transcription factor HLH-1 (CeMyoD) in C. elegans embryos. The two approaches, ChIP-seq and ChIP-chip, yield strongly overlapping results revealing that HLH-1 preferentially binds to promoter regions of genes enriched for E-box sequences (CANNTG), known binding sites for this well-studied class of transcription factors. HLH-1 binding sites were enriched upstream of genes known to be expressed in muscle, consistent with its role as a direct transcriptional regulator. HLH-1 binding was also detected at numerous sites unassociated with muscle gene expression, as has been previously described for its mouse homolog MyoD. These binding sites may reflect several additional functions for HLH-1, including its interactions with one or more co-factors to activate (or repress) gene expression or a role in chromatin organization distinct from direct transcriptional regulation of target genes. Our results also provide a comparison of ChIP methodologies that can overcome limitations commonly encountered in these types of studies while highlighting the complications of assigning in vivo functions to identified target sites.
PMCID: PMC3012110  PMID: 21209968
14.  MM-ChIP enables integrative analysis of cross-platform and between-laboratory ChIP-chip or ChIP-seq data 
Genome Biology  2011;12(2):R11.
The ChIP-chip and ChIP-seq techniques enable genome-wide mapping of in vivo protein-DNA interactions and chromatin states. The cross-platform and between-laboratory variation poses a challenge to the comparison and integration of results from different ChIP experiments. We describe a novel method, MM-ChIP, which integrates information from cross-platform and between-laboratory ChIP-chip or ChIP-seq datasets. It improves both the sensitivity and the specificity of detecting ChIP-enriched regions, and is a useful meta-analysis tool for driving discoveries from multiple data sources.
PMCID: PMC3188793  PMID: 21284836
15.  A short survey of computational analysis methods in analysing ChIP-seq data 
Human Genomics  2011;5(2):117-123.
Chromatin immunoprecipitation followed by massively parallel next-generation sequencing (ChIP-seq) is a valuable experimental strategy for assaying protein-DNA interaction over the whole genome. Many computational tools have been designed to find the peaks of the signals corresponding to protein binding sites. In this paper, three computational methods, ChIP-seq processing pipeline (spp), PeakSeq and CisGenome, used in ChIP-seq data analysis are reviewed. There is also a comparison of how they agree and disagree on finding peaks using the publically available Signal Transducers and Activators of Transcription protein 1 (STAT1) and RNA polymerase II (PolII) datasets with corresponding negative controls.
PMCID: PMC3525234  PMID: 21296745
CHIP-Seq analysis; Next-generation sequencing; comparative analysis; bioinformatics
16.  Cell-type specificity of ChIP-predicted transcription factor binding sites 
BMC Genomics  2012;13:372.
Context-dependent transcription factor (TF) binding is one reason for differences in gene expression patterns between different cellular states. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) identifies genome-wide TF binding sites for one particular context—the cells used in the experiment. But can such ChIP-seq data predict TF binding in other cellular contexts and is it possible to distinguish context-dependent from ubiquitous TF binding?
We compared ChIP-seq data on TF binding for multiple TFs in two different cell types and found that on average only a third of ChIP-seq peak regions are common to both cell types. Expectedly, common peaks occur more frequently in certain genomic contexts, such as CpG-rich promoters, whereas chromatin differences characterize cell-type specific TF binding. We also find, however, that genotype differences between the cell types can explain differences in binding. Moreover, ChIP-seq signal intensity and peak clustering are the strongest predictors of common peaks. Compared with strong peaks located in regions containing peaks for multiple transcription factors, weak and isolated peaks are less common between the cell types and are less associated with data that indicate regulatory activity.
Together, the results suggest that experimental noise is prevalent among weak peaks, whereas strong and clustered peaks represent high-confidence binding events that often occur in other cellular contexts. Nevertheless, 30-40% of the strongest and most clustered peaks show context-dependent regulation. We show that by combining signal intensity with additional data—ranging from context independent information such as binding site conservation and position weight matrix scores to context dependent chromatin structure—we can predict whether a ChIP-seq peak is likely to be present in other cellular contexts.
PMCID: PMC3574057  PMID: 22863112
17.  A global change in RNA polymerase II pausing during the Drosophila midblastula transition 
eLife  2013;2:e00861.
Massive zygotic transcription begins in many organisms during the midblastula transition when the cell cycle of the dividing egg slows down. A few genes are transcribed before this stage but how this differential activation is accomplished is still an open question. We have performed ChIP-seq experiments on tightly staged Drosophila embryos and show that massive recruitment of RNA polymerase II (Pol II) with widespread pausing occurs de novo during the midblastula transition. However, ∼100 genes are strongly occupied by Pol II before this timepoint and most of them do not show Pol II pausing, consistent with a requirement for rapid transcription during the fast nuclear cycles. This global change in Pol II pausing correlates with distinct core promoter elements and associates a TATA-enriched promoter with the rapid early transcription. This suggests that promoters are differentially used during the zygotic genome activation, presumably because they have distinct dynamic properties.
eLife digest
Fertilized eggs—zygotes—develop into embryos via several distinct stages. In many animals, the zygote initially undergoes rapid rounds of genome replication; however, this hectic activity is not controlled by the zygote itself. Instead, the mother deposits RNA molecules in the egg as it forms inside her, and after the egg has been fertilized, these RNA molecules are translated into proteins that guide the development of the early embryo. Only at a stage called midblastula transition does the zygote take over control by transcribing its own RNA molecules.
Fruit flies start to transcribe their own genes en masse after completing thirteen rounds of DNA replication. However, some genes are already transcribed during the rapid cycles of DNA replication earlier in development. How these early genes are transcribed, and how the embryo shifts to more widespread transcription during the midblastula transition, are not well understood. In particular, it is not known if the molecular machinery needed to transcribe the genes is recruited a long time before transcription starts, or if it is recruited ‘just in time’. Here, Chen et al. explore how genes are switched on in the fruit fly zygote.
Genes are transcribed by a protein complex called RNA polymerase, which binds to DNA sequences, called promoters, within the genes. Chen et al. used a technique called ChIP-Seq to determine how much RNA polymerase was bound to the DNA before, during and after the midblastula transition. Before the transition—from about eight rounds of DNA replication onward—RNA polymerase was bound to only about 100 genes, and was active in most of these cases. In contrast, after the transition, RNA polymerase had been recruited to the promoters of around 4000 genes (fruit flies have a total of about 14,000 genes). However, it was often found in a paused, rather than active, form, at these genes, which is thought to help ensure that their transcription can occur on a precise schedule.
Chen et al. then used computer analyses to test the theory that differences in the DNA sequences of the gene promoters might determine which genes the RNA polymerase bound to, and whether or not the polymerase underwent pausing or became active immediately. Strikingly, there were clear differences in the sequence motifs that recruited RNA polymerase to the promoters of genes that were transcribed immediately and those that showed pausing of the polymerase. Moreover, genes that were transcribed before the midblastula transition were shorter, on average, than those transcribed after. This suggests that transcription during the rapid genome replication cycles has to occur quickly and therefore lacks pausing. Together, these findings present a biological rationale for differences in how genes are first transcribed during fruit fly development.
PMCID: PMC3743134  PMID: 23951546
transcription; ChIP-seq; promoters; chromatin; zygotic genome activation; RNA polymerase pausing; D. melanogaster
18.  DDR complex facilitates global association of RNA Polymerase V to promoters and evolutionarily young transposons 
The plant-specific DNA-dependent RNA polymerase V (Pol V) evolved from Pol II to function in an RNA-directed DNA methylation pathway. Here, we have identified targets of Pol V in Arabidopsis thaliana on a genome-wide scale using ChIP-seq of NRPE1, the largest catalytic subunit of Pol V. We found that Pol V is enriched at promoters and evolutionarily recent transposons. This localization pattern is highly correlated with Pol V-dependent DNA methylation and small RNA accumulation. We also show that genome-wide chromatin association of Pol V is dependent on all members of a putative chromatin-remodeling complex termed DDR. Our study presents the first genome-wide view of Pol V occupancy and sheds light on the mechanistic basis of Pol V localization. Furthermore, these findings suggest a role for Pol V and RNA-directed DNA methylation in genome surveillance and in responding to genome evolution.
PMCID: PMC3443314  PMID: 22864289
19.  Integrative genome-wide chromatin signature analysis using finite mixture models 
BMC Genomics  2012;13(Suppl 6):S3.
Regulation of gene expression has been shown to involve not only the binding of transcription factor at target gene promoters but also the characterization of histone around which DNA is wrapped around. Some histone modification, for example di-methylated histone H3 at lysine 4 (H3K4me2), has been shown to bind to promoters and activate target genes. However, no clear pattern has been shown to predict human promoters. This paper proposed a novel quantitative approach to characterize patterns of promoter regions and predict novel and alternative promoters. We utilized high-throughput data generated using chromatin immunoprecipitation methods followed by massively parallel sequencing (ChIP-seq) technology on RNA Polymerase II (Pol-II) and H3K4me2. Common patterns of promoter regions are modeled using a mixture model involving double-exponential and uniform distributions. The fitted model obtained were then used to search for regions displaying similar patterns over the entire genome to find novel and alternative promoters. Regions with high correlations with the common patterns are identified as putative novel promoters. We used this proposed algorithm, RNA-seq data and several transcripts databases to find alternative promoters in MCF7 (normal breast cancer) cell line. We found 7,235 high-confidence regions that display the identified promoter patterns. Of these, 4,167 regions (58%) can be mapped to RefSeq regions. 2,444 regions are in a gene body or overlap with transcripts (non-coding RNAs, ESTs, and transcripts that are predicted by RNA-seq data). Some of these maybe potential alternative promoters. We also found 193 regions that map to enhancer regions (represented by androgen and estrogen receptor binding sites) and other regulatory regions such as CTCF (CCCTC binding factor) and CpG island. Around 5% (431 regions) of these correlated regions do not overlap with any transcripts or regulatory regions suggesting that these might be potential new promoters or markers for other annotation which are currently undiscovered.
PMCID: PMC3481451  PMID: 23134707
20.  NELF and GAGA Factor Are Linked to Promoter-Proximal Pausing at Many Genes in Drosophila▿ † 
Molecular and Cellular Biology  2008;28(10):3290-3300.
Recent analyses of RNA polymerase II (Pol II) revealed that Pol II is concentrated at the promoters of many active and inactive genes. NELF causes Pol II to pause in the promoter-proximal region of the hsp70 gene in Drosophila melanogaster. In this study, genome-wide location analysis (chromatin immunoprecipitation-microarray chip [ChIP-chip] analysis) revealed that NELF is concentrated at the 5′ ends of 2,111 genes in Drosophila cells. Permanganate genomic footprinting was used to determine if paused Pol II colocalized with NELF. Forty-six of 56 genes with NELF were found to have paused Pol II. Pol II pauses 30 to 50 nucleotides downstream from transcription start sites. Analysis of DNA sequences in the vicinity of paused Pol II identified a conserved DNA sequence that probably associates with TFIID but detected no evidence of RNA secondary structures or other conserved sequences that might directly control elongation. ChIP-chip experiments indicate that GAGA factor associates with 39% of the genes that have NELF. Surprisingly, NELF associates with almost one-half of the most highly expressed genes, indicating that NELF is not necessarily a repressor of gene expression. NELF-associated pausing of Pol II might be an obligatory but sometimes transient checkpoint during the transcription cycle.
PMCID: PMC2423147  PMID: 18332113
21.  Evaluation of Algorithm Performance in ChIP-Seq Peak Detection 
PLoS ONE  2010;5(7):e11471.
Next-generation DNA sequencing coupled with chromatin immunoprecipitation (ChIP-seq) is revolutionizing our ability to interrogate whole genome protein-DNA interactions. Identification of protein binding sites from ChIP-seq data has required novel computational tools, distinct from those used for the analysis of ChIP-Chip experiments. The growing popularity of ChIP-seq spurred the development of many different analytical programs (at last count, we noted 31 open source methods), each with some purported advantage. Given that the literature is dense and empirical benchmarking challenging, selecting an appropriate method for ChIP-seq analysis has become a daunting task. Herein we compare the performance of eleven different peak calling programs on common empirical, transcription factor datasets and measure their sensitivity, accuracy and usability. Our analysis provides an unbiased critical assessment of available technologies, and should assist researchers in choosing a suitable tool for handling ChIP-seq data.
PMCID: PMC2900203  PMID: 20628599
22.  A Poisson mixture model to identify changes in RNA polymerase II binding quantity using high-throughput sequencing technology 
BMC Genomics  2008;9(Suppl 2):S23.
We present a mixture model-based analysis for identifying differences in the distribution of RNA polymerase II (Pol II) in transcribed regions, measured using ChIP-seq (chromatin immunoprecipitation following massively parallel sequencing technology). The statistical model assumes that the number of Pol II-targeted sequences contained within each genomic region follows a Poisson distribution. A Poisson mixture model was then developed to distinguish Pol II binding changes in transcribed region using an empirical approach and an expectation-maximization (EM) algorithm developed for estimation and inference. In order to achieve a global maximum in the M-step, a particle swarm optimization (PSO) was implemented. We applied this model to Pol II binding data generated from hormone-dependent MCF7 breast cancer cells and antiestrogen-resistant MCF7 breast cancer cells before and after treatment with 17β-estradiol (E2). We determined that in the hormone-dependent cells, ~9.9% (2527) genes showed significant changes in Pol II binding after E2 treatment. However, only ~0.7% (172) genes displayed significant Pol II binding changes in E2-treated antiestrogen-resistant cells. These results show that a Poisson mixture model can be used to analyze ChIP-seq data.
PMCID: PMC2559888  PMID: 18831789
23.  ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries 
PLoS ONE  2012;7(8):e39573.
The advent of high-throughput technologies such as ChIP-seq has made possible the study of histone modifications. A problem of particular interest is the identification of regions of the genome where different cell types from the same organism exhibit different patterns of histone enrichment. This problem turns out to be surprisingly difficult, even in simple pairwise comparisons, because of the significant level of noise in ChIP-seq data. In this paper we propose a two-stage statistical method, called ChIPnorm, to normalize ChIP-seq data, and to find differential regions in the genome, given two libraries of histone modifications of different cell types. We show that the ChIPnorm method removes most of the noise and bias in the data and outperforms other normalization methods. We correlate the histone marks with gene expression data and confirm that histone modifications H3K27me3 and H3K4me3 act as respectively a repressor and an activator of genes. Compared to what was previously reported in the literature, we find that a substantially higher fraction of bivalent marks in ES cells for H3K27me3 and H3K4me3 move into a K27-only state. We find that most of the promoter regions in protein-coding genes have differential histone-modification sites. The software for this work can be downloaded from
PMCID: PMC3411705  PMID: 22870189
24.  Utilizing gene pair orientations for HMM-based analysis of promoter array ChIP-chip data 
Bioinformatics  2009;25(16):2118-2125.
Motivation: Array-based analysis of chromatin immunoprecipitation (ChIP-chip) data is a powerful technique for identifying DNA target regions of individual transcription factors. The identification of these target regions from comprehensive promoter array ChIP-chip data is challenging. Here, three approaches for the identification of transcription factor target genes from promoter array ChIP-chip data are presented. We compare (i) a standard log-fold-change analysis (LFC); (ii) a basic method based on a Hidden Markov Model (HMM); and (iii) a new extension of the HMM approach to an HMM with scaled transition matrices (SHMM) that incorporates information about the relative orientation of adjacent gene pairs on DNA.
Results: All three methods are applied to different promoter array ChIP-chip datasets of the yeast Saccharomyces cerevisiae and the important model plant Arabidopsis thaliana to compare the prediction of transcription factor target genes. In the context of the yeast cell cycle, common target genes bound by the transcription factors ACE2 and SWI5, and ACE2 and FKH2 are identified and evaluated using the Saccharomyces Genome Database. Regarding A.thaliana, target genes of the seed-specific transcription factor ABI3 are predicted and evaluate based on publicly available gene expression profiles and transient assays performed in the wet laboratory experiments. The application of the novel SHMM to these two different promoter array ChIP-chip datasets leads to an improved identification of transcription factor target genes in comparison to the two standard approaches LFC and HMM.
Availability: The software of LFC, HMM and SHMM, the ABI3 ChIP–chip dataset, and Supplementary Material can be downloaded from
PMCID: PMC2722995  PMID: 19401402
25.  A High-Resolution Whole-Genome Map of Key Chromatin Modifications in the Adult Drosophila melanogaster 
PLoS Genetics  2011;7(12):e1002380.
Epigenetic research has been focused on cell-type-specific regulation; less is known about common features of epigenetic programming shared by diverse cell types within an organism. Here, we report a modified method for chromatin immunoprecipitation and deep sequencing (ChIP–Seq) and its use to construct a high-resolution map of the Drosophila melanogaster key histone marks, heterochromatin protein 1a (HP1a) and RNA polymerase II (polII). These factors are mapped at 50-bp resolution genome-wide and at 5-bp resolution for regulatory sequences of genes, which reveals fundamental features of chromatin modification landscape shared by major adult Drosophila cell types: the enrichment of both heterochromatic and euchromatic marks in transposons and repetitive sequences, the accumulation of HP1a at transcription start sites with stalled polII, the signatures of histone code and polII level/position around the transcriptional start sites that predict both the mRNA level and functionality of genes, and the enrichment of elongating polII within exons at splicing junctions. These features, likely conserved among diverse epigenomes, reveal general strategies for chromatin modifications.
Author Summary
Just as a genome sequence map is indispensible to genetic studies, an epigenome map is crucial for epigenetic research. This is especially true for a sophisticated genetic model such as Drosophila melanogaster, where the wealth of information on genetics and developmental biology awaits systematic epigenetic interpretation on a whole-genome scale. In this manuscript, we report a high-resolution map of key chromatin modifications in the Drosophila genome constructed by the ChIP–Seq approach. This map is derived from all cell types in the adult Drosophila weighted by their natural abundance. It contains key histone marks, HP1a and RNA polymerase II, mapped at 50-bp resolution throughout the genome and at 5-bp resolution for regulatory sequences of genes. It reveals striking features of chromatin modification and transcriptional regulation shared by major adult Drosophila cell types. We anticipate that this map and the salient chromatin modification landscapes revealed by this map should have broad utility to the fields of epigenetics, developmental biology, and stem cell biology.
PMCID: PMC3240582  PMID: 22194694

Results 1-25 (1033831)