PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1049195)

Clipboard (0)
None

Related Articles

1.  MPromDb: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-chip experimental data 
Nucleic Acids Research  2005;34(Database issue):D98-D103.
We have developed Mammalian Promoter Database (MPromDb), a novel database that integrates gene promoters with experimentally supported annotation of transcription start sites, cis-regulatory elements, CpG islands and chromatin immunoprecipitation microarray (ChIP-chip) experimental results with intuitively designed presentation. Release 1.0 of MPromDb currently contains 36 407 promoters and first exons (19 170 from human, 15 953 from mouse and 1284 from rat), 3739 transcription factor (TF)-binding sites (2027 from human, 1181 mouse and 531 rat) and 224 TFs with links to PubMed and GenBank references. Target promoters of TFs that have been identified by ChIP-chip assay are integrated into the database. MPromDb serves as a portal for genome-wide promoter analysis of data generated by ChIP-chip experimental studies. MPromDb can be accessed from .
doi:10.1093/nar/gkj096
PMCID: PMC1347458  PMID: 16381984
2.  Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data 
BMC Bioinformatics  2010;11(Suppl 1):S65.
Background
Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.
Methods
We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.
Results
We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.
Conclusion
Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.
doi:10.1186/1471-2105-11-S1-S65
PMCID: PMC3009539  PMID: 20122241
3.  Long Non-Coding RNA and Alternative Splicing Modulations in Parkinson's Leukocytes Identified by RNA Sequencing 
PLoS Computational Biology  2014;10(3):e1003517.
The continuously prolonged human lifespan is accompanied by increase in neurodegenerative diseases incidence, calling for the development of inexpensive blood-based diagnostics. Analyzing blood cell transcripts by RNA-Seq is a robust means to identify novel biomarkers that rapidly becomes a commonplace. However, there is lack of tools to discover novel exons, junctions and splicing events and to precisely and sensitively assess differential splicing through RNA-Seq data analysis and across RNA-Seq platforms. Here, we present a new and comprehensive computational workflow for whole-transcriptome RNA-Seq analysis, using an updated version of the software AltAnalyze, to identify both known and novel high-confidence alternative splicing events, and to integrate them with both protein-domains and microRNA binding annotations. We applied the novel workflow on RNA-Seq data from Parkinson's disease (PD) patients' leukocytes pre- and post- Deep Brain Stimulation (DBS) treatment and compared to healthy controls. Disease-mediated changes included decreased usage of alternative promoters and N-termini, 5′-end variations and mutually-exclusive exons. The PD regulated FUS and HNRNP A/B included prion-like domains regulated regions. We also present here a workflow to identify and analyze long non-coding RNAs (lncRNAs) via RNA-Seq data. We identified reduced lncRNA expression and selective PD-induced changes in 13 of over 6,000 detected leukocyte lncRNAs, four of which were inversely altered post-DBS. These included the U1 spliceosomal lncRNA and RP11-462G22.1, each entailing sequence complementarity to numerous microRNAs. Analysis of RNA-Seq from PD and unaffected controls brains revealed over 7,000 brain-expressed lncRNAs, of which 3,495 were co-expressed in the leukocytes including U1, which showed both leukocyte and brain increases. Furthermore, qRT-PCR validations confirmed these co-increases in PD leukocytes and two brain regions, the amygdala and substantia-nigra, compared to controls. This novel workflow allows deep multi-level inspection of RNA-Seq datasets and provides a comprehensive new resource for understanding disease transcriptome modifications in PD and other neurodegenerative diseases.
Author Summary
Long non-coding RNAs (lncRNAs) comprise a novel, fascinating class of RNAs with largely unknown biological functions. Parkinson's-disease (PD) is the most frequent motor disorder, and Deep-brain-stimulation (DBS) treatment alleviates the symptoms, but early disease biomarkers are still unknown and new future genetic interference targets are urgently needed. Using RNA-sequencing technology and a novel computational workflow for in-depth exploration of whole-transcriptome RNA-seq datasets, we detected and analyzed lncRNAs in sequenced libraries from PD patients' leukocytes pre and post-treatment and the brain, adding this full profile resource of over 7,000 lncRNAs to the few human tissues-derived lncRNA datasets that are currently available. Our study includes sample-specific database construction, detecting disease-derived changes in known and novel lncRNAs, exons and junctions and predicting corresponding changes in Polyadenylation choices, protein domains and miRNA binding sites. We report widespread transcript structure variations at the splice junction and exons levels, including novel exons and junctions and alteration of lncRNAs followed by experimental validation in PD leukocytes and two PD brain regions compared with controls. Our results suggest lncRNAs involvement in neurodegenerative diseases, and specifically PD. This comprehensive workflow will be of use to the increasing number of laboratories producing RNA-Seq data in a wide range of biomedical studies.
doi:10.1371/journal.pcbi.1003517
PMCID: PMC3961179  PMID: 24651478
4.  Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq 
Nucleic Acids Research  2010;39(1):190-201.
Alternative promoters that are differentially used in various cellular contexts and tissue types add to the transcriptional complexity in mammalian genome. Identification of alternative promoters and the annotation of their activity in different tissues is one of the major challenges in understanding the transcriptional regulation of the mammalian genes and their isoforms. To determine the use of alternative promoters in different tissues, we performed ChIP-seq experiments using antibody against RNA Pol-II, in five adult mouse tissues (brain, liver, lung, spleen and kidney). Our analysis identified 38 639 Pol-II promoters, including 12 270 novel promoters, for both protein coding and non-coding mouse genes. Of these, 6384 promoters are tissue specific which are CpG poor and we find that only 34% of the novel promoters are located in CpG-rich regions, suggesting that novel promoters are mostly tissue specific. By identifying the Pol-II bound promoter(s) of each annotated gene in a given tissue, we found that 37% of the protein coding genes use alternative promoters in the five mouse tissues. The promoter annotations and ChIP-seq data presented here will aid ongoing efforts of characterizing gene regulatory regions in mammalian genomes.
doi:10.1093/nar/gkq775
PMCID: PMC3017616  PMID: 20843783
5.  ChIPBase: a database for decoding the transcriptional regulation of long non-coding RNA and microRNA genes from ChIP-Seq data 
Nucleic Acids Research  2012;41(Database issue):D177-D187.
Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) represent two classes of important non-coding RNAs in eukaryotes. Although these non-coding RNAs have been implicated in organismal development and in various human diseases, surprisingly little is known about their transcriptional regulation. Recent advances in chromatin immunoprecipitation with next-generation DNA sequencing (ChIP-Seq) have provided methods of detecting transcription factor binding sites (TFBSs) with unprecedented sensitivity. In this study, we describe ChIPBase (http://deepbase.sysu.edu.cn/chipbase/), a novel database that we have developed to facilitate the comprehensive annotation and discovery of transcription factor binding maps and transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. The current release of ChIPBase includes high-throughput sequencing data that were generated by 543 ChIP-Seq experiments in diverse tissues and cell lines from six organisms. By analysing millions of TFBSs, we identified tens of thousands of TF-lncRNA and TF-miRNA regulatory relationships. Furthermore, two web-based servers were developed to annotate and discover transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. In addition, we developed two genome browsers, deepView and genomeView, to provide integrated views of multidimensional data. Moreover, our web implementation supports diverse query types and the exploration of TFs, lncRNAs, miRNAs, gene ontologies and pathways.
doi:10.1093/nar/gks1060
PMCID: PMC3531181  PMID: 23161675
6.  Vespucci: a system for building annotated databases of nascent transcripts 
Nucleic Acids Research  2013;42(4):2433-2447.
Global run-on sequencing (GRO-seq) is a recent addition to the series of high-throughput sequencing methods that enables new insights into transcriptional dynamics within a cell. However, GRO-sequencing presents new algorithmic challenges, as existing analysis platforms for ChIP-seq and RNA-seq do not address the unique problem of identifying transcriptional units de novo from short reads located all across the genome. Here, we present a novel algorithm for de novo transcript identification from GRO-sequencing data, along with a system that determines transcript regions, stores them in a relational database and associates them with known reference annotations. We use this method to analyze GRO-sequencing data from primary mouse macrophages and derive novel quantitative insights into the extent and characteristics of non-coding transcription in mammalian cells. In doing so, we demonstrate that Vespucci expands existing annotations for mRNAs and lincRNAs by defining the primary transcript beyond the polyadenylation site. In addition, Vespucci generates assemblies for un-annotated non-coding RNAs such as those transcribed from enhancer-like elements. Vespucci thereby provides a robust system for defining, storing and analyzing diverse classes of primary RNA transcripts that are of increasing biological interest.
doi:10.1093/nar/gkt1237
PMCID: PMC3936758  PMID: 24304890
7.  Most “Dark Matter” Transcripts Are Associated With Known Genes 
PLoS Biology  2010;8(5):e1000371.
Short-read RNA sequencing in mouse and human tissues shows that most transcripts are encoded within or nearby known genes and that most of the genome is not transcribed.
A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.
Author Summary
The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.
doi:10.1371/journal.pbio.1000371
PMCID: PMC2872640  PMID: 20502517
8.  Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data 
PLoS Computational Biology  2011;7(7):e1002111.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Author Summary
Annotating repetitive regions of genomes experimentally is a challenging task. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) provides valuable data for characterizing repetitive regions of genomes in terms of transcription factor binding. Although ChIP-seq technology has been maturing, available ChIP-seq analysis methods and software rely on discarding sequence reads that map to multiple locations on the reference genome (multi-reads), thereby generating a missed opportunity for assessing transcription factor binding to highly repetitive regions of genomes. We develop a computational algorithm that takes multi-reads into account in ChIP-seq analysis. We show with computational experiments that multi-reads lead to significant increase in sequencing depths and identification of binding regions that are otherwise not identifiable when only reads that uniquely map to the reference genome (uni-reads) are used. In particular, we show that the number of binding regions identified can increase up to 36%. We support our computational predictions with independent quantitative real-time ChIP validation of binding regions identified only when multi-reads are incorporated in the analysis of a mouse GATA1 ChIP-seq experiment.
doi:10.1371/journal.pcbi.1002111
PMCID: PMC3136429  PMID: 21779159
9.  A map of the cis-regulatory sequences in the mouse genome 
Nature  2012;488(7409):116-120.
The laboratory mouse is the most widely used mammalian model organism in biomedical research. The 2.6 × 109 bases of the mouse genome possess a high degree of conservation with the human genome1, so a thorough annotation of the mouse genome will be of significant value to understanding the function of the human genome. So far, most of the functional sequences in the mouse genome have yet to be found, and the cis-regulatory sequences in particular are still poorly annotated. Comparative genomics has been a powerful tool for the discovery of these sequences2, but on its own it cannot resolve their temporal and spatial functions. Recently, ChIP-Seq has been developed to identify cis-regulatory elements in the genomes of several organisms including humans, Drosophila melanogaster and Caenorhabditis elegans3–5. Here we apply the same experimental approach to a diverse set of 19 tissues and cell types in the mouse to produce a map of nearly 300,000 murine cis-regulatory sequences. The annotated sequences add up to 11% of the mouse genome, and include more than 70% of conserved non-coding sequences. We define tissue-specific enhancers and identify potential transcription factors regulating gene expression in each tissue or cell type. Finally, we show that much of the mouse genome is organized in to domains of coordinately regulated enhancers and promoters. Our results provide a resource for the annotation of functional elements in the mammalian genome and for the study of mechanisms regulating tissue-specific gene expression.
doi:10.1038/nature11243
PMCID: PMC4041622  PMID: 22763441
10.  Integrative genome-wide chromatin signature analysis using finite mixture models 
BMC Genomics  2012;13(Suppl 6):S3.
Regulation of gene expression has been shown to involve not only the binding of transcription factor at target gene promoters but also the characterization of histone around which DNA is wrapped around. Some histone modification, for example di-methylated histone H3 at lysine 4 (H3K4me2), has been shown to bind to promoters and activate target genes. However, no clear pattern has been shown to predict human promoters. This paper proposed a novel quantitative approach to characterize patterns of promoter regions and predict novel and alternative promoters. We utilized high-throughput data generated using chromatin immunoprecipitation methods followed by massively parallel sequencing (ChIP-seq) technology on RNA Polymerase II (Pol-II) and H3K4me2. Common patterns of promoter regions are modeled using a mixture model involving double-exponential and uniform distributions. The fitted model obtained were then used to search for regions displaying similar patterns over the entire genome to find novel and alternative promoters. Regions with high correlations with the common patterns are identified as putative novel promoters. We used this proposed algorithm, RNA-seq data and several transcripts databases to find alternative promoters in MCF7 (normal breast cancer) cell line. We found 7,235 high-confidence regions that display the identified promoter patterns. Of these, 4,167 regions (58%) can be mapped to RefSeq regions. 2,444 regions are in a gene body or overlap with transcripts (non-coding RNAs, ESTs, and transcripts that are predicted by RNA-seq data). Some of these maybe potential alternative promoters. We also found 193 regions that map to enhancer regions (represented by androgen and estrogen receptor binding sites) and other regulatory regions such as CTCF (CCCTC binding factor) and CpG island. Around 5% (431 regions) of these correlated regions do not overlap with any transcripts or regulatory regions suggesting that these might be potential new promoters or markers for other annotation which are currently undiscovered.
doi:10.1186/1471-2164-13-S6-S3
PMCID: PMC3481451  PMID: 23134707
11.  ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data 
BMC Bioinformatics  2010;11:237.
Background
Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) or ChIP followed by genome tiling array analysis (ChIP-chip) have become standard technologies for genome-wide identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that allow identification of binding sites from ChIP-seq or ChIP-chip datasets and subsequent visualization in the University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are distributed widely across the genome.
Results
We have developed ChIPpeakAnno as a Bioconductor package within the statistical programming environment R to facilitate batch annotation of enriched peaks identified from ChIP-seq, ChIP-chip, cap analysis of gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding sites annotated with ChIPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented functionalities for determining the significance of overlap between replicates or binding sites among transcription factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent genes.
Conclusions
ChIPpeakAnno enables batch annotation of the binding sites identified from ChIP-seq, ChIP-chip, CAGE or any technology that results in a large number of enriched genomic regions within the statistical programming environment R. Allowing users to pass their own annotation data such as a different Chromatin immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date annotation retrieval from the BioMart database.
doi:10.1186/1471-2105-11-237
PMCID: PMC3098059  PMID: 20459804
12.  High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions 
PLoS Computational Biology  2010;6(9):e1000916.
Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.
Author Summary
Transcription factors (TFs) are proteins that bind sites in the non-coding DNA and regulate the expression of targeted genes. Being able to predict the genome-wide binding locations of TFs is an important step in deciphering gene regulatory networks. Historically, there was very limited experimental data on the DNA-binding preferences of most TFs. Computational biologists used known sites to estimate simple binding site motifs, called position-specific scoring matrices, and scan the genome for additional potential binding locations, but this approach often led to many false positive predictions. Here we introduce a machine learning approach to leverage new high resolution data on the binding preferences of TFs, namely, protein binding microarray (PBM) experiments which measure the in vitro binding affinities of TFs with respect to an array of double-stranded DNA probes, and chromatin immunoprecipitation experiments followed by next generation sequencing (ChIP-seq) which measure in vivo genome-wide binding of TFs in a given cell type. We show that by training statistical models on high resolution PBM and ChIP-seq data, we can more accurately represent the subtle DNA binding preferences of TFs and predict their genome-wide binding locations. These results will enable advances in the computational analysis of transcriptional regulation in mammalian genomes.
doi:10.1371/journal.pcbi.1000916
PMCID: PMC2936517  PMID: 20838582
13.  NPEBseq: nonparametric empirical bayesian-based procedure for differential expression analysis of RNA-seq data 
BMC Bioinformatics  2013;14:262.
Background
RNA-seq, a massive parallel-sequencing-based transcriptome profiling method, provides digital data in the form of aligned sequence read counts. The comparative analyses of the data require appropriate statistical methods to estimate the differential expression of transcript variants across different cell/tissue types and disease conditions.
Results
We developed a novel nonparametric empirical Bayesian-based approach (NPEBseq) to model the RNA-seq data. The prior distribution of the Bayesian model is empirically estimated from the data without any parametric assumption, and hence the method is “nonparametric” in nature. Based on this model, we proposed a method for detecting differentially expressed genes across different conditions. We also extended this method to detect differential usage of exons from RNA-seq data. The evaluation of NPEBseq on both simulated and publicly available RNA-seq datasets and comparison with three popular methods showed improved results for experiments with or without biological replicates.
Conclusions
NPEBseq can successfully detect differential expression between different conditions not only at gene level but also at exon level from RNA-seq datasets. In addition, NPEBSeq performs significantly better than current methods and can be applied to genome-wide RNA-seq datasets. Sample datasets and R package are available at http://bioinformatics.wistar.upenn.edu/NPEBseq.
doi:10.1186/1471-2105-14-262
PMCID: PMC3765716  PMID: 23981227
14.  Update of the FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation 
Nucleic Acids Research  2010;39(Database issue):D856-D860.
The international Functional Annotation Of the Mammalian Genomes 4 (FANTOM4) research collaboration set out to better understand the transcriptional network that regulates macrophage differentiation and to uncover novel components of the transcriptome employing a series of high-throughput experiments. The primary and unique technique is cap analysis of gene expression (CAGE), sequencing mRNA 5′-ends with a second-generation sequencer to quantify promoter activities even in the absence of gene annotation. Additional genome-wide experiments complement the setup including short RNA sequencing, microarray gene expression profiling on large-scale perturbation experiments and ChIP–chip for epigenetic marks and transcription factors. All the experiments are performed in a differentiation time course of the THP-1 human leukemic cell line. Furthermore, we performed a large-scale mammalian two-hybrid (M2H) assay between transcription factors and monitored their expression profile across human and mouse tissues with qRT-PCR to address combinatorial effects of regulation by transcription factors. These interdependent data have been analyzed individually and in combination with each other and are published in related but distinct papers. We provide all data together with systematic annotation in an integrated view as resource for the scientific community (http://fantom.gsc.riken.jp/4/). Additionally, we assembled a rich set of derived analysis results including published predicted and validated regulatory interactions. Here we introduce the resource and its update after the initial release.
doi:10.1093/nar/gkq1112
PMCID: PMC3013704  PMID: 21075797
15.  HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data 
BMC Bioinformatics  2014;15(1):280.
Background
Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research.
Results
We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes.
Conclusions
HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-280) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2105-15-280
PMCID: PMC4152589  PMID: 25128017
ChIP-Seq; Next-generation sequencing; Peak calling; Duplicate filtering; Irreproducible discovery rate
16.  Genome-Wide Identification of Bcl11b Gene Targets Reveals Role in Brain-Derived Neurotrophic Factor Signaling 
PLoS ONE  2011;6(9):e23691.
B-cell leukemia/lymphoma 11B (Bcl11b) is a transcription factor showing predominant expression in the striatum. To date, there are no known gene targets of Bcl11b in the nervous system. Here, we define targets for Bcl11b in striatal cells by performing chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) in combination with genome-wide expression profiling. Transcriptome-wide analysis revealed that 694 genes were significantly altered in striatal cells over-expressing Bcl11b, including genes showing striatal-enriched expression similar to Bcl11b. ChIP-seq analysis demonstrated that Bcl11b bound a mixture of coding and non-coding sequences that were within 10 kb of the transcription start site of an annotated gene. Integrating all ChIP-seq hits with the microarray expression data, 248 direct targets of Bcl11b were identified. Functional analysis on the integrated gene target list identified several zinc-finger encoding genes as Bcl11b targets, and further revealed a significant association of Bcl11b to brain-derived neurotrophic factor/neurotrophin signaling. Analysis of ChIP-seq binding regions revealed significant consensus DNA binding motifs for Bcl11b. These data implicate Bcl11b as a novel regulator of the BDNF signaling pathway, which is disrupted in many neurological disorders. Specific targeting of the Bcl11b-DNA interaction could represent a novel therapeutic approach to lowering BDNF signaling specifically in striatal cells.
doi:10.1371/journal.pone.0023691
PMCID: PMC3164671  PMID: 21912641
17.  All and only CpG containing sequences are enriched in promoters abundantly bound by RNA polymerase II in multiple tissues 
BMC Genomics  2008;9:67.
Background
The promoters of housekeeping genes are well-bound by RNA polymerase II (RNAP) in different tissues. Although the promoters of these genes are known to contain CpG islands, the specific DNA sequences that are associated with high RNAP binding to housekeeping promoters has not been described.
Results
ChIP-chip experiments from three mouse tissues, liver, heart ventricles, and primary keratinocytes, indicate that 94% of promoters have similar RNAP binding, ranging from well-bound to poorly-bound in all tissues. Using all 8-base pair long sequences as a test set, we have identified the DNA sequences that are enriched in promoters of housekeeping genes, focusing on those DNA sequences which are preferentially localized in the proximal promoter. We observe a bimodal distribution. Virtually all sequences enriched in promoters with high RNAP binding values contain a CpG dinucleotide. These results suggest that only transcription factor binding sites (TFBS) that contain the CpG dinucleotide are involved in RNAP binding to housekeeping promoters while TFBS that do not contain a CpG are involved in regulated promoter activity. Abundant 8-mers that are preferentially localized in the proximal promoters and exhibit the best enrichment in RNAP bound promoters are all variants of six known CpG-containing TFBS: ETS, NRF-1, BoxA, SP1, CRE, and E-Box. The frequency of these six DNA motifs can predict housekeeping promoters as accurately as the presence of a CpG island, suggesting that they are the structural elements critical for CpG island function. Experimental EMSA results demonstrate that methylation of the CpG in the ETS, NRF-1, and SP1 motifs prevent DNA binding in nuclear extracts in both keratinocytes and liver.
Conclusion
In general, TFBS that do not contain a CpG are involved in regulated gene expression while TFBS that contain a CpG are involved in constitutive gene expression with some CpG containing sequences also involved in inducible and tissue specific gene regulation. These TFBS are not bound when the CpG is methylated. Unmethylated CpG dinucleotides in the TFBS in CpG islands allow the transcription factors to find their binding sites which occur only in promoters, in turn localizing RNAP to promoters.
doi:10.1186/1471-2164-9-67
PMCID: PMC2267717  PMID: 18252004
18.  High-resolution transcriptome and genome-wide dynamics of RNA polymerase and NusA in Mycobacterium tuberculosis 
Nucleic Acids Research  2012;41(2):961-977.
To construct a regulatory map of the genome of the human pathogen, Mycobacterium tuberculosis, we applied two complementary high-resolution approaches: strand-specific RNA-seq, to survey the global transcriptome, and ChIP-seq, to monitor the genome-wide dynamics of RNA polymerase (RNAP) and the anti-terminator NusA. Although NusA does not bind directly to DNA, but rather to RNAP and/or to the nascent transcript, we demonstrate that NusA interacts with RNAP ubiquitously throughout the chromosome, and that its profile mirrors RNAP distribution in both the exponential and stationary phases of growth. Generally, promoter-proximal peaks for RNAP and NusA were observed, followed by a decrease in signal strength reflecting transcriptional polarity. Differential binding of RNAP and NusA in the two growth conditions correlated with transcriptional activity as reflected by RNA abundance. Indeed, a significant association between expression levels and the presence of NusA throughout the gene body was detected, confirming the peculiar transcription-promoting role of NusA. Integration of the data sets pinpointed transcriptional units, mapped promoters and uncovered new anti-sense and non-coding transcripts. Highly expressed transcriptional units were situated mainly on the leading strand, despite the relatively unbiased distribution of genes throughout the genome, thus helping the replicative and transcriptional complexes to align.
doi:10.1093/nar/gks1260
PMCID: PMC3553938  PMID: 23222129
19.  RhesusBase: a knowledgebase for the monkey research community 
Nucleic Acids Research  2012;41(Database issue):D892-D905.
Although the rhesus macaque is a unique model for the translational study of human diseases, currently its use in biomedical research is still in its infant stage due to error-prone gene structures and limited annotations. Here, we present RhesusBase for the monkey research community (http://www.rhesusbase.org). We performed strand-specific RNA-Seq studies in 10 macaque tissues and generated 1.2 billion 90-bp paired-end reads, covering >97.4% of the putative exon in macaque transcripts annotated by Ensembl. We found that at least 28.7% of the macaque transcripts were previously mis-annotated, mainly due to incorrect exon–intron boundaries, incomplete untranslated regions (UTRs) and missed exons. Compared with the previous gene models, the revised transcripts show clearer sequence motifs near splicing junctions and the end of UTRs, as well as cleaner patterns of exon–intron distribution for expression tags and cross-species conservation scores. Strikingly, 1292 exon–intron boundary revisions between coding exons corrected the previously mis-annotated open reading frames. The revised gene models were experimentally verified in randomly selected cases. We further integrated functional genomics annotations from >60 categories of public and in-house resources and developed an online accessible database. User-friendly interfaces were developed to update, retrieve, visualize and download the RhesusBase meta-data, providing a ‘one-stop’ resource for the monkey research community.
doi:10.1093/nar/gks835
PMCID: PMC3531163  PMID: 22965133
20.  Finding the active genes in deep RNA-seq gene expression studies 
BMC Genomics  2013;14:778.
Background
Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generate high-confidence estimates of gene expression for low-abundance transcripts. As a result, the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. We provide an updated view of the accuracy and efficiency of RNA-seq experiments, using genomic data from large-scale studies like the ENCODE project to provide orthogonal information against which to validate our conclusions.
Results
We show that a human cell’s transcriptome can be divided into active genes carrying out the work of the cell and other genes that are likely the by-products of biological or experimental noise. We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations.
Conclusions
The zFPKM normalization method accurately separates the biologically relevant genes in a cell, which are associated with active promoters, from the ultralow-expression noisy genes that have repressed promoters. A read depth of twenty to thirty million mapped reads allows high-confidence quantitation of genes expressed at this threshold, providing important guidance for the design of RNA-seq studies of gene expression. Moreover, we offer an example for using extensive ENCODE chromatin state information to validate RNA-seq analysis pipelines.
doi:10.1186/1471-2164-14-778
PMCID: PMC3870982  PMID: 24215113
21.  DUX4 Binding to Retroelements Creates Promoters That Are Active in FSHD Muscle and Testis 
PLoS Genetics  2013;9(11):e1003947.
The human double-homeodomain retrogene DUX4 is expressed in the testis and epigenetically repressed in somatic tissues. Facioscapulohumeral muscular dystrophy (FSHD) is caused by mutations that decrease the epigenetic repression of DUX4 in somatic tissues and result in mis-expression of this transcription factor in skeletal muscle. DUX4 binds sites in the human genome that contain a double-homeobox sequence motif, including sites in unique regions of the genome as well as many sites in repetitive elements. Using ChIP-seq and RNA-seq on myoblasts transduced with DUX4 we show that DUX4 binds and activates transcription of mammalian apparent LTR-retrotransposons (MaLRs), endogenous retrovirus (ERVL and ERVK) elements, and pericentromeric satellite HSATII sequences. Some DUX4-activated MaLR and ERV elements create novel promoters for genes, long non-coding RNAs, and antisense transcripts. Many of these novel transcripts are expressed in FSHD muscle cells but not control cells, and thus might contribute to FSHD pathology. For example, HEY1, a repressor of myogenesis, is activated by DUX4 through a MaLR promoter. DUX4-bound motifs, including those in repetitive elements, show evolutionary conservation and some repeat-initiated transcripts are expressed in healthy testis, the normal expression site of DUX4, but more rarely in other somatic tissues. Testis expression patterns are known to have evolved rapidly in mammals, but the mechanisms behind this rapid change have not yet been identified: our results suggest that mobilization of MaLR and ERV elements during mammalian evolution altered germline gene expression patterns through transcriptional activation by DUX4. Our findings demonstrate a role for DUX4 and repetitive elements in mammalian germline evolution and in FSHD muscular dystrophy.
Author Summary
Transposable elements (TEs) are found in most genomes, and many TEs create extra copies of themselves in new genomic locations by a process called retrotransposition. TEs are often thought of as genomic parasites that must be suppressed, because retrotransposition can cause great harm to their host organism. However, during evolution, the functions encoded by TEs have sometimes been co-opted to the advantage of the host genome as novel genes or as gene regulatory regions. We studied a human transcription factor called DUX4 that is normally expressed in testis and repressed in muscle. Sometimes muscle repression fails, causing the disease facioscapulohumeral muscular dystrophy (FSHD). We find that DUX4 binds many TE types and can activate their transcription. Some activated TEs have been co-opted as novel promoters for human genes. DUX4's activation of these genes via TEs might be important in the biology of normal testis and may contribute to the FSHD disease process. Our findings raise the possibility that DUX4 and TEs co-evolved, as TEs may have hijacked DUX4 to aid their retrotransposition while DUX4 may have utilized TEs to modify its transcriptional network in the evolving germline.
doi:10.1371/journal.pgen.1003947
PMCID: PMC3836709  PMID: 24278031
22.  W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data 
Bioinformatics  2010;27(3):428-430.
Summary: ChIP-based technology is becoming the leading technology to globally profile thousands of transcription factors and elucidate the transcriptional regulation mechanisms in living cells. It has evolved rapidly in recent years, from hybridization with spotted or tiling microarray (ChIP-chip), to pair-end tag sequencing (ChIP-PET), to current massively parallel sequencing (ChIP-seq). Although there are many tools available for identifying binding sites (peaks) for ChIP-chip and ChIP-seq, few of them are available as easy-accessible online web tools for processing both ChIP-chip and ChIP-seq data for the ChIP-based user community. As such, we have developed a comprehensive web application tool for processing ChIP-chip and ChIP-seq data. Our web tool W-ChIPeaks employed a probe-based (or bin-based) enrichment threshold to define peaks and applied statistical methods to control false discovery rate for identified peaks. The web tool includes two different web interfaces: PELT for ChIP-chip, BELT for ChIP-seq, where both were tested on previously published experimental data. The novel features of our tool include a comprehensive output for identified peaks with GFF, BED, bedGraph and .wig formats, annotated genes to which these peaks are related, a graphical interpretation and visualization of the results via a user-friendly web interface.
Availability: http://motif.bmi.ohio-state.edu/W-ChIPeaks/.
Contact: victor.jin@osumc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq669
PMCID: PMC3031039  PMID: 21138948
23.  A tissue-specific landscape of sense/antisense transcription in the mouse intestine 
BMC Genomics  2011;12:305.
Background
The intestinal mucosa is characterized by complex metabolic and immunological processes driven highly dynamic gene expression programs. With the advent of next generation sequencing and its utilization for the analysis of the RNA sequence space, the level of detail on the global architecture of the transcriptome reached a new order of magnitude compared to microarrays.
Results
We report the ultra-deep characterization of the polyadenylated transcriptome in two closely related, yet distinct regions of the mouse intestinal tract (small intestine and colon). We assessed tissue-specific transcriptomal architecture and the presence of novel transcriptionally active regions (nTARs). In the first step, signatures of 20,541 NCBI RefSeq transcripts could be identified in the intestine (74.1% of annotated genes), thereof 16,742 are common in both tissues. Although the majority of reads could be linked to annotated genes, 27,543 nTARs not consistent with current gene annotations in RefSeq or ENSEMBL were identified. By use of a second independent strand-specific RNA-Seq protocol, 20,966 of these nTARs were confirmed, most of them in vicinity of known genes. We further categorized our findings by their relative adjacency to described exonic elements and investigated regional differences of novel transcribed elements in small intestine and colon.
Conclusions
The current study demonstrates the complexity of an archetypal mammalian intestinal mRNA transcriptome in high resolution and identifies novel transcriptionally active regions at strand-specific, single base resolution. Our analysis for the first time shows a strand-specific comparative picture of nTARs in two tissues and represents a resource for further investigating the transcriptional processes that contribute to tissue identity.
doi:10.1186/1471-2164-12-305
PMCID: PMC3125268  PMID: 21663663
24.  TISdb: a database for alternative translation initiation in mammalian cells 
Nucleic Acids Research  2013;42(Database issue):D845-D850.
Proper selection of the translation initiation site (TIS) on mRNAs is crucial for the production of desired protein products. Recent studies using ribosome profiling technology uncovered a surprising variety of potential TIS sites in addition to the annotated start codon. The prevailing alternative translation reshapes the landscape of the proteome in terms of diversity and complexity. To identify the hidden coding potential of the transcriptome in mammalian cells, we developed global translation initiation sequencing (GTI-Seq) that maps genome-wide TIS positions at nearly a single nucleotide resolution. To facilitate studies of alternative translation, we created a database of alternative TIS sites identified from human and mouse cell lines based on multiple GTI-Seq replicates. The TISdb, available at http://tisdb.human.cornell.edu, includes 6991 TIS sites from 4961 human genes and 9973 TIS sites from 5668 mouse genes. The TISdb website provides a simple browser interface for query of high-confidence TIS sites and their associated open reading frames. The output of search results provides a user-friendly visualization of TIS information in the context of transcript isoforms. Together, the information in the database provides an easy reference for alternative translation in mammalian cells and will support future investigation of novel translational products.
doi:10.1093/nar/gkt1085
PMCID: PMC3965020  PMID: 24203712
25.  Dual functions of TAF7L in adipocyte differentiation 
eLife  2013;2:e00170.
The diverse transcriptional mechanisms governing cellular differentiation and development of mammalian tissue remains poorly understood. Here we report that TAF7L, a paralogue of TFIID subunit TAF7, is enriched in adipocytes and white fat tissue (WAT) in mouse. Depletion of TAF7L reduced adipocyte-specific gene expression, compromised adipocyte differentiation, and WAT development as well. Ectopic expression of TAF7L in myoblasts reprograms these muscle precursors into adipocytes upon induction. Genome-wide mRNA-seq expression profiling and ChIP-seq binding studies confirmed that TAF7L is required for activating adipocyte-specific genes via a dual mechanism wherein it interacts with PPARγ at enhancers and TBP/Pol II at core promoters. In vitro binding studies confirmed that TAF7L forms complexes with both TBP and PPARγ. These findings suggest that TAF7L plays an integral role in adipocyte gene expression by targeting enhancers as a cofactor for PPARγ and promoters as a component of the core transcriptional machinery.
DOI: http://dx.doi.org/10.7554/eLife.00170.001
eLife digest
The development of a single fertilized egg into a highly complex animal is determined by its genome, with a process called differential gene regulation exerting exquisite control over gene expression to ensure that various specialized cells are generated and that many types of tissue are produced. However, the mechanisms responsible for controlling gene expression and, therefore mammalian development, are poorly understood.
Researchers have developed a number of in vitro cell culture models to elucidate the details of differential gene regulation, and this approach has been used to characterize adipocytes—cells that store energy in the form of fat—for close to two decades. The formation of adipocytes, a process known as adipogenesis, has been extensively studied, but there remain major gaps in our knowledge: for example, the identities of many of the transcriptional regulators that are responsible for the differentiation of mesenchymal stem cells into adipocytes remain a mystery. This task is complicated by the fact that some of these regulators are involved in the differentiation of multiple cell lines, and that some of them also have multiple roles in the generation of a single cell type. In addition to being of fundamental interest, improving our knowledge of the properties and behavior of adipocytes is essential for tackling the increasing prevalence of obesity in the developed world.
Zhou et al. now report that TAF7L—a gene that was previously thought to be involved only in the production of sperm cells—has two roles in the differentiation of stem cells to form adipocytes. Using a combination of cellular, biochemical, genetic and genomic techniques, they show that TAF7L interacts with PPARγ, an important adipocyte transcriptional regulator at enhancer sites on the genome to increase the transcription of genes that are involved in adipogenesis. They also show that TAF7L interacts with a general transcription factor called TBP (short for TATA-binding protein) at promoter sequences, again to increase the expression of genes involved in adipogenesis. Moreover, they show that the expression of TAF7L in myoblasts—precursor cells that usually become muscle cells—can induce the formation of fat cells rather than muscle cells. Furthermore, mice lacking TAF7L are lean compared to their normal littermates. A clearer understanding of the underlying causes of fat cell formation could lead to the development of new approaches for the treatment of obesity and associated diseases.
DOI: http://dx.doi.org/10.7554/eLife.00170.002
doi:10.7554/eLife.00170
PMCID: PMC3539393  PMID: 23326641
ChIP-seq; RNA-seq; adipogenesis; C3H10T½; TAF7L; differentiation; Mouse

Results 1-25 (1049195)