PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (911742)

Clipboard (0)
None

Related Articles

1.  Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data 
BMC Bioinformatics  2010;11(Suppl 1):S65.
Background
Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.
Methods
We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.
Results
We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.
Conclusion
Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.
doi:10.1186/1471-2105-11-S1-S65
PMCID: PMC3009539  PMID: 20122241
2.  MPromDb update 2010: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-seq experimental data 
Nucleic Acids Research  2010;39(Database issue):D92-D97.
MPromDb (Mammalian Promoter Database) is a curated database that strives to annotate gene promoters identified from ChIP-seq results with the goal of providing an integrated resource for mammalian transcriptional regulation and epigenetics. We analyzed 507 million uniquely aligned RNAP-II ChIP-seq reads from 26 different data sets that include six human cell-types and 10 distinct mouse cell/tissues. The updated MPromDb version consists of computationally predicted (novel) and known active RNAP-II promoters (42 893 human and 48 366 mouse promoters) from various data sets freely available at NCBI GEO database. We found that 36% and 40% of protein-coding genes have alternative promoters in human and mouse genomes and ∼40% of promoters are tissue/cell specific. The identified RNAP-II promoters were annotated using various known and novel gene models. Additionally, for novel promoters we looked into other evidences—GenBank mRNAs, spliced ESTs, CAGE promoter tags and mRNA-seq reads. Users can search the database based on gene id/symbol, or by specific tissue/cell type and filter results based on any combination of tissue/cell specificity, Known/Novel, CpG/NonCpG, and protein-coding/non-coding gene promoters. We have also integrated GBrowse genome browser with MPromDb for visualization of ChIP-seq profiles and to display the annotations. The current release of MPromDb can be accessed at http://bioinformatics.wistar.upenn.edu/MPromDb/.
doi:10.1093/nar/gkq1171
PMCID: PMC3013732  PMID: 21097880
3.  Integrative genome-wide chromatin signature analysis using finite mixture models 
BMC Genomics  2012;13(Suppl 6):S3.
Regulation of gene expression has been shown to involve not only the binding of transcription factor at target gene promoters but also the characterization of histone around which DNA is wrapped around. Some histone modification, for example di-methylated histone H3 at lysine 4 (H3K4me2), has been shown to bind to promoters and activate target genes. However, no clear pattern has been shown to predict human promoters. This paper proposed a novel quantitative approach to characterize patterns of promoter regions and predict novel and alternative promoters. We utilized high-throughput data generated using chromatin immunoprecipitation methods followed by massively parallel sequencing (ChIP-seq) technology on RNA Polymerase II (Pol-II) and H3K4me2. Common patterns of promoter regions are modeled using a mixture model involving double-exponential and uniform distributions. The fitted model obtained were then used to search for regions displaying similar patterns over the entire genome to find novel and alternative promoters. Regions with high correlations with the common patterns are identified as putative novel promoters. We used this proposed algorithm, RNA-seq data and several transcripts databases to find alternative promoters in MCF7 (normal breast cancer) cell line. We found 7,235 high-confidence regions that display the identified promoter patterns. Of these, 4,167 regions (58%) can be mapped to RefSeq regions. 2,444 regions are in a gene body or overlap with transcripts (non-coding RNAs, ESTs, and transcripts that are predicted by RNA-seq data). Some of these maybe potential alternative promoters. We also found 193 regions that map to enhancer regions (represented by androgen and estrogen receptor binding sites) and other regulatory regions such as CTCF (CCCTC binding factor) and CpG island. Around 5% (431 regions) of these correlated regions do not overlap with any transcripts or regulatory regions suggesting that these might be potential new promoters or markers for other annotation which are currently undiscovered.
doi:10.1186/1471-2164-13-S6-S3
PMCID: PMC3481451  PMID: 23134707
4.  RNA Polymerase II Pausing Downstream of Core Histone Genes Is Different from Genes Producing Polyadenylated Transcripts 
PLoS ONE  2012;7(6):e38769.
Recent genome-wide chromatin immunoprecipitation coupled high throughput sequencing (ChIP-seq) analyses performed in various eukaryotic organisms, analysed RNA Polymerase II (Pol II) pausing around the transcription start sites of genes. In this study we have further investigated genome-wide binding of Pol II downstream of the 3′ end of the annotated genes (EAGs) by ChIP-seq in human cells. At almost all expressed genes we observed Pol II occupancy downstream of the EAGs suggesting that Pol II pausing 3′ from the transcription units is a rather common phenomenon. Downstream of EAGs Pol II transcripts can also be detected by global run-on and sequencing, suggesting the presence of functionally active Pol II. Based on Pol II occupancy downstream of EAGs we could distinguish distinct clusters of Pol II pause patterns. On core histone genes, coding for non-polyadenylated transcripts, Pol II occupancy is quickly dropping after the EAG. In contrast, on genes, whose transcripts undergo polyA tail addition [poly(A)+], Pol II occupancy downstream of the EAGs can be detected up to 4–6 kb. Inhibition of polyadenylation significantly increased Pol II occupancy downstream of EAGs at poly(A)+ genes, but not at the EAGs of core histone genes. The differential genome-wide Pol II occupancy profiles 3′ of the EAGs have also been confirmed in mouse embryonic stem (mES) cells, indicating that Pol II pauses genome-wide downstream of the EAGs in mammalian cells. Moreover, in mES cells the sharp drop of Pol II signal at the EAG of core histone genes seems to be independent of the phosphorylation status of the C-terminal domain of the large subunit of Pol II. Thus, our study uncovers a potential link between different mRNA 3′ end processing mechanisms and consequent Pol II transcription termination processes.
doi:10.1371/journal.pone.0038769
PMCID: PMC3372504  PMID: 22701709
5.  Most “Dark Matter” Transcripts Are Associated With Known Genes 
PLoS Biology  2010;8(5):e1000371.
Short-read RNA sequencing in mouse and human tissues shows that most transcripts are encoded within or nearby known genes and that most of the genome is not transcribed.
A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.
Author Summary
The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.
doi:10.1371/journal.pbio.1000371
PMCID: PMC2872640  PMID: 20502517
6.  Temporal ChIP-on-Chip of RNA-Polymerase-II to detect novel gene activation events during photoreceptor maturation 
Molecular Vision  2010;16:252-271.
Purpose
During retinal development, post-mitotic neural progenitor cells must activate thousands of genes to complete synaptogenesis and terminal maturation. While many of these genes are known, others remain beyond the sensitivity of expression microarray analysis. Some of these elusive gene activation events can be detected by mapping changes in RNA polymerase-II (Pol-II) association around transcription start sites.
Methods
High-resolution (35 bp) chromatin immunoprecipitation (ChIP)-on-chip was used to map changes in Pol-II binding surrounding 26,000 gene transcription start sites during photoreceptor maturation of the mouse neural retina, comparing postnatal age 25 (P25) to P2. Coverage was 10–12 kb per transcription start site, including 2.5 kb downstream. Pol-II-active regions were mapped to the mouse genomic DNA sequence by using computational methods (Tiling Analysis Software-TAS program), and the ratio of maximum Pol-II binding (P25/P2) was calculated for each gene. A validation set of 36 genes (3%), representing a full range of Pol-II signal ratios (P25/P2), were examined with quantitative ChIP assays for transcriptionally active Pol-II. Gene expression assays were also performed for 19 genes of the validation set, again on independent samples. FLT-3 Interacting Zinc-finger-1 (FIZ1), a zinc-finger protein that associates with active promoter complexes of photoreceptor-specific genes, provided an additional ChIP marker to highlight genes activated in the mature neural retina. To demonstrate the use of ChIP-on-chip predictions to find novel gene activation events, four additional genes were selected for quantitative PCR analysis (qRT–PCR analysis); these four genes have human homologs located in unidentified retinal disease regions: Solute carrier family 25 member 33 (Slc25a33), Lysophosphatidylcholine acyltransferase 1 (Lpcat1), Coiled-coil domain-containing 126 (Ccdc126), and ADP-ribosylation factor-like 4D (Arl4d).
Results
ChIP-on-chip Pol-II peak signal ratios >1.8 predicted increased amounts of transcribing Pol-II and increased expression with an estimated 97% accuracy, based on analysis of the validation gene set. Using this threshold ratio, 1,101 genes were predicted to experience increased binding of Pol-II in their promoter regions during terminal maturation of the neural retina. Over 800 of these gene activations were additional to those previously reported by microarray analysis. Slc25a33, Lpcat1, Ccdc126, and Arl4d increased expression significantly (p<0.001) during photoreceptor maturation. Expression of all four genes was diminished in adult retinas lacking rod photoreceptors (Rd1 mice) compared to normal retinas (90% loss for Ccdc126 and Arl4d). For rhodopsin (Rho), a marker of photoreceptor maturation, two regions of maximum Pol-II signal corresponded to the upstream rhodopsin enhancer region and the rhodopsin proximal promoter region.
Conclusions
High-resolution maps of Pol-II binding around transcription start sites were generated for the postnatal mouse retina; which can predict activation increases for a specific gene of interest. Novel gene activation predictions are enriched for biologic functions relevant to vision, neural function, and chromatin regulation. Use of the data set to detect novel activation increases was demonstrated by expression analysis for several genes that have human homologs located within unidentified retinal disease regions: Slc25a33, Lpcat1, Ccdc126, and Arl4d. Analysis of photoreceptor-deficient retinas indicated that all four genes are expressed in photoreceptors. Genome-wide maps of Pol-II binding were developed for visual access in the University of California, Santa Cruz (UCSC) Genome Browser and its eye-centric version EyeBrowse (National Eye Institute-NEI). Single promoter resolution of Pol-II distribution patterns suggest the Rho enhancer region and the Rho proximal promoter region become closely associated with the activated gene’s promoter complex.
PMCID: PMC2822553  PMID: 20161818
7.  Extragenic Accumulation of RNA Polymerase II Enhances Transcription by RNA Polymerase III 
PLoS Genetics  2007;3(11):e212.
Recent genomic data indicate that RNA polymerase II (Pol II) function extends beyond conventional transcription of primarily protein-coding genes. Among the five snRNAs required for pre-mRNA splicing, only the U6 snRNA is synthesized by RNA polymerase III (Pol III). Here we address the question of how Pol II coordinates the expression of spliceosome components, including U6. We used chromatin immunoprecipitation (ChIP) and high-resolution mapping by PCR to localize both Pol II and Pol III to snRNA gene regions. We report the surprising finding that Pol II is highly concentrated ∼300 bp upstream of all five active human U6 genes in vivo. The U6 snRNA, an essential component of the spliceosome, is synthesized by Pol III, whereas all other spliceosomal snRNAs are Pol II transcripts. Accordingly, U6 transcripts were terminated in a Pol III-specific manner, and Pol III localized to the transcribed gene regions. However, synthesis of both U6 and U2 snRNAs was α-amanitin-sensitive, indicating a requirement for Pol II activity in the expression of both snRNAs. Moreover, both Pol II and histone tail acetylation marks were lost from U6 promoters upon α-amanitin treatment. The results indicate that Pol II is concentrated at specific genomic regions from which it can regulate Pol III activity by a general mechanism. Consequently, Pol II coordinates expression of all RNA and protein components of the spliceosome.
Author Summary
During transcription, RNA polymerases synthesize an RNA copy of a given gene. Human genes are transcribed by either RNA polymerase I, II, or III. Here, we focus on transcription of the U6 gene that encodes a small nuclear RNA (snRNA), a non-coding RNA with unique activities in gene expression. The U6 snRNA is transcribed by RNA polymerase III (Pol III); here we report the surprising finding that RNA polymerase II (Pol II) is important for efficient expression of the U6 snRNA. Interestingly, high concentrations of Pol II have been recently observed on genomic regions that are considered outside of transcribed genes. We localized Pol II to a region upstream of the U6 snRNA gene promoters in living cells. Inhibition of Pol II activity decreased U6 snRNA synthesis and was accompanied by a decrease in Pol II accumulation as well as transcription-activating histone modifications, while Pol III remained bound at U6 genes. Thus, Pol II may promote U6 snRNA transcription by facilitating open chromatin formation. Our results provide insight into the extragenic function of Pol II, which can coordinate the expression of all components of the RNA splicing machinery, including U6 snRNA.
doi:10.1371/journal.pgen.0030212
PMCID: PMC2082468  PMID: 18039033
8.  Inference of RNA Polymerase II Transcription Dynamics from Chromatin Immunoprecipitation Time Course Data 
PLoS Computational Biology  2014;10(5):e1003598.
Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influence the rate and timing of gene expression. In this work, we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are estimated using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment. We find that rapidly induced genes are enriched for both estrogen receptor alpha (ER) and FOXA1 binding in their proximal promoter regions.
Author Summary
Cells express proteins in response to changes in their environment so as to maintain normal function. An initial step in the expression of proteins is transcription, which is mediated by RNA polymerase II (pol-II). To understand changes in transcription arising due to stimuli it is useful to model the dynamics of transcription. We present a probabilistic model of pol-II transcription dynamics that can be used to compute RNA transcription speed and infer the temporal pol-II activity at the gene promoter. The inferred promoter activity profile is used to determine genes that are responding in a coordinated manner to stimuli and are therefore potentially co-regulated. Model parameters are inferred using data from high-throughput sequencing assays, such as ChIP-Seq and GRO-Seq, and can therefore be applied genome-wide in an unbiased manner. We apply the method to pol-II ChIP-Seq time course data from breast cancer cells stimulated by estradiol in order to uncover the dynamics of early response genes in this system.
doi:10.1371/journal.pcbi.1003598
PMCID: PMC4022483  PMID: 24830797
9.  Integrated transcriptome analysis of mouse spermatogenesis 
BMC Genomics  2014;15:39.
Background
Differentiation of primordial germ cells into mature spermatozoa proceeds through multiple stages, one of the most important of which is meiosis. Meiotic recombination is in turn a key part of meiosis. To achieve the highly specialized and diverse functions necessary for the successful completion of meiosis and the generation of spermatozoa thousands of genes are coordinately regulated through spermatogenesis. A complete and unbiased characterization of the transcriptome dynamics of spermatogenesis is, however, still lacking.
Results
In order to characterize gene expression during spermatogenesis we sequenced eight mRNA samples from testes of juvenile mice from 6 to 38 days post partum. Using gene expression clustering we defined over 1,000 novel meiotically-expressed genes. We also developed a computational de-convolution approach and used it to estimate cell type-specific gene expression in pre-meiotic, meiotic and post-meiotic cells. In addition, we detected 13,000 novel alternative splicing events around 40% of which preserve an open reading frame, and found experimental support for 159 computational gene predictions. A comparison of RNA polymerase II (Pol II) ChIP-Seq signals with RNA-Seq coverage shows that gene expression correlates well with Pol II signals, both at promoters and along the gene body. However, we observe numerous instances of non-canonical promoter usage, as well as intergenic Pol II peaks that potentially delineate unannotated promoters, enhancers or small RNA clusters.
Conclusions
Here we provide a comprehensive analysis of gene expression throughout mouse meiosis and spermatogenesis. Importantly, we find over a thousand of novel meiotic genes and over 5,000 novel potentially coding isoforms. These data should be a valuable resource for future studies of meiosis and spermatogenesis in mammals.
doi:10.1186/1471-2164-15-39
PMCID: PMC3906902  PMID: 24438502
Spermatogenesis; Meiosis; RNA-Seq; Transcriptome; Deconvolution; RNA Pol II; piRNA
10.  Epigenetic regulation of human cis-natural antisense transcripts 
Nucleic Acids Research  2012;40(4):1438-1445.
Mammalian genomes encode numerous cis-natural antisense transcripts (cis-NATs). The extent to which these cis-NATs are actively regulated and ultimately functionally relevant, as opposed to transcriptional noise, remains a matter of debate. To address this issue, we analyzed the chromatin environment and RNA Pol II binding properties of human cis-NAT promoters genome-wide. Cap analysis of gene expression data were used to identify thousands of cis-NAT promoters, and profiles of nine histone modifications and RNA Pol II binding for these promoters in ENCODE cell types were analyzed using chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. Active cis-NAT promoters are enriched with activating histone modifications and occupied by RNA Pol II, whereas weak cis-NAT promoters are depleted for both activating modifications and RNA Pol II. The enrichment levels of activating histone modifications and RNA Pol II binding show peaks centered around cis-NAT transcriptional start sites, and the levels of activating histone modifications at cis-NAT promoters are positively correlated with cis-NAT expression levels. Cis-NAT promoters also show highly tissue-specific patterns of expression. These results suggest that human cis-NATs are actively transcribed by the RNA Pol II and that their expression is epigenetically regulated, prerequisites for a functional potential for many of these non-coding RNAs.
doi:10.1093/nar/gkr1010
PMCID: PMC3287164  PMID: 22371288
11.  The NSL Complex Regulates Housekeeping Genes in Drosophila 
PLoS Genetics  2012;8(6):e1002736.
MOF is the major histone H4 lysine 16-specific (H4K16) acetyltransferase in mammals and Drosophila. In flies, it is involved in the regulation of X-chromosomal and autosomal genes as part of the MSL and the NSL complexes, respectively. While the function of the MSL complex as a dosage compensation regulator is fairly well understood, the role of the NSL complex in gene regulation is still poorly characterized. Here we report a comprehensive ChIP–seq analysis of four NSL complex members (NSL1, NSL3, MBD-R2, and MCRS2) throughout the Drosophila melanogaster genome. Strikingly, the majority (85.5%) of NSL-bound genes are constitutively expressed across different cell types. We find that an increased abundance of the histone modifications H4K16ac, H3K4me2, H3K4me3, and H3K9ac in gene promoter regions is characteristic of NSL-targeted genes. Furthermore, we show that these genes have a well-defined nucleosome free region and broad transcription initiation patterns. Finally, by performing ChIP–seq analyses of RNA polymerase II (Pol II) in NSL1- and NSL3-depleted cells, we demonstrate that both NSL proteins are required for efficient recruitment of Pol II to NSL target gene promoters. The observed Pol II reduction coincides with compromised binding of TBP and TFIIB to target promoters, indicating that the NSL complex is required for optimal recruitment of the pre-initiation complex on target genes. Moreover, genes that undergo the most dramatic loss of Pol II upon NSL knockdowns tend to be enriched in DNA Replication–related Element (DRE). Taken together, our findings show that the MOF-containing NSL complex acts as a major regulator of housekeeping genes in flies by modulating initiation of Pol II transcription.
Author Summary
Housekeeping genes are required to support basic cellular functions and are therefore expressed constitutively in all tissues. Although the homeostasis of housekeeping gene expression is vital for cell survival, most research on the transcription initiation has been focused on TATA-box-containing promoters of inducible and developmental genes, while regulatory mechanisms at the TATA-less promoters of housekeeping genes have remained poorly understood. Using genome-wide chromatin binding profiles, we find that the NSL complex, a histone acetyltransferase-containing complex, is bound to the majority of constitutively active gene promoters. We show that NSL-bound genes display specific sets of DNA motifs, well-defined nucleosome free regions, and broad transcription initiation patterns. In addition, we show that the NSL complex regulates the recruitment of the basal transcription machinery to target promoters; more specifically, we can pinpoint its role to the early steps of Pol II recruitment. Interestingly, we also see that NSL-bound genes are most susceptible to Pol II loss after depletion of NSLs when they contain the DNA Replication–related Element (DRE). Taken together, we provide a genome-wide analysis of a chromatin-modifying complex that is globally involved in the regulation of housekeeping gene expression.
doi:10.1371/journal.pgen.1002736
PMCID: PMC3375229  PMID: 22723752
12.  Orphan CpG Islands Identify Numerous Conserved Promoters in the Mammalian Genome 
PLoS Genetics  2010;6(9):e1001134.
CpG islands (CGIs) are vertebrate genomic landmarks that encompass the promoters of most genes and often lack DNA methylation. Querying their apparent importance, the number of CGIs is reported to vary widely in different species and many do not co-localise with annotated promoters. We set out to quantify the number of CGIs in mouse and human genomes using CXXC Affinity Purification plus deep sequencing (CAP-seq). We also asked whether CGIs not associated with annotated transcripts share properties with those at known promoters. We found that, contrary to previous estimates, CGI abundance in humans and mice is very similar and many are at conserved locations relative to genes. In each species CpG density correlates positively with the degree of H3K4 trimethylation, supporting the hypothesis that these two properties are mechanistically interdependent. Approximately half of mammalian CGIs (>10,000) are “orphans” that are not associated with annotated promoters. Many orphan CGIs show evidence of transcriptional initiation and dynamic expression during development. Unlike CGIs at known promoters, orphan CGIs are frequently subject to DNA methylation during development, and this is accompanied by loss of their active promoter features. In colorectal tumors, however, orphan CGIs are not preferentially methylated, suggesting that cancer does not recapitulate a developmental program. Human and mouse genomes have similar numbers of CGIs, over half of which are remote from known promoters. Orphan CGIs nevertheless have the characteristics of functional promoters, though they are much more likely than promoter CGIs to become methylated during development and hence lose these properties. The data indicate that orphan CGIs correspond to previously undetected promoters whose transcriptional activity may play a functional role during development.
Author Summary
In the decade since the sequence of the human genome was announced, efforts have been made to annotate all genes with their regulatory sequences. CpG islands are short regions containing the sequence CG at high density that map to regions controlling the expression of most human genes (known as promoters). Using a biochemical method, we have identified and mapped all CpG islands in the human and mouse genomes and find that over half are remote from known gene promoters—so-called “orphans.” Mice, which were thought to possess far fewer CpG islands than humans, turn out to have a very similar number. Surprisingly, orphan CpG islands in both species often mark hitherto unknown promoters. The activity of these novel promoters is particularly dynamic during normal development, as they are often silenced by DNA methylation. In colorectal cancers, however, aberrant DNA methylation affects all CpG islands equally.
doi:10.1371/journal.pgen.1001134
PMCID: PMC2944787  PMID: 20885785
13.  The Transcriptome of the Human Pathogen Trypanosoma brucei at Single-Nucleotide Resolution 
PLoS Pathogens  2010;6(9):e1001090.
The genome of Trypanosoma brucei, the causative agent of African trypanosomiasis, was published five years ago, yet identification of all genes and their transcripts remains to be accomplished. Annotation is challenged by the organization of genes transcribed by RNA polymerase II (Pol II) into long unidirectional gene clusters with no knowledge of how transcription is initiated. Here we report a single-nucleotide resolution genomic map of the T. brucei transcriptome, adding 1,114 new transcripts, including 103 non-coding RNAs, confirming and correcting many of the annotated features and revealing an extensive heterogeneity of 5′ and 3′ ends. Some of the new transcripts encode polypeptides that are either conserved in T. cruzi and Leishmania major or were previously detected in mass spectrometry analyses. High-throughput RNA sequencing (RNA-Seq) was sensitive enough to detect transcripts at putative Pol II transcription initiation sites. Our results, as well as recent data from the literature, indicate that transcription initiation is not solely restricted to regions at the beginning of gene clusters, but may occur at internal sites. We also provide evidence that transcription at all putative initiation sites in T. brucei is bidirectional, a recently recognized fundamental property of eukaryotic promoters. Our results have implications for gene expression patterns in other important human pathogens with similar genome organization (Trypanosoma cruzi, Leishmania sp.) and revealed heterogeneity in pre-mRNA processing that could potentially contribute to the survival and success of the parasite population in the insect vector and the mammalian host.
Author Summary
Identifying genes essential for survival in the host is fundamental to unraveling the biology of human pathogens and understanding mechanisms of pathogenesis. The protozoan parasite Trypanosoma brucei causes devastating diseases in humans and animals in sub-Saharan Africa, and the publication in 2005 of the genome sequence provided the first glance at the coding potential of this organism. Although at present there is a catalogue of predicted protein coding genes, the challenge remains to identify all authentic genes, including their boundaries. We used next generation RNA sequencing (RNA-Seq) to map transcribed regions and RNA polymerase II transcription initiation sites on a genome-wide scale. This approach allowed us to improve and correct the current annotation, to reveal a widespread heterogeneity of RNA processing sites (trans-splicing and polyadenylation) and to estimate that most genes are expressed at levels corresponding to 1 to 10 mRNAs per cell. Our data indicate that different transcript forms representing the same gene are present stochastically within the mRNA population. This unanticipated scenario may contribute to determining gene expression landscapes to adapt to different environments in the parasite life cycle.
doi:10.1371/journal.ppat.1001090
PMCID: PMC2936537  PMID: 20838601
14.  Characteristic bimodal profiles of RNA polymerase II at thousands of active mammalian promoters 
Genome Biology  2014;15(6):R85.
Background
In mammals, ChIP-seq studies of RNA polymerase II (PolII) occupancy have been performed to reveal how recruitment, initiation and pausing of PolII may control transcription rates, but the focus is rarely on obtaining finely resolved profiles that can portray the progression of PolII through sequential promoter states.
Results
Here, we analyze PolII binding profiles from high-coverage ChIP-seq on promoters of actively transcribed genes in mouse and humans. We show that the enrichment of PolII near transcription start sites exhibits a stereotypical bimodal structure, with one peak near active transcription start sites and a second peak 110 base pairs downstream from the first. Using an empirical model that reliably quantifies the spatial PolII signal, gene by gene, we show that the first PolII peak allows for refined positioning of transcription start sites, which is corroborated by mRNA sequencing. This bimodal signature is found both in mouse and humans. Analysis of the pausing-related factors NELF and DSIF suggests that the downstream peak reflects widespread pausing at the +1 nucleosome barrier. Several features of the bimodal pattern are correlated with sequence features such as CpG content and TATA boxes, as well as the histone mark H3K4me3.
Conclusions
We thus show how high coverage DNA sequencing experiments can reveal as-yet unnoticed bimodal spatial features of PolII accumulation that are frequent at individual mammalian genes and reminiscent of transcription initiation and pausing. The initiation-pausing hypothesis is corroborated by evidence from run-on sequencing and immunoprecipitation in other cell types and species.
doi:10.1186/gb-2014-15-6-r85
PMCID: PMC4197824  PMID: 24972996
15.  A map of the cis-regulatory sequences in the mouse genome 
Nature  2012;488(7409):116-120.
The laboratory mouse is the most widely used mammalian model organism in biomedical research. The 2.6 × 109 bases of the mouse genome possess a high degree of conservation with the human genome1, so a thorough annotation of the mouse genome will be of significant value to understanding the function of the human genome. So far, most of the functional sequences in the mouse genome have yet to be found, and the cis-regulatory sequences in particular are still poorly annotated. Comparative genomics has been a powerful tool for the discovery of these sequences2, but on its own it cannot resolve their temporal and spatial functions. Recently, ChIP-Seq has been developed to identify cis-regulatory elements in the genomes of several organisms including humans, Drosophila melanogaster and Caenorhabditis elegans3–5. Here we apply the same experimental approach to a diverse set of 19 tissues and cell types in the mouse to produce a map of nearly 300,000 murine cis-regulatory sequences. The annotated sequences add up to 11% of the mouse genome, and include more than 70% of conserved non-coding sequences. We define tissue-specific enhancers and identify potential transcription factors regulating gene expression in each tissue or cell type. Finally, we show that much of the mouse genome is organized in to domains of coordinately regulated enhancers and promoters. Our results provide a resource for the annotation of functional elements in the mammalian genome and for the study of mechanisms regulating tissue-specific gene expression.
doi:10.1038/nature11243
PMCID: PMC4041622  PMID: 22763441
16.  Integrated genome analysis suggests that most conserved non-coding sequences are regulatory factor binding sites 
Nucleic Acids Research  2012;40(16):7858-7869.
More than 98% of a typical vertebrate genome does not code for proteins. Although non-coding regions are sprinkled with short (<200 bp) islands of evolutionarily conserved sequences, the function of most of these unannotated conserved islands remains unknown. One possibility is that unannotated conserved islands could encode non-coding RNAs (ncRNAs); alternatively, unannotated conserved islands could serve as promoter-distal regulatory factor binding sites (RFBSs) like enhancers. Here we assess these possibilities by comparing unannotated conserved islands in the human and mouse genomes to transcribed regions and to RFBSs, relying on a detailed case study of one human and one mouse cell type. We define transcribed regions by applying a novel transcript-calling algorithm to RNA-Seq data obtained from total cellular RNA, and we define RFBSs using ChIP-Seq and DNAse-hypersensitivity assays. We find that unannotated conserved islands are four times more likely to coincide with RFBSs than with unannotated ncRNAs. Thousands of conserved RFBSs can be categorized as insulators based on the presence of CTCF or as enhancers based on the presence of p300/CBP and H3K4me1. While many unannotated conserved RFBSs are transcriptionally active to some extent, the transcripts produced tend to be unspliced, non-polyadenylated and expressed at levels 10 to 100-fold lower than annotated coding or ncRNAs. Extending these findings across multiple cell types and tissues, we propose that most conserved non-coding genomic DNA in vertebrate genomes corresponds to promoter-distal regulatory elements.
doi:10.1093/nar/gks477
PMCID: PMC3439890  PMID: 22684627
17.  Transcription-factor occupancy at HOT regions quantitatively predicts RNA polymerase recruitment in five human cell lines 
BMC Genomics  2013;14:720.
Background
High-occupancy target (HOT) regions are compact genome loci occupied by many different transcription factors (TFs). HOT regions were initially defined in invertebrate model organisms, and we here show that they are a ubiquitous feature of the human gene-regulation landscape.
Results
We identified HOT regions by a comprehensive analysis of ChIP-seq data from 96 DNA-associated proteins in 5 human cell lines. Most HOT regions co-localize with RNA polymerase II binding sites, but many are not near the promoters of annotated genes. At HOT promoters, TF occupancy is strongly predictive of transcription preinitiation complex recruitment and moderately predictive of initiating Pol II recruitment, but only weakly predictive of elongating Pol II and RNA transcript abundance. TF occupancy varies quantitatively within human HOT regions; we used this variation to discover novel associations between TFs. The sequence motif associated with any given TF’s direct DNA binding is somewhat predictive of its empirical occupancy, but a great deal of occupancy occurs at sites without the TF’s motif, implying indirect recruitment by another TF whose motif is present.
Conclusions
Mammalian HOT regions are regulatory hubs that integrate the signals from diverse regulatory pathways to quantitatively tune the promoter for RNA polymerase II recruitment.
doi:10.1186/1471-2164-14-720
PMCID: PMC3826616  PMID: 24138567
Transcription factor; ChIP-seq; HOT region; Gene regulation
18.  Long Non-Coding RNA and Alternative Splicing Modulations in Parkinson's Leukocytes Identified by RNA Sequencing 
PLoS Computational Biology  2014;10(3):e1003517.
The continuously prolonged human lifespan is accompanied by increase in neurodegenerative diseases incidence, calling for the development of inexpensive blood-based diagnostics. Analyzing blood cell transcripts by RNA-Seq is a robust means to identify novel biomarkers that rapidly becomes a commonplace. However, there is lack of tools to discover novel exons, junctions and splicing events and to precisely and sensitively assess differential splicing through RNA-Seq data analysis and across RNA-Seq platforms. Here, we present a new and comprehensive computational workflow for whole-transcriptome RNA-Seq analysis, using an updated version of the software AltAnalyze, to identify both known and novel high-confidence alternative splicing events, and to integrate them with both protein-domains and microRNA binding annotations. We applied the novel workflow on RNA-Seq data from Parkinson's disease (PD) patients' leukocytes pre- and post- Deep Brain Stimulation (DBS) treatment and compared to healthy controls. Disease-mediated changes included decreased usage of alternative promoters and N-termini, 5′-end variations and mutually-exclusive exons. The PD regulated FUS and HNRNP A/B included prion-like domains regulated regions. We also present here a workflow to identify and analyze long non-coding RNAs (lncRNAs) via RNA-Seq data. We identified reduced lncRNA expression and selective PD-induced changes in 13 of over 6,000 detected leukocyte lncRNAs, four of which were inversely altered post-DBS. These included the U1 spliceosomal lncRNA and RP11-462G22.1, each entailing sequence complementarity to numerous microRNAs. Analysis of RNA-Seq from PD and unaffected controls brains revealed over 7,000 brain-expressed lncRNAs, of which 3,495 were co-expressed in the leukocytes including U1, which showed both leukocyte and brain increases. Furthermore, qRT-PCR validations confirmed these co-increases in PD leukocytes and two brain regions, the amygdala and substantia-nigra, compared to controls. This novel workflow allows deep multi-level inspection of RNA-Seq datasets and provides a comprehensive new resource for understanding disease transcriptome modifications in PD and other neurodegenerative diseases.
Author Summary
Long non-coding RNAs (lncRNAs) comprise a novel, fascinating class of RNAs with largely unknown biological functions. Parkinson's-disease (PD) is the most frequent motor disorder, and Deep-brain-stimulation (DBS) treatment alleviates the symptoms, but early disease biomarkers are still unknown and new future genetic interference targets are urgently needed. Using RNA-sequencing technology and a novel computational workflow for in-depth exploration of whole-transcriptome RNA-seq datasets, we detected and analyzed lncRNAs in sequenced libraries from PD patients' leukocytes pre and post-treatment and the brain, adding this full profile resource of over 7,000 lncRNAs to the few human tissues-derived lncRNA datasets that are currently available. Our study includes sample-specific database construction, detecting disease-derived changes in known and novel lncRNAs, exons and junctions and predicting corresponding changes in Polyadenylation choices, protein domains and miRNA binding sites. We report widespread transcript structure variations at the splice junction and exons levels, including novel exons and junctions and alteration of lncRNAs followed by experimental validation in PD leukocytes and two PD brain regions compared with controls. Our results suggest lncRNAs involvement in neurodegenerative diseases, and specifically PD. This comprehensive workflow will be of use to the increasing number of laboratories producing RNA-Seq data in a wide range of biomedical studies.
doi:10.1371/journal.pcbi.1003517
PMCID: PMC3961179  PMID: 24651478
19.  DUX4 Binding to Retroelements Creates Promoters That Are Active in FSHD Muscle and Testis 
PLoS Genetics  2013;9(11):e1003947.
The human double-homeodomain retrogene DUX4 is expressed in the testis and epigenetically repressed in somatic tissues. Facioscapulohumeral muscular dystrophy (FSHD) is caused by mutations that decrease the epigenetic repression of DUX4 in somatic tissues and result in mis-expression of this transcription factor in skeletal muscle. DUX4 binds sites in the human genome that contain a double-homeobox sequence motif, including sites in unique regions of the genome as well as many sites in repetitive elements. Using ChIP-seq and RNA-seq on myoblasts transduced with DUX4 we show that DUX4 binds and activates transcription of mammalian apparent LTR-retrotransposons (MaLRs), endogenous retrovirus (ERVL and ERVK) elements, and pericentromeric satellite HSATII sequences. Some DUX4-activated MaLR and ERV elements create novel promoters for genes, long non-coding RNAs, and antisense transcripts. Many of these novel transcripts are expressed in FSHD muscle cells but not control cells, and thus might contribute to FSHD pathology. For example, HEY1, a repressor of myogenesis, is activated by DUX4 through a MaLR promoter. DUX4-bound motifs, including those in repetitive elements, show evolutionary conservation and some repeat-initiated transcripts are expressed in healthy testis, the normal expression site of DUX4, but more rarely in other somatic tissues. Testis expression patterns are known to have evolved rapidly in mammals, but the mechanisms behind this rapid change have not yet been identified: our results suggest that mobilization of MaLR and ERV elements during mammalian evolution altered germline gene expression patterns through transcriptional activation by DUX4. Our findings demonstrate a role for DUX4 and repetitive elements in mammalian germline evolution and in FSHD muscular dystrophy.
Author Summary
Transposable elements (TEs) are found in most genomes, and many TEs create extra copies of themselves in new genomic locations by a process called retrotransposition. TEs are often thought of as genomic parasites that must be suppressed, because retrotransposition can cause great harm to their host organism. However, during evolution, the functions encoded by TEs have sometimes been co-opted to the advantage of the host genome as novel genes or as gene regulatory regions. We studied a human transcription factor called DUX4 that is normally expressed in testis and repressed in muscle. Sometimes muscle repression fails, causing the disease facioscapulohumeral muscular dystrophy (FSHD). We find that DUX4 binds many TE types and can activate their transcription. Some activated TEs have been co-opted as novel promoters for human genes. DUX4's activation of these genes via TEs might be important in the biology of normal testis and may contribute to the FSHD disease process. Our findings raise the possibility that DUX4 and TEs co-evolved, as TEs may have hijacked DUX4 to aid their retrotransposition while DUX4 may have utilized TEs to modify its transcriptional network in the evolving germline.
doi:10.1371/journal.pgen.1003947
PMCID: PMC3836709  PMID: 24278031
20.  Dual functions of TAF7L in adipocyte differentiation 
eLife  2013;2:e00170.
The diverse transcriptional mechanisms governing cellular differentiation and development of mammalian tissue remains poorly understood. Here we report that TAF7L, a paralogue of TFIID subunit TAF7, is enriched in adipocytes and white fat tissue (WAT) in mouse. Depletion of TAF7L reduced adipocyte-specific gene expression, compromised adipocyte differentiation, and WAT development as well. Ectopic expression of TAF7L in myoblasts reprograms these muscle precursors into adipocytes upon induction. Genome-wide mRNA-seq expression profiling and ChIP-seq binding studies confirmed that TAF7L is required for activating adipocyte-specific genes via a dual mechanism wherein it interacts with PPARγ at enhancers and TBP/Pol II at core promoters. In vitro binding studies confirmed that TAF7L forms complexes with both TBP and PPARγ. These findings suggest that TAF7L plays an integral role in adipocyte gene expression by targeting enhancers as a cofactor for PPARγ and promoters as a component of the core transcriptional machinery.
DOI: http://dx.doi.org/10.7554/eLife.00170.001
eLife digest
The development of a single fertilized egg into a highly complex animal is determined by its genome, with a process called differential gene regulation exerting exquisite control over gene expression to ensure that various specialized cells are generated and that many types of tissue are produced. However, the mechanisms responsible for controlling gene expression and, therefore mammalian development, are poorly understood.
Researchers have developed a number of in vitro cell culture models to elucidate the details of differential gene regulation, and this approach has been used to characterize adipocytes—cells that store energy in the form of fat—for close to two decades. The formation of adipocytes, a process known as adipogenesis, has been extensively studied, but there remain major gaps in our knowledge: for example, the identities of many of the transcriptional regulators that are responsible for the differentiation of mesenchymal stem cells into adipocytes remain a mystery. This task is complicated by the fact that some of these regulators are involved in the differentiation of multiple cell lines, and that some of them also have multiple roles in the generation of a single cell type. In addition to being of fundamental interest, improving our knowledge of the properties and behavior of adipocytes is essential for tackling the increasing prevalence of obesity in the developed world.
Zhou et al. now report that TAF7L—a gene that was previously thought to be involved only in the production of sperm cells—has two roles in the differentiation of stem cells to form adipocytes. Using a combination of cellular, biochemical, genetic and genomic techniques, they show that TAF7L interacts with PPARγ, an important adipocyte transcriptional regulator at enhancer sites on the genome to increase the transcription of genes that are involved in adipogenesis. They also show that TAF7L interacts with a general transcription factor called TBP (short for TATA-binding protein) at promoter sequences, again to increase the expression of genes involved in adipogenesis. Moreover, they show that the expression of TAF7L in myoblasts—precursor cells that usually become muscle cells—can induce the formation of fat cells rather than muscle cells. Furthermore, mice lacking TAF7L are lean compared to their normal littermates. A clearer understanding of the underlying causes of fat cell formation could lead to the development of new approaches for the treatment of obesity and associated diseases.
DOI: http://dx.doi.org/10.7554/eLife.00170.002
doi:10.7554/eLife.00170
PMCID: PMC3539393  PMID: 23326641
ChIP-seq; RNA-seq; adipogenesis; C3H10T½; TAF7L; differentiation; Mouse
21.  Interconversion between active and inactive TATA-binding protein transcription complexes in the mouse genome 
Nucleic Acids Research  2011;40(4):1446-1459.
The TATA binding protein (TBP) plays a pivotal role in RNA polymerase II (Pol II) transcription through incorporation into the TFIID and B-TFIID complexes. The role of mammalian B-TFIID composed of TBP and B-TAF1 is poorly understood. Using a complementation system in genetically modified mouse cells where endogenous TBP can be conditionally inactivated and replaced by exogenous mutant TBP coupled to tandem affinity purification and mass spectrometry, we identify two TBP mutations, R188E and K243E, that disrupt the TBP–BTAF1 interaction and B-TFIID complex formation. Transcriptome and ChIP-seq analyses show that loss of B-TFIID does not generally alter gene expression or genomic distribution of TBP, but positively or negatively affects TBP and/or Pol II recruitment to a subset of promoters. We identify promoters where wild-type TBP assembles a partial inactive preinitiation complex comprising B-TFIID, TFIIB and Mediator complex, but lacking TFIID, TFIIE and Pol II. Exchange of B-TFIID in wild-type cells for TFIID in R188E and K243E mutant cells at these primed promoters completes preinitiation complex formation and recruits Pol II to activate their expression. We propose a novel regulatory mechanism involving formation of a partial preinitiation complex comprising B-TFIID that primes the promoter for productive preinitiation complex formation in mammalian cells.
doi:10.1093/nar/gkr802
PMCID: PMC3287176  PMID: 22013162
22.  Transcriptome profile of a bovine respiratory disease pathogen: Mannheimia haemolytica PHL213 
BMC Bioinformatics  2012;13(Suppl 15):S4.
Background
Computational methods for structural gene annotation have propelled gene discovery but face certain drawbacks with regards to prokaryotic genome annotation. Identification of transcriptional start sites, demarcating overlapping gene boundaries, and identifying regulatory elements such as small RNA are not accurate using these approaches. In this study, we re-visit the structural annotation of Mannheimia haemolytica PHL213, a bovine respiratory disease pathogen. M. haemolytica is one of the causative agents of bovine respiratory disease that results in about $3 billion annual losses to the cattle industry. We used RNA-Seq and analyzed the data using freely-available computational methods and resources. The aim was to identify previously unannotated regions of the genome using RNA-Seq based expression profile to complement the existing annotation of this pathogen.
Results
Using the Illumina Genome Analyzer, we generated 9,055,826 reads (average length ~76 bp) and aligned them to the reference genome using Bowtie. The transcribed regions were analyzed using SAMTOOLS and custom Perl scripts in conjunction with BLAST searches and available gene annotation information. The single nucleotide resolution map enabled the identification of 14 novel protein coding regions as well as 44 potential novel sRNA. The basal transcription profile revealed that 2,506 of the 2,837 annotated regions were expressed in vitro, at 95.25% coverage, representing all broad functional gene categories in the genome. The expression profile also helped identify 518 potential operon structures involving 1,086 co-expressed pairs. We also identified 11 proteins with mutated/alternate start codons.
Conclusions
The application of RNA-Seq based transcriptome profiling to structural gene annotation helped correct existing annotation errors and identify potential novel protein coding regions and sRNA. We used computational tools to predict regulatory elements such as promoters and terminators associated with the novel expressed regions for further characterization of these novel functional elements. Our study complements the existing structural annotation of Mannheimia haemolytica PHL213 based on experimental evidence. Given the role of sRNA in virulence gene regulation and stress response, potential novel sRNA described in this study can form the framework for future studies to determine the role of sRNA, if any, in M. haemolytica pathogenesis.
doi:10.1186/1471-2105-13-S15-S4
PMCID: PMC3439734  PMID: 23046475
23.  High-Resolution Transcriptome Maps Reveal Strain-Specific Regulatory Features of Multiple Campylobacter jejuni Isolates 
PLoS Genetics  2013;9(5):e1003495.
Campylobacter jejuni is currently the leading cause of bacterial gastroenteritis in humans. Comparison of multiple Campylobacter strains revealed a high genetic and phenotypic diversity. However, little is known about differences in transcriptome organization, gene expression, and small RNA (sRNA) repertoires. Here we present the first comparative primary transcriptome analysis based on the differential RNA–seq (dRNA–seq) of four C. jejuni isolates. Our approach includes a novel, generic method for the automated annotation of transcriptional start sites (TSS), which allowed us to provide genome-wide promoter maps in the analyzed strains. These global TSS maps are refined through the integration of a SuperGenome approach that allows for a comparative TSS annotation by mapping RNA–seq data of multiple strains into a common coordinate system derived from a whole-genome alignment. Considering the steadily increasing amount of RNA–seq studies, our automated TSS annotation will not only facilitate transcriptome annotation for a wider range of pro- and eukaryotes but can also be adapted for the analysis among different growth or stress conditions. Our comparative dRNA–seq analysis revealed conservation of most TSS, but also single-nucleotide-polymorphisms (SNP) in promoter regions, which lead to strain-specific transcriptional output. Furthermore, we identified strain-specific sRNA repertoires that could contribute to differential gene regulation among strains. In addition, we identified a novel minimal CRISPR-system in Campylobacter of the type-II CRISPR subtype, which relies on the host factor RNase III and a trans-encoded sRNA for maturation of crRNAs. This minimal system of Campylobacter, which seems active in only some strains, employs a unique maturation pathway, since the crRNAs are transcribed from individual promoters in the upstream repeats and thereby minimize the requirements for the maturation machinery. Overall, our study provides new insights into strain-specific transcriptome organization and sRNAs, and reveals genes that could modulate phenotypic variation among strains despite high conservation at the DNA level.
Author Summary
Many species have evolved into diverse strains with phenotypic and genotypic variations that facilitate adaptation to different ecological niches and, in the case of pathogens, to different hosts. Whereas comparison of genome sequences reveals differences and similarities among strains, the consequences of genomic variations can be tracked by studying the functional output from the genome. RNA sequencing has been revolutionizing transcriptome analyses of both pro- and eukaryotes. However, the bioinformatics-based analysis is still lagging behind, and transcriptome features are often manually annotated, which is laborious and time-consuming. This is even more compounded for the analyses of multiple strains. Here we compared the primary transcriptomes of four isolates of Campylobacter jejuni, the leading cause of bacterial gastroenteritis in humans, and provide genome-wide transcriptional start site (TSS) maps using a novel automated annotation method. Our comparative RNA–seq showed that most TSS are conserved in multiple strains, but we also observed SNP–dependent promoter usage. Furthermore, we identified a novel minimal RNA–based CRISPR immune system as well as strain-specific small RNA repertoires. Our automated, comparative TSS annotation will facilitate and improve transcriptome annotation for a wider range of organisms and provides insights into the contribution of transcriptome differences to phenotypic variation among closely related species.
doi:10.1371/journal.pgen.1003495
PMCID: PMC3656092  PMID: 23696746
24.  Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data 
PLoS Computational Biology  2011;7(7):e1002111.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Author Summary
Annotating repetitive regions of genomes experimentally is a challenging task. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) provides valuable data for characterizing repetitive regions of genomes in terms of transcription factor binding. Although ChIP-seq technology has been maturing, available ChIP-seq analysis methods and software rely on discarding sequence reads that map to multiple locations on the reference genome (multi-reads), thereby generating a missed opportunity for assessing transcription factor binding to highly repetitive regions of genomes. We develop a computational algorithm that takes multi-reads into account in ChIP-seq analysis. We show with computational experiments that multi-reads lead to significant increase in sequencing depths and identification of binding regions that are otherwise not identifiable when only reads that uniquely map to the reference genome (uni-reads) are used. In particular, we show that the number of binding regions identified can increase up to 36%. We support our computational predictions with independent quantitative real-time ChIP validation of binding regions identified only when multi-reads are incorporated in the analysis of a mouse GATA1 ChIP-seq experiment.
doi:10.1371/journal.pcbi.1002111
PMCID: PMC3136429  PMID: 21779159
25.  A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data 
BMC Bioinformatics  2012;13(Suppl 2):S2.
Background
RNA polymerase II (PolII) is essential in gene transcription and ChIP-seq experiments have been used to study PolII binding patterns over the entire genome. However, since PolII enriched regions in the genome can be very long, existing peak finding algorithms for ChIP-seq data are not adequate for identifying such long regions.
Methods
Here we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a false discovery rate (FDR) approach. The binned ChIP-seq data for PolII are first processed using a non-local means (NL-means) algorithm for purposes of denoising. Then, a FDR approach is developed to determine the threshold for marking enriched regions in the binned histogram.
Results
We first test our method using a public PolII ChIP-seq dataset and compare our results with published results obtained using the published algorithm HPeak. Our results show a high consistency with the published results (80-100%). Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on the breast cancer cell line MCF7. The results demonstrate that our method can effectively identify long enriched regions in ChIP-seq datasets. Specifically, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 Kbp (maximum 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (maximum 325,000 bp).
Conclusions
We demonstrated the effectiveness of this method in studying binding patterns of PolII in cancer cells which enables further deep analysis in transcription regulation and epigenetics. Our method complements existing peak detection algorithms for ChIP-seq experiments.
doi:10.1186/1471-2105-13-S2-S2
PMCID: PMC3375632  PMID: 22536865

Results 1-25 (911742)