The profiling of small RNAs by high throughput sequencing (smRNA-Seq) has revealed the complexity of the RNA world. Here, we describe a computational scheme for dissecting the plant smRNAome by integrating smRNA-Seq datasets in Arabidopsis thaliana. Our analytical approach first defines ab initio the genomic loci that produce smRNAs as basic units, then utilizes principal component analysis (PCA) to predict novel miRNAs. Secondary structure prediction of candidates’ putative precursors discovered a group of long hairpin double-stranded RNAs (lh-dsRNAs) formed by inverted duplications of decayed coding genes. These gene remnants produce miRNA-like small RNAs which are predominantly 21- and 22-nt long, dependent of DCL1 but independent of RDR2 and DCL2/3/4, and associated with AGO1. Additionally, we found two classes of transcription start site associated- (TSSa-) RNAs located at sense (+) and antisense (−) approximately 100 ~ 200 bp downstream of TSSs, but are differentially incorporated into AGO1 and AGO4, respectively.
High-throughput sequencing; small RNAs; Principal component analysis; TSS-associated RNAs
Tissue-specific gene expression requires modulation of nucleosomes, allowing transcription factors to occupy cis elements that are accessible only in selected tissues. Master transcription factors control cell-specific genes and define cellular identities, but it is unclear if they possess special abilities to regulate cell-specific chromatin and if such abilities might underlie lineage determination and maintenance. One prevailing view is that several transcription factors enable chromatin access in combination. The homeodomain protein CDX2 specifies the embryonic intestinal epithelium, through unknown mechanisms, and partners with transcription factors such as HNF4A in the adult intestine. We examined enhancer chromatin and gene expression following Cdx2 or Hnf4a excision in mouse intestines. HNF4A loss did not affect CDX2 binding or chromatin, whereas CDX2 depletion modified chromatin significantly at CDX2-bound enhancers, disrupted HNF4A occupancy, and abrogated expression of neighboring genes. Thus, CDX2 maintains transcription-permissive chromatin, illustrating a powerful and dominant effect on enhancer configuration in an adult tissue. Similar, hierarchical control of cell-specific chromatin states is probably a general property of master transcription factors.
Summary: Transcription and chromatin regulators, and histone modifications play essential roles in gene expression regulation. We have created CistromeMap as a web server to provide a comprehensive knowledgebase of all of the publicly available ChIP-Seq and DNase-Seq data in mouse and human. We have also manually curated metadata to ensure annotation consistency, and developed a user-friendly display matrix for quick navigation and retrieval of data for specific factors, cells and papers. Finally, we provide users with summary statistics of ChIP-Seq and DNase-Seq studies.
Availability: Freely available on the web at http://cistrome.dfci.harvard.edu/pc/
Nuclear receptors (NRs) comprise a superfamily of ligand-activated transcription factors that play important roles in both physiology and diseases including cancer. The technologies of Chromatin ImmunoPrecipitation followed by array hybridization (ChIP-chip) or massively parallel sequencing (ChIP-seq) has been used to map, at an unprecedented rate, the in vivo genome-wide binding (cistrome) of NRs in both normal and cancer cells. We developed a curated database of 88 NR cistrome datasets and other associated high-throughput datasets, including 121 collaborating factor cistromes, 94 epigenomes and 319 transcriptomes. Through integrative analysis of the curated NR ChIP-chip/seq datasets, we discovered novel factor-specific noncanonical motifs that may have important regulatory roles. We also revealed a common feature of NR pioneering factors to recognize relatively short and AT-rich motifs. Most NRs bind predominantly to introns and distal intergenetic regions, and binding sites closer to transcription start sites (TSSs) were found to be neither stronger nor more evolutionarily conserved. Interestingly, while most NRs appear to be predominantly transcriptional activators, our analysis suggests that the binding of ESR1, RARA and RARG has both activating and repressive effects. Through meta-analysis of different omic data of the same cancer cell line model from multiple studies, we generated consensus cistrome and expression profiles. We further made probabilistic predictions of the NR target genes by integrating cistrome and transcriptome data, and validated the predictions using expression data from tumor samples. The final database, with comprehensive cistrome, epigenome, transcriptome datasets, and downstream analysis results, constitutes a valuable resource for the nuclear receptor and cancer community.
Genome-wide ChIP-chip assays of protein–DNA interactions yield large volumes of data requiring effective statistical analysis to obtain reliable results. Successful analysis methods need to be tailored to platform specific characteristics such as probe density, genome coverage, and the nature of the controls. We describe the use of the respective software packages MAT and MA2C for the analysis of ChIP-chip data from one-color Affymetrix and two-color NimbleGen or Agilent tiling microarrays.
ChIP-chip; probe modeling; normalization; peak detection
Fusion of the androgen receptor-regulated (AR-regulated) TMPRSS2 gene with ERG in prostate cancer (PCa) causes androgen-stimulated overexpression of ERG, an ETS transcription factor, but critical downstream effectors of ERG-mediating PCa development remain to be established. Expression of the SOX9 transcription factor correlated with TMPRSS2:ERG fusion in 3 independent PCa cohorts, and ERG-dependent expression of SOX9 was confirmed by RNAi in the fusion-positive VCaP cell line. SOX9 has been shown to mediate ductal morphogenesis in fetal prostate and maintain stem/progenitor cell pools in multiple adult tissues, and has also been linked to PCa and other cancers. SOX9 overexpression resulted in neoplasia in murine prostate and stimulated tumor invasion, similarly to ERG. Moreover, SOX9 depletion in VCaP cells markedly impaired invasion and growth in vitro and in vivo, establishing SOX9 as a critical downstream effector of ERG. Finally, we found that ERG regulated SOX9 indirectly by opening a cryptic AR-regulated enhancer in the SOX9 gene. Together, these results demonstrate that ERG redirects AR to a set of genes including SOX9 that are not normally androgen stimulated, and identify SOX9 as a critical downstream effector of ERG in TMPRSS2:ERG fusion–positive PCa.
As we come to the end of 2011, Genome Biology has asked some members of our Editorial Board for their views on the state of play in genomics. What was their favorite paper of 2011? What are the challenges in their particular research area? Who has had the biggest influence on their careers? What advice would they give to young researchers embarking on a career in research?
We performed a systematic evaluation of how variations in sequencing depth and other parameters influence interpretation of Chromatin immunoprecipitation (ChIP) followed by sequencing (ChIP-seq) experiments. Using Drosophila S2 cells, we generated ChIP-seq datasets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin state bias, open chromatin regions yielded higher coverage, which led to false positives if not corrected and had a greater effect on detection specificity than any base-composition bias. Paired-end sequencing revealed that single-end data underestimated ChIP library complexity at high coverage. The removal of reads originating at the same base reduced false-positives while having little effect on detection sensitivity. Even at a depth of ~1 read/bp coverage of mappable genome, ~1% of the narrow peaks detected on a tiling array were missed by ChIP-seq. Evaluation of widely-used ChIP-seq analysis tools suggests that adjustments or algorithm improvements are required to handle datasets with deep coverage.
Androgen receptor (AR) is reactivated in castration resistant prostate cancer (CRPC) through mechanisms including marked increases in AR gene expression. We identify an enhancer in the AR second intron contributing to increased AR expression at low androgen levels in CRPC. Moreover, at increased androgen levels the AR binds this site and represses AR gene expression through recruitment of lysine specific demethylase 1 (LSD1) and H3K4me1,2 demethylation. AR similarly represses expression of multiple genes mediating androgen synthesis, DNA synthesis and proliferation, while stimulating genes mediating lipid and protein biosynthesis. Androgen levels in CRPC appear adequate to stimulate AR activity on enhancer elements, but not suppressor elements, resulting in increased expression of AR and AR repressed genes that contribute to cellular proliferation.
prostate cancer; androgen receptor; androgen deprivation therapy; H3K4 methylation; LSD1
STAT5 is a transcription factor essential for hematopoietic physiology. STAT5 functions to transduce signals from cytokines to the nucleus where it regulates gene expression. Although several important transcriptional targets of STAT5 are known, most remain unidentified. To identify novel STAT5 targets, we searched chromosomes 21 and 22 for clusters of STAT5 binding sites contained within regions of interspecies homology. We identified four such regions, including one with tandem STAT5 binding sites in the first intron of the NCAM2 gene. Unlike known STAT5 binding sites, this site is found within a very large intron and resides ~200 kb from the first coding exon of NCAM2. We demonstrate that this region confers STAT5-dependent transcriptional activity. We show that STAT5 binds in vivo to the NCAM2 intron in the NKL natural killer cell line and that this binding is induced by cytokines that activate STAT5. Neither STAT1 nor STAT3 bind to this region, despite sharing a consensus binding sequence with STAT5. Activation of STAT4 and STAT5 causes the accumulation of both of these STATs to the NCAM2 regulatory region. Therefore, using an informatics based approach to identify STAT5 targets, we have identified NCAM2 as both a STAT4- and STAT5-regulated gene, and we show that its expression is regulated by cytokines essential for natural killer cell survival and differentiation. This strategy may be an effective way to identify functional binding regions for transcription factors with known cognate binding sites anywhere in the genome.
Endocrine therapies for breast cancer that target the estrogen receptor (ER) are ineffective in the 25-30% of cases that are ER negative (ER−). Androgen receptor (AR) is expressed in 60-70% of breast tumors, independent of ER status. How androgens and AR regulate breast cancer growth remains largely unknown. We find that AR is enriched in ER−breast tumors that over-express HER2. Through analysis of the AR cistrome and androgen-regulated gene expression in ER−/HER2+ breast cancers we find that AR mediates ligand-dependent activation of Wnt and HER2 signaling pathways through direct transcriptional induction of WNT7B and HER3. Specific targeting of AR, Wnt or HER2 signaling impairs androgen-stimulated tumor cell growth suggesting potential therapeutic approaches for ER−/HER2+ breast cancers.
Disruption of the circadian clock exacerbates metabolic diseases including obesity and diabetes. Here we show that histone deacetylase 3 (HDAC3) recruitment to the genome displays a circadian rhythm in mouse liver. Histone acetylation is inversely related to HDAC3 binding, and this rhythm is lost when HDAC3 is absent. Although amounts of HDAC3 are constant, its genomic recruitment in liver corresponds to the expression pattern of the circadian nuclear receptor Rev-erbα. Rev-erbα colocalizes with HDAC3 near genes regulating lipid metabolism, and deletion of HDAC3 or Rev-erbα in mouse liver causes hepatic steatosis. Thus, genomic recruitment of HDAC3 by Rev-erbα directs a circadian rhythm of histone acetylation and gene expression required for normal hepatic lipid homeostasis.
Summary: Transcription factor binding events are frequently associated with a pattern of nucleosome occupancy changes in which nucleosomes flanking the binding site increase in occupancy, while those in the vicinity of the binding site itself are displaced. Genome-wide information on enhancer proximal nucleosome occupancy can be readily acquired using ChIP-seq targeting enhancer-related histone modifications such as H3K4me2. Here, we present a software package, BINOCh that allows biologists to use such data to infer the identity of key transcription factors that regulate the response of a cell to a stimulus or determine a program of differentiation.
Availability: The BINOCh open source Python package is freely available at http://liulab.dfci.harvard.edu/BINOCh under the FreeBSD license.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: With the rapid development of high-throughput sequencing technologies, the genome-wide profiling of nucleosome positioning has become increasingly affordable. Many future studies will investigate the dynamic behaviour of nucleosome positioning in cells that have different states or that are exposed to different conditions. However, a robust method to effectively identify the regions of differential nucleosome positioning (RDNPs) has not been previously available.
Results: We describe a novel computational approach, DiNuP, that compares nucleosome profiles generated by high-throughput sequencing under various conditions. DiNuP provides a statistical P-value for each identified RDNP based on the difference of read distributions. DiNuP also empirically estimates the false discovery rate as a cutoff when two samples have different sequencing depths and differentiate reliable RDNPs from the background noise. Evaluation of DiNuP showed it to be both sensitive and specific for the detection of changes in nucleosome location, occupancy and fuzziness. RDNPs that were identified using publicly available datasets revealed that nucleosome positioning dynamics are closely related to the epigenetic regulation of transcription.
Availability and implementation: DiNuP is implemented in Python and is freely available at http://www.tongji.edu.cn/~zhanglab/DiNuP.
Supplementary data are available at Bioinformatics online.
Histone modifications play important roles in regulating eukaryotic gene expression and have been used to model expression levels. Here, we present a regression model to systematically infer mRNA stability by comparing transcriptome profiles with ChIP-seq of H3K4me3, H3K27me3 and H3K36me3. The results from multiple human and mouse cell lines show that the inferred unstable mRNAs have significantly longer 3′Untranslated Regions (UTRs) and more microRNA binding sites within 3′UTR than the inferred stable mRNAs. Regression residuals derived from RNA-seq, but not from GRO-seq, are highly correlated with the half-lives measured by pulse-labeling experiments, supporting the rationale of our inference. Whereas, the functions enriched in the inferred stable and unstable mRNAs are consistent with those from pulse-labeling experiments, we found the unstable mRNAs have higher cell-type specificity under functional constraint. We conclude that the systematical use of histone modifications can differentiate non-expressed mRNAs from unstable mRNAs, and distinguish stable mRNAs from highly expressed ones. In summary, we represent the first computational model of mRNA stability inference that compares transcriptome and epigenome profiles, and provides an alternative strategy for directing experimental measurements.
Cell differentiation requires remodeling of tissue-specific gene loci and activities of key transcriptional regulators, which are recognized for their dominant control over cellular programs. Using epigenomic methods, we characterized enhancer elements specifically modified in differentiating intestinal epithelial cells and found enrichment of transcription factor-binding motifs corresponding to CDX2, a critical regulator of the intestine. Directed investigation revealed surprising lability in CDX2 occupancy of the genome, with redistribution from hundreds of sites occupied only in proliferating cells to thousands of new sites in differentiated cells. Knockout mice confirmed distinct Cdx2 requirements in dividing and mature adult intestinal cells, including responsibility for the active enhancer configuration associated with maturity. Dynamic CDX2 occupancy corresponds with condition-specific gene expression and, importantly, to differential co-occupancy with other tissue-restricted transcription factors such as GATA6 and HNF4A. These results reveal dynamic, context-specific functions and mechanisms of a prominent transcriptional regulator within a cell lineage.
Transcription factors that potently induce cell fate often remain expressed in the induced organ throughout life, but their requirements in adults are uncertain and varied. Mechanistically, it is unclear if they activate only tissue-specific genes or also directly repress heterologous genes. We conditionally inactivated mouse Cdx2, a dominant regulator of intestinal development, and mapped its genome occupancy in adult intestinal villi. Although homeotic transformation, observed in Cdx2-null embryos, was absent in mutant adults, gene expression and cell morphology were vitally compromised. Lethality was significantly accelerated in mice lacking both Cdx2 and its homolog Cdx1, with particular exaggeration of defects in villus enterocyte differentiation. Importantly, Cdx2 occupancy correlated with hundreds of transcripts that fell but not with equal numbers that rose with Cdx loss, indicating a predominantly activating role at intestinal cis-regulatory regions. Integrated consideration of a transcription factor's mutant phenotype and cistrome hence reveals the continued and distinct requirement in adults of a critical developmental regulator that activates tissue-specific genes.
5-hydroxymethylcytosine (5hmC) is a modified base present at low levels in diverse cell types in mammals1–5. 5hmC is generated by the TET family of Fe(II) and 2-oxoglutarate-dependent enzymes through oxidation of 5-methylcytosine (5mC)1,2,4–7. 5hmC and TET proteins have been implicated in stem cell biology and cancer1,4,5,8,9, but information on the genome-wide distribution of 5hmC is limited. Here we describe two novel and specific approaches to profile the genomic localization of 5hmC. The first approach, termed GLIB (glucosylation, periodate oxidation, biotinylation) uses a combination of enzymatic and chemical steps to isolate DNA fragments containing as few as a single 5hmC. The second approach involves conversion of 5hmC to cytosine 5-methylenesulphonate (CMS) by treatment of genomic DNA with sodium bisulphite, followed by immunoprecipitation of CMS-containing DNA with a specific antiserum to CMS5. High-throughput sequencing of 5hmC-containing DNA from mouse embryonic stem (ES) cells showed strong enrichment within exons and near transcriptional start sites. 5hmC was especially enriched at the start sites of genes whose promoters bear dual histone 3 lysine 27 trimethylation (H3K27me3) and histone 3 lysine 4 trimethylation (H3K4me3) marks. Our results indicate that 5hmC has a probable role in transcriptional regulation, and suggest a model in which 5hmC contributes to the ‘poised’ chromatin signature found at developmentally-regulated genes in ES cells.
TET2 is a close relative of TET1, an enzyme that converts 5-methylcytosine (5-mC) to 5-hydroxymethylcytosine (5-hmC) in DNA1,2. The gene encoding TET2 resides at chromosome 4q24, in a region showing recurrent microdeletions and copy-neutral loss of heterozygosity (CN-LOH) in patients with diverse myeloid malignancies3. Somatic TET2 mutations are frequently observed in myelodysplastic syndromes (MDS), myeloproliferative neoplasms (MPN), MDS/MPN overlap syndromes including chronic myelomonocytic leukemia (CMML), acute myeloid leukemias (AML) and secondary AML (sAML)4–12. We show here that TET2 mutations associated with myeloid malignancies compromise TET2 catalytic activity. Bone marrow samples from patients with TET2 mutations displayed uniformly low levels of 5-hmC in genomic DNA compared to bone marrow samples from healthy controls. Moreover, small hairpin RNA (shRNA)-mediated depletion of Tet2 in mouse haematopoietic precursors skewed their differentiation towards monocyte/macrophage lineages in culture. There was no significant difference in DNA methylation between bone marrow samples from patients with high 5-hmC versus healthy controls, but samples from patients with low 5-hmC showed hypomethylation relative to controls at the majority of differentially-methylated CpG sites. Our results demonstrate that TET2 is important for normal myelopoiesis, and suggest that disruption of TET2 enzymatic activity favours myeloid tumorigenesis. Measurement of 5-hmC levels in myeloid malignancies may prove valuable as a diagnostic and prognostic tool, to tailor therapies and assess responses to anti-cancer drugs.
MicroRNAs (miRNAs) are a class of 20–23 nucleotide small RNAs that regulate gene expression post-transcriptionally in animals and plants. Annotation of miRNAs by the miRNA database (miRBase) has largely relied on computational approaches. As a result, many miRBase entries lack experimental validation, and discrepancies between miRBase annotation and actual miRNA sequences are often observed. In this study, we integrated the small RNA sequencing (smRNA-seq) datasets in Caenorhabditis elegans and Drosophila melanogaster and devised an analytical pipeline coupled with detailed manual inspection to curate miRNA annotation systematically in miRBase. Our analysis reveals 19 (17.0%) and 51 (31.3%) miRNAs entries with detectable smRNA-seq reads have mature sequence discrepancies in C. elegans and D. melanogaster, respectively. These discrepancies frequently occur either for conserved miRNA families whose mature sequences were predicted according to their homologous counterparts in other species or for miRNAs whose precursor miRNA (pre-miRNA) hairpins produce an abundance of multiple miRNA isoforms or variants. Our analysis shows that while Drosophila pre-miRNAs, on average, produce less than 60% accurate mature miRNA reads in addition to their 5′ and 3′ variant isoforms, the precision of miRNA processing in C. elegans is much higher, at over 90%. Based on the revised miRNA sequences, we analyzed expression patterns of the more conserved (MC) and less conserved (LC) miRNAs and found that, whereas MC miRNAs are often co-expressed at multiple developmental stages, LC miRNAs tend to be expressed specifically at fewer stages.
microRNA; deep sequencing; database curation
After fertilization the embryonic genome is inactive until transcription is initiated during the maternal-zygotic transition1,2,3. This transition coincides with the formation of pluripotent cells, which in mammals can be used to generate embryonic stem cells. To study the changes in chromatin structure that accompany pluripotency and genome activation, we mapped the genomic locations of histone H3 molecules bearing Lysine trimethylation modifications before and after the maternal-zygotic transition in zebrafish. Trimethylation of Lysine 27, which is repressive, and trimethylation of Lysine 4, which is activating, were not detected before the transition. After genome activation, more than 80% of genes were marked by Lysine 4 trimethylation, including many inactive developmental regulatory genes that were also marked by Lysine 27 trimethylation. Sequential chromatin immunoprecipitation demonstrated that the same promoter regions had both trimethylation marks. Such bivalent chromatin domains also exist in embryonic stem cells and are thought to poise genes for activation while keeping them repressed4,5,6,7,8. In addition, we found many inactive genes that were uniquely marked by Lysine 4 trimethylation. Despite this activating modification, these monovalent genes were neither expressed nor stably bound by RNA polymerase II. Inspection of published datasets revealed similar monovalent domains in embryonic stem cells. Moreover, Lysine 4 trimethylation marks could form in the absence of both sequence-specific transcriptional activators and stable association of RNA pol II, as indicated by the analysis of an inducible transgene. These results suggest that bivalent and monovalent domains might poise embryonic genes for activation and that the chromatin profile associated with pluripotency is established during the maternal-zygotic transition.
Cancer is known to have abundant copy number alterations (CNAs) that greatly contribute to its pathogenesis and progression. Investigation of CNA regions could potentially help identify oncogenes and tumor suppressor genes and infer cancer mechanisms. Although single-nucleotide polymorphism (SNP) arrays have strengthened our ability to identify CNAs with unprecedented resolution, a comprehensive collection of CNA information from SNP array data is still lacking. We developed a web-based CaSNP (http://cistrome.dfci.harvard.edu/CaSNP/) database for storing and interrogating quantitative CNA data, which curated ∼11 500 SNP arrays on 34 different cancer types in 104 studies. With a user input of region or gene of interest, CaSNP will return the CNA information summarizing the frequencies of gain/loss and averaged copy number for each study, and provide links to download the data or visualize it in UCSC Genome Browser. CaSNP also displays the heatmap showing copy numbers estimated at each SNP marker around the query region across all studies for a more comprehensive visualization. Finally, we used CaSNP to study the CNA of protein-coding genes as well as LincRNA genes across all cancer SNP arrays, and found putative regions harboring novel oncogenes and tumor suppressors. In summary, CaSNP is a useful tool for cancer CNA association studies, with the potential to facilitate both basic science and translational research on cancer.