Despite growing appreciations of the importance of long non-coding RNA (lncRNA) in normal physiology and disease, our knowledge of cancer-related lncRNA remains limited. By repurposing microarray probes, we constructed the expression profile of 10,207 lncRNA genes in approximately 1,300 tumors over four different cancer types. Through integrative analysis of the lncRNA expression profiles with clinical outcome and somatic copy number alteration (SCNA), we identified lncRNA that are associated with cancer subtypes and clinical prognosis, and predicted those that are potential drivers of cancer progression. We validated our predictions by experimentally confirming prostate cancer cell growth dependence on two novel lncRNA. Our analysis provided a resource of clinically relevant lncRNA for development of lncRNA biomarkers and identification of lncRNA therapeutic targets. It also demonstrated the power of integrating publically available genomic datasets and clinical information for discovering disease associated lncRNA.
Diversified histone modifications (HMs) are essential epigenetic features. They play important roles in fundamental biological processes including transcription, DNA repair and DNA replication. Chromatin regulators (CRs), which are indispensable in epigenetics, can mediate HMs to adjust chromatin structures and functions. With the development of ChIP-Seq technology, there is an opportunity to study CR and HM profiles at the whole-genome scale. However, no specific resource for the integration of CR ChIP-Seq data or CR-HM ChIP-Seq linkage pairs is currently available. Therefore, we constructed the CR Cistrome database, available online at http://compbio.tongji.edu.cn/cr and http://cistrome.org/cr/, to further elucidate CR functions and CR-HM linkages. Within this database, we collected all publicly available ChIP-Seq data on CRs in human and mouse and categorized the data into four cohorts: the reader, writer, eraser and remodeler cohorts, together with curated introductions and ChIP-Seq data analysis results. For the HM readers, writers and erasers, we provided further ChIP-Seq analysis data for the targeted HMs and schematized the relationships between them. We believe CR Cistrome is a valuable resource for the epigenetics community.
The profiling of small RNAs by high throughput sequencing (smRNA-Seq) has revealed the complexity of the RNA world. Here, we describe a computational scheme for dissecting the plant smRNAome by integrating smRNA-Seq datasets in Arabidopsis thaliana. Our analytical approach first defines ab initio the genomic loci that produce smRNAs as basic units, then utilizes principal component analysis (PCA) to predict novel miRNAs. Secondary structure prediction of candidates’ putative precursors discovered a group of long hairpin double-stranded RNAs (lh-dsRNAs) formed by inverted duplications of decayed coding genes. These gene remnants produce miRNA-like small RNAs which are predominantly 21- and 22-nt long, dependent of DCL1 but independent of RDR2 and DCL2/3/4, and associated with AGO1. Additionally, we found two classes of transcription start site associated- (TSSa-) RNAs located at sense (+) and antisense (−) approximately 100 ~ 200 bp downstream of TSSs, but are differentially incorporated into AGO1 and AGO4, respectively.
High-throughput sequencing; small RNAs; Principal component analysis; TSS-associated RNAs
Tissue-specific gene expression requires modulation of nucleosomes, allowing transcription factors to occupy cis elements that are accessible only in selected tissues. Master transcription factors control cell-specific genes and define cellular identities, but it is unclear if they possess special abilities to regulate cell-specific chromatin and if such abilities might underlie lineage determination and maintenance. One prevailing view is that several transcription factors enable chromatin access in combination. The homeodomain protein CDX2 specifies the embryonic intestinal epithelium, through unknown mechanisms, and partners with transcription factors such as HNF4A in the adult intestine. We examined enhancer chromatin and gene expression following Cdx2 or Hnf4a excision in mouse intestines. HNF4A loss did not affect CDX2 binding or chromatin, whereas CDX2 depletion modified chromatin significantly at CDX2-bound enhancers, disrupted HNF4A occupancy, and abrogated expression of neighboring genes. Thus, CDX2 maintains transcription-permissive chromatin, illustrating a powerful and dominant effect on enhancer configuration in an adult tissue. Similar, hierarchical control of cell-specific chromatin states is probably a general property of master transcription factors.
Nuclear receptors (NRs) comprise a superfamily of ligand-activated transcription factors that play important roles in both physiology and diseases including cancer. The technologies of Chromatin ImmunoPrecipitation followed by array hybridization (ChIP-chip) or massively parallel sequencing (ChIP-seq) has been used to map, at an unprecedented rate, the in vivo genome-wide binding (cistrome) of NRs in both normal and cancer cells. We developed a curated database of 88 NR cistrome datasets and other associated high-throughput datasets, including 121 collaborating factor cistromes, 94 epigenomes and 319 transcriptomes. Through integrative analysis of the curated NR ChIP-chip/seq datasets, we discovered novel factor-specific noncanonical motifs that may have important regulatory roles. We also revealed a common feature of NR pioneering factors to recognize relatively short and AT-rich motifs. Most NRs bind predominantly to introns and distal intergenetic regions, and binding sites closer to transcription start sites (TSSs) were found to be neither stronger nor more evolutionarily conserved. Interestingly, while most NRs appear to be predominantly transcriptional activators, our analysis suggests that the binding of ESR1, RARA and RARG has both activating and repressive effects. Through meta-analysis of different omic data of the same cancer cell line model from multiple studies, we generated consensus cistrome and expression profiles. We further made probabilistic predictions of the NR target genes by integrating cistrome and transcriptome data, and validated the predictions using expression data from tumor samples. The final database, with comprehensive cistrome, epigenome, transcriptome datasets, and downstream analysis results, constitutes a valuable resource for the nuclear receptor and cancer community.
We performed a systematic evaluation of how variations in sequencing depth and other parameters influence interpretation of Chromatin immunoprecipitation (ChIP) followed by sequencing (ChIP-seq) experiments. Using Drosophila S2 cells, we generated ChIP-seq datasets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin state bias, open chromatin regions yielded higher coverage, which led to false positives if not corrected and had a greater effect on detection specificity than any base-composition bias. Paired-end sequencing revealed that single-end data underestimated ChIP library complexity at high coverage. The removal of reads originating at the same base reduced false-positives while having little effect on detection sensitivity. Even at a depth of ~1 read/bp coverage of mappable genome, ~1% of the narrow peaks detected on a tiling array were missed by ChIP-seq. Evaluation of widely-used ChIP-seq analysis tools suggests that adjustments or algorithm improvements are required to handle datasets with deep coverage.
Endocrine therapies for breast cancer that target the estrogen receptor (ER) are ineffective in the 25-30% of cases that are ER negative (ER−). Androgen receptor (AR) is expressed in 60-70% of breast tumors, independent of ER status. How androgens and AR regulate breast cancer growth remains largely unknown. We find that AR is enriched in ER−breast tumors that over-express HER2. Through analysis of the AR cistrome and androgen-regulated gene expression in ER−/HER2+ breast cancers we find that AR mediates ligand-dependent activation of Wnt and HER2 signaling pathways through direct transcriptional induction of WNT7B and HER3. Specific targeting of AR, Wnt or HER2 signaling impairs androgen-stimulated tumor cell growth suggesting potential therapeutic approaches for ER−/HER2+ breast cancers.
MicroRNAs (miRNAs) are a class of 20–23 nucleotide small RNAs that regulate gene expression post-transcriptionally in animals and plants. Annotation of miRNAs by the miRNA database (miRBase) has largely relied on computational approaches. As a result, many miRBase entries lack experimental validation, and discrepancies between miRBase annotation and actual miRNA sequences are often observed. In this study, we integrated the small RNA sequencing (smRNA-seq) datasets in Caenorhabditis elegans and Drosophila melanogaster and devised an analytical pipeline coupled with detailed manual inspection to curate miRNA annotation systematically in miRBase. Our analysis reveals 19 (17.0%) and 51 (31.3%) miRNAs entries with detectable smRNA-seq reads have mature sequence discrepancies in C. elegans and D. melanogaster, respectively. These discrepancies frequently occur either for conserved miRNA families whose mature sequences were predicted according to their homologous counterparts in other species or for miRNAs whose precursor miRNA (pre-miRNA) hairpins produce an abundance of multiple miRNA isoforms or variants. Our analysis shows that while Drosophila pre-miRNAs, on average, produce less than 60% accurate mature miRNA reads in addition to their 5′ and 3′ variant isoforms, the precision of miRNA processing in C. elegans is much higher, at over 90%. Based on the revised miRNA sequences, we analyzed expression patterns of the more conserved (MC) and less conserved (LC) miRNAs and found that, whereas MC miRNAs are often co-expressed at multiple developmental stages, LC miRNAs tend to be expressed specifically at fewer stages.
microRNA; deep sequencing; database curation
The recent availability of high-density human genome tiling arrays enables biologists to conduct ChIP–chip experiments to locate the in vivo-binding sites of transcription factors in the human genome and explore the regulatory mechanisms. Once genomic regions enriched by transcription factor ChIP–chip are located, genome-scale downstream analyses are crucial but difficult for biologists without strong bioinformatics support. We designed and implemented the first web server to streamline the ChIP–chip downstream analyses. Given genome-scale ChIP regions, the cis-regulatory element annotation system (CEAS) retrieves repeat-masked genomic sequences, calculates GC content, plots evolutionary conservation, maps nearby genes and identifies enriched transcription factor-binding motifs. Biologists can utilize CEAS to retrieve useful information for ChIP–chip validation, assemble important knowledge to include in their publication and generate novel hypotheses (e.g. transcription factor cooperative partner) for further study. CEAS helps the adoption of ChIP–chip in mammalian systems and provides insights towards a more comprehensive understanding of transcriptional regulatory mechanisms. The URL of the server is .
Next generation sequencing (NGS) technologies have been used in diverse ways to investigate facets of chromatin biology by identifying genomic loci that are bound by transcription factors, occupied by nucleosomes, accessible to nuclease cleavage, or physically interact with remote genomic loci. Reaching sound biological conclusions from such NGS enrichment profiles, however, requires that many potential biases be taken into account. In this Review we discuss common ways in which bias may be introduced into NGS chromatin profiling data, ways in which these biases can be diagnosed, and analytical techniques to mitigate their effect.
Recurrent mutations in histone modifying enzymes imply key roles in tumorigenesis yet their functional relevance is largely unknown. Here we show that JARID1B, encoding a histone H3 lysine 4 (H3K4) demethylase, is frequently amplified and overexpressed in luminal breast tumors and a somatic mutation in a basal-like breast cancer results in the gain of unique chromatin binding and luminal expression and splicing patterns. Downregulation of JARID1B in luminal cells induces basal genes expression and growth arrest, which is rescued by TGFβ pathway inhibitors. Integrated JARID1B chromatin binding, H3K4 methylation, and expression profiles suggest a key function for JARID1B in luminal cell-specific expression programs. High luminal JARID1B activity is associated with poor outcome in patients with hormone receptor positive breast tumors.
Lineage-restricted transcription factors (TFs) are frequently mutated or overexpressed in cancer and contribute toward malignant behaviors, but the molecular bases of their oncogenic properties are largely unknown. Because TF activities are difficult to inhibit directly with small molecules, the genes and pathways they regulate might represent more tractable targets for drug therapy. We studied GATA6, a TF gene that is frequently amplified or overexpressed in gastric, esophageal, and pancreatic adenocarcinomas. GATA6-overexpressing gastric cancer cell lines cluster in gene expression space, separate from non-overexpressing lines. This expression clustering signifies a shared pathogenic group of genes that GATA6 may regulate through direct cis-element binding. We used chromatin immunoprecipation and sequencing (ChIP-seq) to identify GATA6-bound genes and considered TF occupancy in relation to genes that respond to GATA6 depletion in cell lines and track with GATA6 mRNA (synexpression groups) in primary gastric cancers. Among other cellular functions, GATA6-occupied genes control apoptosis and govern M-phase of the cell cycle. Depletion of GATA6 reduced levels of the latter transcripts and arrested cells in G2 and M phases of the cell cycle. Synexpression in human tumor samples identified likely direct transcriptional targets substantially better than consideration only of transcripts that respond to GATA6 loss in cultured cells. Candidate target genes responded to loss of GATA6 or its homolog GATA4 and even more to depletion of both proteins. Many GATA6-dependent genes lacked nearby binding sites but several strongly dependent, synexpressed, and GATA6-bound genes encode TFs such as MYC, HES1, RARB, and CDX2. Thus, many downstream effects occur indirectly through other TFs and GATA6 activity in gastric cancer is partially redundant with GATA4. This integrative analysis of locus occupancy, gene dependency, and synexpression provides a functional signature of GATA6-overexpressing gastric cancers, revealing both limits and new therapeutic directions for a challenging and frequently fatal disease.
transcriptional control of cancer; synexpression groups; somatic copy number alterations; ChIP-seq; GATA transcription factors
Notch signaling has pleiotropic context-specific functions that have essential roles in many processes, including embryonic development and maintenance and homeostasis of adult tissues. Aberrant Notch signaling (both hyper- and hypoactive) is implicated in a number of human developmental disorders and many cancers. Notch receptor signaling is mediated by tightly regulated proteolytic cleavages that lead to the assembly of a nuclear Notch transcription complex, which drives the expression of downstream target genes and thereby executes Notch’s functions. Thus, understanding regulation of gene expression by Notch is central to deciphering how Notch carries out its many activities. Here, we summarize the recent findings pertaining to the complex interplay between the Notch transcriptional complex and interacting factors involved in transcriptional regulation, including co-activators, cooperating transcription factors, and chromatin regulators, and discuss emerging data pertaining to the role of Notch-regulated noncoding RNAs in transcription.
Next generation sequencing was used to identify Notch mutations in a large collection of diverse solid tumors. NOTCH1 and NOTCH2 rearrangements leading to constitutive receptor activation were confined to triple negative breast cancers (TNBC, 6 of 66 tumors). TNBC cell lines with NOTCH1 rearrangements associated with high levels of activated NOTCH1 (N1-ICD) were sensitive to the gamma-secretase inhibitor (GSI) MRK-003, both alone and in combination with pacitaxel, in vitro and in vivo, whereas cell lines with NOTCH2 rearrangements were resistant to GSI. Immunohistochemical staining of N1-ICD in TNBC xenografts correlated with responsiveness, and expression levels of the direct Notch target gene HES4 correlated with outcome in TNBC patients. Activating NOTCH1 point mutations were also identified in other solid tumors, including adenoid cystic carcinoma (ACC). Notably, ACC primary tumor xenografts with activating NOTCH1 mutations and high N1-ICD levels were sensitive to GSI, whereas N1-ICD-low tumors without NOTCH1 mutations were resistant.
Chromatin immunoprecipitation coupled with massive parallel sequencing (ChIP-seq) is a powerful technology to identify the genome-wide locations of DNA binding proteins such as transcription factors or modified histones. As more and more experimental laboratories are adopting ChIP-seq to unravel the transcriptional and epigenetic regulatory mechanisms, computational analyses of ChIP-seq also become increasingly comprehensive and sophisticated. In this article, we review current computational methodology for ChIP-seq analysis, recommend useful algorithms and workflows, and introduce quality control measures at different analytical steps. We also discuss how ChIP-seq could be integrated with other types of genomic assays, such as gene expression profiling and genome-wide association studies, to provide a more comprehensive view of gene regulatory mechanisms in important physiological and pathological processes.
Canonical Wnt signaling supports the pluripotency of embryonic stem cells (ESCs) but also promotes differentiation of early mammalian cell lineages. To explain these paradoxical observations, we explored the gene regulatory networks at play. Canonical Wnt signaling is intertwined with the pluripotency network comprising Nanog, Oct4, and Sox2 in mouse ESCs. In defined media supporting the derivation and propagation of ESCs, Tcf3 and β-catenin interact with Oct4; Tcf3 binds to Sox motif within Oct-Sox composite motifs that are also bound by Oct4-Sox2 complexes. Further, canonical Wnt signaling up-regulates the activity of the Pou5f1 distal enhancer via the Sox motif in ESCs. When viewed in the context of published studies on Tcf3 and β-catenin mutants, our findings suggest Tcf3 counters pluripotency by competition with Sox2 at these sites, and Tcf3 inhibition is blocked by β-catenin entry into this complex. Wnt pathway stimulation also triggers β-catenin association at regulatory elements with classic Lef/Tcf motifs associated with differentiation programs. The failure to activate these targets in the presence of a MEK/ERK inhibitor essential for ESC culture suggests MEK/ERK signaling and canonical Wnt signaling combine to promote ESC differentiation.
Mouse embryonic stem cells; Wnt; β-catenin; 2i; pluripotency; differentiation
Cancer cells induce a set of adaptive response pathways to survive in the face of stressors due to inadequate vascularization1. One such adaptive pathway is the unfolded protein (UPR) or endoplasmic reticulum (ER) stress response mediated in part by the ER-localized transmembrane sensor IRE12
and its substrate XBP13. Previous studies report UPR activation in various human tumors4-6, but XBP1's role in cancer progression in mammary epithelial cells is largely unknown. Triple negative breast cancer (TNBC), a form of breast cancer in which tumor cells do not express the genes for estrogen receptor, progesterone receptor, and Her2/neu, is a highly aggressive malignancy with limited treatment options7, 8. Here, we report that XBP1 is activated in TNBC and plays a pivotal role in the tumorigenicity and progression of this human breast cancer subtype. In breast cancer cell line models, depletion of XBP1 inhibited tumor growth and tumor relapse and reduced the CD44high/CD24low population. Hypoxia-inducing factor (HIF)1α is known to be hyperactivated in TNBCs 9, 10. Genome-wide mapping of the XBP1 transcriptional regulatory network revealed that XBP1 drives TNBC tumorigenicity by assembling a transcriptional complex with HIF1α that regulates the expression of HIF1α targets via the recruitment of RNA polymerase II. Analysis of independent cohorts of patients with TNBC revealed a specific XBP1 gene expression signature that was highly correlated with HIF1α and hypoxia-driven signatures and that strongly associated with poor prognosis. Our findings reveal a key function for the XBP1 branch of the UPR in TNBC and imply that targeting this pathway may offer alternative treatment strategies for this aggressive subtype of breast cancer.
Human neurons are functional over an entire lifetime, yet the mechanisms that preserve function and protect against neurodegeneration during aging are unknown. Here we show that induction of the repressor element 1-silencing transcription/neuron-restrictive silencer factor (REST/NRSF) is a universal feature of normal aging in human cortical and hippocampal neurons. REST is lost, however, in mild cognitive impairment (MCI) and Alzheimer’s disease (AD). Chromatin immunoprecipitation with deep sequencing (ChIP-seq) and expression analysis show that REST represses genes that promote cell death and AD pathology, and induces the expression of stress response genes. Moreover, REST potently protects neurons from oxidative stress and amyloid β-protein (Aβ) toxicity, and conditional deletion of REST in the mouse brain leads to age-related neurodegeneration. A functional ortholog of REST, C. elegans SPR-4, also protects against oxidative stress and Aβ toxicity. During normal aging, REST is induced in part by cell non-autonomous Wnt signaling. However, in AD, frontotemporal dementia and dementia with Lewy bodies, REST is lost from the nucleus and appears in autophagosomes together with pathologic misfolded proteins. Finally, REST levels during aging are closely correlated with cognitive preservation and longevity. Thus, the activation state of REST may distinguish neuroprotection from neurodegeneration in the aging brain.
H3K4me2/3, H3K9ac, and H3K27ac investigated by ChIP-Seq showed enrichment in generic regions and transcription start sites, and associated with active transcription in rice. They were used to discover unannotated genes and to predict transcription factor binding sites together with DNase-Seq data.
While previous studies have shown that histone modifications could influence plant growth and development by regulating gene transcription, knowledge about the relationships between these modifications and gene expression is still limited. This study used chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq), to investigate the genome-wide distribution of four histone modifications: di and trimethylation of H3K4 (H3K4me2 and H3K4me3) and acylation of H3K9 and H3K27 (H3K9ac and H3K27ac) in Oryza sativa L. japonica. By analyzing published DNase-Seq data, this study explored DNase-Hypersensitive (DH) sites along the rice genome. The histone marks appeared mainly in generic regions and were enriched around the transcription start sites (TSSs) of genes. This analysis demonstrated that the four histone modifications and the DH sites were all associated with active transcription. Furthermore, the four histone modifications were highly concurrent with transcript regions—a promising feature that was used to predict missing genes in the rice gene annotation. The predictions were further validated by experimentally confirming the transcription of two predicted missing genes. Moreover, a sequence motif analysis was constructed in order to identify the DH sites and many putative transcription factor binding sites.
bioinformatics; chromatin structure and remodeling; epigenetics; gene regulation; genomics; rice.
The combination of ChIP-seq and transcriptome analysis is a compelling approach to unravel the regulation of gene expression. Several recently published methods combine transcription factor (TF) binding and gene expression for target prediction, but few of them provide an efficient software package for the community. Binding and expression target analysis (BETA) is a software package that integrates ChIP-seq of TFs or chromatin regulators with differential gene expression data to infer direct target genes. BETA has three functions: (i) to predict whether the factor has activating or repressive function; (ii) to infer the factor’s target genes; and (iii) to identify the motif of the factor and its collaborators, which might modulate the factor’s activating or repressive function. Here we describe the implementation and features of BETA to demonstrate its application to several data sets. BETA requires ~1 GB of RAM, and the procedure takes 20 min to complete. BETA is available open source at http://cistrome.org/BETA/.
Early full-term pregnancy is one of the most effective natural protections against breast cancer. To investigate this effect, we have characterized the global gene expression and epigenetic profiles of multiple cell types from normal breast tissue of nulliparous and parous women, and carriers of BRCA1 or BRCA2 mutations. We found significant differences in CD44+ progenitor cells, where the levels of many stem cell-related genes and pathways, including the cell cycle regulator p27, are lower in parous women without BRCA1/BRCA2 mutations. We also noted a significant reduction in the frequency of CD44+p27+ cells in parous women, and showed using explant cultures that parity-related signaling pathways play a role in regulating the number of p27+ cells and their proliferation. Our results suggest that pathways controlling p27+ mammary epithelial cells and the numbers of these cells relate to breast cancer risk, and can be explored for cancer risk assessment and prevention.
DNase-seq is a powerful technique for identifying cis-regulatory elements across the genome. We studied the key experimental parameters to optimize the performance of DNase-seq. We found that sequencing short 50-100bp fragments that accumulate in long inter-nucleosome linker regions is more efficient for identifying transcription factor binding sites than using longer fragments. We also assessed the potential of DNase-seq to predict transcription factor occupancy through the generation of nucleotide-resolution transcription factor footprints. In modeling the sequence-specific DNaseI cutting bias we found a surprisingly strong effect that varied over more than two orders of magnitude. This confounds DNaseI footprint analysis to the extent that the nucleotide resolution cleavage patterns at most transcription factor binding sites are derived from intrinsic DNaseI cleavage bias rather than from specific protein-DNA interactions. In contrast, quantitative comparison of DNaseI hypersensitivity between states can predict transcription factor occupancy associated with particular biological perturbations.
DNaseI hypersensitivity; DNase-seq; DNaseI footprint; Chromatin dynamics; CTCF; Androgen receptor; Estrogen receptor; Transcription factor binding; Nucleosome
Transcription factor activity and turnover are functionally linked, but the global patterns by which DNA-bound regulators are eliminated remain poorly understood. We established an assay to define the chromosomal location of DNA-associated proteins that are slated for degradation by the ubiquitin-proteasome system. The genome-wide map described here ties proteolysis in mammalian cells to active enhancers and to promoters of specific gene families. Nuclear-encoded mitochondrial genes in particular correlate with protein elimination, which positively affects their transcription. We show that the nuclear receptor corepressor NCoR1 is a key target of proteolysis and physically interacts with the transcription factor CREB. Proteasome inhibition stabilizes NCoR1 in a site-specific manner and restrains mitochondrial activity by repressing CREB-sensitive genes. In conclusion, this functional map of nuclear proteolysis links chromatin architecture with local protein stability and identifies proteolytic derepression as highly dynamic in regulating the transcription of genes involved in energy metabolism.