|Home | About | Journals | Submit | Contact Us | Français|
Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) represent two classes of important non-coding RNAs in eukaryotes. Although these non-coding RNAs have been implicated in organismal development and in various human diseases, surprisingly little is known about their transcriptional regulation. Recent advances in chromatin immunoprecipitation with next-generation DNA sequencing (ChIP-Seq) have provided methods of detecting transcription factor binding sites (TFBSs) with unprecedented sensitivity. In this study, we describe ChIPBase (http://deepbase.sysu.edu.cn/chipbase/), a novel database that we have developed to facilitate the comprehensive annotation and discovery of transcription factor binding maps and transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. The current release of ChIPBase includes high-throughput sequencing data that were generated by 543 ChIP-Seq experiments in diverse tissues and cell lines from six organisms. By analysing millions of TFBSs, we identified tens of thousands of TF-lncRNA and TF-miRNA regulatory relationships. Furthermore, two web-based servers were developed to annotate and discover transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. In addition, we developed two genome browsers, deepView and genomeView, to provide integrated views of multidimensional data. Moreover, our web implementation supports diverse query types and the exploration of TFs, lncRNAs, miRNAs, gene ontologies and pathways.
It has become increasingly clear that eukaryotic genomes encode thousands of long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) (1–4). Emerging evidence is revealing that lncRNAs and miRNAs serve as the nodes of signaling networks that regulate cancer, apoptosis, proliferation, differentiation and stem cell biology (1,2,5–8). However, the majority of studies that address these types of RNAs focus on defining the regulatory functions of lncRNAs and miRNAs, whereas few investigations are directed toward assessing how the lncRNA and miRNA genes themselves are transcriptionally regulated.
The major limitation in identifying the transcriptional regulatory relationships of lncRNAs and miRNAs has been the high false-positive rates of predictive algorithms for transcription factor binding sites (TFBSs) (9). Recently, chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-Seq) has provided a powerful way to identify TFBSs. The application of the ChIP-Seq technique has significantly reduced the rate of false-positive predictions of TFBSs (10–12). However, although ChIP-Seq technology can reliably identify TFBSs, few studies have used ChIP-Seq data to explore the transcriptional regulation of lncRNAs and miRNAs. For this reason, a high-quality integrated database that could facilitate the annotation and analysis of the transcriptional regulation of lncRNAs and miRNAs from ChIP-Seq data will be of great utility in the study of both the regulation of lncRNAs and miRNAs by TFs and the roles of this regulation in human diseases.
In this study, we developed the ChIPBase to facilitate the integrative and interactive display, as well as the comprehensive annotation and discovery, of TF-lncRNA and TF-miRNA interaction maps from ChIP-Seq data that were generated from diverse tissues and cell lines from six organisms: human, mouse, dog, chicken, Drosophila melanogaster and Caenorhabditis elegans (Figure 1). ChIPBase contains tens of thousands of TF-lncRNA and TF-miRNA regulatory relationships, as well as millions of TFBSs (Table 1). In addition, two novel web servers and two genome browsers were developed to comprehensively explore the relationships of TFs and ncRNAs from ChIP-Seq data.
A total of 543 ChIP-Seq peak data sets for 252 different transcription factors were compiled from multiple related studies and downloaded from the NCBI GEO database (13), the ENCODE (14) and modENCODE (15,16) databases, or the Supplementary Data of the original research articles (Supplementary Table S1). We have also manually curated metadata (such as TF name, refSeq accession number, gene symbols and detailed descriptions and expression patterns of TFs) to ensure annotation consistency. These peak data sets were converted to latest genome version using liftOver tool from the UCSC genome browser website (17), and peaks whose genomic regions could not be transformed into latest version of the genome were discarded. In addition, some data sets in BedGraph format were downloaded to construct peak tracks and displayed in our deepView browser to allow users to check TFBSs. In each species, TFBSs from different transcription factors and many different cell lines were combined and sorted according to their genomic positions. And then, the overlapping TFBSs were grouped into clusters and were imported into database; each cluster included at least one TFBS. Known transcription factor binding matrices were downloaded from the JASPAR (18), Transfac (19), Cistrome (20) and UniPROBE (21) databases.
All of the known lncRNAs or large intergenic non-coding RNAs were downloaded from the Supplementary Data of the six original research articles that addressed these RNAs (22–27) or extracted from Ensembl (28), refSeq (17) and UCSC Bioinformatics website (17). Known functional lncRNAs were downloaded from lncRNAdb database (29). All of the known miRNAs were downloaded from miRBase [release 17.0, (30)]. miRNA targets were downloaded from starBase database (31). All of the refSeq genes were downloaded from the UCSC bioinformatics websites (17). Other known non-coding RNAs were downloaded from the Ensembl database (28) or the UCSC websites (17) or were obtained from the relevant literature. The human (UCSC hg19), mouse (UCSC mm9, NCBI Build 37), dog (UCSC canFam2), chicken (UCSC galGal3), D. melanogaster (UCSC dm3) and C. elegans (WS190) genome sequences were downloaded from the UCSC bioinformatics websites (17).
Pre-miRNAs were grouped into transcriptional units. TFs might not almost exclusively bind at proximal promoters of lncRNAs or miRNAs. For protein-coding genes, more than half of the observed binding events are distal events (32). Moreover, the distances between transcription initiation sites (TSSs) and miRNA genes dramatically vary, ranging from a hundred bases to thousands of bases upstream (33,34). To incorporate proximal and distal binding events (32), for intergenic miRNAs, the 30-kb region upstream and the 10-kb region downstream of the TSS of the first pre-miRNA in the same transcriptional unit were chosen as the regulatory domain (32) of the examined miRNAs. For intronic miRNAs, the 30-kb region upstream and the 10-kb region downstream of TSS of the host genes contained miRNAs were chose as the regulatory domain of the examined miRNAs. The same strategy was used for lncRNAs; we chose a 30-kb region upstream and a 10-kb region downstream of the TSS of each lncRNA as the regulatory domain (32) of each lncRNA. Five-kb upstream region and 1-kb downstream region of each lncRNA and miRNA were chosen as promoter region (32). In each species, regulatory domains/regions of each lncRNA/miRNA were intersected with TFBSs of each data set to identify TFs that regulated the examined ncRNAs. And then, TFBSs overlapping with regulatory domains and corresponding lncRNAs were imported into MySql database.
We integrated ~8.7 million TFBSs from 543 ChIP-Seq experiments in various tissues or cell lines to provide comprehensive genome-wide transcription factor binding profiles. To provide more useful information, we generated extensive annotations and analyses for transcription factors and TFBSs. For each TFBS, we identified the nearest/target gene and the distance between the site and the gene, as well as the expression pattern of the TF and its target gene in various tissues or cell lines (Supplementary Figure S1). For each ChIP-Seq experiment, we identified the distribution of TFBSs in the body of the gene and the distribution of the distances of the TFBSs that are associated with the TSSs of the nearest genes, and we provided descriptions of the ChIP-Seq experiments and the expression patterns of the TFs (Supplementary Figure S2). In addition, we offered the chipGO and chipKEGG tools to explore the features of the lists of TF-target interactions that are derived from the ChIP-Seq data (Figure 1 and Supplementary Figure S3).
To directly investigate potential high-occupancy target (HOT) regions on a genome-wide scale, we grouped ~8.7 million TFBSs into ~2 million clusters (Table 1). Each cluster contains between 1 and 74 transcription factors. We designated the genomic locations that were bound by many TFs as HOT regions. For instance, we identified 26664 HOT regions that were bound by ≥15 factors in the human genome. In addition, we generated distribution maps of the numbers of transcription factors in the clusters. The maps are presented in the form of cluster peaks, which are displayed in our deepView genome browser (Figure 1 and Supplementary Figure S4). This display method allows us easily to determine HOT regions of TFs. We also identified the nearest/target genes of these clusters and created a web interface to display this information (Figure 1 and Supplementary Figure S4).
To investigate TF–lncRNA and TF–miRNA regulatory relationships, the regulatory domains (see the Materials and Methods section for a detailed description) of lncRNAs and miRNAs were intersected with all TFBSs from diverse tissues and cell lines. In total, we identified ~848834 TF-lncRNA regulatory relationships between 221 TFs and 38293 lncRNA transcripts, as well as 53233 TF-miRNA regulatory relationships between 249 TFs and 2294 miRNA clusters (Table 1). Because of its integration of the large number of high-resolution ChIP-Seq data from diverse tissues and cell lines, this analysis provides an enhanced resolution of these regulatory relationships. Moreover, to enable us to explore the interplay between miRNA transcriptional and posttranscriptional regulation, we integrated the targets of miRNAs from our starBase database (31) into the TF-miRNA networks. Cytoscape (web version) (35) were used to display and draw the TF-miRNA and miRNA-target networks.
The large quantity of TFBSs and high-throughput ChIP-Seq data has increased the demand for visual tools that allow for the rapid visual correlation of different types of information. To enable the user to browse seamlessly along the genome and to zoom effortlessly in a very large set of ChIP-Seq data, our improved deepView genome browser (31,36) was developed. This browser provides an integrated view of TFBSs, lncRNAs, miRNAs, protein-coding genes, TF cluster peaks and TF clusters (Figure 2). In the deepView genome browser, the ‘zoom out’ or ‘zoom in’ button can be used to extend or shrink the width of the displayed coordinate range. A click on a track item (e.g. a miRNA, lncRNA or TFBS) of interest launches a multiple-alignment trace viewer that displays all of the traces that are relevant to the item in question or links to external resources, such as NCBI, UCSC and miRBase, that can be used to obtain more comprehensive information.
To provide the whole-genome-scale visualization of large-scale TFBSs, miRNAs and lncRNAs, a new genome browser, genomeView, was developed in this study (Figure 3). The user of this browser can view data for a single ChIP-Seq experiment across the entire genome in the context of miRNAs or lncRNAs. TFBSs and miRNAs or lncRNAs are displayed for each location in the genome as a profile over the chromosome ideogram. This feature allows the user to quickly observe genome-scale patterns in the regulatory data and identify regions of interest for further visualization in our deepView genome browser.
We provide two web interfaces, LncRNA and MicroRNA, which may be used to display the TF-lncRNA and TF-miRNA interaction relationships, respectively (Supplementary Figures S5 and S6). Users can browse the relationships by entering a lncRNA name. When a user starts typing a lncRNA name in the search box, suggested lncRNA names are displayed in the list box. The user can then either choose a lncRNA from the list box or finish typing a full gene name. The user can also select a TF and search for lncRNAs that are regulated by the selected TF. If users do not enter lncRNA name and TF name, webpage will output all the TF-lncRNA interactions. Users can download these interactions to construct more complex networks composed by dozens to hundreds of lncRNAs. The results of the search are listed as the TF-lncRNA table. For the lncRNA interface, the numbers of TFBSs for each lncRNA in are indicated in a table. The users can click on a number within the table to launch a detailed page that provides further information about the TF-lncRNA interaction in question. The user can also click on the title of the table to sort TF-lncRNA interactions according to various features, such as the number of TFBSs, the lncRNA names or the TF names. The detailed information for a TF-lncRNA interaction includes a description of the TF gene and its distance to the start site of the lncRNA (Supplementary Figure S5). The ‘references’ section enables the retrieval of the primary articles yielding the annotation data. Click the article title link to visit the NCBI PUBMED website.
The microRNA interface is organized similarly to the LncRNA interface. The user can select a miRNA and a TF gene from a drop-down menu to explore TF-miRNA interactions. The numbers of upstream and downstream TFBSs, the genomic coordinates and the distance to the start site of the miRNA are all presented in a table (Supplementary Figure S6A and B). In addition, we have constructed a webpage, Networks, to simply display TF-miRNA, miRNA-target and TF-target interactions (37) using cytoscape (web version) (35) by integrating our starBase database. Users can select different miRNA target regulated by examined miRNA to construct regulatory networks (Supplementary Figure S6C). For example, we recapitulated the published c-Myc, E2F1 and miR-20 network (38) by selecting hsa-miR-20a and miRNA-target gene E2F1 in Networks webpage (Supplementary Figure S6D).
The interface of the transcriptional regulatory for other ncRNAs (such as snoRNAs, tRNAs, snRNAs, etc.) is also provided and organized similarly to the lncRNA and miRNA interfaces. Users can explore their regulatory interactions by similar ways.
We also provide the annotatedTool program, which offers a simple and user-friendly interface to annotate transcription factor binding regions (TFBRs). The user is required to select an intended organism and annotated TSSs of known protein-coding genes, lncRNAs or miRNAs and then upload TFBRs in the browser extensible (BED) format. After the user has completed the data submission, a typical iteration of the annotatedTool program may require several minutes to finish. The output of this program consists of three parts: the distribution of distances between the center of the TFBR and the TSS, information about the nearest gene and a link to the deepView genome browser, which allows the user to view various features of each target region (Supplementary Figure S7).
In the following section, we will present several example applications of ChIPBase.
Let us assume that we are interested in liver-specific TFs as transcriptional regulators of miR-122, which is also expressed in the liver. We select three liver-enriched TFs (HNF4A, CEBPA and HNF3B/FoxA2) and the miR-122 gene in the microRNA webpage. The results page summarizes all of the query results: (i) there are five ChIP-Seq experiments for these three TFs (Supplementary Figure S8A and B); (ii) there are HNF4A ChIP-Seq data from three different experiments; and (iii) HNF4A and HNF3B/FoxA2 have multiple binding sites in the regulatory domain of miR-122 (Supplementary Figure S8A and B). We navigate to the corresponding deepView genome browser by locating the miR-122 regulatory domain, which opens up a genome browser view that effectively recapitulates the published TFBSs (39) (Figure 4).
To relate a lncRNA gene to the core transcriptional circuitry of embryonic stem (ES) cells, we select nine pluripotency-associated transcription factors (including Oct4, Sox2, Nanog, c-Myc, n-Myc, Klf4, Zfx, E2F1 and Smad1) in the mouse genome at the lncRNA webpage. The results page summarizes the pluripotency-associated transcription factors that bind in the regulatory domains of lncRNAs. A click on linc1428, a known ES-cells-associated lncRNA, launches a deepView genome browser view that also recapitulates the published TFBSs of E2F1, n-Myc and Klf4 (25) (Figure 5).
Ultra-high-throughput next-generation sequencing technology has recently been developed for mapping TFBSs (10,11). In this study, we performed a large-scale integration of public TFBSs that have been generated by high-throughput ChIP-Seq technology and provide the most comprehensive TF data set for various cell types that are available at the present time. We also provide comprehensive transcriptional regulatory maps of lncRNAs and miRNAs by connecting TFs to these non-coding genes.
The transcriptional regulation of the majority of miRNAs and almost all of the discovered lncRNAs is currently unknown. Recent studies have revealed that the deregulation of miRNAs and lncRNAs is correlated with various human cancers and diseases (6,7), and that this deregulation is often due to the aberrant expression of TFs (39). In the current study, we developed the ChIPBase database to decode the transcriptional regulation of lncRNA and miRNA genes from ChIP-Seq data. We can use ChIPBase to recapitulate the known transcriptional regulatory relationships of miRNAs and lncRNAs. For example, ChIPBase can be used to identify that the liver-specific miR-122 is regulated by three liver-enriched TFs (39), and that the linc2048 lncRNA is regulated by embryonic stem cell (ESC)-associated transcription factors (25). In addition, the integration of a large quantity of ChIP-Seq data from diverse tissues and cell lines allows us to provide enhanced resolution and novel findings.
In comparison with other sources, for elucidating the transcriptional regulation of lncRNAs and miRNAs, or storing and analyzing ChIP-Seq data, the distinctive features in our ChIPBase database are as follows. (i) Our ChIPBase database is the first database that provides the transcriptional regulation maps for lncRNA genes. (ii) The other databases that are related to transcriptional regulation for miRNAs, including transmiR (40) and CircuitsDB (41), only collect computationally predicted or experimentally supported TF-miRNA interactions. By contrast, ChIPBase provides the comprehensive TF-miRNA regulatory relationships that have been identified from high-throughput ChIP-Seq data. The entries in TransmiR database contain only the name of TFs and their corresponding target miRNAs. The detail information of TFBSs and TFs, however, is not included. Also, the TransmiR database may not contain the comprehensive target miRNAs of the corresponding TFs. We used two TFs, E2F1 and MYC, whose target miRNAs are the most comprehensive in TransmiR to perform comparison between TransmiR database and our ChIPBase database. When considered only the relationships of TFs and target miRNAs, ChIPBase could identify 77% (20/26) E2F1–miRNA and 75% (21/28) MYC–miRNA relationships. These data indicated that ChIPBase could recover majority of TF-miRNA interactions documented in TransmiR. Moreover, our database also identifies tens of novel E2F1–miRNA and MYC–miRNA relationships that were not included in transmiR data. In addition, TransmiR does not contain HN4FA-miR-122 and CEBPA-miR-122 interactions described in our Example Applications sections. (iii) hmChIP (42) is a database of genome-wide ChIP data in human (hg18 version) and mouse (mm8 version). It just provides the Protein–DNA binding intensities from individual samples for user-provided genomic regions. Currently, hmChIP does not explore the transcriptional regulatory of lncRNAs or miRNAs, even for protein-coding genes. By contrast, our ChIPBase provides comprehensive transcriptional regulatory relationships of lncRNAs, miRNAs and other ncRNAs, as well as comprehensive annotation of TFBSs from 543 ChIP-Seq data sets from six organisms. (iv) To enable the user to browse seamlessly along the genome and to zoom effortlessly in a very large set of ChIP-Seq data, our improved deepView genome browser was developed to provide an integrated view of TFBSs that have been identified from ChIP-Seq data, predicted TFBSs, ncRNAs, protein-coding genes and TFBS clusters (Figures 2 and and3).3). (v) We developed two web tools, annotatedTool and genomeViewer to annotate and discover the transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data (Figure 1). (vi) We constructed genome-wide transcription factor binding profiles from ChIP-Seq data. Combinatorial transcription factor interactions that control the transcriptional regulations of lncRNAs and miRNAs were easily identified by searching for appropriate profiles in the genome browser (Figure 2). (vii) ChIPBase also provides the gene ontology annotation, biological pathways and expression patterns of transcription factor binding targets (Figure 1). This supplementary information may provide valuable insights into the function of each TF, lncRNA and miRNA. Finally, the data and the integrative, interactive and versatile displays that are provided by the ChIPBase database will aid future experimental and computational studies in their attempts to elucidate the regulation of lncRNAs and miRNAs by TFs and assess the roles of these regulatory relationships in human diseases.
As a means of comprehensively integrating ChIP-Seq data, ChIPBase is expected to provide considerable resources to assist researchers that are investigating the TF-lncRNA and TF-miRNA regulatory networks and examining the biological functions of the genes and ncRNAs with expression levels that are controlled by transcription factors. As ChIP-Seq technology is applied to a broader set of species, cell lines, tissues and conditions, ChIPBase will continue to be developed and refined toward the achievement of the following goals: (i) the better integration and cross-comparison of diverse ChIP-Seq data sets and data resources. (ii) the correlation of these diverse ChIP-Seq data with lncRNAs and miRNAs. (iii) we will continue to extend the amount of storage space and improve the performance of our computer servers for storing and analysing these new data, and improve the database to accept upload of new data by the users. In addition, we intend to integrate the epigenomic data that are generated by ChIP-Seq technology into ChIPBase to improve our understanding of eukaryotic regulatory networks.
ChIPBase is freely available at http://deepbase.sysu.edu.cn/chipbase/. The ChIPBase data files can be freely downloaded and used in accordance with the GNU Public License.
Supplementary Data are available at NAR Online: Supplementary Tables 1, Supplementary Figures 1–8 and Supplementary References [15,16,43–98].
Ministry of Science and Technology of China, National Basic Research Program [No. 2011CB811300]; National Natural Science Foundation of China [No. 31230042, 30900820, 81070589]; funds from Guangdong Province [No. S2012010010510]; The project of Science and Technology New Star in ZhuJiang Guangzhou city [No. 2012J2200025]; Fundamental Research Funds for the Central Universities [No. 2011330003161070]; China Postdoctoral Science Foundation [No. 200902348]; Guangdong Province Key Laboratory of Computational Science and the Guangdong Province Computational Science Innovative Research Team (in part). Funding for open access charge: Ministry of Science and Technology of China, National Basic Research Program [No. 2011CB811300].
Conflict of interest statement. None declared.