|Home | About | Journals | Submit | Contact Us | Français|
MPromDb (Mammalian Promoter Database) is a curated database that strives to annotate gene promoters identified from ChIP-seq results with the goal of providing an integrated resource for mammalian transcriptional regulation and epigenetics. We analyzed 507 million uniquely aligned RNAP-II ChIP-seq reads from 26 different data sets that include six human cell-types and 10 distinct mouse cell/tissues. The updated MPromDb version consists of computationally predicted (novel) and known active RNAP-II promoters (42893 human and 48366 mouse promoters) from various data sets freely available at NCBI GEO database. We found that 36% and 40% of protein-coding genes have alternative promoters in human and mouse genomes and ~40% of promoters are tissue/cell specific. The identified RNAP-II promoters were annotated using various known and novel gene models. Additionally, for novel promoters we looked into other evidences—GenBank mRNAs, spliced ESTs, CAGE promoter tags and mRNA-seq reads. Users can search the database based on gene id/symbol, or by specific tissue/cell type and filter results based on any combination of tissue/cell specificity, Known/Novel, CpG/NonCpG, and protein-coding/non-coding gene promoters. We have also integrated GBrowse genome browser with MPromDb for visualization of ChIP-seq profiles and to display the annotations. The current release of MPromDb can be accessed at http://bioinformatics.wistar.upenn.edu/MPromDb/.
The mammalian transcriptome and proteome is far more diverse than expected from one gene→one mRNA→one protein paradigm (1). This diversity arises due to the generation of multiple transcripts from a gene using alternative transcriptional and splicing events. Alternative transcriptional events that involve use of multiple promoters and/or transcriptional termination result in multiple pre-mRNAs from the same gene that can further undergo alternative splicing to generate a plethora of transcript variants corresponding to a single gene (2). Therefore, a gene can yield transcript variants that differ in either their regulatory UTRs or/and protein coding regions; thereby expanding the complexity of mammalian genomes (3–5). In particular, the role of alternative promoter activity is critical in transcriptional regulation, as their precise utilization allows the balanced expression of corresponding pre-mRNA variants in different cell and/or developmental contexts. In fact, recent evidence suggests that at least half of the mammalian genes use alternative promoters generating multiple transcript variants (3,5). Therefore, identifying all possible gene promoters, their usage and epigenetic modification states in specific cell populations, tissues and their developmental stages and disease conditions is critical to understanding a diversity of physiological processes associated with normal and diseased states.
Several high-throughput technologies, such as cap analysis gene expression (CAGE), chromatin immunoprecipitation (ChIP) followed by microarray analysis (ChIP–chip), (6,7), and more recently, ChIP coupled with sequencing (ChIP-seq) (8) and sequencing of cDNAs (RNA-seq) (5), are enabling the genome-wide identification of alternative promoters and their patterns of use. However, these high-throughput approaches need to be applied with caution because of the inherent problems with each method (9). In our recent study, we have shown that a combination of ChIP-seq and computational technique provides a better approach to annotate active promoters (9,10). Although EPD database (11) provides curated promoter sequences for eukaryotic organisms, it does not provide promoter activity information at tissue/cell centric level. In this update of MPromDb we have removed ChIP–chip results and added active RNAP-II promoters identified after analyzing six different cell types of human and 10 different cell/tissue types of mouse ChIP-seq experiments performed with RNAP-II antibody. In addition, we have added enrichment profile of various transcription factors obtained from ChIP-seq data sets. These promoters along with their annotations are provided as a user-friendly database, where each known and ChIP-seq promoter is linked to a new interface for visualization of enrichment profile. Here, we describe the updates of our MPromDb, which enables users to study promoter activity at tissue/cell centric level for human and mouse genome.
In this update, we have added (i) a comprehensive knowledgebase of known and novel promoters, (ii) promoters identified from RNAP-II ChIP-seq experiments, (iii) advance search and filter options and (iv) visualization of ChIP-seq profiles and promoters using GBrowse (12). The comprehensive promoter knowledgebase was generated from various known gene models (RefSeq, Vega, Ensembl, MGI and UCSC Known genes), predicted gene models (AceView, Tromer, MGC, SGP, SIB, Genscan, Geneid, N-SCAN and Augustus Abinitio), Orthologous gene model (XenoRef), GenBank mRNAs, spliced ESTs, CAGE promoters and mRNA-seq tags (Figure 1). The gene models, mRNAs and spliced ESTs were downloaded from UCSC Genome Browser database (13), CAGE promoters location were downloaded from FANTOM4 project (14) and mRNA-seq raw reads were downloaded from NCBI GEO database. We have also added promoter regions of recently discovered non-coding genes class (lincRNA) transcribed by RNAP-II (15,16). The total number of records in the knowledgebase can be found in Table S1.
The RNAP-II ChIP-seq data sets includes the data generated at our lab (9) and data sets from various published and unpublished studies available freely at NCBI GEO database. The human RNAP-II ChIP-seq data sets include six different cell lines: CD4+T, HeLa S3, K562, NB4, Lymphoblastoid and Jurkat, whereas mouse samples include five different tissues and five different cell types: brain, liver, lung, spleen, kidney, Embryonic Stem Cell (V6.5), Mouse Embryonic Fibroblasts B4, Mouse Embryonic Fibroblasts B6, Bone Marrow-derived macrophages and 3T3-L1 (9,17–23). The NCBI GEO accession numbers of the data sets are provided in Table S2. On the downloaded ChIP-seq data sets, we apply our pipeline (Figure 1) that includes alignment, identification of significant enriched regions, promoter prediction and annotation. Bowtie program (24) was applied to map reads to the reference genome (mm9 version for mouse and hg18 version for human), allowing up to two mismatches. Only uniquely mapped reads were considered for further analysis. We obtained 174777943 and 333192049 uniquely mapped reads for mouse and human genome respectively (Table S3). Significant peaks were identified using our three steps procedure as described in (9) at P-value=0.01. After identification of significant RNAP-II bound peaks we apply our recently published program for prediction of RNAP-II bound promoters (10). The peak identification and promoter prediction of each sample is summarized in Table S3. Following promoter prediction, we performed promoter annotation using our reference promoter knowledgebase as summarized in Figures S1 and S2. Finally, we identified 48366 mouse and 42893 human promoters bound by RNAP-II where 39% and 42% of the promoters in mouse and human respectively were annotated as ‘Novel promoters’ (Table 1). In case the predicted ChIP-seq promoters lie within −1 to 0.5kb of known TSS or within the first exons of known transcripts, they are defined as ‘Known promoters’ otherwise they are considered as ‘Novel promoters’. It is worth noting that 65% and 90% of novel promoters in mouse and human, respectively, are supported by additional sources (novel gene models, mRNAs, spliced ESTs, CAGE tags and Orthologous gene model) (Table S4).
Furthermore, our analysis has identified promoters for 15493 and 14266 protein-coding genes in mouse and human respectively. A gene is defined as protein coding if it has at least one protein-coding transcript in RefSeq/Vega gene models, or else it is a non-coding gene. Please note that a protein coding gene can generate transcript variants that are non-coding RNAs. We also observed that 40% and 36% of protein coding genes in mouse and human are expressed from alternative promoters (Table 2). Surprisingly, 37% of promoters in mouse and 43% of human promoters were identified in a single cell/tissue suggesting that they are cell/tissue-specific promoters. Additionally, we analyzed the CpG-richness and bidirectionality of the promoters and found that 51% and 64% of promoters are CpG-rich and there are 1801 and 1501 bidirectional promoters in mouse and human respectively. Additionally, we also provide significant enrichment profiles of various factors (Mouse – OCT4, CEBPa, CHD7, c-Myc, CTCF, ESRRB, FOXA1, FOXA2, GFP, KLF4, n-Myc, NR5A2, P300, Rbbp5, SETDB1, SIRT1, SOX2, STAT3, STAT4, STAT6, SUZ12, TBP, TBX3, TCFCP2I1, WDR5, ZFX; Human – OCT4, CBP, CTCF, ETS1, KLF4, NANOG, P300, PCAF, PHF8, PPARG, RUNX, SOX2, STAT1, TFII, Tip60, ZNF263, SUZ12, MOF, IGF1R, NFkB) calculated from different published and unpublished ChIP-seq datas ets (Table S5A and B).
MPromDb as a web-based application has many layers: the core application (designed in Django), a backend database (MySQL), a visualization component (GBrowse) and a web server (Apache) (see Supplementary File 1). The promoter information corresponding to a particular gene can be retrieved from the database using Entrez geneid or gene symbol. We also provide additional search and filter options such as selection of tissue/cell type, tissue/cell specific promoters, known/novel promoters and coding/non-coding gene promoters. The gene search query returns result at two different levels (see Figure 2, Supplementary File 2, Supplementary Tables S6 and S7). The first level provides information (promoter position, CpG type and bidirectional type) regarding all promoters of the queried gene that are present in the promoter knowledgebase. The second level of search result lists all promoters identified from ChIP-seq data sets for the queried gene. The result of the search can be downloaded into an excel file. Each promoter of the search result is linked to the visualization module. Further, complete list of annotated promoters can be downloaded from the download link. Visualization of the promoter position and ChIP-seq data enrichment profile is implemented using GBrowse (12), an open source genome browser platform. GBrowse is simple but highly configurable web-based genome browser, which provides a fast and customizable interface for visualizing data that is stored in a backend database, as well as the data that is uploaded by the user. GBrowse is lighter than UCSC genome browser and offers many advantages especially in displaying the results and tracks. Some of the features unique to GBrowse are: glyphs and balloons to represent different features, organizing features sub categories to more depth, multi-language support, view GenBank, chado and biosql feature databases, third party loading. On GBrowse the identified promoter location and enrichment profile of the analyzed ChIP-seq data sets are shown (Figure 2D). Further, users can directly type the genome coordinates or gene symbol on GBrowse for searching. Users have an option to turn on/off the tracks that are displayed on the genome browser.
In future, we plan to include epigenetic histone modifications profile identified from ChIP-seq data sets that are currently available at NCBI GEO and integrate it to our promoter knowledgebase. We will also continue to collect RNAP-II and transcription factors ChIP-seq data sets from a wider variety of tissues and cell types to routinely update MPromDb. We also plan to include other mammalian data sets, and add additional features and search options to the frontend of the database. In conclusion, MPromDb will provide integrated transcriptional regulatory information for mammalian genomes in an easily accessible way. We believe that the updates will facilitate large-scale ChIP-seq data analysis and contribute toward the elucidation of mammalian transcriptional regulatory networks.
Supplementary Data are available at NAR Online.
NHGRI/NIH grant (# R01HG003362); American Cancer Society Research Scholar Grant (# RSG-07-097-01 to R.D.); and Philadelphia Healthcare Trust. R.D. holds a Philadelphia Healthcare Trust Endowed Chair Position. Funding for open access charge: National Institutes of Health grant (#R01HG003362 to R.D.).
Conflict of interest statement. None declared.
We thank Sharmistha Pal for reading the manuscript and providing valuable inputs for developing MPromDb. The use of computational resources in the Centre for Systems and Computational Biology and Bioinformatics Facility of Wistar Cancer Centre (grant # P30CA010815) are gratefully acknowledged.