Statistics of the promoters identified using ChIP-seq data sets
In this update, we have added (i) a comprehensive knowledgebase of known and novel promoters, (ii) promoters identified from RNAP-II ChIP-seq experiments, (iii) advance search and filter options and (iv) visualization of ChIP-seq profiles and promoters using GBrowse (
12). The comprehensive promoter knowledgebase was generated from various known gene models (RefSeq, Vega, Ensembl, MGI and UCSC Known genes), predicted gene models (AceView, Tromer, MGC, SGP, SIB, Genscan, Geneid, N-SCAN and Augustus Abinitio), Orthologous gene model (XenoRef), GenBank mRNAs, spliced ESTs, CAGE promoters and mRNA-seq tags (). The gene models, mRNAs and spliced ESTs were downloaded from UCSC Genome Browser database (
13), CAGE promoters location were downloaded from FANTOM4 project (
14) and mRNA-seq raw reads were downloaded from NCBI GEO database. We have also added promoter regions of recently discovered non-coding genes class (lincRNA) transcribed by RNAP-II (
15,
16). The total number of records in the knowledgebase can be found in
Table S1.
The RNAP-II ChIP-seq data sets includes the data generated at our lab (
9) and data sets from various published and unpublished studies available freely at NCBI GEO database. The human RNAP-II ChIP-seq data sets include six different cell lines: CD4

+

T, HeLa S3, K562, NB4, Lymphoblastoid and Jurkat, whereas mouse samples include five different tissues and five different cell types: brain, liver, lung, spleen, kidney, Embryonic Stem Cell (V6.5), Mouse Embryonic Fibroblasts B4, Mouse Embryonic Fibroblasts B6, Bone Marrow-derived macrophages and 3T3-L1 (
9,
17–23). The NCBI GEO accession numbers of the data sets are provided in
Table S2. On the downloaded ChIP-seq data sets, we apply our pipeline () that includes alignment, identification of significant enriched regions, promoter prediction and annotation. Bowtie program (
24) was applied to map reads to the reference genome (mm9 version for mouse and hg18 version for human), allowing up to two mismatches. Only uniquely mapped reads were considered for further analysis. We obtained 174

777

943 and 333

192

049 uniquely mapped reads for mouse and human genome respectively (
Table S3). Significant peaks were identified using our three steps procedure as described in (
9) at
P-value

=

0.01. After identification of significant RNAP-II bound peaks we apply our recently published program for prediction of RNAP-II bound promoters (
10). The peak identification and promoter prediction of each sample is summarized in
Table S3. Following promoter prediction, we performed promoter annotation using our reference promoter knowledgebase as summarized in
Figures S1 and S2. Finally, we identified 48

366 mouse and 42

893 human promoters bound by RNAP-II where 39% and 42% of the promoters in mouse and human respectively were annotated as ‘Novel promoters’ (). In case the predicted ChIP-seq promoters lie within −1 to 0.5

kb of known TSS or within the first exons of known transcripts, they are defined as ‘Known promoters’ otherwise they are considered as ‘Novel promoters’. It is worth noting that 65% and 90% of novel promoters in mouse and human, respectively, are supported by additional sources (novel gene models, mRNAs, spliced ESTs, CAGE tags and Orthologous gene model) (
Table S4).
| Table 1.Summary of RNAP-II bound promoters identified in various tissues/cell types for human and mouse using ChIP-seq data sets |
Furthermore, our analysis has identified promoters for 15

493 and 14

266 protein-coding genes in mouse and human respectively. A gene is defined as protein coding if it has at least one protein-coding transcript in RefSeq/Vega gene models, or else it is a non-coding gene. Please note that a protein coding gene can generate transcript variants that are non-coding RNAs. We also observed that 40% and 36% of protein coding genes in mouse and human are expressed from alternative promoters (). Surprisingly, 37% of promoters in mouse and 43% of human promoters were identified in a single cell/tissue suggesting that they are cell/tissue-specific promoters. Additionally, we analyzed the CpG-richness and bidirectionality of the promoters and found that 51% and 64% of promoters are CpG-rich and there are 1801 and 1501 bidirectional promoters in mouse and human respectively. Additionally, we also provide significant enrichment profiles of various factors (Mouse – OCT4, CEBPa, CHD7, c-Myc, CTCF, ESRRB, FOXA1, FOXA2, GFP, KLF4, n-Myc, NR5A2, P300, Rbbp5, SETDB1, SIRT1, SOX2, STAT3, STAT4, STAT6, SUZ12, TBP, TBX3, TCFCP2I1, WDR5, ZFX; Human – OCT4, CBP, CTCF, ETS1, KLF4, NANOG, P300, PCAF, PHF8, PPARG, RUNX, SOX2, STAT1, TFII, Tip60, ZNF263, SUZ12, MOF, IGF1R, NFkB) calculated from different published and unpublished ChIP-seq datas ets (Table S5A and B).
| Table 2.Alternative promoter usage for active protein-coding genes in mouse and human |
Database search and visualization
MPromDb as a web-based application has many layers: the core application (designed in Django), a backend database (MySQL), a visualization component (GBrowse) and a web server (Apache) (see
Supplementary File 1). The promoter information corresponding to a particular gene can be retrieved from the database using Entrez geneid or gene symbol. We also provide additional search and filter options such as selection of tissue/cell type, tissue/cell specific promoters, known/novel promoters and coding/non-coding gene promoters. The gene search query returns result at two different levels (see ,
Supplementary File 2,
Supplementary Tables S6 and S7). The first level provides information (promoter position, CpG type and bidirectional type) regarding all promoters of the queried gene that are present in the promoter knowledgebase. The second level of search result lists all promoters identified from ChIP-seq data sets for the queried gene. The result of the search can be downloaded into an excel file. Each promoter of the search result is linked to the visualization module. Further, complete list of annotated promoters can be downloaded from the download link. Visualization of the promoter position and ChIP-seq data enrichment profile is implemented using GBrowse (
12), an open source genome browser platform. GBrowse is simple but highly configurable web-based genome browser, which provides a fast and customizable interface for visualizing data that is stored in a backend database, as well as the data that is uploaded by the user. GBrowse is lighter than UCSC genome browser and offers many advantages especially in displaying the results and tracks. Some of the features unique to GBrowse are: glyphs and balloons to represent different features, organizing features sub categories to more depth, multi-language support, view GenBank, chado and biosql feature databases, third party loading. On GBrowse the identified promoter location and enrichment profile of the analyzed ChIP-seq data sets are shown (D). Further, users can directly type the genome coordinates or gene symbol on GBrowse for searching. Users have an option to turn on/off the tracks that are displayed on the genome browser.