|Home | About | Journals | Submit | Contact Us | Français|
Alternative promoters that are differentially used in various cellular contexts and tissue types add to the transcriptional complexity in mammalian genome. Identification of alternative promoters and the annotation of their activity in different tissues is one of the major challenges in understanding the transcriptional regulation of the mammalian genes and their isoforms. To determine the use of alternative promoters in different tissues, we performed ChIP-seq experiments using antibody against RNA Pol-II, in five adult mouse tissues (brain, liver, lung, spleen and kidney). Our analysis identified 38639 Pol-II promoters, including 12270 novel promoters, for both protein coding and non-coding mouse genes. Of these, 6384 promoters are tissue specific which are CpG poor and we find that only 34% of the novel promoters are located in CpG-rich regions, suggesting that novel promoters are mostly tissue specific. By identifying the Pol-II bound promoter(s) of each annotated gene in a given tissue, we found that 37% of the protein coding genes use alternative promoters in the five mouse tissues. The promoter annotations and ChIP-seq data presented here will aid ongoing efforts of characterizing gene regulatory regions in mammalian genomes.
Recent analyses of mammalian genomes and microarray data suggest that the majority of mammalian genes generate multiple transcripts and protein isoforms with distinct functional roles. This transcript diversity is generated, in part, through the use of alternative promoters (1) and alternative splicing (2), which produce pre-mRNA and mRNA isoforms respectively. The use of alternative promoters plays a fundamental role in regulating different gene isoforms, e.g. LEF1, TP73, RUNX1 and MYC in various mammalian tissues and at different developmental stages. For example, in case of LEF1, the protein isoforms generated from the two promoters perform opposing biological functions. While the full-length LEF1, transcribed from upstream promoter, interacts with β-catenin and regulates Wnt target genes, the shorter isoform is incapable of binding β-catenin and suppresses the regulation of Wnt targets through β-catenin (3). Moreover, activation of upstream promoter and silencing of the internal promoter is observed in most colon cancers (4). Therefore, identifying primary and alternative gene promoters in various normal tissues is critical to understanding a diversity of physiological processes associated with normal and diseased states in different tissues. The advent of high-throughput molecular technologies and computational methods to support this technology has significantly improved our ability to annotate mammalian gene regulatory regions. High-throughput technologies, such as cap analysis gene expression (CAGE); chromatin immunoprecipitation (ChIP) followed by microarray analysis (ChIP–chip); ChIP coupled with pair-end ditag sequencing analysis (ChIP-PET) (5,6); and, more recently, ChIP coupled with sequencing (ChIP-seq) (7) are enabling the genome-wide identification of alternative promoters and their patterns of use. This information will help us to understand the use of alternative promoters in a wide variety of cell/tissue types, different developmental stages and their misuse in disease conditions.
Growing evidence suggests that about half of the mammalian genes have multiple alternative promoters that can span up to thousands of bases (8–12). For example, a comprehensive analyses of 1% of the human genome in 16 diverse human cell lines, using transient transfection reporter assays demonstrated the presence of functional alternative promoters in >20% of genes (12). Similarly, it has been reported that 35% of 100 human erythroid genes examined have alternative promoters and that 24% of active genes in human fibroblast cells possess multiple promoters (13). This is quite a high percentage of genes showing multiple promoter usage in a single biological process or cell type suggesting extensive use of multiple promoters by mammalian genes. The knowledge of alternative promoter usage in different mammalian tissues is very limited and cannot be addressed without high-resolution genome-wide mapping of the promoter regions. However, the high-throughput approaches, such as CAGE (14), deepCAGE (15), ChIP–chip (16,17) or ChIP-seq (7), to annotate promoters at genome level need to be applied with caution because of the inherent problems with each method. For example, cytoplasmic enzyme complexes can add caps to 5′-monophosphate RNA molecules generated by ribonuclease cleavage (18), and hence CAGE tags could represent 5′ ends of RNAs generated by cleavage and subsequent re-capping (19). CAGE analysis can also capture some non-capped transcripts that may represent cleaved decaying mRNA (20). Furthermore, a large number of CAGE tags are distributed throughout the gene transcripts rendering it inefficient as a sole source of promoter identifier. Previously, we (16) and others (17) have performed ChIP–chip analyses to identify the activity of mammalian promoters across different cell and tissue types. However, ChIP–chip requires design of genome-wide microarray to probe the ChIP-bound DNA sequences. Additionally, with either ChIP-chip or ChIP-seq technology promoters cannot be identified solely on the presence of Pol-II enrichment on a genomic location because of its enrichment throughout the transcribed genomic region and lack of highly specific antibodies that can distinguish promoter bound Pol-II from elongating Pol-II. In order to overcome these limitations of previous studies, we pursued a combined Pol-II ChIP-seq and bioinformatics promoter prediction approach to identify promoter regions and their activity in five different mouse tissues. We provide a genome-wide catalog of active promoters in five tissues of adult mouse along with tissue-specific promoters that will help future studies of transcriptional regulation in mammalian genomes.
About 1g of freshly dissected brain, kidney, liver, lung or spleen tissue from 2-month-old FBV mice was minced finely and cross-linked with 1% formaldehyde for 10min at room temperature. To stop cross-linking, glycine was added to a final concentration of 0.125M. Next, the tissue sample was treated to isolate individual cells and cross-linked chromatin was fragmented to a size range of 0.2–0.8kb as previously described (21). ChIP was performed using 10µg of Pol-II antibody bound Dynal magnetic beads. The antibodies against Pol-II were purchased from Abcam Inc (ab5408) and Santa Cruz Biotechnology (sc-899X). Following immunoprecipitation, the bound nucleoprotein complexes were extensively washed six times with wash buffer 1 and once with wash buffer 2 [Wash buffer 1: 50mM HEPES-KOH (pH 7.55), 500mM LiCl, 1mM EDTA, 1.0% NP-40 and 0.7% Na-deoxycholate; wash buffer 2: TE containing 50mM NaCl] and the ChIP enriched DNA was eluted and purified by phenol:chloroform:isoamyl alcohol extraction. This purified DNA (10ng) was further processed according to the Illumina Inc. instructions to prepare the library for sequencing ChIP enriched DNA. For ChIP-qPCR, Pol-II ChIP was conducted the same way as described above and same amount of either input or Pol-II enriched DNA was PCR amplified in the presence of specific primers using the SYBR green-based detection (Applied Biosystems, Foster City, CA, USA) as described previously (22). The primers used were as follows: Promoter forward: 5′ gacggttggagaagaaggtg 3′, Promoter reverse: 5′ aggagaggaggaggttttgg 3′ and Control region forward: 5′ gtaacctctgccgttcagga 3′, Control region reverse: 5′ tttctccctttccggagatt 3′.
We have adapted a similar approach used by previous published studies (7,23) for ChIP-seq data analysis. Briefly, the analysis involves the following steps: (i) Identification of statistically significant sequence read-enriched genomic regions (of length 1kb). A region will be considered statistically significant if the number of reads within that region is higher than the number expected due to random background. We used Poisson distribution to estimate the background read count at a given significant level P (P≤0.01). (ii) Creating the read overlapping profile for each identified region from step 1, by extending the sequence reads from the 5′ end to the 3′ end of the reads up to 400bp (the average length of the ChIP–DNA fragment sequenced from the Solexa GA with Illumina standard ChIP-seq protocol). (iii) Peak identification—by counting the number of overlapped reads at each nucleotide position and defining the genomic position with the highest number as the peak position within the 1kb significant region.
To identify the Pol-II bound promoters from the ChIP-seq data we used our recently developed program to discriminate Pol-II enrichment peak associated with promoter region from peaks associated with non-promoter region (24). The method uses DNA sequence composition, physico-chemical-structural property of DNA sequences, CAGE tags, Pol-II and H3K4me3 enrichment profiles from ChIP-seq data sets as features for discrimination. In total, 39 features were calculated for each peak and as described in ref. 24. Classification model using the aforementioned features was constructed with three different state-of-the-art ensemble and meta classifiers: Random forest (25), Bagging (26) and LogitBoost (27–30). The performance of the model was evaluated based on the promoter prediction metrics suggested by (31): sensitivity (SN), positive predictive value (PPV), correlation coefficient (CC) and true-positive cost (TPC). The performance measures were calculated for 10-fold cross-validation and independent test set.
For annotating the predicted promoters, we referred to gene information tracks from various sources available at UCSC genome browser. The tracks include protein coding and non-coding genes from Refseq gene, UCSC gene, Ensembl gene, Vega gene and miRNA. We also downloaded recently discovered large intervening non-coding RNAs (lincRNA) (32) information for annotating promoters related to non-coding genes. The other non-coding RNAs gene information including snoRNA and snRNA are part of the Refseq gene, UCSC gene, Ensembl gene and Vega gene models. A non-redundant set of coding and non-coding transcripts was generated after combing the transcript information from various gene models stated above. A total of 42924 and 38159 coding and non-coding transcripts, respectively, were obtained for annotation of Pol-II peaks (Supplementary Table S1A in additional data file 2). The known protein-coding and non-coding genes for organism other than mouse were also considered for annotation. This was done to identify those promoters that are evolutionarily conserved and are known in other organisms but still unknown in mouse. The non-mouse gene track was also downloaded from UCSC genome browser and is referred as XenoRef Gene. A non-redundant set of promoters for XenoRef Gene track genes (Supplementary Table S1B in additional data file 2) was also generated for annotation.
We divided the result of promoter annotation into three categories: (i) known promoters, (ii) novel promoters and (iii) novel promoters-unassigned. All those promoters that overlapped with first exon of known mouse transcripts are categorized as ‘known promoters’. The rest of the promoters are categorized as ‘novel promoters’. Further, the novel promoters are assigned to known genes if those fall inside a transcript of known mouse genes (or orthologous non-mouse genes) or within −10kb of the 5′end of known mouse genes (or orthologous non-mouse genes). The rest of the novel promoters are left as ‘Novel promoters –unassigned’. Those promoter peaks that overlap with the 5′ends of both protein coding and non-coding genes are assigned to both.
PCR on mouse genomic DNA was performed with specific primers to amplify 0.5–1.0kb of the randomly selected novel promoter (0.5–1.0kb) and non-promoter (~0.9kb) regions. The genomic coordinates of each cloned promoter/non-promoter region is provided in Supplementary Table S2. As a positive control for the luciferase assay, the promoter of DLL1 gene was also amplified. Amplified PCR products were cloned in pCRII vector (Invitrogen) and the clones were confirmed by sequencing. The confirmed clones were subcloned in the promoter less luciferase vector pGL3basic (Promega Inc.). DNA for the pGL3 basic constructs (1.8µg for calcium chloride method, 0.9µg for Lipofectamine 2000 or Fugene) along with pGL4-renilla-luciferase (0.2µg for calcium chloride method, 0.1µg for Lipofectamine 2000 or Fugene) were individually transfected in HEK293 (calcium chloride-based transfection), A549, HepG2 (Lipofectamine 2000, Invitrogen Inc.), NIH3T3 and DAOY (Fugene, Roche Inc.) cell lines in triplicates in six-well plate for about 48h. After 48h, cells were washed and lyzed in 200µl of passive lysis buffer provided in the dual luciferase assay kit (Promega Inc.). The lysates were cleared by centrifugation and luciferase assay was performed with 5–20µl of the lysate as per manufacturer’s instructions (Promega Inc.). Renilla luciferase activity was used to normalize for transfection efficiencies and fold enrichment of luciferase activity was calculated relative to the vector backbone (pGL3 basic alone).
We searched for core-promoter elements for each identified promoter, by scanning a sequence of length 200bp (–100 to +100 around the Pol-II peak position). The sequences were analyzed by MATCH program (33) for the five known core-promoter elements (INR, TATA, MTE, BRE and DPE) using the position weight matrices published earlier (34). We used the default parameters for the MATCH search with the following cutoffs for each element (INR-0.85 and 0.8; TATA-0.73 and 0.58; MTE-0.79 and 0.53; BRE-0.70 and 0.65; DPE-0.92 and 0.92). In this process, search was done first for the INR element because it is the most abundant core promoter element, and if found, that position plus 3 was considered as the true TSS for the corresponding promoter. If INR was not found, the rest of the elements (TATA, MTE, BRE and DPE) were searched in that order of importance, and then the TSS was assigned relative to the first element found, by adjusting the relative distance between the TSS and the corresponding element (34). The next priority was given to TATA because though MTE is the second most abundant core promoter element it shows high co-occurrence with INR and the co-occurrence tendencies of TATA element with others is least. If there are more than one element identified in a sequence, priority is given to the one with highest score. Once this assignment is done, we looked for the presence of the remaining core promoter elements in that promoter. If none of the elements were present, the original peak position was considered as the true TSS.
To identify the active promoter regions in the adult mouse genome, we used the ChIP-seq approach to find genome-wide binding regions of Pol-II in five mouse tissues (brain, kidney, liver, lung, and spleen). Through mapping of the Pol-II binding regions, we seek to investigate the usage of alternative promoters and uncover novel promoters in the mouse genome. Previous studies have indicated that performing two biological replicates for ChIP-seq studies is enough to achieve the sequencing depth required for robust identification of target binding sites (35,36), and hence we performed two replicates of Pol-II ChIP-seq experiment on each tissue and analyzed their overlap. Following the ENCODE consortium standards (36), we analyzed the agreement between our two biological replicates and since our results indicate a good correlation we combined the two datasets for further analysis (Supplementary Figure S1 in additional data file 1). Sequencing of Pol-II enriched DNA from two biological replicates in the five tissues yielded a total of over 102 million sequence reads of 36bp length. Using the ELAND (Illumina, Inc., San Diego, CA, USA) and Bowtie (37) programs and allowing a maximum of two mis-matches, ~62% (63.2 million) of the reads were uniquely mapped back to the mouse reference genome (version mm9). The aligned reads were processed to identify significantly enriched Pol-II binding regions and significant peaks as described in ‘Materials and Methods’ section. Following a three-phase peak identification approach, we identified a total of 335468 significantly Pol-II-enriched peaks across the five tissues with brain showing the maximum number of peaks followed by kidney, lung, liver and spleen (Table 1).
As Pol-II binding is expected to be highly enriched in promoter regions compared to intragenic locations, we looked at the enrichment profile of reads relative to known transcription start sites (TSS). The distribution of read counts per million mapped reads for each tissue indicates an increased enrichment of Pol-II near TSS as compared to intra and inter-genic regions (Figure 1A). To further verify the quality of the ChIP-seq experimental data, we performed ChIP-qPCR experiment on the ubiquitous Polr2a locus. We analyzed the enrichment of Pol-II at the promoter region as well as at a downstream region of Polr2a gene as indicated in Figure 1B. Consistent with our ChIP-seq data, we observed significant Pol-II enrichment only at the promoter region (Figure 1B and C).
The major challenge in identifying promoters based on Pol-II enriched regions/peaks is the presence of the transcribing polymerase throughout the gene and, as a result, all genomic regions bound by Pol-II are enriched in the ChIP-seq experiments, producing significantly large number of enriched peaks after the initial statistical analysis. This is clear from the cumulative distribution of Pol-II peaks around known TSS, which indicates that a significant percentage of Pol-II bound loci are also present outside known TSS/promoter regions (Figure 2A). We have recently developed a computational method that discriminates the Pol-II bound promoter regions from Pol-II associations at non-promoter regions (24). Moreover, in our study, we do not use an IgG control as we have determined that our promoter identification analysis is not influenced by the use of IgG background subtraction (Supplementary Table S3 in additional data file 2). We have performed promoter prediction on Pol-II-enriched DNA from liver tissue with and without the subtraction of IgG bound DNA background of liver tissue. As shown in Supplementary Table S3, there is no major change upon inclusion of IgG control but rather we achieve slightly better results without background subtraction in terms of the number of promoters identified and the overlap of identified promoters with the known first exons from various databases. Using this program, we predicted ~24% (80597) of the significant Pol-II peaks from the five tissues as promoter associated. As shown in Figure 2B, ~20–26% of the Pol-II peaks in brain, kidney, liver and lung are in promoter regions and in spleen almost 34% peaks correspond to promoters. Further analysis suggested that nearly 85% of the non-promoter Pol-II bound peaks were localized in intragenic regions in liver and kidney compared to about 71% of non-promoter-enriched peaks in spleen intragenic locations. Next, we compared the coverage of Pol-II enrichment in the entire transcript with the sequence enrichment at the corresponding promoter (Supplementary Figure S2 in additional data file 1). For ~5–7.5% of the promoters in these five adult tissues, we found 4-fold or more enrichment of the read around the promoter region than the rest of the corresponding transcript, and we speculate that these represent the paused promoters in the adult tissues (38).
We analyzed the CpG richness of promoter and non-promoter peaks based on the previously defined criteria (39). As expected, we found that 69% of the predicted promoter peaks were CpG rich, while only 1.1% of non-promoter peaks were localized in CpG-rich regions (Figure 2C). Further, the proportion of CpG rich promoters did not vary significantly across the tissues, with a maximum of 73.8% and a minimum of 67.7% CpG-rich promoters found in spleen and kidney respectively. Next, we analyzed the enrichment of Pol-II on the CpG rich and CpG poor (non CpG) promoters and found higher binding of Pol-II on CpG rich than non-CpG promoters in all tissues except spleen where the reverse is observed (Supplementary Figure S3A in additional data file 1). In order to provide genome-wide annotation of active promoters in the mouse genome, we combined the promoter regions identified in all the five tissues into a master table (additional data file 3). We merged any two consecutive promoters into a single promoter region if the distance between corresponding Pol-II peaks in those promoter regions was <300bp.
Next, we annotated the identified promoters using a non-redundant set of 42924 known protein-coding transcripts and 38159 known non-coding transcripts as described in ‘Materials and Methods’ section (Supplementary Table S1A in additional data file 2). A schematic representation of the step-wise pipeline that was followed for promoter annotation is shown in Supplementary Figure S4 in additional data file 1. We identified 21926, 20301, 15720, 21599 and 11401 promoters that are active in brain, kidney, liver, lung and spleen, respectively (Table 2). About 8173 (21%) of the promoters were left unassigned to any gene based on our annotation strategy. In order to account for the false-negative predictions of the program (known promoters that were not predicted by the program despite the presence of significantly enriched Pol-II peak), all the Pol-II significantly enriched peaks that were predicted as non-promoters but overlap with the first exons of known transcripts were added into the final promoter annotations presented in Table 2. Using this strategy we had further annotated 5356 (2.1%), and 8374 (3.2%) of Pol-II-enriched peaks to known protein-coding and non-coding genes, respectively (Supplementary Table S4 in additional data file 2). Eventually, we have identified a total of 38639 promoters and annotated 21739 promoters to only protein-coding genes (15503), another 7406 promoters to only non-coding genes (5354) and 1321 promoters were assigned to both protein coding and non-coding genes. The list of all annotated promoters along with the annotation is provided in additional data file 4. Many of these promoters were tissue specific, while others were shared between two or more tissues as shown in the Venn diagram (Supplementary Figure S5 in additional data file 1). In particular, our analysis has identified 8727 promoters for Pol-II transcribed non-coding genes with 856, 184, 31 and nine promoters assigned to lincRNA, miRNA, snoRNA and snRNA genes, respectively (Table 3). As promoters are localized around TSS, we analyzed the positioning of the identified promoters relative to the known TSS and found that for both protein coding and non-coding genes, promoters were mostly upstream of the known TSS (Figure 2D). Next, we examined the presence of bidirectional promoters among the identified promoters. We consider a promoter as bidirectional if it is shared between two genes which are in opposite orientation and the promoter region either overlaps with the first exon of the transcripts or is within −1kb of known TSS as previously described (40). We identified 1093, 1029, 989, 1125 and 852 bidirectional promoters in brain, kidney, liver, lung and spleen, respectively. Interestingly, we found that more than 93% of the bidirectional promoters were CpG rich (Supplementary Table S5 in additional data file 2).
One of the major goals of this study was the identification of novel promoters in the mouse genome. We have identified a total of 12270 novel promoters, which represents 32% of all the identified promoters. This suggests that a large number of promoters that are active in at least one of the five tissues were unknown in any of the current genome-wide annotations. The tissue-wise distribution of the novel promoters is presented in Table 4. We observed higher enrichment of Pol-II on the novel promoters than known promoters in spleen, while in brain and lung the reverse was true. In contrast, in kidney and liver the binding of Pol-II near TSS was similar for known and novel promoters (Supplementary Figure S3B in additional data file 1). Additionally, we found that about 34% of the novel promoters were CpG rich and these novel promoter regions show similar level of conservation as known promoters across 30 vertebrate species (Supplementary Figure S6 in additional data file 1). When we analyzed the novel promoters with the known promoters of homologous genes, we found that 671 of these promoters had corresponding known promoters in other organisms. Next, we looked for the overlap of novel promoters from our study with CAGE tag clusters generated by the FANTOM4 project (41). CAGE tag clusters are found at the 5′ end of transcript as well as in other regions including the internal exons, introns, 3′UTR and intergenic regions. Because CAGE analysis relies on 5′ Cap trapper techniques, thus capturing even post-transcriptionally re-capped mRNA, it has inherent deficiencies as a sole tool to identify promoters (20). We have observed that for our promoter peaks, 95–97% are supported by CAGE clusters and surprisingly even 53–59% of non-promoter peaks also show the presence of CAGE clusters (Supplementary Figure S7 and Supplementary Table S6). When we focused on the overlap of novel promoters with CAGE clusters, as expected, we observed a 97% correlation. It is worth noting that only 1.4% of all CAGE tag based predicted promoters are actually identified as active promoter by our approach in the five adult tissues (Supplementary Figure S7A). Additionally, we compared non-redundant EST sequences with the novel promoters and found that about 62.4% of the novel promoters overlapped with the 5′ ends of the ESTs. Furthermore, using the published mRNA-seq data from mouse brain and liver, we detected mRNA-seq reads for 68% and 60% of novel promoters identified in brain and liver, respectively (example shown in Supplementary Figure S8 in additional data file 1) (42). Thus, the novel promoters identified by our approach of promoter prediction on Pol-II-enriched ChIP-seq data are supported by other independent experimental methods. To further validate the activity of these novel promoters, we cloned 10 of the randomly selected novel promoters (NP1-10) and two non-promoter regions (Ctrl1, 2) upstream of a promoter-less luciferase gene and measured promoter activity (Supplementary Table S2 in additional data file 2). As shown by the wiggle tracks in Figure 3A, we have identified a new promoter in all five mouse tissues that lies ~16kb upstream of the known mKIAA1632 gene promoter (NP1). The homologous region in humans represents the promoter for the human KIAA1632 gene. Similarly, we have identified a new promoter in three of the mouse tissues for GPM6A that lies ~175kb upstream of the known GPM6A gene promoter in mouse (NP8). The homologous region in humans represents the promoter for the human GPM6A gene (Figure 3B). These promoters either represent the unidentified promoters for KIAA1632 and GPM6A or drive the expression of unknown genes. Figure 3C shows the Pol-II binding in one of the regions that was not predicted as a promoter in our analysis and is considered as non-promoter. Using transient transfection experiments, we introduced these constructs in five different cell lines (HEK293, DAOY, A549, HepG2 and NIH3T3) and measured the expression of luciferase gene which is controlled by the novel promoters or non-promoter regions. We observed significant luciferase expression (7-fold for NP5 to 304-fold for NP1) from nine of the 10 selected promoters in at least one of the cell lines (Figure 3D). We did not observe any promoter activity for the novel promoter NP7 which was identified in spleen tissue and it is possible that this is due to the absence of the proper cell system in our luciferase assays or NP7 is a false promoter prediction. Based on our results, we conclude that non-CpG promoters are underrepresented in the current promoter inventory and advances in high-throughput sequencing technology coupled with bioinformatics analysis can help to identify these.
It is well established that many mammalian genes have multiple promoters and that these are differentially used in different cellular context. In agreement, we found 23060 promoters for 16330 protein-coding and 8727 promoters for 6314 (24%) non-coding genes in the five mouse tissues that were analyzed. To identify the genes that use alternative promoters in these different tissues, we adopted a two-step procedure. The first step involved identification of the promoter(s) with bound Pol-II for each gene in each tissue individually. In the second step, we compared the identified promoter for each gene across the five tissues. In case the promoters from two or more tissues overlap with each other by at least 300bp then they are considered as the same promoter, otherwise they are defined as distinct promoters (additional data file 4). Examples of alternative promoter genes identified by our approach are shown for Adar1and Hdgf gene in Figure 4A. In case of Adar1, there are two active promoters that are differentially used in the five tissues. The upstream promoter P1 is used in brain, kidney and lung, while downstream P2 promoter has been identified as active in kidney, liver, lung and spleen. Similarly for Hdgf, four distinct active promoters have been identified that are differentially enriched with Pol-II. Based on this analysis, we have found the distribution of multi-promoter usage in five mouse tissues (Figure 4B and C). We observed that 37% of the annotated protein-coding genes and 31% of the non-coding genes use alternative promoters in the five mouse tissues. Furthermore, we found that the use of alternative promoters changes the coding protein in 34.5% of the alternative promoter genes. As our analysis is based on only five tissues, it suggests that a significant number of mouse genes use alternative promoters.
Having identified the promoters that are active in one or more of the five tissues, we further investigated the usage of promoters in a tissue-specific manner. Two different parameters were used for identifying tissue-specific promoters in our study. The first parameter is based on Shannon entropy that was previously employed for identifying tissue-specific promoters from ChIP–chip (17), gene expression and EST data (43). As the second parameter, we define fold change for each promoter (p) as fp=max1/max2, where max1 and max2 are the first and second highest normalized read counts for promoter p among the five tissues, respectively. To define tissue specificity, we have set the maximum cutoff as 1.25 for Shannon entropy and a minimum cutoff of 2.0 for the fold change determinant. Note that while Shannon entropy is inversely correlated, the fold-change parameter is directly correlated with tissue specificity. Using the above-defined parameters, we have identified 6384 tissue-specific promoters across the five mouse tissues (Supplementary Table S7 in additional data file 2 and additional data file 5). These results are further supported by the box plot, which shows as an example for the brain-specific promoters a significantly higher read count in brain compared to the other four tissues in the distribution of normalized read counts (Figure 5A). Similar results were obtained for other tissue-specific promoters (Supplementary Figure S9 in additional data file 1). The highest number of tissue-specific promoters was identified in brain while spleen has the least number of tissue-specific Pol-II-associated promoters. Further analysis revealed that overall only 29% of the tissue-specific promoters were CpG rich, with brain and lung exhibiting highest CpG richness (~41%), while in spleen only 9% of promoters were CpG rich. We further studied the relationship of tissue-specific parameters: Shannon entropy and normalized read fold change with CpG richness in promoters of genes (Figure 5B and C). A direct relationship is observed between CpG richness and Shannon entropy and an inverse relationship is observed between CpG richness and normalized read fold-change for promoters, suggesting that globally the tissue-specific promoters are CpG poor compared to ubiquitous promoters. Furthermore, detailed analysis of core promoter elements in the tissue-specific promoters versus ubiquitous promoters show that TATA (p=1.75e-14) and INR (p=1.69e-12) elements are more enriched in tissue-specific promoters, while BRE (p=4.56e-29) and MTE (p=6.82e-18) elements are significantly enriched in ubiquitous genes (Supplementary Table S8A and B in additional data file 2, statistical significance was calculated using proportion test). DPE element did not show any preference for either class of promoters. Thus, our data suggests hat CpG-poor tissue-specific promoters and CpG-rich ubiquitous promoters tend to possess different core promoter composition (44).
As binding of Pol-II precedes transcription, we expected that promoters with high occupancy of Pol-II will be transcribed at a higher rate than others. To address this issue, we studied the correlation of Pol-II recruitment to the promoter and the consequential transcript expression in the mouse tissues. We performed this analysis for only brain and liver using the publicly available mRNA-seq data (42). We used Cufflinks software (45) to estimate the transcript expression from the mRNA-seq data sets using the default parameters. For each tissue, the expression scores from promoters were broken up into four quartiles: high, medium, low and very low. Next, we calculated the average Pol-II ChIP-seq read count around annotated TSS at base pair resolution for promoters found in the four quartiles and plotted the Pol-II-enriched profile around annotated TSS (Figure 6). Because there is no mRNA-seq data available for kidney, lung and spleen, we performed similar analysis at the gene level using microarray gene expression data (Supplementary Figure S10). The gene expression data used in our present study for brain, kidney, liver, lung and spleen were downloaded from NCBI (GEO ID: GDS592) (46). As these data contained profiles for eight different brain tissues, (frontal cortex, cerebral cortex, substantia nigra, cerebellum, amygdala, hypothalamus, hippocampus and dorsal striatum), the average of these scores for each gene was taken as the expression of the corresponding gene in brain. Altogether, we observe that the promoters driving higher mRNA expression exhibit increased Pol-II recruitment, suggesting that the binding of Pol-II at the promoter is a good signature for global expression from a promoter.
Identification and annotation of all human and mouse gene promoters that are differentially used in different cell/tissue types, during development, or aberrantly activated in disease conditions are still incomplete and are essential for defining the transcriptome and proteome of the mammalian genome. It is well known that differential gene expression is a characteristic of different tissues; however, not much work has been done to characterize the global isoform specific expression of genes in various tissues (47,48). One of the important aspects to understand the regulation of gene expression is the study of all the promoters of a gene. Currently, our promoter knowledge is partial and our goal in this study was to expand our promoter inventory and to determine the tissues where each promoter is active. To provide a catalog of active promoters in various tissues and identify tissue-specific promoter usage, we used a combination of ChIP-seq and bioinformatics approaches. In this study, we focused on five adult tissues—brain, kidney, liver, lung and spleen, and have successfully identified 38639 promoters for both protein coding and non-coding genes. Our approach has identified 12270 novel promoters including promoters for genes such as Dnmt1, Bmp4, Jmjd3, Cyclin E1 and D1, MeCP2, which have been associated with tumorigenesis. We have been able to annotate a large number of the newly discovered promoters to known genes like 60% in case of brain, and we anticipate that the remaining 40% un-annotated promoters might mostly represent the promoters of unknown non-coding genes. Our results also show that about 37% of the protein coding genes possess alternative promoters. This is lower than the expected 50–60% genes from other genome-wide analysis due to the small number of tissues assayed as well as the sole use of adult tissue in this study (11,44). This is supported by our analysis where we compared the alternative promoter use in two, three, four, five tissues and observed an increase in alternative promoter usage from 27% (two-tissue) to 37% (five-tissue). The use of alternative promoters results in different proteins in about 34% of multi-promoter genes as seen for Adar1, Hdgf (Figure 4A). In case of Adar1, the upstream promoter P1 is responsive to interferon and produces a 150-kDa protein compared to the constitutive promoter P2 that produces an N-terminally truncated 110-kDa protein (49). We have found that 5–8% of the promoters in these tissues are bidirectional and 4–6% of the promoters are shared by protein coding and non-coding genes in each tissue. Our analysis suggests that nearly 17% of the promoters in the mouse genome are used in a tissue-specific manner and these tissue restrictive promoters tend to be CpG poor. We found that almost 70% of the known promoters are CpG rich, while 66% of the novel promoters are CpG poor suggesting that many of these new promoters are tissue-specific and not easily identifiable without high-throughput genome-wide analysis. This is supported by the finding that, while 1/4th of the highly tissue restrictive (active in only one tissue) novel promoters are CpG rich, 2/3rd of the novel promoters active in the five tissues show CpG richness. In conclusion, we have identified the active promoters in five mouse tissues and we plan to expand our study to identify the differential and overlapping use of promoters in normal human tissues and diseased tissue counterparts.
National Human Genome Research Institute (grant # R01HG003362 to R.D.). R.D. holds a Philadelphia Healthcare Trust Endowed Chair Position and this work is also supported by Philadelphia Healthcare Trust. Funding for open access charge: National Institutes of Health (grant # R01HG003362 to R.D.).
Conflict of interest statement. None declared.
Supplementary Data are available at NAR Online. ChIP-seq data has been deposited in GEO under accession number GSE 21773.
The use of Genomics core facility and computational resources in the Centre for Systems and Computational Biology and Bioinformatics Facility of Wistar Cancer Centre (supported by grant # P30CA010815) are gratefully acknowledged.