Figure shows H3K9 acetylation and gene transcription start sites in ENCODE region ENr333 as an example of how H3K9ac is concentrated around DeepCAGE promoters and gene starts. Indeed, H3K9ac is localized around transcription start sites throughout the entire human genome [
10,
13]. A genome-wide histogram of H3K9ac around DeepCAGE TSS, shown in Figure , illustrates this on a genome-wide scale. H3K9ac has a characteristic bimodal distribution around the TSSs, with one single peak upstream of the TSS, a stronger single peak downstream of the TSS, and depletion right on the TSS. H3K9ac level right on the TSS is low because core promoters are depleted in nucleosomes [
13,
14]. This bimodal distribution has been described in several previous studies [
12,
15,
16].
However, when inspecting the H3K9ac levels around individual promoters, the distribution of acetylation level often does not resemble the average genome-wide situation: around individual promoters the H3K9ac level may be more concentrated upstream (Figure ) or downstream (Figure ) of the promoter, may show a distribution which resembles the genome-wide distribution (Figure ), or have other configurations.
Clustering of DeepCAGE promoters according to their surrounding H3K9ac signal reveals three clearly separated clusters with a comparable number of members
To determine whether there are characteristic groups of promoters having H3K9ac enrichment at different positions relative to the transcription start site, we clustered 4,481 DeepCAGE promoters according to their surrounding H3K9ac level using k-medoids clustering.
Deep sequencing of CAGE tags as well as H3K9ac and RNA polymerase II ChIP-chip experiments were performed on a culture of undifferentiated THP-1 cells (see methods). A second set of DeepCAGE and ChIP-chip data was produced for THP-1 cells which have been treated with phorbol myristate acetate (PMA) to stimulate the cells to differentiate into a macrophage-resembling phenotype. This second dataset has been used for validation.
Before clustering according to the surrounding H3K9ac signal, the DeepCAGE promoters were filtered in order to minimize the effects of two types of confounding factors: missing probes and proximal promoters. The ChIP-chip experiments were performed using genome-wide tiling-arrays with probes of length 25 bp, spaced at 35 bp. However, there are no tiles in the repetitive regions of the genome, which may include promoters. Such missing tiles around the promoter region result in missing data. To address this, we divided the region around each promoter into eleven bins of size 400 bp, in a window of 4,000 bp around the promoter. Each core promoter has one tag starting site defined as its representative position, which contains the majority of tag starts in this promoter (see methods section of [
6]), and this representative position has been chosen as the exact center of the 4,000 bp window. Only promoters with at least one tiling array probe in each bin were retained. A second factor that potentially can confound the analysis is proximal promoters: if there are two or more promoters within a 4,000 bp window the ChIP-chip signal in these bins is a compound signal of all promoters, which can not be unambiguously decomposed. We discarded all such cases of proximal promoters, reducing the initial 14,607 DeepCAGE promoters in the FANTOM 4 dataset to 4,481 promoters retained for analysis after the two-stage filtering.
K-medoids clustering will classify the input (here, a list of promoters) into any predetermined number of clusters (or classes). Those items that are close to each other using a distance measure are classified as belonging to the same cluster. To determine the distance between any two promoters, the average H3K9ac signal strength in each bin was determined and a cumulative normalized distribution across all bins reflecting the strand orientation of the promoter was calculated. For each possible pair of promoters, the Kolmogorov-distance between the corresponding two cumulative distributions of H3K9ac signal strength was taken as the distance between those two promoters. The Kolmogorov distance between two cumulative distributions essentially measures the similarity in shape of the two original graphs. A matrix containing the distances between all possible promoter pairings was then used as input for the k-medoids implementation, to cluster the 4,481 promoters. A variety of different clustering parameters and different sizes of bins were assessed, and we observed that using three classes as the predetermined number for the k-medoids algorithm produced three clearly distinct clusters. Clustering with smaller bins and/or more clusters led to decompositions that had similar shapes as the three clusters but were decomposed into sub-clusters with different acetylation strength. However, they did not show any fundamentally new shapes or any striking new features in terms of peaks or depletions (data not shown). We therefore decided to focus the remainder of the analysis on these three clusters as the separation between them is clear and conceptually simple.
We refer to these three clusters as upstream, centered, and downstream, Figure . Each cluster contained a comparable number of promoters: upstream 1,482, centered 1,733, downstream 1,266 members.
By excluding proximate promoters, we also exclude bidirectional promoters on purpose in this study. Genome-wide histograms of general histone 3 acetylation as reported in [
12] show a typical bimodal distribution as described above. An analysis of the H3ac signal around bidirectional promoters has shown that the weaker, upstream peak of H3ac is in fact often a downstream peak of the corresponding promoter on the antisense strand; i.e. H3ac generally occurs downstream of promoters [
17]. Although we removed bidirectional promoters from our dataset, we obtain one cluster (the centered cluster) that shows a bimodal signal distribution which would be expected to occur from bidirectional promoters. The promoters in the centered cluster may therefore contain cases of bidirectional promoters, where the corresponding antisense promoters have not been detected by DeepCAGE, either because they are lowly expressed or because they belong to uncapped transcripts.
Membership of promoters to clusters of H3K9ac and RNA polymerase II with similar shape
The clustering of promoters according to their surrounding RNA polymerase II binding activity was performed in the same way as for H3K9ac. This clustering also revealed three well-separated clusters to which we refer to again as upstream, centered, and downstream (Figure ). Table shows the concordance between H3K9ac and RNA polymerase II clusters. A substantial part of the promoters which fall into the centered and downstream clusters are co-localized; i.e. the majority of promoters which fall into the centered cluster of the H3K9ac clustering also fall into the centered cluster of the RNA polymerase II clustering, and the majority of promoters in the downstream cluster of H3K9ac also fall into the downstream cluster of RNA polymerase II. Only the promotes in the upstream cluster of H3K9ac are more likely to fall into the downstream, instead of the upstream cluster of RNA polymerase II. Our expectation of a statistically significant, strong correlation between the cluster assignment of promoters in the two datasets could not be confirmed; however, this data hints at that there might be at least a very weak tendency for promoters to fall into clusters with the same shape of H3K9ac/RNA polymerase II signal distribution when comparing the two experiments.
| Table 1Common promoters in clusters with similar signal distribution between H3K9ac and RNA polymerase II |
Genomic features associated with the clusters
We next investigated whether the promoters in different clusters coincide with different genomic sequence features. To investigate this, we checked how many of the promoters in each cluster are single peak or broad promoters, whether they have a TATA-box binding motif upstream of the TSS, whether they overlap with a CpG island, and whether the promoter belongs to an annotated gene. Also, we were interested in how many promoters in each cluster overlap with a repeat element, and if there is a bias for certain types of repeat elements. These features were selected because the association of single peak promoters with TATA-boxes and broad promoters with CpG-islands are typical aspects of DeepCAGE promoter architecture [
8], and the chosen features all have a fundamental relationship to transcription initiation, of which H3K9ac is a well-known epigenetic marker.
Here, we use a modified version of the definition of single peak and broad promoters from FANTOM 3 [
8], adapted to the FANTOM 4 dataset: single peak promoters are defined as promoters that express 50% or more of their total gene expression level from TSSs (level 1 promoters, see methods section) which are contained in a window of no more than 4 nucleotides, while all other promoters were classified as broad. Figure shows the result for all 4,481 promoters: the upstream cluster is enriched in single peak promoters, whereas the centered and downstream clusters are enriched in broad promoters (Figure ). These results suggest that there are several different typical acetylation states, as depicted in Figure : while in all three clusters broad promoters are most prevalent (Figure ), single peak promoters still occur in more than 40% of the cases where the H3K9ac signal is concentrated in the upstream region (Figure ). A concentrated activity of H3K9ac in the centered (Figure ) and downstream regions (Figure ) is more prone to lead to a broad and less defined mode of transcription initiation.
TATA-box binding motifs are located upstream of the RNA polymerase II binding site [
18,
19] and play an important role in the formation of the pre-initiation complex [
20]. The clustered promoters were annotated for the presence of a TATA-box binding motif by searching for a match (with more than 75% confidence score) to the TATA-box position weight matrix from JASPAR [
21] in the region -50 to -15 of the representative position of each core promoter. Single peak promoters in our filtered dataset of 4,481 promoters had highly significant (Fisher's exact test, Bonferroni-corrected p-value < 7.8E
-5) enrichment for the TATA-box binding motif, a confirmation of previous findings [
8]. In connection with the proposed model above, we expected an additional enrichment of TATA-box promoters in the upstream cluster. However, we did not find any statistical significant enrichment of TATA-box promoters in any of the clusters.
CpG islands have previously been shown to be highly acetylated [
12]. Figure shows how the centered and downstream cluster are enriched in CpG islands compared to the upstream cluster. As the centered and downstream clusters are also enriched in broad promoters, this observation is consistent with the findings from [
8] where the association of broad promoters with CpG islands was first noted.
About 16% of the entire 14,607 promoters identified in the THP-1 cell line do not have annotation based on the Entrez gene prediction dataset in the 1 kb-downstream region [
6]. Around half of these un-annotated promoters are evolutionary conserved across mammals and are therefore likely to be promoters of yet undetected genes, including functional non-coding RNA genes.
Large intergenic noncoding RNAs (lincRNAs) [
22,
23] are a group of multi-exonic, functional RNAs that show strong conservation across mammals and are thought to be involved in various cellular processes, including embryonic stem cell pluripotency and differentiation [
23], the establishment of chromatin states and down regulation of gene expression in concert with chromatin modifying enzymes [
24]. To specifically test if the putative novel promoters in our dataset code for lincRNAs, we examined how many out of the 3,289 lincRNAs collected in [
24] start in a window -300/+1,000 bp downstream of the representative position of the core promoters. Only 35 promoters of the full dataset of 14,607 promoters identified in FANTOM 4 have a lincRNA in the region considered here. The lincRNAs identified to date have been determined in cell lines other than THP-1 and represent only a subset of the entire functional noncoding transcriptome; despite the lack of overlap it is still reasonable to assume that many of the un-annotated core promoters belong to ncRNA genes, yet undetected protein-coding genes, or may be alternative promoters of already annotated genes. We therefore consider un-annotated promoters as putative novel promoters.
The centered and downstream cluster are enriched in promoters which belong to known genes, while the upstream cluster is enriched in putative novel promoters (Figure ). This is consistent with the abovementioned observation that the centered and downstream clusters have a stronger enrichment of broad promoters than the upstream cluster; genes with broad promoter architecture have previously been shown to be associated with abundantly expressed, housekeeping genes [
8] which are more likely to be contained in gene annotation datasets like RefSeq than genes which are only expressed in certain tissues.
The association between the clusters and the three tested genomic features promoter architecture (single peak vs. broad), CpG islands and gene annotation is highly significant for these three features shown in Figure (each of the three features has a statistically significant distribution among the three clusters with a Bonferroni-corrected p-value < 5.5E-15, Pearson's Chi-square test).
Repeat elements are widely expressed in mammalian genomes and have global impact on gene expression regulation by acting as alternative promoters, by cis-regulating protein-coding genes, and performing other proposed functional roles [
25]. We assessed how many promoters in each cluster overlap with repeat elements (Figure ), and if there is significant bias for repeat elements in general, or any particular class of repeat element, in any of the clusters. Regarding repeat elements as a general group without distinguishing the particular type of repeat, there is a very slight but still significant (Chi-square test, Bonferroni-corrected p-value <0.02) depletion of repeat elements in the centered cluster. When examining the repeat elements found in all three clusters, simple repeats (i.e. micro-satellites) and low complexity repeats were found to be the two most prevalent groups, but there is no significant bias for any specific type of repeat in any of the clusters.
H3K9ac signal strength corresponds to gene expression level
In order to investigate the relationship between gene expression level and H3K9ac signal shape and strength, we separately examined three subsets of clustered promoters, selected by their gene expression level. Figure shows boxplots visualizing the gene expression level for the three clusters. We sorted the 4,481 filtered promoters by their expression level in the undifferentiated stage (i.e. at the zero hour time point, see methods), and selected the 10% weakest promoters (lowly expressed genes) and 10% strongest promoters (high gene expression), and a third group containing all promoters with an expression level that lies between the lowly and highly expressed genes. Figure shows the H3K9ac signal strength of the three extracted groups. The promoters of lowly expressed genes on average have a weak H3K9ac signal, while the highly expressed promoters have an overall enriched acetylation signal when compared with the clustering for all promoters, confirming a similar finding from [
16]. Apart from the different levels in acetylation strength, there is still a very clear separation between the three clusters. Also, the clusters retain their characteristic shapes.
It is interesting to note that when comparing only the highly expressed to the promoters which have low and medium gene expression level, the acetylation strength of the downstream and centered clusters (maximum peak-level as well as overall distribution) increase to a lesser degree with increasing expression, while the upstream cluster increases dramatically. This suggests a more direct link between H3K9ac and gene expression level in the upstream cluster, than in the centered and downstream clusters. This can be interpreted within the model of three main epigenetic modes of transcription initiation [
26]: genes experiencing initiation and elongation, genes experiencing transcription initiation but not elongation, and genes experiencing neither. The mechanisms of gene-regulation in these three groups may belong to the initiation or elongation phase of transcription, respectively. This model in combination with our observations suggests that genes having the H3K9ac concentration in the centered and downstream region could predominantly be regulated at post-initiation steps. Such post-initiation regulation could be based on two general classes of regulation mechanisms [
26]: in one class, transcriptional pausing of RNA polymerase II, poor processivity, or abortive initiation prevents elongation. In the second class, transcription does take place but is immediately degraded by gene silencing.
Features of subsets filtered by gene expression level
With the extracted subsets of weak and strong promoters, we again performed the correlation analysis between clusters and genomic sequence features. There was no statistically significant enrichment for the distribution of the selected sequence features in any of the subsets of clusters filtered by gene expression level. However, some interesting general aspects could be observed which are valid for the overall subsets, although not for the clusters.
The lowly expressed promoters overall show a lower level of RefSeq annotation compared to the whole clustered dataset. This was to be expected, since lowly expressed genes are difficult to detect and therefore have a tendency to not be contained in gene annotation databases like RefSeq.
The vast majority (>97%) of the highly expressed promoters are of the broad type; broad promoters tend to regulate genes with a higher gene expression level than peak promoters (Figure ), and thus by selecting a subgroup of highly expressed promoters, we could expect this group to contain more broad promoters than the total clustered set. Accordingly, many of the highly expressed promoters (>91%) lie on CpG islands. There is available RefSeq annotation for more than 80% of all promoters in each cluster of this group. The high association of these promoters with annotated genes can be explained by the fact that proteins of highly expressed genes can be expected to be contained in gene prediction data sets.
Repeat elements increase with gene expression level
With increasing promoter expression level we observed an increase in the number of promoters overlapping with repeat elements. Only ~5.8% of all lowly expressed promoters overlap with any of the repeat elements. For medium gene expression, ~7.8% of the promoters overlap with a repeat element, and for the promoters regulating highly expressed genes the result was ~11.8%.
Analysis based on DeepCAGE and ChIP-chip data performed on differentiated THP-1 cells confirms the findings
To confirm our findings, we repeated the entire analysis using core promoters determined by DeepCAGE, and H3K9ac and RNA polymerase II ChIP-chip of THP-1 cells 96 hours after PMA treatment. At this time point, the THP-1 cells have differentiated from monocytes to a phenotype resembling macrophages [
6]. All findings of the study using the 0 hour time point were confirmed. This implies that the found correlations of sequence features and acetylation signal distribution are stable in the two cell phenotypes.