Features of the core promoter sequence greatly differ between S. cerevisiae genes with high and with low maximal expression
In a recent study performed in our lab (to be published elsewhere), 859 native S. cerevisiae
promoters were inserted upstream of a YFP reporter gene, and their promoter activity was accurately measured in 10 different conditions. Many of these genes were constitutively expressed in all conditions, and we will refer to them here as constitutive genes. We will refer to all other genes as regulated genes, as each of them is further induced or repressed in a subset of the conditions. This first classification of genes is related to their mode of regulation. We alternatively classified the genes based on their maximal promoter activity. As an approximation to the real maximal promoter activity (in any possible condition), we used the maximal measured promoter activity, denoted by Emax
, and classified as follows: low (Emax
< 0.1), medium (0.1 ≤ Emax
< 0.4), high (0.4 ≤ Emax
< 1) and very high (Emax
≥ 1). By definition, this approximation is more accurate for genes with high Emax
values, and may be less accurate for genes with low Emax
values, as they may be highly expressed in a condition other than the 10 examined. In total, we had 729 genes out of the above 859 that also had their TSS measured by (32
) (see Supplementary Table S1
). Applying both of the above classification criteria, we partitioned these 729 genes into eight subsets: 171 constitutive–low (both constitutive and low), 104 regulated–low, 190 constitutive–medium, 80 regulated–medium, 122 constitutive–high, 18 regulated–high, 36 constitutive–very high, and 8 regulated–very high.
For each of the above eight gene subsets, we analyzed base content (of mononucleotides and G + C) and hits of the TATA box TATAWAWR consensus (2
) in the region between 200 bp upstream and 100 bp downstream of the main TSS (denoted the [−200, 100] region), as mapped by (32
Strikingly, we found that genes with higher Emax
values have core promoter sequences that are significantly (see rank-sum P
-values in Supplementary Figure S1
) more T rich ( and Supplementary Figure S1A
) upstream of the main TSS (within the [−70, −10] region), and alternately more A rich ( and Supplementary Figure S1B
) at and downstream of the main TSS (within the [0, 50] region). A similar result was shown by (27
) for a smaller set of genes (see above). Here we also show that this signal is highly similar for both constitutive and regulated genes that have similar levels of Emax
Figure 1. Yeast core promoter sequence signals differ between genes with different maximal promoter activity. Mean nucleotide and TATA box content, computed using a sliding window (20 bp long, 10 bp step) over the [−200, 100] region around the main TSS (more ...)
Genes with higher Emax
values (both constitutive and regulated) also tend to have significantly lower G\C content around their main TSS ( and Supplementary Figure S1C
). Interestingly, reduced G\C content around the main TSS of genes with high Emax
is mainly achieved by significantly reduced G content ( and Supplementary Figure S1D
) and not C content (). G\C content is known to be highly correlated with intrinsic nucleosome occupancy (33
) that depends only on the DNA sequence and the concentration of histone proteins. This intrinsic occupancy was shown to be well predicted by a thermodynamic model learned based on in vitro
nucleosome occupancy data (29
). Using this model, we predicted and computed the mean intrinsic nucleosome occupancy around the main TSS (see ‘Materials and Methods’ section) for the eight gene sets defined above, shown in A. Indeed, there is similarity between the G\C content tracks () and the predicted intrinsic nucleosome occupancy tracks, suggesting that lower intrinsic nucleosome occupancy around the main TSS contributes to higher levels of maximal promoter activity. This trend is also evident when examining the mean YPD in vivo
nucleosome occupancy (29
) shown in B, although for regulated genes, many of which not induced in YPD (with glucose available), there are significant differences between the YPD in vivo
and the predicted intrinsic nucleosome occupancies.
Figure 2. Nucleosome occupancy of yeast core promoters differs between genes with different maximal promoter activity. (A) Mean predicted nucleosome occupancy in the [−200, 100] region around the main TSS, for the eight yeast gene subsets (defined in the (more ...)
Consensus TATA boxes are known to be high-affinity binding sites of TBP, with one and two mismatches leading to weaker binding affinities (3
). In accord with (2
), when comparing regulated and constitutive genes that have similar Emax
values, we found a higher frequency of consensus TATA boxes () in core promoters of regulated genes. Still, for both regulated and constitutive genes, consensus TATA box frequency is higher in genes with higher Emax
. This is in support of TBP high-affinity binding contributing to higher expression (26
These results thus identify several specific core promoter sequence signals that are predictive and may thus affect the maximal promoter activity of yeast promoters.
T richness upstream of the main TSS is a predictor of the maximal promoter activity
We next sought to test the extent to which core promoter sequence features can predict Emax levels. To this end, we focused on the T-richness signal upstream of the main TSS (). For each of our sliding windows (20 bp long, 10 bp step) within the [−80, −1] region, we computed per promoter, the T content to A\T content ratio in that window, and took the maximum over these windows to be a T-richness feature value. Taking the T content to A\T content ratio and not the T content itself normalizes out differences in A\T content between constitutive and regulated genes with similar Emax. For each pair of gene subsets (out of the above eight), we then computed an AUC score, quantifying the extent to which our T-richness feature separates between genes of one subset and the other subset. A score of 0.5 is attained by random, while a score of 1 represents a perfect separation (classification). An advantage of reporting AUC scores over P-values (of rank-sum tests, for instance) is that P-values are highly sensitive to the sizes of the compared gene subsets, while the AUC scores are not. The computed AUC scores are shown in in a triangular color matrix. Each intersection of a row and a column holds the AUC score for how well the T-richness feature separates between the subset at the right of the row and the one at the head of the column. For each AUC score, we also tested whether it is significantly different than 0.5 by computing an empirical P-value based on 10 000 random permutations of the feature values. AUC scores with a non-significant P-value (controlled for allowing a false discovery rate of 0.05) were marked by ‘x'.
Figure 3. T richness upstream of the main TSS is a predictor of maximal activity of yeast promoters. AUC scores for seperating genes of one yeast gene subset from another, based on a feature of T richness upstream of the main TSS (top right, see also main text). (more ...)
We found two important observations. First, the T-richness feature does not significantly separate between gene subset pairs (constitutive versus regulated) of the same Emax level. Second, in most cases, the T-richness feature can significantly separate between gene subsets that have different levels of Emax, and the measure of separation (AUC) increases as the difference in Emax levels increases. There are four exceptions to this rule (constitutive–medium versus Constitutive–low, regulated–medium versus regulated–low, regulated–high versus regulated–medium and regulated–very high versus regulated–high), where separation is weak and not significant.
Taken together, these results demonstrate that the T-richness feature defined above is predictive of the Emax level of a gene.
Prediction of maximal promoter activity from core promoter sequence features
Encouraged by our above results, we sought to learn a quantitative model that predicts a gene’s Emax from features of its core promoter sequence, with two goals in mind: to provide a lower bound on how predictive the core promoter region is of the maximal promoter activity, and to elucidate core promoter sequence features that may have a role in determining it.
For each of the 729 genes, we computed a large set of base content and k-mer (k = 1, … ,4) counts and existence features over different windows within the [−200, 50] region. For each 4-mer feature, we computed another version of it with up to 1 mismatch of the 4-mer allowed. Adhering to a 10-fold CV scheme, we defined 10 pairs of training and test gene sets in the following way: we randomly partitioned the 729 genes into 10 subsets of (about) the same size. Each time we took one of the 10 subsets to be the test set, while the training set consisted of all other genes. For each of the 10 training sets, we first pruned sparse features (that had a support of <5% of the genes in the training set), and used the remaining features (>54 K) to learn a linear model that predicts Emax based on a small subset of them (see ‘Materials and Methods’ section). This model was then used to predict the Emax values of the genes in the respective test set. Model performance was evaluated by three measures: the R2 statistic (quantifying the proportion of variance in the data that is explained by the model), the Pearson correlation, r, and the Spearman correlation, ρ.
The mean (over the 10 models) performance measures are shown in A (bar plot). Most importantly, the mean test R2
is 0.254, indicating that the core promoter sequence itself can explain at least 25.4% of the variance in the maximal promoter activity of yeast promoters. The difference between the mean test Pearson correlation (0.527) and the mean test Spearman correlation (0.425) indicates a small bias of the models to better predict high Emax
values. This can be expected in cases such as ours, where the distribution of response values is greatly skewed (Supplementary Figure S3
). A comparison of mean model performance on the training data versus the test data (A, bar plot) shows that there is a degree of overfitting to the training data. This too is expected, as each model was learned on a large set (>54 K) of features. We will therefore focus on features that were included in at least 5 of the 10 models (48 such features), as they are likely to represent real signals. We term these features, robust features.
Figure 4. Core promoter sequence features explain 25% of the variance in maximal promoter activity in yeast. Results of learning linear models (in a 10-fold CV scheme) that predict Emax from features of the [−200, 50] region (relative to the main TSS). (more ...)
A table detailing the robust features is included in A. For each feature (row) we show its sequence element (e.g. the 4-mer ‘TTTT’) and the promoter window (relative to the main TSS) where it was computed. Notably, most features could be related to one of the core promoter sequence signals that we discussed above. Based on that, we partitioned the features to several classes (class definitions appear in the left column), and sorted them within each class by their mean effect size (computed over the 10 models, color coded in the right column).
Robust features 1–11 (serial numbers appear in the gray-shaded column) are of T-rich k-mers in windows within the [−75, −1] region, and have positive effects (toward higher predicted Emax values), in accord with our above observations. Notably, most predictive are T-rich k-mers within the [−20, −11] region, as they are included in both robust features 1 and 2, and their effect size is the sum of effects of the two features. Robust feature 12 is also of a T-rich 4-mer, but at a small window further upstream.
Robust features 13–19 are of T-less k-mers within the [−100, −1] region. Robust features 20–22 are of the maximum A-content to A\T-content ratio, taken over a sliding window (size 10 bp, step 5 bp), within the [−75, −16] region. Robust features 13–22 all represent T-poor elements upstream of the main TSS, and have negative effects.
Robust features 23–29 likely involve TBP binding signals, as they include k-mers that are part of the consensus TATA box, up to 1 mismatch (3
), and are upstream of the main TSS. Interestingly, such signals within the 100 bp upstream of the main TSS (robust features 23–26) have positive effects, while those further upstream (robust features 27–29) have negative effects, suggesting that TBP binding far upstream of the main TSS is less suited for achieving high promoter activity. This is in accord with the pol-II scanning model, as TBP binding far upstream of the main TSS would require longer pol-II scanning, and perhaps even increase the chance of pol-II falling off before initiation.
Robust features 30–41 are in windows overlapping the main TSS or slightly downstream of it. These features have positive effects and include A content (robust feature 33), as well as A-rich k-mers that also contain pyrimidines, and capture signals that have some resemblance to the TSS motif of (19
Robust feature 42 is the occurrence of the ‘AATG’ 4-mer within the 50 bp downstream of the main TSS, and has a positive effect. This feature may in fact represent a translation-related signal, suggesting that a short 5′UTR contributes to higher translation rates (recall that the promoter activity measures were based on YFP measurements), perhaps in relation with the ribosome scanning of the mRNA for the first AUG (34
). To further assess this result, we compared 5′UTR lengths (32
) of four gene subsets: constitutive–high, constitutive–low, regulated–high and regulated–low. shows the cumulative distributions of 5′UTR lengths for these four subsets. For constitutive–high genes alone, 5′UTRs were found to be significantly shorter than those of the other subsets (rank-sum P
-values < 10−5
). This suggests that 5′UTR length may indeed have an effect on expression for constitutive genes, but less so for regulated genes. Our result extends previous results (35
) that showed that constitutive genes tend to have shorter 5′UTRs than other genes.
Figure 5. 5′UTR length may affect expression in constitutive genes. Cumulative distributions of 5′UTR lengths (32) for four gene subsets. 5′UTRs of constitutive–high genes tend to be shorter than those of the other three subsets (more ...)
Robust features 43–48 are of G\C content and of G\C-rich 4-mers in windows around the main TSS, and have negative effects, in accord with above observations.
Thus, by using a quantitative modeling approach, our results demonstrate that core promoter sequence features can explain a large fraction of the variance in maximal promoter activities of held-out genes. We were also able to suggest specific sequence features in different parts of the core promoter as having a role in determining the maximal promoter activity. These features fall into several classes, as illustrated in B.
The effects of the complexity and location of sequence features on prediction
The linear modeling results reported in the previous section were based on features of the core promoter region with sequence complexity ranging from base content and up to 4-mers. An interesting question that arises is whether low sequence complexity features suffice to get similar performance. To test this, we repeated our linear model learning scheme several times, starting with only base content features, and gradually added features of higher sequence complexity, up to features of 5-mers. The mean (over 10 models) test R2 is reported in A for each sequence complexity threshold, demonstrating that using base content features alone can explain (on average) >20% of the variance in the test data. Still, increasing sequence complexity up to 4-mers further contributed to test performance, suggesting that higher complexity sequence features of the core promoter region indeed play an additional non-redundant role in determining the maximal promoter activity.
Figure 6. A comparison of mean test R2 of linear models learned with varying complexity/location constraints on the sequence features. Here, the same 10-fold cross validated linear model learning scheme, described in the main text and in the legend, was (more ...)
Another interesting question is which regions of the core promoter are more predictive of the maximal promoter activity. To shed light on this, we again repeated our linear model learning scheme multiple times (using features of up to 4-mers), each time using only features that fall within a certain window (of length 100 or 50 bp) over the core promoter region. For each such window, its resulting mean test R2 is shown in B, showing that it suffices to take features within the [−50, 49] or the [−75, 24] regions to explain 19.4% of the test variance. Indeed, most of the robust features that were found to have high (absolute) effects (A) were computed over windows that fall within the [−75, 49] region around the TSS.
Core promoter sequence is also indicative of maximal promoter activity in human
Our results above show that core promoter sequence is indicative of maximal promoter activity in yeast. A natural question that arises is whether these results also hold for other organisms.
To explore this, we examined mRNA expression data (37
) of 10 human cell lines (GM12878, GM12892, H1-hESC, HCT116, HeLa-S3, HepG2, HSMM, HUVEC, K562 and MCF7), available at TSS resolution (for each gene, an FPKM mRNA abundance measure is reported for different TSSs, see also Supplementary Information
). For each cell line, there were between two and four replicates of mRNA expression measurements, and for each TSS we conservatively took the minimum replicate value as its mRNA expression level. We then chose only TSSs that were expressed in all cell lines, and were the most highly expressed out of all other TSSs of the same gene. This left us with a set of 8025 TSSs that are constitutively expressed in the above 10 cell lines (see Supplementary Table S2
). We further defined two subsets of these TSSs, based on their maximal mRNA expression (the maximal FPKM over the 10 different cell lines): 1035 TSSs with high maximal mRNA expression (maximal FPKM ≥ 100) and 1218 TSSs with low maximal expression (maximal FPKM < 5). The high maximal expression subset is, by definition, a subset of TSSs with high maximal promoter activity. The low maximal expression subset is an approximation of a subset with low maximal promoter activity, as some of its TSSs may be highly expressed in other cell lines or conditions, or alternatively, their mRNA products may be strongly downregulated posttranscriptionally.
Similar to our analysis in yeast (see above), we analyzed various sequence signals within the [−200, 100] region around the TSSs, including base content (mononucleotides and G + C), CpG and GpC content, as well as the percent of TSSs with TATA box hits, or with hits of 6-mers of the SP1 transcription factor motif consensus (GGGCGG or its reverse complement CCGCCC). For all of these sequence signals (A) there were significant differences (see rank-sum P
-values in Supplementary Figure S4
) between the set of high maximal expression TSSs (A, left column) and the set of low maximal expression TSSs (A, middle column).
Figure 7. Human core promoter sequence signals differ between constitutive TSSs with different maximal expression. (A) Mean nucleotide, k-mers and TATA box content, computed using a sliding window (20 bp long, 10 bp step) over the [−200, 100] region around (more ...)
High maximal expression TSSs tend to have significantly lower A and T content around the TSS than low maximal expression TSSs (Supplementary Figure S4A
), and, conversely, higher C, G, G\C, GpC and CpG content around the TSS (Supplementary Figure S4C–G
). While human core promoters are known to have high G\C, GpC and CpG content compared with flanking regions (38
), here we show that their core promoter content is indicative of the maximal promoter activity. Accordingly, features of G\C, GpC and CpG richness around the TSS were found to significantly separate between the high and the low maximal expression TSSs, with AUC scores of 0.623, 0.64 and 0.682, respectively (Supplementary Figure S5
). Recently, one study showed that high G\C and CpG content promote nucleosome depletion in mammalian promoters, both in vivo
and in vitro
). Thus, higher G\C and CpG content around the TSS would lower its nucleosome occupancy, making it more accessible for PIC formation and hence more highly expressed, in line with our results. Importantly, our focus here on constitutive core promoters removes the possibility that the differences between the high and the low maximal expression TSSs are due to differences between constitutive and tissue-specific core promoters (for instance, constitutive human core promoters are known to be CpG richer (39
Among several TF motifs that are known to be enriched in core promoters of constitutive genes, SP1 motifs are the most abundant (38
). SP1 consensus 6-mers (GGGCGG or its reverse complement CCGCCC) in the 100 bp upstream of the TSS are found in 3355 of the 8025 (41.8%) constitutive core promoters, and are significantly depleted in the low maximal expression subset (found in 304 out of 1218, 25%, P
), suggesting that their existence may contribute to higher levels of maximal promoter activity. Other TF motifs known to be enriched in core promoters include those of NF-Y (the CAAT-box) and ETS (38
). Similar to the SP1 consensus 6-mers, both NF-Y consensus 5-mers (CCAAT or its reverse complement ATTGG) and ETS consensus 6-mers (CCGGAA or its reverse complement TTCCGG) occur more around high maximal expression TSSs than around low maximal expression TSSs (Supplementary Figure S6
While 10–24% of human genes were estimated to have a core promoter containing a TATA box, core promoters of constitutively expressed human genes were shown to be mostly TATA-less (41
). In line with this, consensus TATA boxes (TATAWAWR) that are distanced 50–20 bp upstream of the TSS are found in only 60 of the 8025 constitutive core promoters (0.75%). Albeit the small numbers, they are significantly enriched in the high maximal expression subset (found in 29 out of 1035, 2.8%, P
), and depleted in the low maximal expression subset (found in 3 out of 1218, 0.25%, P
-value 0.013), suggesting that the TATA box may contribute to higher levels of maximal promoter activity, as in yeast (see above).
While for most of the above sequence elements, the mean signal of the high maximal expression TSSs (A, left column) seems relatively similar to that of all constitutive TSSs (A, right column), there are significant differences between the two sets (see Supplementary Figure S4B–E
). Most notably, high maximal expression TSSs tend to have higher C content immediately upstream of the TSS, and lower G content at and downstream of the TSS, compared with the average constitutive TSS.
Out of the above defined set of 8025 constitutive TSSs, 50 were TSSs of ribosomal proteins (RPs). Human RPs share a common core promoter architecture (42
), and most of them are similarly expressed across tissues (44
), suggesting that they are jointly regulated. Many of the 50 constitutive RP TSSs were very highly expressed, with 46 of them included in the above high maximal expression TSSs subset. We partitioned these 50 TSSs to the 25 with higher maximal expression and the 25 with lower maximal expression, and compared their mean base content around the TSS (B). Here, too, we found significant differences between the two subsets. Most notably, RP TSSs with higher maximal expression tend to be C richer and G poorer in the [−10, 9] window around the TSS (rank-sum P
-values 0.02 and 0.032, respectively). Accordingly, C content in that window separates the higher maximal expression RP TSSs from the lower ones with an AUC score of 0.69. Mammalian RP core promoters were shown to have a polypyrimidine initiator (43
). Our results suggest that the pyrimidine content of this initiator element may affect its efficiency.
Finally, following the same 10-fold CV linear model learning scheme as with the yeast data (see above), we learned 10 linear models that predict the maximal expression (log of the maximal FPKM measure) of the 8025 constitutive human TSSs from base content and k-mer features of their core promoters. Mean model performance measures (R2
, Pearson correlation r
, Spearman correlation ρ
) and a table detailing 58 robust features (that were included in at least 9 out of the 10 models) are shown in Supplementary Figure S7
. The linear models could only explain, on average, 7% of the variation in the maximal expression of held-out test TSSs (with corresponding test mean r
= 0.268 and ρ
= 0.238), but consistently so (low standard deviation between models). The low predictive power of these linear models is likely due to the great complexity and diversity of human core promoter architectures. In accord with the results above, features of CpG containing k-mers were found to be the most significant predictors of higher maximal expression, and features of TATA box, NF-Y, ETS and SP1 consensus k-mers were also found to contribute to higher maximal expression.
Taken together, our results show that various human core promoter sequence features are predictive of maximal promoter activity, suggesting that they likely have a causal role in their determination.