Data and post processing
All primary data was downloaded from the ENCODE UCSC browser [49
. Only data labeled as unrestricted (9 months after release date) were used.
We made the main analysis using the K562 cell line data and used HUVEC and NHEK data for validation, all from the NCBI36 (HG18) assembly.
We used the UCSC known gene track [49
] as gene models and for promoter annotation, unless specifically described below.
Core Promoter set
All TSSs were derived from the gene track mentioned above. Since the alternative transcripts may cause ambiguous cases when measuring the tag expression, we only used transcripts that do not overlap any other transcript from the track. This gave a set of 12, 872 core promoters. Due to that we measure the mRNA expression by RNA-Seq in the downstream 1, 000 exonic regions (excluding the signals from intronic intervals), all the transcripts that were not long enough were discarded, resulting in a 9, 115 set for final analysis.
ChIP-Seq data of histone modifications and RNAPII
RNAPII as well as the histone modification data for H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me1, H3K27ac, H3K27me3, H3K36me3, H4k20me1 in all the three cell lines were downloaded from 'Broad Histone' tracks [34
] in the UCSC browser as mapped reads. We applied the MACS peak finder [52
] on these datasets with standard settings but vfold = 10, using the input control ChIP for respective cell as background and used the produced 1 nt resolution wig files of the shifted tags (not the peaks) as our basal data. Replicates were pooled. These tracks are produced by the Broad and Bernstain laboratories and released for public use.
Cap Analysis for Gene Expression (CAGE) data and processing
We used the nuclear CAGE libraries from the K562 cell lines from the 'ENCODE RIKEN RNA Subcellular Localization by CAGE Tags' track in the UCSC browser. These tracks are produced by the RIKEN Omics Science Center [53
]. The data was transformed so that we count the sum of 5' ends of reads at each genomic nucleotide, given strand, as in [56
]. The tag counts from the immediate vicinity (-75/+75) around the TSS, normalized to tags per million were calculated for measuring the promoter usage.
RNA-Seq data and processing
We used K562 Tier 1 polyA+ RNA-Seq data produced by the Snyder laboratory [57
] from the 'Yale RNA-Seq' track in the UCSC browser. Only the tag counts from the first 1, 000 exonic nt downstream of each core promoter were summed and normalized to TPM scale, which we used as an estimate of the amount of produced RNA from that promoter. Genes with a total exon length shorter than 1, 000 nt were excluded from further analysis. i). We tested the effect of varying this definition to the first 500 nt or all exonic nucleotides (Additional file 1
, Table S6); this resulted in PCC values between 0.7-0.75 and 0.6-0.67, respectively, which are both significantly (P
< 0.05) lower than when using the definition above which typically gave PCC scores of 0.75 or higher. While the decrease is not large in terms of absolute numbers (~0.4-0.1 difference in mean PCCs), it probably reflects the issue discussed above - the shorter definition might reflect the issues with detecting exon edges while including all known exons increases the risk of not capturing relevant splice form or erroneously the contributions of unannotated downstream alternative promoters.
Data for transcription factors
We downloaded the ChIP-seq c-Myc and NELFe data from the K562 cell line from the ENCODE 'Open Chromatin' [60
] track in the UCSC browser. The total signal in the +/- 1000 nt region around the TSS, normalized to tags per million was used as a single feature for the predictor.
We downloaded the ENCODE methylation data for K562 from the Hudson Alpha lab [62
] from the UCSC genome browser. The data contains the methylation status for all CpG regions in the genome. The methylation status for each bin was set to "methylated" if just one basepair in the bin was methylated and not methylated otherwise. In this way the methylation status was used as a binary feature in the predictions.
Dinucleotide content and normalized GC-content
We extracted the promoter sequences for all used genes and divided them into bins. For each bin we counted the number of occurrences of each dinucleotide and divided by the length of the bin-1. These 16 numbers for each bin were used as input features in the prediction. The normalized GC-content was computed as defined by Saxonov et al
Overall framework for capturing genomic signals around TSSs
To retain the positional distribution as well as signal strength as inputs we separated the +/-975 nt regions around the TSS into 13 150 nt wide bins. Starting from setting up the center bin +/- 75 around the TSS, flanking ones were gradually extended towards upstream and downstream. Given a bin and a ChIP dataset, we counted the number of TPMs from the ChIP data set mapping to the region. This results in the size of an initial feature set (number of bins)*(number of data sets).
These thresholds were selected based on biological and practical reasons. The +-975 region was mean to encompass the core promoters as well as its flanking regions. The reason for not using +-1000 is that the region has to be dividable by the 150 nt bins, whose sizes was chosen not only for the practical reason that it gives a reasonable number of bins, but also as it roughly corresponding to the area occupied by a nucleosome. However, as described in the main text, changing the number of bins and thereby their positions has no substantial impact on the results. Likewise, as shown in Figure , most of the predictive signals reside in the bins around the TSS, so changes in the overall region investigated will not have substantial impact as long as this region is included. It should be noted that two similarly scoped papers discussed in the main text [20
] used much larger regions (+-2000 nt around TSS) and did not report higher correlations when predicting expression.
General description of prediction models
Given the above input feature framework, we constructed two types of predictive models: one for classification (typically between two types of promoters) and one for regression (typically for predicting promoter usage given chromatin signals).
All the classifications were made by using the Random Forest method [24
], as implemented in the RandomForest R package [63
]. ROC curves were drawn using ROCR [64
]. The feature importance in the classification problems was calculated as the mean decrease in accuracy.
For performing regression we used Random Forest and four different versions of linear regression. The linear models included ordinary linear regression as implemented in R function lm and regularized versions of it, namely: ridge-, lasso- and elastic net regression. These three methods are designed to prevent over fitting and perform feature selection when the number of predictive variables is large. We fitted these linear models using glmnet [33
] package in R with parameter alpha valued at 0, 1 and 0.5 to achieve correspondingly the ridge-, lasso- and elastic net regression. The regularized models produce a sequence of model fits corresponding to different values of the regularization parameter lambda. In this case we chose the model showing the best correlation with the training data. All the other parameters were kept as default in the analysis. The regression using Random Forest was performed with RandomForest [63
] package in R using the default settings. We used the mean decrease in mean standard error (MSE) to assess the importance of features in Random Forest model. The resulting importance from the multi-folds cross-validation was calculated as the average of the individual values.
Classification of promoter activity
We classified active, inactive and randomly selected non-promoter regions from each other using chromatin signals as inputs, as described below.
Definitions of promoter sets for classifications
The active promoter set (5, 131 promoters) was defined as +/-1, 000 nt regions containing both CAGE and RNA-Seq tags. We considered only genes that were long enough (exonic length of 1000nt or more) for a reliable RNA-Seq density measurement.
The inactive (or silent) promoter set (2, 838 promoters) was defined as promoters with no tags from either CAGE within +/-75 nt around the TSS or RNA-Seq in the first 1000 nt exonic region. We selected random genomic regions of the same size for the random position set.
Training and evaluations for classifications
For training and evaluating the results for the classification, we used a hold-out strategy wrapped by 10-fold cross-validation. In order to minimize the bias from unbalanced sizes of the binary classes, we randomly selected the same amount of data from the larger class according to the size of smaller class in each run of the cross-validation. Then with two equal-sized classes, we further divided the data for training and testing by the proportion of 70% and 30%. The local AUC and importance for one fold was evaluated from the performance of the trained model in the test set. After finishing 10-fold repeats, the overall AUC and importance were calculated as the mean of the results.
Expression measurements used as responses in regression
For predicting the expression levels we considered only the active promoters used in the classification, described above. We applied log2 transformation to the data in order to make it more suitable for the regression task. To avoid taking the logarithm of 0, we added a pseudo count of 0.001 to both input features and output.
Training and validation for regression
We assessed the performance of the predictions using a repeated hold-out scheme. At each step we randomly divided the dataset of 5, 131 promoters defined above into 10 equally sized parts. Then we trained the model using the 9 proportions of them and tested the model predictions on the exclusive part. We train on the data 10 times until all of the subset had been used as a test set. For evaluations, we calculated the Pearson Correlation Coefficients between predicted and actual log2 TPM values. This way we could estimate both the regression accuracy and also how stable the accuracy is, which is important when comparing the results from different methods.
Assessing interactions between input features
To study the influence of interactions on the expression level prediction, we used linear models with interaction terms. In our original dataset we divided the promoter into 13 bins, to assess the positional influences of the modifications. However, including interactions between all these parameters would make the model too large. Since we achieve almost as good result using 1 bin per modification as 13 in the regression problem, we used only 1 bin per mark for testing interactions. We then tested the model using a cross-validation schema as described above.
Regression of the stalling index
The stalling index value S were calculated as described in the main text. For the regression of s index, we randomly selected 30% of the data from the 9, 115 set as the testing data and used the rest in the training procedure. We then used the same 13-bin prediction framework and methods as we used in the previous regression problems. In addition, c-MYC and NELFe ChIP-seq signals were also used as optional input features.
Thresholds for histone marks
To be able to say with confidence if a promoter has a specific histone mark present we need to assess the random expectation of tags from the given mark in a genomic region of the same size. We sampled 33.000 random genomic regions and counted the number of tags for each mark in each region. We set the threshold to the 95th
quantile of the random distribution meaning that 5% of the random regions would be considered to have the mark present. Thresholds for RNAPII present in the promoter region and gene body were defined in the same way. As mentioned in the text, we also varied the threshold to the 90th
percentile, and also tested the Poisson based methods as the one also used by Ernst et al
], with a significance threshold of P
< 0.05 or P
Assessing the significance of the different performance between prediction models
For measuring the significance of the difference between different classification tasks, a list of p values for one particular pairwise classification was computed in each fold of the cross-validation procedures, by the Hanley and McNeil test [46
] implemented in the R package MKmist [40
]. Both the original AUCs and the AUCs recomputed in test were based on the same posterior probabilities estimated from the corresponding Random Forest models. For evaluating the difference between different regression tasks, we applied two-sided t test on the resulting PCCs obtained from cross-validation.
Determining the presence of TATA box for each promoter
We predicted TATA-boxes in the -50 to -10 region of each TSS using the TATA-box positon weight matrix from the JASPAR database [65
] and a score threshold of 70% (as described in [38
]). If one or more sites were predicted, the promoter was labeled as TATA-box containing.
We made all plots in R using the ggplot [66
] and ROCR packages [64