Prediction of enhancers using random forest and multiple chromatin marks
Random forests have recently become a popular machine learning technique in biology 
due to their ability to run efficiently on large datasets without over-fitting, and their inherently non-parametric structure. Since random forests use a single variable at a time, they can give an automatic measure of feature importance 
. Hence, we developed an algorithm based on this random forest technique for the purpose of enhancer prediction. Conventional random forests utilize a single scalar value associated with each feature at each node of the tree. In order to train a random-forest for enhancer prediction we wanted to use histone modification profiles at p300 binding sites. Because the spatial organization of histone modifications along a linear chromosome can be as informative as their actual levels, they are better represented as vectors of binned reads. Inspired by recent modifications to the random-forest approach such as discriminant random forests 
or oblique random forests 
that utilize a linear classifier at each node, we developed a new vector-based random forest algorithm RFECS or Random Forest for Enhancer Identification using Chromatin States (see Methods).
Genome-wide distal p300 binding sites were found using ChIP-seq in H1 and IMR90 cell-lines. We selected p300 binding sites overlapping DNase-I hypersensitive sites and distal to annotated TSS as active p300 binding sites representative of enhancers. We found 5899 such p300 binding sites in H1 and 25109 such sites in IMR90 (Table S1
,S2), and observed several distinct and diverse chromatin states using an unsupervised clustering technique, ChromaSig (). All clusters showed enrichment of H3K4me1 and depletion of H3K4me3 as previously observed 
. However, different clusters were characterized by varying levels of histone acetylation, H3K4me2 or H3K27me3. Clusters with presence or absence of H3K36me3 may represent genic and intergenic enhancers respectively. In order to ensure we represented all these different chromatin states at active p300 binding sites, we selected a relatively large number of these sites (>5000) for training as compared to previous methods.
Histone modification patterns at distal p300 binding sites in H1 and IMR90.
To train the forest, active and distal p300-binding sites (BS) were selected as representative of the enhancer class. As non-enhancer classes, we considered annotated transcription start sites (TSS) that overlap DNase-I, and random 100 bp bins that are distal to known p300 or TSS (see Methods). The confidence of each enhancer prediction is given by the percentage of trees that predict this site to be an enhancer. In general, a genomic region is predicted as an enhancer if it has a background cutoff greater than 0.5 (>50% trees vote in it's favor). At higher cutoffs, confidence of prediction is higher, but fewer enhancers are predicted.
We used Receiver Operating Characteristic (ROC) curves to determine optimal parameters for our classification algorithm 
. In the case of enhancer predictions, we can only obtain an approximate measure of specificity since we can never be certain that the randomly selected elements of the non-p300 class are all true negatives. Hence, in addition to the ROC curves generated using 5-fold cross-validation, we also verified parameter selection by comparing the percentage of predicted enhancers at each cutoff that overlap markers of active enhancers (validation rate) or TSS (misclassification rate). The markers of active enhancers include distal DNase-I hypersensitivity sites (HS), p300 binding sites (excluding those used in training), occupancy by CBP or sequence-specific transcription factors known to act at embryonic stem cell enhancers such as NANOG, OCT4 and SOX2.
In the case of Random forests, the main parameter to be determined is the number of trees. Since the non-enhancer class is assumed to be several times enriched compared to the enhancer class in the genome, we select a greater number of non-p300 training sites as compared to p300 sites and this proportion is also adjusted using the above-described methods. Previous algorithms 
as well as empirical observations showed a width of −1 kb to +1 kb around the p300 binding site as optimal but we further verified this selection by cross-validation in the H1 cell-type (fig. S1A
). The difference in cross-validation curves using a width of 0.5 kb or 1 kb is not obvious on the cross-validation curve while a width of 1.5 kb clearly shows a sharp drop in the area under the ROC curve (fig. S1A
). When we further made enhancer predictions using all three widths (fig. S1B
,C), it can be seen that a width of 1 kb on either side shows best validation and misclassification rates as compared to 0.5 or 1.5 kb widths.
Enhancer predictions in H1 and IMR90 cells
To determine the optimal number of trees for the random-forest, we examined the area under the ROC curve in H1 and IMR90 and found both to be stable beyond 45 trees (). In order to verify this further, we made enhancer predictions using various number of trees such as 45, 65 and 85 and compared the validation and misclassification rates (fig. S2A
–D). While H1 appeared to show no change at all (fig. S2A
,,C) IMR90 showed a slight improvement from 45 to 65 trees (fig. S2B
,D). In the end, we selected 65 trees for training the random forest as it appeared to be optimal for both cases. The training-set ratio of p300 to non-p300 was set at 1
7 since the ROC curve did not appear to change much beyond this ratio. (fig. S2E
Performance of RFECS for enhancer predictions in H1 and IMR90 cells.
In order to estimate the accuracy of the enhancer prediction by RFECS, we applied this algorithm to chromatin profiles of 24 marks obtained in H1 and IMR90. We then calculated the validation rate as the percentage of predicted enhancers overlapping with DNase-I hypersensitivity sites and binding sites of p300 and a few sequence specific transcription factors known to function in each cell type (true positive markers). We also computed the misclassification rate as the percentage of predicted enhancers overlapping with known promoters. These overlaps were computed using a window of −2.5 to +2.5 kb. Incase, both a true positive marker as well as promoter lay within this window, the criteria used to decide if the enhancer was “validated” or “misclassified” is discussed in detail in the Methods section. In H1 cells, we obtained a total of 55382 predicted enhancers at the lowest voting cutoff of 0.5. Over 80% of these predicted enhancers overlap with distal DNase-I hypersensitive sites and the binding sites of p300, NANOG, OCT4 and SOX2. Upon randomly generating enhancer predictions in the H1 genome 100 times, we found the average validation rate to be 18.43% and the actual validation rate of 80% to be highly significant with a one-sided t-test p-value of 10∧-256. Additionally, we found that 5% of them overlap with UCSC TSS, indicating a low misclassification rate of 5% (, in red). A similar high level of validation rate and low misclassification rate were observed when RFECS was applied to IMR90 cells, where 83581 enhancers were predicted with a validation rate of 85%(average random validation rate
2×10∧-279), and misclassification rate of 4% (). Thus, RFECS appears to accurately predict putative enhancer sequences based on chromatin modification state of the genome.
We next tried to assess the linear resolution of RFECS predictions. We calculated the distance between the predicted enhancers and locations of enhancer markers such as DNase-I hypersensitive sites, or p300 binding sites in each cell type, and found that the majority of predicted enhancers are within 200 bp of these sites (fig. S3A
,B). In H1, nearly 62% of enhancers lie within 200 bp of an enhancer marker site (fig. S3A
), while in IMR90 this value is around 70% (fig. S3B
). Thus, the majority of enhancer predictions also show a high distance resolution in terms of proximity to the validation marker.
We also confirmed that our enhancer predictions showed an activation of gene expression in the proximal TSS. In order to do this, we compared RNA-seq datasets (Wei Xie et al., manuscript under revision) in H1 and IMR90 using edgeR 
to identify H1-specific and IMR90-specific TSS. Then we identified enhancer predictions specific to either H1 or IMR90 using a filter distance of 2.5 kb. When we look at the average distribution of H1-specific enhancers they are clearly enriched in the vicinity of H1-specific TSS as compared to either non-specific TSS or IMR90-specific TSS (fig. S3C
) and this enrichment is found to significant at distances up to at least 500 kb using a Wilcoxon test (p-value<10∧-6). Similarly, in the case of IMR90-specific enhancers, we observe them to be more enriched in the proximity of IMR90-specific TSS as compared to H1-specific TSS (fig. S3D
As further evidence that RFECS accurately predicts enhancers, chromatin modifications at the predicted enhancers showed presence of all chromatin states observed in the training sets comprised of a subset of distal p300 binding sites (). In H1, clusters 1,2 and 8 of enhancer predictions (fig. S4
) are similar to clusters 1–3 of the p300 binding sites (), clusters 3–4 appear to correspond to cluster 5 of p300 BS, while clusters 5–6 look like cluster 4 of p300 BS. In IMR90, similar trends could be observed when comparing chromatin states at enhancer predictions (fig. S5
) to those of p300 binding sites (). Further, it can be observed that clusters 3–6 of the enhancer predictions in H1 (fig. S4
) that have weaker acetylation and/or enrichment of H3K27me3 also tend to have lower voting percentage of trees.
In summary, we showed that RFECS accurately predicted enhancers in the two cell lines H1 and IMR90 using a set of 24 chromatin modifications. These enhancers showed high validation rates, low misclassification rates and sharp linear resolution.
Random forest trained on one cell-type can accurately predict enhancers in other cell-types
To make enhancer predictions, our approach requires a construction of a random forest trained on promoter-distal p300 binding sites. It is time-consuming and expensive to create a new training set for enhancer prediction in each new cell type, so it is desirable to use a random forest developed in one cell type to predict enhancers in another. To evaluate the feasibility of such approach, we first trained a random-forest using chromatin modification profiles obtained in H1, and then applied it to the IMR90 cells. Compared to RFECS predictions using IMR90 chromatin profiles as training set, RFECS predictions using H1 training dataset reduces the validation rate by ~5–8% and increases the misclassification rate by ~2% ( black vs red). Similarly, we also developed a random forest using the IMR90 data as the training set and then applied it to H1. This led to an average reduction of 2–3% in validation rate (, black vs red). Therefore, RFECS trained using one cell type may be applied to a different cell type, albeit with slightly lower accuracy.
We sought to examine if this moderate decrease in performance was largely due to cell-type specific differences or was within the limits of technical or biological variability between replicates. To this end, we trained a random forest on one replicate of a cell-type, and made predictions on the other replicate of the same cell type. RFECS trained on IMR90 and then applied to the replicate 1 of the H1 profiles (blue dot vs asterisk) actually showed a higher validation rate and lower misclassification rate than RFECS trained using replicate 2 of H1 (), while similar performance was observed with enhancer predictions on replicate 2 of H1 independent of whether the random-forest was trained on H1 replicate 1 or IMR90 (green dot vs asterisk). Similar trends were observed when comparing predictions made on individual replicates of IMR90 using either H1-training or training on the other replicate (). In conclusion, predicting enhancers using the random forest built from a different cell type exhibits a modest decrease in performance compared to a same-cell training set. However, this decrease in performance is comparable to the decrease that can arise due to variability between two replicates of the same cell-type.
Optimal set of chromatin marks required for enhancer prediction
With the increasing number of histone modifications being discovered and mapped, determination of the relative importance of each mark in defining genomic elements is important. An out-of-bag measure of variable importance is a natural by-product of random forest classification scheme 
wherein the relative importance of each feature is assessed as the increase in classification error upon permutation of feature values across classes. In both H1 and IMR90, the variable importance was assessed for random forests trained on 5 cross-sections of data for each of the 2 sets of replicates individually as well as the set of averaged replicates. Upon ranking histone modifications by variable importance, it is apparent that H3K4me1 and H3K4me3 are the top 2 most robust modifications across replicates and cross-sectional samples in both cell types, followed by H3K4me2 (). This indicates that these 3 modifications maybe the most informative in the prediction of enhancers in any unknown cell type as well.
Out-of-bag variable importance of histone modifications in enhancer prediction.
Beyond the top 3 modifications, there is variability among the cell types. In IMR90, the other modifications appear to contribute almost equally, while in H1 there is a much clearer difference in variable importance. These differences are supported by correlation analyses in H1 and IMR90 (). In H1, several modifications are highly correlated, which could explain the larger differences in variable importance, as only a few variables maybe needed to form a non-redundant set. In IMR90, the correlations are lower and hence each of the modifications may contribute non-redundant information and thus contribute equally to the variable importance. Modifications that cluster together in both H1 and IMR90 (shown in the same non-black colors, ) suggest cell-type independent redundancy.
Having established the relative importance of each histone modification in predicting enhancers, we next examined the accuracy of predictions using different sets of modifications. Validation rates obtained by using the minimal set of H3K4me1-3 is within 2% of that for all 24 modifications in H1 (). Furthermore, this minimal set performs considerably better than the more conventionally selected set of H3K4me1 and H3K4me3 
and at times, H3K27ac 
(, in black and blue). The set of H3K4me1-2-3 is more comparable to H3K4me1-H3K4me3-H3K27ac in IMR90 but does have a slightly lower misclassification rate (). In both cases the use of the minimal set of 3 modifications shows a much closer resemblance in performance to all 24 modifications than to the set of 2 marks H3K4me1 and H3K4me3 ().
Validation rate and Misclassification rate of enhancers predicted using RFECS in H1 and IMR90.
It can also be observed that in conjunction with H3K4me1 and H3K4me3, using H3K4me2 picks up a larger proportion of enhancers with weaker acetylation enrichment as compared to H3K27ac (fig. S4
,S5), supporting our prediction of the minimal set.
We also made enhancer predictions using all possible combinations of 3 modifications in chromosome 1 for replicate 1 and replicate 2 of H1. The average validation rate for a fixed range of enhancers was compared across replicates and it can be seen the set corresponding to H3K4me1, H3K4me2 and H3K4me3 (marked in *), is the highest performing combination common to both replicates (). We also found the performance of the combination of H3K27ac with H3K4me1 and H3K4me3 appears to be comparable in this case (3, ), validating the use of H3K27ac as a feature for enhancer prediction when H3K4me2 is not available. Some of the worst performing combinations include H3K9me3 and H4K20me1 (4 and 5, ), which also show up as variables with least importance in .
In many currently existing datasets, H3K27ac is the more commonly sequenced histone modification as compared to H3K4me2 due to it's perception as a marker of active enhancers. While using H3K4me2 may improve enhancer prediction in some cell-types, use of H3K27ac in addition to H3K4me1 and H3K4me3 marks does show considerable improvement over using just the top 2 marks H3K4me1 and H3K4me3 (). Hence, for many of the currently existing datasets, we could use H3K4me1, H3K4me3 and H3K27ac as the features in our random-forest with satisfactory performance.
Overall, these comparisons indicate the suitability of selecting H3K4me1, H3K4me2 and H3K4me3 as three minimal chromatin marks for purposes of enhancer prediction. Additional chromatin modifications required for improving upon enhancer predictions may depend on cell-type specific characteristics, as indicated by the differences in variable importance between H1 and IMR90 ().
Comparison of RFECS with other enhancer prediction methods
We next asked if our enhancer prediction algorithm performed better than several other current techniques for enhancer prediction – CSIANN, ChromaGenSVM and Chromia 
. In previous studies, CSIANN and ChromaGenSVM were applied on the histone modification dataset in CD4 T-cells 
. In order to make a comparison of performance of our method with previous approaches, we applied RFECS to the CD4+ T cell dataset as well and determined parameters using cross-validation (fig. S6
). Using H3K4me1, H3K4me3, and H3K27ac, CSIANN made 21832 predictions 
and ChromaGenSVM method made 23574 predictions 
. We made enhancer predictions using H3K4me1, H3K4me3 and H3K27ac with RFECS as well as Chromia 
. Cutoffs were selected that yielded a similar number of enhancer predictions for both Chromia (21895) and RFECS (22947) (), so as to make a fair comparison across methods.
Comparison of enhancer predictions using RFECS, ChromaGenSVM, CSIANN and Chromia.
To compare these different sets of enhancer predictions, we computed validation rates by comparing them to TSS-distal DNase-I hypersensitive sites, p300 binding sites, and CBP binding sites and misclassification rates by comparing to known UCSC TSS using a window of −2.5 kb to +2.5 kb as described in the methods. (). The validation rate of RFECS predictions is around 70%, which is considerably higher than the other three methods (57% ChromaGenSVM, 51% CSIANN, 60% Chromia). Further, the misclassification rates of RFECS is less than 7%, much lower than the 27%, 35% and 15% rates of ChromaGenSVM, CSIANN and Chromia, respectively. These results suggested that overall procedure for RFECS, including selection of training set as well as training and prediction using the vector-random-forest, performs better than currently available techniques for enhancer prediction.
In the above comparison, we selected our enhancer-representative training set as p300 peaks called using MACS 
that were distal to known UCSC TSS and overlapped DNase-I locations while CSIANN and ChromaGenSVM used a training-set of p300 peaks called using SICER previously 
. We also wanted to compare the performance of the different algorithms on our own datasets using the same training-set to evaluate the performance of the random-forest based part of the algorithm. To achieve this, we ran the various enhancer prediction methods on H3K4me1, H3K4me2 and H3K4me3 datasets of H1, with help from the author of ChromaGenSVM 
(). We tried to make the pre-processing stages of the various algorithms as consistent as possible by merging several replicates of each histone modification files and input files into single bed files and randomly selecting a smaller subset of p300 peaks for training, since these were the requirements of the other algorithms such as CSIANN and ChromaGenSVM. Incase of CSIANN, the selection of background was hard-coded in the software but in all other cases we used our own background training set as well. In , it can be observed that RFECS shows a maximum validation rate of around 82.8% as compared to 66.8%, 57.7% and 63.3% for ChromaGenSVM, CSIANN and chromia respectively. Further, RFECS showed the lowest misclassification rate of 4.9% as compared to 8.3%, 36,7% and 10.1% rates for the above-mentioned cases. Hence, the improvement in performance due to RFECS cannot be solely attributed to method of selecting the training-set. In summary, RFECS shows considerably improved performance over existing enhancer-prediction algorithms in two very different datasets and hence can be considered an advance in the field.
Prediction of enhancers in multiple human cell-types
Comparing enhancer predictions across diverse cell-types can contribute to understanding differences in regulatory mechanisms between cell-types. The ENCODE dataset is an example of a collection of high-throughput datasets such as histone modifications and transcription factor binding data that are available for multiple cell-types 
. Having a set of high-confidence enhancer predictions in these cell-types would be a valuable resource.
We trained our random forest on the p300 ENCODE data in H1 and made enhancer predictions in 12 ENCODE cell-types using the three marks H3K4me1, H3K4me3 and H3K27ac since these were available for all the cell-types. Validation rates were assessed based on overlap with existing DNAse-I hypersensitivity data while misclassification rates were calculated based on overlap with UCSC TSS. It can be seen that the majority of cell-types show high validation rates between 80 and 95%, while the misclassification rates lie within acceptable levels of 2–7% ().
Enhancer predictions in ENCODE cell-lines using RFECS.
In order to compare enhancers across cell-types, it is preferable to have enhancer predictions with the same level of confidence. To determine the appropriate cutoff for multiple number of cell-types, we calculate a False Discovery rate by randomly permuting 100 bp bins across the genome and computing the ratio of enhancers predicted in permuted data/enhancers predicted in real data for various cutoffs of voting percentages. In , it can be seen that different cell-types show a different relationship with FDR. For example, at an FDR of 5%, the voting percentage for GM12878 (solid dark blue) is 0.74, for Nhek (dashed cyan) 0.64 and for Hsmm (solid yellow) it is 0.56.
Using an FDR of 5%, we obtained a consistent set of high-confidence enhancer predictions in the 12 ENCODE cell-types. In , the numbers of enhancer predictions in each cell type is shown above the bar. The validation rates (in red) are above 90% for all cell-types except H1, Hepg2 and GM12878. In H1 and Hepg2, the numbers of DNase-I hypersensitivity sites are relatively less, i.e. ~150 to 177K as compared to ~230 to 380K in the other cell-lines. This may explain the somewhat lower validation rate in these two cell-types. GM12878 appears to be an outlier and we suspect that enhancer predictions may potentially be improved in this cell line by using a different training set.
In summary, we obtained a high-confidence set of enhancer predictions in multiple ENCODE cell-lines with the same level of confidence. This will enable more rigorous comparisons of regulatory characteristics of these cell-types in the future.