|Home | About | Journals | Submit | Contact Us | Français|
As key epigenetic regulators, polycomb group (PcG) proteins are responsible for the control of cell proliferation and differentiation as well as stem cell pluripotency and self-renewal. Aberrant epigenetic modification by PcG is strongly correlated with the severity and invasiveness of many types of cancers. Unfortunately, the molecular mechanism of PcG-mediated epigenetic regulation remained elusive, partly due to the extremely limited pool of experimentally confirmed PcG target genes. In order to facilitate experimental identification of PcG target genes, here we propose a novel computational method, EpiPredictor, that achieved significantly higher matching ratios with several recent chromatin immunoprecipitation studies than jPREdictor, an existing computational method. We further validated a subset of genes that were uniquely predicted by EpiPredictor by cross-referencing existing literature and by experimental means. Our data suggest that multiple transcription factor networking at the cis-regulatory elements is critical for PcG recruitment, while high GC content and high conservation level are also important features of PcG target genes. EpiPredictor should substantially expedite experimental discovery of PcG target genes by providing an effective initial screening tool. From a computational standpoint, our strategy of modelling transcription factor interaction with a non-linear kernel is original, effective and transferable to many other applications.
Originally discovered in Drosophila as the regulators of homeotic (HOX) genes, polycomb group (PcG) proteins are well-conserved epigenetic modifiers that repress the expression of thousands of target genes in a given genome (1–12). These target genes are essential for many fundamental, evolutionarily conserved processes including development, cell-fate determination, proliferation, stem-cell pluripotency and self-renewal (1,4,7,8,13–16). Mutations of PcG proteins are implicated in defects in stem-cell fates and their abnormal levels exhibit a striking correlation with the severity and invasiveness of a number of cancer types including prostate cancer and breast cancer (1,4,7,8,13–16).
PcG proteins impose gene silencing through their interactions with polycomb response elements (PREs) that are present on the promoter regions of polycomb target genes (31). This interaction is mediated by three types of multiprotein complexes, polycomb repressive complex 1 and 2 (PRC1 and PRC2) and a recently discovered PhoRC that contains the DNA-binding protein Pleiohomeotic (Pho) or Pleiohomeotic-like (PhoL) (17) in Drosophila and Ying and Yang 1 and 2 (YY1 and YY2) in mammals (18–20). The known members of Drosophila PRC1 include Polycomb (PC), Polyhomeotic (PH), Posterior sex combs (PSC) and dRing, whereas Drosophila PRC2 contains at least three core components: Enhancer of zeste (E(z)), Extra sex comb (Esc) and Suppressor of zeste 12 (Su(z)12) (18). Since none of these PRC1 and PRC2 proteins can bind to DNA directly, a hierarchical recruitment model has been proposed stating that DNA-binding transcription factors including Pho and PhoL first bind to PREs on the target genes and recruit the PRC2 complex to trimethylate the lysine 27 residue of histone H3 (H3K27me3) that is later bound by the PRC1 complex for maintenance (21). Besides Pho and PhoL, the best studied Drosophila transcription factors contributing to PRC2 recruitment include GAGA factor (GAF)/Pipsqueak (PSQ) (22,24,26), Zeste (25), Dorsal switch protein (DSP) (23,28), Grainyhead (Grh) (28) and Sp1/KLF (29), (reviewed in 30). In addition, several Drosophila PREs have been identified through both computational and experimental analyses (31–40). More recently, the first two mammalian genomic regions have been discovered to confer PcG responsiveness, one in the human HOXD cluster (41) and the other in the regulatory region of the mouse MafB gene (9).
Recent advances in high-throughput techniques such as chromatin immunoprecipitation in conjunction with microarray (ChIP-on-chip), DNA adenine methyltransferase identification (DamID) and ChIP-sequencing (ChIP-seq), have greatly enriched our knowledge on the scale of genes regulated by PcG (1,4–8,10,11,13–16,42–49). However, the rather low overlaps of target genes identified in separate ChIP studies, at ~30% for three ChIP studies on Drosophila melanogaster (4,14,15), stress the need for additional experimental and computational verifications of individual PcG target genes. Ideally, a powerful computational method that is able to predict/screen, with a reasonable accuracy, PcG target genes in a given genome would drastically expedite experimental verification of these genes.
In the literature, there are considerable efforts in developing computational methods to predict PRE sequences and to locate the genes regulated by PcG based upon their adjacency to PREs. For instance, Ringrose et al. investigated the combinatorial pattern of transcription factors known to be involved in PcG recruitment and assigned to each genomic region of interest a score equalling the weighted sum of the occurrence of every possible transcription factor pairs (40). Fiedler and Rehmsmeier extended this idea and developed jPREdictor for PRE prediction (50). Hauenschild and colleagues used the latest version of jPREdictor to perform a genome-wide prediction on D. melanogaster and predicted 201 PREs together with 243 associated genes (51). They also incorporated the aspect of comparative genomics and expanded their prediction to 285 PREs with 322 associated genes. More recently, Liu et al. integrated data from a ChIP study and transcription factor binding analysis to predict a set of PcG target genes in mouse embryonic stem cells (52). Despite these efforts, however, due to the plasticity of PRE sequences, developing a reliable computational PRE predictor remains a difficult task. For example, the overlaps between the top target genes predicted by jPREdictor and those shown in the three recent ChIP studies in D. melanogaster (4,14,15) are strikingly low (at ~8–20%).
We have addressed this challenge by developing a novel computational approach, EpiPredictor, to predict PcG target genes via the identification of PRE sites. With the incorporation of novel features including the use of a support vector machine (SVM)-based classifier, global sequence information, conservation analysis and comparative genomics, our approach was able to predict PcG target genes in the D. melanogaster genome with substantially improved accuracy. Most of the predicted PcG target genes are transcription factors involved in key biological processes such as development, neurogenesis and cell fate determination. Our results suggest that multiple transcription factor networking at the cis-regulatory elements is critical for PcG recruitment, and high GC content and high conservation level are also important features of PcG target genes.
In Drosophila, several transcription factors responsible for PcG recruitment have been identified, which, together with the consensus sequences of their DNA binding sites, are collectively referred to as motifs hereafter. We used seven motifs corresponding to four transcription factors, GAF (G, G10), Pho (PS, PM, PF), engrailed-1 (EN1) and Zeste (Z), all of which are known to be instrumental for PcG recruitment (Supplementary Table S1). The same motif set was also used in jPREdictor (40,51). Though a few other transcription factors, e.g. DSP, Grh, Sp1/KLF, are also implicated in PcG recruitment in some studies, we did not include them in our current system because doing so did not lead to any performance improvement (data not shown) and also may not allow a fair comparison with jPREdictor.
In order to validate our prediction of PcG target genes in an objective way, we used the gene lists reported in three recent ChIP studies in D. melanogaster, where Schwartz et al. (4) used ChIP-on-chip technique on S2 cultured cell line with antibodies to PC, E(z), PSC and H3K27me3; Tolhuis et al. (15) used DamID approach on Kc cells to identify binding sites of PC, Esc, Sex combs extra (Sce) and H3K27me3; while Schuettengruber et al. (14) applied ChIP-on-chip on Drosophila embryos and employed antibodies to PC, PH and H3K27me3. Different choices of cell lines and antibodies all had an impact on the results of these experiments that differed from one another at varying degrees. Since our in silico PRE prediction is independent of any experimental conditions, we expected that a comparison of our results with these three well-annotated studies, which as a whole investigated a range of antibodies and cell types, would provide a comprehensive evaluation of our system. To ensure that the validation gene lists to be used were as reliable and up-to-date as possible, we performed a post-processing procedure on the published data using the following stringent selection criteria. For all three validation sets, we used the gene lists published by the authors as input and removed duplicates if there was any. We also eliminated the genes that were withdrawn in the newer release of the gene annotation to ensure that the validation gene sets are up-to-date. In particular, in processing Schwartz’s data (4), we only selected the target genes with strong PcG binding signals to all of the four PcG proteins [PC, PSC, PH and Su(z)12] simultaneously as defined by the authors. As a result, we obtained three lists consisting of 176 (Schwartz), 225 (Tolhuis) and 215 (Schuettengruber) predicted PcG target genes, respectively (Supplementary Table S2). Among them, 38 genes appeared in all of the three validation sets, denoted as Intersection, making the degree of overlap in the range of ~17–22%.
Our PRE classifier is a supervised learner. Therefore we needed to provide it with a training set of good quality. This consisted of two steps: (i) construction of a PRE/non-PRE sequence collection and (ii) construction of the training set containing examples of both PRE sites (positive) and non-PRE sites (negative).
First, we constructed a sequence collection containing 12 known PRE sequences and 23 control (non-PRE) sequences. Among them, the 12 PRE sequences and 16 control sequences were the same as those used by Ringrose and colleagues (40,51). The 12 PRE sequences had solid evidence to support the existence of PRE site(s) within, whereas the 16 control sequences included promoters of genes regulated by GAF and Zeste but not by PcG proteins (40). To reflect the most recent progress in the field, we followed the same methodology used by Ringrose (40) and collected seven extra control sequences (Supplementary Table S3) for our training set that were bound by GAF, Pho and Zeste but did not have any enrichment for PC, PH or H3K27me3 in a genome-wide ChIP study (14). They were obtained by examining whether a given locus bound by GAF, Pho and Zeste was in the proximal promoter region of any gene, i.e. −1000 to +1000 base pairs (bps) with respect to the gene’s transcription start site (TSS). If so, we retained the locus and the gene, otherwise, we discarded them. To ensure that our control sequences did interact with GAF, we consulted another list of GAF target genes by an independent study (53). If the genes associated with any retained loci under investigation were not included in the second study, the loci were eliminated from our list. It was evident that, despite the addition of seven new control sequences in our study, the size of the sequence collection remained rather small.
A PRE sequence containing PRE site(s) is much larger than an actual PRE site. Due to the limited resolution of the experimental verification process, most known PRE sequences included in our sequence collection spanned thousands of bps long whereas the core-PRE sites are usually much shorter (<200bps) (37). In other words, in addition to core PRE sites, a known PRE sequence might also contain non-PRE sites. Thus it was prudent to identify the loci that were most likely the bona fide PRE sites. For this purpose, we scanned each PRE/non-PRE sequence in our collection with a sliding window of 200bps that incrementally moved downstream with a constant step of 20bps. For each PRE sequence, we chose the window(s) with the highest sum of motif occurrence (calculated by the Motif Analyzer in the following section) as PRE sites. For every control (non-PRE) sequence, all the windows from scanning a control sequence were kept to ensure that the classifier was to be trained under very stringent condition.
Our system consisted of six primary components including Motif Analyzer, PRE Classifier, GC Analyzer, PRE-to-gene Mapper, Conservation Level Analyzer and Comparative Genomics Analyzer (Figure 1A). With the exception of PRE-to-gene Mapper, which was a utility module, each component rendered a unique perspective of investigating the genomic sequence or gene of interest. The first three units were centred around the prediction of PRE sites (Figure 1A and B), whereas the last three were focused on analyses at the gene level (Figure 1A).
We employed a sliding window of 200 bps and a step size of 20 bps to scan the whole genome where the DNA sequence overlapping with the window was captured and analyzed at any given time by the Motif Analyzer. Using a set of n motifs of transcription factors that were known to be involved in PcG recruitment, denoted by M1, M2,…,Mn, the Motif Analyzer constructed a profile for each window sequence/locus (denoted by Si) and represented it by a feature vector Fi=(fi1, fi2,…,fin), where fij denoted the occurrence frequency of motif Mj in sequence Si. This feature vector was then analyzed by the pre-trained PRE classifier (below) that predicted whether the test window/locus was a PRE or not (Figure 1B).
Ringrose et al. (40) examined the occurrence of paired motifs at the putative PRE sites and observed that the weighted sum of the occurrence frequencies of all possible motif pairs were far more effective than a linear sum up of the occurrence frequency of single motifs. This suggested that the pattern of transcription factor interactions at the PRE sites be combinatorial. In order to abstractly model the multifaceted interactions among transcription factors at PRE sites, we incorporated an SVM-based PRE classifier, which is a powerful supervised learning method for handling classification tasks. SVM has achieved prominent success in a spectrum of biological applications including gene selection (54,55), protein classification (56–58), cancer tissue characterization (59,60), outperforming many other classic machine learning techniques such as neural network, decision tree, k-nearest neighbour (61,62).
There are four basic kernel functions in SVM, including linear, polynomial, radial basis function (RBF) and sigmoid. Given the context of PRE prediction, we provided a further annotation to SVM coupled with some of these kernels. For instance, in the case of a polynomial kernel, the parameter d corresponded to the degree of motif combinations, e.g. when d=1 (equivalent to a linear kernel), only single motif occurrence was noted; when d=2 (quadratic kernel), the occurrence of motif pairs was considered; whereas when d=3 (cubic kernel), the occurrence of motif triplets was analyzed. In the case of the RBF kernel, the data was mapped to an infinite dimensional Hilbert space where intuitively speaking, all the motifs were mapped to a circle/hypersphere. Taken together, we expected the polynomial (d>1) and RBF kernels to be best for modelling transcription factor interaction at the PRE sites. While the windows/loci classified to be non-PREs were discarded, those classified to be PREs had to undergo further scrutiny by GC Analyzer (below).
Previous studies indicated that native DNA sequence features, such as GC content, are associated with epigenetic modification activities such as DNA methylation and PcG binding (63–65). In particular, the work of Ku et al. (5) suggested that CpG islands influence recruitment of PcG. Furthermore, GC-rich sequence elements have been shown to recruit PRC2 in mammalian embryonic stem cells (66) and high-CpG-density promoters are associated with highly regulated key developmental genes and are enriched with the H3K27me3 marks (67). Therefore, we implemented a GC Analyzer to further scrutinize the output from the PRE Classifier. For a sequence window/locus Si that was positively predicted by the PRE classifier, our GC Analyzer compared its GC ratio Ri with a threshold value RT and discarded Si if Ri <RT (Figure 1B). To decide on an appropriate threshold for a region of 200-bps window size that we used, we examined six experimentally verified PRE sequences where short core PRE segments were identified (37). The lowest GC ratio of these core PRE segments was 44%. We then chose this lower bound of the GC ratio as our threshold in order to ensure that all the verified PRE segments satisfy this GC ratio cut-off so that they can pass the GC Analyzer’s scrutiny. We also compared the cut-off values of 44, 42 or 40% on EpiPredictor, and found 44% yielded the best performance. Therefore, we used the 44% threshold in our subsequent analysis. Only the ones that passed the GC content test were considered as the potential PRE loci. Each locus was given a numerical score SCi by the Motif Analyzer that equalled the sum of motif occurrence in the sequence Si, i.e. SCi= (Figure 1B).
To characterize the probability of a predicted PRE site being the real PRE site, we performed a non-parametric analysis on 100 randomly generated genomes whose size and nucleotide distribution (A: 29%, C: 21%, T: 29%, G: 21%) are the same as the D. melanogaster genome. We used our software to predict PRE sites on these random genomes and for a given score s we counted Os that denoted the occurrence of a score that is higher or equal to s. We then calculated Es, i.e. the E-value of score s by Os/100 and the corresponding P-value would be Es/100.
From our genome-wide PRE site prediction results, the PRE-to-gene Mapper first mapped all of the predicted PRE sites to their genomic coordinates on the genome. When several windows/loci adjacent to each other were all predicted to be PRE sites, they were all combined into a longer PRE. The Mapper then analyzed every locus that had a positive PRE score S, located its most adjacent gene G and credited G a score equalling S. If the locus was positioned closely in between two genes and if the second closest gene G2 was within 4000 bps away, the Mapper granted G2 a score equalling S as well.
Due to their roles in regulating key developmental processes, PcG target genes were expected to be evolutionarily conserved. The Conservation Level Analyzer considered six Drosophila genomes that are close to D. melanogaster according to the phylogenetic tree (68), including D. simulans, D. sechellia, D. yakuba, D. erecta, D. pseudoobscura and D. ananassae. For each annotated D. melanogaster gene, it queried the Flybase database (www.flybase.org) to locate its orthologues in any of the six related genomes. If a gene failed to have any orthologue, it was eliminated from the eligible gene list. That is, the Analyzer excluded the genes that did not have any orthologue in any related species from the pool of candidate genes to be considered as PcG targets. Eventually all the remaining genes were ranked according to the genes’ associated PRE scores. The version of our EpiPredictor up to this point was termed as EpiPredictor-Basic.
We investigated the value of incorporating comparative genomics (69–71) into PcG target gene prediction. For this, we constructed a variant version of EpiPredictor, hereafter referred to as EpiPredictor-CG, which integrated analyses on three well-annotated Drosophila organisms (D. simulans, D. yakuba and D. pseudoobscura) that are close to D. melanogaster in the phylogenetic tree (68).
Our tactic in implementing EpiPredictor-CG was to construct an ensemble system that employed the top-ranked genes provided by our original EpiPredictor-Basic as the base set and incorporated the information obtained from our comparative genomics study for rank adjustment when necessary. To be more specific, if our ultimate goal was to retrieve N genes that were most likely the PcG targets, we started our process with a gene list containing the top M genes ranked by EpiPredictor-Basic (M=1.5N) and reordered the genes based upon the scores of the candidate genes’ orthologues in different Drosophila species.
In order to achieve this, we applied EpiPredictor-Basic onto each of these three Drosophila genomes. For each genome, all annotated genes were evaluated and ranked according to their predicted PRE scores. If a gene is orthologous to a D. melanogaster gene, the rank of that gene was linked to its D. melanogaster gene orthologue. Therefore, for any D. melanogaster gene included in the top list, up to four ranks could be obtained, each representing the rank of the gene (or its orthologue) in the respective species, i.e. D. melanogaster, D. simulans, D. yakuba and D. pseudoobscura. A final rank was calculated by averaging all the ranks. The gene list was then re-sorted accordingly.
BART (Bayesian Additive Regression Trees) is a nonparametric regression method that can also be used as a binary classifier. As a comparison to the SVM-based PRE Classifier in our system, we used BART as an alternative classifier to evaluate whether a given locus is actually the PRE site. This was achieved by using the R package (BayesTree) by Hugh Chipman and Robert McCulloch.
The primary computational complexity of our EpiPredictor model came from the component of the SVM-based PRE classifier. During the training phase, the complexity of the SVM was O(Ns3+(Ns2)l+NsdLl) where Ns denoted the number of support vectors, l denoted the number of training points and dL denoted the dimension of input data. During the testing phase, the complexity of the SVM was O(MNs) where M was O(dL). In our experiments, Ns=21.
On a regular Dell desktop (Intel Duo CPU 3.00GHz, 1G memory), our system spent 63ms in training. During the prediction phase, it took about 30min to process the entire D. melanogaster genome of 137 million bps and used around 5 MB memory. Due to the integration of SVM, it was necessary to store a substantial amount of feature vectors onto a text file. This Input/Output process was responsible for the majority of the execution time.
Importantly, our system is an automated program in which the components such as Motif Analyzer, SVM-based PRE Classifier and GC Analyzer were run sequentially requiring no human intervention after the genome sequence under study is input, and is readily scalable. For example, we used our software to predict the PcG target genes on the entire human genome and obtained complete results within three and a half hours on the same PC.
In marked contrast, when we used BART as an alternative to SVM to classify PRE, we noticed that BART required substantial computational resources. It was impossible to complete the prediction of D. melanogaster genome on the same PC. On an Intel Xeon computer cluster which contains 134 SunFire x4150 nodes from Sun Microsystems, the computation took about 33h to complete. The average usage of memory was 16 GB.
Drosophila melanogaster Schneider S2 cells were cultured in 1× Schneider’s medium (Invitrogen) supplemented with 20% fetal bovine serum, 100U/ml Penicillin and 100µg/ml Streptomycin at room temperature. Cells were passaged at 1:4 ratio every two days to keep logarithmic growth. Crosslinking, immunoprecipitation with anti-E(z) antibody and quantitative PCR (qPCR) were done as described previously (72). In brief, 5µg anti-E(z) (Santa Cruz Biotechnology, Inc) or anti-FLAG mock antibody (Sigma) were added to 4×108 crosslinked S2 cells to immunoprecipitate protein/DNA complexes. The antibody–protein/DNA complexes were then purified using 50µl protein A Sephorose 4 Fast Flow Beads (GE Healthcare). DNA was extracted from the purified antibody-protein/DNA complexes by phenol–chlorophorm extraction. Purified DNA was subjected to qPCR using primer pairs designed to amplify DNA of ~250bps using a SYBER green detection mix (Applied Biosystems). All experiments were carried out in triplicates.
In order to identify the most appropriate kernel for the SVM-based PRE classifier, we performed an empirical analysis on the training set to gauge how well a certain kernel distinguished known PRE sequences. This is done using three runs of 10-fold cross validation so as to avoid any potential over-fitting problem. With the default parameters provided by LibSVM (73), the performance of all four basic kernel methods was analyzed by sensitivity and specificity (Table 1). As we expected, the non-linear kernels such as polynomial worked very well in distinguishing PRE sequences from control sequences, further confirming the advantage of modelling the motif interaction in a combinatorial manner. Among them, the polynomial (d=2 and d=3) kernels (also called the quadratic kernel and cubic kernel, respectively) achieved the best results in terms of specificity and sensitivity when both the average and standard deviation are taken into account, implicating that at the PRE sites, multiple transcription factors interact with each other that as a whole serves as the platform for PcG recruitment. Although the cubic kernel did not significantly outperform the quadratic kernel, it is still the best model given all the parameters considered. Therefore, we used the cubic kernel on the SVM throughout our analyses.
To compare our new sequence collection with the original one used by Ringrose and colleagues, we ran independently EpiPredictor-Basic using the modules (a, b, c) on both training sets and the jPREdictor (static) on our new training set (Supplementary Table S4). Also included in Supplementary Table S4 is the result of jPREdictor (static) with Ringrose’s training set as originally reported (51). We found virtually no difference in performance when using different training sets. Therefore, we elected to use our training set throughout the analyses.
We tested our classifier on the D. melanogaster genome that contains roughly 137 million bps and 13000 genes. Each chromosome was scanned with a sliding window of 200bps and a step size of 20bps (parameters determined by empirical analysis), and each window was analyzed by the Motif Analyzer component and represented by a seven-dimensional feature vector (each corresponding to one of the seven motifs we used). The performance of the system was evaluated by the matching ratios between our top predicted genes and those of the three validation sets derived from ChIP studies (4,14,15) together with their intersection set (Intersection) (Table 2). In order to examine whether the performance of our system is sensitive to different window size and step sizes, we also varied the values of these parameters (Supplementary Table S5). It is clear that with different window and step sizes, the performance of our system did vary slightly but the change was not very substantial. Overall, the parameter setting of window size=200, step size=20 produced the best results. Therefore, we used the window size of 200 bps and step size of 20bps throughout.
Our system contains multiple components. The effect of each component was evaluated by sequentially adding each component onto the Baseline system that used only the Motif Analyzer.
To thoroughly evaluate the merit of each component of EpiPredictor, we constructed a baseline system that did not incorporate SVM or any other subsequent component but instead only used the Motif Analyzer that calculated the sum of the motif occurrence frequency. The baseline system achieved a moderate performance, having the matching ratios of 14.20, 5.33, 12.09, 2.63%, with the three validation sets and their intersection, respectively (Table 2). It is noteworthy that to perform a fair comparison with jPREdictor that reported their top 243 genes, we also retrieved the top 243 genes from our system to obtain the aforementioned results.
In order to estimate the merit of SVM, we then integrated SVM into the baseline system. The application of SVM drastically enhanced the performance of our system when compared to the baseline system, with matching ratios of 22.73, 9.78, 19.53, 23.68%, respectively (Table 2).
Subsequently we incorporated the GC Analyzer into our program. The prediction performance of EpiPredictor was further improved to 26.14, 10.22, 25.12, 23.68%, respectively (Table 2), demonstrating that the bona fide PcG sites tend to have relatively high GC content.
The non-parametric tests conducted on 100 random genomes indicated that a PRE score of 12.7 corresponded to a P-value of 0.01. In our prediction, the top 190 predicted PRE sites had a PRE score of higher than 12.7, with the highest score being 39.2. We also corrected the issue of multiple comparisons using Bonferroni correction, and found that a PRE score of 17.3 corresponded to a P-value of 0.0001 (0.01/100). In our prediction, the top 73 predicted PRE sites had a PRE score of 17.3 or higher. Thus these top 73 predicted PRE sites are regarded as predictions with significant confidence, even under such a stringent condition.
The integration of the Conservation Level Analyzer slightly enhanced our system’s performance to 27.27, 10.67, 26.05, 26.32%, respectively (Table 2). At this point, we completed the construction of the basic version of our system, EpiPredictor-Basic. A complete list of the top genes thereby generated is provided in Supplementary Table S6.
It is worth mentioning that an attempt of using the base-by-base evolutionary conservation score compiled on D. melanogaster genome in comparison to 14 insects (71) failed to produce any improvement in the prediction performance (data not shown). Taken together, this suggested that the bona fide PcG target genes be most likely evolutionarily conserved; however, their positions might be more flexible in the course of evolution.
In order to evaluate the performance of EpiPredictor-CG, which integrated the Comparative Genomics Analyzer, we retrieved the top 322 predicted genes, which was the same number as generated by our counterpart jPREdictor (dynamic) (Supplementary Tables S7 and S8). Due to the integration of comparative genomics, some of the genes with lower scores were boosted up into the top list (Supplementary Table S9) and yielded an improved performance of 35.80, 15.11, 33.02, 44.74%, respectively (Table 2), in comparison to the performance of EpiPredictor-Basic in predicting 322 genes: 32.39, 14.22, 30.70, 34.21% (Table 2).
It is worth noting that the Intersection set obtained by intersecting all the three validation sets derived from ChIP studies (4,14,15) did have very high matching ratio with our EpiPredictor-CG prediction (Table 2), consistent with the expectation that it is the highest confidence set of the target genes.
Besides SVM, several other statistical models including BART (74) are also able to capture nonlinear interactions among the sequence features. For instance, Liu et al. (52) used BART to predict polycomb target genes with a good performance. Therefore we compared our system’s performance using SVM-based or BART-based PRE classifier (Table 3). It is clear that the SVM-based classifier consistently outperformed the BART-based counterpart.
We conducted a comparative analysis of EpiPredictor and jPREdictor (Table 4) by using the matching ratios as well as the receiving operating characteristics (ROC) curve as our evaluation metrics. The former metric indicates the overall accuracy of prediction while the latter one depicts the trade-off between sensitivity and specificity, which focuses on evaluating the ranking scheme. In terms of the matching ratio, EpiPredictor-Basic outperformed jPREdictor (static) by 6.25, 2.67, 6.05, 5.27%, respectively, against the three validation sets and their intersection set and the improvement is statistically significant (P<0.05 in one-tailed Students’ t-test). In addition, EpiPredictor-CG surpassed the performance of jPREdictor (dynamic) by 7.96, 2.67, 10.23, 18.42%, respectively (P<0.05). In terms of the area under curve (AUC) of ROC curve, EpiPredictor-Basic achieved comparable results with jPREdictor (static), whereas EpiPredictor-CG outperformed jPREdictor (dynamic) in three out of the four cases (Figure 2). It is worth noting that the AUCs of EpiPredictor-Basic, EpiPredictor-CG and jPREdictor (static) were all significantly larger than 0.5 (random guess) (P<0.05) but it was not the case for jPREdictor (dynamic). Furthermore, using the AUCs as a measure, neither EpiPredictor nor jPREdictor’s advanced version significantly outperformed their basic counterpart.
To reveal the major function enrichment of the genes predicted by our system and jPREdictor, we used the DAVID bioinformatics tool (75) to perform a gene ontology analysis on the genes uniquely predicted by either EpiPredictor-CG or jPREdictor (dynamic), as well as those predicted by both EpiPredictor-CG and jPREdictor (dynamic) (Figure 3). Most of the highly represented gene functions were related to the regulation of transcription, development, pattern specification, morphogenesis and cell-fate commitment, consistent with the expected roles of PcG in regulating key developmental processes of an organism (4,6–8,13–16). The consensus genes predicted by both EpiPredictor-CG and jPREdictor (dynamic) made up about 28% of the top 322 genes and their corresponding gene ontology analysis presented good consistency with experimental studies.
By cross-referencing existing literature, we found experimental evidence for seven genes, which were uniquely identified by EpiPredictor-CG and also matched at least one of the three ChIP studies, of their critical roles in key developmental processes (Table 5). To exemplify, the inv locus was recently found to harbour one PRE site which has been experimentally verified (38) and its role in regulating Drosophila hindgut development is well established (76). The wg locus belongs to the important Wg/Wnt signal transduction pathway that directs a variety of cell fate decisions in developing animal embryos (86). In Drosophila, wg alone directs a wide range of cell fate and patterning decisions (77). The nub locus is involved in embryogenesis and neurogenesis (78–80). The pdm2 locus is responsible for a variety of cell fate decision in the Drosophila development (81). The dac is an essential part of a complex that functions to induce ectopic eye development (82). The Gsc mediates effective repression in Drosophila blastoderm embryos (83). The tup has a key function in the development of imaginal disc (84) and is also a key component in early cardiogenesis (85). Interestingly, a recent ChIP study (7) revealed that the human homologues of wg (WNT1), dac (DACH-1), Gsc (GSC) and tup (ISL1) are all targeted by PcG. In particular, WNT1 is known to be involved in embryogenesis and cancer development (86). The functions of the genes uniquely identified by our system but excluding the abovementioned seven genes are shown by a gene ontology analysis using DAVID (Figure 3F).
To further validate our prediction, we also cross-referenced our gene list with the 27 PcG target genes confirmed by ChIP-qPCR in the work of Ringrose and colleagues (40), of which EpiPredictor-CG correctly predicted 19 genes (70%), exhibiting a good correlation.
In order to experimentally validate EpiPredictor prediction, ChIP-qPCR was used to investigate the enrichment of 15 predicted PRE sites that were randomly selected from the top 150 predictions (Supplementary Table S10) using anti-E(z) antibody. For positive controls we used three known PREs, bxd, iab2, and en_DM, as established in the literature (87) along with four sequences from Ringrose et al. (40), hth, unc-4, idgf4, and cato, for which ChIP-qPCR experiments have been done using anti-PC antibody. Three housekeeping genes with no previous evidence as PRE or of polycomb related activity, hsp22, hsp26 and Pc, were selected as negative controls (40). Primers for qPCR are listed in Supplementary Table S11.
The results of ChIP-qPCR showed that there are more than two-fold enrichments for 12 out of the 15 tested PRE sites (Figure 4). Among them, five showed enrichment greater than the averaged value of 5.66 for the seven positive controls, indicating a higher degree of confidence for their potential as PcG target genes. Our E(z)-ChIP derived data and Ringrose’s PC data are scaled roughly to the same level (Supplementary Table S12) with the exception of idgf4 which exhibited enrichment in our data but not in Ringrose’s (40). However, this discrepancy is not completely unexpected given the fact that on the whole genome scale PC and E(z) do not always align well (4).
By mapping the positively enriched sequences onto their closest genes, we found that all 12 corresponding genes are of crucial importance to Drosophila embryonic development, since the knockout of each of these genes conferred serious body morphological changes. The antp and abd-A are Drosophila HOX genes (88), while bxd is expressed directly upstream of and is known to directly influence the behavior of ubx, another HOX gene (89). Furthermore, both disco and eve regulate the localization or expression of HOX genes, conversely, salm and bab2 are directly regulated by HOX genes (90–93), while unc-4 is a homeobox-containing protein and a paralogue of the HOX genes with similar functions (94). Finally, both noc and pnr are critical for proper eye formation (95,96), grn has importance in multiple organ development (97) and immune response in the midgut (98), and zfh1 is essential to cell differentiation of lateral mesodermal derivative lineages and in neurogenisis (99). The critical importance of these genes and the computational prediction of them being PcG target genes highlight the importance of understanding how sequence influences PcG binding in order to properly understand embryonic development in Drosophila.
Transposons are mobile genetic elements that can cause mutations and change the amount of DNA in the genome (105). Given their critical importance in cellular functions, we predicted that PcG target genes in D. melanogaster should have a minimal presence of transposons, generally termed as transposon-free regions (TFRs). We performed a whole-genome search and identified 1400 TFRs of >10000bps in length, of which 1232 overlapped with at least one gene’s TSS. In the top 322 putative PcG target genes predicted by our system, 319 of them (99%) had TSS overlapping with one or more TFRs. Thus, as we expected, the D. melanogaster PcG target genes are indeed essentially free of identifiable transposon-derived sequences. This is a novel finding in Drosophila and corroborates well with several recent mammalian studies that revealed strong correlations between TFRs and genes encoding developmental regulators (100), as well as the H3K27me3 marks (101).
Given the ambiguity in the consensus sequence of motifs, our system considered different versions of the same motifs (for instance, PS, PM and PF for Pho) as well as allowed the existence of ambiguity codes and mutations (Supplementary Table S1). In addition, by using an SVM with non-linear kernel as a PRE classifier, our program abstractly models how multiple motifs interact with each other at the genomic site of interest. These two considerations are similar to the options of position-specific probability matrices and multi-motifs in jPredictor.
To the best of our knowledge, this is the first application of SVM to PRE prediction. With the integration of a non-linear kernel, our system EpiPredictor succeeded in modelling the spatial relationship and combinatorial interaction among transcription factors that are involved in PcG recruitment. This strategy offers a higher level of abstraction over any other approaches that use a linear function. The fully automated process of constructing the classifier in SVM also reduces the level of bias in the analysis.
Our novel computational strategy also offers new insights into the interactions among transcription factors at the cis-regulatory elements in vivo. The outstanding performance of the non-linear kernels indicates that multiple transcription factors are networking at the cis-regulatory elements for efficient recruitment of PcG proteins. However, the details of such networking remain to be illustrated in future studies.
Among the array of perspectives that we used in EpiPredictor, SVM classifier, high GC content and comparative genomics all led to substantial performance improvements (Table 2). The success of integrating GC analysis suggested that relatively high GC content be an important feature of PcG target genes, consistent with previous studies that hyper-conserved CpG domains underlie polycomb-binding sites (65). In addition, given their critical importance in cellular functions, PcG target genes are not surprisingly highly conserved in evolution.
The gene ontology analysis on the genes predicted by our system revealed that the target genes of PcG are mainly regulators of transcription activities and are crucial for key developmental processes. Some genes uniquely predicted by our system are confirmed by several independent experimental studies to be essential for normal development and patterning. These observations further support the fundamental roles of PcG proteins in development and cellular functions.
Trithorax group (TrxG) proteins methylate histone 3 lysine 4 to reverse the repression imposed by PcG proteins (18,102). There exists substantial evidence that Trithorax response elements (TREs) and PREs co-localize. For example, several major TrxG proteins bind at essentially all known or presumptive PREs, suggesting that the regulatory platforms are switchable (18,103). In mouse embryonic stem cells, large bivalent domains were found to contain chromatin modifications generated by both PcG and TrxG, suggesting the co-presence of PcG and TrxG in developmental genes (101). A recent genetic study on Drosophila also revealed that PcG repression is dynamic and that ASH1 (absent, small or homeotic discs 1), the histone methyltransferase belonging to the TrxG complex, is critical for the active state of Polycomb target genes (102). Taken together, accumulating evidence suggests that the epigenetic regulations mediated by PcG and TrxG are likely to be closely intertwined and that the approach that accurately predicts PcG target genes will also shed new light on TrxG target genes. Thus, it is fully expected that some of the PcG target genes we predicted here will turn out to be TrxG target genes.
Despite a large number of genome-wide ChIP studies of PcG target genes (1,4–8,10,12,14–16) recently appeared in the literature that substantially enriched our knowledge of the scales of PcG-mediated epigenetic modification and their roles in normal cellular functions and in cancer development, our mechanistic understanding of this process remains extremely poor. To exemplify, up to date, there are only two mammalian PREs (9,41) and a dozen of Drosophila PREs (31–40) that have been experimentally verified. In addition, there are only nine Drosophila transcription factors confirmed to be involved in PcG recruitment, among which only two have mammalian homologues (20,104). The extremely limited pools of confirmed PREs and their interacting transcription factors are the main restraints for the relatively mediocre performance of computational methods such as EpiPredictor and jPredictor, with 20–30% matching ratios with genome-wide ChIP data. Although our EpiPredictor has substantially outperformed jPredictor (by up to >10% in matching ratio), we expect a much better performance if we had had more knowledge on PREs and their interacting transcription factors. Thus, the more accurate computational method such as EpiPredictor will provide a very useful tool for initial screening of PcG target genes from ChIP studies so as to identify the most likely candidates for labour-intensive experimental verifications. The enhanced knowledge of PREs will in turn improve the performance of these computational methods, and ultimately leads to a comprehensive understanding of PcG-mediated gene repression in normal cellular functions as well as in epigenetic dysregulation. Thus, our new EpiPredictor program reported in this study represents an important step toward this ultimate goal in the field of epigenetics.
Supplementary Data are available at NAR Online: Supplementary Tables S1–S11.
Collaborative Advances in Biomedical Computing from The John and Ann Doerr Fund for Computational Biomedicine at Rice University (to J.M. and Q.W.); training fellowship from the Keck Center Computational Cancer Biology Training Program of the Gulf Coast Consortia (CPRIT Grant No. RP101489 to J.Z.); training fellowship from the NLM Training Program of the Keck Center of the Gulf Coast Consortia (NIHGrant No.5 T15LM07093-17 to B.D.K.); the National Institutes of Health (R01-GM067801), the National Science Foundation (MCB-0818353); the Welch Foundation (Q-1512) to J.M. Funding for open access charge: Start-up funds from Baylor College of Medicine (to Q.W.).
Conflict of interest statement. None declared.
We are grateful to Drs Leonie Ringrose and Marc Remsmeier for providing valuable details regarding their publications. We also thank Kit Menlove for his input on the project. The source code of EpiPredictor is available upon request.