Our new method for predicting if a promoter is bound by a TF computes a score that combines the promoter's maximum PWM score and the total ChIP-seq tag counts of histone modifications within the promoter in the tissue of interest. A single histone modification—H3K4me3—appears to be as informative about the presence of TF binding as any of the combinations of histone modifications we tested. This means that our approach is immediately practical, since genome-wide ChIP-seq data for H3K4me3 in many tissues is already available. The software we make available (Dr Gene) can be used to make tissue-specific predictions of binding of a TF for which a PWM is available and for which H3K4me3 ChIP-seq data exist in the tissue of interest. As the results in show, the accuracy of predictions will be improved if TF ChIP-data are available for at least one TF in the tissue of interest, but this is not necessary. Our predictor can be trained and used in tissues for which no TF ChIP-data yet exists.
The effectiveness of combining PWM scores with H3K4me3 scores for predicting TF binding is consistent with our previous work (
Whitington et al., 2009). There, we filtered out regions with low H3K4me3 scores, and then made binding predictions based on PWM scores. A drawback of this earlier approach is that regions with marginal H3K4me3 scores are treated as ‘unbound’ without considering their PWM score at all. Our current method removes this drawback, and also focuses exclusively on predicting bound
promoters. The surprising ability of the H3K4me3 score of a promoter to predict whether it is bound by
any TF (see green curve in ) indicates that this score could function as an extremely good ‘prior’ for predicting TF binding.
Evaluating in silico methods for predicting TF binding is problematic. We have chosen to consider all core promoters as ‘bound’ that contain the midpoint of a TF ChIP-seq peak. However, some ChIP-seq peaks may represent indirect binding by the TF, so the core promoter may not contain a strong match to the TF's PWM. It is tempting, therefore, to remove promoters lacking strong PWM matches from the list of bound core promoters in a given reference set. We feel that doing so would give an unfairly optimistic estimate of the accuracy of PWM-based prediction methods, including our naïve Bayes score. Reference sets constructed using a given PWM method would be biased in favor of that PWM, since any core promoters that are directly bound by the TF but do not match the (possibly inaccurate) PWM would be eliminated. Our method of evaluation therefore considers all promoters as bound for which TF ChIP-seq data indicates binding within the core promoter, even though such binding may be indirect, and thus impossible for the TF's PWM to detect. In addition, we recognize that some directly bound promoters will be labeled as ‘unbound’ in our reference sets due to missing peaks caused by limitations in the raw ChIP-seq data (e.g. sequencing depth), and by the accuracy of the algorithms used to determine ChIP-seq peaks.
Both conservation-based prediction approaches that we test here are on average actually less accurate than using a PWM to predict the core promoters bound by a TF. This does not necessarily mean that conservation-based predictions of TF binding sites are less accurate than PWM-based predictions if one is trying to predict all sites bound in any tissue. However, because we use tissue-specific ChIP-seq data to label core promoters, sites that are not bound in the tissue of interest, but are highly conserved because of function in a different tissue or tissues, will be treated as false positives. This potentially explains the slightly lower accuracy of the conservation-based prediction methods tested here. It is still somewhat unexpected that combining a conservation-based binding score with the H3K4me3 score (Monkey++H3K4me3) is no more accurate than the non-conservation-based PWM+H3K4Me3 score. This may be due to characteristics of our reference sets—such as the problems of indirectly bound promoters and missing peaks discussed above—limiting the accuracy attainable by any PWM-based prediction method.
Funding: Australian Postgraduate Award and the Queensland Government Department of Tourism, Regional Development and Industry to R.C.M; Australian Research Council Centre of Excellence in Bioinformatics, National Institutes of Health Award (RO-1 RR021692) to T.L.B.
Conflict of Interest: none declared.