We have demonstrated in this article how predictive modeling approaches can reveal subtle sequence signals that may influence TF–DNA binding and generate testable hypotheses. Compared with some other more system-based approaches to gene regulation, such as building a large system of differential equations or inferring a complete Bayesian network, predictive modeling is more intuitive, more theoretically solid (as many in-depth statistical learning theories have been developed), more easily validated (by CV), and can generate more straightforward testable hypotheses.
Our main goal here is to conduct a comparative study on the effectiveness of several statistical learning tools for combining ChIP-chip/expression and genomic sequence data to tackle the protein–DNA binding problem. Because of the generality of these tools, we are able to include sequence features besides TF motifs, such as background words, GC content, and a measure of cross-species conservation. The finding that these nonmotif features can significantly improve the predictive power of all the tested methods indicates a potentially important yet less understood role these sequence features play in TF–DNA interactions. They may help the localization of a TF on the DNA for precise recognition of its binding sites, or may have a function with chromatin associated factors and histone modification activities. It is generally believed that many other factors in addition to the sequence specificity of short TFBSs contribute to TF–DNA binding. Along this direction, we have proposed a general framework to explore and characterize potentially influential factors.
In both ChIP-chip data sets, we not only unambiguously identified all the binding motifs for the target TFs, Oct4 and Sox2, but also discovered a number of verified cooperative or functional regulators in ESCs, such as Nanog, Klf4, Nfyb and P53. As a principled way to utilize both positive (i.e. binding sequences) and negative (nonbinding) information, the predictive modeling approach provides a powerful alternative for detecting TF-binding motifs to those more popular generative model-based tools (3–6). Noting that the stepwise linear regression methods (Step-Full, Step-SO, and Step-LR) are equivalent to MotifRegressor (29) and MARS is equivalent to MARSMotif (30) with all the known and discovered (Sox–Oct) motifs as input, we have shown that BART and boosting using all three categories of sequence features outperformed MotifRegressor and MARSMotif significantly in all of our examples.
For a generative modeling approach, separate statistical models are fitted to TF-bound (positive) and background (negative) sequences, and then discriminant analysis based on posterior odds ratio or likelihood ratio is applied to construct prediction rules. In contrast, a predictive modeling approach targets at prediction by modeling directly the condition distribution of TF-binding given extensively extracted sequence features. As shown in this article by both real and simulated data sets, modern statistical learning tools such as boosting and BART have made it possible to estimate this conditional distribution quite accurately for the TF-binding problem. These two approaches have their own respective advantages. If the underlying data generation process is unclear or difficult to model, predictive approaches have the advantage to construct a nonparametric conditional distribution from the training data. On the other hand, generative models are usually built with more explicit assumptions that help us understand the underlying science and can capture key characteristics of a biological or physical system. Typical examples of generative models in gene-regulation problems include, for example, the mixture modeling of DNA sequence motifs (2) and the graphical model for protein–DNA interaction measured with ChIP-chip data (
56).
Finally, our study suggests that the Bayesian learning method BART is a good tool for analyzing high-dimensional genomic data because of its high predictive power, its explicit quantification of uncertainty, and its interpretability. First, like boosting, BART is an ensemble learning method, which approximates an unknown relationship by an aggregation of a large number of simple models (small trees). Second, the Bayesian formulation of BART leads to not only the ‘optimal’ model, but also a posterior distribution on the space of all possible models, which can be used to predict the response of a new observation by weighted averaging predictions from all models. This model averaging approach tends to improve the model's; predictive power in general. Third, the variable selection procedure is a coherently built-in feature of BART and performs quite well in identifying important and relevant sequence features that contribute to TF–DNA interactions in all the examples. With the rapid accumulation of large-scale genomic data, we believe that flexible statistical learning methods such as BART and boosting will be very useful for studying a large class of biological problems including cis-regulatory analysis.