We studied mechanistic determinants of TF-DNA binding by computationally modeling genomic occupancy from over 40 ChIP data sets obtained from four different stages of embryonic development, in conjunction with over 300 TF motifs and stage-specific DNA accessibility and RNA-SEQ data. Our ultimate goal is to use the insights revealed here, both general and data set-specific, to develop improved computational tools that can quantify functional TF-DNA interactions genome-wide. Such tools can potentially inform models of TF regulatory networks in the same way that ChIP data is beginning to be used today 
. We note that characterizing hundreds of TFs by the whole-genome ChIP-SEQ in the vast number of different cellular conditions is not currently feasible. Computational tools therefore offer an attractive alternative, especially if they can be shown to predict cell type-specific occupancy. TF motifs are already being characterized through high throughput technologies such as Bacterial 1-Hybrid 
, SELEX 
, and Protein-Binding Microarrays 
. Cell type-specific DNA accessibility profiles and TF expression levels only need to be characterized once for a given cell state, and can then be used to predict binding profiles for all TFs. Our work provides initial evidence for the feasibility of this vision. At the same time, we note that the CC values reported here should not be interpreted as correlation coefficients between genome-wide predictions and observed levels of TF binding. The manner in which we chose to evaluate various models, i.e., by examining agreement with ChIP scores on 1000 bound regions and 1000 randomly selected non-peaks, was dictated primarily by the goal of detecting significant influences on primary TF occupancy. We also note that the CC values varied substantially across data sets, from 0.765 for TRL to 0.062 for Dorsal (DL) (). This variation in model performance may reflect weaknesses of certain data sets or PWMs, or a variable reliance of ChIP scores on the primary TF's binding.
Despite a general appreciation of the potential role of various determinants of TF binding, there have been very few systematic studies of the extent of their influence across a large number of TFs. We review three such studies that set the stage for our own work and explain the main goals and contributions of our work in the backdrop of these important prior studies.
Kaplan et al. 
studied ChIP-SEQ data on five TFs in early Drosophila
development, and concluded that the TF motif and DNA accessibility are the most informative correlates of TF-DNA binding, as determined by the agreement between measured and predicted occupancy profiles. They also used TF sequence signatures to examine the role of competitive and cooperative interactions with other TFs with similar developmental roles and concluded that these interactions do not play a significant role overall. Their negative finding regarding secondary motifs may be limited to the small number of data sets examined, or be a limitation of the specific methodology adopted in the study (including the use of a more limited set of motifs that were available then). Here, we perform much more extensive tests of the role of the above-mentioned binding determinants of TF binding, by analyzing 45 TF-ChIP data sets spanning multiple stages of embryonic development in D. melanogaster
. We primarily consider the influence of a large number of secondary TFs that are highly expressed in that developmental stage. In contrast to the earlier findings, we find many cases where the primary TF's binding levels are significantly influenced by the presence or absence of binding sites for other TFs.
In a related study, Pique-Regi et al. 
considered the problem of classifying primary motif matches within ChIP peaks versus those outside of ChIP peaks, in the context of six ChIP-SEQ data sets from two human cell lines. They found accessibility and specific histone modifications to be the most useful features in this classification task, but did not consider the influence of secondary TFs. However, there are fundamental differences in the goals of our study from that of Pique-Regi et al. Their objective was to build a computational tool for annotating TF-bound sites genome-wide, and therefore their algorithm integrates several variables that correlate with binding, including evolutionary conservation, transcription start site proximity, DNA accessibility and histone marks. On the other hand, our focus is on the influence of variables that are expected to be mechanistic determinants of binding, and whose influence can be reasonably understood within an intuitive biophysical framework. We therefore focus specifically on testing whether and how binding sites of secondary TFs shape the primary TF's binding profile. In this pursuit, we rely upon motif, sequence and TF expression data, treating these as the “predictor variables” with which to model ChIP data. We do not include other variables such as evolutionary conservation (which is not a mechanistic determinant) or start site proximity (whose influence cannot be easily modeled biophysically) as predictors in this statistical exercise. DNA accessibility data is used in our analysis, not to improve occupancy prediction per se, but to answer a specific mechanistic question about how secondary TFs influence binding. Also, there is a fundamental technical difference between the data types modeled in the two studies: the variable we propose to model is not tied to TF-DNA interaction at an individual binding site as in 
, but to the aggregate effect of all binding events within a 500 bp window. For the simplicity, we ask whether a model can predict the actual ChIP score at a genomic position, rather than ask whether a model can predict whether a putative motif match falls within a significant ChIP peak or not.
A recent study by Yanez-Cuna et al. 
searched for motif signatures of context specific binding of TFs. In particular, they analyzed ChIP data sets for the same TF from two different cellular conditions and asked if peaks exclusive to either condition could be discriminated on the basis of motif presence. They showed that such motif signatures do exist for the seven TFs examined and that general-purpose machine learning methods such as support vector machines can accurately classify context-specific binding sites using tens of motifs. In the same vein, they showed that bound and non-bound regions of a TF can be discriminated using a combination of tens of motifs, for most of the 21 TF-ChIP data sets examined. Additionally, they performed a closer examination of the binding determinants of one particular TF, twist (TWI), and demonstrated that binding sites for the secondary TFs VFL and TTK significantly affect the correct prediction of many context-specific TWI binding sites. While Yanez-Cuna et al. mostly focused on demonstrating that accessory motif signatures can distinguish
TF-DNA binding regions in different developmental stages, our primary goal was to precisely identify
the most influential secondary motifs for each of 45 different TF-ChIP data sets. To this end, we focused largely on quantifying the influence of secondary motifs and assessing their statistical significance rigorously. By performing our analysis over many data sets, we were able to gain more general insights about the widespread or TF-specific roles of particular secondary TFs. In particular, our statistical tests are geared towards explaining the mechanistic basis of such roles: short- versus long-range effects, synergistic versus antagonistic effects, chromatin mediated versus direct interactions, etc.
The review by Biggin 
uses findings from recent studies to argue that accessibility is more important than the role of secondary TFs in determining primary TF binding levels. However, we do not attempt here to characterize the effect of accessibility as being stronger or weaker than the effect of interacting TFs. Integrating perspectives from Biggin and others 
, DNA accessibility in vivo
can be considered the result of multiple factors playing out simultaneously, possibly including innate sequence preferences of nucleosome location, a conglomerate of chromatin remodeling activities and displacement of nucleosomes by competition with TF binding. Under this view, there are practical limitations in the approach of directly comparing the improvement in occupancy prediction due to accessibility information to that due to secondary motif information alone. Moreover, while it may be possible to make broad statements regarding the influence of accessibility or other chromatin-related information on TF binding, secondary TFs , due to the combinatorial nature of gene regulation, will be factor-specific in their effects and thus will only be detectable on a few data sets. Accordingly, our goal is to characterize as many of these determinants of TF occupancy, from each ChIP data set, rather than assign any one number to the overall influence of, say, interactions between the primary and secondary TFs, which will be factor dependent by definition.
A related study that examined the effects of secondary TFs on ChIP data is that of Gordan et al. 
who reported on TF-ChIP data sets in yeast where a secondary motif alone was a better correlate of peak location than the primary motif. In some cases, this may be due to a problem with the primary motif (H.N.P. and M.H.B. unpublished results). In other cases, such a situation may reflect indirect binding of the primary TF to the peak, via physical interaction with the bound secondary TF. It suggests an alternative model of ChIP data, where binding is predicted to be a sum or linear combination of the occupancy values of the primary TF (direct binding) and a secondary TF (indirect binding). We have not explored this model here, and believe that it is an important goal for future studies.
Our approach to including accessibility data in the analysis was to use partial correlations to examine secondary TF influences before and after factoring out the effect of accessibility on ChIP scores. Alternative approaches may directly include accessibility data in the occupancy models, as was done by Kaplan et al 
, who changed prior probabilities of binding in their probabilistic model based on accessibility, and Pique-Regi et al. 
, who included DHS and histone modification data as features in their classifier. Future modifications of our approach will attempt to include accessibility within the biophysical framework of STAP, and may potentially reveal the role of accessibility even more accurately.
An intriguing observation from our analyses was the influence of competitive binding by the secondary TF EXD despite there being no correlation between EXD sites and the ChIP scores of the primary TF. It is puzzling because it suggests that the frequency of EXD sites does not differ between peaks and non-peaks, yet these sites somehow make a significant difference to binding predictions. However, it is possible that the frequency of EXD sites overlapping with primary TF sites is different between peaks and non-peaks, and the advanced model uses the competition for overlapping sites to predict lower occupancy in certain sequences than that predicted by the baseline model, leading to improved agreement with ChIP scores (Supplementary Figure S11
Our work opens up several important directions of future research into TF-DNA interaction on a genomic scale. While the models we explored used at most one secondary motif in one interaction mode, a more realistic model will require integration of more than one underlying mechanisms influencing primary TF occupancy. Accessibility information will play a crucial role in the predictive ability of such models. In the longer term, an important goal will be to develop integrative models where sequence, TF gene expression and developmental history is sufficient to predict, at least to a good approximation, both accessibility patterns and TF-DNA binding profiles. With the future availability of large collections of TF motifs, such computational surrogates for cell type-specific ChIP data will enable global studies of gene regulatory networks and provide specific regulatory assignments that can be experimentally confirmed.