Quantitative Models of Transcription Factor Binding
We developed a probabilistic framework to infer the occupancies of one or more transcription factors across any DNA sequence given the concentration of these factors and their DNA binding preferences (Figure S1
). Our model is based on the formalism of generalized hidden Markov models (gHMMs; 
, ), which allow efficient integration of the different forces that may influence transcription factor binding into the model.
This class of models offers several prominent advantages. gHMMs have very few parameters and are therefore easy to optimize. And, unlike most probabilistic graphical models, they offer exact inference of posterior probabilities in linear time, using the forward-backward dynamic programming algorithm 
. Finally, gHMMs are related to thermodynamic equilibrium models, with the ensemble of all possible configurations of bound factors viewed as a Boltzmann distribution and each configuration assigned a weight (or probability) depending on its energetic state. The probability that a factor is bound at a given location is assumed to be the fraction of configurations (weighted by their probabilities) in which it is bound 
The Markovian property of gHMMs prevents them from considering the full context in which binding occurs, and thus they offer only an approximation of the full thermodynamic model. We overcome this limitation with a sampling procedure (see below).
In Vitro and In Vivo DNA Binding Data
To model the DNA binding affinities of the five factors considered in this study (HB, BCD, KR, GT and CAD), we used in vitro specificities (expressed as position weight matrices; PWMs) measured using SELEX-Seq by the Berkeley Drosophila Transcription Network Project (bdtnp.lbl.gov).
For in vivo binding data, we used ChIP-seq measurements of formaldehyde crosslinked HB, BCD, KR, GT and CAD from blastoderm embryos of Drosophila melanogaster
(Oregon R) 
. A range of controls establishes that these data provide a quantitative measure of the relative levels of transcription factor directly bound to different genomic DNA regions 
. In particular, in vitro controls using purified transcription factors and naked DNA, where binding at individual sites is expected to be proportional to the affinity of the factor for that site, show that relative levels of crosslinking closely correlate with relative DNA affinity 
. “Spike in” experiments using ~200 kb BAC DNAs show that the ChIP-seq post immunoprecipitation processing steps accurately preserve the relative enrichment levels of different genomic regions, with the enrichments of DNA samples prior to and post processing showing a correlation of 0.997 and a linear fit of slope of 1.14 (Figure S2
). In vivo UV crosslinking results show that similar data are obtained when protein/protein crosslinking is absent 
. And the genomic regions identified as bound by factors in our ChIP-seq experiments are not preferentially enriched in the crosslinked DNA initially used as input to the immunoprecipitations 
This latter observation is particularly important, as Auerbach et al have reported that some methods of DNA isolation can introduce a bias that adversely affects ChIP-seq data 
. In particular, they found that the sonication of intact nuclei leads to the preferential enrichment of regions of open chromatin. Although the crosslinked DNA used in our ChIP-seq experiments is sonicated only after it has been purified away from non-crosslinked proteins by buoyant density centrifugation 
, and thus is unlikely to suffer the above DNA extraction bias, we nonetheless directly compared the input DNA samples from our ChIP-seq experiments to DNase I-seq genome accessibility data for embryos at the same stage of development (Thomas et al, unpublished data). In contrast to Auerbach et al 
, we found no correlation between ChIP-seq input samples and the DNase I-seq data (mean correlation coefficient of -0.0005).
Thus, while we do not believe ChIP-seq represents an absolutely precise measure of binding to each region of the genome (discussed further below), a wide range of evidence indicates that our ChIP-seq data do not appear to suffer large systematic biases that could interfere with our modeling effort and that these data provide a sufficiently quantitative estimate of binding on which to base attempts to predict it.
Comparing Model Predictions and In Vivo Binding Data
We developed a model-based algorithm to compare the binding probabilities predicted by our gHMM model to high-resolution in vivo binding measurements 
. Using the length distribution of the DNA fragments recovered by ChIP, we simulated the shape of the peak corresponding to a single binding event, as measured by ChIP-seq. We then used that shape to convert the single-nucleotide resolution binding probabilities into an expected ChIP-seq profile (Figure S3
To analyze our predictions, we compiled a list of 21 known target loci of the A-P patterning system, primarily known maternal, gap, and pair rule genes 
, expressed during early stage 5 
. Each gene was expanded by ~10 Kb upstream and downstream of the transcription unit to capture its known regulatory sequences. In each of the analyses presented below, we trained the model parameters to optimize the fit between the predicted and observed ChIP-seq landscapes at a set of six loci spanning over 87 Kb (croc, cnc, slp, kni, hkb, D
), and evaluated the trained model on a set of 15 loci spanning over 280 Kb (prd, h, eve, cad, oc, opa, ftz, gt, hb, Kr, odd, run, fkh, tll, os
). To account for long genomic regions where no binding is observed in vivo, we enhanced the train and test sets by adding three and five control loci, spanning 100 Kb and 221 Kb, respectively (see ).
Genes and coordinates for train and test set loci.
Quantitative Comparisons of Model Predictions to In Vivo Binding Data
We began with the simplest model – a single transcription factor binding to DNA. This required optimizing only a single parameter, P(t), for each factor, corresponding to its effective concentration in nuclei. After training, the correlation coefficients between the model's predicted binding and the training data range from poor (CAD, 0.11) to reasonable (BCD, 0.58) with a total correlation of 0.37. Correlations for the test set were similar (suggesting that the training data were not over fit), and ranged from 0.15 (GT) to 0.66 (BCD), with a total correlation of 0.36 ( and Table S1
). shows the model's predictions and the in vivo data for several test loci.
Factor-specific accuracy at increasing degrees of model complexity.
High-resolution predictions of protein-DNA binding landscape.
Minimal Effect of Factor Competition in Predicting In Vivo Binding
Encouraged by results with single factors, we examined the effect of competition between the five factors on our ability to predict in vivo binding. Overlapping binding sites are known to modulate direct competition between factors for some individual cases 
. Moreover, overlapping sites are often conserved at long evolutionary distances 
, suggesting an important mechanistic role for inter-factor competition. We expanded the gHMM in our model to consider all five transcription factors simultaneously in a probabilistic framework, where the different concentration of each factor t
is modeled by an additional probabilistic term P(t). In the single factor model binding of one factor to a site did not affect the binding of a different factor to the same site. In this new model, however, because the total occupancy at a site cannot exceed 1, factors effectively compete for binding to overlapping target sites. While other early regulators bind many of the same regulatory regions as these five factors 
, we have not included them in our model to allow for a focused examination of the factors that are most closely implicated in functioning together and because of a lack of complete DNA binding specificity data.
To our surprise, the five-factor competitive model gave slightly less accurate predictions than its single factor counterpart. On the test data, we observed decrease in the model predictions from a total correlation of 0.36 to 0.33 (see and for full data).
Prediction accuracy at increasing degrees of model complexity.
Expanding the Model to Three Dimensions with Single Nucleus Resolution
One potential explanation for the lack of improvement for the above competition model could have been that, because we were treating the embryo as a homogenous entity, it allowed competition between factors that are expressed together at high levels in few nuclei (e.g. GT and KR, BCD and CAD, or KR and HB).
Predicting binding in single-nucleus resolution.
We therefore expanded our algorithm to model the binding of all factors in each of the ~6000 nuclei separately. We used single nucleus estimates of protein concentration, based on three dimensional fluorescence microscopy of D. melanogaster
embryos at early stage 5 
to scale the optimized concentration parameters of the five transcription factors. Specifically at each run, we multiplied the concentrations P(t) of every regulator by its protein expression level. We then averaged the predicted binding landscape of all nuclei, to obtain whole-embryo genomic predictions, which were compared to (whole-embryo) in vivo binding measurements (). The results were slightly improved relatively to the whole-embryo predictions (3–4% improvement on the training (0.34) and test (0.34) sets), presumably because some inappropriate competition events had been eliminated from the model.
Further analysis suggests why including binding site competition did not have a major affect on the predictive power of the model. The stronger affinity recognition sites for proteins that are co-expressed in the same cells, i.e. those having the potential to significantly affect net occupancy of a protein in vivo, overlap in only a minority of cases in the training or test genomic regions. Thus while binding site competition between the five factors may well play a key regulatory role at a subset of sites in a subset of cells as previously proposed 
, it is unlikely to have a major effect on net occupancy averaged across all cells and many genomic regions.
DNA Accessibility Data Greatly Improve Binding Predictions
Even if direct competition between the five factors does not play a major role in shaping their binding landscape, other mechanisms, including interactions with nucleosomes and covalent modifications to nearby histones 
, could allow transcription factors to affect binding to neighboring sites.
When comparing our computational binding predictions () to experimental maps of in vivo binding, we found that the majority of highly bound regions in vivo were indeed correctly predicted (133 of 240 peaks were correctly predicted, for all factors, with specific percentages varying from 30% (KR, 14/46 peaks) to 86% (BCD, 36/42)). Nonetheless, our model suffered from many false positive predictions (), a common problem when attempting to predict in vivo binding from DNA sequence 
. It is widely accepted that a major source of such over-prediction is that many potential high-affinity binding sites are found in regions of the genome that are densely packed into chromatin states that limit their accessibility to regulatory proteins 
and that information on the state of the chromatin can be used to improve prediction of transcription factor binding 
. In some cases, the weak sequence preferences of nucleosomes have been used to predict nucleosomes positions and thereby improve binding site or gene expression predictions 
. Alternatively, histone modification data has been used to identify genomic regions with putative regulatory function 
. These previous methods, however, did not make use of direct experimental measurements of DNA accessibility.
Predictions with and without DNA accessibility prior.
To measure the influence of chromatin state on the accuracy of our model's binding predictions, we turned to two complementary perspectives. First, we integrated the exact position of nucleosomes into our model at single nucleotide resolution to enable the competition between transcription factors and nucleosomes in binding DNA to be modeled 
. As there are no direct measurements of nucleosome positions from early Drosophila embryos, we modeled these computationally (see below). Secondly, because chromatin accessibility results from the combined effect of nucleosomes, other chromatin binding proteins and the higher-order 3D packaging of the DNA, we used direct genome-wide measurements of DNA accessibility obtained from DNase I digestion of chromatin in isolated blastoderm embryo nuclei 
. We then quantified the effect of these two ways of assessing chromatin state on predicting the binding landscape in turn.
Predicting Nucleosome Positions Does Not Improve Model Predictions
In analyzing the role of nucleosome positioning, we were particularly interested in the possibility that the binding of one factor would reduce local nucleosome occupancy and therefore increase the occupancy of other factors at adjacent sites, so called indirect cooperative DNA binding 
. To investigate this, we extended the cellular resolution (3D) model by incorporating an additional state to represent the 141 bp-long sequence bound by a single nucleosome.
Due to uncertainty in the literature about the contribution of sequence specificity to in vivo nucleosome positioning 
, we decided to evaluate two different ways of incorporating positioned nucleosomes into our models. First, we used a sequence-specific model of nucleosome binding, that takes into account presumed preferences for certain DNA sequence features 
as an additional state of our generalized hidden Markov models. This addition did not improve the predictions of our model, when comparing our predictions to in vivo measurements, obtaining correlations of 0.35 and 0.33, on the training and test sets, respectively ().
As an alternative, we used a sequence-independent model of nucleosome binding, where nucleosome are viewed as long “space-fillers” that, when present, prevent regulators from binding. This model obtained correlations of 0.37 and 0.36 on the training and test data, respectively, an improvement of 6-7% over the non-nucleosomal 3D model (see and ). Yet, even this model fails to significantly outperform the naïve algorithm of modeling a single-regulator at a time (no competition and no 3D resolution).
Direct Measurements of Chromatin Accessibility
To more directly quantify the effect of chromatin accessibility on in vivo binding, we used genome-wide measurements of DNA accessibility obtained from DNase I digestion of isolated blastoderm embryo nuclei 
, followed by deep sequencing of the short DNA fragments that were released. When comparing these accessibility data to the predictions of our model, we found that genomic loci for which our model correctly predicts DNA binding based on sequence, namely true-positive loci, tend to have higher DNA accessibility (mean DNA accessibility of 0.36, with 25th
percentiles at 0.10 and 0.60, respectively. See Methods
), while false-positive loci (for which we predicted stronger binding than actually measured) typically displayed much lower DNA accessibility (mean DNA accessibility of 0.06, with 25th
percentiles at 0.01 and 0.05, respectively).
Based on this observation, we next partitioned the genome based on DNA accessibility (with a threshold of 0.5), and compared the true positive rates (percent of predicted peaks that match measured ones) at high- and low-accessibility regions. Our model obtained high rates among the accessible regions (50–100% true positive rates for the different factors, with a total average of 77%), compared to much lower accuracy (true positive rates of 15–52%, with a total average of 43%) over the low accessibility regions. This differential accessibility is presented in and Figure S4B
, where most of the false-positive predictions of the model (top-left corner) indeed show low accessibility.
To better quantify the role of accessibility in factor targeting, we leveraged the statistical framework of generalized hidden Markov models to incorporate DNA accessibility into the model as a non-uniform prior probability of regulatory binding along the genome – with regions of low accessibility having a greatly reduced probability of binding. The incorporation of differential DNA accessibility dramatically boosted the model's accuracy by a twofold for both the training set correlations (from 0.37 using the first, single-factor, model to 0.75) and the test set predictions (to a correlation of 0.70), with factor-specific correlation varying from 0.57 (HB) to 0.81 (BCD) (see , and and ).
To further control for any possible experimental biases that might be present in the DNase I accessibility data, we also ran our model with DNase I data from a much later developmental stage, after cellular differentiation (embryonic stage 14). The highly bound genomic regions in our training and test sets are mostly comprised of loci that control early developmental patterning at stage 5 and which have been shown to have dramatically lower DNA accessibility at stage 14 (Thomas et al, unpublished data). Consistent with this, when we used DNase I data from the wrong developmental stage to estimate prior probabilities of transcription factor binding, the model's accuracy deteriorated (correlations of 0.36 and 0.28 on the train and test set, respectively), and was similar to the initial runs, where no DNase I data was used at all. Thus, the DNAse I data for stage 5 largely comprises a measurement of developmentally regulated chromatin accessibility, free from any major constitutive experimental bias.
BAC Spike-Ins Results Set the Highest Obtainable Correlation at 0.92
Our scoring system for the accuracy of our model uses the correlation coefficient between predicted and measured binding. We returned to the BAC spike-in data described earlier (Figure S2
) to assess the experimental noise introduced into the ChIP-seq experiments by the amplification and sequencing of immunoprecipitated DNA as this would allow us to estimate the maximal correlation possible. We compared the binding landscape measured by the original ChIP data with versions of the same data into which noise was artificially introduced in levels following the BAC data (Methods
). This resulted in an average correlation of 0.92 over 100 random perturbations, suggesting that even if our computational model were to perfectly predict the occupancies of proteins in vivo, the maximum obtainable correlation would be ~0.92. Because additional experimental variation is likely to be introduced into the ChIP-seq data by differences in crosslinking efficiency and immunoprecipitation and because the DNase-seq data must also contain noise, we suspect that the true maximal correlation achievable is probably somewhat lower than this.
Thermodynamic Modeling Via Boltzmann Ensembles Captures Cooperative Binding
Although our predictions with DNA accessibility data were good, especially in light of the above estimates of experimental noise, we sought to further refine our model by considering more complex types of factor-factor interactions than the simple direct competition (via overlapping recognition sites) described above. For example, direct physical interactions between transcription factors bound at neighboring recognition sites has often been found to increase the occupancy of one or both proteins on DNA, both for homomeric and heteromeric interactions 
, and to sharpen the regulatory response to changes in transcription factor concentration 
Generalized hidden Markov models, however, have limited ability to model the broader context of binding events, including interactions between neighboring sites. Therefore we added a second, sampling-based, phase to our computational model. In this phase, a large ensemble of binding configurations is sampled, each with a different set of protein-DNA interactions. The probability of each configuration is then estimated based on all pairs of nearby occupied sites (up to 95 bp apart) and the parameterized energetic gain of each pair. Finally, the overall binding probability at each position is quantified as a weighted sum of all sampled configurations ().
Thermodynamic modeling of cooperative interactions.
By adopting a statistical mechanics perspective, the exponential space of protein-DNA binding configurations can be viewed as a canonical ensemble in a thermodynamic equilibrium. Here, the probability of each configuration is directly linked to its energetic state, including direct protein-DNA interactions, steric hindrance constraints and cooperative interactions with neighboring factors 
We extended our model to capture cooperativity using a novel set of 15 parameters (one for each non-redundant pair of the five factors), modeling the energy gain for the nearby binding of every possible pair of the five transcriptional regulators in our model.
To predict binding in this new thermodynamic setting, we first used the generalized hidden Markov model in 3D to analyze the sequence and binding competition, and calculate an approximate map of binding. We then used the predicted binding probabilities to sample likely protein-DNA binding configurations, and re-weighted them to account for additional energetic gain, as modeled via protein-protein cooperative binding interactions 
. Finally, we averaged over these weighted configurations to predict a 3D map of binding. Using this combination of the gHMM followed by importance-weighted sampling, we were able to approximate the full thermodynamic landscape of binding, using a fast framework with few parameters. These cooperativity parameters, as well as their range of effect, were optimized based on the training set of genomic loci, using multiple random runs of a gradient-based trust-region optimization algorithm 
. The optimized set of cooperative binding parameters includes predictions of interactions between many homomeric and heteromeric pairs. However, these cooperativity parameters only improved the predictive power of the model by <5%, giving a correlation of 0.79 over the training set, with high accuracy binding predictions for all factors, ranging from 0.70 (CAD) to 0.9 (BCD). The model's accuracy over test set was 0.70, ranging from ~0.6 (HB) to 0.83 (BCD), a marginal improvement of <1% over the Markovian approach (see for full details). To further establish this result, we reran this procedure, this time allowing for both stabilizing and destabilizing interactions. Although we now identified additional interacting protein-protein pairs, the resulting correlations remained similar. Thus, our model suggests that cooperative interactions have a rather limited contribution in shaping the genomic landscape of in vivo binding.
Quantitative Estimations of Various Determinants of Transcriptional Regulation
We have presented here a series of increasingly complex models. The simplest model, using only the DNA sequence and in vitro binding data, obtained a correlation coefficient of 0.36 (on test data). After adding single-nucleus 3D modeling, as well as modeling binding competition with additional transcription factors and nucleosomes, the model's accuracy did not improve. We then incorporated experimental data of the DNA accessibility into the model, and boosted its accuracy to 0.70. Finally, we added thermodynamic parameters to model cooperative binding of neighboring factors, without a significant improvement in the model's accuracy (see and ).
These estimators suggest that while sequence per se is responsible for approximately half of our predictive power, and DNA accessibility contributes the other half, the effect of modeling both binding competition and cooperative interactions is rather minor, and is estimated at the order of 1% of our predictive power.
To quantify the contribution of various transcriptional determinants on the binding landscape, we ran our model in all 16 possible configurations, using or not using: (1) sequence, (2) 3D competition, (3) DNA accessibility data, and (4) cooperativity (see Table S1
). We then averaged over particular combinations to quantify the direct contribution of each of the four regulatory determinants. Once again, the relative contribution of each determinant was similar. For example, to quantify the importance of sequence per se, we compared the accuracy of the model in eight combinations (spanning all binary combinations 3D competition, accessibility and cooperativity) to the predictions obtained at the same configuration, only with randomly shuffled DNA sequences. The average difference in correlation over eight configuration pairs was 0.33 (±0.03), suggesting that DNA sequence per se does contribute half of our total predictive power. Similarly, the integration of DNA accessibility resulted in an increase of 0.33 (±0.03) in the model's accuracy (or about a half of its predictive power). The consistency of these estimates over different configurations suggests that the different regulatory determinants are largely independent of each other. Thus, the results suggest that binding within open chromatin is mostly controlled by DNA sequence, with a minimal role for direct cooperative or competitive interactions among factors.
Genome-Wide Prediction of Protein Binding
In the work presented so far, our analysis focused on regions of strong, functional binding, in the proximity of 21 known targets of the A-P patterning system and an additional eight control regions. To quantify the accuracy of the model on a larger genomic scale, we used the DNase I as a proxy, and identified the most accessible 2% of the genome (top 170 DNase I peaks, as called by an iterative peak-fitting algorithm 
, and expanded each peak by ~10 Kb to each side in a similar manner to the initial set of 21 A-P loci). When applying the model to these open regions, it obtained an accuracy of 0.44, with various factors ranging from 0.38 (CAD) to 0.57 (BCD). Finally, we applied the model to the entirety of chromosome 2R (which was not included in our train set loci). Here, the model's accuracy was measured at 0.33 (or at 0.1 when no DNase I-based prior was used).
While these genomic scale predictions are not as accurate as the ones for the functional train and test set loci, they still offer a crude estimate to the genomic locations where binding occurs. The known target loci are inevitably biased towards genomic regions that are more accessible and more highly bound than the genome average (compare and Figure S4
). In that sense, the DNase I peaks offer an unbiased proxy to the regions of the genome where chromatin accessibility plays a lesser role in shaping transcription factor binding in vivo. Moreover, part of the reason for the lower apparent performance of the model at the whole chromosome level could, thus, be technical, due to lower signal-to-noise ratios in the empirical data for the poorly bound, less accessible portions of the genome.
To further understand the decrease in the model's accuracy when shifting from an annotated set of known target loci to chromosome-wide applications, we compared the model's accuracy to the expression level of the underlying gene. We applied the model to predict the genome-wide binding across 4,251 genes for which we had gene expression data from the same developmental stage, taking 10 Kb regions around each gene 
. We found a strong relation between the model's ability to accurately predict the binding landscape of transcription factors along the gene, and its expression level (Figure S5
). These results suggest that while the model is applicable for genome-wide applications, it is mostly useful in predicting binding near actively transcribed genes and highly accessible genomic regions.