The core principle applied in this study is that in solution and at equilibrium, the proportion of a particular TFBS sequence that is bound to a transcription factor (the probability of binding) is determined by the dissociation constant between the sequence and transcription factor, and the concentration of free transcription factor. Therefore, if one knows the relative frequencies of different sequences in a mixture prior to binding, one can predict the relative frequencies of those sequences that are bound to the transcription factor. A difficulty arises, in that the free concentration of transcription factor is generally unknown, and if the equilibrium constants are also unknown then there are more unknowns than data points (i.e., ds-oligo sequence counts in the bound solution). This problem can be easily solved by including at least two reference sequences with known dissociation constants in the basic (BBM) system, and if an energy model (GEM) is used, only one reference sequence is required. To ensure accuracy, it is also best if the counts for the reference sequences are relatively high. Although one could first solve for the unknowns using the reference sequence data, we have chosen to use a flexible Bayesian approach that can account for errors and uncertainty in all data points and allows inclusion of as many reference sequences as are available.
One of the main benefits of the Bayesian approach is that it can incorporate information from multiple sources, all with varying degrees of uncertainty. In the simple graphical system that does not model the binding energy (), the sources of information are the counts of sequences before the transcription factor is bound (
) and the counts in the fraction bound to the transcription factor (
). It is assumed that these counts are random draws from a multinomial based on the true underlying proportions of the sequences in the two solutions, respectively
. These counts thus constrain the reasonable range of values for the underlying proportions, with higher counts constraining the proportions more than lower counts. The ds-oligo proportions in the bound solution containing transcription factor are modified from the pre-bound proportions based on the sequence-specific probability of binding the transcription factor. The reference sequences, with known dissociation constants, allow the information about the proportions of sequences in the two solutions to be translated back into information about the dissociation constants, and thus the binding energy of each sequence
through the relationship
for details). Inclusion of at least two reference sequences ensures the identifiability of all unknown parameters.
An interesting illustration of the flexibility of our Bayesian approach is our use of “generative energy models” (GEMs). With these models, information is communicated across ds-oligo sequences to predict the binding energy. In GEMs, the binding energy is considered to be composed of independent position-specific energy components, as well as interaction energy components 
. The energy model may not be perfectly accurate, however, and the error term for the energy model allows for automated weighting of the accuracy of the energy-based predictions and the relative frequency-based predictions in coming to a joint posterior prediction of each sequence-specific dissociation constant. This weighting allows an easy switch from a basic method that relies on reference ds-oligos with known
s and the counts alone, if they are high enough, to the GEM portion of the model, if the error in the GEM portion is low enough.
There are similarities between the GEM and a position weight matrix (PWM) model using binding energies since both are (typically) estimated using independent energy contributions for each position 
. The main difference is that in PWM estimation, the free
concentration is assumed to be low. If this assumption is inaccurate the PWM will return an incorrect energy model. Therefore, an advantage of approaches such as ours that allow
to vary, and to be estimated in the model-fitting process (see also 
), is that a more accurate energy model can be determined without restrictive assumptions. In addition, our method also contrasts with PWM fitting in the use of reversible jumps between models of different dimensions to allow simultaneous estimates of the parameters of interest (the binding energies) while only including those energy components that are statistically well-justified.
We have also proposed a separate “generative sequence model”, or GSM, which allows us to use the frequencies of shorter sub-sequences (polynucleotides) to model the frequency of sequences in solution prior to binding. This is useful in cases (such as the Zif268 Bind-n-Seq data) where there are so many pre-bound sequences that most of them individually have extremely low frequencies and are therefore difficult to count. For example, as we showed for the Bind-n-Seq data 
, the relative frequencies of their “randomly”-generated ds-oligos of even 8 bp in length can differ by many orders of magnitude. The range of frequencies for longer ds-oligos is even greater, and thus if accuracy is a concern it may be impossible to obtain sufficient coverage of all ds-oligos in a highly complex mixture. The sequencing requirements for accurate frequency estimates can easily be far beyond the capacity of even the best modern high-throughput sequencers. The GSM ameliorates this problem to some extent by providing improved frequency estimators for ds-oligos with low sequence counts, and may be further improved by discovering and incorporating other general rules to predict frequency variation. In the meantime, careful experimental design to reduce the number of sequences in the mixture and focus on particular experimental questions is also important if accuracy is a concern.
Two recently proposed approaches similar to the SULDEX experimental system, Bind-n-Seq 
and HT-SELEX 
, also utilize ds-oligos sequenced before and after binding to a transcription factor. The Bind-n-Seq analysis was designed to qualitatively identify binding site preferences, and assumes a simple PWM model with no attempt to model the biophysical energy terms or the relationship between the pre-bound and bound solution counts. As we have noted here, the experimental results are suboptimal for the purpose of measuring binding energies primarily because binding sites are created randomly in the varying context of a much longer random ds-oligonucleotide. This variable context can affect the probability of ds-oligo synthesis and the probability of binding, adding noise and bias to the results. The HT-SELEX approach, in contrast, has a similar aim to the SULDEX approach, and uses the BEEML program to estimate transcription factor concentrations and energy terms using a maximum likelihood approach and a nearly identical biophysical model. Key advantageous features of our approach include: 1) SULDEX uses a Bayesian approach and MCMC rather than a maximum likelihood approach, resulting in posterior distributions and an automatic estimate of uncertainty rather than only a point estimate. Bayesian and MCMC approaches based on biophysical models of transcription factor binding have previously been used to analyze ChIP-chip or PBM data 
, and are particularly useful for parameter-rich models such as these; 2) Unlike HT-SELEX, SULDEX allows estimation of binding energies directly from relative frequencies or in combination with an energy model, and allows incorporation of reference sequences with known binding energies. Thus, while both approaches include an energy model error term, in SULDEX this term allows reduced dependence on the energy model assumptions when binding energies are inferred. Binding energy inference will flexibly depend more on the energy model for low frequency ds-oligos with insufficient sequencing to be estimated by the BBM approach alone; 3) SULDEX uses the multinomial distribution to calculate the probabilities of observed counts given underlying ds-oligo frequencies in solution, while HT-SELEX's BEEML program uses a Gaussian approximation. This is most likely to matter when some counts are small, which will often occur due to the high variance of starting ds-oligo frequencies. BEEML also uses energy level discretization to approximate the partition function; 4) We have presented here a model (GSM) to predict relative ds-oligo frequencies in the pre-bound fraction, and shown that the simple independent nucleotide frequency model used in HT-SELEX/BEEML can be highly inaccurate. The GSM should be further developed to increase accuracy when data from a larger number of pre-bound ds-oligo synthesis experiments become available; 5) the SULDEX method allows simultaneous inference of model complexity and parameter values by incorporating a reversible jump MCMC approach, thus avoiding potential problems of model over-specification that can lead maximum likelihood estimators (such as those in BEEML) to focus on noise and thus reduce prediction accuracy.
We expect that a future application of this and related approaches will be to allow detailed study of how binding energies are influenced by nucleotide variation in the binding site. In particular, the simple additive energy model is insufficient to accurately predict dissociation constants for many proteins, and interactive energy terms will therefore be needed. The number of possible interactive energy terms can be quite large, and their statistical justification and usefulness should be carefully considered. The reversible jump MCMC approach applied here can be used for this purpose and allows detailed biophysical models to be developed, evaluated, and compared without making a large number of assumptions. However, to explore interaction terms, it will be important to implement careful experimental design for generating data. For the datasets currently available (Bind-n-Seq and HT-SELEX), we do not believe that the pre-bound frequencies can be reliably determined, and for the Bind-n-Seq data the binding probabilities are confounded by variable context and possible multiple binding opportunities per ds-oligo. Thus, it seems quite possible that estimation of higher-order interactions in these datasets could be thoroughly confounded by these other effects. For example, in our own focused experiments using mitochondrial transcription factor A protein (mtTFA; unpublished data), which has moderate sequence specificity 
, we found that its multiple binding modes and multimers that form on the ds-oligos, as well as the context of varying sequence well outside the binding site, can deeply confound interpretation. Experiments that therefore focus on a sub-sample of possible ds-oligos close to the consensus or optimal ds-oligo can produce precise estimates of binding energies without relying on a model, and these may then be used to better elucidate reasonable model structures.
Characterizing the binding potential of target binding sites for a transcription factor within a species has immediate benefit for understanding transcriptional regulation for a particular system, but applying this strategy across multiple species may have more widespread impact. In particular, to understand morphological evolution, it is necessary to have a clear idea of the relationship between transcription factors, their strength of binding to a wide range of targets, and how the binding energy relationships change as transcription factors evolve. Future implementations can also include important factors such as cooperative binding and interaction with repressors. We therefore expect that exploiting methods for determining binding energies across species will lead to substantial impact in other fields and in understanding to approach questions relating to disease pathology, evolutionary adaptation, and speciation.