Transcriptional regulation is a central control mechanism for many biological processes. Transcriptional regulation generally involves DNA-binding proteins, transcription factors (TFs), that control gene expression by binding to short regulatory sequence motifs in gene promoters 
. DNA-binding specificities of TFs are encoded in their DNA-binding domains that specialize them to recognize and bind specific types of binding sites. This mechanism is the basis of control in complex transcriptional regulatory networks. Revealing these regulatory mechanisms is one of the key problems in understanding genome-wide transcriptional regulation. Although experimental studies and computational approaches are extending our knowledge of TF binding specificities, relatively little is known about genome-wide binding of TFs to gene promoters. Thus, TF binding prediction remains an important problem in computational biology.
Computational approaches to TF binding site analysis can be divided into two categories, discovery
. Motif discovery
focuses on searching for novel binding motifs from a collection of short sequences that are assumed to contain a common regulatory motif. Several algorithms have been proposed for motif discovery (for a recent review and comparison, see 
). Accurate motif discovery is difficult in general, but incorporating additional information to guide the search for novel sequence signals can improve performance. Such additional data sources include, among others, information about co-regulated genes 
, evolutionary conservation 
, physical binding locations as measured by chromatin immunoprecipitation on chip (ChIP-chip) 
, information on the structural class of TFs 
, and nucleosome occupancies 
TF binding prediction
, in turn, makes use of given DNA-binding specificities to predict putative TF binding sites. The binding preferences can either be the output of a motif discovery algorithm or they can be experimentally measured, such as those reported in curated databases (TRANSFAC 
and JASPAR 
). Regardless of the data source, binding site prediction typically requires some information about binding specificities and is therefore dependent on previous analysis. Current knowledge of binding preferences already allows useful predictions to be made genome-wide. Moreover, several novel measurement techniques to measure DNA-binding specificities have recently been developed 
. For example, Berger et al. 
have developed a protein binding microarray (PBM) technology to measure binding preferences to all k
currently being 10 base pairs. These new techniques are rapidly expanding currently available databases by providing estimates of binding specificities of virtually any TF in a high-throughput manner. Consequently, they also offer an approach for rapid and sensitive identification of all TF binding sites genome-wide. In particular, high-throughput screening of TF binding specificities combined with accurate TF binding prediction provides a viable, condition independent, alternative to somewhat complex ChIP-chip experiments 
. At the same time, however, there is a growing need for accurate TF binding prediction methods.
Although motif discovery methods are relatively well-developed, the TF binding prediction problem has attracted less attention. Most of the previous binding site prediction tools have been formulated as hypothesis testing methods, where a significance value of TF binding at a specific sequence position is obtained by comparing a test statistic to a null distribution 
, and possibly correcting the significance level for multiple testing. Traditional scanning methods for TF binding site prediction are known to perform relatively poorly in that they typically have an excessively high false positive rate (see 
). This reported poor performance is not directly a shortcoming of previous prediction methods but has more to do with the fact that models to represent binding motifs and background sequences alone do not contain sufficient information for accurate binding site detection. This suggests that one possible approach to improve binding site prediction is to develop better motif (and background) models than the currently used position specific frequency model (PSFM) for binding sites and Markovian models for background. For example, observed dependencies between binding site nucleotides 
can be incorporated into motif models 
. However, the use of more complex models, such as general Bayesian networks, is found to be challenging 
. While developing better motif and background models is important, another more general direction aims at making use of several additional information sources, in a similar manner as has been done in the context of motif discovery (see above) to improve TF binding prediction.
Here, we formulate a probabilistic framework for TF binding prediction that differs from the standard hypothesis testing approaches in three important ways. First, the proposed framework is probabilistic in nature and thus outputs a probability of binding (as opposed to a p-value), which directly reflects our belief of gene's promoter having a binding site. We introduce both likelihood and Bayesian inference methods that naturally allow regularization via various prior distributions. Secondly, the proposed method answers the question of whether the whole promoter has a binding site, as opposed to reporting a p-value for every possible position in the sequence. But it is also straightforward to modify our methods to estimate TF binding separately for each base pair position as well, which we also consider. Since we process each promoter as a whole, in addition to assessing physical TF binding at the individual locations, our computational predictions provide insights into the functional role of a TF in the regulatory program of a target gene. The rationale for this is the fact that the higher binding probability anywhere on the promoter (not just in a particular location) implies higher probability of a regulatory relationship. Thirdly, and most importantly, we propose a principled way of combining multiple data sources, such as evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip, and other prior knowledge, into a unified probabilistic framework. Moreover, the proposed data fusion framework is extremely versatile and thus, allows incorporating practically any additional (future) information sources that are indicative of TF binding sites at the genome level.
To validate our computational methods, we constructed a test set of annotated binding sites in mouse promoters from existing databases 
. We demonstrate that our probabilistic inference framework significantly improves TF binding predictions. We also test our probabilistic inference method by applying it to all known mouse promoters. These genome-wide results, which are made publicly available, indicate a sparse connectivity between transcriptional regulators and their target promoters. To provide easy access to this method, we have also implemented a web tool, ProbTF, which allows users to analyze their own promoter sequences and additional data sources.
Because our proposed computational framework is based on probabilistic modeling, it provides an intuitive interpretation (i.e., probabilities, not p-values). Our probabilistic formulation also provides regularization to the inference problem via several informative prior distributions, hence further improving performance. Results on a carefully constructed test set show that the proposed computational methods significantly improve performance when compared to previous binding site prediction methods. This is partly due to the fact that each promoter sequence is analyzed as a whole (i.e., TF binding is not assessed independently at each nucleotide position), taking advantage of multiple binding sites. The most important ingredient, however, is the principled incorporation of multiple additional data sources.
We construct our basic probabilistic framework using commonly used models for binding and non-binding sites, although generalizations to more complex models are straightforward. Consequently, our initial formulation is similar to other previously proposed TF binding prediction methods 
and has perhaps even more in common with general motif discovery methods 
. Notably, the differences are that we further extend our basic TF binding prediction method into a Bayesian setting, incorporate multiple motif models, consider combinatorial regulation with multiple TFs, combine both forward and reverse strands into the modeling framework, provide a way to simultaneously estimate binding probabilities to the whole promoter and at single nucleotide resolution and, most importantly, provide a principled statistical integration of multiple data sources.
Although applications that combine binding site prediction with other data sources, especially “phylogenetic footprinting,” are plentiful (see e.g 
, or 
for review), we are not aware of other probabilistic frameworks for TF binding modeling that can combine several additional data sources at the genome level. Related probabilistic data fusion approaches to TF binding prediction are introduced in 
. For example, the method of Beyer et al. 
provides a general, higher-level, naive Bayes approach to integrate ChIP-chip data with several additional evidence sources. Our approach is different in that we integrate multiple data sources at the genome level, which is indeed necessary for incorporating nucleosome information, regulatory potentials, etc. We also note that our probabilistic predictions can be further used in other methods, such as the one of 
. Other related previous methods include, among others, probabilistic methods for combining sequences and microarray data 
, general frameworks for data integration (see 
), and a supervised method for binding site prediction 
. The most closely related previous method is that by Thijs et al. 
, although that was originally introduced in the context of motif discovery and without an option for data fusion. Note that depending on what additional data sources are available, our method can be applied with either zero or any number of additional information sources.
Although our general aim is to integrate as many lines of evidence as possible into TF binding prediction, we restrict our focus to those data sources that contain useful information for TF binding at the genome level. For example, we do not consider functional data sources, such as gene expression or protein level measurements, or descriptive higher-level information, such as gene ontology. Despite the fact that gene or protein expression measurements alone can be informative of transcriptional regulatory relationships and, therefore, indirectly informative of TF binding as well, we will not include those data sources into our modeling framework. For example, proper modeling of gene expression or protein level measurements does inevitably require the use of a predictive network model, such as a (dynamic) Bayesian network 
, a system of ordinary differential equations or a stochastic kinetic model 
. However, even more comprehensive modeling approaches that integrate all sequence level data sources with functional measurements and gene ontology information become much easier to tackle because our TF binding prediction are probabilistic. Probabilistic methods similar to those presented in 
are practically straightforward to apply using our sequence level predictions as a building block (see ‘Discussion’ Section for more discussion). From a more general point of view, our approach has much in common with general strategies to infer transcriptional regulation from multiple data sources (see 
for representative examples). We expect that the method described herein will also prove to be a useful building block in comprehensive transcriptional regulatory modeling.