DNA-binding specificity of transcription factors (TFs) is traditionally viewed as consisting of a direct and an indirect readout component, and the proportion between them differs from one TF to another (1
). The direct readout mechanism is well defined and involves recognition of specific DNA bases by amino acids. However, there is no deterministic recognition code for the interaction between DNA and protein sequences, essentially because of the influence of the three-dimensional (3D) structures of both macromolecules. The influence of the structure of the DNA-binding domain of the TF on the direct recognition code has been clearly shown for some TFs (2
). If DNA-binding specificity were determined only by direct readout, then a probabilistic approach to TF–DNA recognition would suffice. The direct readout does not, however, fully explain the observed variety of sequence composition and binding affinity of binding sites for a specific TF (3
). This is where the indirect readout mechanism comes in. Indirect readout is much less well defined but takes into consideration protein–DNA interactions that depend on base pairs that are not directly contacted by the protein. These protein–DNA interactions essentially reflect the influence of the structure and thermodynamic properties of the DNA before or upon binding by the TF. DNA is flexible and exhibits sequence-dependent deviations from the idealized B-DNA structure: the deviations arise from the stacking interactions of successive dinucleotides (4
). These structural details have usually been neglected in the analysis of TF–DNA interactions: a probabilistic approach to direct readout is most commonly used as the sole component for prediction of transcription factor binding sites TFBSs, with varying degrees of success. Rohs et al.
) recently emphasized the importance of the 3D structures of both macromolecules. Direct readout and indirect readout were renamed as base readout and shape readout, respectively. Base readout was subdivided according to either the major or the minor groove of the DNA, whereas shape readout was subdivided into global and local shape recognition. It was argued that individual TFs combine multiple readout mechanisms to achieve DNA-binding specificity.
Methods for identifying TFBSs can be classified into two main groups on the basis of the type of data used to model the TF–DNA binding specificity. Sequence-based methods model the binding specificity from a collection of aligned sequences known to bind the TF in vitro
or in vivo
. Structure-based methods use information from available crystal structures of TF–DNA complexes [reviewed in Ref. (7
)]. Most sequence-based methods treat DNA as a uniform static structure that is independent of the nucleotide sequence. For example, the widely used position weight matrix (PWM) method (8
) takes into account only the nucleotide frequency at each position of the TFBS and assumes independence between those positions. The assumption that the nucleotides add to the binding affinity of TFs independently from each other is called the ‘additivity’ assumption. Based on theoretical concerns and a few experiments for some TFs (9–12
), the correctness of this assumption and the quality of the approximation it yields have been discussed in the previous years (13–15
). Recently, thanks to larger amounts of experimental data, it was shown that for most TFs, dependencies exist between nucleotide positions in their binding sites (16
). This could be expected because it has been suggested that nucleotide positional dependencies observed within TFBSs arise from the structure and biophysical interactions of unbound and TF-bound DNA (15
). Nucleotide positional dependencies are symptoms of shape readout rather than base readout. Nowadays, many sequence-based methods try to model nucleotide dependencies between positions, and thus they implicitly recognize the structural aspects of TF–DNA binding. They yield accuracy improvement over the classic PWM method for most TFs [e.g. Refs (17–20
)]. A few publications present sequence-based methods that use sequence-dependent structural characteristics explicitly (21–28
). Some of these methods, e.g. (25
), report higher accuracies than those obtained by methods that model only nucleotide dependencies. Structure-based methods, by definition, take into account at least some structural characteristics of TF–DNA binding. Some of these methods are valuable for comparative modeling and they seem promising for TFBS prediction as well [e.g. (7
)]. However, none of the structure-based methods have offered substantial improvement on the PWM method yet.
In this manuscript we present a sequence-based method that uses the random forest (RF) algorithm with features that cover either nucleotide positional dependencies or nucleotide sequence-dependent structural characteristics of the TFBS and its flanking sequences. We call the corresponding models the positional dependencies of nucleotides (NPD) model and the structural model. We also let our method combine both models and tried to integrate the PWM score in the combined model. The set of one-type models and combined models presented in this article should be seen as the products of our flexible integrative method, which can easily determine the most appropriate model to use. We measure the accuracy with which our models separate TFBSs from randomly selected genomic sequences, and we compare this measured value to the accuracy of the classic PWM method and the most recent alternative method, namely CRoSSeD (28
Results are given for five eukaryotic TFs that bind differently to DNA: HIF1 (zipper-type group/Helix–Loop–Helix family), P53 (zinc-coordinating group/Loop–Sheet–Helix family), SP1 (zinc-coordinating group/BetaBetaAlpha-zinc finger family), STAT1 (Stat protein family) and TBP (Beta-sheet group/TATA box-binding family) (30
). Our method was also used on seven prokaryotic data sets that were presented along with CRoSSeD (28
) and a more recent Fis data set (31