The Cys2His2 ZF proteins represent one of the most studied transcription factor protein families. Their modular structure makes them amenable to statistical and computational approaches for predicting their DNA binding specificities given only their protein sequences. Here, we have presented an SVM-based approach for predicting ZF protein–DNA binding. Whereas most previous computational methods for predicting protein–DNA interactions have used only known binding examples, our approach additionally utilizes examples of proteins known not to bind particular DNA regions. In addition, with a linear SVM, we also use relative binding data in the form of comparative examples.
The canonical binding model for ZF protein–DNA binding () attributes the protein–DNA interaction to only four canonical amino acid-base contacts. This simple model has served well for a number of experimental and theoretical studies and has been confirmed by the majority of co-crystal structures. However, zinc finger binding can be altered by variations in the protein sequence and can result in reorganization of the DNA-interacting interfaces (Wolfe et al.
). Consistent with this, we have found that the polynomial SVM outperforms previous methods, as well as the linear SVM, in a wide assortment of testing. SVMs with a polynomial kernel map feature vectors into a higher dimensional space, thereby making possible implicit inclusion of higher order interactions not listed in the original canonical model (Luscombe et al.
). It is highly possible that certain amino acid residues are able to interact with more than one base in the DNA sequence, thus complicating the sequence recognition pattern. Therefore, the success of the polynomial SVM may indicate the necessity to adjust the canonical structural model.
Linear SVMs show limited performance when tested on the TRANSFAC database. In general, most proteins from the high-confidence database used for SVM training were designed on the basis of Zif268. In contrast, the proteins listed in the TRANSFAC database and used for testing are natural ZF proteins and can have sequences significantly different from the Zif268 family. Therefore, the binding interface of these proteins could be different from that described by the canonical model. This fact may result in decreased linear SVM performance, compared with the polynomial model which implicitly considers alternative contacts. However, the good performance of the linear SVM in cross-validation testing appears very promising for further improving its performance. In particular, use of the polynomial kernel does not allow the incorporation of relative binding information through the use of comparative examples. By modifying the canonical model to explicitly consider higher order interactions, a linear SVM can be applied again with its advantage of using quantitative and comparative experimental data.
For the linear SVM, it is possible to examine the learned weights to ascertain which contacts are learned to be most important for predicting ZF protein–DNA interactions (see Supplementary Table S3
). The contacts originating from the Zif268 protein–DNA complex have large weight vector coordinates, stressing the prominence of Zif268-derived examples in our training set and suggesting that a likely source of improvement for the linear SVM is inclusion of data from a more diverse set of ZF proteins. Such data would likely to improve the performance of all methods. Interestingly, the Pearson correlation coefficient observed between the linear SVM weight vector coordinates and the weights assigned to corresponding interactions by other methods is weak; this is also true when considering pairwise relationships between the other methods (data not shown). This suggests that combining different theoretical approaches may lead to better predictions where the methods complement each other.
Significant further challenges remain in developing a complete system for predicting ZF protein–DNA interactions. The relatively poor performance of all methods in predicting the binding of four-and five-finger ZF proteins suggests that for improved performance for proteins with many zinc fingers, it will be necessary to develop methods for predicting which fingers are binding DNA and whether the fingers are binding in tandem along the DNA, or in several separate regions. Furthermore, it is important to note that all the methods tested here evaluate whether a particular ZF protein can in principle bind a fragment of DNA; they do not evaluate whether this binding occurs in vivo. To better assess whether interactions occur in vivo, these predictions should be used in conjunction with other types of information, such as expression data or cell and tissue type.
In conclusion, we present a new approach for predicting ZF protein–DNA binding based on SVMs. Our approach allows utilizing a wide range of experimental data, from positive to negative to comparative binding examples. Overall, this methodology makes substantial progress on the problem of predicting a transcription factor's DNA binding sites, and should provide a basis for predicting binding sites at the genome level. While, we have described our methodology for predicting ZF–DNA binding, in principle the approach can be applied to any conserved structural interface. Furthermore as more high-throughput, experimental techniques are developed and applied for quantitatively determining DNA binding specificity (Bulyk et al.
; Mukherjee et al.
), approaches such as the one outlined here will become increasingly important.