Transcription factors are group of proteins that participate in gene regulation by binding to specific short DNA sequences, known as transcription factor binding sites (TFBS). Accurate identification of the TFBSs is the first and perhaps the most critical step in modeling the gene regulatory mechanisms from datasets generated by recent high-throughput approaches, such as ChIP-Seq/chip
[1]. TFBSs are usually short and degenerated at multiple positions. Although numerous computational approaches to predict the TFBSs have been proposed in recent years, the high false positive rate is still a problem. The problem of predicting TFBSs still remains as one of the hard problems in computational biology
[2],
[3],
[4],
[5],
[6]. Depending on the representation of the TFBSs, the computational prediction methods fall into three broad classes: the PWM-based approaches
[7],
[8], consensus sequences-based or regular expressions-based approaches
[9],
[10],
[11] and feature-based methods
[12],
[13].
Generally, PWM-based approaches assume independence between the base positions of the sequence motif and suffer from high false positive rates. However, recent studies have shown that the independent assumption is not true and modeling the dependencies in TFBSs could lead to better predictions
[14]. Examples include feature-based method
[13],
[15], HMM-based method
[16],
[17], Markov Chain based method
[18]. These methods could account for the strong nearest-neighbor (adjacent or local) dependencies, but still fail to incorporate potentially important longer-range interactions.
Long-range dependency in the DNA motif could be important due to the 3-D structure of the TFBS-proteins binding complex
[19]. The 3-D structure of the complex makes the cooperation between non-adjacent nucleotide positions possible. Several approaches have been proposed to incorporate such dependency. Examples include Optimized Markov chain model
[20], MDD
[21], PVLMM
[22], Bayesian Network
[23], Generalized PWM
[24], non-parametric method
[25]. Generalized PWM extends the original PWM model to include pairs of correlated positions and uses MCMC algorithm to sample in the model space. Optimized Markov chain model reorders the nucleotides positions of the motif such that the most significantly dependent positions become pairs of adjacent positions, and then a Markov model is trained using the reordered training sequences.
In addition, the recent experimental studies have shown that a particular transcription factor may have different binding profiles under different condition, for example in the presence of different co-regulators, which suggests that the overall binding profile could be context-dependent or a mixture of different subclasses
[26]. Previous studies showed that for factors which bind to divergent binding sites, mixture of multiple PWMs increase performance of the prediction
[27],
[28]. The potential cluster structure could make the dependent structure of the motif complicated. Thus, there is a growing need to develop methods to model the biological complexity of binding sites sequences beyond a single independent model, especially methods which could efficiently and robustly utilize the huge information provided by ChIP-Seq/chip experiments.
Although the Bayesian network modeling can capture the complicated dependent structure of the TFBS, the structure learning of the network is very complex and time-consuming, and the predicted network is very unlikely to be the true model
[29]. Both MDD and PVLMM can also capture the complicated dependent structure. MDD iteratively splits the training data into a binary tree and different conditional independent models are fit to the leaf nodes of the tree. PVLMM extends the optimized Markov model by introducing variable length Markov models, which allow for different order of dependency in the reordered motif. However, all the existing methods are developed on a set of aligned exact known TFBSs or splice sites. None of these methods are optimized for analysis of large set of TF binding profiles derived in ChIP-seq experiments.
ChIP-Seq experimental methodology combines chromatin immunoprecipitation (ChIP) of a protein with massive parallel sequencing of the retrieved genomic sequences, which are mapped back to the reference genome to obtain significant peaks
[1],
[30]. The sequences within those peaks are expected to be enriched with TFBS of the corresponding TF of interest. ChIP-Seq is a genome scale experiment, which provides a comprehensive analysis of protein-DNA interactions. The sequence enrichment profiles provide excellent opportunity to model the dependent structure of the TFBS motif. However, with ChIP-seq technology, TF bound genomic regions cannot be identified solely on the presence of sequence enrichment on a genomic location, due to the non-specificity of the antibody, indirect binding of TF through protein-protein interactions, sequencing error, etc. Further computational analyses are needed to extract precise TFBS location. Most of existing computational approaches are not designed for processing huge data sets. Applying them on all the sequences are very time consuming. In practice, usually just the top candidate peaks are submitted to those algorithms, rather than using all of the sequence data from significant peaks
[31].
Recently, several groups have started to develop prediction tools utilizing the huge information provided by ChIP-Seq. HMS extends the generalized PWM method to incorporate peak height information to aid motif identification. In order to handle a large number of input sequences and increase computational speed, it uses a novel Gibbs sampling method, where the motif alignment variables are sampled from a small proportion of top sequences, rather than from all sequences
[32]. ChIPMunk is an iterative algorithm which can take into account the peak shape from ChIP-Seq data and extract the single optimal motif from large data sets like ChIP-Seq
[33]. Gapped PWM examines the flexibility to allow variable length motif models utilizing the ChIP-Seq data
[34]. However none of these methods can model the long range dependency between different positions in the binding sites. The huge amount information generated by ChIP-Seq experiments gives an excellent opportunity to refine those PWMs stored in transcription factor databases, like TRANSFAC
[35] or JASPAR
[36]. Usually the stored PWMs come from a limited number of experimentally verified TFBSs and do not truly reflect the general binding affinity of transcription factors. A recent study refined those stored PWMs based on a discriminative approach, however they assume independent motif models and do not utilize the vast information provided by ChIP-Seq experiments
[37]. To this end, we developed a novel Tree-based PWM approach to accurately model the binding profile and also proposed a discriminative approach (TPD) to construct it from the ChIP-Seq data.
The paper is organized as follows. In section 2, we describe how to construct the TPWM and then the detailed implementation of TPD. In section 3, we evaluate the performance of TPD on both simulated and real biological data.