A ChIP-seq dataset should be enriched in binding sites (motifs) for the protein immunoprecipitated. Some of the sequences may also contain binding sites for a transcriptional coregulator, yet to be identified. To learn about possible coregulatory motifs, we proposed a finite mixture model fitted by applying an EM algorithm not only to identify a coregulatory motif, but also to simultaneously determine which sequences contain both motifs, either one or neither of them.
coMOTIF uses two known PWMs as the starting points for the EM algorithm to elucidate the two motifs. Since the identity of the coregulators may not be known, coMOTIF allows a user to use a set of PWMs as the candidate PWMs and runs it one pair at a time. The set of PWMs could be all the PWMs in a database such as Transfac (Wingender, 2008
To our knowledge, coMOTIF is unique in considering the joint
distribution of the two motifs within a sequence and estimating the proportion of sequences in each of nine states defined by cross-classifying whether each motif is absent, present on the plus strand or present on the reverse complementary strand. Since coMOTIF models the coexistence of two motifs in a sequence jointly
and does not allow motif overlaps, intuitively, it should perform better than the one-motif-at-a-time
approaches, especially when the two motifs share some resemblance. Our test supports this intuition (section VI in Supplementary Material
and Table S7–S10
A simpler mixture model coupled with an EM algorithm was previously proposed for motif discovery by Bailey and Elkan (1994
) and implemented in MEME+. However, our method differs fundamentally from MEME+ in that our framework simultaneously considers the joint
distribution of two motifs, the presence of either single motif and none (background), whereas MEME+ considers only a single motif and background. Consequently, our method allows nine states, whereas MEME+ allows only three. Furthermore, our algorithm works in the sequence space rather than the (overlapping) subsequence space as in MEME+. Nevertheless, our method can be viewed as an extension to the mixture model of MEME+.
Our proposed method is also fundamentally different from the cis
-module-based approaches (Gupta and Liu, 2005
; Thompson et al., 2003
; Zhou and Wong, 2004
). The cis
-module-based approaches are well suited for sequences that are enriched in multiple transcription factor binding sites such as promoter sequences of coexpressed genes. We aim to identify simultaneously the motif for the protein that was immunoprecipitated and a coregulatory factor motif in ChIP-seq data. The two motifs must be different and can co-exist anywhere in a sequence with or without a modular structure. Moreover, our framework also allows sequences with just one of the two motifs or neither. We use a single mixture framework to simultaneously estimate all nine proportions. We believe that this framework is well suited for identifying transcription factor and its coregulator motifs in ChIP-seq data.
Our approach that finds motifs using a mixture model with a single ChIP-seq dataset is also distinct from discrimination-based approaches that find motifs by contrasting two different datasets. For example, Mason et al. (2010
) developed a contrast motif finder that could be adapted to finding cofactor motifs using pairs of datasets, though they focused on discerning context-dependent motifs for the same transcription factor.
Modeling the joint distribution of two binding sites within a sequence can be computationally challenging, especially for a large dataset. To greatly reduce the computational time, we proposed to use only the highest scoring non-overlapping sites for updating the motif parameters (both PWM and proportions). This option makes our method practical for genome-wide ChIP-seq data analysis.
With simulated datasets, we demonstrated that the results from using only the non-overlapping sites were comparable to those from using all sites. Intuitively, this technique makes sense since only a small fraction of all overlapping sites are likely binding sites in a typical ChIP-seq sequence. Importantly, this procedure does not restrict the EM algorithm from reaching its local maximum as the identification of the highest scoring non-overlapping sites is updated at each EM step, as exemplified in Supplementary Table S6
. We also showed that coMOTIF is relatively robust to the starting PWMs (section V in Supplementary Material and Table S5
We investigated the performance of our method on several simulated datasets. To make the ChIP-seq simulations realistic, we used background sequences randomly taken from the mouse genome. In all simulations, the primary motif was present in ~ 90% of the sequences, whereas the ‘co-regulator’ motif was present in 10–50% of the sequences. Both long and conserved coregulators such as Hnf1a and short and AT-rich coregulators such as Foxa2 were considered. In both cases, the primary and cofactor motifs were successfully identified (Supplementary Table S3
). We showed that our method was superior to MEME in identifying the coexistence of two motifs in a sequence while it performed comparably in identifying a single motif in a sequence. We also showed with simulated data that cisModule is not well suited to the problem that coMOTIF addresses. In addition, coMOTIF performed better than a simple scanning procedure for estimating the proportions of sequences containing a common primary motif and containing a less abundant coregulatory motif (section VIII in Supplementary Material
When tested on the mouse liver Foxa2 ChIP-seq dataset, two known liver-specific transcription factor motifs, Hnf4a and Cebpa, were identified. Both motifs are relatively abundant in the Foxa2 ChIP-seq data. However, the majority of the other starting PWMs for the coregulators motifs converged to different motifs. Although this behavior was expected, it demonstrates that motifs with low abundance are difficult to identify. One solution to this problem, which is implemented as an option in the software, is to fix the PWM for the coregulator motif in the EM procedure at its starting value while updating the PWM for the primary motif and all other parameters. When the secondary PWM is not updated and its starting value is poorly specified, the estimated proportion of sequences containing the coregulator motif will be biased. Fortunately, multiple PWMs for the same transcription factor are often available in databases such as TRANSFAC. The results from different PWMs for the same TF may provide some insight. Knowing which coregulator motifs might be present in which sequences can be useful for generating hypotheses.
Thanks to new technologies such as protein-DNA microarray, SELEX and bacteria-1-hybrid (B1H) [see Stormo and Zhao (2010
) for a recent review], a large amount of protein–DNA interaction data has been generated. Computational methods that consider not only the binding sequences, but also experimental binding affinity have led to PWMs with higher specificity than those based on sequences alone (Zhao et al., 2009
). Thanks to the advances in both new technologies and computational methods, high-quality PWMs will increasingly be available in databases such as UniProbe (Newburger and Bulyk, 2008
In conclusion, we propose a finite mixture framework coupled with an EM algorithm to simultaneously model the joint distribution of two motifs and classify ChIP-seq sequences containing both motifs, either one or neither of them. We propose a procedure to reduce the sampling space in EM so that the method is applicable to large-scale genomic ChIP-seq data for a transcription factor and a coregulator motif discovery. Finally, coMOTIF can also take a single PWM and automatically carry out the one-motif analysis as in MEME for a single motif finding. Both functionalities are described in the user manual that is included in the software.