The spatial-temporal patterns of gene expression are controlled by
cis-regulatory sequences
[1], through binding of transcription factors (TFs) to specific sites in these sequences. Numerous studies point out that the final transcriptional “read-out” is determined, not by an individual TF, but by the combinatorial interactions of multiple TFs with DNA. Most notably, in developmental genes, multiple binding sites of different TFs are often located close to each other in genomes, forming so called
cis-regulatory modules (CRMs), and work together to generate precise expression patterns
[2].
Sequence-specific binding of TF molecules to DNA has been well studied, both in theory
[3] and in practice
[4]. In contrast, the interactions between TF molecules that enhance or inhibit their DNA binding affinities or transcriptional effects are not well understood. Although the importance of cooperative interactions among TF molecules in gene regulation were clearly demonstrated
[5]–
[8], it is not clear, at a quantitative level, what are the roles of such interactions, and in most systems the identities of interacting TFs remain unknown. In cases where multiple TF molecules do interact, it is generally unknown how the spatial organization of their binding sites affects DNA binding. Some studies suggest that binding sites must be arranged in specific ways, following “grammar-like rules”
[9],
[10] in order for them to interact properly; others provide evidence of a flexible organization of regulatory sequences
[11],
[12]. Knowledge of the role of TF interactions and how they interact will be central to our understanding of gene regulation.
Genome-wide DNA-binding data from chromatin immunoprecipitation followed by either genome tiling array analysis (ChIP-chip) or sequencing (ChIP-seq), provide an opportunity to address the above-mentioned problems quantitatively
[13],
[14]. DNA-binding by TFs is a key step in transcriptional regulation, thus modeling combinatorial TF-DNA interactions will serve as a bridge to understanding the complex transcriptional process. Focusing on ChIP-based data, instead of gene expression data, simplifies the task at hand. Gene expression is often accomplished through an intricate process involving not only TF-DNA interactions, but also chromatin remodeling, epigenetic modifications, communications among multiple enhancers, etc
[15]. For this reason, several studies have argued for studying combinatorial interactions among TFs using ChIP-based technologies
[16],
[17].
The central task of this work is to build a predictive model of TF binding affinity from DNA sequences, incorporating both TF-DNA and TF-TF interactions. This would allow us to learn how cooperative interactions among TFs may contribute to their DNA binding affinities. By varying the assumptions of TF interactions and observing their effects on the model predictability, one may be able to understand the details of how binding site arrangements affect interactions. Moreover, a model trained from one set of sequences in one situation can be applied to a different setting to make more predictions about TF targets. This extrapolative ability will be useful, for instance, when we only have TF binding data for part of the genome (e.g. only promoters) and want to identify more TF targets (a large portion of regulatory sequences may lie outside the promoter regions in higher organisms). In one of the analyses, we applied the binding models learned from one genome to predict affinities of the orthologous sequences in a related organism. Such predictions facilitate the analysis of the evolution of TF binding even when ChIP-chip or ChIP-seq data are available in only one organism.
A number of computational methods have been proposed to study the TF binding profiles
[18],
[19] and combinatorial aspect of gene regulation through predictive models
[20]. Typically, these methods attempt to extract information from statistical patterns in DNA sequences, e.g., the occurrence of sequence motifs. Various techniques from statistical learning, such as Bayesian networks
[10], multivariate regression
[19],
[21],
[22], decision trees
[20], regression trees
[23], SVM and artificial neural networks
[24], were applied to extract important features from sequences, using either gene expression or ChIP-chip data. However, these methods do not reflect underlying physical principles. As such, it is not clear to what extent their assumptions, e.g., additivity of different features, are valid. Additionally, important sequence features, such as interactions among adjacent binding sites, are often not represented in these approaches. Quantitative methods that are not based on predictive modeling are also available for analyzing ChIP-chip or ChIP-seq data for the purpose of identifying binding sites in the data
[25],
[26] or patterns of co-occurrence of motifs
[27],
[28]. These methods serve somewhat different goals and do not offer the benefits of predictive models. Interested readers are referred to recent reviews
[14],
[20].
By directly modeling the underlying processes, a biophysics-based approach can overcome many limitations of the statistical methods mentioned above. Shea and Ackers
[29] and Buchler et al.
[30] pioneered the use of thermodynamic principles in the study of regulatory mechanisms. A number of recent studies applied these principles to model expression data on promoters/enhancers
[6],
[23],
[31]–
[33] or TF-DNA binding data from ChIP-chip experiments
[18],
[19],
[34]. However, these methods have not adequately addressed the interaction of multiple transcription factors with each other and with DNA. Also, most of these studies focused on individual regulatory sequences
[31]–
[33] rather than genome-wide data, while others have taken the route of simulations
[33], or studied artificial promoters
[6], which are by design far simpler than natural systems. In summary, no existing work has provided a quantitative framework to analyze genome-wide TF-DNA binding data based on realistic biophysical modeling, especially of combinatorial interaction among multiple TFs and their DNA binding sites.
We developed a novel method, called STAP (Sequence To Affinity Prediction), to analyze large scale TF-DNA binding data. The heart of this method is a thermodynamic model adapted from earlier theoretical studies
[29],
[30]. The key novel feature of STAP is the explicit treatment of cooperative interactions among different TF molecules. Different from existing thermodynamic models, STAP explicitly expresses the expected number of TFs bound to a regulatory sequence, and thus it is directly applicable to analyze binding intensities reflected in whole-genome binding data. In addition, our specially developed computational techniques based on dynamic programming will enable the model to be efficiently applied to complex sequences and large scale data. Another main feature of STAP is the utility of genome-wide binding data not only as binary indicators of TF binding regions, as been done by most existing studies, but also as quantitative measurements of the binding strengths. Thus, more information from these data will be utilized by this new method. STAP was applied to analyze the ChIP-seq data of 12 TFs in mouse embryonic stem cells (ESCs)
[35] and the ChIP-chip data of two TFs involved in fruit fly blastoderm development
[16]. The analysis results demonstrated the effectiveness of the new method to address issues in combinatorial gene regulation using genome-wide binding data.