Genome sequences encode not only the sequences of RNAs, but also the rates at which these are transcribed under various conditions. This cis
-regulatory code is a consequence of the sequence specificity of transcription factors (TFs) and their interactions with other TFs, nucleosomes and other chromatin-associated proteins. The identification of cis
-regulatory elements on a genomic scale is complicated by the fact that, although TF sequence specificity is generally well-characterized in vitro
, functional elements in the genome are in fact much sparser than would be predicted from sequence alone (Gao et al.
; Pilpel et al.
). There are two types of constraint on the in vivo
selection of functional targets by a given TF: those that prevent the TF from binding to DNA, and those that prevent a bound TF from driving transcription (). Functional and comparative genomics data are therefore needed in concert with knowledge of TF sequence specificity and DNA sequence to infer regulatory networks. The advantage of the functional genomics approach is that the contribution of sequence to the recruitment of a particular TF, and the association of that TF's binding with transcription, can be quantified. Comparative genomics, on the other hand, can lend evidence of the biological utility of TF binding through the application of evolutionary principles.
Fig. 1. Model for conservation of TF affinity. Promoter-TF affinity may be partitioned three ways: (1) affinity that does not lead to occupancy, (2) affinity that leads to occupancy but not function and (3) affinity that leads to occupancy and function. (1) and (more ...)
Most comparative genomics methods rely on local alignment of orthologous promoters or statistical measures of sequence overrepresentation (Cliften et al.
; Kellis et al.
; Li and Wong, 2005
; Moses et al.
; Siddharthan et al.
; Sinha et al.
), neither of which reflect the evolutionary constraints on regulatory sequence. Local alignment is not well-suited to detect lower affinity binding sites, which may be distant in sequence space yet functional; nor can it capture the rapid turnover of binding sites, which often occurs without conservation of position (Dermitzakis and Clark, 2002
; Ludwig, 2002
; Tautz, 2000
; Wray, 2003
). These limitations could be overcome by not directly comparing orthologous sequences, but rather comparing their predicted affinities for various TFs. Various biophysically motivated models of promoter-TF affinity have been developed (Bintu et al.
; Djordjevic et al.
; Liu and Clarke, 2002
; Roider et al.
; Ronen et al.
; Stormo et al.
) that allow such a comparison. In this article, we build upon the principles of conservation of promoter-TF affinity across a wide range of interaction strengths (Tanay, 2006
) and conservation of the core transcriptional network (Pritsker et al.
), and posit that the fraction of a promoter's affinity that is conserved in all species—that is, the minimum total affinity among orthologous promoters—can be used as a proxy for the fraction of the affinity that is functional.
We tested this idea using Saccharomyces cerevisiae
and three closely related yeast species. We used the position-specific affinity matrix (PSAM) model (Foat et al.
) to predict the affinity of each promoter in each species for a set of TFs with previously characterized sequence specificities (MacIsaac et al.
). For each TF, this yields a value for the total affinity at every promoter in S. cerevisiae
). The minimum of the four orthologous promoters’ affinities defines the conserved promoter affinity (NC
) at each promoter, and the unconserved promoter affinity (NU
) is calculated by subtracting NC
(). We find that compared to the unconserved affinity NU
, the conserved affinity NC
tends to exhibit greater bias toward Gene Ontology (GO) categories, better explains TF-promoter susceptibilities inferred from expression data, and correlates more strongly with nucleosome depletion. For several TFs, we detect GO category enrichment using the conserved affinity NC
when none is observed using the total single-species affinity NT
and no function has been reported in the literature.
We also develop a measure of correlation between genome-wide NC landscapes for pairs of TFs (affinity co-conservation). The interactions thus predicted are highly enriched for known physical or functional interactions between TFs. When the same approach is repeated using NT (affinity co-occurrence), no such enrichment is detected.
Our method holds promise for predicting in vivo function when only in vitro TF binding data and an ensemble of closely related genome sequences are available. It is fundamentally different from other methods because it is free of the parameters which govern local alignment, thresholding between targets and non-targets, and any distinction between conserved and non-conserved instances of individual binding sites.