|Home | About | Journals | Submit | Contact Us | Français|
Motivation: The identification of transcription factor (TF) binding sites and the regulatory circuitry that they define is currently an area of intense research. Data from whole-genome chromatin immunoprecipitation (ChIP–chip), whole-genome expression microarrays, and sequencing of multiple closely related genomes have all proven useful. By and large, existing methods treat the interpretation of functional data as a classification problem (between bound and unbound DNA), and the analysis of comparative data as a problem of local alignment (to recover phylogenetic footprints of presumably functional elements). Both of these approaches suffer from the inability to model and detect low-affinity binding sites, which have recently been shown to be abundant and functional.
Results: We have developed a method that discovers functional regulatory targets of TFs by predicting the total affinity of each promoter for those factors and then comparing that affinity across orthologous promoters in closely related species. At each promoter, we consider the minimum affinity among orthologs to be the fraction of the affinity that is functional. Because we calculate the affinity of the entire promoter, our method is independent of local alignment. By comparing with functional annotation information and gene expression data in Saccharomyces cerevisiae, we have validated that this biophysically motivated use of evolutionary conservation gives rise to dramatic improvement in prediction of regulatory connectivity and factor–factor interactions compared to the use of a single genome. We propose novel biological functions for several yeast TFs, including the factors Snt2 and Stb4, for which no function has been reported. Our affinity-based approach towards comparative genomics may allow a more quantitative analysis of the principles governing the evolution of non-coding DNA.
Availability: The MatrixREDUCE software package is available from http://www.bussemakerlab.org/software/MatrixREDUCE
Supplementary information: Supplementary data are available at Bioinformatics online.
Genome sequences encode not only the sequences of RNAs, but also the rates at which these are transcribed under various conditions. This cis-regulatory code is a consequence of the sequence specificity of transcription factors (TFs) and their interactions with other TFs, nucleosomes and other chromatin-associated proteins. The identification of cis-regulatory elements on a genomic scale is complicated by the fact that, although TF sequence specificity is generally well-characterized in vitro, functional elements in the genome are in fact much sparser than would be predicted from sequence alone (Gao et al., 2004; Pilpel et al., 2001). There are two types of constraint on the in vivo selection of functional targets by a given TF: those that prevent the TF from binding to DNA, and those that prevent a bound TF from driving transcription (Fig. 1). Functional and comparative genomics data are therefore needed in concert with knowledge of TF sequence specificity and DNA sequence to infer regulatory networks. The advantage of the functional genomics approach is that the contribution of sequence to the recruitment of a particular TF, and the association of that TF's binding with transcription, can be quantified. Comparative genomics, on the other hand, can lend evidence of the biological utility of TF binding through the application of evolutionary principles.
Most comparative genomics methods rely on local alignment of orthologous promoters or statistical measures of sequence overrepresentation (Cliften et al., 2003; Kellis et al., 2003; Li and Wong, 2005; Moses et al., 2004; Siddharthan et al., 2005; Sinha et al., 2004), neither of which reflect the evolutionary constraints on regulatory sequence. Local alignment is not well-suited to detect lower affinity binding sites, which may be distant in sequence space yet functional; nor can it capture the rapid turnover of binding sites, which often occurs without conservation of position (Dermitzakis and Clark, 2002; Ludwig, 2002; Tautz, 2000; Wray, 2003). These limitations could be overcome by not directly comparing orthologous sequences, but rather comparing their predicted affinities for various TFs. Various biophysically motivated models of promoter-TF affinity have been developed (Bintu et al., 2005; Djordjevic et al., 2003; Liu and Clarke, 2002; Roider et al., 2007; Ronen et al., 2002; Stormo et al., 1986) that allow such a comparison. In this article, we build upon the principles of conservation of promoter-TF affinity across a wide range of interaction strengths (Tanay, 2006) and conservation of the core transcriptional network (Pritsker et al., 2004), and posit that the fraction of a promoter's affinity that is conserved in all species—that is, the minimum total affinity among orthologous promoters—can be used as a proxy for the fraction of the affinity that is functional.
We tested this idea using Saccharomyces cerevisiae and three closely related yeast species. We used the position-specific affinity matrix (PSAM) model (Foat et al., 2006) to predict the affinity of each promoter in each species for a set of TFs with previously characterized sequence specificities (MacIsaac et al., 2006). For each TF, this yields a value for the total affinity at every promoter in S. cerevisiae (NT). The minimum of the four orthologous promoters’ affinities defines the conserved promoter affinity (NC) at each promoter, and the unconserved promoter affinity (NU) is calculated by subtracting NC from NT (Fig. 2). We find that compared to the unconserved affinity NU, the conserved affinity NC tends to exhibit greater bias toward Gene Ontology (GO) categories, better explains TF-promoter susceptibilities inferred from expression data, and correlates more strongly with nucleosome depletion. For several TFs, we detect GO category enrichment using the conserved affinity NC when none is observed using the total single-species affinity NT and no function has been reported in the literature.
We also develop a measure of correlation between genome-wide NC landscapes for pairs of TFs (affinity co-conservation). The interactions thus predicted are highly enriched for known physical or functional interactions between TFs. When the same approach is repeated using NT (affinity co-occurrence), no such enrichment is detected.
Our method holds promise for predicting in vivo function when only in vitro TF binding data and an ensemble of closely related genome sequences are available. It is fundamentally different from other methods because it is free of the parameters which govern local alignment, thresholding between targets and non-targets, and any distinction between conserved and non-conserved instances of individual binding sites.
Genome sequences for S. bayanus, S. cerevisiae, S. mikatae and S. paradoxus were obtained from Kellis et al. (2003). Only genes for which the authors defined orthologs in all species were considered. For the correlation and GO analyses, we extracted sequences 500 bp upstream from each start codon and truncated them to exclude any upstream coding sequences. For the co-occurrence and co-conservation analyses, to negate length effects and spurious correlations, we extracted 500 bp upstream sequences without regard for upstream coding sequences; additionally, to avoid any overlap, we considered only the sequence upstream of the gene encoded on the Watson strand at divergent promoters with a length<1 kb.
We used the PSAM model for TF–DNA binding affinity (Foat et al., 2006), consisting of a matrix of parameters wjb. These weights represent the energetic consequences of mutations from the highest affinity sequence Sref to nucleotide j at position b within the binding window, resulting in a mutated sequence Smut with an increased free energy of binding:
As described by Foat et al. (2007), assuming independence between positions, we can multiply the weights within a binding window to arrive at the relative affinity of the TF for any sequence Smut. If we further assume a non-saturating physiological concentration of TF relative to DNA, the occupancy N of a sequence Smut is directly proportional to this association constant Ka and the concentration of the factor [TF]. Finally, we sum the occupancies of each subsequence within a sliding window across a promoter to calculate the predicted occupancy Ng for a promoter sequence Ug of any length i:
We converted a library of position-specific scoring matrices (PSSMs) representing S. cerevisiae TF specificities from MacIsaac et al. (2006) to PSAMs using the transformation previously described by Foat et al. (2007) and Bussemaker et al. (2007b)
where λ is an evolutionary selection parameter (Berg and von Hippel, 1987; Stormo et al., 1986), pb is the background probability of base b, and fjb is the frequency at position j and base b in the PSSM. Using a selection parameter λ=1 and equal background frequencies pb, we converted the log2-likelihood scores sjb=fjb/pb to weights wjb=2sjb. We then normalized each column of the PSAM so that the highest affinity base b at each position j was equal to one, and scaled the weights for the other bases accordingly.
We used the AffinityProfile utility from the MatrixREDUCE software package to calculate the affinity Ngfs for each promoter g for each TF f in each species s. We define the total promoter affinity as the affinity Ngfs for s= S. cerevisiae, and the conserved promoter affinity () of a promoter g for a TF f as the minimum of Ngfs among all four species s. The unconserved promoter affinity is given by
The genome-wide total promoter affinity conservation for each factor was calculated as
We selected those TFs for which both a PSAM and a corresponding TF deletion expression microarray experiment from Hughes et al. (2000) were available and calculated the Pearson correlation between NU for all promoters and the corresponding log2 expression ratios, and between NC for all promoters and the corresponding log2 expression ratios.
Similarly, we considered TFs for which Gao et al. (2004) applied the MA-Networker algorithm, which predicts the regulatory susceptibility of each promoter to a given TF by combining whole-genome chromatin immunoprecipitation (ChIP–chip) data with a compendium of expression profiles. We calculated the Pearson correlation between both NU and NC for each TF and the corresponding promoter–TF coupling T-values from MA-Networker.
To investigate the relationship between conservation and nucleosome occupancy, we used the nucleosome ChIP–chip data from Bernstein et al. (2004). We calculated the Pearson correlation between either NU or NC for each TF on one hand, and the mean log2 ratios across nucleosome ChIP experiments on the other.
We performed GO (Ashburner et al., 2000) analysis using a non-parametric variant of the T-profiler algorithm (Boorsma et al., 2005) described by Scheer et al. (2006) and Bussemaker et al. (2007a). Briefly, using the NT, NC, and NU for each TF, we performed a Mann–Whitney–Wilcoxon test between the affinities for genes within each GO category and affinities for all other genes. We performed a Bonferroni correction on the resulting P-values by multiplying them by the number of unique GO categories. We use the YEAST package from the BioConductor platform (Gentleman et al., 2004) within the R statistical programming environment.
We define the affinity co-occurrence between two TFs as the rank correlation between NT profiles of those TFs beyond what is explained by the similarity between pairs of PSAMs. Affinity co-conservation is defined similarly, but using NC instead of NT. We first calculated the Spearman correlation coefficient ρ using each of the 7626 pairs of TFs. To control for the confounding effect of similarity between PSAMs, we then randomly permuted the sequence of every promoter in each genome, and performed the same analysis. Using 3874 random samples, we obtained a null distribution for ρ for each TF pair. This procedure was performed separately for co-occurrence and co-conservation. These distributions were found to be normal upon examination of a Q–Q plot, which then allowed us to calculate a mean and SD for the null distribution for each TF pair, and assign two P-values to each TF–TF pair which served as our co-occurrence and co-conservation metrics. We also assigned a false-discovery rate (FDR) threshold α to each pair as described by Benjamini and Hochberg (1995). We assigned ranks i to the co-affinity and co-occurrence P-values, and calculated the FDR threshold α=pn/i, where n is the number of pairs (7626).
We compared our co-occurrence and co-conservation scores against four sources of validation: the subset of cofactor pairs reported by Banerjee and Zhang (2003) that had been experimentally validated; physical interactions deposited in the Saccharomyces Genome Database (SGD) (Cherry et al., 1998), including interactions detected by mass spectrometry affinity capture, Western Blot affinity capture, complex reconstitution and yeast two-hybrid; synthetic lethal interactions from SGD; and synthetic rescue interactions from SGD. We used the Mann–Whitney–Wilcoxon P-value and the area under the ROC curve (ROC AUC) to assess the performance of the co-occurrence and co-conservation scores at predicting validated interactions of each type. We used the ROCR package developed by Sing et al. (2005) within the R statistical programming environment for calculating the ROC AUC.
For each TF, we first analyzed the overall conservation of affinity cf, defined as the ratio of the genome-wide sum of conserved promoter affinities NC and the genome-wide sum of total promoter affinities NT. The overall conservation varies greatly between factors (Fig. 3), but in most cases, the majority of TF affinity is unconserved. This finding is consistent with previous observations that TF affinity is not sufficient for TF binding, and that TF binding is not sufficient for function; in fact, previous work has shown that approximately 42% of TF binding is not functional (Gao et al., 2004).
To explore whether the conserved fraction of the affinity at a promoter NC corresponds to the part of the affinity that is functional, and conversely whether the unconserved fraction NU corresponds to non-functional affinity, we analyzed transcriptional response to TF deletion as measured using expression microarrays by Hughes et al. (2000). Using the Pearson correlation between both components of the affinity and the expression log2 ratios shows that, indeed, the NC tends to predict the transcriptional response of a promoter better than the NU (Fig. 4A).
Such microarray experiments are often difficult to interpret, because of the drastic physiological change induced and the lack of distinction between direct and indirect targets. The analysis by Gao et al. (2004) was one of a family of techniques to integrate binding and expression data to estimate the regulatory susceptibility of each S. cerevisiae promoter to each of a variety of TFs. The authors noted that this quantity did not correlate well with sequence-based predictions of affinity due to an abundance of non-functional binding. Again using the Pearson correlation, we found that these susceptibilities are explained by the conserved affinity NC, in contrast to the unconserved affinity NU (Fig. 4B).
There are a variety of reasons why a promoter with high affinity for a TF might not be regulated by that TF and consequently would not display conserved affinity across orthologous promoters. Specifically, (i) the TF may occupy the promoter, yet lack the appropriate cofactors at that promoter, or the preferred binding site within that promoter might be in the wrong position or orientation relative to the transcription start site to recruit or activate polymerase; or (ii) the TF might not bind as predicted by sequence because of competing occupancy by nucleosomes. We can distinguish between these two models using ChIP–chip data that probe genomic occupancy by nucleosomes and by the factors themselves. When we compare NC and NU for TFs with their corresponding promoter occupancies as measured by Harbison et al. (2004), we find modest support for the former mechanism; NC and NU tend to both correlate with binding of the TF, although NC often correlates more strongly (data not shown). However, when we look at nucleosome occupancy data from Bernstein et al. (2004), we observe a strong and consistent relationship between nucleosome depletion and NC, but not NU (Fig. 5). Strikingly, the exceptions to this mechanism are Rap1, Sfp1, and Fhl1, all of which function to regulate ribosomal protein (RP) genes (Lieb et al., 2001; Marion et al., 2004; Yu and Morse, 1999).
Any genome-wide measure of promoter–TF connectivity can be combined with prior classifications of genes such as those curated by the GO consortium (Ashburner et al., 2000) to detect functional bias (Bussemaker et al., 2007a). We tested the utility of calculating the conserved affinities NC by analyzing their bias toward GO categories, and by comparing this bias to that which is detected using unconserved affinities NU. We expected NU to be distributed at random in the genome, and that any functional bias should be restricted to NC. Indeed, we find that in general, the significance of the bias toward the most enriched GO category is much stronger when using NC (Fig. 6). Again, the three exceptions are the factors Rap1, Sfp1, and Fhl1, which all regulate RP genes.
We also find that the number of GO categories within which significant bias is detected is almost always greater when using the conserved affinity NC compared to the unconserved affinity NU (Supplementary Table 1). In some cases, we are able to predict functions for TFs using NC that are not ascertainable from NT; these predictions tend to agree with those provided by the literature, when available (Supplementary Table 1). We are able to formulate several hypotheses about TF functions using NC that have not been reported in the literature:
In the course of our analysis, we made the surprising discovery that the factors Rap1, Sfp1, and Fhl1 do not follow the same pattern as the other studied factors: the unconserved affinity NU for these factors appears to be functionally biased and associated with nucleosome depletion more so than the conserved affinity NC. Each of these TFs is known to regulate RP genes, whose expression constitutes half of the RNA polymerase II transcription in rapidly growing yeast cells (Warner, 1999). Our GO analysis for these factors indicates that the strongest functional bias toward RPs in each case is found by considering NT rather than either NU or NC. One explanation might be that the function of Rap1/Sfp1/Fhl1 affinity is context independent, leading even unconserved affinity to be functional, whereas for other TFs, a uniform background level of NU is found throughout the genome (because it is non-functional outside of the proper genomic context, and therefore not selected against). The association of NU with RP promoters, which one would expect to be exceptionally nucleosome depleted, explains the observed anomalous correlation between NU for these factors and nucleosome depletion.
Co-occurrence and co-conservation of individual TF binding sites has been an extensively studied line of evidence for predicting cofactor pairs (Chiang et al., 2003; Pilpel et al., 2001; Sudarsanam et al., 2002). Pairwise comparison of genome-wide NT and NC distributions constitutes a natural extension to these methods, which we term affinity co-occurrence and affinity co-conservation. Similarity in genome-wide NT and NC profiles between two TFs is governed, to an extent, by the similarity between their sequence specificities (PSAMs); we controlled for this effect by developing a null model for each TF–TF pair based on scrambled genomes. At a FDR threshold of α<0.05, we discovered 29 pairs using co-occurrence and 1530 by co-conservation (reported in full in Supplementary Table 2). We first compared these results to the subset of 11 cofactors predicted by Banerjee and Zhang (2003) that were experimentally validated. None of these validated pairs were among the 29 co-occurring pairs, while 8 of the 11 validated pairs were among the 1530 co-conserved pairs (hypergeometric P=1.9×10−5). At this FDR-level stringency it is difficult to compare the two methods, so we used the ROC AUC (Fig. 7) and Wilcoxon-Mann-Whitney tests to compare their performance at predicting the 11 validated pairs. Strikingly, affinity co-occurrence fails to predict these cofactors by either measure, while affinity co-conservation predicts them with significant strength.
We then compared our predictions to other sets of TF interactions discovered via high-throughput experiments: physical interactions, synthetic lethal interactions, and synthetic rescue interactions (Table 1).
While physical interaction or synthetic lethality between two TFs does not necessarily imply that they act together as cofactors or even target the same genes, we nevertheless find that these types of evidence are associated with significant signals of affinity co-conservation. Synthetic rescue interactions are not associated with co-conservation, which is to be expected because such interactions indicate a complementary rather than synergistic relationship. Again, it is notable that in none of these cases does co-occurrence of affinity in a single genome provide evidence that agrees with experimentally validated pairs.
The pairs of interactions that we detect by co-conservation (Supplementary Table 2) are biologically plausible. The master cell-cycle regulators Fkh1 and Fkh2 have many interaction partners; the cell-cycle regulators Mcm1 and Yox1 are connected; the RP regulators Fhl1, Rap1, and Sfp1 are connected; heme-activator proteins Hap1, Hap2, Hap3, and Hap4 are connected; sulfur amino acid biosynthesis regulators Met4 and Met32 are connected; and glycolysis regulators Gcr1 and Gcr2 are connected. Interestingly, the serum response factor like protein Rlm1 has many interaction partners, suggesting that it has a genome-wide regulatory function that is shared by many factors with diverse functions.
We have exploited the pattern of TF affinity conservation across orthologous promoters to infer the fraction of TF affinity at each promoter that is functional. This method should be broadly applicable to situations where the in vitro binding specificity of a TF is known, but its in vivo function has not been demonstrated experimentally. It may also be especially useful in the analysis of the genomes of higher eukaryotes, in which non-functional binding sites outnumber functional ones to an even higher degree than in yeast. In contrast to phylogenetic footprinting methods, information from the full range of potential binding affinities is incorporated, allowing for more sensitive detection of potential TF–promoter and TF–TF interactions.
The author would like to thank Richard Mann, Gabor Halasz, and Ronald Tepper for critical readings of the manuscript, and members of the Bussemaker laboratory for helpful discussions.
Funding: This work was supported by National Institutes of Health (NIH) grants R24GM074105, T32GM008798, R01HG003008, and U54CA121852.
Conflict of Interest: none declared.