Gene duplication events provide the necessary ‘spare parts’ for evolutionary innovation by facilitating elaboration of existing biological functions (1–4
). Diversification of gene functions can involve at least two distinct pathways: (i) alteration of gene expression pattern, and (ii) alteration of protein’s sequence, structure and eventually its interactions and biochemical activity. Alteration of expression patterns is likely to involve changes in the cis
regulatory elements. Extensive work has been carried out to identify, model, and analyze regulatory evolution and evolution of novel gene function. Typically, many computational analyses have exploited highly conserved non-coding regions as a proxy for putative functional elements. Such analyses fail to capture divergent aspects of sequence that might underlie functional diversification of gene families. Other approaches have attempted to correlate sequence changes in cis
regulatory regions with expression divergence in pairs of orthologs. However, not only is it inherently difficult to compare expressions across multiple species, but expression profiles across multiple samples obscure the effects of regulation in specific, individual conditions.
In view of the above remarks, we have investigated the correlated changes between TF-binding sites and the condition-specific expressions of paralogous genes in the model organism S. cerevisiae
, using rigorous statistical analysis. Moreover, we have attempted to characterize the impact of regulatory sequence evolution on expression divergence in paralogous gene families, as opposed to analyzing paralogous pairs of genes, as was done previously. Our genome-wide analysis in yeast has revealed several significant correlations between changes in TF-binding scores in the promoters of paralogous genes and their expression values in specific experimental conditions. We have also observed that diverse measures of TF binding appear to capture different aspects of TF-binding site variation and evolution, which underscores the value of incorporating TF-binding data from a variety of sources. In general, incorporating nucleosome occupancy probabilities in the promoters yields additional significant correlations. It is worth emphasizing that since ChIP-chip captures in vivo
binding and thus implicitly incorporates nucleosome occupancy, we observed negligible additional significant correlations after incorporating nucleosome occupancy data to ChIP-chip-binding probabilities. The additional significant correlations retrieved are most likely due to the fact that ChIP-chip experiments are executed for a specific and limited set of conditions, and may not capture the effect of nucleosome occupancy in other experimental conditions. This fact reinforces the value of our analysis, as we are able to gain further insights from utilizing several different experimental samples. Finally, we have highlighted a few specific examples of significant correlations between TF-binding site divergence with expression divergence in specific conditions. Collectively, our findings suggest that during evolution, alterations in TF-binding sites contribute to condition-specific expression changes among paralogous genes. Our results further suggest that evolution of nucleosome occupancy within paralogous families potentially underlie the expression divergence among the paralogs, as noted in a recent review (19
Although we have identified a number of significant correlations between TF-binding site and expression, there are obviously a large number of cases in which we could not detect any significant correlations in any expression sample. Expression divergence is potentially mediated by a multitude of factors including genomic, epigenomic and transcriptional changes that are not captured solely by mutations in the proximal promoters of genes. There are also other mitigating factors including the lack of known TF-binding sites and the relatively small gene families, thereby reducing the power of the statistical tests. In addition, our compendium of expression samples is far from being comprehensive, and so many correlations would not be detected due to the unavailability of relevant expression samples. Despite these limitations, we have found significant correlations between cis regulatory elements and sample-specific expression, and even capture the effect of nucleosome positioning in transcriptional evolution.
The statistical challenges involved in performing the analysis presented here should not be understated. Expressions and TF-binding scores in paralogous genes cannot be assumed to be either normally distributed or statistically independent. While many studies neglect or discount such details, we have adopted conservative, but rigorous statistical technique. It is also worth mentioning that due to stringent multiple testing corrections, a genome-wide application of our technique potentially diminishes the percentage of significant correlations. In view of this observation, we believe that a careful application of our method, perhaps even in additional organisms, to specific gene families utilizing a smaller but more relevant subset of putative TFs and expression samples will be more fruitful.
The methods presented here will uniquely reveal novel functional cis elements that may underlie the expression divergence among paralogs, as illustrated by several known cases. For instance, we found that GAL4 binding may underlie the expression divergence within the sugar permease family in yeast. We were able to capture relevant correlations between TF binding and expression in this family, precisely because we investigated sequence-expression relationships at the level of families in a condition-specific manner. Sugar transporter genes have evolved to perform related, but subtly distinct functions. Our analysis suggests that this functional diversification is facilitated, at least in part, by expression divergence, which in turn is mediated by divergence in TF binding. Moreover, by analyzing the entire family, as opposed to one gene at a time, our approach affords a greater statistical power. From a broader perspective, genome-wide applications of our method may provide a means of generating testable hypotheses as well as insights into the evolutionary process by which members of gene families diversify their functions.