Experimental Identification of Zeste-Bound Regions
We isolated regions of the D. melanogaster genome bound by the transcription factor Zeste in stage 11 embryos by immunoprecipitating chromatin with an anti-Zeste antibody. We detected bound fragments by hybridization to Affymetrix whole-genome tiling arrays containing more than 3 million 25-basepair (bp) oligos covering the euchromatic portion of the D. melanogaster genome at an average density of one oligo per 35 bp (see ).
We identified 296 regions bound in vivo by Zeste covering 546,016 bp and containing 306 peaks of signal intensity. For 294 of these bound regions we were able to identify orthologous sequences in
D. simulans, D. yakuba, and
D. erecta, which we aligned using MLAGAN [
29]. All subsequent analyses use these 294 regions.
We constructed a position-weight matrix describing the binding specificity of Zeste from 26 previously characterized Zeste binding sites (A). Using this matrix, we identified 1,406 potential Zeste binding sites in the 294 bound regions. The density of Zeste binding sites in bound regions is roughly 2.5-fold greater than in flanking noncoding sequences (see ).
| Table 1Total Numbers of Predicted Zeste Binding Sites in Bound and Flanking Regions |
Definition of Turnover
We consider a predicted Zeste binding site to be an example of binding-site turnover if it is bound in D. melanogaster but not conserved among the four sequenced species in the melanogaster species group.
Evolutionary Model of Zeste Binding Sites
Our first step towards distinguishing conserved and nonconserved binding sites was to examine the aggregate evolutionary properties of the 1,406 Zeste binding sites. We calculated the average rate of nucleotide change (based on maximum parsimony) at each position across the seven-base Zeste binding site (B).
The observed position-specific variation in evolutionary rates is expected because of the varying degeneracy tolerated by Zeste at different positions in its binding site [
18]: highly degenerate positions change rapidly, while more specific positions change more slowly.
We have previously shown that the position-specific evolutionary properties of functional transcription factor binding sites are accurately described by a model that assumes that binding sites are under constant selective pressure to remain binding sites [
18]. This model, based on the protein-evolution work of Halpern and Bruno [
30] and henceforth referred to as the HB model, generates a distinct probabilistic evolutionary model for every position within a binding site based only on the factor's binding specificity. This model for Zeste will henceforth be referred to as HBZ.
The HBZ model predicts the overall rate of substitution at each position within the Zeste binding site. A comparison of these rates to the observed rates of substitution shows overall good agreement (R
2 = 0.81; B). However, the rates of substitution in the binding sites are faster than those predicted by the model, although slower than would be expected under a background noncoding model (HKY [
31]).
We also compared the observed rates of each type of substitution (e.g., C and D) at each position in the binding site to the corresponding rates predicted by the model [
18]. Although there is overall good agreement, once again the observed rates are faster than those from the HBZ model. Additional comparison with the predicted rates from the background model revealed that the observed rates consistently fall between the predictions of the two models.
Assuming that the HBZ model accurately describes the evolution of functional Zeste binding sites, these observations suggest that our set is a mixture of sites evolving under purifying selection to retain Zeste binding and nonfunctional sites evolving at or near the background rate.
Classification of Sites Based on Conservation
To classify the 1,406 sites according to conservation, we used the HBZ model to test whether the observed pattern of evolution at each position across each site is consistent with it having been under continuous selection to maintain Zeste binding since the divergence of the four analyzed species.
Specifically, we designed a likelihood ratio statistic that compares the pattern of evolution under the binding-site model (HBZ) to that under a background noncoding model (HKY). We define
where
p(
x|
y) is the probability density function of the random variable
x, conditioned on the random variable
y, τ is the evolutionary tree that relates the sequences, and
HBZ and
HKY represent the choice of rate matrices to describe the evolutionary process. We use
X to represent the “reference” species,
D. melanogaster, and
Y…,
Z to represent the other sequences in the alignment. We calculate the conditional probabilities recursively by summing over all the possible ancestral states [
32]. Note that the probabilities are conditioned on X, as we aim to classify patterns of evolution given that we have already observed a binding site in
D. melanogaster. A similar use of conditional probability has been applied in the two-species case [
33].
The T statistic measures whether the observed substitutions in a site are more consistent with the HBZ model (T > 0) or the HKY model (T < 0). Since the species considered here are closely related, and highly conserved sequences are more likely under the HBZ model, an observation that T > 0 provides only weak support for the hypothesis that a site has evolved under purifying selection to retain Zeste binding. In contrast, an observation that T < 0, indicating the presence of substitutions that interfere with Zeste binding, provides strong evidence against the hypothesis that a binding site is conserved.
In using this statistic we are comparing the pattern of evolution under two expectations (i.e., HKY or HBZ) for the evolution of the sequence. It is in principle possible that functional Zeste binding sites evolve under constraints not captured by the HBZ model. As we do not have a set of functional Zeste binding sites known not to be conserved, we cannot directly test the propensity of the statistic to produce errors in classification. Instead, we examined the alignments of the binding sites identified by the T statistic as nonconserved, and found that they consistently contained substitutions that deviated from the Zeste binding motif.
shows examples of previously characterized Zeste binding sites [
34–
37] (obtained from [
38] version 2.0) and their values of the T statistic. The binding sites in the
zeste promoter (A) have few substitutions and thus have T > 0. In contrast, four of the five binding sites in the
Ultrabithorax promoter (B) have T < 0, reflecting the large number of substitutions that have occurred among these four species. Interestingly, we note that in two cases (B, i and ii) there are predicted Zeste binding sites on the other strand in the
D. yakuba, D. erecta lineage, perhaps reflecting compensatory evolution, while the other two cases (B, iii and iv) suggest lineage-specific gain and/or loss.
We can calculate the expected distribution of the T statistic for sites evolving according to the HBZ and HKY models (A). The observed distribution of T statistics for the 1,406 Zeste binding sites (B) shows that they are qualitatively similar to the HBZ distribution. Using the expected distribution, we can calculate the probability that a site has the observed value of the T statistic or smaller, given that it evolved under the HBZ model. We can use this as a p-value to reject the hypothesis that a binding site is conserved (although we note that the true statistical power of the test depends on how closely the HBZ model reflects the true constraints on Zeste binding sites).
We classified sites as “not conserved” if the p-value for their value of the T statistic was less than 0.01. Of the 1,406 binding sites, 215 met this criterion, far more than the 14 that would be expected if all of the sites were conserved and evolving under the HBZ model.
However, before assuming that all of these 215 nonconserved sites represent examples of binding-site turnover, we had to address two potential confounding factors. First, errors in our multispecies alignments could make binding sites that are actually conserved appear to be evolving rapidly. Second, we do not expect all of the predicted binding sites in the Zeste-bound data to be functional.
Alignment Errors Do Not Significantly Impact Our Analyses in Closely Related Species
Alignment algorithms attempt to reconstruct evolutionary history by aligning putatively orthologous bases to each other. However, even the best alignment algorithms are imperfect. As with many models of molecular evolution, ours assume that the DNA sequence alignment is perfect, and alignment errors could result in the erroneous classification of conserved binding sites as nonconserved.
As part of a larger study of alignment error, we have performed a simulation of regulatory sequence evolution, with transcription factor binding sites evolving under the HB model surrounded by randomly chosen D. melanogaster noncoding sequences evolving according to the HKY model. These simulations demonstrate that in alignments containing a branch of greater than 0.6 substitutions per neutral site, many conserved binding sites are no longer perfectly aligned (Pollard et. al, 2006). We therefore limited our analysis to D. melanogaster and its three most closely related species with fully sequenced genomes. The longest branch in the tree relating these four species has fewer than 0.1 substitutions per noncoding site.
The same simulation study also suggested that even when conserved binding sites are not perfectly aligned, they are often overlapping in the alignment. We therefore modified the software we use to identify conserved binding sites [
39] to recursively recover overlapping binding sites from multiple alignments, and assumed that these were orthologous binding sites.
To verify the relevance of these simulations to our Zeste data we performed a similar simulation of the evolution of the 284 bound intervals that contain at least one of the 1,406 Zeste sites described above. We evolved these sites under the HBZ model, and surrounding sequences under the HKY model, along the tree relating these four species. We then realigned the simulated sequences using MLAGAN and searched the alignments for matches to the Zeste matrix as before. Of the 1,406 nonoverlapping matches in D. melanogaster included in the simulation, our heuristic recovered 1,351 (96%) from the alignments. Of these, we found that only 10 (0.7%) showed p-values for the T statistic less than or equal to 0.01, which is close to the expected 1%. This suggests that errors due to alignment contribute negligibly to the analyses presented here.
Comparison to Flanking Sequences Reveals Significant Number of Functional Nonconserved Binding Sites
Our genome-wide ChIP–chip experiments do not have sufficient resolution to detect binding to individual binding sites, leading us to analyze predicted Zeste binding sites found in Zeste-bound regions. With the methods we used to identify these binding sites, we expect to find two predicted binding sites per 1,000 bp in random sequences with the base composition of the D. melanogaster genome. While the sequences analyzed here are more complex than random sequence, many of the predicted Zeste binding sites may be chance matches to the Zeste specificity matrix and not bound by Zeste. It is important that we do not consider these possible nonfunctional and nonconserved sites when evaluating binding-site gain and loss.
To estimate how many of the 1,406 binding sites are nonfunctional, we analyzed predicted binding sites in 425 1,000-bp noncoding fragments located 2–3 kb on either side of the bound intervals. These sequences have generally similar base composition and evolutionary properties to the bound regions. compares the numbers of predicted Zeste binding sites in the bound regions and flanking noncoding regions. If we assume that binding sites predicted outside bound regions are nonfunctional, and that nonfunctional binding sites occur at the same rate in bound and unbound regions, we can place a lower bound on the number of functional binding sites and functional nonconserved binding sites in bound regions.
We find an excess of 806.7 (±56.7) Zeste binding sites within 300 bp of peaks in the bound regions, and an excess of 61.6 (±15.7) nonconserved sites (). Because we used a p-value cutoff of 0.01 to define nonconserved binding sites, we expect 8.7 functional binding sites to have passed this threshold by chance. Correcting for these sites produces an estimate of 53.6 (±14.6) nonconserved functional binding sites in the bound regions. Thus, even with the conservative assumption that nonfunctional sites are found at equal densities in bound and nonbound regions, we estimate that 6.6% (+2.4%, −1.4%) of the approximately 800 functional binding sites in these regions are not conserved.
We note that the large excesses of predicted Zeste binding sites in the bound regions (Z = 33.9 and Z = 38.5 for matches in D. melanogaster and conserved matches respectively, see ) are strong evidence that the high-throughput data is identifying bona fide in vivo Zeste-bound regions. Nevertheless, the data will likely contain false negatives and false positives, which will tend to reduce the functional enrichment. Because we have assumed that all of the predicted Zeste binding sites outside of bound regions are nonfunctional, our estimates for the numbers of functional binding sites are expected to be conservative.
Systematic biases in the array data, such as differences in base composition between the bound and flanking regions, however, could produce unexpected effects. Therefore, as a control for base composition, rate of evolution, or other local sequence effects between the flanking regions and the bound regions, we repeated our analysis using a scrambled version of the Zeste specificity matrix and found no significant differences between the bound regions and the flanking regions in the frequencies of matches in D. melanogaster or conserved matches (unpublished data) or nonconserved matches (). In addition, because the classification is based on the assumption that functional binding sites evolve under HBZ, the correction of 0.01 × 3.6 misclassified binding sites per Kb is not necessarily conservative. For example, if the HBZ model differs from the true evolution of conserved Zeste binding sites such that twice as many conserved sites are passing the threshold (2% instead of 1%), we would estimate there are 45.5 nonconserved binding sites or 5.6% of the total.
Rates of Binding-Site Gain and Loss and the Effects of Selection
Having demonstrated that approximately 50
D. melanogaster binding sites in the Zeste-bound regions have not been conserved since the divergence of the melanogaster subgroup, we next sought to explicitly analyze rates and patterns of binding-site loss and gain [
40]. To do this, we predicted Zeste binding sites in each of the four species, and identified positions in the multiple alignments of the bound and flanking regions where there was a binding site in at least one of the species (see ). We found a total of 1,909 such positions within 300 bp of binding peaks in
D. melanogaster, of which 584 were classified as nonconserved (T statistic
p < 0.01). To infer loss and gain events, we classified each of the 584 nonconserved binding sites according to the species in which the site is present. As above, we estimated the number of functional sites by comparison with flanking regions.
| Table 2Total Number of Predicted Zeste Binding Sites in Any Species in Bound and Flanking Regions |
We were particularly interested in the 426 (73%) of these nonconserved binding sites where we could assign a single likely gain or loss event (A). From these, we estimated the rates of binding-site gain (λ) and loss (μ), and inferred the effects of selection by comparing these estimates in the bound regions with those in the flanking unbound regions.
We defined the rate of binding-site loss as the fraction of binding sites in the ancestor that are not conserved. The ancestral binding-site number was estimated as the number of conserved sites plus the number of nonconserved sites classified as losses (A, v, vi, vii, and viii). We found the rates of binding-site loss in the bound regions and flanking regions to be μb = 0.053 (66 “losses” out of 1243 “ancestral” binding sites) and μi = 0.093 (78 “losses” out of 841 “ancestral” binding sites), respectively, and that these differed significantly (p < 0.005, Fisher's exact test). That the rate of binding-site loss in the bound fragments was 57% of that in the flanking regions suggests that purifying selection has acted to remove many mutations that disrupted functional Zeste binding sites.
We defined the rate of binding-site gain as the fraction of ancestral background sequence that contains a nonconserved binding-site classified as a gain (A, i, ii, iii, and iv). We estimated the length of the ancestral background sequence to be the total number of bp in D. melanogaster minus the ancestral binding sites as defined above. The rates of binding-site gain in the bound and flanking regions were λb = 1.61 (360 “gains” out of 223.499 kb) and λi = 1.41 (602 “gains” out of 424.159 kb), respectively, and this difference was also significant (Z = 2.38, p < 0.01).
Binding-Site Evolution and Functional Evolution
The gain or loss of functional binding sites has the obvious potential to alter gene expression patterns. However, it has been suggested that regulatory sequences may frequently evolve through compensatory gain and loss events that produce little if any functional change [
23]. To evaluate the extent of these two modes of evolution in the Zeste-bound regions, we compared the rates and patterns of binding-site gain and loss along the lineage leading to
D. melanogaster since its most recent common ancestor with
D. simulans, to gains and losses along the other lineages.
Because the bound regions we are evaluating here come from experiments in D. melanogaster, any sequence changes that affect regulatory function should be asymmetrically distributed with respect to D. melanogaster. In particular, if any of the bound regions are unique to D. melanogaster, we might expect to find Zeste binding-site gains in these regions along the D. melanogaster lineage. Conversely, we would not expect to detect many Zeste binding-site losses along the D. melanogaster lineage if those losses impaired binding. Therefore, we might expect to see an excess of binding-site gains and a deficit of binding-site losses along the lineages leading to D. melanogaster.
To examine lineage-specific rates of binding-site gain and loss, we computed the excess (relative to flanking sequences) fraction of nonconserved binding sites that showed a single change along the D. melanogaster lineage (C). Although the melanogaster branch accounts for about 13.5% of the evolutionary distance covered by these species (0.032 of 0.23 total substitutions per site spanned by four species), it accounts for 32% (13.5 of 41.7) of the excess binding-site gain events and only 7% (1.7 of 24.8) of the excess loss events, consistent with the hypothesis of lineage-specific gain and loss of Zeste-bound regions. In addition, if we look at the net gain (gains minus losses) of nonconserved binding sites, there is an excess of 16.9 binding sites. Changes on the melanogaster branch account for 70% (11.8) of these. It is important to note, however, that while we can clearly reject the hypothesis of symmetrically distributed changes, the excesses of these subdivided classes of nonconserved binding sites were not statistically significantly different than the background. The asymmetries that we observed may therefore be caused by any number of heterogeneities in the data, though we did not notice such effects along the melanogaster lineage in the total numbers of predicted Zeste binding sites or the number of nonconserved matches of these types for a scrambled version of the Zeste matrix (unpublished data).
That the net gain in binding sites was small, and seemed to occur mostly on the melanogaster lineage, suggests that many of the changes in functional nonconserved binding sites are compensatory, that is, cases where a binding-site loss was compensated with the gain of a binding site elsewhere in the same bound region. To test this model, we used the initial classification of nonconserved binding sites illustrated in A to evaluate how frequently specific binding-site gain events were matched with compensatory losses (the scenarios in A were grouped as (i) and (v), (ii) and (vi), (iii) and (vii), (iv) and (viii)). In this analysis we also included the binding sites corresponding to the pair of scenarios consistent with one change, but whose direction we could not infer.
We observed 33 instances where complementary binding-site gain and loss events occurred in the same bound region. We compared this number with that observed in permutations in which the total number of binding sites in each region was kept constant, but the evolutionary scenario to which each site corresponded was randomized. The observed co-occurrence value in the bound regions did not fall in the extreme of this distribution (D), failing to provide support for the compensatory change model. We constructed several other test statistics based on similar reasoning and we were unable to find any that provided support for the compensatory turnover model (unpublished data).