ChIP-exo provides a comprehensive and high-resolution (to within a few bp) view of transcription factor-DNA interactions across a genome. It detects low-level binding to the point where typically 2–4 fold more binding locations are discovered. With this precision, cognate DNA binding sequences become unambiguous, thereby revealing the complexity of site-specific DNA recognition. Detection by ChIP-exo is not compromised by the presence of other bound proteins, including histones. Since only a small fraction of proteins become crosslinked to DNA, neighboring proteins are stripped away by stringent SDS detergent washes in the ChIP procedure. ChIP-exo not only resolves adjacent binding events that are indiscernible by other methods, but also resolves multiple crosslinking sites within a given bound location.
ChIP-chip and ChIP-seq Have Substantial False Discovery Rates
For the five proteins examined here, >98% of all peak-pair binding locations determined by ChIP-exo contained a recognizable DNA binding motif. The remaining ~2% had very low occupancy, and may contain highly degenerate motifs. If generally true, then many sequence-specific DNA binding proteins may not make high affinity primary contacts with nonspecific DNA as some studies have suggested. Instead such putative binding events might represent false positives or have substantially degenerate DNA recognition elements that cannot be readily discerned with the current resolution of standard ChIP-chip and ChIP-seq technology. As such, standard methods may obscure the true degeneracy intrinsic to site-specific DNA recognition
in vivo. Nevertheless, many regulatory proteins might gain DNA binding specificity through protein-protein bridging interactions in lieu of sequence-specific binding (
Welboren et al., 2009;
Zhao et al., 2010).
The false positive and false negative rates associated with ChIP-chip and ChIP-seq vary depending upon chromatin fragmentation heterogeneity, ChIP-efficiency, background contamination, limitations imposed by the detection platform, and bioinformatic filtering/thresholding of the data. From our analysis, we suspect that as much as 50% of factor-bound locations determined by ChIP-chip and 30% by ChIP-seq could be false positives. False negatives represent a much higher percentage. Higher data thresholding produces fewer false positives, but more false negatives. While these false discovery rates associated with ChIP-chip and ChIP-seq appear high, they are sufficiently low to draw robust statistical conclusions. However, the validity of any one selected binding location may have substantially more uncertainty, which may be diminished with ChIP-exo.
Diversity and Complexity in Site Recognition In Vivo
From the five proteins examined here, we observed diverse and complex strategies for achieving sequence-specific DNA binding (). These proteins likely represent a small sampling of this diversity. Several predominant well-known themes arise. First, each nucleotide position within a binding site has a characteristic biased usage of the four possible nucleotides. Certain positions may be essentially invariant, whereas others accept any nucleotide with equal frequency. Between these extremes, some positions are biased towards two or three nucleotides. Usage of three of four possible nucleotides at a position might indicate that the fourth causes a negative interaction, rather than the three providing a positive interaction. These position-specific tolerance profiles form the basis of a consensus. Variations from the consensus may serve to alter binding affinity, the magnitude of which may be dependent on the type of nucleotide present at other positions within the site, and/or cooperative or competitive binding with other factors (discussed below).
Since site variation may occur throughout a consensus sequence, any particular deviation from the consensus may be rare. Collectively, however, deviations from the consensus appear to be quite common. As such, there may not be a clear demarcation between a consensus sequence and site variants. It may therefore be useful to think of each position in a site as a four-setting non-linear rheostat, of which some positions provide coarse tuning of affinity whereas others provide fine-tuning.
A particular level of occupancy may be needed to regulate a set of functionally related genes, in which case a single motif version may be employed. For example, Rap1 regulates a large set of ribosomal proteins genes, and these genes selectively utilize one version of the Rap1 consensus. Rap1 is also found at telomeres where a different version of the consensus is employed. The same is seen for Reb1. This phenomenon of selective motif version utilization might explain some of the reported discrepancies in consensus sequences defined in different studies that may have been derived from different subsets of binding locations.
It almost seems paradoxical that a more comprehensive set of bound locations would necessarily yield a more degenerate consensus. However, this finding is consistent with the idea that low affinity binding sites are low affinity because their sequences are farthest from the consensus. Therefore a technique with greater detection sensitivity would, by definition, yield an abundance of low affinity interactions occurring at degenerate motifs.
Physiological Importance of Lowly Occupied Sites
At what point does a sequence impart so little specificity/occupancy that it ceases to be biologically meaningful? Biological networks have been generally thought of as being discrete, meaning that a factor either regulates or does not regulate particular genes in the network. However, an alternative view is of a continuum, where a factor’s regulatory potential on a gene scales with its occupancy level (
Li et al., 2008). A continuum of occupancy levels renders the concept of false negatives as somewhat meaningless, except in an operational sense. Thus, while protein binding might be detected at more than a thousand locations in a genome, only the binding of the most highly occupied sites might be rationalized. The rest may form a continuum or increasingly more subtle regulation as site occupancy decreases, which would make network definition seemingly less vivid. Thus, even with perfect data, the set of bound locations would not be definable in an absolute sense, but only at a specified occupancy threshold.
The low affinity/occupancy locations reported in this study show evidence of being real (i.e., not false positives) and functional. First, such locations are reproducibly detected in multiple biological replicates. Second, with an uncertainty of less than a few bp, such locations are almost always centered over a sequence with similarity, albeit degenerate, to a high affinity site. Third, peak-pair distances are nearly identical to distances of high occupancy locations. Fourth, and most importantly, such locations are not random in the genome, but instead are concentrated at fixed distances from genomic features. For example, isolated low and high occupancy Reb1 locations are concentrated 95 bp upstream of the TSS, and clustered low and high occupancy locations are concentrated ~40 bp from each other. When Reb1 is bound to the −1 nucleosome, the lowly occupied secondary locations selectively reside in the upstream flanking region bound by the nucleosome. In contrast, the genome is awash with equivalent sites that are intrinsically low affinity, but no binding is detected. Taken together, none of these properties are consistent with the notion of physiological irrelevance or nearby incidental contact due to looping or chromatin compaction. Such weak interactions might have little measurable regulatory potential on gene expression, but may be sufficiently important for fine-tuning to be evolutionarily maintained.
An alternative view of lowly occupied sites is a hit-and-run mechanism (
Biddie et al., 2011;
Voss et al., 2011), whereby the dwell time of a protein on a DNA site may be rather short, but is sufficiently long to catalyze downstream events (e.g. chromatin remodeling) that may be more long-lived and ultimately functional. As such, low occupancy sites may be functionally important.
Multiple Mechanisms by which Transcription Factors Bind Chromatin
The effective concentration of DNA binding proteins and DNA sites in the nucleus may far exceed the K
D of DNA binding, and as such factors may be DNA-bound (specifically or nonspecifically) most of the time (
Lin and Riggs, 1975). This raises the question as to the exact pathway of site-specific DNA binding
in vivo, whether factors exist in an unbound pool or are directly transferred from other DNA sites (
von Hippel et al., 1974). Our finding that isolated high affinity sites may be lowly occupied
in vivo, while many intrinsically low-affinity sites have higher occupancy, suggests that intrinsic affinity is not the sole determinant of occupancy
in vivo. Rather a combination of effects, including high local concentrations, direct and indirect cooperativity, and competitive binding derived from other factors including nucleosomes will likely impose additional constraints. The contribution of any constraint may vary from one location to another.
For example, Reb1 not only binds in the middle of nucleosome-free regions and has been implicated in NFR formation, but it also binds quite strongly and selectively to nucleosomes. Rap1 binds to nucleosomes as well, but such nucleosomes seem to have low occupancy, which might reflect Rap1 binding followed by nucleosome eviction, rather than simple competitive binding. Certainly, many other sequence-specific binding proteins might recognize their site only after nucleosome eviction, and thus would be mutually competitive.
Each of the yeast proteins examined here had clustered binding locations. Clustered sites had substantially higher occupancy than isolated sites, perhaps owing to mutually cooperative binding through direct or indirect interactions or through cooperative exclusion of competing proteins. Site clustering might also give rise to the perception of non-orthologous site evolution. It is well known that cis regulatory elements have a conserved presence but not necessarily a conserved position in promoter regions (
Birney et al., 2007;
Dermitzakis and Clark, 2002;
Moses et al., 2006). Conceivably, each site in a cluster of sites might evolve back and forth from high affinity (recognizable) to low affinity (unrecognizable). As such, two sites that appear at nonorthologous locations might also have degenerate orthologous equivalents that are undetectable by consensus matching.
Summary
To our knowledge ChIP-exo is the first technique that has the potential to reveal essentially a comprehensive and unambiguous set of genomic binding locations for a protein at near single bp accuracy. Moreover, improved mapping accuracy and background reduction substantially reduce the number of tags needed to unambiguous identify a bound location, and provides a much greater range of occupancy levels that can be detected. This allows for a more complete assessment of regulatory networks, the repertoire of binding sites, their evolutionary turnover, and the context in which they interact with other factors.