In contrast to the vast diversity of protein function, the elements that regulate gene expression recruit from a shared repertoire of transcription factors, offering the potential for a common regulatory sequence code. The torrent of variants emerging from human resequencing studies – the vast majority of which lie in non-coding regions – coupled with the growing number of common, disease-associated non-coding variants 
has created an urgent need for determining the consequences of variation within regulatory DNA. However, the proportion of variants within regulatory DNA that have reproducible functional consequences on regulatory factor binding is currently unknown, and our ability to predict such outcomes from known rules of protein-DNA interaction is uncertain.
We have described a novel, hypothesis-driven genetic method employing targeted capture and genome-wide in vivo
occupancy profiling to investigate directly the consequences of heritable variation in regulatory sequence. Our results show that individual transcription factor binding sites are surprisingly robust to genetic variation, even at evolutionarily constrained positions. While previous studies have observed differences in transcription factor occupancy among individuals using occupancy profiling alone 
, genome-wide linkage scans 
, or allele-specific occupancy approaches 
, this work is the first systematic analysis of patterns of functional alteration in TF recognition sequences. This study further advances the characterization of heritable variation in TF binding by using highly accurate sequence information throughout a three-generation pedigree.
Our study has revealed a large degree of context dependence for changes to the CTCF recognition sequence. Indeed, even over the core 14 bp motif, only 36% of SNPs affected occupancy (). Our estimate of the percentage of SNPs that affect occupancy in this 14 bp region ranges between 24% to 42% at FDRs 0.1% and 5%, respectively, indicating that the magnitude of this effect cannot be explained by the choice of significance cutoff. We have suggested that buffering is partly mediated by the strength of the binding site, as well as the sequence context at the local CTCF recognition sequence. In addition, buffering might be facilitated by a feedback process that maintains a constant CTCF occupancy despite alterations to the site's inherent affinity. However, while 21% of the SNPs in the region of protein-DNA contact that were significant in our association analysis also exhibited allele-specific occupancy in heterozygous samples, only 3.7% of the non-significant SNPs did, indicating that buffering is not likely to be the consequence of a feedback process. Alterations in DNA methylation might also mask the effect of otherwise significant genetic changes. However, only 30% of polymorphic CTCF sites contain a CpG at positions 1 or 11 of their recognition sequences. Furthermore, the prevalence of CpGs at these positions is the same at sites where a SNP does and does not affect occupancy, limiting the potential scope of methylation to fully explain the observed buffering. As this study was performed on transformed B-lymphoblastoid cells, it is worth noting that the specific CTCF sites that are buffered may not be extrapolated to primary cell types. However, assuming that EBV transformation does not invoke novel cellular mechanisms to regulate protein-DNA interaction, our primary conclusion stands that TF occupancy is strongly modulated by site-dependent effects.
Our results establish a low level of mutational load directly affecting transcription factor occupancy in the 4 founder genomes. Although variants were found at 21% of surveyed CTCF sites, only 0.9% of binding sites exhibited a difference in occupancy due to polymorphism (). Previous studies have identified varying levels of positive and negative selection in transcription factor recognition sequences by estimating changes in binding energy 
. However, our results indicate that 87% of polymorphism observed in the region of protein-DNA contact does not affect binding (Table S4
). This is a higher proportion of silent variation than predicted by binding energy models (Figure S4
), providing evidence that the scope of sequence change consistent with neutral evolution may be larger than previously thought.
Interestingly, we observed that the allele with higher occupancy was the derived allele in 40% of the cases (assuming the chimpanzee allele is ancestral). This indicates that approximately 40% of the functional substitutions in the human lineage increased occupancy, which is surprisingly high given that most mutations might be expected to reduce binding energy.
Previous work studying the power of comparative genomics has predicted a steep increase in the number of sequenced genomes required to obtain nucleotide resolution, particularly in the absence of perfect conservation 
. While genome-wide phylogenetic footprinting approaches have highlighted substantial conservation of transcription factor sequence specificities 
, functional studies of diverged species have uncovered low conservation in occupancy at orthologous sites 
. Any phylogenetic approach is thus a compromise between statistical power gained by sampling more diverged species and the ability to recognize similar functional elements by sequence similarity. The optimum evolutionary distances to sample may be different for assessing functional non-coding elements than for more conserved coding sequence 
. This tradeoff suggests a potential motivation for broad resequencing of natural populations, though even this approach faces the fundamental limitation of ineffective purifying selection in primates and humans 
Gene-based studies have successfully identified causal variants using current methods for prediction of functional non-synonymous protein variants 
. Coding mutations in the CTCF gene affecting its DNA binding specificity have been identified in cancer samples 
, but lesions in its binding sites are harder to interpret. The link between cognate recognition sequence and cellular consequence is complicated by potential influences from the cell-type specific chromatin landscape 
, maintenance of regulatory function despite sequence rearrangement 
, altered association with protein complexes 
, and the lack of binding specificity and occupancy data for common transcription factors. In spite of these caveats, our results indicate that a motif-based classifier of variation in experimentally-identified CTCF binding sites predicts functional variation with a 59% true positive rate and a false positive rate of 20% (). In comparison, current methods for prediction of non-synonymous protein variants achieve a 73% to 92% true positive rate at the same false positive rate 
– not dramatically greater considering the comparatively greater depth of variant databases used to assess coding variation. Given the encouraging performance of a straightforward functional genomics approach, that the majority of variants presently associated with human physiology and pathology lie in non-coding regions 
should be grounds for optimism.
In summary, our results indicate the existence of widespread recognition site-dependent buffering of polymorphism within regulatory DNA regions. A major implication of our work is that the potential for accurately predicting the consequences of variation affecting regulatory factor recognition sequences is severely limited by complex context dependencies, necessitating empirical assessment using functional genomic approaches. The feasibility of approaches such as the one we describe here has recently dramatically increased owing to coupled advances in next-generation sequencing technology and molecular biology, and continuation of this trend should in the near future enable further systematic investigations into the effect of polymorphism on protein-DNA interaction on a routine basis.