PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Mech Dev. Author manuscript; available in PMC 2017 August 1.
Published in final edited form as:
PMCID: PMC5441388
NIHMSID: NIHMS795880

Spatial distribution of predicted transcription factor binding sites in Drosophila ChIP peaks

Abstract

In the development of the Drosophila embryo, gene expression is directed by the sequence-specific interactions of a large network of protein transcription factors (TFs) and DNA cis-regulatory binding sites. Once the identity of the typically 8–10 bp binding sites for any given TF has been determined by one of several experimental procedures, the sequences can be represented in a position weight matrix (PWM) and used to predict the location of additional TF binding sites elsewhere in the genome. Often, alignments of large (>200 bp) genomic fragments that have been experimentally determined to bind the TF of interest in Chromatin Immunoprecipitation (ChIP) studies are trimmed under the assumption that the majority of the binding sites are located near the center of all the aligned fragments. In this study, ChIP/chip datasets are analyzed using the corresponding PWMs for the well-studied TFs, CAUDAL, HUNCHBACK, KNIRPS and KRUPPEL, to determine the distribution of predicted binding sites. All four TFs are critical regulators of gene expression along the anterio-posterior axis in early Drosophila development. For all four TFs, the ChIP peaks contain multiple binding sites that are broadly distributed across the genomic region represented by the peak, regardless of the prediction stringency criteria used. This result suggests that ChIP peak trimming may result in the exclusion of functional binding sites from subsequent analyses.

Keywords: transcription factor, Drosophila, ChIP, binding sites

1. Introduction

The specification of the embryonic body plan in Drosophila melanogaster, as in many other organisms, is accomplished by the regulation of gene expression during development through the activity of cis-regulatory sequences in the genome. Protein transcription factors (TFs) bind to such sequences, regulating expression of target genes [1]. The molecular components of this spatio-temporal network have been extensively characterized (for a review, see [2]). At the top of the cascade that controls anterio-posterior patterning, translation of spatially localized maternal mRNAs, deposited in the unfertilized egg cell during oogenesis, establishes the early TF gradients in the embryo [3, 4]. In turn, these maternal TFs bind at target embryonic regulatory sequences for gap genes, directing gap TF expression patterns in the developing embryo [5, 6]. Gap TFs further regulate downstream target genes, such as those for pair-rule and homeotic TFs [7, 8]. At each step in the cascade, gene expression patterns are tightly controlled by the binding of TFs to specific clusters of activator and repressor binding sites within embryonic cis-regulatory modules (CRMs). The transcriptional output is mediated by the specific molecular properties of individual CRMs. Whether a given TF acts as an activator or repressor when it binds to a CRM can be context-dependent [9, 10] and the DNA sequences within CRMs may bind TFs with varying affinities [6, 11]. Therefore, the accurate identification of potential TF binding sites in known and putative CRMs is critically important to further our understanding of the molecular control of gene regulation in the developing embryo.

The binding affinity of a TF’s DNA binding domain for any potential TF binding site is a function of the sequence of base pairs that make up that binding site [12, 13]. There is extensive variation in the type and number of DNA binding domains utilized by TFs that can modulate both the sequence specificity and relative affinity for different binding sites [1417]. As such, the organization of binding sites at genomic regions known to recruit a specific TF has been the subject of many investigations aimed at determining the influence of DNA sequences on TF binding affinity and the computational prediction of binding site locations.

A multitude of bioinformatic tools have been designed to aid in the process of analyzing and predicting genomic fragments that contain TFBSs [18]. The use of position weight matrices (PWMs) has been shown to be very effective in such analyses [1823]. The datasets required to construct PWM-based models are obtained from protein binding microarrays, yeast and bacterial one-hybrid assays, DNA footprinting assays, and in vitro SELEX experiments [24, 25]. The known binding regions obtained through these experimental procedures are aligned and trimmed to a minimal length of highest conservation (L) and a 4 X L matrix is created containing the frequencies of each nucleotide at each position in the binding site [22]. A PWM is then constructed from this frequency matrix, with the primary underlying assumption that the probabilities for nucleotides at each position are independent of one another. PWMs often also take into consideration the background frequency of the nucleotides in a given genome and are constructed using log likelihoods. The resulting PWM-model can reduce the need to conduct complex in vitro binding assays by predicting the location of unconfirmed binding sites in the genome in silico [19, 22].

In the last decade Chromatin Immunoprecipitation-sequencing (ChIP-seq) and ChIP/chip datasets have significantly increased the ability to identify regions of DNA that bind TFs in vivo at the whole genome level [2629]. However, a limitation of these studies is the resolution of the binding regions identified. Experimentally identified ChIP peaks are often in the range of 500–1000 bp [27], while the TF binding sites themselves are usually only 8–10 bp in size [30, 31]. As a result, it is not uncommon for ChIP peaks to be trimmed to the middle 100 bp before using bioinformatic tools to make de novo predictions of TF binding site locations [18, 32]. This approach assumes that the most relevant binding regions for the corresponding TF in vivo are located within this middle 100 bp region of any ChIP peak. This assumption follows from the fact that ChIP peaks are called based on alignments of multiple overlapping immunoprecipitated fragments of randomly sheared chromatin fragments of similar size [28]. The most highly represented regions among the fragments, in theory, therefore should be those that bind the actual TF most often. Thus, assuming a normal distribution of fragment lengths, the most highly represented binding region should be captured in the middle 100 bp. (see Fig. 1 and description in Methods).

Figure 1
ChIP peak binding site arrangements and measurements

In this study we attempt to validate this assumption using PWMs for four well-studied TFs (CAUDAL, HUNCHBACK, KNIRPS and KRUPPEL) involved in the early patterning of the anterio-posterior axis in the Drosophila embryo to predict binding site locations in ChIP peaks from four corresponding ChIP/chip datasets. The results of this analysis indicate that predicted binding sites are in fact broadly distributed throughout the ChIP peaks regardless of the stringency of the predictive thresholds applied, suggesting that ChIP peaks should be left untrimmed to ensure inclusion of all possible binding sites in the dataset.

2. Results and discussion

At any genomic region represented by a ChIP peak, it is possible that the region contains either a single binding site or multiple binding sites for the TF of interest. While the TF binding sites themselves are usually only 8–10 bp in size, for the four Drosophila TFs investigated here: CAUDAL (CAD), HUNCHBACK (HB), KNIRPS (KNI) and KRUPPEL (KR), the mean length of a ChIP peak is ~800 ± 280 bp (Table 1). Trimming ChIP peaks to only consider the center 100 bp, based on the assumptions that a) the genomic fragments recovered for a given peak in a ChIP experiment have a normal distribution around a biologically important binding site(s), and b) that the binding site(s) is likely to be centered around the midpoint of the peak and therefore contained in the central region (Fig. 1), is potentially an efficient approach to help narrow the search for DNA sequences that represent the in vivo binding sites. To examine the distribution of predicted TF binding sites in these ChIP peaks we utilized three different measurements (see Methods for detailed description): directional midpoint distance, position ratio and absolute midpoint distance (Fig. 1).

Table 1
Transcription Factor ChIP peak statistics

If we consider the directional midpoint distance for only the top ranked (top 1) binding site in each ChIP peak for each of the four TFs, the average location is very close to the center of the corresponding ChIP peaks, with the mean distance from the midpoint in the narrow range of +1 to +14 bp (Table 2). However, the distribution for the binding sites in each case is surprisingly large (~250 bp) (Fig. 2). Extending this analysis to include the top 2 or 3 scoring sites for each ChIP peak only expands the range of the mean distance from the midpoint to −3 to +14 bp, while the distribution remains broad (Table 2 and Fig. 2).

Figure 2
TF binding site directional midpoint distance graphs
Table 2
TF binding site directional midpoint distances in ChIP peaks

Previous studies examining the distribution of binding sites for three mammalian TFs in ChIP-seq datasets have identified more restricted patterns of binding site distribution [33]. In each case, if only the single highest scoring predicted site (equivalent to the top 1 binding site in our current study) is considered across a trimmed peak region of 500 bp (−250 bp to +250 bp) then 90% of all binding sites fall in the middle 100 bp (see Figure 7 in [33]). Comparable analysis of the top 1, 2 or 3 binding sites in each ChIP peak for the four Drosophila TFs in our study reveals a similar general trend to the Wilbanks et al. study [33], with a mean binding site distance from the middle of the peak of 0 bp (Fig. 3). However, the distribution for the Drosophila TF binding sites is larger, the 25th and 75th percentile values average around −100 bp and +100 bp (Fig. 3), than the less than 50 bp most programs predict for the mammalian TF binding sites [33]. It should be noted that the data in the mammalian studies is generated from ChIP-seq [33], while the Drosophila studies utilize ChIP-chip [28]. The use of different experimental approaches will likely introduce some systematic differences in the overall size of the identified chromatin fragments, which may impact the ability to accurately compare the results.

Figure 3
TF binding site directional midpoint distance graphs for 500 bp trimmed peaks

It is possible that the “Top n” analyses may have limited biological significance because they do not require the top predicted binding sites to meet any threshold relative to the PWM sequence, which should be indicative of their ability to bind the TF. To address this issue, for each TF, thresholds of varying stringency across a broad range were used in separate analyses (see Methods for details). The use of a range of thresholds is paramount to a robust and unbiased analysis because it has been shown that the performance of a PWM can vary significantly across a range of thresholds [32]. Intriguingly, in each case the results largely remain consistent, with a narrow range for the mean distance to midpoint and a wide distribution (−7 to +49 ± ~270 bp) (Table 2 and Fig. 2). It is worth noting that the pattern is maintained despite the fact that at different thresholds there are vastly different numbers of predicted binding sites (Table 3).

Table 3
Number of TF binding sites per ChIP peak

As a control, we also examined the distribution of binding sites for the DORSAL (DL) TF on each of the four ChIP peak datasets. DL is a TF active in the dorsal-ventral patterning network in Drosophila development. Although there is some evidence that DL can bind to a number of genomic regions known to also bind TFs involved in the anterio-posterior patterning of the embryo [28], binding sites for DL may not be expected to be significantly enriched at ChIP peaks for anterio-posterior TFs when compared to the overall background level in the genome. In almost all cases the total number of DL binding sites predicted is considerably lower than the corresponding experimental TF, particularly at weaker thresholds (Table S2). The exception is KNI, which has an order of magnitude fewer ChIP peaks than any of the other three experimental TFs and a corresponding decrease in the number of predicted KNI binding sites (Table 1 and and3).3). The distribution of the DL binding sites mirrors the pattern observed for the four experimental TFs, but with a larger range at stronger thresholds as a result of the reduced number of sites included in the analysis (Fig. S2 and Table S3).

Analyzing the location of CAD, HB, KNI and KR binding sites using a different measure in the position ratio approach, which allows for normalization of the binding site location relative to the ChIP peak size in each case, reveals a very similar distribution pattern to the directional midpoint approach (Table S1 and Fig. S1). However, an additional feature of the binding site distribution is revealed by measuring the binding site locations using a third approach. Measuring the absolute distance from the midpoint of a ChIP peak reveals a mean of ~220 ± 140 bp (Table 4). Once more this distribution pattern is largely consistent irrespective of which TF or which threshold is analyzed (Fig. 4) and suggests that on average the binding site(s) in any particular ChIP peak are located over 200 bp away from the center of the peak. In combination with the results from the directional and position ratio analyses this indicates that multiple TF binding sites are centered around the midpoint of a given ChIP peak with a broad distribution that may include sites both upstream and downstream of the midpoint.

Figure 4
TF binding site absolute midpoint distance graphs
Table 4
TF binding site absolute midpoint distances in ChIP peaks

To investigate this distribution more carefully we analyzed the number of binding sites contained in incremental 100 bp intervals centered on the middle of the ChIP peaks. For each of the four TFs at both weak and strong “Footprint” thresholds the distribution is very consistent. Only ~12.5% of all binding sites are included in the middle 100 bp and only ~25% are in the middle 200 bp (Table 5). Each additional 100 bp interval expanding out from the middle adds ~10% of the total binding sites. As a result, including 800 bp of sequence around the center of the ChIP peaks (an absolute distance of 400 bp in either direction) accounts for ~85% of all the binding sites (Table 5) and the resulting distribution plots are very broad (Fig. 5). The distribution of binding site locations using the strong “Footprint” thresholds are significantly different from a normal distribution (D values for CAD, HB, and KR are 0.04, 0.05, and 0.1 respectively, all with p-values <0.001, see Methods for details), except in the case of KNI (D value of 0.1), which only has 41 predicted binding sites at this threshold (Table 3).

Figure 5
Distribution of TF binding sites in ChIP peaks
Table 5
TF binding site distribution in ChIP peaks

To address the potential biological significance of the observed genome-wide TF binding site distributions we also analyzed the distribution patterns specifically at two well-characterized CRMs from the Drosophila bithorax complex. The 1 kb IAB5 and 1.7 kb IAB8 CRMs are both embryonic enhancers responsible for driving precise spatio-temporal expression of the Abdominal-B gene in early development [30]. Examining the TF binding profile using Berkeley Drosophila Transcription Network Project ChIP/chip tracks [28] reveals that both CRMs overlap with binding peaks for CAD, HB, KNI and KR, and a number of additional TFs in the anterio-posterior regulatory network (Figure S3). In almost all cases the ChIP peaks are broader than the CRM and many do not align precisely with each other or the center of the defined CRM (Figure S3). In addition, it is clear that genomic regions from within the bithorax complex that do not function as embryonic CRMs [34] can also recruit these TFs, as evidenced by the presence of separate ChIP peaks outside of the annotated CRMs (Figure S3). This profile supports the observation in earlier studies that many of the key TFs in early Drosophila development bind thousands of active and inactive genomic regions [27].

To quantitatively assess the distribution pattern for the binding sites in the ChIP peaks associated with the two IAB CRMs we compared the profile to that in the neighboring upstream and downstream genomic flanking regions of equal size. In each case the neighboring flanking regions on either side of the ChIP peak demonstrate no in vivo TF binding (Figure S3) [35]. The analysis is conducted at the all and weak “Footprint” thresholds, as the paucity of predicted binding sites at more robust thresholds prevents meaningful comparison. Despite the fact that there is no obvious pattern to the total number of predicted binding sites in the ChIP peak when compared to the flanking regions at these thresholds, the broad distribution for these sites remains consistent and in the majority of cases (9 out 14) does not show any statistically significant difference (p-values >0.05) (Table S5). In only 1 out of the 14 cases investigated is there a statistically significant narrower distribution in the ChIP peak when compared to the flanking genomic regions (p-value <0.05). In 7 out of the 14 cases, including 3 statistically significant cases (p-value <0.05), the distribution pattern is in fact broadest in the ChIP peak (Table S5). Together these results indicate that the binding site distribution in the ChIP peaks associated with the IAB5 and IAB8 CRMs is in many cases broader than neighboring genomic regions (Table S5) and the genome-wide profile observed if all ChIP peaks are considered (Table 4).

2.1 Conclusions

The results from our analysis on four Drosophila developmentally important TFs indicate that the average ChIP peak contains multiple computationally predicted binding sites for a given TF (mean number range = 1.3 – 6.1) if we consider all sites above the defined weak thresholds (Table 3). This raises the question of whether the ChIP peaks may actually represent clusters of binding sites for a TF. If we only consider binding sites above the strong predictive thresholds, the mean number per peak remains greater than one (range = 1.14 – 2.27), indicating the presence of multiple sites in many ChIP peaks and the potential importance of clustering of binding sites. This clustering has been previously recognized as a key feature of regulatory regions in the genomes of a number of different species [3638], including at cis-regulatory modules in the even-skipped locus that bind the KNI and KR repressors during early Drosophila development [3941], and may be responsible for mediating homotypic cooperativity of a TF [42, 43] or heterotypic interactions between proteins in ChIP peaks [44]. Indeed, homotypic clustering of binding sites has been shown to be a common feature of cis-regulatory modules active in embryonic development [45]. The conservation of TF binding site organization at a number of cis-regulatory modules in Drosophila and sepsid species [30, 34, 4648], despite underlying differences in overall genome size [48], suggests that specific spacing of multiple binding sites in a single ChIP region may in fact be an important evolutionary feature. ChIP datasets are currently not very extensive in insect species other than Drosophila melanogaster, preventing a comprehensive genome-wide investigation of this issue. However, the very precise arrangement of binding sites at the few well-studied cis-regulatory modules, including the IAB5 and IAB8 embryonic enhancers [30, 34], indicates that the local architecture of binding site clusters may be critical to their function.

The results also reveal that the binding sites for these TFs within genomic sequences previously identified in ChIP studies as in vivo binding regions are on average centered in the middle of ChIP peak regions (Fig. 2 and and3),3), but are widely distributed across the peaks (Fig. 5). Comparison to the previously assumed normal distribution for binding sites in ChIP peaks [18, 32] demonstrates that in fact the distribution pattern is significantly broader, with only ~25% of all binding sites contained in the middle 200 bp of an average peak (Table 5). The four Drosophila TFs analyzed in this study therefore demonstrate generally similar binding site profiles to previously studied mammalian TFs in ChIP-seq peaks [33, 49] in their characteristic centering on the middle of the peak. However, given the broad distribution of binding sites it may be wise to exhibit caution when restricting any analysis only to the middle 100 bp, as has been previously applied [18, 32]. Rather, the distribution of binding sites observed suggests it may be prudent to consider an extended sequence in a ChIP peak in future studies if the aim is to retain as much information on TF binding sites as possible. In future studies, the application of next-generation ChIP technologies such as ChIP-exo [50], where an exonuclease trims ChIP DNA to a precise distance from the crosslinking site, will be of significant value as this enables the mapping of TF binding sites at single base pair resolution.

3. Experimental Procedures

3.1 ChIP peaks and PWMs

ChIP/chip peaks were obtained from the Berkeley Drosophila Transcription Network Project (BDTNP) for the TFs; HUNCHBACK (HB), CAUDAL (CAD), KNIRPS (KNI), and KRUPPEL (KR). The datasets [28] were downloaded from the University of California at Santa Cruz Genome Browser (http://hgdownload.soe.ucsc.edu/goldenPath/dm3/database/). The total number of peaks for each TF is displayed in Table 1. Only those peaks that were equal to or greater than 100 bp in length were included in the analysis because shorter peaks would not be subject to potential information loss when predicting binding sites in the center 100 bp of a peak. The PWMs used for HB, CAD, KNI and KR are as previously published [51]. A PWM for DORSAL (DL) [52] was utilized as a control.

3.2 Bioinformatic Analysis

The spatial arrangement of computationally predicted binding sites was determined using PATSER analysis [22] with the PWM for each TF on the corresponding set of ChIP peaks. In all cases, PATSER was run with a background sequence A/T content of 0.56, C/G content of 0.44 [51, 53] and, with the exception of the threshold criteria, the default settings.

As each individual ChIP peak should, in theory, contain at least one real binding site for the corresponding TF, the initial analysis was performed such that only the top 1, 2 or 3 scoring PATSER-predicted binding sites for each peak were included (“Top n” analysis). The same “Top n” analyses were also performed using the control DL PWM on each ChIP peak set.

In addition to the “Top n” analyses, binding site score thresholds of different stringencies were used in separate PATSER runs. The ln(p-value) thresholds were obtained by running each PWM on the set of DNA footprinting assay-derived sequence fragments that had originally been used to create the PWM. This produces a range of ln(p-value) scores across all fragments. The 50th percentile of these scores was used as the “weak” threshold in subsequent analyses and the 75th percentile was used as the “strong” threshold, as previously published [51]. In addition to the “weak” and “strong” thresholds, the 0th percentile of the scores was used as the “all” threshold, as it includes all sequences that score at or above the lowest-scoring footprinted binding site and the 100th percentile of the scores is used as the “strongest” threshold (“Footprint” threshold analysis). A second set of analogous thresholds were derived using the same approach, but percentiles were defined from the scores of the binding sites PATSER output for each “top 1” analysis on the ChIP peaks (“ChIP” threshold analysis).

Functional thresholds for the control DL PWM were obtained by running an analysis on the 330 bp rhomboid (rho) neuroectodermal enhancer (NEE). The rho NEE has been shown to contain four binding sites for DL and their locations have been accurately determined via DNA footprinting [54]. The lowest-scoring of these binding sites was used as the “weak Footprint” threshold, while the highest scoring was used as the “strong Footprint” threshold. Using percentile cutoffs for thresholds in this case would have been superfluous as there are only four binding sites for DL; however, these thresholds can be compared to those derived from the experimental PWMs [51] because both sets of thresholds were derived from footprint data. A second set of “ChIP” thresholds was derived for DL from the “top 1” analysis using the same percentile cutoff approach employed for the experimental PWMs.

3.3 Statistical Analysis

For each TF, three different measures of computationally predicted binding site location in each ChIP peak were used to assess the distribution of binding sites (see Fig. 1 for a graphical representation). A “directional distance” (in bp) for each binding site position from the midpoint of each peak was calculated for each of the analyses described above. An “absolute distance” was calculated in the same way but did not differentiate between which side of the midpoint the binding site was located. A “position ratio” value was calculated by dividing the binding site position in the ChIP peak by the total length of the peak. In this case, the ratio values 0 and 1 correspond to the ends of the peak and a value of 0.5 represents the center of the peak. Boxplots of the data were generated in MATLAB with default settings for each TF.

The average number of binding sites per ChIP peak was calculated for each “Footprint” threshold and “ChIP” threshold analysis. In addition, the total number and percentage of weak and strong “Footprint” threshold binding sites in different regions centered on the middle of the ChIP peaks was also calculated. The regions range from the middle 100 bp to the middle 800 bp, in increments of 100 bp.

To compare the distribution of predicted binding site locations from the midpoint at the strong “Footprint” threshold to the expected normal distributions, we performed a Kolmogorov-Smirnov test. For each TF, the position of predicted binding sites was tested against the null hypothesis that the positions follow a normal distribution with a mean of 0 and a standard deviation that results in the same percentage of binding sites falling into the middle 800 bp as is predicted using the strong “Footprint” threshold. The statistical analysis was done at a 5% significance level.

To compare the distribution of predicted binding site locations in the IAB5 and IAB8 CRMs to the distributions in the neighboring upstream or downstream genomic flanking regions of equal size, we performed two-sample Kolmogorov-Smirnov tests. For each TF, the position of predicted binding sites within the CRM-associated ChIP peak region was tested first against the null hypothesis that the position of predicted binding sites within the neighboring upstream region came from the same distribution and secondly against the null hypothesis that the position of predicted binding sites within the neighboring downstream region came from the same distribution. The statistical analysis was done at a 5% significance level.

3.4 Box-and-Whisker Plots

For each of the three different measures of computationally predicted binding site locations, a box-and-whisker plot was generated in MATLAB. In each plot, the central mark in the box represents the median value, while the bottom and top of the box correspond to the 25th and 75th percentiles of the values. The whiskers extend to include all measurement values not considered outliers. Values are defined as outliers if they are greater than y + 1.5 (y – x) or less than x − 1.5 (y – x), where x and y are the 25th and 75th percentile values, respectively.

Highlights

  • ChIP datasets are analyzed for CAUDAL, HUNCHBACK, KNIRPS and KRUPPEL to determine the distribution of predicted binding sites.
  • In all four cases the ChIP peaks contain multiple binding sites that are broadly distributed.
  • Trimming of ChIP peaks may result in the exclusion of functional binding sites from subsequent analyses.

Supplementary Material

1

Figure S1. TF binding site position ratio graphs:

Boxplots for each TF, with the y-axis representing the position ratio of binding sites in the ChIP peak, for “Top n”, “Footprint”, and “ChIP” analyses. The red line indicates the median position ratio and the middle 50% of binding sites are contained within the upper and lower bounds of each box. Whiskers extend to include all positions not marked as outliers.

10

Table S1. TF binding site position ratio in ChIP peaks:

The mean and standard deviation of the position ratio for binding sites, and the ln(p-value) threshold used in the “Top n”, “Footprint” threshold and “ChIP” threshold analyses are listed. A ‘−’ corresponds to entries that are not applicable.

11

Table S2. Number of DORSAL binding sites per ChIP peak:

The total number of predicted DORSAL (DL) binding sites and the mean number of sites per ChIP peak are listed for each “Footprint” and “ChIP” threshold analysis.

12

Table S3. DORSAL binding site directional midpoint distances in ChIP peaks:

The mean and standard deviation of the directional midpoint distances for DORSAL (DL) binding sites, and the ln(p-value) threshold used in the “Top n”, “Footprint” threshold and “ChIP” threshold analyses are listed. A ‘−’ corresponds to entries that are not applicable.

13

Table S4. DORSAL binding site distribution in ChIP peaks:

The number and percent of total DORSAL (DL) binding sites obtained in the weak and strong “Footprint” analyses in ChIP peak regions ranging from the middle 100 bp to the middle 800 bp in increments of 100 bp is shown for each of the four corresponding anterio-posterior TF peak regions. The number of binding sites over the entire length of the ChIP peaks (all) is also listed.

2

Figure S2. DORSAL binding site directional midpoint distance graphs:

Boxplots for DORSAL (DL) analysis on each set of ChIP peaks, with the y-axis representing the directional distance of binding sites from the ChIP peak midpoint, for “Top n”, “Footprint”, and “ChIP” analyses. The red line indicates the median directional midpoint distance and the middle 50% of binding sites are contained within the upper and lower bounds of each box. Whiskers extend to include all positions not marked as outliers.

3

Figure S3. Transcription factor binding profiles at the IAB5 and IAB8 CRMs:

The location of the characterized CRMs in the bithorax complex are shown as a custom track in the UCSC Genome Browser (http://genome.ucsc.edu/; [35]). The Berkeley Drosophila Transcription Network Project ChIP/chip track [28] shows the location of verified binding for selected anterio-posterior gap/terminal (green) and pair-rule (yellow) transcription factors in stage 4–5 embryos (1% false discovery rate). The center 100 bp of each CRM is indicated (red rectangle).

4

Table S5. TF binding site predictions and distribution in the IAB5 and IAB8 CRMs:

The total number of predicted binding sites in the corresponding ChIP peak region mapping to each CRM and the ln(p-value) threshold used in the “Footprint” threshold analyses are listed. Color coding for the number of predicted sites indicates that the ChIP peak contains more sites than either of the neighboring upstream or downstream genomic flanking regions of equal size (green), less sites than either flanking region (red) or an intermediate number (orange). The mean and standard deviation of the directional midpoint distance and normalized absolute midpoint distances for binding sites are also shown. A ‘−’ corresponds to entries that are not applicable. Color coding for the normalized absolute position ratio indicates that the value is higher (representing a broader distribution) in the ChIP peak than either of the neighboring genomic flanking regions of equal size (green), lower than either flanking region (red) or an intermediate value (orange). Statistical significance of the differences in the binding site distribution patterns between the ChIP peak and the neighboring genomic flanking regions of equal size was assessed using two-sample K-S tests using the directional midpoint distance values. The resulting normalized absolute position ratios are indicated as significantly different to both (**) or one (*) of the flanking regions as appropriate.

Acknowledgments

The research in this paper was funded by National Institutes of Health (GM090167) and National Science Foundation (IOS-0845103) grants to RAD, and a National Institutes of Health (GM110571) grant to RAD and JMD. K.P.P. was funded as a Fellow from a National Science Foundation Interdisciplinary Training Award to Amherst College (DBI1129152).

Abbreviations

TF
Transcription factor
ChIP
Chromatin immunoprecipitation
PWM
Position weight matrix

Footnotes

Competing interests

The authors report no competing interests.

Authors’ contributions

K.P.P. participated in the design of the study, performed the analysis, and drafted the manuscript. J.M.D. jointly conceived the study, participated in the design of the study and coordination, and drafted the manuscript. R.A.D. jointly conceived the study, participated in its design and coordination, and drafted the manuscript. All authors read and approved the final manuscript.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1. Howard ML, Davidson EH. cis-Regulatory control circuits in development. Dev Biol. 2004;271:109–118. [PubMed]
2. Sauer F, Rivera-Pomar R, Hoch M, Jäckle H. Gene regulation in the Drosophila embryo. Philos Trans R Soc Lond B Biol Sci. 1996;351:579–587. [PubMed]
3. Berleth T, Burri M, Thoma G, Bopp D, Richstein S, Frigerio G, Noll M, Nüsslein-Volhard C. The role of localization of bicoid RNA in organizing the anterior pattern of the Drosophila embryo. EMBO J. 1988;7:1749–1756. [PubMed]
4. Steward R, Zusman SB, Huang LH, Schedl P. The dorsal protein is distributed in a gradient in early Drosophila embryos. Cell. 1988;55:487–495. [PubMed]
5. Driever W, Nüsslein-Volhard C. A gradient of bicoid protein in Drosophila embryos. Cell. 1988;54:83–93. [PubMed]
6. Struhl G, Struhl K, Macdonald PM. The gradient morphogen bicoid is a concentration-dependent transcriptional activator. Cell. 1989;57:1259–1273. [PubMed]
7. Qian S, Capovilla M, Pirrotta V. The bx region enhancer, a distant cis-control element of the Drosophila Ubx gene and its regulation by hunchback and other segmentation genes. EMBO J. 1991;10:1415–1425. [PubMed]
8. Stanojevic D, Hoey T, Levine M. Sequence-specific DNA-binding activities of the gap proteins encoded by hunchback and Krüppel in Drosophila. Nature. 1989;341:331–335. [PubMed]
9. Ip YT, Kraut R, Levine M, Rushlow CA. The dorsal morphogen is a sequence-specific DNA-binding protein that interacts with a long-range repression element in Drosophila. Cell. 1991;64:439–446. [PubMed]
10. Small S, Blair A, Levine M. Regulation of two pair-rule stripes by a single enhancer in the Drosophila embryo. Dev Biol. 1996;175:314–324. [PubMed]
11. Jiang J, Kosman D, Ip YT, Levine M. The dorsal morphogen gradient regulates the mesoderm determinant twist in early Drosophila embryos. Genes Dev. 1991;5:1881–1891. [PubMed]
12. Mitchell PJ, Tjian R. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989;245:371–378. [PubMed]
13. Ptashne M, Gann A. Transcriptional activation by recruitment. Nature. 1997;6625:569–577. [PubMed]
14. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, et al. DNA-Binding Specificities of Human Transcription Factors. Cell. 2013;152:327–339. [PubMed]
15. Kadonaga JT. Regulation of RNA polymerase II transcription by sequence-specific DNA binding factors. Cell. 2004;116:247–257. [PubMed]
16. Rebar EJ, Pabo CO. Zinc finger phage: affinity selection of fingers with new DNA-binding specificities. Science. 1994;263:671–673. [PubMed]
17. Sommer RJ, Retzlaff M, Goerlich K, Sander K, Tautz D. Evolutionary conservation pattern of zinc-finger domains of Drosophila segmentation genes. Proc Natl Acad Sci U S A. 1992;89:10782–10786. [PubMed]
18. Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S, et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013;31:126–134. [PMC free article] [PubMed]
19. Bailey TL, Bodén M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:W202–208. [PMC free article] [PubMed]
20. Djordjevic M, Sengupta AM, Shraiman BI. A biophysical approach to transcription factor binding site discovery. Genome Research. 2003;13:2381–2390. [PubMed]
21. Hertz GZ, Hartzell GW, 3rd, Stormo GD. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computer Applications in the Biosciences. 1990;6:81–92. [PubMed]
22. Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15:563–577. [PubMed]
23. Morozov AV, Havranek JJ, Baker D, Siggia ED. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005;33:5781–5798. [PMC free article] [PubMed]
24. Badis G, MFB, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. [PMC free article] [PubMed]
25. Zhu LJ, Christensen RG, Kazemian M, Hull CJ, Enuameh MS, Basciotta MD, Brasefield JA, Zhu C, Asriyan Y, Lapointe DS, et al. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011;39:D111–117. [PMC free article] [PubMed]
26. Gilchrist DA, Fargo DC, Adelman K. Using ChIP-chip and ChIP-seq to study the regulation of gene expression: genome-wide localization studies reveal widespread regulation of transcription elongation. Methods. 2009;48:398–408. [PMC free article] [PubMed]
27. Li XY, MacArthur S, Bourgon R, Nix D, Pollard DA, Iyer VN, Hechmer A, Simirenko L, Stapleton M, Luengo Hendriks CL, et al. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biology. 2008;6:e27. [PubMed]
28. MacArthur S, Li XY, Li J, Brown JB, Chu HC, Zeng L, Grondona BP, Hechmer A, Simirenko L, Keränen SV, et al. Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biol. 2009;10:R80. [PMC free article] [PubMed]
29. Sakabe NJ, Nobrega MA. Beyond the ENCODE project: using genomics and epigenomics strategies to study enhancer evolution. Philos Trans R Soc Lond B Biol Sci. 2013;368:20130022. [PMC free article] [PubMed]
30. Ho MC, Johnsen H, Goetz SE, Schiller BJ, Bae E, Tran DA, Shur ASA, JM, Rau C, Bender W, Fisher WW, et al. Functional evolution of cis-regulatory modules at a homeotic gene in Drosophila. PLoS Genetics. 2009;5:e1000709. [PMC free article] [PubMed]
31. Noyes MB, Christensen RG, Wakabayashi A, Stormo GD, Brodsky MH, Wolfe SA. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell. 2008;133:1277–1289. [PMC free article] [PubMed]
32. Zellers RG, Drewell RA, Dresch JM. MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding. BMC Bioinformatics. 2015 In press. [PMC free article] [PubMed]
33. Wilbanks EG, Facciotti MT. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010;5:e11471. [PMC free article] [PubMed]
34. Starr MO, Ho MC, Gunther EJM, Tu Y-K, Shur AS, Goetz SE, Borok MJ, Kang V, Drewell RA. Molecular dissection of cis-regulatory modules at the Drosophila bithorax complex reveals critical transcription factor signature motifs. Developmental Biology. 2011;359:290–302. [PMC free article] [PubMed]
35. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Research. 2002;12:996–1006. [PubMed]
36. Araya CL, Kawli T, Kundaje A, Jiang L, Wu B, Vafeados D, Terrell R, Weissdepp P, Gevirtzman L, Mace D, et al. Regulatory analysis of the C. elegans genome with spatiotemporal resolution. Nature. 2014;512:400–405. [PMC free article] [PubMed]
37. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A. 2002;99:757–762. [PubMed]
38. Ng FS, Schutte J, Ruau D, Diamanti E, Hannah R, Kinston SJ, Gottgens B. Constrained transcription factor spacing is prevalent and important for transcriptional control of mouse blood cells. Nucleic Acids Res. 2014;42:13513–13524. [PMC free article] [PubMed]
39. Arnosti DN, Barolo S, Levine M, Small S. The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development. 1996;122:205–214. [PubMed]
40. Small S, Blair A, Levine M. Regulation of even-skipped stripe 2 in the Drosophila embryo. EMBO J. 1992;11:4047–4057. [PubMed]
41. Struffi P, Corado M, Kaplan L, Yu D, Rushlow C, Small S. Combinatorial activation and concentration-dependent repression of the Drosophila even skipped stripe 3+7 enhancer. Development. 2011;138:4291–4299. [PubMed]
42. Drewell RA, Nevarez MJ, Kurata JS, Winkler LN, Li L, Dresch JM. Deciphering the combinatorial architecture of a Drosophila homeotic gene enhancer. Mech Dev. 2014;131:68–77. [PMC free article] [PubMed]
43. Johnson JL, McLachlan A. Novel clustering of Sp1 transcription factor binding sites at the transcription initiation site of the human muscle phosphofructokinase P1 promoter. Nucleic Acids Res. 1994;22:5085–5092. [PMC free article] [PubMed]
44. Worsley Hunt R, Wasserman WW. Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets. Genome Biol. 2014;15:412. [PMC free article] [PubMed]
45. Li L, Zhu Q, He X, Sinha S, Halfon MS. Large-scale analysis of transcriptional cisregulatory modules reveals both common features and distinct subclasses. Genome Biol. 2007;8:R101. [PMC free article] [PubMed]
46. Fakhouri WD, Ay A, Sayal R, Dresch J, Dayringer E, Arnosti DN. Deciphering a transcriptional regulatory code: modeling short-range repression in the Drosophila embryo. Mol Syst Biol. 2010;6:341. [PMC free article] [PubMed]
47. Hare EE, Peterson BK, Eisen MB. A careful look at binding site reorganization in the even-skipped enhancers of Drosophila and sepsids. PLoS Genetics. 2008;4:e1000268. [PMC free article] [PubMed]
48. Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genetics. 2008;4:e1000106. [PMC free article] [PubMed]
49. Bailey TL, Machanick P. Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res. 2012;40:e128. [PMC free article] [PubMed]
50. Rhee HS, Pugh BF. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 2011;147:1408–1419. [PMC free article] [PubMed]
51. Stringham JL, Brown AS, Drewell RA, Dresch JM. Flanking sequence context-dependent transcription factor binding in early Drosophila development. BMC Bioinformatics. 2013;14:298. [PMC free article] [PubMed]
52. Meng X, Brodsky MH, Wolfe SA. A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat Biotechnol. 2005;23:988–994. [PMC free article] [PubMed]
53. Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008;9 doi: 10.1186/1471-2105-1189-1167. [PMC free article] [PubMed] [Cross Ref]
54. Ip YT, Park RE, Kosman D, Bier E, Levine M. The dorsal gradient morphogen regulates stripes of rhomboid expression in the presumptive neuroectoderm of the Drosophila embryo. Genes Dev. 1992;6:1728–1739. [PubMed]