|Home | About | Journals | Submit | Contact Us | Français|
Computationally retrieving biologically relevant cis-regulatory modules (CRMs) is not straightforward. Because of the large number of candidates and the imperfection of the screening methods, many spurious CRMs are detected that are as high scoring as the biologically true ones. Using ChIP-information allows not only to reduce the regions in which the binding sites of the assayed transcription factor (TF) should be located, but also allows restricting the valid CRMs to those that contain the assayed TF (here referred to as applying CRM detection in a query-based mode). In this study, we show that exploiting ChIP-information in a query-based way makes in silico CRM detection a much more feasible endeavor. To be able to handle the large datasets, the query-based setting and other specificities proper to CRM detection on ChIP-Seq based data, we developed a novel powerful CRM detection method ‘CPModule’. By applying it on a well-studied ChIP-Seq data set involved in self-renewal of mouse embryonic stem cells, we demonstrate how our tool can recover combinatorial regulation of five known TFs that are key in the self-renewal of mouse embryonic stem cells. Additionally, we make a number of new predictions on combinatorial regulation of these five key TFs with other TFs documented in TRANSFAC.
In eukaryotes, transcriptional regulation is mediated by the concerted action of different transcription factors (TFs) (1). Searching for cis-acting regulatory modules (CRMs), or combinations of motifs that often co-occur in a set of coregulated sequences, helps in unraveling the mode of combinatorial regulation. CRM detection is customarily being applied on a set of intergenic regions located upstream of coexpressed genes; such genes are for example identified by microarray experiments. Except for de novo methods (2,3), most CRM detection methods rely on a motif screening step. In this step, all sites that match given motifs of TFs, are located in the selected sequences (4,5). Subsequently, a combinatorial search is performed to identify a set of motifs (called a CRM), that occur frequently in the given set of intergenic sequences. Usually, a score is assigned to each of the obtained CRMs that assess their statistical significance in a set of background sequences (4,5). Some methods apply a highly structured definition in which a CRM consists of combinations of motifs that need to occur in a specific order, with specific orientations, and within certain distances (6–8). Although the overrepresentation of a structured CRM in a gene set is likely to be biologically relevant (9,10), the degree to which biologically relevant CRMs are structured is still largely unknown (11). Therefore, most methods rely on less constrained CRM models, hereafter referred to as unstructured CRM detection methods (4,5,12–16),
The combinatorial search underlying CRM detection is highly complex. Adding to this complexity, is the fact that often the regions containing the binding sites are not well delineated [e.g. when starting from a coexpressed gene sets, the intergenic sequences typically range several 1000s of base pairs upstream of the transcription start site (TSS)]. Such long intergenic sequences reduce the signal/noise ratio to such extent that in silico CRM detection becomes ineffective. Hence the search for combinatorial regulation is often limited to the proximal promoter region, whereas at least some of the sites responsible for the observed expression behavior might also be located in regions distant from the TSS of the given genes (such as for instance in enhancers). Nowadays chromatin-immunoprecipitation (ChIP)-based techniques are becoming increasingly popular for the genome-wide identification of TF binding sites (17,18). Such techniques make it feasible to locate, at least for the assayed TF, the approximate binding regions. Using ChIP-bound sequences thus allows largely reducing the regions in which the binding sites of the assayed TF should be located (typically 500bp instead of thousands of bp) (19,20) and does not restrict the search for CRMs to the proximal promoter region (7,21–23).
In addition to these obvious advantages, using ChIP-information also allows for a query-based search strategy. Instead of searching for all possible CRMs in the input set as is traditionally being done, we can limit our search to those CRMs that contain the ChIP-assayed TF (7,8). Incorporating knowledge of the assayed TF during CRM detection in such a query-based way allows predicting complex CRMs in which the assayed TF is involved. In this work, we studied the extent to which using ChIP-derived information can help in increasing the performance of CRM detection, compared to the application of CRM detection in a traditional non-query-based setting. To this end, we developed a novel powerful CRM detection method ‘CPModule’, an unstructured CRM detection method based on a constraint programming for itemset mining framework (24). Besides handling the specific challenges of CRM detection on ChIP-Seq based data, CPModule can be used in both a query and non-query-based mode. Its exhaustive search strategy allows making an assessment of the total number of valid CRMs that are present in the input set and of the degree to which a CRM of interest gets prioritized among the total number of candidates.
Applying CPModule on a well-studied ChIP-Seq data set involved in self-renewal of mouse embryonic stem cells (25) showed that using a query-based setting is in most cases the only sensible way to perform CRM detection. Besides recovering well described benchmark CRMs, we also make several novel predictions on the combinatorial regulation of the five key regulators, involved in the process of self-renewal, with other TFs documented in TRANSFAC (26).
The motif models used for screening are position weight matrices (PWMs) from TRANSFAC (26). To remove redundancy among the PWMs, each of them was compared to all the other PWMs using the program MotifComparison (27). MotifComparison calculates the Kullback–Leiber distance between matrices; PWMs showing a mutual distance <0.1 were grouped. All PWMs in a group were removed except for one representative one. This resulted in a final list of 516 PWMs. Motif screening was performed with the tool Clover (28). Given a PWM of W nucleotides and a sequence, it calculates a score for each subsequence of length W. A threshold on the score determines whether the subsequence is considered a potential binding site of the motif, also called a hit or a motif site. The default threshold of 6.0 was used to define a stringent screening, resulting in few hits per motif and sequence on average, while a threshold of 3.0 corresponds to a non-stringent screening resulting in many more hits.
We filtered potential motif sites by removing sites that exceed a given value for the nucleosome occupancy score (NuOS). To calculate the nucleosome occupancy score of a potential motif site, we first assigned a nucleosome occupancy probability to each base pair position, using the prediction model ‘NuPoP’ of Xi et al. 2010 (29). The nucleosome occupancy score was then calculated as the geometric mean of the nucleosome occupancy probabilities at all positions of the potential motif site (30).
To determine the optimal threshold values for the NuOS, we tested the effect of different filtering thresholds on their ability to: (i) reduce the number of motif site predictions per region and per TF, while (ii) not too much compromising the sensitivity of recovering true binding sites (see Supplementary File S1). Based on these tests, predicted motif sites located within a low probability of nucleosome occupancy (<10%) (when using a filtering based on the NuOS) were retained.
CPModule uses as input the motif sites located in the input sequences by motif screening. The result of the screening and filtering step is for each motif M and sequence S a set of motif hits . Motif M has a hit at with ; here is an interval between start position and stop position on the sequence S. corresponds to the length of sequence S.
The combinatorial search problem of finding a motif set is solved by using the constraint programming (CP) for itemset mining framework (24). The core of CPModule enumerates all possible motif sets, where a motif set is defined as a subset of all screened motifs. A CRM is a motif set that is valid given a set of domain-specific constraints (more details provided below): (i) the frequency constraint, which requires that the motif set occurs in a predefined minimal number of sequences from the input set, (ii) the proximity constraint, which requires that hits of motifs in a set should occur in each other’s proximity. The maximal distance of the region in which hits should co-occur is specified by the user and controls the level of proximity, (iii) the redundancy constraint, which requires that the motif set must be non-redundant with respect to its supersets. Optionally, (iv) a query constraint ensures that a given query motif must be part of the motif set found (Figure 1B).
More formally, a motif set will only be considered as a potential CRM in a sequence if and only if its set of hit regions (HR) on that sequence is not empty, where the set of hit regions (HR) consists of those ranges of base pairs in which all selected motifs are present, e.g. they have at least one hit with interval in that range:
Given a set of sequences S, the subset of sequences in which the set of hit regions is not empty is denoted by .
To find the motif sets that fulfill the above requirements, we use a general and principled approach based on ‘constraint programming’. In constraint programming, problems are modeled as constraint satisfaction problem in terms of constraints on variables. The constraint specification is then solved by a general purpose, yet highly efficient algorithm. Consequently, we need to encode our problem as a constraint satisfaction problem. We do so as follows. Motif sets are represented by Boolean variables: there is a Boolean variable for every possible motif, indicating whether this motif is part of the motif set . If a certain , then we say that the motif is in the motif set; otherwise the motif is not in the set. Furthermore, we have a Boolean variable for every genomic sequence, indicating whether the motif set will be considered as a potential CRM in a sequence, i.e. whether . Lastly, we define a Boolean variable for every motif and every sequence . The variable indicates whether motif is in the proximity of all motifs in motif set on sequence .
The ‘constraints’ are imposed on these variables as follows:
The proximity constraint couples the variables to the variables representing the motifs. Formally, we define the variables as follows for every motif on every genomic sequence:
In other words, if in a particular genomic sequence a hit of a particular motif is within a hit region () of the motif set, this motif’s variable for that sequence must be 1. Observe that =1 will hold for all motifs in the motif set , for all sequences that are in ; however, there may be additional motifs that have hits in the proximity of regions in .
The constraint that imposes a minimum size on is formalized as: . Here, the variables are defined in terms of the variables, such to ensure that sequences are only counted () if all selected motifs () occur within each other’s proximity in that sequence ():
The redundancy constraint requires that we cannot add a motif to a motif set without losing one sequence in its corresponding sequence set . This can be enforced as follows on the Boolean variables:
stating that a motif must be part of the set () if on all selected sequences () the motif is within the proximity of the others ().
The query-based constraint requires that each motif set contains at least a given motif. This can be enforced by requiring that the corresponding Boolean variable satisfies the constraint that . Note that the proximity constraint will ensure that only motif sets will be considered with hits close enough to this given motif.
This combination of constraints is solved by a constraint programming system by means of a depth first backtracking search. The search strategy alternates between branching, in which a variable is assigned a value from its domain (Boolean value), and propagation, the process of using a constraint to remove values from the domain of variables. The search strategy is similar to strategies that have been used in itemset mining (24). The main difference with traditional itemset mining is the inclusion of proximity constraints and the inclusion of a redundancy constraint for this type of data. The advantage of using an existing CP system (31) is that additional constraints can be added in a modular and straightforward way, preventing the reimplementation of the itemset mining strategy from scratch. For more details on the implementation of the different constraints we refer to ref. (32).
To assess the significance of the found motif sets (CRMs), we calculate for each an enrichment score (P-value) adapting the strategy proposed in Gallo et al. (33). The main modification is that we use a set of background sequences in the calculation of this score. These background sequences are used to estimate the proportion of sequences in the whole genome that contain the motif set. We compare the number of observed input sequences that contain a particular motif set with the number of sequences that is expected to contain this motif set. The latter is estimated by counting the number of background sequences containing the motif set . This set is obtained after applying exactly the same screening and filtering strategy on the background sequences as was applied on the input sequences. Based on this estimate of the probability to observe a valid motif set in a random set, we calculate a genome-wide enrichment score (P-value) by means of a cumulative binomial distribution:
Where ; is the set of input sequences; is the number of input sequences; is the set of background sequences.
The background sequences were derived by sampling from the mouse genome [Version mm9, NCBI Build 37, UCSC database (34)], a large number of intergenic sequences (2000 background sequences for the synthetic data, 5000 background sequences for each of the ChIP-Seq assays). To exclude the possibility that the composition of the background set would influence the estimated background occurrences of the CRMs, we compiled background sets consisting of either putative promoter sequences, that is sequences located upstream of a gene’s transcription start site, or sets made from background sequences located in putative enhancer regions, that is sequences corresponding to regions bound by the enhancer binding proteins factor CTCF [downloaded from ENCODE (35)]. In our experiments, the composition of the background sets did not influence our final ranking. All results presented in the article use a background set based on proximal promoter regions.
When dealing with ChIP-Seq data, where each sequence is a region around an assayed transcription factor site, we selected the background sequences such that each sequence contains at least one motif site of the assayed TF (which does not overlap with a ChIP-bound region). In this setting, the number of background sequences that qualifies is variable for each data set. To have an equal number of sequences for each background set, we randomly sampled for each data set 5000 sequences from the set of qualifying background sequences (5310 was the maximal number of sequences that could be obtained for the data set with the smallest cognate background set). Note that with this strategy we approximate the P-value calculation in a conservative way as we cannot exclude that the background contains sequences with true sites of the assessed TF that remained unbound under the assessed conditions.
The motif sets are ranked according to their enrichment score [P-value (M), Equation (5)]. Note that with this ranking, for two redundant motif sets that occur in the same sequences, the smaller motif set will never score better than the larger one i.e. if , with (M1,S)=(M2,S), motif set M2 will never score worse than M1. This motivates the use of the redundancy constraint during the combinatorial search as it removes from a set of redundant motif sets only the least interesting sets, which would get a very low rank in any case.
We first use a synthetic data set to compare CPModule with different CRM detection tools. The synthetic data is retrieved from Xie et al. (36). The data consists of 22 genomic sequences each 1000 base pairs in length. In 20 sequences, sites sampled from the TRANSFAC PWMs of respectively OCT4, SOX2 and FOXD3 were inserted in a region of at most 164bp (so the CRM encompasses maximally 164bp). Each PWM was sampled three times per sequence. The last two sequences had no sites inserted.
CPModule was compared with related tools for unstructured CRM detection such as ModuleSearcher, obtained from the author (37) on 6 July 2008; Compo obtained from the author (14) on 14 August 2010; Cister and Cluster-Buster downloaded from the authors website (15,16) on 22 June 2010 and 21 June 2010, respectively; all methods were given the non-redundant motif list described in section ‘Motif screening’.
CPModule was run with a frequency threshold of 60% and a proximity threshold of 165bp (as this was the maximum distance used when generating the data). Nevertheless, CPModule was shown not to be very sensitive to the exact value of the proximity threshold (see Supplementary File S2). For the other CRM detection tools, we similarly used the best parameter values according to the characteristics of the synthetic data (length of the sequences, the distance between two insertion sites, and the maximum size of CRM) and default values otherwise. Supplementary Table S1 lists the non-default parameters for the used tools.
We evaluated the performance of the different CRM tools using the motif (mCC) and nucleotide correlation coefficients (nCC) (5). At the motif level (mCC), a predicted motif for a sequence is a true positive (TP) if that motif was indeed part of the CRM on that sequence, otherwise it is a false positive (FP). If a motif was not predicted to belong to a CRM on that sequence, although it should have been according to the benchmark, it is counted as a false negative (FN), otherwise as a true negative (TN). As the motif level evaluation does not take the predicted sites into account, a solution is also evaluated at the nucleotide level (nCC): for every nucleotide we verify whether it was predicted to be part of the CRM and whether it should have been predicted or not, again resulting in TP, FP, FN and TN counts. These counts are aggregated over all sequences to obtain the total counts of the predicted CRM. Based on these counts, a correlation coefficient (CC) is defined as follows:
The value of this coefficient ranges from −1 to 1. A score of +1 indicates that a prediction corresponds to the correct answer. Random predictions will generally result in CC values close to zero. Ideally, a CRM should score good at both the motif and nucleotide level.
The real data set was derived from genome-wide ChIP data obtained with DNA sequencing (ChIP-Seq) for the KLF4, NANOG, OCT4, SOX2 and STAT3 transcription factors, as described by Chen et al. (25). For each transcription factor, the input set consists of 100 sequences, each corresponding to 500bp centered around one of the top 100 ChIP-binding peaks of the assayed TF. Binding peaks were taken from the GEO file GSE11431 (38).
Screening is performed using the 516 TRANSFAC PWMs described above (section ‘Motif screening’). However, as a KLF4 PWM was missing in TRANSFAC, we added to our list the PWM described by Whitington et al. (39). Whitington et al. (39) constructed the KLF4 PWM using de novo motif detection on a set of sequences involved in the development of mouse embryonic stem cells, derived from a ChIP-chip experiment by Jiang et al. (40) independent from the one used in this study.
We applied our method using three different screening results: (i) high-stringency screening; (ii) low-stringency screening; (iii) and low-stringency screening in combination with NuOS filtering. Proximity thresholds for CPModule were varied stepwisely as mentioned below (section ‘Results’ section). The frequency threshold was set at 60% unless mentioned otherwise.
In contrast to what is often assumed for CRM detection on coexpressed genes, we do not expect that all sequences derived from a ChIP-Seq experiment contain the same CRM. Indeed in the same list of ChIP-bound sequences, different CRMs might be present depending on which other TFs are needed next to the ChIP-assayed one to mediate coregulation of certain subsets of the genes in the list.
Since the same CRM must not be present in all sequences, one cannot simply take the intersection of motifs that appear in the sequences. Instead, all possible combinations of motifs have to be considered as candidate CRM, and validated against the data. This makes CRM detection a combinatorial and computational hard problem. Furthermore, it is not known in advance which TFs are part of the CRM hence one would like to consider all TFs having a known motif model. Using a large number of candidate motifs makes the problem more difficult, as 100 candidate motifs means there are 2100 candidate motif sets; the problem is even more severe when using the entire TRANSFAC database, which contains more than 500 motif models. Additionally, ChIP-Seq derived datasets are large in the number of sequences (usually hundreds of peaks are detected for the assayed TF). For each candidate motif set, all these sequences will have to be processed. The fact that each motif can have multiple hits on a single sequence complicates things even further. Despite the fact that several methods for CRM detection have been developed in the past (4), the aforementioned computational issues are still challenging for CRM tools that find complex CRMs consisting of an arbitrary number of TFs.
To deal with these computation issues, we developed CPModule, a CRM detection method based on constraint programming (CP) for itemset mining (24) (for a full description of the method see ‘Materials and Methods’ section). CPModule searches for combinations of motifs (cis-acting regulatory modules) that are sufficiently specific for a given input set. Using a library of PWMs, a set of coregulated sequences is screened and filtered to obtain a list of hits per motif and sequence (Figure 1A). CPModule then uses this input list to enumerate all possible combinations of motifs (motif sets) that meet a set of predefined constraints (Figure 1B). These constraints define what we consider biologically relevant CRMs: first, a CRM should occur in a minimal number of input sequences (frequency constraint, Figure 1B) to be sufficiently specific for the input set, but it does not necessarily have to cover all sequences. Second, we assume that motif sites of a CRM are more likely to reflect true combinatorial regulation when they occur in each other’s proximity on a single sequence than when they are scattered over long genomic distances. Therefore, the motif sites composing a CRM should occur within a maximal genomic distance from each other (proximity constraint, Figure 1B). This is not guaranteed to always be the case, which is compensated by the frequency constraint that does not require all sequences to contain the CRM. Because motifs can have multiple binding sites on the same sequence, we observed that several of the CRMs found were redundant. We consider a found CRM to be redundant to another one if it contains a subset of the motifs of the other CRM and occurs in exactly the same sequences. In this case, the smaller CRM will have a lower statistical score anyway; hence we can discard such CRMs from consideration. To make the search more effective, we will avoid redundant CRMs during search (redundancy constraint, (Figure 1B). Finally, in ChIP-Seq assayed data, one can use the fact that at least the assayed TF should be part of the CRM. For this purpose, we add a query-based option in which a ‘query motif’ can be provided that has to be part of the CRM. Using this knowledge before and during search (query-based constraint, Figure 1B) can make the problem computationally much more feasible.
When using the constraint programming for itemset mining framework with the above constraints, the result is a large collection of CRMs found to fulfill those constraints. We explicitly chose not to statistically evaluate the CRMs during search as this can quickly become computationally intensive, especially when evaluating the specificity of a CRM using a large collection of background sequences. Instead, we calculate an enrichment score (P-value) in a post-processing step and rank the CRMs according to this score (see Figure 1C and ‘Materials and Methods’ section). The added benefit is that our system does not return one CRM as being the most significant CRM, rather it returns an ordered list which a domain expert can choose from.
Before applying our method on a real ChIP-Seq data set we compared its performance with that of a number of well performing CRM tools, namely ModuleSearcher (37), Compo (14), Cister (16) and Cluster-Buster (15).
Cister (16) and Cluster-Buster (15) are representative single-sequence tools. They scan each sequence individually, searching for potential CRMs that best match a predefined structure as imposed by model parameters (here a hidden Markov model). In contrast to multiple sequences tools, such as CPModule, they do not explicitly test whether the detected CRMs are specific for the input set as a whole. Nevertheless, these tools are very good at detecting CRMs in individual sequences. Because they treat sequences individually, they are computationally more efficient than multiple sequence tools. However, they cannot take advantage of the fact that sequences are coregulated.
Like CPModule, ModuleSearcher (37) and Compo (14) are multiple sequence tools. Both of them come with their own motif screening tool. ModuleSearcher searches for motif sets by using a genetic algorithm, a heuristic search method that maintains of pool of solutions which are modified to find ever better solutions. This type of search is more ad hoc and gives no guarantees that the best CRMs are found. Compo on the other hand uses techniques from itemset mining, as does CPModule. However, they differ in a number of ways: Compo is a specialized algorithm while CPModule uses a generic constraint-based methodology, allowing to incorporate extra constraints such as the redundancy constraint and the query-based constraint in a principled way. Additionally, while Compo has a strong focus on multi-parameter search and optimization, CPModule does exhaustive search and calculates the significance of a CRM using a large collection of background sequences in a second step.
For benchmarking we used the synthetic data constructed by Xie et al. (36). This data set contains intergenic sequences in which ‘true motif sites’ are inserted. The performance of CRM detection was assessed by comparing the best scoring solution of each algorithm with the known ‘true’ solution. The quality of the solutions was evaluated both at the motif and nucleotide level using respectively a motif correlation coefficient (mCC) and a nCC (5) (see ‘Materials and Methods’ section).
Finding CRMs in multiple long sequences using all 516 TRANSFAC PWMs is a challenging task. Table 1 shows a comparison of the different tools on our synthetic benchmark data. The single sequence tools Cister (16) and Cluster-Buster (15) perform rather poorly (<0.25 for both mCC and nCC). Because they screen sequences individually, they predict CRMs with a large number of motifs that differ from sequence to sequence, resulting in bad scores. ModuleSearcher (41) could not handle the large number of PWMs and repeatedly ran into memory problems, even with 2GB of RAM allocated. When limiting the maximum number of motifs in a CRM to 10 or less, Compo (14) found the solution reported in Table 1. It scores very well at the nucleotide level (0.68), meaning that it finds the binding region on the sequences quite accurately. However, at the motif level it performs rather poorly (0.27); it wrongly identifies the motifs that bind in the identified regions. CPModule scores worse (0.55 versus 0.68) at the nucleotide level than Compo, but scores considerably better (0.57 versus 0.27) at the motif level.
To allow for a more thorough comparison of different the tools, we reanalyzed the dataset using each time a subset of the 516 motif models. Starting from the PWMs of the three inserted TFs, we gradually increase the number of total PWMs by sampling them from the set of 513 remaining ones. This results in an increasingly larger input in terms of candidate motifs, which contain increasingly more false motif models. This makes the problem increasingly harder. Figure 2 shows the motif and nucleotide-level correlation coefficients (CC) of the different tools on the data. The number of motif models used is shown on the x-axis (for each sample size, 10 different samples are created and all tools are run using the same sample sets). The single-sequence-based tools Cister (16) and Cluster-Buster (15) perform best in the presence of few sampled PWMs and deteriorate as more PWMs are added. With few PWMs their predictions are accurate, especially regarding the binding region of the CRMs (nucleotide level). However, because they treat sequences independently, different false positive motifs are predicted on each sequence separately. This becomes worse as the number of sampled PWMs increases, leading to decreasing scores. For the multiple sequence tools Compo (14), ModuleSearcher (41) and CPModule this problem is less pronounced. The behavior of Compo changes as the number of motifs increases: at the motif level, the score has a decreasing trend, while at the nucleotide level the score increases. Looking at the CRMs found, we observe that Compo finds CRMs with only one true motif and increasingly more false motifs as the number of PWMs increases, explaining the motif-level behavior. Unexpectedly, adding these false motifs seems to contribute rather than to deteriorate the precision at the nucleotide level. The shorter predicted binding regions obtained by adding more motifs seem to coincidentally better approximate the regions in which also the true CRMs are located. The scores of CPModule and ModuleSearcher score well on both the motif- and nucleotide-level, while being less affected by the number of motifs used. However, ModuleSearcher is unable to scale to more than 400 motifs because of memory issues, while this is not a problem for CPModule.
These results show that CPModule is competitive with state-of-the-art tools in detecting CRMs in sets of coregulated sequences. This, in combination with its capability of handling a large number of sequences, prioritizing CRMs by means of ranked lists and the ability to be used in a query-based mode, make it ideally suited for this study and for the analysis of ChIP-Seq data in general.
To show the effect of using ChIP-Seq information on improving the performance of CRM detection, we relied on publicly available ChIP-Seq experiments conducted by Chen et al. (25). The data consist of ChIP-Seq experiments for five key TFs involved in self-renewal of mouse embryonic stem cells, namely KLF4, NANOG, OCT4, SOX2 and STAT3 for which it is known that combinatorial interactions exist amongst at least some of these five TFs. These previously known interactions, corresponding to nine different CRMs were used as a benchmark (Figure 3). Starting from the data of a ChIP-Seq experiment of a single assayed TF, we used CRM detection to discover, in silico, the other TFs with which the assayed TF constitutes a CRM. We then tested to what extent we could recover the previously described benchmark CRMs using either a query-based or non-query-based setting. The non-query-based setting mimics the traditional way in which CRM detection is being performed, that is trying to prioritize a CRM that is enriched in a set of sequences without using any further prior information. In the query-based setting, only CRMs that contain a motif for the assayed TF are searched for. Prior to the CRM detection, we first optimized screening thresholds to reduce the effect of the screening on the success of the CRM detection.
We started from the top 100 binding peaks identified for each ChIP-Seq-assayed TF (assuming that those represent the most reliable binding sites). It was recently shown that the sites of the assayed TF do not exactly coincide with their binding peaks, but can be located as far as 250bp from the actual peak (42,43). Therefore, we used a sequence region of 500bp centered around each of the top 100 peaks as input sequences.
Ideally, the result of the screening should have a high sensitivity, while containing only few false positive motif sites. Using a stringent threshold would bias towards only finding the most ‘conserved motif sites’. However, true sites do not necessarily correspond to the most conserved ones (44,45). Using a low stringency filtering might contrarily result in the inclusion of too many false positives, possibly deteriorating the CRM detection. Therefore, we used in addition to a screening with either a high or a low stringency threshold, also a low stringency screening combined with a filtering based on nucleosome positioning as nucleosome positioning plays a role in determining the accessibility of a site (39). As the information on condition- and tissue-dependent nucleosome occupancy is not readily available, we relied on the NuOS which has also previously been used in the context of motif detection (30) (see ‘Materials and Methods’ section). Motif sites located in regions that show a high NuOS are considered to be transcriptionally inactive (46) and were therefore filtered out.
One advantage of starting from ChIP-Seq based information is that it allows to approximate, at least for the assayed TF, the effect of the screening/filtering on the recovery rate of its binding sites. This effect on recovering the binding sites of the assayed TF can be seen as representative for the effect of the screening/filtering on recovering sites of any other TF. In Figure 4, we display for each input set (sequences corresponding to the 100 binding regions of each assayed TF) the sensitivity in recovering binding sites of the assayed TF after applying different screening/filtering procedures. The sensitivity is expressed as the percentage of the input set in which a motif site of the assayed TF could be detected. A sensitivity of 100% thus corresponds to retrieving at least one motif site for the assayed TF in each of the 100 high scoring binding regions. As can be expected, a stringent screening results in a rather low sensitivity for most of the binding regions of the assayed TFs (sensitivity <50%). Lowering the screening stringency largely increases this sensitivity. At least 80% of the binding peaks for respectively KLF4 (84%), NANOG (80%), OCT4 (98%), SOX2 (95%) and STAT3 (100%) were found to contain a motif site for their respective TFs. However, this increased sensitivity comes at the expense of also predicting many more potentially false positive sites. The number of false positives that result from a screening/filtering procedure is harder to estimate as we have no clue about the identity or location of the true sites. Therefore, we estimated the false discovery rate by the average number of motif sites per TF and per sequence region retained after applying different screening procedures (Figure 4B). The average number of sites per screened TF should be sufficiently low. Note that this average number does not exclude that some TFs can have multiple motif sites in the same sequence while others might have none. Applying a low stringency screening resulted in each of the analyzed data sets in the detection of, on average, four motif sites per TF and per sequence (Figure 4B). This was much lower in case stringent screening was used.
Combining the non-stringent screening with a filtering based on nucleosome occupancy (that is removing sites predicted to be located in nucleosome occupied regions) seems to offer a good trade-off between the sensitivity and the number of false positive sites as is shown in Figure 4. Compared to stringent screening, applying the filtering after a non-stringent screening largely increases the sensitivity for most of the assayed TFs, while still maintaining the number of predicted sites per TF within a reasonable range. In the remainder of the analysis, filtering was applied on the low stringency predicted sites of all TFs except for those of the assayed one. For the assayed TF, filtering was omitted as the ChIP-derived evidence experimentally supports that each ChIP-bound region contains binding sites of that TF.
CPModule was run using for each assayed TF the sequences of the top 100 peak regions, as well as the motif sites identified by the screening and filtering. For the proximity threshold, we started for each data set from 150bp and step wisely (50bp at the time) extended this value to maximally 400bp. The value of the frequency threshold was set to 60%. Predicted CRMs were ranked according to their P-values. The higher the rank of a benchmark CRM (that is a previously known CRM, see Figure 3), the better the algorithm was able to prioritize the CRM amongst the total number of predicted CRMs.
The results obtained by running CPModule after screening with a stringent threshold (for all TFs except the assayed one) (Supplementary Table S2) shows that a stringent screening threshold lowers the sensitivity of retrieving true binding sites to such extent that none of the benchmark CRMs can be retrieved at the predefined frequency threshold of 60%. Only by subsequently lowering the frequency threshold to 50% allowed recovering a benchmark CRM (1 out of the 9). To calculate benchmark recovery we considered all solutions obtained with all possible proximity thresholds irrespective of their rank. Without filtering, the number of binding sites per motif and sequence became so high that the search for CRMs was computationally prohibitive.
The most informative results and highest coverage of benchmark CRMs was obtained using CPModule with non-stringent screening and filtering based on the NuOS. Seven of nine of the previously described CRMs involving KLF4, NANOG, OCT4, SOX2 or STAT3.
Table 2 shows for each of the assayed TFs and for each proximity threshold, the highest ranking recovered benchmark CRM together with its rank amongst the total number of solutions. In addition to their rank we show for each CRM its support as an additional quality criterion, indicating how many of the 100 given peak regions contained the predicted CRMs (the higher this value, the more specific the detected CRM is for the given dataset).
For all data sets, at the one but lowest proximity (200bp) most of the benchmark CRMs were retrieved (6 of the 9 benchmark CRMs). Further increasing the proximity threshold results in the same benchmark CRMs also obtained with a lower threshold, albeit most of the time at a lower rank and/or in combination with other motifs. Increasing the proximity threshold will increase the number of valid CRMs. For NANOG, one additional benchmark CRMs were retrieved at a higher proximity threshold than the one for which the first CRM containing the assayed TF was found. Most of the TFs involved in the benchmark interactions, therefore, have binding sites in a rather close proximity on the genome.
Predicted CRMs were also validated using the available ChIP-Seq experimental information: if a previously described interaction between the analyzed TF and any of the other four benchmark TFs was predicted by CPModule, the ChIP-Seq data of the other TFs were used to experimentally verify whether their predicted sites in the retrieved CRMs coincided with their binding peaks. Predicted sites of TFs for which experimental data was available, overlapped for at least 10% (and often more) with their corresponding binding peaks (Column ‘Validation’ in Table 2). This indicates that most of the benchmark CRMs predicted by CPModule reflect true CRM signals present in the ChIP-Seq data.
The added value of using ChIP-Seq derived information becomes obvious when comparing the rank obtained for each benchmark CRM in the ‘non-query-based’ setting with the one obtained using the ‘query-based’ setting. The first setting mimics a classical CRM detection set up, i.e. when searching for CRMs that are statistically overrepresented in a set of coregulated sequences. The query-based setting differs from the classical setting by only listing CRMs that contain a site for the ChIP-assayed TF. Table 2 shows that in the query-based setting, the rank of the benchmark CRMs is in many cases much better than the rank in the non-query-based setting. This is especially true for KLF4, NANOG, SOX2 and STAT3. For OCT4, the ranks are more similar for both settings, but in the query-based setting much less candidate CRMs are returned, and hence the computation is much more efficient. Our results show that exploiting ChIP-Seq information, by constraining the candidate CRMs to those that only contain the assayed TF, not only leads to a compact result set with in many cases a better ranking of the true CRMs, but more importantly they also show that in the absence of such information, CRM detection becomes almost infeasible.
Besides the results on the benchmark set, we also displayed for all assayed TFs their top three ranking CRMs predicted by CPModule, for the different proximity thresholds (Supplementary Table S3). Note that those CRMs score in most cases much better than the benchmark CRMs.
Functional analysis (Ingenuity Pathway analysis and literature-based analysis) of the 56 TFs that were involved in the predicted CRMs of respectively KLF4, NANOG, OCT4, SOX2 or STAT3, showed that most the TFs have functions related to development (embryonic development, cellular development, tissue development, organ development, organismal development), cellular growth and proliferation, cancer and tissue morphology all functionalities that could be related to ESC cell growth, death and differentiation (see Supplementary File S3).
For a handful of the predicted CRMs (KLF4-STAT4; OCT4-CDXA; OCT4-PAX2; OCT4-STAT; OCT4-SRY; SOX2-OCT4), it was previously proven that their contributing TFs were involved in combinatorial regulation (see Supplementary Table S3 for a list of references).
For most of the other CRMs we could find indirect literature-based support, suggesting the plausibility of the predicted interactions, for instance for NANOG-FAC1, the putative transcriptional regulator FAC1 has shown to be expressed in embryonic and extra-embryonic tissues of the early mouse conceptus and was shown to be essential for trophoblast differentiation during early mouse development (47), making an interaction between NANOG and FAC1 plausible. Other examples are described in Table 3.
In this work, we developed CPModule, a novel approach for CRM detection with a performance that is competitive to that of other state-of-art tools, while being able to handle larger data sets (such as 100 sequences in combination with a library of 517 PWMs). The advantage of CPModule is that it builds upon a constraint programming for itemset mining framework (24). This approach provides fast search techniques similar to those in itemset mining, while allowing to freely impose additional constraints on the CRMs. These constraints such as the proximity and redundancy constraint can help prioritizing likely CRMs. At this point our system focuses on finding loosely structured CRMs that satisfy multiple constraints, such as a minimum frequency constraint or a constraint on the maximum distance between binding sites. It can easily be extended to find unstructured CRMs under additional constraints. The discovery of structured CRMs, similar to the approaches proposed by Noto and Craven (2006) and by Cartharius et al. (8) in FrameWorker, is outside the scope of the current approach.
The flexible framework also allows us to use CPModule in a query-based setting when dealing with ChIP-Seq derived data, that is, we can search only for CRMs that contain the assayed TF. Because of its enumerative approach, CPModule outputs all valid CRMs as an ordered list rather than returning one CRM as being the most significant CRM. Having an idea of the rank of a true CRM amongst the total number of valid CRMs gives an intuition on the difficulty of computationally retrieving a true CRM in a particular data set. We used this property to assess the contribution of ChIP-Seq data in its ability to prioritize true CRMs. Our results on real datasets showed that in the absence of ChIP-Seq based information, biologically relevant CRM detection is almost infeasible.
The success of CRM detection also depends on the quality of the input data, which is the set of motif sites predicted by screening using motif models such as PWMs. A too dense collection of motif sites (hits) obtained by a non-stringent screening threshold usually results in too many motif combinations, which either make the problem intractable or lower the quality of the outcome; more false positive hits in the screening will also result in the detection of more spurious CRMs. Just increasing the stringency of the screening seems not to be the best option as many true sites, and thus also true CRMs, appear to be missing. Using a lower screening threshold in combination with a filtering procedure based on nucleosome occupancy provided in our case a good trade-off between keeping the number of false positives in a reasonable range and recovering true sites. We applied this filtering to sites of all TFs other than those of the assayed TF. By increasing the recovery rate of sites of the assayed TF, we maximize the chance of finding CRMs and instances of CRMs that contain the assayed TF, as those are the ones we are primarily interested in. Using a more stringent filtering for all other TFs will help reducing the spurious CRMs, but might come at the expense of not being able to detect some of the true interactions between the assayed TF and other TFs in TRANSFAC (which might explain why we couldn’t recover all previously described benchmark CRMs). With more experimental data on condition-dependent nucleosome occupancy and other cell specific features becoming available, filtering will become more reliable and will surely further improve the success of combinatorial CRM detection.
KULeuven Research Council (GOA/08/011, - SymBioSys -CoE EF/05/007, NATAR C1895-PF/10/10); the agency for Innovation by Science and Technology (SBO-BioFrame); Interuniversity Attraction Poles (P6/25-BioMaGNet); Research Foundation - Flanders (IOK-B9725-G.0329.09); the Human Frontier Science Program (RGY0079/2007C); Ghent University (Multidisciplinary Research Partnership ‘M2N’); Institute for the Promotion and Innovation through Science and Technology in Flanders (IWT-Vlaanderen); A Postdoc grant from the Research Foundation-Flanders. Funding for open access charge: KU Leuven Research Council (SymBioSys, CoE EF/05/007).
Conflict of interest statement. None declared.
The authors thank Luc De Raedt for the invaluable discussions and his vision on using constraint programming for itemset mining, which enabled the development of the CPModule system. The authors also thank the anonymous reviewers for many useful comments.