The last 15 years many bioinformatics methods and tools have been developed for cis
-regulatory sequence analysis (64
). Broadly, they can be divided in two categories. The first category is methods for motif discovery on a set of co-regulated sequences, such as MEME-like approaches (dozens of methods and extensions exist). The second category are methods for CRM prediction through whole-genome scanning using one or more known motifs as input, often using Hidden Markov Models and sequence conservation cues [see (65
) for a review]. A few methods, such as phylCRM/Lever, ModuleMiner and cisTargetX combine both approaches and show increased motif discovery performance, even when very large upstream regions and introns are included in the analysis (28
). The concept of these integrative methods is to apply genome-wide CRM scoring, including comparative genomics cues, for many different models (e.g. PWMs), followed by the identification of those particular models that yield the highest accuracy on a set of co-expressed genes. In this work we have introduced three important novelties into a new method, called i-cisTarget. The first is the a priori determination of 136K regions to be scored, which leads to an increased flexibility. In particular, this partitioning of the genome allows to analyse both data sets of genomic loci (by selecting all 136K regions that overlap these loci) and co-expressed gene sets (by selecting all 136K regions that fall in the upstream and intronic space of all genes in the set). In this study we obtained good results for a genome segmentation using sequence conservation (phastCons) combined with insulator sites, and excluding coding exons. However, we envision that improvements can be made on the genome segmentation, for example by including coding exons (66
) or using a segmentation that is guided by the high-throughput data sets (i.e. the iVEs) themselves. The latter can become practical as more and more data sets are generated with overlapping results, which may ultimately converge to a defined set of regulatory regions. The second novelty is the generalization of regulatory feature discovery, with the possibility to identify enriched motifs (as PWMs) but also enriched iVEs such as ChIP-peaks, and active/repressive chromatin marks. The third novelty is the ability to perform any combination of regulatory features, even across different types of features (e.g. a motif with a ChIP or DHS feature).
Taken together, these features allow analysing most kinds of high-throughput data available in Drosophila
, and to combine several analyses using the same tool for different datasets. For example, it is possible to combine the analysis of binding location data for a particular factor (ChIP) with the analysis of the corresponding expression data in mutant conditions for this factor, as we have shown for MEF2 (57
) and Zelda (48
We have applied our tools on various datasets, distinguishing gene sets from sets of genomic loci. For gene sets, we have shown that i-cisTarget identifies the enrichment of the correct motif in most gene sets we investigated; failures to do so might be explained by the specificity of the binding motif to certain conditions or tissues. Enriched iVEs can lead to interesting new hypotheses, such as the co-operation between daughterless
, inferred from the PNC set analysis, that resembles the recent discovery of Smad
co-operation with master regulators (53
); or the prediction of new TF-target and TF-TF interactions across cell types in Drosophila
, as was demonstrated for Kenyon cells, pericardial cells and cardioblasts (). Moreover, the discovered motifs lead to CRM predictions in the 5
first intron of the input genes that have a high specificity to be regulatory regions, as was demonstrated on the zelda
LOF dataset (56
) and the PNC dataset (51
). A current limitation of i-cisTarget, when analysing gene sets, is the arbitrary assignment of genomic regions to the gene set. Multiple demarcations are available at the i-cisTarget web tool, for example [5-kb upstream limited to upstream gene, 5′-UTR, and first intron] or [10-kb upstream limited, 5′-UTR, all introns, 3′-UTR and 10-kb downstream limited to downstream gene] (see ‘Materials and Methods’ section). A future challenge remains identifying very distal enhancers and enhancers overlapping the coding sequence of nearby genes (66
). A simple extension of the sequence search space, including more sequence and including intronic and exonic sequences from neighbouring genes, will not solve the problem. Indeed, when applying i-cisTarget to 100-kb upstream and downstream sequence of the TSS (this search space includes 100% of REDfly CRMs), without truncating this sequence at neighbouring genes, the performance drops dramatically (see Supplementary Figure S2
We also used several ChIP datasets to investigate the performance of i-cisTarget on sets of genomic loci. Here, as for the gene sets, i-cisTarget performs very well in recovering the expected motif from a comprehensive library of motifs, but also highlights the involvements of other factors, such as Zelda or Trl in embryonic datasets. While motif discovery or enrichment is also performed by several other tools (45
), i-cisTarget adds the possibility to search for additional iVEs. We have shown that a TF-binding site (TFBS) does not necessarily correspond to a binding event
. While potential binding sites for HSF or MEF2 cannot be distinguished from actual binding events based on motif enrichment alone, adding iVEs clearly selects marks typical for active chromatin as the best discriminant between actually bound or unbound sites. We emphasize that this result is obtained ab initio
, without any prior knowledge of which are the relevant iVEs. Hence, additional signals are needed for a TF to bind to a motif sequence, and these are often related to marks of open or active chromatin: DNAse hypersensitive sites, binding of pioneering factors such as Trl or Zelda, whose role as a general precursor of chromatin opening has only very recently been hypothesized (48
). Interestingly, while in both HSF and Mef2 cases, the bound motifs present an enrichment for active features (GAF/Trl, CBP/p300, or DHS), the pattern of enriched features for unbound motifs is quite different. Namely, the unbound MEF2 motifs present an enrichment for repressive chromatin marks [Su(HW) or heterochromatin like features], while the unbound HSF motifs do not present any of these marks, consistent with what was reported in Guertin et al.
). This might suggest a distinct mechanism of negative regulation through chromatin conformation between developmental processes and stress response pathways.
A feature of our approach that is not found in alternative studies is the ability to easily combine any number of features to investigate the synergistic effect of different features. Being based on ranks, using OS allows an ‘on-the-fly’ re-ranking of the 136K regions using particular combinations. We showed on the PNC and zelda gene set that combinations of PWM and iVE yield higher 1%-AUCs meaning a much higher specificity in the high ranking regions (). This last result shows that transcriptional regulation is not a linear process, in the sense that the contributions of the combination of regulatory features is more than the addition of individual contributions, revealing a synergistic mechanism of action. Moreover, the fact that many different regulatory features are found enriched in the datasets we have studied previously confirms that transcriptional regulation is intrinsically a highly combinatorial process.
These two aspects (combinations and synergy) have already been extensively described before in the context of the enhanceosome model of regulation (68
). In particular, in Drosophila
, analysis of a collection of curated CRMs showed that they are typically characterized by a combination of different TFBSs (70
). This heterotypic model
has been shown to be the general rule, while homotypic CRMs are generally restricted to early embryogenesis (71
However, these descriptions focused on combinatorial regulation by TFs alone. Here, we have confirmed recent evidence that this combinatorial regulation extends to other kinds of regulatory features such as histone modifications, binding of chromatin-modifying proteins or transcriptional co-factors such as CBP. Hence, we propose that the notion of heterotypic model
of regulation should be extended to describe any combination of regulatory features, including motifs and chromatin-related features. Similarly to the CRM finding procedure consisting of finding clusters of TFBS for different TFs (26
), we introduce and show that searching for ‘clusters’ of regulatory features can improve the predictive power of regulatory sequence analysis.
While our method currently applies to Drosophila
, it can in principle be extended to any other organism for which large-scale collections of in vivo
datasets are available, and in particular to human. The much greater size of non-coding regions in human, and the lower proportion of functional DNA in the human genome (72
), would however require to pre-select candidate regulatory regions, as using a full partition of the complete non-coding genome would become computationally untractable and would contain too high noise levels. We are currently working on implementing i-cisTarget for human, using the collection of ENCODE datasets.