Gene regulatory networks of TFs and their TGs play key roles in development, homeostasis and behavior. The relationship between a TF and its TGs is achieved through “DNA words”, usually 4–12 bp long, acting as binding sites for the TF in question. A collection of such motifs regulating a specific aspect of the expression of a gene defines a CRM. Therefore, to understand network organization, we need to understand the organizational logic of CRMs. Since most networks mediating a specific biological function consist of multiple TFs and their- sometimes overlapping- TGs, the capacity to detect TGs genome-wide and in silico with high accuracy would bring major advantages. In this respect, regulatory bioinformatics faces a few challenges. First, the organizational logic(s) of CRMs is not clear. Do most TFs regulate their TGs via multiple or single TFBSs? How conserved do such TFBSs need to be at the sequence level to be detectable? Can the contextual information around a given TFBS in a CRM be harnessed to improve TG detection? Does the availability of multiple related genomes foster or complicate CRM detection? And finally, do different TFs follow different logics? Developing and assessing methods that can address these issues is clearly a major goal of regulatory bioinformatics.
To begin to address these issues we took advantage of the availability of 12 Drosophila genomes and the Drosophila database of TF-TG relations (FlyReg) to perform a benchmark study on genome-wide in silico TG prediction. Drosophila is ideally suited for such an approach not only because multiple genomes are now available, but also because a few regulatory networks, such as the segmentation network, have been well-established in vivo resulting in several deeply understood TF-TG relations. One challenging problem of target gene prediction is the mapping of an in silico predicted regulatory locus to the gene it potentially regulates. Indeed, several known enhancers are located one or several genes upstream of the actual target gene, for example the cluster of dorsal binding sites that regulates zen is located directly upstream of CG1162. To circumvent this problem in the benchmark analysis and to make direct comparisons at the sequence level possible, we added isolated regulatory regions to sets of negative sequences. This setup has the additional advantage that it requires less computational resources so that more parameters can be varied. We chose the same region size for each enhancer and each transcription factor to exclude confounding effects of the size when comparing across factors (Cluster-Buster can generate higher scores for larger regions). For the cross-validations we arbitrarily set this size to 1 kb. Such a choice is justified because the cross-validations are intended to compare relative performances rather than absolute performances. When a cross-validation procedure is applied on a single factor under study, the region size could be considered as an additional parameter. For the genome-wide scoring, the size of the motif clusters is determined by Cluster-Buster.
Cross-validation tests and subsequent genome-wide TG predictions result in both higher average performances across a data set of more than 30 TFs (of a total of approximately 700 TFs in the
Dmel genome
[44]) as well as determination of the optimal parameters for each of the individual TFs. This is of particular value for molecular geneticists who are likely to be interested in one or a few TFs within a network and for whom average performance across a large dataset is not particularly useful. The difference between the two is highlighted by our finding that two different performance parameters result in highly similar average performance, but radically different single TF performance profiles. As a result of these advantages the number of factors for which TG detection becomes highly performant (AUC values above 0.90) is increased from 5/34 to 19/34.
We attempted to address some of the questions facing regulatory bioinformatics. The major conclusions from our work are as follows. First, there is unlikely to be a single unifying CRM logic, at least at current levels of genome annotation resolution. We find that whereas some TFs perform optimally with single TFBS parameters (data not shown), others use clusters of homotypic TFBSs and still others use heterotypic TFBS clusters. Thus, a methodology that can determine the optimal approach per TF a priori is necessary for successful single TF based TG predictions. Second, the availability of multiple genomes is, in general, extremely useful for genome-wide TG prediction. One exception to this rule is that, contrary to conventional wisdom and several previous reports, additional genomes do not always result in better PWM building. The reason for this could be that PWMs that are built from sufficiently distinct binding sites (e.g., more than 20) already possess enough variation and do not benefit from the addition of more sites. A small amount of PWMs with very few, but highly conserved sites (e.g., HLHm5) do benefit from a phylogenetic extension.
The availability of multiple genomes becomes especially useful in two other ways. First, it improves the training of a CRM model from a set of known target regions, to discover sites that are both conserved and shared across this set. We have assessed whether such
de novo discovered motifs could contribute to a better enhancer model for the TF. Second, it improves the scoring of a CRM model by taking advantage of functional enhancer conservation or TFBS conservation. We found no obvious explanation (e.g., in terms of correlations with TF families, with the conservation of known sites, or with the size of the TG set) why some TFs perform better with network-level conservation and others with motif conservation. When more
cis-regulatory data becomes available as validation sets, for example through open community based annotation
[45], a deeper investigation of this issue may become feasible.
Although several of the tested variables, most importantly the integration of multiple genomes, can result in significantly enhanced TG prediction accuracy, more work is needed to improve on this performance because only a portion of the true target enhancers could be detected. Again, it can be expected that performances will increase further, when more knowledge about regulatory regions emerges. For example, King et al. found recently that in vertebrates some regulatory regions correlate well with phastCons conservation scores (used for our motif conservation), while other regulatory regions correlate better with alignment-based scores that are corrected for background neutral substitution rates
[46]. However, even with more advanced interspecies comparisons, on a genomic scale the true positive TF-TG interactions are spread out across many other high-scoring interactions. At present it is difficult to determine for a certain TF whether the other high scoring genes are also
bona fide TGs, false positive predictions, or- most likely- a mix of both. Several ways can be envisioned to further improve the performance. For example, enhancer predictions can be combined with other data types to filter the ranked enhancers, such as GO terms, as we have shown for the TFs in , and/or large scale expression data, as we tested for the eye determination TF Ey. The benchmark dataset can be used in the future to evaluate novel methods for
de novo motif discovery, enhancer prediction, or target gene prioritization.
In summary, we have tested several strategies and parameters for the computational prediction of TF-TG relations through TFBS and CRM detection. The selection of the best strategy for each individual TF, combined with the extensive use of multiple genomes during both the training and scoring of enhancer models results in a significant step forward in the bioinformatics to solving the architecture of gene regulatory networks.