We propose a supplementary approach for TFBS prediction, entitled SiteGA. It is based on the detection of locally positioned dinucleotides, identified from known sites using a GA with discriminant analysis. We have previously applied this combination for computer analysis of nucleosome formation potentials, RECON [
43,
44], which has been extensively validated and used to investigate a variety of genomic locations [
45-
50]. The approach has also proved for the
Drosophila melanogaster promoter recognition [
32]. The techniques for modelling dependencies between distant positions has been successfully used for the recognition of splicing sites [
51-
53], recombination sites [
54] and genes [
55].
Recently [
33], we introduced the SiteGA method as one among other alternatives to traditional PWM approach. Now we propose a far more comprehensive study of TFBS recognition, which includes a range of modifications/improvements to the SiteGA method and web tool itself: (a) one window system for training instead three-window system was used (in [
33] – whole window was divided into three overlapping regions of equal length of 25–35 nt (center, left flank, right flank); then, a set of locally positioned dinucleotides was sought for each of the regions separately, finally three recognition function combined); (b) three new types of recombination in genetic algorithm were applied and described (section Method, SiteGA, genetic algorithm), in [
33] only one operator was used and described. Moreover, in comparison with the previous paper [
33], the method is more carefully presented here, with more details of the implementation and algorithm. The careful comparison included jackknife (leave-one-out cross-validations) and bootstrap tests for accuracy estimation. These tests were done for all the SiteGA and optimized PWMs recognition models (9 TFBSs). Among other conclusion from these tests, we demonstrated that PWMs may be significantly improved by:
a) dinucleotide statistics (in contrast to mononucleotide statistics, that usually applied; and
b) exhaustive search among different length and location of PWM window.
Chen
et al. [
53] have studied selecting a window size for the acceptor and donor splice site sequence. They suggested an optimal length for the donor and acceptor splice site, i.e. a window from 9 bases upstream to 9 bases downstream for the donor splice site, and a window from 27 bases upstream to 9 bases downstream of acceptor splice site. Thus, a proper window size is among the most important factors for performance improvement.
A similar effect of window size (motif width) on the accuracy was investigated in the comparison of five motif discovery algorithms by Hu
et al. [
56]. From this comparison we may conclude that very short motif width showed the worst results. Finally, they suggested running algorithms multiple times with different motif widths to get the best result. Thus, we followed this advice and performed exhaustive searches of window size for all 9 TFBSs. Though, we should notice that motif discovery is not the same as site recognition, but intrinsically the approaches have many similar features.
We applied the algorithms of Berg and von Hippel's [
2], MATCH [
57] and log-odds [
4,
58], as well as our own variant of the latter and compared these. For many types of TFBSs, we have shown that our matrices perform better (see Figure , Table , Table )
Nine TFBSs have been investigated in this study. E2F is a key regulator of cell cycle. The good recognition performance achieved for this TF (Figure ) may be considered as a consequence of it participation in composite elements and a number of additional context and structural features in the flanking regions [
18]. TFs ISGF3 and STAT1 are strongly inducible TFs [
59]. ISGF3 is activated by interferons type I. STAT1 may be activated by all interferons and some other cytokines. The ISGF3 and STAT1 enhance transcription of many interferon-inducible genes at early stage of induction (1–2 hours), whereas IRF1 ensures enhanced transcription of many interferon-inducible genes for a long time in infected cells [
60,
61]. SF-1 is a key regulator of the steroidogenic genes expression in gonads and adrenals [
62]. Moreover, SF-1 is required for development and differentiation at all the levels of the hypothalamic-pituitary-gonadal and adrenal axis [
62]. There is experimental evidence for the presence of the SF-1 BSs in the regulatory regions of genes functioning within this axis [
40]. NF-κB is a factor involved in regulation of many types of genes, being induced by cytokines, growth factors, and some other stimuli. NF-κB is involved not only in regulation of the immune response, but also of many other processes [
41]. Nevertheless, the significance of interactions between distant positions and their competence for recognition improvement was already confirmed for NF-κB BSs [
63]. The BSs of PPAR were already found earlier as good examples of interactions between distant positions [
22]. Markov model application for large HNF4α set (71 sites) revealed many dependent non-adjacent positions [
64]. Also this discovery confirmed the importance of a large dataset for performance improvement.
There is ample evidence to suggest that the duplex-DNA quaternary structures in the TF-DNA complex and site flanking regions are the main factors that explain the observed differences between accuracy estimates of traditional PWMs and other models which consider distant interactions. Taking into account the relations between TFBSs recognition accuracy, potential involvement of TFs in specific regulation or in processes in wide range of systems and tissues we may suppose that accuracy estimates reflected the hidden context pattern of TFBSs. Most probably duplex-DNA quaternary structures, which we have here interpreted as a set of mutually adjusted locally positioned dinucleotides, may be more strictly predefined if we find a stronger pattern of these context features.
In the current work, the sophisticated method SiteGA was developed for TFBS recognition. In order to evaluate the efficiency of our approach, we developed special techniques for traditional PWMs optimization. Namely, the best of mononucleotide or dinucleotide PWMs and optimal window lengths chosen using jackknife tests were implemented for 9 types of TFBSs. We revealed that for six TFBSs (E2F, ISGF3, IRF1, NF-κB, PPAR and SF-1) performances are better and optimal lengths are longer (Figure ) than for HNF4, SREBP and STAT1. Maybe each of those six TFBSs has a stronger context pattern or they have a more stable set of general co-factors. The latter case may be for example if a quantity of genes, subjected to TF-specific regulation may be roughly functionally restricted.
In comparison with other well-known approaches for weights calculation [
2,
4,
61,
62], the new formula developed here (equation (1d), section Method, PWMs) performed on a par if not better than the best of the others. We further optimized our PWMs through the adjustment of lengths by jackknife tests, which were mainly based on dinucleotide statistics (Table ). Thus we have captured the longest lengths (17–38 nt, Table ) than traditional widely used PWMs (~10–15 nt) [
2,
4,
58] and our PWMs have shown very substantial performance improvement (Figure ).
Performance estimates for all PWM models did not notably depend on the exact type of resampling tests (Figure , jackknife or bootstrap). The same was observed in almost all cases for SiteGA models. The possible exclusions are SiteGA models for IRF1 and PPAR BSs (Figure and ). At least for the former, this may be related with small dataset size (28, Table ). For the latter, this effect is not so notable. In all other cases the differences between jackknife or bootstrap tests for SiteGA models were not observed. Additional sites for training cause the differences between jackknife and bootstrap tests, i.e. this may be interpreted not only as a result of substantially small, but rather not sufficient data. SiteGA algorithm in contrast to PWM's is essentially stochastic, since SiteGA as all other GAs do not guarantee the best solution. Since SiteGA accuracies did not notably depend on the type of resampling tests we may conclude that we achieved sufficient stability for SiteGA algorithm convergence.
Generally, based on 9 types of TFBSs, optimized PWM and SiteGA have shown similar performances (Figure ). By taking into account fuzzy local positioning of the dinucleotide context, one can possibly achieve considerable increase in the recognition accuracy when compared to that for PWMs. Recall that PWM cannot be quite correct since it based on the assumption of additive contribution of site positions to the total score [
5-
8,
20-
25]. Thereby account of dependant contributions of different positions may help to improve the recognition accuracy.
For E2F, PPAR, NF-κB, HNF4 BSs we found other evidence for distant structural interactions [
18,
22,
63,
64]. For first three of them SiteGA successfully competed with performance of PWMs (Figure ). The worse results for HNF4 (Figure ) may be related to a small dataset size (we used only 29 BSs, against 71 declared in [
64]), so even with PWMs we were able to catch the shortest length (Figure ). Also we found SF-1 as a good example in SiteGA comparison with PWMs (Figure ).
Worse results achieved for other TFBSs can probably be attributed to two reasons. The first is a small and heterogenous dataset (this refers to ISGF3 and IRF1, Figure and ). We may suspect that in terms of distant interactions these sets were not representative. The solution space of any complicated model (like SiteGA) is incommensurably greater than that for simple (PWM) models. Hence only a statistically large enough dataset size may allow a complicated model to realise its potential advantages. Whatever the case, the representative data are the result of a compromise between their heterogeneity and their total amount. The ISGF3 and IRF1 are very divergent among other training sets: firstly they are the smallest sets, secondly the total count of dinucleotides exploited by their SiteGA models are very small in comparison with all other SiteGA models (Table , last column).
The second reason for worse accuracy of SiteGA in comparison with PWMs is a weak and "short" context pattern (SREBP and STAT1, Figure and ), since these TFBSs are found among the worst among all PWM models, and among the others they have nearly shortest lengths (Figure ). One may expect that for some TFBSs, distant interactions may be substantially less important than close ones, so that additivity assumption [
8] supported by observed context pattern.
Comparisons between the performance of PWM and SiteGA models estimated by resampling tests (Figure ) lead us to conclude that PWMs perform better when only smaller datasets are available (ISGF3 and IRF1, Table ). SiteGA models need more representative training sets (> 25–30 sites) to achieve better results. The comparison of performances for datasets of different size suggests that a subtle context pattern, which may increase the performance, might be successfully retrieved only from larger training data [
10,
25].
Simulated data analyses have shown that performances of mono- and dinucleotide PWMs, as well as of optimized markov model (OMiM, [
25]) are increased when dataset size grows from 15 to 150 sequences. Moreover, the most substantial growth of performance for OMiM was observed when the dataset size was increased from 15 up to 75 sites. This observation allows us to suppose that nearly all our datasets (Table ) still far from the optimal size.
We realise, that as a quite sophisticated model, SiteGA in comparison with PWMs may be prone to overfitting. In the case of overfitting, the accuracy achieved by the training data should be quite good, but the independent control data do not confirm good performance. This effect should be noticeable for a relatively small dataset and careful resampling tests allow us to appreciate possible overfitting effects. In our paper, we carried out two types of resampling tests (jackknife and bootstrap), thus, the crucial interference of overfitting is excluded.
To conclude, we may note that increasing the number of training sequences makes a sophisticated method like SiteGA less prone to overfitting, thus in the case of larger datasets SiteGA may successfully compete with optimized PWMs.
We may emphasize at least three possible ways for further improvement of SiteGA:
(1) Extension of all current TFBSs datasets up to at least ~40–60 sequences;
(2) Careful adjustment of SiteGA-specific parameters (e.g. number of LPDs);
(3) Analysis of different lengths. That was extremely important for PWMs (Figure ), but still SiteGA models have been implemented with predefined lengths. But there is no guarantee that PWMs and SiteGA have exactly the same setting for optimal window length.
Next our task was to study the hidden context patterns that allow better recognition performance. For example, the SF-1 BS has been described by the motifs CCAAGGTCA [
65], (C/T)CAAGGT(C/T)A [
66] and GTCAAGGTCA, that was derived from our data set [
40]. According to the data extracted from TRRD [
42], the region protected from nuclease digestion in the footprinting experiment was not longer than 20 bp. It followed that the core region of the SF-1 BS was not longer than that. We took local dinucleotides of the SiteGA SF-1 model and studied the distribution of their locations within and outside the core region. We found that most significant SF-1 context features were inherent to the consensus [
10,
19] and footprint (approximately [
5,
24]) regions (Figure ).
The revealed dependencies between locally positioned dinucleotides for all SiteGA models were split into close and distant ones (Figure ), which revealed that:
(1) The most significant correlations are mainly between pairs of close dinucleotides, mostly resided to the core region (most probably these patterns are clearly handled by PWMs);
(2) Larger portions of less significant correlations are mainly between distant dinucleotides;
(3) Total numbers of distant dependencies are substantially higher than numbers of close ones. Since the significance of the distant dependencies is generally lower than for the close ones, the larger dataset is favourable for detailed clarification. We may note that the domination of close dependencies among most significant correlations (Figure , p < 0.001) is indeed the basis for assumption of independence [
8], which was accepted in our case by PWMs.
Finally, a large portion of the current research was devoted to analysis of EPD data, i.e. nearly all human promoters ~10% of genome (genes for which the transcription start site has been determined experimentally). This is a substantial development that is worth bringing to the attention of a wider audience. Firstly we separately used optimized PWM and SiteGA models on human EPD promoters (Figure ) and simulated data analysis (Figure ), and applied stringent thresholds (TP 50%–70%) to reduce the number of FPs. A small number of predicted sites may be considered as indirect evidence of better recognition performance, since we can expect obvious enrichment of total EPD dataset with functional TFBSs. Indeed for both PWMs and SiteGA, the ratios of predicted site frequencies to those for background (random sequences, Markov model 0) varied from ~1 to ~3 at the most permissive stringency (TP 70% for SiteGA and PWMs) depending on TF type.
The comparison of PWM and SiteGA model predictions applied to EPD promoter data (Figure ) as well as for random sets (Figure ) suggests the following:
(1) For E2F, HNF4, NF-κB, SF-1 and STAT1, SiteGA appeared to be better than the respective PWMs. Some deviation from this general rule is not very important. For instance, a small advantage of E2F PWM at TP 50% (Figure ) was suppressed by considerable dominance of SiteGA at two other thresholds, TP 60% and 70%.
(2) Ambiguous situations are found for PPAR and SREBP: the ratios of PWM and SiteGA predictions (Figure , ) may be more or less than unity depending on threshold, but generally it is hard to choose the better model.
(3) For IRF1 and ISGF3, PWMs achieved better results than SiteGA, which may be attributed to their having only small datasets. Another possible explanation is the small total count of LPDs captured by SiteGA models (38 and 36 are two most divergent values in the last column of Table ).
This EPD analysis suggests that SiteGA models are able to outperform the optimized dinucleotide PWMs.
Table shows the genes (identified by the TFBS searches) sorted on the likelihood (from biological knowledge) of them genuinely regulated by the TF concerned. For all TFBSs, the superposition of PWM and SiteGA appeared to reduce FPs substantially, while TPs reduced slower (Figure ). We have shown that among combined predictions indeed there are many known and other probable functional sites (Table ). The combined use of both models is very promising for whole genome searches, since the human genome contains approximately tenfold greater number of genes than is present in EPD now.
Analysis of whole human genome sequences (Figure ) confirmed that SiteGA and optimized PWMs have similar performances and the combination of two different approaches has a substantially reduced FP rate. Obviously, the density of predicted sites for Y chromosome is an outlier among others, most probably explained by differences in base content and gene density [
67-
69]. The average predicted site density for other chromosomes (1–22, X) are 1.60E-04, 1.64E-04 and 3.99E-05 for optimized PWM, SiteGA and their combination, respectively (Figure ). These figures are slightly lower than the random sequence scores for PWMs and SiteGA of 2.0E-04 (see Figure ), which is probably accounted for by minor differences in base composition. Whereas for EPD promoters, there were notably higher predicted-site densities (2.18E-04, 2.19E-04 and 5.56E-05, respectively). This may be just a consequence of closer similarity of nucleotide contents between training data and EPD promoters (they are both gene regulatory regions), than between training data and full-length chromosomes.
Finally, we consider the high number of predicted SF-1 BSs on human chromosomes (Figure ). The first and obvious explanation is the presence of BSs that have similar consensus sequences. For instance, SF-1 BSs closely resemble those of LRH-1 [
70], both TFs function as monomers. In general, closely related TFs share the same consensus (e.g. androgen, progesterone and glucocorticoid receptors that bind DNA as dimers [
71,
72]). Futhermore, the specific location of an predicted site may be non-functional
in vivo for a range of reasons: (a) the position may lie in tightly packaged heterochromatin far from gene regulatory machinery (b) although in putatively regulatory DNA, it also may be not exposed enough for TF binding, and finally (c) in a given tissue or development stage where the TF is present, only a subset of the potential binding sites are available. Hence in genomic DNA, we may separate '
in vitro' and '
in vivo' false positives. That means that the latter can bind TF
in vitro, but
in vivo this interaction is not observed. We have previously confirmed that our computer tools are able to predict functional sites at least
in vitro quite well [
73]. However, the majority of predicted sites on human chromosomes are probably '
in vivo' false positives, which can only filtered out with extra knowledge.