A new computational algorithm of building improved PWMs is presented. As an input this technique requires an initial PWM including cutoff value (or just motif consensus) and a promoter database as a source of putative sites. New 4-row and 16-row matrices with new cutoff values are built. The main idea underlining this algorithm is that the occurrence frequency of sites is not random in the functional window of promoter sequences (see and ) allowing extraction of a set of putative sites with minimal number of random (potentially non-functional) sites.
Our algorithm is an improvement of Staden–Bucher approach (7
). The main formal difference with Bucher's algorithm is the optimization parameter: we use correlation coefficient (see definition in Data and Algorithm) instead of over-representation parameter (20
). Bucher's algorithm is searching for an area (window) inside the set of aligned promoter sequences where the ratio of the number sites inside and outside that area is maximal. It allows locating the potential functional window for the considered element. In this case the number of random sites is not minimized. In contrast, our algorithm minimized the number of random sites by looking for an optimal window inside the functional window (not a functional window itself) where the set of putative sites are as close as possible to the experimentally defined sites and the percentage of the noisy sites is minimal. The proportion of the random sites in the representative dataset is crucial. This can be especially seen for the TFBS with broad spatial distribution of functional positions such as the GC-box. To build a PWM for GC-box, Bucher used the putative sites gathered from promoter sequences at positions from −170 to −5 bp (20
). In contrast, we used a much smaller range of positions, which allows us to considerably reduce the portion of the noisy sites. Thus, to build a 16-row matrix with maximal sensitivity we used 190 sites extracted from the promoter sequences at position from −57 to −52 bp. The proportion of the potentially non-functional sites in the window −57 to −52 bp (2.4%) is very much smaller than in the window −170 to −5 bp (8.9%).
Another essential difference with Bucher algorithm is an additional input, a control set of experimentally verified sites. It allows us to control sensitivity of the new matrices and create multiple optimal PWMs with different ratios of specificity/sensitivity (see and ). Finally, we used randomly generated sequences to find the matrix with highest specificity among all matrices with equal sensitivity.
In order to compare the specificities of PWMs, we made the assumption that the majority of sites found in the randomly generated DNA sequences are false positives. How true is this assumption? First of all, it is known that the majority (if not all) of the existing PWMs find in the genome sequence many orders of magnitude more sites than any reasonable estimate of the functional site number. For example, the PWMs for the TATA box and Initiator from (20
) found 42 and 272 million potential sites, respectively, in the human genome, although the estimated numbers are 3000–5000 for the TATA box and 15
000 for Initiator [this estimate is made based on the statistics of core promoter elements from (38
). One of the intrinsic reasons of ‘bad’ specificity of PWM is the degeneracy of functional sites. In particular, it means that even if we magically find all real functional sites and build a PWM based on this ideal set of sites, the resulting matrix will still find many more false positive sites in genome. So there is a hope that, in the absence of exhaustive experimental data, the average OF in the random sequences is an appropriate parameter to estimate specificity.
As we see from the Results section, the new matrices ( and ) are better than the original one. First, the percentage of recognized experimental sites by the new matrices, i.e. sensitivity, is higher (). Second, the proportion of sites in the randomly generated sequences is lower, i.e. the specificity, is also higher (). Although the main goal of this article is a demonstration of the ability of the proposed algorithm to improve an existing PWM (not necessarily to create an ideal matrix), the new PWMs for the Sp1 binding sites show a reasonably high level of specificity and sensitivity. This conclusion follows from the comparison of the predicted sites and experimental data from chromosomes 21 and 22 (37
). The huge difference between the occurrence frequencies of the GC-box sites in the sequences conserved between mouse and human and in the randomly generated sequence with the same composition (see and from the main text) also indirectly supports this statement.
The analysis of the pseudo GC-box matrix demonstrated that if the database of sequences is not enriched in the sites being modeled, the iterative procedure will not return an improved matrix as assessed by the set of experimental sites which is an important component of the training procedure, yet not required to be especially large. There are at least 200 of known TFBS satisfying these requirements (21
) and, therefore, their matrices could be improved by the described algorithm. Though in the present article we considered in detail only one example, we already have preliminary results showing improvement of matrices for the EGR-1 and EST-1 TFBSs. Note that there are just a few known experimental sites for the later TFBSs (see TRANSFAC database). The number of experimentally defined sites affects the outcome of the program in two ways. An initial matrix is needed to get the process started, but that could be based on only a few sequences—perhaps just a consensus sequence. The other effect comes at the end of the process when different matrices are ranked by their specificity (frequency of sites in random sequences) and their sensitivity. For that purpose one would like have a reasonable estimate of the sensitivity, but 10 to 20 sequences should be sufficient.
Studying the sensitivity-specificity tradeoff for the information theoretic PWM and determining an optimal cutoff according to some criterion is one possible approach. Recently, Djordjevic et al
) proposed an alternative solution to the problem. They formulated a maximum likelihood estimation (MLE) method that utilized the physical TF concentration-dependent binding probabilities. This MLE naturally becomes a variant of logistic regression and gives rise to a set of parameter estimation methods that interpolate between the conventional information theoretic weight matrix at one end and a Support Vector Machine at the other end. The Support Vector Machine, called QPMEME (for Quadratic Programming Method for Energy Matrix Estimation) not only provides a PWM but also a threshold. So this method defines the threshold based on intrinsic properties of sites included in the training set of experimentally defined sites, in contrast to our approach, which maximizes the portion of functional sites among a set of putative sites. These two methods do not contradict but compliment each other.
Given that the ‘additivity assumption’ does not hold in some cases (27
), we can refine the methods and algorithms employed to build higher order matrices [see, for example, the weight array method (29
)] or more generally counting all interdependency between adjusted and non-adjusted nucleotides (30
). The limitations of small experimental datasets have persuaded researchers to use less accurate, but fairly reliable 4-rows matrices (40
). There is no such limitation in our case since we use a large set of putative sites extracted from the sequences of promoter database. Despite the accuracy of ‘additivity assumption’, the 16-row matrix, in general, is more statistically informative than the 4-row matrix, and, therefore, should a priori be more precise. Indeed our results ( and ) confirmed this statement. The results obtained by the recently developed approaches to the problem of TFBS prediction (30
) also confirmed that algorithms counting within-motif dependence in many cases outperform conventional 4-row PWM.
The improvement of PWM quality is beneficial for studying the variety of transcription regulation scenarios. An alternative way of understanding such scenarios is a strict molecular modeling of the processes of the cis–trans
). That approach would utilize the detailed knowledge of the TF structure, in particular those of their DNA-binding domains and of the details of their biochemical interaction with the respective cis
-elements. An experimental study of these processes may also be used for the refinement of the existing PWMs (44
). Even though such a study may be somewhat simplified by taking into account the existence of major types of the TF-binding domains (45
), it still would be very laborious and could deal only with a few TFs and TFBS at a time. Our approach allows for significant improvement of PWMs in a short period of time, thereby presenting an alternative to molecular modeling.