|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org
Position-weight matrices (PWMs) are broadly used to locate transcription factor binding sites in DNA sequences. The majority of existing PWMs provide a low level of both sensitivity and specificity. We present a new computational algorithm, a modification of the Staden–Bucher approach, that improves the PWM. We applied the proposed technique on the PWM of the GC-box, binding site for Sp1. The comparison of old and new PWMs shows that the latter increase both sensitivity and specificity. The statistical parameters of GC-box distribution in promoter regions and in the human genome, as well as in each chromosome, are presented. The majority of commonly used PWMs are the 4-row mononucleotide matrices, although 16-row dinucleotide matrices are known to be more informative. The algorithm efficiently determines the 16-row matrices and preliminary results show that such matrices provide better results than 4-row matrices.
Regulation of gene expression involves the participation of many regulatory transcription factors (TFs). Understanding a regulatory system requires detailed knowledge of both the trans-acting TFs and the respective promoter cis-elements (binding sites). TFs bind to specific DNA sites among a vast excess of structurally similar non-specific sites. These specific sites share common features, consensus base pairs that almost always appear at the same position in every site (1). The consensus pattern can help to identify unknown sites (2). However, DNA–protein binding sites (signals) are often highly degenerate, so it is impractical to use just a consensus to evaluate the presence or absence of a signal in a DNA sequence (3). The position-weight matrix (PWM) technique has been developed to find a signal (4–14). The advantages of using a PWM in the search for transcription factor binding sites (TFBS) compared with motif consensus have been demonstrated (3,4).
PWMs have been used for the prediction of the binding affinity for numerous bacterial (15) and eukaryotic TFs (16–19). Currently PWMs are routinely used, e.g. in the TRANSFAC database (http://www.gene-regulation.com/pub/databases.html#transfac) and in some of the accompanying software (http://www.gene-regulation.com/pub/programs.html). Despite the obvious advantages of PWMs, the majority of existing PWMs provide a low level of both sensitivity and specificity (17).
There are several approaches to building PWMs. The most widely used methods are similar to the one proposed by Staden (7), and are as follows. One takes a collection of aligned TFBS and builds a base frequency table, i.e. counts the number of times each base occurs at each position. The base frequency table has four rows (one row for each letter: A, C, G and T) and the number of columns are equal to the motif length. The weight matrix has the same number of rows and columns with the value at each position being the natural logarithm of the value from the frequency table divided by the number of sequences in the original collection, i.e. the weight matrix contains the estimates of the log-probabilities of each base occurring at each position in true binding sites, based on the sample of known sites. A score for a particular sequence is the sum of the weights that correspond to the sequence which, under some simplifying assumptions, should be equal to the log-probability of seeing that sequence given that it is a binding site. A common alternative is to use log-odds weights which are the logarithm of the ratio of the probability of observing the sequence among a collection of sites compared to observing the sequence in the genome as a whole (3). The resulting matrix allows searching of DNA sequences in order to find sites similar to the original set of known sites, typically using a cutoff value, or score threshold, to predict new binding sites.
The basic methods do not show how to find an optimal cutoff value or how to pick the representative set of sites with which to build PWM; this can be important, since it is not uncommon to have false positives among the ‘known’ sites (3). Bucher presented a method to optimize the cutoff value for a given PWM (20). He applied his method to build PWMs for the most widely used promoter elements such as TATA box, Initiator, and GC-box and these matrices are still in use (see TRANSFAC). Tsunoda and Takagi (21) further refined Bucher's method and calculated the optimal cutoff values for the 205 vertebrate TFs from the TRANSFAC database. Here we describe further improvements to this method.
Besides the biological significance of recognized signals, the statistical significance of PWM matches (22,23) should also be verified. In the absence of experimental validation, the statistical significance allows estimating the expected rate of false-positives. The relationship between sensitivity, specificity and statistical significance of PWM matches were studied by Claverie and Audic (23). Based on several examples, they demonstrated the importance of calculating the statistical significance and showed how to estimate the rate of false-positive matches for a particular PWM. Hence, the statistical significance of a PWM with a given cutoff value is a necessary parameter of the PWM.
The new computational algorithm is a further development of the Staden–Bucher approach in determining a PWM (7,20) that improves the quality of the existing PWM (or in the worst case leaves it unmodified) based on the information in the promoter database. The promoter database, a set of promoter sequences aligned by transcription start site (TSS), is utilized in addition to the experimentally determined sites as a reservoir of sequences expected to be enriched in the binding sites being modeled. We believe that evolution has preserved the sites necessary for promoter regulation and, therefore, their occurrence frequencies in the promoter area are far from random. Since there are always a limited number of sites known for sure (from experiment) as functional, and since the true functional sites are usually highly degenerate, the only way to get more functional sites (without extensive experimentation) is to find them in the promoter sequences. Indeed, the spatial distributions of the sites in the proximal promoter area are very different from the non-promoter area or from randomly generated sequences (24,25).
Based on the preliminary knowledge about the TFBS being modeled, such as a list of experimentally verified sites, an existing PWM, or even just a consensus, the set of putative sites is extracted from the promoter sequences and used to build a new matrix. Some of the extracted sites are potentially non-functional (false positives), and we may miss some functional ones (true positives). But we expect that these putative sites are enriched in functional sites and by an iterative procedure of finding optimal matrices we will converge on an improved PWM for the sites of interest. In some respects this procedure is similar to the Gibbs sampling algorithm for discovering a common motif in a set of functionally related sequences (11), but by starting with a good estimate of the pattern we expect it will converge to the global optimum more frequently. We also optimize a different objective function—the correlation coefficient (CC) (26) that takes into account both the sensitivity and the specificity of the PWM. We assess the improvement on independent datasets.
The majority of practically used PWMs are the 4-row mononucleotide matrices based on the ‘additivity hypothesis’, which considers the contributions from each position of the binding site as independent and additive (1). Some experimental evidence (27,28) and theoretical considerations (29) show that a dinucleotide approach (counting of dependence between adjacent nucleotides of TFBS) could be, in some cases, the more appropriate approximation. Recently developed approaches of modeling dependencies in protein–DNA binding sites (30–32) also conclude that counting of inter-dependence between nucleotides (not necessarily adjacent) gives better results than regular PWM in many cases. Using the same methodology, we built the 16-row dinucleotide matrices. We demonstrate the applicability and advantage of our algorithm by building new 4-row and 16-row matrices for the GC-box element, TFBS for Sp1.
To build a new PWM for a particular TFBS we need (i) an existing motif consensus or a PWM defined, (ii) a database expected to be enriched in the TFBS of interest and (iii) a control set of experimentally defined binding sites. As an initial (original) matrix we use the PWM for GC-box TFBS obtained by Bucher (20). As a control set of sites, we use 122 non-redundant experimentally defined Sp1 binding sites from human genes from the TRANSFAC database (see list of these sites in Table 1 of Supplementary Material 1). For our database enriched in sites we use the Eukaryotic Promoter Database (EPD) release 75 (33) (http://www.epd.isb-sib.ch/), which contains a total of 1871 non-redundant, experimentally verified, human promoter sequences, each 600 bp long (−499 to +100 bp around the TSS). We evaluate the performance of the new PWM using the Database of Transcriptional Start Sites (DBTSS) (34) (http://dbtss.hgc.jp/index.html) that contain 8793 promoter sequences. We also search the entire human genome (http://genome.ucsc.edu/).
The mononucleotide matrix is a table with 4 rows (one row for each letter: A, C, G and T) and a number of columns of motif length l. The dinucleotide matrix is a table with 16 rows (one row for each dinucleotide) and a number of columns of motif length minus one. The formula for weight at the ith position of the motif for 4-row matrices was taken from article (20) with slight changes:
where b is one of 4 nt, nbi is the number of times base b occurs at the ith position of the motif, ci is a constant providing column maximum value to be zero, si is a ‘smoothing’ parameter preventing the logarithm of zero (or too small a value), and is the expected frequency of base b at position i, and L is the length of the sequence (in our case L = 600, the length of promoter area). The choice of si is important, since when nbi equals zero, the value of ln (si) defines the relative importance of different nucleotides. Different approaches were proposed to define the si value [see references and discussion in (23)]. Here, we use the following value for parameter si: si = 0 if the first term under logarithm in Equation 1 is larger than 0.01×n/(4×ebi) and si = 0.01×n/(4×ebi) otherwise, where . To calculate the weight score (S) for a specific sequence we use the formula (20):
Equation 1 will also be used for finding the weights for the 16-row matrices. In this case b is one of 16 dinucleotides, nbi is the number of times dinucleotide b occurs at the ith position of the motif, and ebi is the expected frequency of dinucleotide b at position i; ci and si have the same meaning; si = 0 if the first term under logarithm in Equation 1 is larger than 0.01×n/(16×ebi) and si = 0.01×n/(16×ebi) otherwise, . To calculate the weight score for the 16-row matrices we will use Equation 2 with l − 1 instead of l.
To consider averaged positional distribution of elements along aligned promoter sequences we use the occurence frequency (OF) of the element OFi = ni/Ns, where ni is the number of promoters containing considered element centered at position i and Ns is the number of sequences. We will use the term ‘functional window’ to designate the positions relative to TSS, where the occurrence frequency of considered element (OFreal) is much larger than expected, namely where OFreal − OFrandom ≥ 3×SD; here OFrandom is the occurrence frequency of the respective elements in the randomly generated DNA sequences with the same proportion of 4 nt as in the training set of promoters and SD is the standard deviation. Therefore, we suppose that sites appearing in that window are likely to have a functional (biological) meaning and some of these putative sites could be used to build a new matrix. Note that we will need the functional window only to define approximately the initial range of positions where we will extract the respective sites; the final optimal matrices will not be affected by the parameters of the functional window.
Several different approaches were applied to optimize the cutoff value and motif length. Bucher used a local over-representation parameter [see equation 5 in his article (20)]. We found that the CC is the most appropriate parameter for optimization:
where TP, FP, TN and FN are the number of true positives, false positives, true negatives and false negatives, respectively. The definition of these parameters is not straightforward and depends on the specific task. We define them in the following way (see also Figure 1 for clarification): TP is the number of sites positively identified by the new matrix at expected positions (in the given window) among the sites identified there by the original matrix; FP is the difference between the number (Nnew) of sites positively identified by the new matrix at expected positions at all considered promoter sequences in the same window and TP; FN is the difference between the number (Norig) of positively identified sites by the original matrix at expected positions and TP; TN is total number of all sites in the given functional window with length lw in all considered promoter sequences (Ns) minus TP, FP and FN; so that:
Defining CC in this way we suppose for a moment that the initial matrix is ideal, since the initial matrix is the only information about a particular TFBS we rely on at the beginning. But the sites extracted from the promoter sequences based on the original matrix carry additional data. The new matrix built on these new sites will be different from the original. Maximizing CC parameter we try to be as close as possible to the original matrix, but gradually picking up the additional information (new putative sites) concealed in the promoter sequences. Note that parameters TP, FP, TN and FN defined here to calculate CC could not be used to evaluate the sensitivity and specificity of the final matrix.
The question of how to calculate and compare the sensitivity and specificity of the new and original matrices is crucial. We will define the sensitivity (Sn) simply as a percentage of experimentally confirmed sites recognized by the respective matrix. To compare the specificity of two matrices we will suppose that the majority of sites found by these matrices in the randomly generated DNA sequences are false positives. If this is true, the ratio of the occurrence frequencies found by the new and original matrices is inversely proportional to the ratio of their specificities. Therefore, we will consider the averaged occurrence frequency of sites in the randomly generated sequences as a parameter describing the specificity (Sp) of the PWM.
To construct the random sequence, we calculate the average percentage of each of 4 nucleotides in the DNA sequences of interest (for example, in all sequences of the training set of promoter database or in respective chromosomes of the human genome) and then we generate a four-letter random sequence with the probability of each letter being proportional to its average percentage in those sequences.
The flowchart of the algorithm is depicted in Figure 2. The input parameters are the initial PWM with a given cutoff value, a set of experimentally defined sites, and the given sensitivity (Sn0)—the minimal portion of experimental sites recognized by the matrix. Note that the set of experimentally defined sites was not used for the building of the initial matrix and will not be included in the set of putative sites used to construct new matrices. The first step is the extraction of the dataset of putative binding site sequences for the considered element from the promoter database based on an existing PWM. To distinguish a functional window and to estimate the percentage of the noisy sites we use the randomly generated DNA sequences in addition to the sequences from the promoter database. The procedure to define a functional window includes the following steps: (i) calculate the average percentage of each of 4 nt in the sequences of the training promoter database; (ii) generate a random DNA sequence with the percentage of nucleotides defined in the previous step; (iii) calculate the averaged number and SD of respective TFBS in this randomly generated sequence using the initial matrix; (iv) calculate the positional distributions of TFBS averaged on all sequences from training promoter database. The results from the last two steps allow estimation of the noise level and accurate definition of the functional window.
The next step is the alignment of extracted sequences and construction of PWM using Equation 1. The alignment is straightforward since the length of all sites is the same. The new matrix absorbs information from a larger set of sequences containing putative sites. Presumably, this dataset is more representative than the one used to build the previous version of the PWM. The goal of the following steps is a minimization of the percentage of the noisy sites (to reach the maximal specificity) and maximization of the percentage of recognized experimental sites (to reach the maximal sensitivity). There are two levels of optimization at the beginning: cutoff value and motif length (see flowchart in Figure 2). The CC (Equation 3) is used as the optimization criterion. In the given range of parameters (reasonably chosen) we apply a new matrix to the promoter database every time changing one of the parameters and recalculating CC. First, we find the optimal cutoff value for the given motif length and for the given position and size of the window where we pick up the putative motives. To do this we calculate the CC parameter for every cut-off value in the range, for example, from 0.700 to 0.950 with step 0.001. The cutoff value is considered optimal if the CC value for this cutoff is maximal. Second, we change the length of the motif and find the optimal cutoff value and respective value of CC for each length in the given range for the given window. The maximal value of CC defined the optimal length. The final result of this procedure is a PWM with optimal length l, and cutoff value c. Now the new optimal matrix is utilized as an initial matrix for the next optimization cycle, which repeats all the aforementioned steps. This refinement process is continued up to m cycles, until CCm = 1 (usually it takes from 6 to 12 cycles). Each cycle brings a portion of new sites typical for this particular window and excludes some not typical sites increasing the influence of sites from that window. This influence is strongly limited by the requirement to be as close as possible to the previous matrix expressed by the definition of CC. All aforementioned steps should be repeated for each window from the functional window. As a result we will have a set of optimal matrices, one matrix for each considered window. Each matrix has its own sensitivity and specificity. Now we should choose the best matrix among all optimal matrices using the input set of experimentally defined sites and given sensitivity Sn0. Accordingly, we choose a matrix with sensitivity Sn ≥ Sn0 having the lowest score of occurrence frequency in the randomly generated DNA sequence, thus Sp = max(Spn) (see Figure 2). This final matrix is built on the set of sites extracted from a window which is part of the functional window. We will use the term ‘optimal window’ to designate this window.
To implement the described algorithm and to perform the statistical analysis a set of C++ Window-based programs was created. We also used the software package, Promoter Classifier (35), available at site http://bmi.osu.edu/~ilya/promoter_classifier/.
Figure 3 depicts the spatial distribution of OF of GC-box sites defined by scanning promoter sequences from EPD and DBTSS databases by the original PWM (20). From this picture, one can see that the OF values (dark blue and green curves) are larger than in randomly generated DNA sequences (horizontal line) across the entire promoter region (from −1000 to +200 bp) with an exception in the immediate downstream area. The difference between OF in promoter area and OF in random sequence exceeds 3 SD in that area. The ratio (OF in promoter area)/(OF in random sequence) is ~10 in the proximal upstream area, indicating the essential overrepresentation of the GC-box motif there. Surprisingly, the positions and values of absolute maxima on both OF curves, for the EPD and DBTSS databases, practically coincide and the curves are virtually identical. These features point to the non-random nature of GC-box presence even more than the large OF values. Note that both strands of promoter DNA sequence are overrepresented with GC-box (see magenta and red curves at Figure 3). Most of the human promoters contain CpG islands (36). Since the GC-box motif is rich with C and G nucleotides, the presence of those islands could be an explanation of GC-box overrepresentation. However, the occurrence frequency of GC-box in CpG-less and CpG-containing subsets of promoters exhibit similar behavior (see Figure 1 in Supplementary Material 2) excluding ‘CpG island’ explanation. Thus, the statistical analysis indicates that proximal upstream regions of promoter sequences contain GC-box like motif. It is likely that the majority of those sites have functional (biochemical) meaning as binding sites for Sp1 and, therefore, they could be used for building the PWM for Sp1 TFBS.
We applied our algorithm to refine PWM for the GC-box element using as input the original matrix and a control set of experimentally defined Sp1 binding sites (see subsection Data under Data and Algorithm). Since one of the input parameters, given by the user, is the number (or percentage) of recognized sites from the control set of sites there are multiple optimal matrices (one for each given number). Figure 4 depicts the occurrence frequency of GC-box sites in the random sequence versus sensitivity for the original matrix (circle at the left upper corner) and two sets of new 4-row (squares) and 16-row (diamonds) matrices, respectively. Each new matrix was built based on its own set of sites extracted from the respective optimal window. Note that each matrix has its own optimal cutoff value, which is considered a part of the matrix. We already mentioned that the set of the experimental sites was not used to build the new matrices. However, we used this set on the final stage of the algorithm procedure to choose the best matrix among all optimal matrices with the same (given by the user) sensitivity. Thus, indirectly the set of experimental sites is involved in the process of matrix refinement. Since we use the same set of sites to compare the initial and new matrices we should confirm by cross-validation procedure that the different subsets of the experimental sites lead to the same optimal matrix. To implement it we divided randomly 122 experimental sites on five subsets with ~80% sites in each. We applied our algorithm five times using different subsets of experimental sites and found that indeed every time the comparison of the sensitivity and specificity of all optimal matrices leads us to the same (the best) matrix. For example, the matrix built on the set of putative sites from the window −55 to −47 bp was the matrix with maximal sensitivity among all matrices for each of the 5 subsets of experimental sites. Furthermore, the sensitivity on the remaining 20% of sites is the same, on average, indicating that the procedure does not lead to overfitting to the 80% of sites used to determine the optimal matrices.
Two obvious conclusions follow from Figure 4: (i) there are new matrices with higher levels of both sensitivity and specificity than original matrix; (ii) in all cases, the optimal 16-row matrix has higher specificity than optimal 4-row matrix with the same sensitivity. We consider in more details four new matrices: two (4-row and 16-row) with the highest sensitivity (Tables 1 and and2)2) and two (4-row and 16-row) with sensitivity equal to the original matrix (Tables 1 and 2 of Supplementary Material 3). The sets of putative sites extracted from the promoter sequences and used for the matrix building can be found in Supplementary Materials 4–7.
To compare matrices we applied them to the human genome and two types of randomly generated sequences [type 1 and type 2 are the sequences with the same percentage of each of 4 nt as in the human genome and as in the training set of promoter sequences (EPD database) respectively]. The results are presented in Table 3 and lead to the following conclusions:
The common features of the new 4-row matrices are: (i) the positions 5 and 6 contain only nucleotide G; (ii) there is no G at position 7; (iii) there is no C and T at positions 4 and 8; (iv) there is no C at position 10.
The consensus comparison of the original and two new 4-row matrices shows that the main nucleotides at the core positions from 3 to 12 are identical (see Table 1 here and Table 1 in Supplementary Material 3). The major difference between matrices is due to the weight of the second and third nucleotides in those positions. Indeed, the weights of A and T at position 3 in the original matrix are larger than in the new matrices. The same is for the weight of A at positions 4 and 10. Nucleotide T at positions 11 and 12 has higher weight in the original matrix. The edge positions have more differences. Thus, the main nucleotides in the original matrix are A at positions 1 and 2 and T at positions 13 and 14, in contrast to the new matrices having the main nucleotide G at positions 1 and 2, C at position 13 and G at position 14.
The typical features of GC-box motif followed from the 16-row matrices are (i) position 5 contains only dinucleotide GG; (ii) position 4 contains mainly GG and AG; (iii) position 6, 7 and 8 contains mainly 3 among 16 possible dinucleotides (GC, GA and GT at position 6; CG, AG and TG at position 7; GG, GT and AG at position 8); (iv) edge positions never contain dinucleotides AT at position 1, AC and TC at position 2, AA at position 13 and TT at position 14.
It is obvious that the 16-row matrix contains, in general, more information than the 4-row matrix. There are certain positions where some dinucleotides never occur or their occurrence frequencies are much smaller or much higher than expected from the 4-row matrix. The expected and actual numbers of all possible dinucleotides at all positions are presented at frequency table (Table 2). From this table one can see that there are small differences between predicted and actual numbers for the main dinucleotides such as GG at positions 3–5, 8–11 and CG at position 7. At the same time, the dinucleotides with small frequency number could be greatly different from the predicted number. For example, the frequency of dinucleotide AT at positions 12 and 13 are 0 and 1 (see frequency Table 2, first line), although the expected value calculated base on frequency of mononucleotide A at positions taken from frequency Table 1 are 3.7 and 4.3, respectively. Another example is the frequency of GT at positions 8, 9 and 13 are 17, 1 and 5, in contrast to expected values 7.8, 7.2 and 13 (frequency Table 2, line 12).
The curves on Figure 5 depict the spatial distribution of OF for the GC-box motif defined by scanning of EPD and DBTSS sequences with the 16-row matrix with the best sensitivity (Table 2). The respective curves on Figures 3 and and55 qualitatively coincide, but the values are essentially different. The results from Figure 5 are:
We scanned with the 4-row and 16-row matrices of maximal sensitivity each chromosome of the human genome in two types of sequences: (i) whole chromosome sequences and (ii) only conserved regions between human and mouse sequences. Table 4 shows the average occurrence frequencies of GC-box sites in each chromosome and whole genome. For comparison, we also present the OF values in randomly generated sequences (columns 3 and 6). We see that the average occurrence frequency of GC boxes ranges widely from 238/million base pair in Chr# 4 to 796/million base pair in Chr# 19 reflecting the different density of genes and/or promoters in chromosomes. Indeed, the occurrence frequency of sites in chromosomes is proportional to the occurrence frequency of known genes (see Figure 6). The average OFs in the randomly generated sequences also vary widely, from 25/million base pair in Chr# 13 to 108/million base pair in Chr# 19. This is because of the difference of nucleotide percentages: Chr# 13 has 30.7% of A and T, in contrast, Chr# 19 has 25.8% of A and T. The average OF value in every chromosome and in the whole Genome are 6- to 8-fold larger than in the randomly generated sequences, but are similar to the OF of random promoters sequences (Table 3), indicating the different composition of promoter regions compared to the whole chromosomes.
We compare our predictions with experimental results that were obtained by applying high-density oligonucleotide arrays to all nonrepetitive sequences on human chromosomes 21 and 22 to map in vivo binding sites, in particular, for Sp1 (37). The method used by Cawley et al. (37) allows defining the windows in chromosome sequence (up to a 1000 bp long) containing one or several functional binding sites. They found a total of 353 such windows in both chromosomes. We scanned those 353 sequences by our new matrices (from the Table 1 and and2)2) and found totally 1158 sites in 256 sequences (75% of total number of sequences) for the 4-row matrix and 861 sites in 231 (65%) sequences for the 16-row matrix. These sites that are predicted within the experimentally identified segments are only a fraction, 1–2%, of the total number of sites predicted across the complete chromosomes 21 and 22. This is due to at least two reasons. The authors used very high stringency in identifying the binding sites, thereby missing many of the true sites (37). In addition, we search the entire chromosomes and identify sites that are not available for binding to the Sp1 due to chromatin structure, although presumably many of them would bind if they were available. But we can still compare the predictions to see if the sites predicted by the new matrices are more enriched in the experimentally validated segments. In comparison with the original matrix, the new 4-row matrix with maximum sensitivity (Table 1) predicts ~40% fewer sites across both chromosomes, consistent with its higher specificity. Yet it actually finds ~6% more sites within the experimentally validated segments, so that it has ~50% higher fraction of its predicted sites within those segments than the original matrix. In this case the 16-row matrix with maximum sensitivity (Table 2) is not quite as good, but still shows ~20% higher enrichment in the validated segments than the original matrix.
The algorithm proposed here allows the improvement of any initial matrix based on any database of sequences regardless if this initial matrix and database are ‘good’ (reflects some basic features of a particular TFBS) or ‘bad’. To find the difference between good and bad matrices we applied our algorithm to a pseudo GC-box matrix using a fake promoter database. As an initial matrix we used Bucher GC-box matrix where column 7 replaced column 1, the columns from 1 to 6 shifted right one position and the columns 8–14 stayed as before. So the good and bad matrices have identical columns but rearranged order. As a training set of promoter sequences we used the EPD sequences in the interval of positions from 1000 to 1600 from TSS (instead of −500 to +100). We also used the set of experimental sites for the real GC-box where we altered the respective positions as for the pseudo GC-box matrix. Thus, we used a matrix with no biological sense and a training set of sequences where we do not expect an overrepresentation of sites recognized by the pseudo GC-box matrix. Applying our algorithm we obtained a new pseudo GC-box matrix. The results of scanning the human genome, random sequences of type 1 and 2, and promoter sequences are presented in Table 3 (last two lines) and Figures 1 and 2 of Supplemental Material 8, respectively. We see that the new matrix is ‘improved’ by some criteria: the OF values in random sequences of both types are reduced compared with the initial pseudo GC matrix (Table 3); the OF value found by the new matrix in pseudo-functional window is larger than OF value obtained by the original matrix (Figure 2 of Supplemental Material 8). However, by other criteria the new pseudo matrix is quite different from the new real matrix. First, the reduction in OF between the initial and final matrices for both real genome and random genome sequences is much less for the pseudo matrix than for the real one, indicating a much smaller increase in specificity. Second, the ratios OF(genome)/OF(random) for the new real matrices are larger than the same ratio for the new pseudo matrix (compare column 2 and 5, Table 3), indicating that in the real case a true signal has been identified that distinguishes real promoters from random sequences. And third, the final pseudo matrix loses sensitivity on the pseudo experimental sites: the initial pseudo GC-box matrix recognized 68.8% of pseudo experimental sites and the final optimal matrix with maximal sensitivity recognized only 59.8%. This last result emphasizes the fundamental difference between ‘good’ data and ‘bad’ data. The iterative procedure for obtaining a new matrix is done without regard to the experimental data. If the promoter database contains new real examples of the sites specified by the matrix they will be used to improve the matrix by increasing both its specificity, compared with random sequences, and its sensitivity on known binding sites. In the case of the pseudo matrix, for which there is no enrichment in the pseudo promoter database, the iterative procedure does increase specificity of the matrix slightly, but that is matched with a decrease in sensitivity.
A new computational algorithm of building improved PWMs is presented. As an input this technique requires an initial PWM including cutoff value (or just motif consensus) and a promoter database as a source of putative sites. New 4-row and 16-row matrices with new cutoff values are built. The main idea underlining this algorithm is that the occurrence frequency of sites is not random in the functional window of promoter sequences (see Figures 3 and and5)5) allowing extraction of a set of putative sites with minimal number of random (potentially non-functional) sites.
Our algorithm is an improvement of Staden–Bucher approach (7,20). The main formal difference with Bucher's algorithm is the optimization parameter: we use correlation coefficient (see definition in Data and Algorithm) instead of over-representation parameter (20). Bucher's algorithm is searching for an area (window) inside the set of aligned promoter sequences where the ratio of the number sites inside and outside that area is maximal. It allows locating the potential functional window for the considered element. In this case the number of random sites is not minimized. In contrast, our algorithm minimized the number of random sites by looking for an optimal window inside the functional window (not a functional window itself) where the set of putative sites are as close as possible to the experimentally defined sites and the percentage of the noisy sites is minimal. The proportion of the random sites in the representative dataset is crucial. This can be especially seen for the TFBS with broad spatial distribution of functional positions such as the GC-box. To build a PWM for GC-box, Bucher used the putative sites gathered from promoter sequences at positions from −170 to −5 bp (20). In contrast, we used a much smaller range of positions, which allows us to considerably reduce the portion of the noisy sites. Thus, to build a 16-row matrix with maximal sensitivity we used 190 sites extracted from the promoter sequences at position from −57 to −52 bp. The proportion of the potentially non-functional sites in the window −57 to −52 bp (2.4%) is very much smaller than in the window −170 to −5 bp (8.9%).
Another essential difference with Bucher algorithm is an additional input, a control set of experimentally verified sites. It allows us to control sensitivity of the new matrices and create multiple optimal PWMs with different ratios of specificity/sensitivity (see Figure 4 and Table 3). Finally, we used randomly generated sequences to find the matrix with highest specificity among all matrices with equal sensitivity.
In order to compare the specificities of PWMs, we made the assumption that the majority of sites found in the randomly generated DNA sequences are false positives. How true is this assumption? First of all, it is known that the majority (if not all) of the existing PWMs find in the genome sequence many orders of magnitude more sites than any reasonable estimate of the functional site number. For example, the PWMs for the TATA box and Initiator from (20) found 42 and 272 million potential sites, respectively, in the human genome, although the estimated numbers are 3000–5000 for the TATA box and 15000 for Initiator [this estimate is made based on the statistics of core promoter elements from (38). One of the intrinsic reasons of ‘bad’ specificity of PWM is the degeneracy of functional sites. In particular, it means that even if we magically find all real functional sites and build a PWM based on this ideal set of sites, the resulting matrix will still find many more false positive sites in genome. So there is a hope that, in the absence of exhaustive experimental data, the average OF in the random sequences is an appropriate parameter to estimate specificity.
As we see from the Results section, the new matrices (Tables 1 and and2)2) are better than the original one. First, the percentage of recognized experimental sites by the new matrices, i.e. sensitivity, is higher (Table 3). Second, the proportion of sites in the randomly generated sequences is lower, i.e. the specificity, is also higher (Table 3). Although the main goal of this article is a demonstration of the ability of the proposed algorithm to improve an existing PWM (not necessarily to create an ideal matrix), the new PWMs for the Sp1 binding sites show a reasonably high level of specificity and sensitivity. This conclusion follows from the comparison of the predicted sites and experimental data from chromosomes 21 and 22 (37). The huge difference between the occurrence frequencies of the GC-box sites in the sequences conserved between mouse and human and in the randomly generated sequence with the same composition (see Tables 3 and and44 from the main text) also indirectly supports this statement.
The analysis of the pseudo GC-box matrix demonstrated that if the database of sequences is not enriched in the sites being modeled, the iterative procedure will not return an improved matrix as assessed by the set of experimental sites which is an important component of the training procedure, yet not required to be especially large. There are at least 200 of known TFBS satisfying these requirements (21) and, therefore, their matrices could be improved by the described algorithm. Though in the present article we considered in detail only one example, we already have preliminary results showing improvement of matrices for the EGR-1 and EST-1 TFBSs. Note that there are just a few known experimental sites for the later TFBSs (see TRANSFAC database). The number of experimentally defined sites affects the outcome of the program in two ways. An initial matrix is needed to get the process started, but that could be based on only a few sequences—perhaps just a consensus sequence. The other effect comes at the end of the process when different matrices are ranked by their specificity (frequency of sites in random sequences) and their sensitivity. For that purpose one would like have a reasonable estimate of the sensitivity, but 10 to 20 sequences should be sufficient.
Studying the sensitivity-specificity tradeoff for the information theoretic PWM and determining an optimal cutoff according to some criterion is one possible approach. Recently, Djordjevic et al. (39) proposed an alternative solution to the problem. They formulated a maximum likelihood estimation (MLE) method that utilized the physical TF concentration-dependent binding probabilities. This MLE naturally becomes a variant of logistic regression and gives rise to a set of parameter estimation methods that interpolate between the conventional information theoretic weight matrix at one end and a Support Vector Machine at the other end. The Support Vector Machine, called QPMEME (for Quadratic Programming Method for Energy Matrix Estimation) not only provides a PWM but also a threshold. So this method defines the threshold based on intrinsic properties of sites included in the training set of experimentally defined sites, in contrast to our approach, which maximizes the portion of functional sites among a set of putative sites. These two methods do not contradict but compliment each other.
Given that the ‘additivity assumption’ does not hold in some cases (27,28), we can refine the methods and algorithms employed to build higher order matrices [see, for example, the weight array method (29)] or more generally counting all interdependency between adjusted and non-adjusted nucleotides (30–32). The limitations of small experimental datasets have persuaded researchers to use less accurate, but fairly reliable 4-rows matrices (40). There is no such limitation in our case since we use a large set of putative sites extracted from the sequences of promoter database. Despite the accuracy of ‘additivity assumption’, the 16-row matrix, in general, is more statistically informative than the 4-row matrix, and, therefore, should a priori be more precise. Indeed our results (Figure 4 and Tables 3) confirmed this statement. The results obtained by the recently developed approaches to the problem of TFBS prediction (30–32) also confirmed that algorithms counting within-motif dependence in many cases outperform conventional 4-row PWM.
The improvement of PWM quality is beneficial for studying the variety of transcription regulation scenarios. An alternative way of understanding such scenarios is a strict molecular modeling of the processes of the cis–trans interactions (41–43). That approach would utilize the detailed knowledge of the TF structure, in particular those of their DNA-binding domains and of the details of their biochemical interaction with the respective cis-elements. An experimental study of these processes may also be used for the refinement of the existing PWMs (44). Even though such a study may be somewhat simplified by taking into account the existence of major types of the TF-binding domains (45), it still would be very laborious and could deal only with a few TFs and TFBS at a time. Our approach allows for significant improvement of PWMs in a short period of time, thereby presenting an alternative to molecular modeling.
Supplementary Material is available at NAR Online.
The authors would like to thank anonymous referees for helpful suggestions. GDS was supported by NIH grant HG00249. Funding to pay the Open Access publication charges for this article was provided by OSU.
Conflict of interest statement. None declared.