Clusters of TFBS are an important property of regulatory regions [7
]. To determine if this is a general tendency for PWM matches and all protein coding genes, we have developed a measure that evaluates the correlation between predicted TFBS concentrations and promoter sequences. We then examined the correlation for individual PWMs using an unbiased sequence set. Our results show that not all TFBS are clustered in promoter sequences. We found that TFBS clusters corresponding to 47% of PWMs are positively correlated with promoter sequences, and that TFBS clusters corresponding to around 11% of PWMs are negatively correlated with promoter sequences.
It is important to ascertain the relationship between cluster scores of PWMs and CGI, because CGI are a prominent feature of promoter sequences. The consensus sequences of the top-ranked PWMs (Table ) are, 'ANNGACGCTNN' (WHN_B), 'TTTCSCGC' (E2F1DP1_Q6), 'NSGGGGGGGGMCN' (MAZR_01), and 'GGGGAGGG' (MAZ_Q6), where S represents C or G, M represents A or C, and N represents any bases. The sequence logos of the PWMs are depicted in Fig. . The G+C % of base composition of each matrix is 70%, 56%, 91%, and 86%, respectively. The sequences with high cluster scores appear to be GC-rich. Larsen et al
found that 57% of human genes are associated with CGI, that all housekeeping genes have CGI covering transcription start sites (TSS), and that 40% of tissue specific genes have CGI [25
]. Therefore, the association of PWM-PCP with CGI may be significant, and CGI-related PWMs may play important roles in housekeeping regulation.
To evaluate the relationship between PWM-PCP and CGI, we calculated the partial correlation coefficient for each PWM. In general, if a correlation coefficient rXY
is not small and rXY.Z
(defined in Methods) ≈ 0, the probable hypotheses concerning cause and effect will be either 1) the correlation of X
is a consequence of Z
, or 2) Z
intervenes between X
. For the PWMs in the top circle in Fig. , rIC.P
is high and rPC.I
is approximately zero. This suggests that the correlation between promoters and TFBS clusters is attributable to the presence of the CGI and that while they do not directly correlate they appear to because both independently correlate with CGI. The characteristic PWMs in Table are NMYC_01 (M00055:0.25), AHRARNT_01 (M00235:0.23), and HIF1_Q5 (M00466:0.22), where parentheses include the accession number referred to in TRANSFAC and the recorded rIC.P
(Y-value in Fig. ). Sequence logos are depicted in Fig. . The PWMs in the middle right circle in Fig. have an rIC.P
of approximately zero and a high rPC.I rPC.I
value showing that the cluster is correlated with promoters independent of CGI. The predicted TFBS clusters corresponding to these PWMs could not be explained by the presence of CGI. Some of these PWMs have thresholds T
less than 1.0 indicating that even the single occurrence of a predicted TFBS is more discriminative than clusters. Particular examples with high recorded rPC.I
values and values for rIC.P
< 0.1 are TFIII_Q6 (M000706), MYOGNF1_01 (M00056) and CREL_01 (M00053). Sequence logos are depicted in Fig. . TFIII_Q6 is a matrix associated with a general transcription factor II-I with the consensus sequence RGAGGKAGG, where the K represents G or T. The matrix TFIII_Q6 contains many 'G', and 'C' is allowed only the fourth position with low frequency. MYOGNF1_01 is a matrix associated with myogenin, nuclear factor 1 or related factors, and is therefore involved in the regulation of differentiation. CREL_01 is a matrix associated with the C-Rel proto-oncogene protein (C-Rel protein). An understanding of the function of these factors is important to this study. The PWM groups described above may be involved in tissue-specific gene regulation. If all housekeeping genes have CGI [25
] then genes without CGI can be assumed to be tissue-specific or rarely expressed. Thus, genes with a cluster of predicted TFBS not associated with CGI might be associated with tissue-specific regulation. Further analysis of extractions of tissue specific genes, shown in Results, supports the hypothesis.
Results from this analysis provide a solution to the promoter prediction problem. Hannenhalli et al.
used additional information, including profiles of TF binding sites, for promoter prediction based on CGI [29
], with no significant improvement to prediction performance. The report using 7 manually selected PWMs confirmed that CGI are the most dominant feature. Our results show that Sp1 and ATF have a strong correlation with CGI; a result consistent with their result that information including both PWM did not improve prediction accuracy. This observation is consistent with other PWMs. More stringent selection of PWMs is required for an improved accuracy of promoter prediction. One strategy is to utilise the CGI-independent PWMs identified in this study. Another problem is exemplified by the under-representation of Oct-1 (M00138) in the (-600:600) region of the human promoter and the absence of positional preferences [29
]. This under-representation was not expected but is observed in 10% of known PWMs. OCT1_04 (M00138) is not in the high quality list of TRANSFAC, OCT1_01 (M00135) and OCT1_C (M00210) was found to have minus cluster scores (-0.63 and -2.89) in our table (additional file 1
It is noteworthy that Fig. shows TFBS (AP2_Q6) in non-promoters to occur randomly under a certain distribution. This distribution can be modelled by a binomial probability distribution. A model of Poisson distribution, which is an approximation of binomial probability distribution for a certain condition, was proposed in [11
] as the probability distribution of TFBS density. Although we have not tested the goodness-of-fit, our observation does not contradict the Poisson distribution model.
Title: Distribution of accumulated score C for promoters and non-promoters for AP2_Q6
To assess the robustness of the cluster score, we compared cluster scores for different datasets from chromosomes 20, 21 and 22 (Fig. ). The correlation coefficients were 0.91 (a) and 0.93 (b), proving that the significance would be similar if we utilized the whole human genome dataset in the analysis. The scale of the figures between the Y-axis and X-axis are different because of the different number of sequences taken from each chromosome.