The ability of every living cell to properly respond to diverse stimuli depends on the genetic information encoded in its genome and signaling cascades that activate appropriate transcription factors (TFs) for gene regulation (
1–3). To understand the global network of transcription for controlling diverse cellular responses, it is important to identify the regulatory modules that are responsible for spatial or temporal gene regulation. For this purpose, diverse integrative tools for genomic analysis of DNA sequences, accompanied by information on the transcriptome and interactome, have been actively developed (
4).
High-throughput technologies, such as ChIP-chip, ChIP-PET and ChIP-Seq, allow genome-scale mapping of epigenetic modification and protein–DNA interactions in particular genomes (
5,
6). Integration of accumulated genome-wide experimental data with DNA sequence information allows the construction of a map of the transcriptional regulatory circuits encoded in a genome that can eventually lead to the identification of the regulatory modules for gene regulation. However, annotating the functional transcription factor binding sites (TFBS) in the regulatory modules remains a challenging task (
7). The problem derives mainly from the nature of the DNA sequences that are recognized by transcription factors; they are relatively short and degenerate. Furthermore, transcription factors are known to recognize more than one consensus sequence (
8), and similar DNA sequences can be recognized by different groups of transcription factors (
9).
Because accurate prediction of the putative binding sites of transcription factors is a valuable tool for understanding transcriptional regulatory networks and mechanisms of transcriptional control, numerous computational tools have been generated. The most common method is the pattern matching approach that uses a position weight matrix (PWM) (
10–13) or Hidden Markov Models (HMMs) (
14). However, prediction of the putative TFBS using the predefined PWM suffers from a high rate of false positive discovery (
15). To alleviate this problem, integration of the heterogeneous information (
16), such as the DNA sequence conservation score (
17,
18) DNase-I hypersensitive score (
19,
20), or nucleosome occupancy (
21) and modification information (
22,
23), has been successfully applied with enhanced prediction performance.
In parallel, approaches using PWM clustering based on the sequence similarity were proposed. In this method, a familial binding profile (FBP) is constructed from the multiple PWMs for each family of transcription factors, improving the sensitivity of
de novo motif discovery algorithms (
24,
25). However, a FBP ignores the flanking positions of PWMs that are not aligned but which may be important for discriminating false positives; hence, this method can have low specificity in predicting functional binding sites. An alternative approach is to combine overlapping TFBSs predicted by the original PWMs belonging to the same cluster (
15). This program can increase specificity by removing redundant TFBSs, but because it is based on the heuristic scoring system, it is not suitable for comparing scores of overlapping TFBSs. To overcome these problems, we have recently developed a motif-based scanning program (
26). It searches STAT TFBS of high affinity scores, using the combined predicted TFBSs from PWMs that show similar binding specificity to STAT family members.
In an attempt to construct an efficient computational tool for predicting TFBSs, we applied the motif-based scanning program to other transcription factors with multiple PWMs. A total of 368 transcription factors and 565 PWMs were considered in this study. Finally, TFBS-scanner was applied to identify the co-occurring
cis-motifs that might function coordinately. The source code of TFBS-Scanner program is freely available from the supporting webpage,
https://sourceforge.net/p/tfbsscanner.