Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.13; 2012**|**PMC3543194

Formats

Article sections

- Abstract
- Background
- Methods
- Results and discussion
- Conclusions
- Competing interests
- Author’s contributions
- Supplementary Material
- References

Authors

Related links

BMC Bioinformatics. 2012; 13: 215.

Published online 2012 August 27. doi: 10.1186/1471-2105-13-215

PMCID: PMC3543194

Chih Lee: ude.nnocu.rgne@eelhihc; Chun-Hsi Huang: ude.nnocu.rgne@gnauh

Received 2012 January 29; Accepted 2012 August 16.

Copyright ©2012 Lee and Huang; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Computational approaches to transcription factor binding site identification have been actively researched in the past decade. Learning from known binding sites, new binding sites of a transcription factor in unannotated sequences can be identified. A number of search methods have been introduced over the years. However, one can rarely find one single method that performs the best on all the transcription factors. Instead, to identify the best method for a particular transcription factor, one usually has to compare a handful of methods. Hence, it is highly desirable for a method to perform automatic optimization for individual transcription factors.

We proposed to search for transcription factor binding sites in vector spaces. This framework allows us to identify the best method for each individual transcription factor. We further introduced two novel methods, the negative-to-positive vector (NPV) and optimal discriminating vector (ODV) methods, to construct query vectors to search for binding sites in vector spaces. Extensive cross-validation experiments showed that the proposed methods significantly outperformed the ungapped likelihood under positional background method, a state-of-the-art method, and the widely-used position-specific scoring matrix method. We further demonstrated that motif subtypes of a TF can be readily identified in this framework and two variants called the *k* NPV and *k* ODV methods benefited significantly from motif subtype identification. Finally, independent validation on ChIP-seq data showed that the ODV and NPV methods significantly outperformed the other compared methods.

We conclude that the proposed framework is highly flexible. It enables the two novel methods to automatically identify a TF-specific subspace to search for binding sites. Implementations are available as source code at: http://biogrid.engr.uconn.edu/tfbs_search/.

Transcription of genes followed by translation of their transcripts into proteins determines the type and functions of a cell. Expression of certain genes even initiates or suppresses differentiation of stem cells. It is therefore crucial to understand the mechanisms of transcriptional regulation. Among them, transcription factor (TF) binding is the one that has been given considerable attention by computational biologists for the past decade and is still being actively researched. A TF is a protein or protein complex that regulates transcription of one or more genes by binding to the double-stranded DNA. A first step in computational identification of target genes regulated by a TF is to pinpoint its binding sites in the genome. Once the binding sites are found, the putative target genes can be searched and located in flanking regions of the binding sites.

In general, there are two approaches to computational transcription factor binding site (TFBS) identification, motif discovery and TFBS search. The former assumes that a set of sequences is given and each of the sequences may or may not contain TFBSs. An algorithm then predicts the locations and lengths of TFBSs. The term motif refers to the pattern that are shared by the discovered TFBSs. These algorithms rely on no prior knowledge of the motif and hence are known as *de novo* motif discovery algorithms. The latter assumes that, in addition to a set of sequences, the locations and lengths of TFBSs are known. An algorithm then learns from these examples and predicts TFBSs in new sequences. Such algorithms are also called supervised learning algorithms since they are guided by the given sequences with known TFBSs. Plenty of efforts have been devoted to the *de novo* motif discovery problem
[1-11]. Comprehensive evaluation and comparison of the developed tools have been performed
[12,13]. In this study, we focus on the problem of TFBS search. We refer readers interested in the motif discovery problem to the evaluation and review articles
[12-14] and references therein.

A typical TFBS search method searches for the binding sites of a particular transcription factor in the following manner. It scans a target DNA sequence and compare each length *l* sub-sequence (*l* -mer) to the binding site profile of the TF, where *l* is the length of a binding site. Each of the *l* -mer is scored when comparing to the profile. A cut-off score is then set by the method to select candidate TF binding sites. The position-specific scoring matrix (PSSM)
[15] is a widely used profile representation, where the binding sites of a TF are encoded as a 4 ×* l* matrix. Column *i* of the matrix stores the scores of matching the *i*^{th} letter in an *l* -mer to nucleotides A, C, G and T, respectively. Depending on the method of choice, the score of A at position *i* can be the count of A at position *i* in the known TFBSs, the log-transformed probability of observing A at position *i* , or any other reasonable number. Once computed, the scoring matrix of a TF can be stored in a database. These matrices are used by tools
[16-21] to scan sequences for TFBSs.

One assumption the PSSM representation makes is that positions in a binding site are independent, which is often not the case. Osada *et al.*[22] exploited dependence between positions by considering nucleotide pairs in scoring methods. It was shown that incorporating nucleotide pairs significantly improved the performance of a method, meaning that most transcription factors studied demonstrated correlation between positions in a binding site. This result was reinforced in a recent study
[23], in which the authors showed correlations between two nucleotides within a binding site by plotting the mutual information matrix. A novel scoring method called the ungapped likelihood under positional background (ULPB) method was proposed in this study. The ULPB method models a TFBS by two first-order Markov chains and scores a candidate binding site by likelihood ratio produced by the two Markov chains. leave-one-out cross-validation results on 22 TFs with 20 or more binding sites showed that ULPB is superior to the methods compared in their work.

In this work, we approach the TFBS search problem from a different perspective. We propose to search for binding sites in vector spaces. Specifically, *l* -mers are placed in the Euclidean space such that each *l* -mer corresponds to a vector in the space. With known binding sites of a TF, we construct a profile vector for the TF. This profile vector can then be used as a query vector to search for the unknown binding sites in the space given a similarity measure between two vectors. The vector space model has long been used in information retrieval (IR)
[24,25]. Under this model, each document in a collection is embedded in a *t* -dimensional space. That is, each document is represented by a *t* -element vector, where *t* is the number of distinct terms present in the document collection or corpus. To search for documents on a particular topic, a query composed of terms relevant to the topic is constructed. The query can be similarly embedded in the *t* -dimensional space. Similarity between the query and a document can then be measured by measuring the similarity between the two corresponding vectors. In the TFBS search problem, the entire genome or the collection of promoter region sequences corresponds to the corpus, whereas an *l* -mer is analogous to a document in IR. On the other hand, a TF is analogous to a topic, while a TF representation is the analog of a query for the topic.

In this framework, we propose two novel approaches to constructing a query vector for a TF of interests. We compare the proposed methods to a state-of-the-art method, the ULPB method, as well as the widely-used PSSM method. Performance of a method is assessed by cross-validation experiments on two data sets collected from RegulonDB [26] and JASPAR [27], respectively. Independent validation on human ChIP-seq data gives further insights into the proposed methods. Finally, we discuss the advantages of searching for TF binding sites in the proposed framework.

The paper is organized as follows. In Methods, we present the novel negative-to-positive vector and optimal discriminating vector methods, in addition to introducing the existing methods compared in this work. Cross-validation results on prokaryotic and eukaryotic transcription factors are presented and discussed in Results and Discussion. Finally, we give the concluding remarks in Conclusions.

To understand the compared methods in this work, we experimented on prokaryotic as well as eukaryotic transcription factors. The known prokaryotic TF binding sites were collected from from RegulonDB
[26] release 6.8. Considered in
[23], this data source contains binding sites of TFs in the *E. coli* K-12 genome. We considered a data set of 26 TFs with 17 or more known binding sites. The filtering criterion ensures that, for each TF, we have enough examples to learn from. Similar filtering criteria were used in
[23]. This data set is summarized in Table
Table11.

The known eukaryotic TF binding sites were collected from JASPAR CORE database (the 4^{th} release)
[27]. TFs of Homo sapiens and Mus musculus were filtered by two criteria. A TF was kept only if it has at least 20 known binding sites and the length of its binding sites is at least 6 nucleotides. The length criterion, arbitrarily chosen, ensures a TF under consideration is specific enough. This data set is summarized in Table
Table22.

For clarity, we list and define functions and variables used throughout this paper. Please see Additional file 1 for more details.

• *f*_{i}(*u*) denotes the probability of observing letter *u* at position *i* of a TFBS, where *u* {A, C, G, T}.

• *f*_{i,j}(*u*,*v*) denotes the probability of observing letters *u* and *v* at positions *i* and *j* , respectively, where *i* <*j* and *u*,*v* {A, C, G, T}.

• *f*_{i}(*v*|*u*) denotes the position-specific conditional probability of observing *v* at position *i* + 1 given *u* has been seen at position *i* , where *u* ,*v* {A, C, G, T}.

• *f* (*v*|*u*) denotes the background conditional probability of observing *v* given *u* has been observed at the previous position, where *u* ,*v* {A, C, G, T}.

• ${\mathcal{I}}_{u}(\xb7)$ is the indicator function given by

$${\mathcal{I}}_{u}\left(v\right)=\left\{\begin{array}{cc}1& \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{if}v=u,\\ 0& \phantom{\rule{0.3em}{0ex}}\text{otherwise,}\end{array}\right)$$

(1)

where *u* ,*v* {A, C, G, T}.

• ${\mathcal{I}}_{{u}_{1}{u}_{2}}(\xb7)$ is similarly defined as follows:

$${\mathcal{I}}_{{u}_{1}{u}_{2}}\left({v}_{1}{v}_{2}\right)=\left\{\begin{array}{cc}1& \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{if}\phantom{\rule{0.3em}{0ex}}{v}_{1}={u}_{1}\phantom{\rule{0.3em}{0ex}}\text{and}\phantom{\rule{0.3em}{0ex}}{v}_{2}={u}_{2},\\ 0& \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{otherwise,}\end{array}\right)$$

(2)

where *u*_{1},*u*_{2},*v*_{1},*v*_{2}{A, C, G, T}.

• *I**C*_{i }denotes the information content at position *i* of a binding site. Information content is closely related to entropy, a measure of uncertainty in information theory. The entropy at position *i* is given by
${E}_{i}=-\sum _{u\in \left\{\text{A, C, G, T}\right\}}{f}_{i}\left(u\right){\text{log}}_{2}\left[{f}_{i}\left(u\right)\right]$. When
${f}_{i}\left(u\right)=\frac{1}{4}$ for all *u* {A, C, G, T}, *E*_{i }attains the maximal entropy of 2 and we are most uncertain about the letter at position *i* . *I**C*_{i} is simply defined as

$$I{C}_{i}=2-{E}_{i}=2+\sum _{u\in \left\{\text{A, C, G, T}\right\}}{f}_{i}\left(u\right){\text{log}}_{2}\left[{f}_{i}\left(u\right)\right].$$

(3)

• *I**C*_{i,j} denotes the information content of the position pair (*i* ,*j* ) of a binding site. Similarly,

$$I{C}_{i,j}=4+\sum _{u,v\in \left\{\text{A, C, G, T}\right\}}{f}_{i,j}(u,v){\text{log}}_{2}\left[{f}_{i,j}(u,v)\right],$$

(4)

where the maximal entropy of 4 is attained when
${f}_{i,j}(u,v)=\frac{1}{16}$ for all *u* ,*v* {A, C, G, T}.

We describe how a short sequence of *l* nucleotides or an *l* -mer is placed in a vector space. Let *s* be an *l* -mer and *s*_{i} denote its *i*^{th} nucleotide. Each nucleotide in *s* is converted to 4 variables, that is, *s*_{i }is converted to
${w}_{i}{\mathcal{I}}_{\text{A}}\left({s}_{i}\right),{w}_{i}{\mathcal{I}}_{\text{C}}\left({s}_{i}\right),{w}_{i}{\mathcal{I}}_{\text{G}}\left({s}_{i}\right)\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}{w}_{i}{\mathcal{I}}_{\text{T}}\left({s}_{i}\right)$ for *i* = 1,2,…,*l* . Hence, *s* is converted to *4l* variables, placing *s* in
${\mathbb{R}}^{4l}$. Figure
Figure11 illustrates the conversion of each nucleotide in an *l* -mer to 4 variables when *w*_{i }= 1 for *i* = 1,2,…,*l* .

We further consider nucleotide pair (*s*_{i},*s*_{j}), where *i* <* j* . Only pairs in close proximity are considered in this study. We consider (*s*_{i},*s*_{j}) only if *j* −*i* = 1 or 2, i.e., a pair of nucleotides is considered only if they are adjacent or separated by one nucleotide. Nucleotide pair (*s*_{i},*s*_{j}) is similarly converted to 16 variables,
${w}_{i,j}{\mathcal{I}}_{\mathrm{AA}}\left({s}_{i}{s}_{j}\right),{w}_{i,j}{\mathcal{I}}_{\mathrm{AC}}\left({s}_{i}{s}_{j}\right),\dots ,{w}_{i,j}{\mathcal{I}}_{\mathrm{TT}}\left({s}_{i}{s}_{j}\right)$, as there are 16 possible nucleotide pairs, {AA, AC,…,TT}. We use 32*l* −48 additional variables to encode the pairs since there are *l* −1 adjacent pairs and *l* −2 pairs separated by one nucleotide. Consequently, considering individual nucleotides and nucleotide pairs, each *l* -mer is converted to a (36*l* −48)-element vector.

In this study, we consider two choices of *w*_{i}’s and *w*_{i,j}’s. For the first choice, all the nucleotides and nucleotide pairs are given the same weight, i.e., *w*_{i }= 1 and *w*_{i,j }= 1 for all *i* and *j* . The second one assigns weight to the *i*^{th} nucleotide according to the information content at position *i* . Similarly, it assigns weight to pair *i* ,*j*) according to the information content at this pair of positions. Specifically,
${w}_{i}=\sqrt{I{C}_{i}}$ and
${w}_{i,j}=\sqrt{I{C}_{i,j}}$ for all *i* and *j* .

Given a query vector ** t** in space, we score an

$$\mathrm{Score}\left(s\right)={\mathit{s}}^{\mathrm{T}}\mathit{t},$$

(5)

where ** s **denote the corresponding vector of

As described above, an *l* -mer is converted to a (36*l* −48)-element vector. Hence, we use ** t** to search for binding sites in
${\mathbb{R}}^{(36l-48)}$. Our approach offers great flexibility in that it easily allows searching for binding sites in a lower dimensional subspace. By setting all but the first

We first introduce a simple approach to constructing a query vector. Let *P* be the set of *n*_{+} binding sites and *N* be the set of *n*_{−}non-binding sites of a particular transcription factor. We embed all the *l* -mers in *P* and *N* in
${\mathbb{R}}^{(36l-48)}$. We then find the mean binding site vector

$${\mathit{\mu}}_{+}=\frac{1}{{n}_{+}}\sum _{s\in P}\mathit{s}$$

as well as the mean non-binding site vector

$${\mathit{\mu}}_{-}=\frac{1}{{n}_{-}}\sum _{s\in N}\mathit{s}.$$

The query vector ** t** is found by subtracting

The score of an *l* -mer *s* given by the NPV method is therefore

$$\mathrm{Score}\left(s\right)={\mathit{s}}^{\mathrm{T}}({\mathit{\mu}}_{+}-{\mathit{\mu}}_{-})={\mathit{s}}^{\mathrm{T}}{\mathit{\mu}}_{+}-{\mathit{s}}^{\mathrm{T}}{\mathit{\mu}}_{-}.$$

(6)

We can see that it computes the similarity between *s* and the mean binding site vector as well as the similarity between *s* and the mean non-binding site vector. It then scores *s* by the difference of the two similarity scores. The more similar *s* is to the mean binding site vector, the higher the score. The less similar *s* is to the mean non-binding site vector, the higher the score.

From the perspective of geometry, we note that Score(*s* ) in (5) is proportional to Score(*s* )/||** t**|| , where ||

$${\mathit{s}}^{\mathrm{T}}\mathit{t}=||\mathit{s}||\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}||\mathit{t}||cos\theta ,$$

we know Score(*s* )/||** t**|| equals the orthogonal projection of

We have described the NPV method, which offers a heuristic way of constructing a query vector. We now introduce a way of finding an optimal query vector
$\mathit{\beta}\in {\mathbb{R}}^{(36l-48)}$. Suppose that |*P* |=*n*_{+} and |*N* |=*n*_{−}, that is, there are *n*_{+} binding sites and *n*_{−} non-binding sites for a particular TF. Let *P* ={*s*_{(1)},*s*_{(2)},…,*s*_{(n+)}} and
$N=\{{s}_{({n}_{+}+1)},{s}_{({n}_{+}+2)},\dots ,{s}_{\left(n\right)}\}$ , where *s*_{(i)} denotes the *i*^{th}*l* -mer in the union of the two sets and *n* =*n*_{+} + *n*_{−}. We find the optimal ** β**by solving the following minimization problem:

$$\begin{array}{ll}\underset{\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\mathit{\beta},b,\mathit{\xi}}{\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\text{min}}& \phantom{\rule{0.9em}{0ex}}\phantom{\rule{0.8em}{0ex}}\frac{1}{2}\left|\right|\mathit{\beta}|{|}^{2}+\frac{C}{{n}_{+}}\sum _{i=1}^{{n}_{+}}{\xi}_{i}+\frac{C}{{n}_{-}}\sum _{i={n}_{+}+1}^{n}{\xi}_{i}\phantom{\rule{2em}{0ex}}\end{array}$$

(7)

$$\begin{array}{ll}\text{subject to}& \frac{\mathrm{Score}\left({s}_{\left(i\right)}\right)}{\left|\right|\mathit{\beta}\left|\right|}\ge \frac{b+1-{\xi}_{i}}{\left|\right|\mathit{\beta}\left|\right|}\text{for}{s}_{\left(i\right)}\in P,\phantom{\rule{2em}{0ex}}\end{array}$$

(8)

$$\begin{array}{ll}& \phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\frac{\mathrm{Score}\left({s}_{\left(i\right)}\right)}{\left|\right|\mathit{\beta}\left|\right|}\le \frac{b-1+{\xi}_{i}}{\left|\right|\mathit{\beta}\left|\right|}\text{for}{s}_{\left(i\right)}\in N,\phantom{\rule{2em}{0ex}}\end{array}$$

(9)

$$\begin{array}{ll}& \phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}\phantom{\rule{0.8em}{0ex}}{\xi}_{i}\ge 0\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\forall \mathrm{i.}\phantom{\rule{2em}{0ex}}\end{array}$$

(10)

The constraint in (8) ensures that the projection of a TFBS *s*_{(i)} onto the vector ** β**,
$\frac{\mathrm{Score}\left({s}_{\left(i\right)}\right)}{\left|\right|\mathit{\beta}\left|\right|}$, exceeds the threshold
$\frac{b+1}{\left|\right|\mathit{\beta}\left|\right|}$. On the other hand, the constraint in (9) ensures that the projection of a non-TFBS

The optimization problem in (7) is known as a quadratic programming problem with linear inequality constraints specified in (8), (9) and (10). There are *p* + *n* + 1 variables and *2n* constraints, where *p* =36*l* −48 is the dimension of ** β**. We can see that (8) and (9) specify

We briefly describe the ungapped likelihood under positional background (ULPB) method proposed in
[23] and the position-specific scoring matrix (PSSM) method compared therein. We refer readers to section Notation for functions and variables used here. Consider a specific TF with binding sites of length *l* . The PSSM method scores an *l* -mer *s* by

$$\sum _{i=1}^{l}\text{log}\left[{f}_{i}\left({s}_{i}\right)\right],$$

(11)

where *s*_{i} denotes the *i*^{th} letter in *s* . We note that usually the ratio *f*_{i}(*s*_{i})/*f* (*s*_{i}) is used in place of *f*_{i}(*s*_{i}), where *f* (*s*_{i}) is the background probability of *s*_{i}. The simpler form in (11) was compared in
[23] and hence it serves as a baseline method in this study.

The ULPB models a TFBS by a first-order Markov chain and models the background by another first-order Markov chain. The background transition probabilities are estimated using the entire genome of a species and hence the ULPB method uses negative examples implicitly. It scores an *l* -mer *s* by

$$\text{log}\phantom{\rule{0.3em}{0ex}}{f}_{1}\left({s}_{1}\right)+\sum _{i=1}^{l-1}\text{log}\phantom{\rule{0.3em}{0ex}}\left(\frac{{f}_{i}\left({s}_{i+1}\right|{s}_{i})}{f\left({s}_{i+1}\right|{s}_{i})}\right).$$

(12)

Although ULPB does not consider background probability in the first term of (12), the score is approximately the log-likelihood ratio of the two Markov chains.

The main difference between the PSSM method and the NPV, ODV and ULPB methods is that the PSSM method does not score nucleotide pairs nor does it utilize a background distribution. The NPV and ODV methods explicitly take advantage of negative binding sites, while the ULPB method does it implicitly by using a background distribution. The flexibility of the proposed framework allows the NPV and ODV methods to easily search in subspaces, further distinguishing the PSSM and ULPB methods from the proposed ones.

The performance of a TFBS search method is evaluated by *ν* -fold cross-validation (CV). Consider a TF with *n*_{+ }TFBSs of length *l* with flanking regions on both sides. A set of negative examples, *N*_{test}, called the *test negatives* is constructed from the TFBSs of the other TFs with filtering as in
[22]. Another set of negative examples, *N*_{train}, called the *training negatives* is collected from sequences embedding the *n*_{+}binding sites. It is comprised of all the *l* -mers except for the TFBSs and two neighboring *l* -mers of each TFBS.

The *n*_{+} TFBSs are first divided into *ν* sets, each of which contains
$\lfloor \frac{{n}_{+}}{\nu}\rfloor $ or
$\lfloor \frac{{n}_{+}}{\nu}\rfloor $ + 1 TFBSs. At each iteration of *ν* -fold CV, one of the *ν* TFBS sets called the *test TFBS set**P*_{test}is left out. The rest of the TFBSs are therefore called the *training TFBSs* . A scoring function is obtained using the training TFBSs and non-TFBSs randomly sampled from the training negatives, where the ratio of numbers of non-TFBSs to TFBSs is set to 10. The test TFBSs in *P*_{test} along with the non-TFBSs in *N*_{test}are then scored by the scoring function. To score a test sequence, both the forward and reverse strands are scored and, in case the test sequence is longer or shorter than *l* , the *l* -mer producing the highest score is used. For each test TFBS *t* *P*_{test}, we find its rank relative to all the non-TFBSs in *N*_{test}. Formally, the rank of *t* equals 1 + |{*s* * N*_{test}|Score(*s* ) ≥ Score(*t* )}|.

After the *ν* -fold CV, we end up with *n*_{+} ranks, each of which corresponds to a TFBS. To allow comparison of methods, we use the area under the ROC curve (AUC) to gauge the performance of a method on the TF. The ROC curve is a plot of true positive rate (TPR) against false positive rate (FPR), displaying the trade-off between TPR and FPR. We refer readers to
[30] for an introduction to this metric. In this study, *ν* =10 for all the CV experiments. For the NPV and ODV methods, the best weight and subspace combination is obtained at each iteration of the *ν* -fold CV. Specifically, another (*ν* −1)-fold CV is performed on the *ν* −1 sets of TFBSs to search for the best combination.

To understand the behavior of search methods on prokaryotic TF binding sites, we conducted 10-fold cross-validation experiments on the 26-TF RegulonDB data set. The proposed NPV and ODV methods were compared to the ULPB method [23]. The PSSM method, considered in [23], was also included for comparison since it served as a simple baseline method.

Figure
Figure4a4a shows the plot of area under the ROC curve (AUC) across the 26 TFs for each method. We can see that the ODV method has the best AUC on 12 out of 26 TFs and the NPV method has the best AUC on 9 out of 26 TFs whereas the ULPB and PSSM methods have the best AUC on 1 and 4 TFs, respectively. To gauge the relative performance between two methods, statistical tests
[31] were performed on all the 6 pairs of methods. Figure
Figure4b4b shows the *p* -values of the pair-wise comparisons. We first notice that, consistent with the results in
[23], ULPB outperformed PSSM with a slightly larger *p* -value of 0.0679 than the usual 0.05 significance cut-off. As seen in Figure
Figure4b,4b, the NPV and ODV methods are significantly better than the PSSM and ULPB methods. We can see that the ODV method benefited from optimization albeit minimizing the objective function in (7) does not guarantee maximization of the AUC.

Here we compare the proposed NPV and ODV methods to the ULPB and PSSM methods on eukaryotic TF binding sites. As in the previous section, we conducted 10-fold cross-validation experiments on the 28-TF JASPAR data set. Figure Figure5a5a shows the plot of AUC across the 28 TFs for each method. We can see that both the ODV and NPV methods have the best AUC on 13 out of 28 TFs while the ULPB and PSSM methods have the best AUC on 6 and 4 TFs, respectively. All the methods have the best AUC of 1 on MA0149.1 and MA0115, while the ODV, NPV and PSSM methods have the best AUC of 0.999 on MA0137.2.

Similarly, statistical tests [31] were performed on all the 6 pairs of methods. Figure Figure5b5b shows that the NPV and ODV methods are significantly better than the PSSM and ULPB methods. ULPB is significantly better than PSSM, which is again consistent with the results reported in [23]. Overall, performance of the four methods remain unchanged as we shift from prokaryotic transcription factors to eukaryotic ones. This implies that a TFBS search method effective on prokaryotic transcription factors will perform equally well on eukaryotic transcription factors and vice versa.

It has been shown that the binding sites of a TF can be better represented by 2 motif subtypes than by a single motif
[32,33]. In search for new binding sites, two position-specific scoring matrices are used to score an *l* -mer and the higher score of the two is assigned to this *l* -mer. Searching with two PSSMs was shown to be superior to searching with a single PSSM by cross-species conservation statistics in these studies.

We demonstrate that motif subtypes can be readily identified once we embed *l* -mers in a vector space. The purpose here, however, is not to compare motif subtype identification algorithms. We adopted a slightly different approach to motif subtype identification from those in previous work
[32,33], while the idea is similar. As usual, all the *l* -mers were first embedded in a vector space. The known binding sites of a TF were clustered into two subtypes by the *k* -means algorithm
[34]. Immediately, we have a variant of the NPV method called the *k* NPV method, where *k* =2 denotes the number of motif subtypes. The *k* NPV method first computes the mean vectors of these two subtypes, *μ*_{+ 1} and *μ*_{+ 2}, and scores an *l* -mer *s* by

$$\text{Score}\left(s\right)=\text{max}\phantom{\rule{0.3em}{0ex}}\left\{{\mathit{s}}^{\mathrm{T}}\left({\mathit{\mu}}_{+1}-{\mathit{\mu}}_{-}\right),{\mathit{s}}^{\mathrm{T}}\left({\mathit{\mu}}_{+2}-{\mathit{\mu}}_{-}\right)\right\},$$

(13)

where *μ*_{−} is the mean vector of the non-binding sites. Figure
Figure66 illustrates the *k* NPV method.

Similarly, the *k* ODV method scores an *l* -mer *s* by

$$\text{Score}\left(s\right)=\text{max}\phantom{\rule{0.3em}{0ex}}\left\{{\mathit{s}}^{\mathrm{T}}{\mathit{\beta}}_{+1}/\left|\right|{\mathit{\beta}}_{+1}\left|\right|,{\mathit{s}}^{\mathrm{T}}{\mathit{\beta}}_{+2}/\left|\right|{\mathit{\beta}}_{+2}\left|\right|\right\},$$

(14)

where *β*_{+ i} is obtained using TFBSs in cluster *i* , *i* =1,2. Unlike the *k* NPV method, the lengths of *β*_{+ i}’s may be very different and hence *β*_{+ i}’s are scaled to unit vectors so as not to bias the scoring function. We note that the choice of *k* =2 came from previous studies
[32,33]. Generally, *k* can be greater than 2 or even automatically selected
[35]. This however is beyond the scope of this study and may be investigated in the future.

We assessed the *k* NPV and *k* ODV methods by 10-fold cross-validation on both the RegulonDB and JASPAR data sets. Figure
Figure77 shows the results in terms of AUC. We observe in Figure
Figure7a7a that overall introducing motif subtypes into the NPV and ODV methods improves the search performance (*p* -values: 6.41 × 10−^{7} and 8.31 × 10−^{5}, respectively). Results in Figure
Figure7b7b also support this observation (*p* -values: 1.61 × 10−^{3} and 3.04 × 10−^{3}, respectively). The *k* NPV and *k* ODV are comparable on both the RegulonDB and JASPAR data sets (*p* -values: 0.197 and 0.47, respectively). These results are consistent with those reported in
[32,33].

To evaluate the proposed NPV and ODV methods on the whole genome scale, we built TF models using TFBSs in the JASPAR database to scan all the human (build hg19) 1000-base promoter sequences obtained from the UCSC Genome Browser database [36]. ChIP-seq peaks from the ENCODE project were also retrieved [37]. Specifically, the wgEncodeRegTfbsClusteredV2 table of build hg19 was obtained. We checked TFs in Table Table22 against the annotations and found 14 JASPAR TFs, recognized by 17 antibodies present in the ENCODE annotations. The mapping is listed in the first 3 columns of Table Table33.

For the NPV and ODV methods, the best weight and subspace combination was found by 5-fold cross-validation on the JASPAR TFBSs, while flanking genomic sequences of the TFBSs were the sources of negative binding sites. To assess the 4 compared methods, we considered the part of a ROC curve where FPR is at most 0.01 and calculated the AUC scaled to between 0 and 1. This is nearly equivalent to allowing at most 10 false positive hits per promoter on average. As a peak spans about 200 bases, it is considered recalled when it fully contains a predicted binding site. Similarly, a predicted binding site must be fully covered by a peak to be a true positive hit.

In Table
Table3,3, we observe that ODV, NPV, ULPB and PSSM produced the best AUC on 13, 1, 1 and 3 out of 18 tests, respectively. Statistical tests showed that ODV significantly outperformed the other 3 methods (*p* -values ≤ 0.0028), NPV significantly outperformed ULPB and PSSM (*p* -values ≤ 0.0449), and ULPB and PSSM are comparable (*p* -value: 0.433). We notice that both NPV and ODV performed worse than the other two methods on MEF2A. As NPV and ODV both sample negative examples from flanking sequences of TFBSs, we suspect that this is one example where the flanking sequences do not represent well the entire promoters. ODV performed consistently across tests corresponding to the same JASPAR ID such as the three for CTCF. Examining the best weight and subspace, we can see that the subspace agrees on 11 out of 14 TF models, while the weight agrees on only 7 of them. The latter may be because ODV optimizes the ** β** vector and hence is less sensitive to the weight used to embed an

In this work, we proposed to search for transcription factor binding sites in vector spaces. The novel NPV and ODV methods were introduced to construct a query vector to search for binding sites of a TF. We compared our methods to a state-of-the-art method, the ULPB method, and the widely-used PSSM method. Cross-validation experiments revealed that the NPV and ODV methods significantly outperformed the ULPB and PSSM methods on prokaryotic as well as eukaryotic TF binding sties. Independent validation on human ChIP-seq data further verified that the NPV and ODV methods are significantly better than the other compared methods.

One of the advantages of our framework is that it allows one to easily search for binding sites in various subspaces. Hence, one can search in the best subspace for each individual TF since one can hardly find an optimal subspace for all the TFs. Another advantage is that under the proposed framework one can readily identify motif subtypes for a TF. Hence, to exploit this advantage, we introduced the *k* NPV and *k* ODV methods, immediate variants of the NPV and ODV methods. We demonstrated that, consistent with results in previous studies, *k* NPV (*k* ODV) significantly improved NPV (ODV) on the two data sets.

Our future work aims for extending our proposed methods to handling known binding sites of variable lengths. We will seek to approach this problem without resorting to multiple sequence alignment, which is notoriously time-consuming. In the meantime, we will also seek to identify additional promising subspaces to search for TF binding sites in.

Both authors declared that they have no competing interests.

CL and CH conceived the study. CL collected the data, carried out the experiments and drafted the manuscript. CH guided the study and revised the manuscript. Both authors read and approved the final manuscript.

This work was supported in part by National Science Foundation [grant numbers CCF-0755373 and OCI-1156837].

- Vilo J, Brazma A, Jonassen I, Robinson A, Ukkonen E. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. San Diego, USA: AAAI Press; 2000. Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data; pp. 384–394. [PubMed]
- Barash Y, Bejerano G, Friedman N. WABI ’01: Proceedings of the First International Workshop on Algorithms in Bioinformatics. London, UK: Springer-Verlag; 2001. A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites; pp. 278–293.
- Buhler J, Tompa M. RECOMB ’01: Proceedings of the fifth annual international conference on Computational biology. New York, NY, USA: ACM; 2001. Finding motifs using random projections; pp. 69–76.
- Sinha S. RECOMB ’02: Proceedings of the sixth annual international conference on Computational biology. New York, NY, USA: ACM; 2002. Discriminative motifs; pp. 291–298.
- Takusagawa KT, Gifford DK. Pacific Symposium on Biocomputing. Big Island of Hawaii, USA: World Scientific; 2004. Negative information for motif discovery; pp. 360–371. [PubMed]
- Rajasekaran S, Balla S, Huang CH. Exact Algorithms for Planted Motif Problems. J Comput Biol. 2005;12(8):1117–1128. doi: 10.1089/cmb.2005.12.1117. [PubMed] [Cross Ref]
- Balla S, Thapar V, Verma S, Luong T, Faghri T, Huang CHH, Rajasekaran S, del Campo JJ, Shinn JH, Mohler WA, Maciejewski MW, Gryk MR, Piccirillo B, Schiller SR, Schiller MR. Minimotif Miner: a tool for investigating protein function. Nat methods. 2006;3(3):175–177. doi: 10.1038/nmeth856. [PubMed] [Cross Ref]
- Li N, Tompa M. Analysis of computational approaches for motif discovery. Algorithms for Mol Biol. 2006;1:8. doi: 10.1186/1748-7188-1-8. [PMC free article] [PubMed] [Cross Ref]
- Zaslavsky E, Singh M. A combinatorial optimization approach for diverse motif finding applications. Algorithms for Mol Biol. 2006;1:13. doi: 10.1186/1748-7188-1-13. [PMC free article] [PubMed] [Cross Ref]
- Yanover C, Singh M, Zaslavsky E. M are better than one: an ensemble-based motif finder and its application to regulatory element prediction. Bioinformatics. 2009;25(7):868–874. doi: 10.1093/bioinformatics/btp090. [PMC free article] [PubMed] [Cross Ref]
- Georgiev S, Boyle A, Jayasurya K, Ding X, Mukherjee S, Ohler U. Evidence-ranked motif identification. Genome Biol. 2010;11(2):R19. doi: 10.1186/gb-2010-11-2-r19. [PMC free article] [PubMed] [Cross Ref]
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WSS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat biotechnol. 2005;23:137–144. doi: 10.1038/nbt1053. [PubMed] [Cross Ref]
- Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33(15):4899–4913. doi: 10.1093/nar/gki791. [PMC free article] [PubMed] [Cross Ref]
- Sandve G, Drablos F. A survey of motif discovery methods in an integrated framework. Biol Direct. 2006;1:11. doi: 10.1186/1745-6150-1-11. [PMC free article] [PubMed] [Cross Ref]
- Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984;12(1Part2):505–519. doi: 10.1093/nar/12.1Part2.505. [PMC free article] [PubMed] [Cross Ref]
- Schug J. In: Curr Protoc Bioinf. Baxevanis AD, editor. New York: J. Wiley and Sons; 2003. Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence.
- Kel A, Gößling E, Reuter I, Cheremushkin E, Kel-Margoulis O, Wingender E. MATCH™: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003;31(13):3576–3579. doi: 10.1093/nar/gkg585. [PMC free article] [PubMed] [Cross Ref]
- Sandelin A, Wasserman WW, Lenhard B. ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res. 2004;32(suppl 2):W249–W252. [PMC free article] [PubMed]
- Chekmenev DS, Haid C, Kel AE. P-Match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Res. 2005;33(suppl_2):W432–437. [PMC free article] [PubMed]
- Turatsinze JVV, Thomas-Chollier M, Defrance M, van Helden J. Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc. 2008;3(10):1578–1588. doi: 10.1038/nprot.2008.97. [PubMed] [Cross Ref]
- Zambelli F, Pesole G, Pavesi G. Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Res. 2009;37(suppl 2):W247–W252. [PMC free article] [PubMed]
- Osada R, Zaslavsky E, Singh M. Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics. 2004;20(18):3516–3525. doi: 10.1093/bioinformatics/bth438. [PubMed] [Cross Ref]
- Salama RA, Stekel DJ. Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction. Nucleic Acids Res. 2010;38(12):e135. doi: 10.1093/nar/gkq274. [PMC free article] [PubMed] [Cross Ref]
- Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Commun ACM. 1975;18:613–620. doi: 10.1145/361219.361220. [Cross Ref]
- Lee DL, Chuang H, Seamons K. Document Ranking and the Vector-Space Model. IEEE Software. 1997;14:67–75.
- Gama-Castro S, Jiménez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Peñaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muñiz-Rascado L, Martínez-Flores I, Salgado H, Bonavides-Martínez C, Abreu-Goodger C, Rodríguez-Penagos C, Miranda-Ríos J, Morett E, Merino E, Huerta AM, Treviño-Quintanilla L, Collado-Vides J. RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 2008;36(suppl 1):D120–D124. [PMC free article] [PubMed]
- Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38(suppl 1):D105–D110. [PMC free article] [PubMed]
- Bertsekas DP. Nonlinear Programming. Belmont, MA: Athena Scientific; 1999.
- Kroshko DL. OpenOpt 0.36. 2011. http://openopt.org/
- Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27:861–874. doi: 10.1016/j.patrec.2005.10.010. [Cross Ref]
- Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bulletin. 1945;1(6):80–83. doi: 10.2307/3001968. [Cross Ref]
- Hannenhalli S, Wang LS. Enhanced position weight matrices using mixture models. Bioinformatics. 2005;21(suppl_1):i204–212. [PubMed]
- Georgi B, Schliep A. Context-specific independence mixture modeling for positional weight matrices. Bioinformatics. 2006;22(14):e166–e173. doi: 10.1093/bioinformatics/btl249. [PubMed] [Cross Ref]
- de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20(9):1453–1454. doi: 10.1093/bioinformatics/bth078. [PubMed] [Cross Ref]
- Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognit Lett. 2010;31(8):651–666. doi: 10.1016/j.patrec.2009.09.011. [Cross Ref]
- Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2011;39(suppl 1):D876–D882. [PMC free article] [PubMed]
- Rosenbloom KR, Dreszer TR, Pheasant M, Barber GP, Meyer LR, Pohl A, Raney BJ, Wang T, Hinrichs AS, Zweig AS, Fujita PA, Learned K, Rhead B, Smith KE, Kuhn RM, Karolchik D, Haussler D, Kent WJ. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res. 2010;38(suppl 1):D620–D625. [PMC free article] [PubMed]

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |