Transcription of genes followed by translation of their transcripts into proteins determines the type and functions of a cell. Expression of certain genes even initiates or suppresses differentiation of stem cells. It is therefore crucial to understand the mechanisms of transcriptional regulation. Among them, transcription factor (TF) binding is the one that has been given considerable attention by computational biologists for the past decade and is still being actively researched. A TF is a protein or protein complex that regulates transcription of one or more genes by binding to the double-stranded DNA. A first step in computational identification of target genes regulated by a TF is to pinpoint its binding sites in the genome. Once the binding sites are found, the putative target genes can be searched and located in flanking regions of the binding sites.
In general, there are two approaches to computational transcription factor binding site (TFBS) identification, motif discovery and TFBS search. The former assumes that a set of sequences is given and each of the sequences may or may not contain TFBSs. An algorithm then predicts the locations and lengths of TFBSs. The term motif refers to the pattern that are shared by the discovered TFBSs. These algorithms rely on no prior knowledge of the motif and hence are known as de novo
motif discovery algorithms. The latter assumes that, in addition to a set of sequences, the locations and lengths of TFBSs are known. An algorithm then learns from these examples and predicts TFBSs in new sequences. Such algorithms are also called supervised learning algorithms since they are guided by the given sequences with known TFBSs. Plenty of efforts have been devoted to the de novo
motif discovery problem
]. Comprehensive evaluation and comparison of the developed tools have been performed
]. In this study, we focus on the problem of TFBS search. We refer readers interested in the motif discovery problem to the evaluation and review articles
] and references therein.
A typical TFBS search method searches for the binding sites of a particular transcription factor in the following manner. It scans a target DNA sequence and compare each length l
-mer) to the binding site profile of the TF, where l
is the length of a binding site. Each of the l
-mer is scored when comparing to the profile. A cut-off score is then set by the method to select candidate TF binding sites. The position-specific scoring matrix (PSSM)
] is a widely used profile representation, where the binding sites of a TF are encoded as a 4 × l
matrix. Column i
of the matrix stores the scores of matching the ith
letter in an l
-mer to nucleotides A, C, G and T, respectively. Depending on the method of choice, the score of A at position i
can be the count of A at position i
in the known TFBSs, the log-transformed probability of observing A at position i
, or any other reasonable number. Once computed, the scoring matrix of a TF can be stored in a database. These matrices are used by tools
] to scan sequences for TFBSs.
One assumption the PSSM representation makes is that positions in a binding site are independent, which is often not the case. Osada et al.
] exploited dependence between positions by considering nucleotide pairs in scoring methods. It was shown that incorporating nucleotide pairs significantly improved the performance of a method, meaning that most transcription factors studied demonstrated correlation between positions in a binding site. This result was reinforced in a recent study
], in which the authors showed correlations between two nucleotides within a binding site by plotting the mutual information matrix. A novel scoring method called the ungapped likelihood under positional background (ULPB) method was proposed in this study. The ULPB method models a TFBS by two first-order Markov chains and scores a candidate binding site by likelihood ratio produced by the two Markov chains. leave-one-out cross-validation results on 22 TFs with 20 or more binding sites showed that ULPB is superior to the methods compared in their work.
In this work, we approach the TFBS search problem from a different perspective. We propose to search for binding sites in vector spaces. Specifically, l
-mers are placed in the Euclidean space such that each l
-mer corresponds to a vector in the space. With known binding sites of a TF, we construct a profile vector for the TF. This profile vector can then be used as a query vector to search for the unknown binding sites in the space given a similarity measure between two vectors. The vector space model has long been used in information retrieval (IR)
]. Under this model, each document in a collection is embedded in a t
-dimensional space. That is, each document is represented by a t
-element vector, where t
is the number of distinct terms present in the document collection or corpus. To search for documents on a particular topic, a query composed of terms relevant to the topic is constructed. The query can be similarly embedded in the t
-dimensional space. Similarity between the query and a document can then be measured by measuring the similarity between the two corresponding vectors. In the TFBS search problem, the entire genome or the collection of promoter region sequences corresponds to the corpus, whereas an l
-mer is analogous to a document in IR. On the other hand, a TF is analogous to a topic, while a TF representation is the analog of a query for the topic.
In this framework, we propose two novel approaches to constructing a query vector for a TF of interests. We compare the proposed methods to a state-of-the-art method, the ULPB method, as well as the widely-used PSSM method. Performance of a method is assessed by cross-validation experiments on two data sets collected from RegulonDB
] and JASPAR
], respectively. Independent validation on human ChIP-seq data gives further insights into the proposed methods. Finally, we discuss the advantages of searching for TF binding sites in the proposed framework.
The paper is organized as follows. In Methods, we present the novel negative-to-positive vector and optimal discriminating vector methods, in addition to introducing the existing methods compared in this work. Cross-validation results on prokaryotic and eukaryotic transcription factors are presented and discussed in Results and Discussion. Finally, we give the concluding remarks in Conclusions.