Let
TF be one of the transcription factors to be investigated. The binding dataset of the transcription factor
TF, denoted as
BTF, consists of the sequences with low binding
p-value (< 0.001) to the
TF in the ChIP-chip array data [
22]. A sliding window of size
w is used to extract segments of length
w when sliding through each of the sequences in
BTF.
Let STF be the collection of all extracted segments from BTF, M the number of sequences in the binding dataset BTF, Li the length of the ith sequence in the binding dataset BTF, and TTF the total number of segments in STF. Then
To discover the binding motifs of the transcription factor TF, a number of initial candidate motif sets for TF is subsequently built from the collection STF of extracted segments. Note that the contents of segments, called patterns, in STF may not be distinct.
Most of early motif finding algorithms, such as Gibbs sampler [
8] and MEME [
23], have a weakness, where initial candidate motif sets are built by randomly extracting segments from sequences in the binding dataset
BTF (i.e. randomly selecting segments from the set
STF). To improve the deficiency, the binomial probability distribution model is firstly utilized in the establishment of a number of initial candidate motif sets in our algorithm.
Then in the process of iterative sampling in our algorithm to expand and/or trim each of the initial candidate motif sets, the method of dependency graphs and their expanded Bayesian networks [
21] is used to develop a statistical model for the background motif set identified as the union
S =
TFSTF of segments extracted from all transcription factor binding datasets.
The basic procedure to find the binding motifs of the transcription factor TF is as follows:
1. Build N initial candidate motif sets.
(a) Take N distinct patterns from the set STF with the most highest significance scores as the candidates by the binomial distribution model (see the Binomial probability distribution model subsection).
(b) Then for each of the N significant binding site candidates for the transcription factor TF, in view of evolution, collect all segments in STF whose patterns have no more than d Hamming distance matching to the candidate pattern to form an initial candidate motif set.
At this stage, N initial candidate motif sets for the transcription factor TF are built.
2. Iteratively sample through the binding dateset
BTF to expand and/or trim each of the
N initial candidate motif sets so that their approximate maximum a posteriori (AMAP) scores [
1,
24] can keep increasing until the
N candidate motif sets are invariant in
K consecutive iterations (see the Iterative sampling subsection).
(a) In the calculation of AMAP scores in this stage, the background model for the background motif set
S =
TFSTF is established under the method of dependency graphs and their expanded Bayesian networks (see the Method of dependency graphs and their expanded Bayesian networks subsection).
3. Refine each of the N candidate motif sets by re-examining all the segments already included in the motif set. A segment is removed from the motif set if doing so increases the AMAP score.
A simple flowchart for our algorithm is shown in Figure . The following subsections will expatiate on each stage of our algorithm. As an illustration of the dynamics of the PWM and the rank of different candidate motif sets at different stages of our algorithm, a summary of the prediction process for the motif of the transcription factor CBF1 is given in Figure .
Initial motif sets building
Our method begins by enumerating those patterns in STF that appear most often in the binding dataset BTF than in others. What we want to do first is to calculate the appearance probability of a pattern in STF, which is the probability that the pattern appears no less than n times in the binding dataset BTF. If a pattern b appears more often than other patterns in STF and its occurrence probability in a generic intergenic region is comparatively low, the calculated significance score of b would be relatively high. We will take patterns with the most highest significance scores as the candidates to build a number of initial candidate motif sets.
Binomial probability distribution model
The probability to observe exactly j occurrences of pattern b in the collection STF of segments extracted from the binding dataset BTF is estimated by the binomial distribution
where
occ(
b) is the occurrence times of pattern
b in
STF and
f (
b) is the probability that pattern
b occurs in the intergenic region and is estimated as the relative frequency of pattern
b in the union
S =
TFSTF of segments extracted from all transcription factor binding datasets. The probability to observe
n or more occurrences of the pattern
b in
STF is
We define the significance score sigTF (b) of a pattern b to TF as
The less probable pattern b in S appears more than n times in STF, the more probable will it be a binding site candidate for the transcription factor TF. We will take N distinct patterns with the most highest significance scores as the candidates.
For each of the N significant binding site candidates for the transcription factor TF, in view of evolution, collect all segments in STF whose patterns have no more than d Hamming distance matching to the candidate pattern to form an initial candidate motif set. Thus N initial candidate motif sets for the transcription factor TF are built at the end of this stage. As an example, the PWM and the rank of five initial candidate motif sets for the motif prediction of the transcription factor CBF1 are shown in Figure .
Iterative sampling
In this stage, a sampling method is used to expand and/or trim each of the
N initial candidate motif sets
M1,
M2,..,
MN. For our purpose, a false motif set
MN+1 is created by randomly selecting
e0 (
e0 is equal to the maximum size of the
N initial candidate motif sets) segments from the collection
STF such that
Mi ∩
MN+1 =
![[empty]](/corehtml/pmc/pmcents/empty.gif)
, for all
i = 1, 2,...,
N. In addition, let the collection
S =
TFSTF of segments extracted from all transcription factor binding datasets represent the intergenic background and here be denoted as
MBG.
Approximate maximum a posteriori (AMAP) measure
The score

of the approximate maximum a posteriori (AMAP) measure of the candidate motif set
Mi is defined as [
1,
24]
where
ps, j is the frequency of nucleotide
j at base position
s in the candidate motif set
Mi(which can be retrieved from the position specific scoring matrix (PSSM) of
Mi),
ni is the number of segments in
Mi, and
P(
m|
MBG) is the probability of the pattern of segment
m in the motif set
Mi under an expanded Bayesian network (EBN) model [
21] developed from the background motif set
MBG (EBN model will be discussed shortly).
The first part of the AMAP score is a negative entropy, which is higher if there are more similar patterns in the candidate motif set
Mi. A motif set
Mi with all identical patterns has the maximum negative entropy 0, whereas equal nucleotide frequencies at every position in the PSSM of
Mi has the minimum negative entropy. And a segment
m in the candidate motif set
Mi which has a pattern much different from the background motif model built from
MBG would have lower appearance probability
P (
m|
MBG) and hence increases the score

of the AMAP measure of
Mi.
Sampling strategy
In each iteration, there are two steps for the sampler, the S-step and the M-step.
In the S-step, the sampler samples a site by randomly selecting a sequence from
BTF and then randomly picking up a site in the selected sequence to extract a segment
ms of length
w. For 1 ≤
i ≤
N, if the sampled segment
ms appears in
Mi, segment
ms will be removed from
Mi if the AMAP score

of the candidate motif set
Mi increases after its removal; otherwise, segment
ms will be kept in
Mi. Note that the PSSM of the motif model
Mi should be retrained if the sampled segment
ms is removed from
Mi.
Which one of the
N + 1 motif sets would be the best motif set for the sampled segment
ms will depend on the appendant score

that the segment
ms is derived from
Mi [
24,
25]
where
ni is the size of current motif set
Mi,
P(
ni) equals

,
P(
ms|
Mi) and
P(
ms|
MBG) are the probabilities of the content of the sampled segment
ms under the PSSM model developed from the current motif set
Mi and under an EBN model developed from the background
MBG, respectively. The sampled segment
ms will be considered to append into the motif set
Mi with the highest appendant score

. If

is the highest score, then the sampled segment
ms is appended into the false motif set
MN+1 unless
ms is already there and the current iteration stops here. If for some
i, 1 ≤
i ≤
N,

is the highest score, the sampled segment
ms will be further checked in the M-step to see if we really want to append
ms into
Mi unless we have processed
ms for
Mi at the beginning of this S-step as in above and the current iteration stops here.
In the M-step, the sampler has to decide whether the newly sampled segment
ms should be appended into the candidate motif set
Mi or not. The AMAP measure again will be used to evaluate our decision. The sampled segment
ms is appended into the candidate motif set
Mi if and only if the score

of the motif model
Mi is increased once the sampled segment
ms is appended to
Mi. Note that the PSSM of the motif model
Mi should be retrained after the sampled segment
bs is appended to
Mi. Now the M-step is done and the current iteration stops here.
The sampler will iteratively sample through the binding dataset
BTF to expand and/or trim the
N candidate motif sets
M1,
M2,..,
MN so that their AMAP scores

will keep increasing. The
N candidate motif sets will tend to be invariant after a (larger) number of iterations. The stopping criterion of the sampling process is that all the
N candidate motif sets are invariant in
K consecutive iterations. The parameter
K is usually set to be 1% of the size of
STF.
Alternative sampling strategy
There is an alternative sampling strategy as follows.
In the S-step, the new sampler also randomly samples a site from a sequence in
BTF to extract a segment
msof length
w. For 1 ≤
i ≤
N, if the pattern of the sampled segment
ms appears in
Mi, all the segments in
Mi whose pattern is the same as that of
ms will be removed if the AMAP score

of the motif set
Mi increases after their removal. Otherwise, these segments will be kept in
Mi.
Also in the S-step, if

is the highest among all

, 1 ≤
i ≤
N + 1, then all segments in the set
STF having the same pattern as that of the sampled segment
ms will be appended into the false motif set
MN+1 unless these segments are already there and the current iteration stops here. If

is the highest for some
i, 1 ≤
i ≤
N, the sampled segment
ms will be further checked in the M-step to see if we really want to append those segments in the set
STF having the same pattern as that of the sampled segment
ms into
Mi unless we have already processed those segments for
Mi at the beginning of this S-step as in above and the current iteration stops here.
In the M-step, all the segments in the set
STF having the same pattern as that of the sampled segment
ms are decided to append to the candidate motif set
Mi if and only if the AMAP score

of
Mi increases after these segaments are appended into
Mi.
Method of dependency graphs and their expanded Bayesian networks
Considering the binding mechanism of transcription factors to specific DNA sites (motifs), there must be distinctive features for the specific motif regions from other intergenic regions which represent the background DNA sequence. Hence, it is conceivable that we can use a statistical model to capture the feature of a specific DAN site (motif) or a generic DNA intergenic region (background). Since the size of a candidate motif set Mi is often small, a PSSM model is commonly used for Mi instead of any other more sophisticated statistical model. However, the size of the background motif set MBG is usually large enough to be equipped with a more sophisticated one.
As reported in [
21], a dependency graph model is used to fully capture the intrinsic interdependency between base positions in a motif or region. The establishment of dependency between two positions is based on a
χ2-test from known sample data. An edge is established between two nodes (a node represents a base position) in the graph if the two corresponding base positions of the motif or region are dependent. After all dependent edges have being established completely, a dependency graph for the motif or region is constructed. An example of a dependency graph with 7 nodes is shown in Figure .
As reported in [
21], although the dependency graph can fully capture the intrinsic interdependency between base positions in a motif or region, it is difficult, if not impossible, to perform statistical inference based on the dependency graph. To resolve the dilemma, the dependency graph is expanded to form a Bayesian network (which is a directed acyclic graph that facilitates statistical reasoning) by allowing a base position in the dependency graph to appear more than once in the Bayesian network as nominally distinct nodes. Figure shows an example of an expanded Bayesian network of the dependency graph in Figure . For the detailed procedure of constructing an expanded Bayesian network (EBN) from a dependency graph, please see [
21]. In this paper, we use EBNs to model the background motif set
MBG.
Continued with the same example of the motif prediction of the transcription factor CBF1, the PWM and the rank of the five candidate motif sets at the end of the iterative sampling stage are also shown in Figure , together with the final results at the end of the refinement stage.