For motif discovery our model is restricted to motifs that vary in length by one base at most. We target gapped PWMs that have a optional base near their centres. We did not allow more than one optional base as inference becomes increasingly difficult as the hypothesis space grows. In a similar spirit to the popular motif finder MDscan [22
], we combine a maximum likelihood approach with enumerative methods to initialise the model's parameters. Using the Baum-Welch algorithm [38
] we learn the parameters of a hidden HMM that models the background sequence and binding sites on both strands of the DNA. We use the Baum-Welch algorithm as it is the most popular technique for learning the parameters of HMMs and is guaranteed to converge to a local maximum. Viterbi training [38
] and Gibbs sampling [39
] are possible alternative inference techniques. Viterbi training is less popular than the Baum-Welch algorithm and is not guaranteed to converge. Gibbs sampling has been successful in several other TFBS search algorithms [40
] but is not normally used in conjunction with HMMs. The Baum-Welch algorithm is an expectation-maximisation (EM) algorithm. EM methods have been successful in this field [22
]. Hence we had no compelling reason to use Gibbs sampling over the Baum-Welch algorithm.
Our method comprises the following stages (see Figure ):
Figure 2 Search method overview. Overview of search method. The input sequences are converted into a suffix tree which is used to efficiently enumerate over-represented words. These words are tested as possible seeds for a HMM. For each seed we consider a number (more ...)
• Build a suffix tree representing the sequences.
• Find over-represented words in the sequences.
• Test over-represented words together with possible gap positions as candidate seeds for the HMM.
• Train HMM using the most promising seeds.
• Score, rank and filter learnt gapped PWMs.
Finding over-represented words
Unfortunately, in the context of our problem the Baum-Welch algorithm is extremely sensitive to initial conditions. Therefore, we devote some effort to finding several good candidate initial conditions or seeds for the HMM's emission parameters.
We use a suffix tree [48
] to enumerate all the L
-mers in the sequences allowing for reverse complements (L
is a user-specified parameter which defaults to 8). For each L
-mer, we count how many sequences it occurs in and the number of times it occurs across all the sequences. The L
-mers are sorted to determine which are over-represented. The primary sort key is the number of sequences the L
-mer occurs in and the secondary sort key is the total number of occurrences across all the sequences.
Seeds that are close in edit distance to each other are likely to converge to the same gapped PWM when the Baum-Welch algorithm is applied. Therefore, we filter the L-mers using their edit distance from higher ranked L-mers. Our edit distance allows for reverse complements and shifts in the L-mers. Any L-mers that are less than a user-specified edit distance from a previously evaluated L-mer are discarded. The remaining L-mers become our candidate seeds. It should be emphasised that finding over-represented words is a heuristic for seeding the HMM only and has no influence on the final scoring by the HMM.
In addition to the L
-mer we need to choose where to place the optional base in the PWM in order to seed the HMM. For each candidate L
-mer we examine each possible gap position in turn. We do not allow gap positions close to the end of the motif. The first gap is allowed after the base at position L
+ 1 and the last gap is positioned symmetrically at the end of the motif. Each (L
-mer, gap position) pair is scored as follows:
• We generate 2 standard PWMs from the L-mer: one represents binding sites which include the optional base at the given gap position, the other represents sites without the optional base. The PWM without the optional base is given an extra base at the end so both have the same length, L + 1. The user specifies a pseudo-count to smooth both PWMs' distributions and the gap position is given a uniform distribution as is the extra padding base.
• We calculate the log likelihood of each L + 1-mer in every sequence under a background model.
• We score each L + 1-mer with both PWMs, calculating the log likelihood for both strands.
• For each L + 1-mer we calculate the log likelihood ratio between the better PWM (in either orientation) and the background model.
• Each sequence is scored as the maximum of its L + 1-mers' log likelihood ratios. That is, we are looking for the single best binding site on either strand of each sequence explained either by the gapped or ungapped PWM.
• The overall score for the given (L-mer, gap position) pair is the sum of the scores for each sequence. For each sequence we take the maximum ratio over all positions, both strands and both PWMs. The score for the seed is the sum of these maxima over all sequences.
Our scheme is motivated by a desire not to use seeds that are easily explained by a background model and to find seeds that explain sites in as many sequences as possible.
In the above scheme the background is modelled by a HMM with 4 states and Markov order 3. That is, its emissions are conditioned on the previous three observations which gives each state 256 emission parameters. The parameters for transitions out of any given state are tied. The HMM is trained on the input sequences.
Initialisation and training of the HMM
A HMM is a Markov process with unobserved states. These states can be regarded as an underlying process that generates the data. We model background genomic sequence by one set of states and binding sites by another set. Binding sites on the positive strand are generated by a distinct set of states to those on the negative strand. The HMM is parameterised by the transition probabilities between the states and the token emission probabilities for each state. We use the transition probabilities to model the relative scarcity of binding sites. In our context, the output tokens of the HMM are the nucleotide bases of the sequences. Each base in our input sequences is associated with an unobserved state. An example state transition diagram is shown in Figure .
Figure 3 HMM state transitions. An example of a typical HMM state transition diagram. This HMM jointly models background sequence and binding sites from a gapped PWM of length 7. State 0 is the background state. The two arms leading out from state 0 generate binding (more ...)
For each of the highest scoring seeds (L-mer, gap position pairs) we create a HMM and train it. The number of seeds used for this purpose is a tunable parameter. Our experience showed the algorithm was robust to changes in this parameter as the best seeds were invariably amongst the highest scoring. In our tests we used a value of 60. The emission parameters of the states for the positive strand are initialised by the L-mer (with the addition of pseudo-counts). The emission parameters of the states for the negative strand are tied to the emission parameters of the states for the positive strand so that the motif is the same irrespective of the strand the binding site is on. The state transitions are initialised to reflect the gap position. We estimate the initial probability of leaving the background state by the number of occurrences of the initialising L-mer divided by the number of bases in the sequences. This estimate can be scaled by a user-specified parameter. It can be reduced to encourage sharper motifs with fewer binding sites or increased for more prevalent vaguer motifs. The transition probability to the gap base is initialised to 0.5. We train the HMM using the Baum-Welch algorithm. Without a prior on the transitions out of the background state, this invariably results in an extremely vague motif: that is, the model prefers a motif of high entropy with many binding sites over a motif of low entropy with a smaller but more plausible number of occurrences. We place a very strong prior on this transition that effectively fixes it and encourages the Baum-Welch algorithm to learn motifs of higher information content.
The Baum-Welch algorithm terminates when the increase in log likelihood is smaller than some threshold. We use a threshold of .0004 per sequence. If the motif becomes vague during training, we stop training and discard the model. We measure the vagueness by the entropy per base.
Scoring the gapped PWMs
Each seed we use to initialise the HMM results in one gapped PWM. However, we are most interested in PWMs that satisfy the following criteria:
• The PWM has high information content.
• The PWM found a binding site in a high proportion of the sequences in the data set.
• The PWM does not just model lower order features in the sequences.
For each of these properties we score each PWM between 0 and 1. The score for the information content, Sic, is the ratio of the PWM's information content to the maximum possible. Here we calculate the information content in a naïve position-independent sense as we cannot afford the full enumeration over all possible words as described elsewhere in the Methods section. We use an approximation where each position is treated independently but the information content of the optional base is weighted by the frequency with which it occurs. The score for the number of binding sites, Sbs, is simply the fraction of sequences for which the PWM finds at least one binding site. In order to discount PWMs that appear to model lower-order features in the sequences (for example GC rich regions), we calculate the entropy of the first-order distribution defined by the PWM. That is, we take consecutive bases in the PWM and look at their joint distribution. We take the average of these distributions over all consecutive pairs of bases and calculate its entropy. We are looking for PWMs where this first order entropy is high (for example a PWM that represents "GCGCGCGC" would have a very low entropy). Our score, Slo, to discount PWMs representing these lower order features is simply the ratio of the PWM's first order entropy to the maximum possible entropy.
In order to take account of the three criteria above, we score each PWM by the geometric mean of the scores. This mean is biased using weights to make the scales of the different scores comparable. Heuristically, we found suitable weights to be 1.5, 1 and 1 for Slo, Sic and Sbs respectively. For the data sets used in this paper, the top motifs from both methods were clearly the best. In the results presented only the top motif was used for discrimination and we only report one motif per data set in the results. This is certainly an ad hoc scoring scheme, however we found it important to integrate our prior beliefs about motifs into the scoring scheme. It was not easy to encode all these beliefs into a probabilistic model or likelihood function that we could fit or optimise. In particular, the beliefs about the lower-order features were difficult to incorporate in this way. Our ad hoc scoring scheme does capture our beliefs in a straightforward manner and was found to be effective. We found the results were robust to minor variations in the values of these parameters.
Gapped PWMs as distributions over words
We describe the details of the distribution a gapped PWM induces over words in order to make the example in Figure concrete. Suppose we have a gapped PWM of length K
(including the optional character). We treat the gapped PWM as a model of binding sites on both the positive and negative strand of DNA. In other words, we model it as a 50/50 mixture of itself and its reverse complement. Suppose that the optional character occurs in a proportion r
of the binding sites and that the base frequencies of the equivalent standard PWM with and without the optional character are given by
respectively. Note that, as in the example, the frequencies for the case where the optional character is omitted are augmented by an 'N' in the last position. Suppose furthermore that
is the complement of base b
is the k
th position in the reversed PWM, so that
- 1,... Then the distribution the gapped PWM induces over words is given by
where x = x1 ... xK is a word.
Probabilistic models for transcription factor binding sites such as PWMs and gapped PWMs define distributions over words of a certain length, K
. The information content (or information gain or Kullback-Leibler divergence, DKL
) of such a distribution, p
), relative to a background distribution over words, q
), is defined as
where X is the set of all such words. Iseq measures how different the PWM, p(x), is from the background, q(x). In information theory terminology, Iseq is the average message length required to transmit a binding site of the PWM using a code optimised for the background distribution.
In the position independent case when using a 0-order background model and ignoring reverse complements (see below), the sum decomposes into sums over the probabilities of bases at each position.
This leads to the well-known formula for the information content of a standard PWM to be
where pkb is the probability of seeing base b at the k'th position of the PWM and qb is the probability of base b in the background distribution. This decomposition ignores the fact that a PWM is almost always applied to the positive and negative strand of DNA. In light of this and continuing to view a PWM as a distribution over words, the PWM can be seen as a mixture model over two components. In one component it is applied in the positive direction. In the other component it is applied in the negative direction as a reverse complement. Put another way, which words would we score highly under the consensus sequence AACCTT? AACCTT itself of course but also its reverse complement, AAGGTT. As a PWM is almost always used in this mixture model sense, position independencies do not hold. When position independencies do not hold, the decomposition in the above sum does not hold either and in general the calculation of information content requires a sum over all words. This also applies to gapped PWMs and we show an example in Figure .
Figure 4 Information content of gapped motifs. Examples showing how position dependencies induced by gap characters can affect the information content of motifs. Compare the gapped PWM C with the standard PWM D. Here the introduction of a gap has decreased the (more ...)
Each method was evaluated by constructing an HMM from a single background class, (defined by single nucleotide frequencies), the motif from the method and a parameter for the transition probability from the background to the motif. For each sequence, s
, we calculated the expected number of bases, ns
, that have been generated by the motif under this model. For several thresholds, t
, we calculated the proportion, p
, of sequences (in the held out sequences from the original data set) which had ns
, and the corresponding proportion, q
, for a set of negative sequences. The Receiver-Operator-Characteristic (ROC) curve is the plot of p
. The perfect ROC is the one that goes through the point (0,1), and random guessing gives the diagonal line between points (0, 0) and (1, 1). In our analyses, we used three separate negative reference sets: a) shuffled versions of the original positive sequences, b) a set of sequences taken at random from the human genome, assembly NCBI36 and c) sequences from promoter regions of randomly chosen genes starting 1000 bases upstream of the transcription start site. In each case, the sequences of the negative data set were matched in number and length to those of the positive data set. Evaluation of the motif-finders used a five-fold cross validation. The ROCs shown in the text are the accumulation of the ROCs for each of the five folds. For each ROC we calculated the area-under-the-curve (AUC) and AUC50 [49
] statistics. The AUC statistic reflects the performance of the method overall and the AUC50 statistic reflects the performance of the method at high specificity. The AUC50 is the area under the ROC curve generated by discarding all but the 50 highest scoring negative examples. It is a measure of how good the method is at classifying sequences relative to the highest scoring negative examples. It is a useful metric when the user of a method can only afford to follow-up a few of the examples that they test. The details of the calculation are given in Additional file 1
The parameters used for MEME and GLAM2 and the details of the processing of the data sets are given in Additional file 1
. We should note that we used the same data sets to tune the ad hoc parameters of our method, MEME and GLAM2. This may mean that these parameters are slightly overfitted with respect to our data sets but we believe this effect is negligible. In general, we found all the methods robust to minor changes in the parameters.
Investigation of TRANSFAC binding sites
] release 2.0.10 was used to realign the sequences used with TRANSFAC, version 2008.3, [10
] to determine the PWMs. The gap extension penalty was reduced to 7 from the default of 15, the gap extension to 3 from 6.66 and the transitions weighting to 0 from 0.5. Minor manual adjustments were made to the results to reduce the number of locations with optional gaps. The 10 PWMs where TRANSFAC had introduced gaps in order to produce their published PWMs were I$DL_01, V$MYOGNF1_01, IRF-1, V$IRF2_01, V$BRN2_01, V$ARP1_01, P$EMBP1_Q2, V$RSRFC4_Q2, V$LUN1_01, V$DEAF1_02. In all, 510 PWMs were processed through ClustalW2. Of these there were 70 PWMs where ClustalW2 introduced gaps in order to obtain alignment of one or two base-pairs on the edge. These added no significant information and were ignored. There were 58 cases where ClustalW2 introduced a gap for one site (or all but one site) in the centre of the binding region. These cases could be significant but given the small sample size, these were also ignored. There were 26 examples similar to the above where there was more than one gap that was introduced, but still there was only a single instance of each type, so these were ignored as well. There were 159 examples where ClustalW2 introduced one or more gaps involving more than one site. The significance of these examples varies in a continuous spectrum from many probably being of no significance through to the examples given in the Results, which were the two best. Logos and information content were calculated for the core of the sequence where base types were available for more than 50% of the binding sites. Information content for PWMs resulting from the gapped alignments and the standard alignments were calculated as described above.