Sequence motifs are important tools in molecular biology. Sequence motifs can describe and identify features in DNA, RNA and protein sequences such as transcription factor binding sites, splice junctions and protein-protein interaction sites. Numerous algorithms have been developed for discovering motifs, as well as algorithms for scanning databases for matches to a given motif or motifs. Some are specialized for discovery of DNA motifs. These include A-GLAM
, AlignACE 
, BioProspector 
, MDscan 
, RSA Tools 
, Weeder 
and YMF 
. Others, such as MEME 
and Gibbs 
can discover motifs in either protein or DNA sequences. The importance of motifs is further underscored by the numerous databases that have been compiled of known motifs including DNA regulatory motifs in TRANSFAC, JASPAR, SCPD, DBTBS, RegulonDB 
, and protein motifs in ELM, PROSITE, BLOCKS and PRINTS 
It is worth noting that biological motifs fall into at least three somewhat distinct classes. The first comprises short motifs often found at functional sites of biopolymers, such as cleavage sites, binding sites and attachment sites. These short motifs probably arise through convergent evolution as often as not. The second comprises longer protein motifs associated with globular structural domains. These often, if not always, arise through divergent evolution. Finally, recurring motifs can arise from evolutionarily recent duplications, such as DNA transposons. It is not clear that these categories are best tackled by a single motif discovery method. glam2 is primarily aimed at short motifs for functional sites, although it performs respectably for the other categories.
In an ideal world, simple motifs would directly encode biological functions, as is the case with the triplet genetic code for amino acids for example. In reality, protein phosphorylation sites and the like may be encoded in a more complex and dispersed fashion, and in the worst case we would have to understand the full biophysics of the molecule in order to predict its function. Nevertheless, there is often at least a correlation between motifs and functional sites, which is useful. This is illustrated well by the ELM server, which uses protein motifs as a first step in predicting functional sites, and filters the predictions by criteria such as cell compartment and globular domain clash 
. Thus, refining known motifs and discovering new motifs will be useful for identifying functional sites.
Most motif discovery algorithms are limited to gapless motifs. The main reason for this is that the motif discovery process becomes more difficult when gaps are allowed due to an explosion in the number of possible variations. Gapped motifs are ubiquitous in biology, however. Many of the motifs described in protein motif databases such as ELM and PROSITE contain variable length gaps. Transcription factor complexes can have DNA binding motifs with variable width spacers, and some DNA motif discovery algorithms are specialized to finding bipartite motifs – two motifs separated by a single, variable-length spacer.
There are some existing methods for discovering gapped motifs, but they do not appear to be widely used. pratt
discovers gapped motifs, in the form of regular expressions, in protein sequences 
. Regular expressions may have trouble capturing subtle motifs, because they specify exactly which residues and spacers are allowed at each position, and do not allow a better match in one part to compensate for a worse match in another part. Since they make such detailed specifications, it may also be hard to discover accurate regular expressions from small numbers of examples. In any case, glam
2 re-discovers PROSITE motifs more sensitively than pratt
So-called profile hidden Markov models (HMMs) have been used to represent protein structural motifs with gaps, notably in the sam
packages, and HMM training algorithms can be used to discover such motifs 
. It is telling that, while sam
are extremely successful and widely used, they are mainly used for motif scanning, and rarely for ab initio
motif discovery. Recent versions of hmmer
do not even retain the training algorithm. In fact, glam
2 can be regarded as an HMM training method similar to these. A key difference is that, while sam
optimise the HMM parameters (the transition and emission probabilities), glam
2 “integrates out” these parameters (Materials and Methods
, Text S1
), and directly optimises the motif alignment. One consequence is that glam
2 can use a better-characterised heuristic to search for the globally optimum solution: simulated annealing, rather than expectation-maximization with noise injection. (Expectation-maximization alone is well-characterized, but it only finds local optima.) glam
2 actually uses the same stochastic traceback step as hmmer
, but since hmmer
optimises the parameters rather than the alignment, this is not true simulated annealing, as pointed out by its author 
. The yebis
program also discovers gapped motifs, in DNA only, using an ad hoc
HMM training method 
A dynasty of Gibbs sampling algorithms has been developed, which allow for gapped motifs with steadily increasing generality. The original Gibbs sampler only found ungapped motifs 
. The second generation method allowed for discontiguous motifs, where poorly conserved positions within a motif are not considered part of the motif (“turned-off ”) 
. This allows a limited form of insertion, which must be the same size in all motif instances. A successor program named probe
is aimed at protein structural motifs, and it models a motif as multiple separated blocks, where each block may be discontiguous 
. Most recently, Neuwald and Liu extended probe
to allow general insertions and deletions within blocks, using an HMM very similar to the profile HMMs of sam
. Since glam
2 is also an extension of Gibbs sampling to allow general indels, it is somewhat similar to this method, but there are the following important differences:
- Neuwald and Liu use a more complex motif model, designed for protein structural motifs, and much more sophisticated alignment-editing operations and annealing schemes. However, some of their alignment-editing operations are awkward and violate the detailed balance condition of simulated annealing.
- The central step of re-aligning one sequence is carried out differently. glam2 uses the stochastic traceback algorithm to directly sample one alignment according to its score. Neuwald and Liu, in contrast, sample HMM transition and emission probabilities, then obtain the optimal alignment, and finally accept or reject this alignment in the standard Monte Carlo fashion.
- Neuwald and Liu use a simple Dirichlet prior for amino acid frequencies, which lacks information on their tendencies to align with one another, whereas glam2 uses Dirichlet mixtures, which can provide such information. Dirichlet mixtures will be more powerful for small numbers of sequences, but the simpler approach may be sufficient for large numbers of sequences. (glam2 can use either approach.)
- glam2 uses position-specific insertion and deletion probabilities, whereas Neuwald and Liu use universal insertion and deletion probabilities (within blocks). This is important because real motifs tend to concentrate insertions and deletions in a few positions.
This publication aims to make gapped motif discovery as powerful and ubiquitous as gapless motif discovery. We describe the glam
2 algorithm for discovering gapped motifs, and a companion scanning algorithm, glam
. In the following, we first give an overview of glam
2 and glam
, followed by more details on the methods. Full technical details are in Text S1
. We then assess their performance at three different kinds of task: re-discovering PROSITE motifs, refining and then scanning ELM motifs, and aligning BAliBASE sequences. Finally, we give two examples of using these methods to discover kinase substrate motifs and to identify DNA target sites of the LIM-only complex. The results show that glam
2 and glam
are very capable of identifying gapped motifs, especially short linear motifs.