|Home | About | Journals | Submit | Contact Us | Français|
Motif discovery, the process of discovering a meaningful pattern of nucleotides or amino acids that is shared by two or more molecules, is an important part of the study of gene function. In this paper, we propose a hybrid motif discovery approach based upon a combination of Particle Swarm Optimization (PSO) and the Expectation-Maximization (EM) algorithm. In the proposed algorithm, we use PSO to generate a seed for the EM algorithm.
Many amino acid and nucleotide sequences with functional or structural similarities share short contiguous sequences on the order of 10-20 bases known as motifs. Motifs can have a wide variety of purposes including providing structural properties, ligand binding sites, or signaling sites. Depending on the purpose of the motif and the nature of the specificity required, these regions may be highly conserved (close to 100% identical), or may contain a more subtle signal. The detection of these common patterns in a set of similar biological sequences can give insight into the regulation mechanisms involved, whether it be functionally or structurally controlled.
Several motif discovery algorithms have been demonstrated. Expectation Maximization (EM) uses information theory to identify conserved patterns of nucleotides or amino acids that may exist in several unaligned genetic sequences [1;2]. This particular algorithm is likely to identify sequences with a local maximum score, and must be run many times to search for improved scores.
The Gibbs Sampler extends this concept by introducing a stochastic process to exit the local maximum and continue search for a better solution .
Meta-MEME uses hidden Markov models (HMMs) of protein families to predict the motif patterns . These HMMs must be trained with a known set of conserved regions. However, once trained, they can be an effective tool for a searching a large database of genetic sequences.
Particle Swarm Optimization (PSO) is a socially inspired algorithm that has been applied to search in both continuous and discrete search spaces with multiple dimensions [5-8]. The concept is to have various particles “fly” through a search space in search of solutions based upon a given objective function.
In this paper, we use PSO to search for a high value motif and then use that as a seed to begin the EM algorithm to further improve the motif locations.
The problem space consists of N genetic sequences. Sequence si has a length of ki nucleotides. The object is to find a motif of length m, in each sequence, si, with the highest information content based upon EM scoring methods.
To apply PSO, we define a particle, P = [pi], as a vector containing the location of first character of the motif in sequence si. Further, we maintain a velocity vector V = [vi], where vi is the current velocity of the particle within sequence si. Our current implementation requires the user to specify the length of the motif. (We are currently investigating methods to discover the optimal motif length.)
From this point forward, we use a standard implementation of PSO in which a number of particles are instantiated with random pi and vi values. The social factor, individual factor, and maximum velocity are specified, and the particles are allowed to “fly.” After each step, the EM score of each particle is evaluated and the pBest and gBest values are updated. The particles are allowed to fly until they fail to improve the gBest score for x iterations.
Once the termination condition is met, the P matrix for the gBest solution is then used as a seed for the EM algorithm. EM is an iterative algorithm that enumerates all possible motif locations in an effort to find the best fit. It then repeats until no further improvement is found.
Even using PSO to seed the EM algorithm, we observe frequent cases of termination on a local solution. With the current implementation, we reinitialize the PSO and re-run the algorithm until no improvement is observed.
To test our methods, we used a data set previously considered by Stormo and Hartzell . This data presents 18 sequences with each sequence being 105 characters long, and each sequence contains at least one CRP protein binding site. The CRP binding site is 22 characters in length.
The same dataset was analyzed using our PSO/EM algorithm and compared to the results of the Gibbs Sampler and MEME. The results are tabulated in the Appendix.
The PSO/EM algorithm correctly identified the region of the motif in 13 of 18 sequences. This compares favorably with the Gibbs Sampler (12 of 18) and MEME (14 of 18).
Like the other two methods, the PSO/EM tends to have a consistent offset from the actual known motif location. PSO/EM consistently predicts a motif starting location 3 characters to the right of actual compared to Gibbs (2 characters left) and MEME (1 character left). This is believed to be a function of the information content of the specific motif versus the background and the scoring method applied by the algorithms.
This paper introduces the use of PSO into a new problem set, namely motif discovery. Even though motif discovery is primarily a problem in the domain of bioinformatics, it has potential application in pattern matching problems in other domains.
As far as we can ascertain, the technique presented here is the first hybrid utilization of PSO and EM in any problem domain.
This algorithm has only been tested on DNA sequences. The investigator must supply the expected length of the motif. We have implemented a scoring function for sequences of amino acids, but not yet tested it.
ER acknowledges support from the National Center for Research Resources (NCRR) grant 2P20RR016481-04 (Nigel G. F. Cooper, PI).
|Ref||LOCUS||Known Motif Locations||PSO/EM Location (Error)||Gibbs Location (Error)||MEME Location (Error)|
This data set was obtained from GenBank, Release 55. (The tdc gene was not in that release; it was obtained from .
LOCUS is presented in alphabetical order.
Known motif locations were obtained from .
C.T. Hardin, Department of Computer Science and Computer Engineering, University of Louisville, Louisville, KY 40292, Email: ude.ellivsiuol@10drahtc.
Eric C. Rouchka, Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, Email: ude.ellivsiuol@10cuorce.