The structural identification of unknown biochemical compounds in complex biofluids continues to be a major challenge in metabolomics research. Using LC/MS there are currently two major options for solving this problem: searching small biochemical databases, which often do not contain the unknown of interest, or searching large chemical databases which include large numbers of non-biochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds of identifying an unknown biochemical compound, but only if non-biochemical structures can be eliminated from consideration. In this paper we present BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds (as scaffolds) and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% of 1,388 Kyoto Encyclopedia of Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1,565 scaffolds. Analysis of two additional biological datasets containing 2,330 human metabolites (HMDB) and 2,416 plant secondary metabolites (KEGG) resulted in biochemical annotations of 89% and 72% of the compounds respectively. When a dataset of 3,895 drugs (DrugBank and USAN) was tested, 48% of these structures were predicted to be biochemical. However, when a set of synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% of the 458,207 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% of 883,199 randomly selected compounds from PubChem were biochemical. We then expanded the scaffold list to 3,927 biochemical compounds and reevaluated the above datasets to determine whether scaffold number influenced model performance. Although there were significant improvements in model sensitivity and specificity using the larger scaffold list, the dataset comparison results were very similar. These results suggest that additional biochemical scaffolds will not further improve our representation of biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method for endogenous mammalian biochemical annotation of chemical space, and thus will be useful in the identification of unknown biochemical structures in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu.
Metabolomics; cheminformatics; structural similarity; graph matching; metabolite identification
Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances.
This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/.
We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.
Identification of DNA/Protein motifs is a crucial problem for biologists. Computational techniques could be of great help in this identification. In this direction, many computational models for motifs have been proposed in the literature.
One such important model is the motif model. In this paper we describe a motif search web tool that predominantly employs this motif model. This web tool exploits the state-of-the art algorithms for solving the motif search problem.
The online tool has been helping scientists identify many unknown motifs. Many of our predictions have been successfully verified as well. We hope that this paper will expose this crucial tool to many more scientists.
Availability and requirements
Project name: PMS - Panoptic Motif Search Tool. Project home page: http://pms.engr.uconn.edu or http://motifsearch.com. Licence: PMS tools will be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions. The online tool is freely available without login.
Efficient tile sets for self assembling rectilinear shapes is of critical importance in algorithmic self assembly. A lower bound on the tile complexity of any deterministic self assembly system for an n × n square is
Ω(log(n)log(log(n))) (inferred from the Kolmogrov complexity). Deterministic self assembly systems with an optimal tile complexity have been designed for squares and related shapes in the past. However designing
Θ(log(n)log(log(n))) unique tiles specific to a shape is still an intensive task in the laboratory. On the other hand copies of a tile can be made rapidly using PCR (polymerase chain reaction) experiments. This led to the study of self assembly on tile concentration programming models. We present two major results in this paper on the concentration programming model. First we show how to self assemble rectangles with a fixed aspect ratio (α:β), with high probability, using Θ(α + β) tiles. This result is much stronger than the existing results by Kao et al. (Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008) and Doty (Randomized self-assembly for exact shapes. In: proceedings of the 50th annual IEEE symposium on foundations of computer science (FOCS), IEEE, Atlanta. pp 85–94, 2009)—which can only self assembly squares and rely on tiles which perform binary arithmetic. On the other hand, our result is based on a technique called staircase sampling. This technique eliminates the need for sub-tiles which perform binary arithmetic, reduces the constant in the asymptotic bound, and eliminates the need for approximate frames (Kao et al. Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008). Our second result applies staircase sampling on the equimolar concentration programming model (The tile complexity of linear assemblies. In: proceedings of the 36th international colloquium automata, languages and programming: Part I on ICALP ’09, Springer-Verlag, pp 235–253, 2009), to self assemble rectangles (of fixed aspect ratio) with high probability. The tile complexity of our algorithm is Θ(log(n)) and is optimal on the probabilistic tile assembly model (PTAM)—n being an upper bound on the dimensions of a rectangle.
Tile assembly model; Concentration programming; Self assembly algorithms; Probabilistic self assembly model (PTAM); Randomized algorithms
Background and objective
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms.
Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components.
Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes).
We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
Data Integration; Healthcare Records; Algorithms; Parallel Algorithms; Speedups
The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data. In the present contribution, we show that efficient and accurate prediction of a diverse set of properties of material systems is possible by employing machine (or statistical) learning methods trained on quantum mechanical computations in combination with the notions of chemical similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us to discover decision rules that establish a mapping between easily accessible attributes of a system and its properties. It is shown that fingerprints based on either chemo-structural (compositional and configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically explore and mine vast chemical spaces, and can significantly accelerate the discovery of new application-specific materials.
Traditional Associative Classification (AC) algorithms typically search for all possible association rules to find a representative subset of those rules. Since the search space of such rules may grow exponentially as the support threshold decreases, the rules discovery process can be computationally expensive. One effective way to tackle this problem is to directly find a set of high-stakes association rules that potentially builds a highly accurate classifier. This paper introduces AC-CS, an AC algorithm that integrates the clonal selection of the immune system along with deterministic data sampling. Upon picking a representative sample of the original data, it proceeds in an evolutionary fashion to populate only rules that are likely to yield good classification accuracy. Empirical results on several real datasets show that the approach generates dramatically less rules than traditional AC algorithms. In addition, the proposed approach is significantly more efficient than traditional AC algorithms while achieving a competitive accuracy.
We propose a new algorithm, PMS6, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. The run time ratio PMS5/PMS6, where PMS5 is the fastest previously known algorithm for motif discovery in large instances, ranges from a high of 2.20 for the (21,8) challenge instances to a low of 1.69 for the (17,6) challenge instances. Both PMS5 and PMS6 require some amount of preprocessing. The preprocessing time for PMS6 is 34 times faster than that for PMS5 for (23,9) instances. When preprocessing time is factored in, the run time ratio PMS5/PMS6 is as high as 2.75 for (13,4) instances and as low as 1.95 for (17,6) instances.
Planted motif search; string algorithms
Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.
In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.
Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.
Single Nucleotide Polymorphisms (SNPs) are sequence variations found in individuals at some specific points in the genomic sequence. As SNPs are highly conserved throughout evolution and within a population, the map of SNPs serves as an excellent genotypic marker. Conventional SNPs analysis mechanisms suffer from large run times, inefficient memory usage, and frequent overestimation. In this paper, we propose efficient, scalable, and reliable algorithms to select a small subset of SNPs from a large set of SNPs which can together be employed to perform phenotypic classification.
Our algorithms exploit the techniques of gene selection and random projections to identify a meaningful subset of SNPs. To the best of our knowledge, these techniques have not been employed before in the context of genotype‐phenotype correlations. Random projections are used to project the input data into a lower dimensional space (closely preserving distances). Gene selection is then applied on the projected data to identify a subset of the most relevant SNPs.
We have compared the performance of our algorithms with one of the currently known best algorithms called Multifactor Dimensionality Reduction (MDR), and Principal Component Analysis (PCA) technique. Experimental results demonstrate that our algorithms are superior in terms of accuracy as well as run time.
In our proposed techniques, random projection is used to map data from a high dimensional space to a lower dimensional space, and thus overcomes the curse of dimensionality problem. From this space of reduced dimension, we select the best subset of attributes. It is a unique mechanism in the domain of SNPs analysis, and to the best of our knowledge it is not employed before. As revealed by our experimental results, our proposed techniques offer the potential of high accuracies while keeping the run times low.
Feature Selection Algorithm (FSA); Gene Selection Algorithm (GSA); Multifactor Dimensionality Reduction (MDR); Random Projection (RP); Single‐Nucleotide Polymorphism (SNP); Support Vector Machine (SVM); Principal Component Analysis (PCA)
Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function.
In this paper1 we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O’s required to sort the input close to the lower bound of Ω(logMB(NM)). We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease.
Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the -motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at http://pms.engr.uconn.edu/downloads/qPMS7.zip. Our online motif discovery tools that use Algorithm qPMS7 are freely available at http://pms.engr.uconn.edu or http://motifsearch.com.
Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n2) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ovmax(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ−|ovmax(x, y)| if and only if ω≤λ, where |ovmax(x, y)| is the length of ovmax(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ−1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip
Contact: email@example.com; firstname.lastname@example.org
Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.
Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.
We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.
In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
Minimotifs are short contiguous peptide sequences in proteins that have known functions. At its simplest level, the minimotif sequence is present in a source protein and has an activity relationship with a target, most of which are proteins. While many scientists routinely investigate new minimotif functions in proteins, the major web-based discovery tools have a high rate of false-positive prediction. Any new approach that reduces false-positives will be of great help to biologists.
Methods and Findings
We have built three filters that use genetic interactions to reduce false-positive minimotif predictions. The basic filter identifies those minimotifs where the source/target protein pairs have a known genetic interaction. The HomoloGene genetic interaction filter extends these predictions to predicted genetic interactions of orthologous proteins and the node-based filter identifies those minimotifs where proteins that have a genetic interaction with the source or target have a genetic interaction. Each filter was evaluated with a test data set containing thousands of true and false-positives. Based on sensitivity and selectivity performance metrics, the basic filter had the best discrimination for true positives, whereas the node-based filter had the highest sensitivity. We have implemented these genetic interaction filters on the Minimotif Miner 2.3 website. The genetic interaction filter is particularly useful for improving predictions of posttranslational modifications such as phosphorylation and proteolytic cleavage sites.
Genetic interaction data sets can be used to reduce false-positive minimotif predictions. Minimotif prediction in known genetic interactions can help to refine the mechanisms behind the functional connection between genes revealed by genetic experimentation and screens.
Protein-protein interactions are important to understanding cell functions; however our theoretical understanding is limited. There is a general discontinuity between the well-accepted physical and chemical forces that drive protein-protein interactions and the large collections of identified protein-protein interactions in various databases. Minimotifs are short functional peptide sequences that provide a basis to bridge this gap in knowledge. However, there is no systematic way to study minimotifs in the context of protein-protein interactions or vice versa. Here we have engineered a set of algorithms that can be used to identify minimotifs in known protein-protein interactions and implemented this for use by scientists in Minimotif Miner. By globally testing these algorithms on verified data and on 100 individual proteins as test cases, we demonstrate the utility of these new computation tools. This tool also can be used to reduce false positive predictions in the discovery of novel minimotifs. The statistical significance of these algorithms is demonstrated by an ROC analysis (p = 0.001).
Minimotif Miner; HomoloGene; BLAST; Grb2; SLiM; short linear motifs
Minimotif Miner (MnM available at http://minimotifminer.org or http://mnm.engr.uconn.edu) is an online database for identifying new minimotifs in protein queries. Minimotifs are short contiguous peptide sequences that have a known function in at least one protein. Here we report the third release of the MnM database which has now grown 60-fold to approximately 300 000 minimotifs. Since short minimotifs are by their nature not very complex we also summarize a new set of false-positive filters and linear regression scoring that vastly enhance minimotif prediction accuracy on a test data set. This online database can be used to predict new functions in proteins and causes of disease.
The photochemical and thermal stability of the detergent solubilized blue- and green-absorbing proteorhodpsins, BPR and GPR respectively, are investigated to determine viability of these proteins for photonic device applications. Photochemical stability is studied by using pulsed laser excitation and differential uv-vis spectroscopy to assign the photocyclicity. GPR, with a cyclicity of 7×104 photocycles protein−1, is 4–5 times more stable than BPR (9×103 photocycles protein−1), but is less stable than native bacteriorhodopsin (9×105 photocycles protein−1) or the 4-keto-bacteriorhodopsin analog (1×105 photocycles protein−1). The thermal stabilities are assigned by using differential scanning calorimetry and thermal bleaching experiments. Both proteorhodopsins display excellent thermal stability, with melting temperatures above 85°C, and remain photochemically stable up to 75°C. The biological relevance of our results is also discussed. The lower cyclicity of BPR is found to be adequate for the long-term biological function of the host organism at ocean depths of 50 m or more.
opto-electronic material; cyclicity; differential scanning calorimetry; proteorhodopsin; biophotonic
Motifs are patterns found in biological sequences that are vital for understanding gene function, human disease, drug design, etc. They are helpful in finding transcriptional regulatory elements, transcription factor binding sites, and so on. As a result, the problem of identifying motifs is very crucial in biology.
Many facets of the motif search problem have been identified in the literature. One of them is (ℓ, d)-motif search (or Planted Motif Search (PMS)). The PMS problem has been well investigated and shown to be NP-hard. Any algorithm for PMS that always finds all the (ℓ, d)-motifs on a given input set is called an exact algorithm. In this paper we focus on exact algorithms only. All the known exact algorithms for PMS take exponential time in some of the underlying parameters in the worst case scenario. But it does not mean that we cannot design exact algorithms for solving practical instances within a reasonable amount of time. In this paper, we propose a fast algorithm that can solve the well-known challenging instances of PMS: (21, 8) and (23, 9). No prior exact algorithm could solve these instances. In particular, our proposed algorithm takes about 10 hours on the challenging instance (21, 8) and about 54 hours on the challenging instance (23, 9). The algorithm has been run on a single 2.4GHz PC with 3GB RAM. The implementation of PMS5 is freely available on the web at http://www.pms.engr.uconn.edu/downloads/PMS5.zip.
We present an efficient algorithm PMS5 that uses some novel ideas and combines them with well-known algorithm PMS1 and PMSPrune. PMS5 can tackle the large challenging instances (21, 8) and (23, 9). Therefore, we hope that PMS5 will help biologists discover longer motifs in the futures.
The discovery of patterns in DNA, RNA, and protein sequences has led to the solution of many vital biological problems. For instance, the identification of patterns in nucleic acid sequences has resulted in the determination of open reading frames, identification of promoter elements of genes, identification of intron/exon splicing sites, identification of SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have proven to be extremely helpful in domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, etc. Motifs are important patterns that are helpful in finding transcriptional regulatory elements, transcription factor binding sites, functional genomics, drug design, etc. As a result, numerous papers have been written to solve the motif search problem.
Three versions of the motif search problem have been proposed in the literature: Simple Motif Search (SMS), (l, d)-motif search (or Planted Motif Search (PMS)), and Edit-distance-based Motif Search (EMS). In this paper we focus on PMS. Two kinds of algorithms can be found in the literature for solving the PMS problem: exact and approximate. An exact algorithm identifies the motifs always and an approximate algorithm may fail to identify some or all of the motifs. The exact version of PMS problem has been shown to be NP-hard. Exact algorithms proposed in the literature for PMS take time that is exponential in some of the underlying parameters. In this paper we propose a generic technique that can be used to speedup PMS algorithms.
We present a speedup technique that can be used on any PMS algorithm. We have tested our speedup technique on a number of algorithms. These experimental results show that our speedup technique is indeed very effective. The implementation of algorithms is freely available on the web at http://www.engr.uconn.edu/rajasek/PMS4.zip
Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet).
In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.
The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.
Minimotifs are short contiguous peptide sequences in proteins that are known to have a function in at least one other protein. One of the principal limitations in minimotif prediction is that false positives limit the usefulness of this approach. As a step toward resolving this problem we have built, implemented, and tested a new data-driven algorithm that reduces false-positive predictions.
Certain domains and minimotifs are known to be strongly associated with a known cellular process or molecular function. Therefore, we hypothesized that by restricting minimotif predictions to those where the minimotif containing protein and target protein have a related cellular or molecular function, the prediction is more likely to be accurate. This filter was implemented in Minimotif Miner using function annotations from the Gene Ontology. We have also combined two filters that are based on entirely different principles and this combined filter has a better predictability than the individual components.
Testing these functional filters on known and random minimotifs has revealed that they are capable of separating true motifs from false positives. In particular, for the cellular function filter, the percentage of known minimotifs that are not removed by the filter is ∼4.6 times that of random minimotifs. For the molecular function filter this ratio is ∼2.9. These results, together with the comparison with the published frequency score filter, strongly suggest that the new filters differentiate true motifs from random background with good confidence. A combination of the function filters and the frequency score filter performs better than these two individual filters.
Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature.
We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database.
MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context.