The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. It has been introduced as a memory efficient alternative for suffix trees. The suffix array consists of the sorted suffixes of a string. There are several linear time suffix array construction algorithms (SACAs) known in the literature. However, one of the fastest algorithms in practice has a worst case run time of O(n2). The problem of designing practically and theoretically efficient techniques remains open.
In this paper we present an elegant algorithm for suffix array construction which takes linear time with high probability; the probability is on the space of all possible inputs. Our algorithm is one of the simplest of the known SACAs and it opens up a new dimension of suffix array construction that has not been explored until now. Our algorithm is easily parallelizable. We offer parallel implementations on various parallel models of computing. We prove a lemma on the ℓ-mers of a random string which might find independent applications. We also present another algorithm that utilizes the above algorithm. This algorithm is called RadixSA and has a worst case run time of O(n log n). RadixSA introduces an idea that may find independent applications as a speedup technique for other SACAs. An empirical comparison of RadixSA with other algorithms on various datasets reveals that our algorithm is one of the fastest algorithms to date. The C++ source code is freely available at http://www.engr.uconn.edu/~man09004/radixSA.zip.
doi:10.1016/j.jda.2014.03.001
PMCID: PMC4101535
PMID: 25045344
suffix array construction algorithm; parallel algorithm; high probability bounds
Since the function of a short contiguous peptide minimotif can be introduced or eliminated by a single point mutation, these functional elements may be a source of human variation and a target of selection. We analyzed the variability of ∼300 000 minimotifs in 1092 human genomes from the 1000 Genomes Project. Most minimotifs have been purified by selection, with a 94% invariance, which supports important functional roles for minimotifs. Minimotifs are generally under negative selection, possessing high genomic evolutionary rate profiling (GERP) and sitewise likelihood-ratio (SLR) scores. Some are subject to neutral drift or positive selection, similar to coding regions. Most SNPs in minimotif were common variants, but with minor allele frequencies generally <10%. This was supported by low substation rates and few newly derived minimotifs. Several minimotif alleles showed different intercontinental and regional geographic distributions, strongly suggesting a role for minimotifs in adaptive evolution. We also note that 4% of PTM minimotif sites in histone tails were common variants, which has the potential to differentially affect DNA packaging among individuals. In conclusion, minimotifs are a source of functional genetic variation in the human population; thus, they are likely to be an important target of selection and evolution.
doi:10.1093/nar/gkv580
PMCID: PMC4513861
PMID: 26068475
Abstract
Protein/peptide microarrays are rapidly gaining momentum in the diagnosis of cancer. High-density and high-throughput peptide arrays are being extensively used to detect tumor biomarkers, examine kinase activity, identify antibodies having low serum titers, and locate antibody signatures. Improving the yield of microarray fabrication involves solving a hard combinatorial optimization problem called the border length minimization problem (BLMP). An important question that remained open for the past 7 years is if the BLMP is tractable or not. We settle this open problem by proving that the BLMP is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal NP}$$\end{document}-hard. We also present a hierarchical refinement algorithm that can refine any heuristic solution for the BLMP and prove that the TSP+1-threading heuristic is an O(N)-approximation.
doi:10.1089/cmb.2013.0127
PMCID: PMC4038986
PMID: 24528000
algorithms; combinatorial optimization; DNA self-assembly; genomics; machine learning
Background
Record linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem.
Method
We have developed RLT-S, a freely available web tool, which implements our single linkage clustering algorithm for record linkage. This tool requires input data sets and a small set of configuration settings about these files to work efficiently. RLT-S employs exact match clustering, blocking on a specified attribute and single linkage based hierarchical clustering among these blocks.
Results
RLT-S is an implementation package of our sequential record linkage algorithm. It outperforms previous best-known implementations by a large margin. The tool is at least two times faster for any dataset than the previous best-known tools.
Conclusions
RLT-S tool implements our record linkage algorithm that outperforms previous best-known algorithms in this area. This website also contains necessary information such as instructions, submission history, feedback, publications and some other sections to facilitate the usage of the tool.
Availability
RLT-S is integrated into http://www.rlatools.com, which is currently serving this tool only. The tool is freely available and can be used without login. All data files used in this paper have been stored in https://github.com/abdullah009/DataRLATools. For copies of the relevant programs please see https://github.com/abdullah009/RLATools.
doi:10.1371/journal.pone.0124449
PMCID: PMC4420456
PMID: 25942687
Metabolomics is the study of small molecules, called metabolites, of a cell, tissue or organism. It is of particular interest as endogenous metabolites represent the phenotype resulting from gene expression. A major challenge in metabolomics research is the structural identification of unknown biochemical compounds in complex biofluids. In this paper we present an efficient cheminformatics tool, BioSMXpress that uses known endogenous mammalian biochemicals and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSMXpress identifies endogenous mammalian biochemical structures with high accuracy. BioSMXpress is 8 times faster than our previous work BioSM without compromising the accuracy of the predictions made. BioSMXpress is freely available at http://engr.uconn.edu/~rajasek/BioSMXpress.zip
doi:10.1186/1471-2105-16-S5-S11
PMCID: PMC4402589
PMID: 25859612
Metabolomics; Metabolite Identification; Chemical Structure Matching; Biochemical Structures
Numerous OLAP queries process selection operations of “top N”, median, “top 5%”, in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε) passes (ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2 + ε)-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = Mc for some constant c). While, our proposed algorithm RSS runs in (2 + ε) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≫ Mc.
doi:10.1109/ISCC.2013.6755015
PMCID: PMC4217089
PMID: 25374478
Selection; Parallel Disk System; Median; OLAP queries
Sequence assembly from short reads is an important problem in biology. It is known that solving the sequence assembly problem exactly on a bi-directed de Bruijn graph or a string graph is intractable. However, finding a shortest double stranded DNA string (SDDNA) containing all the k-long words in the reads seems to be a good heuristic to get close to the original genome. This problem is equivalent to finding a cyclic Chinese Postman (CP) walk on the underlying unweighted bi-directed de Bruijn graph built from the reads. The Chinese Postman walk Problem (CPP) is solved by reducing it to a general bi-directed flow on this graph which runs in O(|E|2 log2(|V|)) time. In this paper we show that the cyclic CPP on bi-directed graphs can be solved without reducing it to bi-directed flow. We present a Θ(p(|V| + |E|) log(|V|) + (dmaxp)3) time algorithm to solve the cyclic CPP on a weighted bi-directed de Bruijn graph, where p = max{|{υ∣din(υ) − dout(υ) > 0}|, |{υ∣din(υ) − dout(υ) < 0}|} and dmax = max{∣din(υ) − dout(υ)}. Our algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. From our experimental results on various datasets, we have noticed that the value of p/|V| lies between 0.08% and 0.13% with 95% probability. Many practical bi-directed de Bruijn graphs do not have cyclic CP walks. In such cases it is not clear how the bi-directed flow can be useful in identifying contigs. Our algorithm can handle such situations and identify maximal bi-directed sub-graphs that have CP walks. A Θ(p(|V| + |E|)) time heuristic algorithm based on these ideas has been implemented for the SDDNA problem. This algorithm was tested on short reads from a plant genome and achieves an approximation ratio of at most 1.0134. We also present a Θ((|V| + |E|) log(V)) time algorithm for the single source shortest path problem on bi-directed de Bruijn graphs, which may be of independent interest.
PMCID: PMC4215799
PMID: 25364472
Sequence assembly algorithms; bioinformatics; Chinese Postman problem; bi-directed graphs
We develop an efficient multicore algorithm, PMS6MC, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. PMS6MC is based on PMS6, which is currently the fastest single-core algorithm for motif discovery in large instances. The speedup, relative to PMS6, attained by our multicore algorithm ranges from a high of 6.62 for the (17,6) challenging instances to a low of 2.75 for the (13,4) challenging instances on an Intel 6-core system. We estimate that PMS6MC is 2 to 4 times faster than other parallel algorithms for motif search on large instances.
doi:10.3390/a6040805
PMCID: PMC4193679
PMID: 25309700
Planted motif search; parallel string algorithms; multi-core algorithms
In the next generation sequencing techniques millions of short reads are produced from a genomic sequence at a single run. The chances of low read coverage to some regions of the sequence are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. This set of overlapping segments are collectively called contigs in the literature. The final step of the sequencing process, called scaffolding, is to assemble the contigs into a correct order. Scaffolding techniques typically exploit additional information such as mate-pairs, pair-ends, or optical restriction maps. In this paper we introduce a series of novel algorithms for scaffolding that exploit optical restriction maps (ORMs). Simulation results show that our algorithms are indeed reliable, scalable, and efficient compared to the best known algorithms in the literature.
doi:10.1186/1471-2164-15-S5-S5
PMCID: PMC4120203
PMID: 25081913
Next Generation Sequencing (NGS); Optical Restriction Map (ORM); Human Genome Project (HGP); String Graph Assembler (SGA); Contigs; Scaffolding
The structural identification of unknown biochemical compounds in complex biofluids continues to be a major challenge in metabolomics research. Using LC/MS there are currently two major options for solving this problem: searching small biochemical databases, which often do not contain the unknown of interest, or searching large chemical databases which include large numbers of non-biochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds of identifying an unknown biochemical compound, but only if non-biochemical structures can be eliminated from consideration. In this paper we present BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds (as scaffolds) and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% of 1,388 Kyoto Encyclopedia of Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1,565 scaffolds. Analysis of two additional biological datasets containing 2,330 human metabolites (HMDB) and 2,416 plant secondary metabolites (KEGG) resulted in biochemical annotations of 89% and 72% of the compounds respectively. When a dataset of 3,895 drugs (DrugBank and USAN) was tested, 48% of these structures were predicted to be biochemical. However, when a set of synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% of the 458,207 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% of 883,199 randomly selected compounds from PubChem were biochemical. We then expanded the scaffold list to 3,927 biochemical compounds and reevaluated the above datasets to determine whether scaffold number influenced model performance. Although there were significant improvements in model sensitivity and specificity using the larger scaffold list, the dataset comparison results were very similar. These results suggest that additional biochemical scaffolds will not further improve our representation of biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method for endogenous mammalian biochemical annotation of chemical space, and thus will be useful in the identification of unknown biochemical structures in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu.
doi:10.1021/ci300512q
PMCID: PMC3866231
PMID: 23330685
Metabolomics; cheminformatics; structural similarity; graph matching; metabolite identification
Background
Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances.
Results
This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/.
Conclusions
We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.
doi:10.1186/1471-2105-15-34
PMCID: PMC3924400
PMID: 24479443
Background
Identification of DNA/Protein motifs is a crucial problem for biologists. Computational techniques could be of great help in this identification. In this direction, many computational models for motifs have been proposed in the literature.
Methods
One such important model is the motif model. In this paper we describe a motif search web tool that predominantly employs this motif model. This web tool exploits the state-of-the art algorithms for solving the motif search problem.
Results
The online tool has been helping scientists identify many unknown motifs. Many of our predictions have been successfully verified as well. We hope that this paper will expose this crucial tool to many more scientists.
Availability and requirements
Project name: PMS - Panoptic Motif Search Tool. Project home page: http://pms.engr.uconn.edu or http://motifsearch.com. Licence: PMS tools will be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions. The online tool is freely available without login.
doi:10.1371/journal.pone.0080660
PMCID: PMC3851466
PMID: 24324619
Efficient tile sets for self assembling rectilinear shapes is of critical importance in algorithmic self assembly. A lower bound on the tile complexity of any deterministic self assembly system for an n × n square is
Ω(log(n)log(log(n))) (inferred from the Kolmogrov complexity). Deterministic self assembly systems with an optimal tile complexity have been designed for squares and related shapes in the past. However designing
Θ(log(n)log(log(n))) unique tiles specific to a shape is still an intensive task in the laboratory. On the other hand copies of a tile can be made rapidly using PCR (polymerase chain reaction) experiments. This led to the study of self assembly on tile concentration programming models. We present two major results in this paper on the concentration programming model. First we show how to self assemble rectangles with a fixed aspect ratio (α:β), with high probability, using Θ(α + β) tiles. This result is much stronger than the existing results by Kao et al. (Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008) and Doty (Randomized self-assembly for exact shapes. In: proceedings of the 50th annual IEEE symposium on foundations of computer science (FOCS), IEEE, Atlanta. pp 85–94, 2009)—which can only self assembly squares and rely on tiles which perform binary arithmetic. On the other hand, our result is based on a technique called staircase sampling. This technique eliminates the need for sub-tiles which perform binary arithmetic, reduces the constant in the asymptotic bound, and eliminates the need for approximate frames (Kao et al. Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008). Our second result applies staircase sampling on the equimolar concentration programming model (The tile complexity of linear assemblies. In: proceedings of the 36th international colloquium automata, languages and programming: Part I on ICALP ’09, Springer-Verlag, pp 235–253, 2009), to self assemble rectangles (of fixed aspect ratio) with high probability. The tile complexity of our algorithm is Θ(log(n)) and is optimal on the probabilistic tile assembly model (PTAM)—n being an upper bound on the dimensions of a rectangle.
doi:10.1007/s11047-012-9313-1
PMCID: PMC3848893
PMID: 24311993
Tile assembly model; Concentration programming; Self assembly algorithms; Probabilistic self assembly model (PTAM); Randomized algorithms
Background and objective
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms.
Methods
Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components.
Results
Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes).
Conclusions
We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
doi:10.1136/amiajnl-2013-002034
PMCID: PMC3932463
PMID: 24154837
Data Integration; Healthcare Records; Algorithms; Parallel Algorithms; Speedups
The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data. In the present contribution, we show that efficient and accurate prediction of a diverse set of properties of material systems is possible by employing machine (or statistical) learning methods trained on quantum mechanical computations in combination with the notions of chemical similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us to discover decision rules that establish a mapping between easily accessible attributes of a system and its properties. It is shown that fingerprints based on either chemo-structural (compositional and configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically explore and mine vast chemical spaces, and can significantly accelerate the discovery of new application-specific materials.
doi:10.1038/srep02810
PMCID: PMC3786293
PMID: 24077117
Traditional Associative Classification (AC) algorithms typically search for all possible association rules to find a representative subset of those rules. Since the search space of such rules may grow exponentially as the support threshold decreases, the rules discovery process can be computationally expensive. One effective way to tackle this problem is to directly find a set of high-stakes association rules that potentially builds a highly accurate classifier. This paper introduces AC-CS, an AC algorithm that integrates the clonal selection of the immune system along with deterministic data sampling. Upon picking a representative sample of the original data, it proceeds in an evolutionary fashion to populate only rules that are likely to yield good classification accuracy. Empirical results on several real datasets show that the approach generates dramatically less rules than traditional AC algorithms. In addition, the proposed approach is significantly more efficient than traditional AC algorithms while achieving a competitive accuracy.
doi:10.1109/CEC.2013.6557966
PMCID: PMC3770136
PMID: 24500504
We propose a new algorithm, PMS6, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. The run time ratio PMS5/PMS6, where PMS5 is the fastest previously known algorithm for motif discovery in large instances, ranges from a high of 2.20 for the (21,8) challenge instances to a low of 1.69 for the (17,6) challenge instances. Both PMS5 and PMS6 require some amount of preprocessing. The preprocessing time for PMS6 is 34 times faster than that for PMS5 for (23,9) instances. When preprocessing time is factored in, the run time ratio PMS5/PMS6 is as high as 2.75 for (13,4) instances and as low as 1.95 for (17,6) instances.
doi:10.1109/ICCABS.2012.6182627
PMCID: PMC3744182
PMID: 23959399
Planted motif search; string algorithms
Background
Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.
Results
In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.
Conclusions
Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.
doi:10.1186/1471-2105-14-161
PMCID: PMC3679804
PMID: 23679045
Background
Single Nucleotide Polymorphisms (SNPs) are sequence variations found in individuals at some specific points in the genomic sequence. As SNPs are highly conserved throughout evolution and within a population, the map of SNPs serves as an excellent genotypic marker. Conventional SNPs analysis mechanisms suffer from large run times, inefficient memory usage, and frequent overestimation. In this paper, we propose efficient, scalable, and reliable algorithms to select a small subset of SNPs from a large set of SNPs which can together be employed to perform phenotypic classification.
Methods
Our algorithms exploit the techniques of gene selection and random projections to identify a meaningful subset of SNPs. To the best of our knowledge, these techniques have not been employed before in the context of genotype‐phenotype correlations. Random projections are used to project the input data into a lower dimensional space (closely preserving distances). Gene selection is then applied on the projected data to identify a subset of the most relevant SNPs.
Results
We have compared the performance of our algorithms with one of the currently known best algorithms called Multifactor Dimensionality Reduction (MDR), and Principal Component Analysis (PCA) technique. Experimental results demonstrate that our algorithms are superior in terms of accuracy as well as run time.
Conclusions
In our proposed techniques, random projection is used to map data from a high dimensional space to a lower dimensional space, and thus overcomes the curse of dimensionality problem. From this space of reduced dimension, we select the best subset of attributes. It is a unique mechanism in the domain of SNPs analysis, and to the best of our knowledge it is not employed before. As revealed by our experimental results, our proposed techniques offer the potential of high accuracies while keeping the run times low.
doi:10.1186/1472-6947-13-41
PMCID: PMC3686582
PMID: 23557276
Feature Selection Algorithm (FSA); Gene Selection Algorithm (GSA); Multifactor Dimensionality Reduction (MDR); Random Projection (RP); Single‐Nucleotide Polymorphism (SNP); Support Vector Machine (SVM); Principal Component Analysis (PCA)
Sargeant, David P. | Gryk, Michael R. | Maciejewski, Mark W. | Thapar, Vishal | Kundeti, Vamsi | Rajasekaran, Sanguthevar | Romero, Pedro | Dunker, Keith | Li, Shun-Cheng | Kaneko, Tomonori | Schiller, Martin R.
Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function.
doi:10.1371/journal.pone.0049957
PMCID: PMC3517595
PMID: 23236358
In this paper1 we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O’s required to sort the input close to the lower bound of Ω(logMB(NM)). We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
doi:10.1016/j.jpdc.2011.07.004
PMCID: PMC3199586
PMID: 22034549
The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease.
doi:10.1371/journal.pone.0045589
PMCID: PMC3459956
PMID: 23029121
Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the -motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at http://pms.engr.uconn.edu/downloads/qPMS7.zip. Our online motif discovery tools that use Algorithm qPMS7 are freely available at http://pms.engr.uconn.edu or http://motifsearch.com.
doi:10.1371/journal.pone.0041425
PMCID: PMC3404135
PMID: 22848493
Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n2) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ovmax(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ−|ovmax(x, y)| if and only if ω≤λ, where |ovmax(x, y)| is the length of ovmax(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ−1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip
Contact: hdinh@engr.uconn.edu; rajasek@engr.uconn.edu
doi:10.1093/bioinformatics/btr321
PMCID: PMC3129531
PMID: 21636593
Background
Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.
Methods
Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.
Results
We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.
Conclusions
In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
doi:10.1186/1472-6947-12-59
PMCID: PMC3439324
PMID: 22741525