Search tips
Search criteria

Results 1-25 (36)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  An Elegant Algorithm for the Construction of Suffix Arrays 
The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. It has been introduced as a memory efficient alternative for suffix trees. The suffix array consists of the sorted suffixes of a string. There are several linear time suffix array construction algorithms (SACAs) known in the literature. However, one of the fastest algorithms in practice has a worst case run time of O(n2). The problem of designing practically and theoretically efficient techniques remains open.
In this paper we present an elegant algorithm for suffix array construction which takes linear time with high probability; the probability is on the space of all possible inputs. Our algorithm is one of the simplest of the known SACAs and it opens up a new dimension of suffix array construction that has not been explored until now. Our algorithm is easily parallelizable. We offer parallel implementations on various parallel models of computing. We prove a lemma on the ℓ-mers of a random string which might find independent applications. We also present another algorithm that utilizes the above algorithm. This algorithm is called RadixSA and has a worst case run time of O(n log n). RadixSA introduces an idea that may find independent applications as a speedup technique for other SACAs. An empirical comparison of RadixSA with other algorithms on various datasets reveals that our algorithm is one of the fastest algorithms to date. The C++ source code is freely available at
PMCID: PMC4101535  PMID: 25045344
suffix array construction algorithm; parallel algorithm; high probability bounds
2.  Natural variability of minimotifs in 1092 people indicates that minimotifs are targets of evolution 
Nucleic Acids Research  2015;43(13):6399-6412.
Since the function of a short contiguous peptide minimotif can be introduced or eliminated by a single point mutation, these functional elements may be a source of human variation and a target of selection. We analyzed the variability of ∼300 000 minimotifs in 1092 human genomes from the 1000 Genomes Project. Most minimotifs have been purified by selection, with a 94% invariance, which supports important functional roles for minimotifs. Minimotifs are generally under negative selection, possessing high genomic evolutionary rate profiling (GERP) and sitewise likelihood-ratio (SLR) scores. Some are subject to neutral drift or positive selection, similar to coding regions. Most SNPs in minimotif were common variants, but with minor allele frequencies generally <10%. This was supported by low substation rates and few newly derived minimotifs. Several minimotif alleles showed different intercontinental and regional geographic distributions, strongly suggesting a role for minimotifs in adaptive evolution. We also note that 4% of PTM minimotif sites in histone tails were common variants, which has the potential to differentially affect DNA packaging among individuals. In conclusion, minimotifs are a source of functional genetic variation in the human population; thus, they are likely to be an important target of selection and evolution.
PMCID: PMC4513861  PMID: 26068475
3.  Border Length Minimization Problem on a Square Array 
Journal of Computational Biology  2014;21(6):446-455.
Protein/peptide microarrays are rapidly gaining momentum in the diagnosis of cancer. High-density and high-throughput peptide arrays are being extensively used to detect tumor biomarkers, examine kinase activity, identify antibodies having low serum titers, and locate antibody signatures. Improving the yield of microarray fabrication involves solving a hard combinatorial optimization problem called the border length minimization problem (BLMP). An important question that remained open for the past 7 years is if the BLMP is tractable or not. We settle this open problem by proving that the BLMP is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal NP}$$\end{document}-hard. We also present a hierarchical refinement algorithm that can refine any heuristic solution for the BLMP and prove that the TSP+1-threading heuristic is an O(N)-approximation.
PMCID: PMC4038986  PMID: 24528000
algorithms; combinatorial optimization; DNA self-assembly; genomics; machine learning
4.  RLT-S: A Web System for Record Linkage 
PLoS ONE  2015;10(5):e0124449.
Record linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem.
We have developed RLT-S, a freely available web tool, which implements our single linkage clustering algorithm for record linkage. This tool requires input data sets and a small set of configuration settings about these files to work efficiently. RLT-S employs exact match clustering, blocking on a specified attribute and single linkage based hierarchical clustering among these blocks.
RLT-S is an implementation package of our sequential record linkage algorithm. It outperforms previous best-known implementations by a large margin. The tool is at least two times faster for any dataset than the previous best-known tools.
RLT-S tool implements our record linkage algorithm that outperforms previous best-known algorithms in this area. This website also contains necessary information such as instructions, submission history, feedback, publications and some other sections to facilitate the usage of the tool.
RLT-S is integrated into, which is currently serving this tool only. The tool is freely available and can be used without login. All data files used in this paper have been stored in For copies of the relevant programs please see
PMCID: PMC4420456  PMID: 25942687
5.  A molecular structure matching approach to efficient identification of endogenous mammalian biochemical structures 
BMC Bioinformatics  2015;16(Suppl 5):S11.
Metabolomics is the study of small molecules, called metabolites, of a cell, tissue or organism. It is of particular interest as endogenous metabolites represent the phenotype resulting from gene expression. A major challenge in metabolomics research is the structural identification of unknown biochemical compounds in complex biofluids. In this paper we present an efficient cheminformatics tool, BioSMXpress that uses known endogenous mammalian biochemicals and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSMXpress identifies endogenous mammalian biochemical structures with high accuracy. BioSMXpress is 8 times faster than our previous work BioSM without compromising the accuracy of the predictions made. BioSMXpress is freely available at
PMCID: PMC4402589  PMID: 25859612
Metabolomics; Metabolite Identification; Chemical Structure Matching; Biochemical Structures
6.  A Two-Pass Exact Algorithm for Selection on Parallel Disk Systems 
Numerous OLAP queries process selection operations of “top N”, median, “top 5%”, in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε) passes (ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2 + ε)-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = Mc for some constant c). While, our proposed algorithm RSS runs in (2 + ε) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≫ Mc.
PMCID: PMC4217089  PMID: 25374478
Selection; Parallel Disk System; Median; OLAP queries
Sequence assembly from short reads is an important problem in biology. It is known that solving the sequence assembly problem exactly on a bi-directed de Bruijn graph or a string graph is intractable. However, finding a shortest double stranded DNA string (SDDNA) containing all the k-long words in the reads seems to be a good heuristic to get close to the original genome. This problem is equivalent to finding a cyclic Chinese Postman (CP) walk on the underlying unweighted bi-directed de Bruijn graph built from the reads. The Chinese Postman walk Problem (CPP) is solved by reducing it to a general bi-directed flow on this graph which runs in O(|E|2 log2(|V|)) time. In this paper we show that the cyclic CPP on bi-directed graphs can be solved without reducing it to bi-directed flow. We present a Θ(p(|V| + |E|) log(|V|) + (dmaxp)3) time algorithm to solve the cyclic CPP on a weighted bi-directed de Bruijn graph, where p = max{|{υ∣din(υ) − dout(υ) > 0}|, |{υ∣din(υ) − dout(υ) < 0}|} and dmax = max{∣din(υ) − dout(υ)}. Our algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. From our experimental results on various datasets, we have noticed that the value of p/|V| lies between 0.08% and 0.13% with 95% probability. Many practical bi-directed de Bruijn graphs do not have cyclic CP walks. In such cases it is not clear how the bi-directed flow can be useful in identifying contigs. Our algorithm can handle such situations and identify maximal bi-directed sub-graphs that have CP walks. A Θ(p(|V| + |E|)) time heuristic algorithm based on these ideas has been implemented for the SDDNA problem. This algorithm was tested on short reads from a plant genome and achieves an approximation ratio of at most 1.0134. We also present a Θ((|V| + |E|) log(V)) time algorithm for the single source shortest path problem on bi-directed de Bruijn graphs, which may be of independent interest.
PMCID: PMC4215799  PMID: 25364472
Sequence assembly algorithms; bioinformatics; Chinese Postman problem; bi-directed graphs
8.  PMS6MC: A Multicore Algorithm for Motif Discovery 
Algorithms  2013;6(4):805-823.
We develop an efficient multicore algorithm, PMS6MC, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. PMS6MC is based on PMS6, which is currently the fastest single-core algorithm for motif discovery in large instances. The speedup, relative to PMS6, attained by our multicore algorithm ranges from a high of 6.62 for the (17,6) challenging instances to a low of 2.75 for the (13,4) challenging instances on an Intel 6-core system. We estimate that PMS6MC is 2 to 4 times faster than other parallel algorithms for motif search on large instances.
PMCID: PMC4193679  PMID: 25309700
Planted motif search; parallel string algorithms; multi-core algorithms
9.  Efficient and scalable scaffolding using optical restriction maps 
BMC Genomics  2014;15(Suppl 5):S5.
In the next generation sequencing techniques millions of short reads are produced from a genomic sequence at a single run. The chances of low read coverage to some regions of the sequence are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. This set of overlapping segments are collectively called contigs in the literature. The final step of the sequencing process, called scaffolding, is to assemble the contigs into a correct order. Scaffolding techniques typically exploit additional information such as mate-pairs, pair-ends, or optical restriction maps. In this paper we introduce a series of novel algorithms for scaffolding that exploit optical restriction maps (ORMs). Simulation results show that our algorithms are indeed reliable, scalable, and efficient compared to the best known algorithms in the literature.
PMCID: PMC4120203  PMID: 25081913
Next Generation Sequencing (NGS); Optical Restriction Map (ORM); Human Genome Project (HGP); String Graph Assembler (SGA); Contigs; Scaffolding
10.  BioSM: A metabolomics tool for identifying endogenous mammalian biochemical structures in chemical structure space 
The structural identification of unknown biochemical compounds in complex biofluids continues to be a major challenge in metabolomics research. Using LC/MS there are currently two major options for solving this problem: searching small biochemical databases, which often do not contain the unknown of interest, or searching large chemical databases which include large numbers of non-biochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds of identifying an unknown biochemical compound, but only if non-biochemical structures can be eliminated from consideration. In this paper we present BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds (as scaffolds) and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% of 1,388 Kyoto Encyclopedia of Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1,565 scaffolds. Analysis of two additional biological datasets containing 2,330 human metabolites (HMDB) and 2,416 plant secondary metabolites (KEGG) resulted in biochemical annotations of 89% and 72% of the compounds respectively. When a dataset of 3,895 drugs (DrugBank and USAN) was tested, 48% of these structures were predicted to be biochemical. However, when a set of synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% of the 458,207 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% of 883,199 randomly selected compounds from PubChem were biochemical. We then expanded the scaffold list to 3,927 biochemical compounds and reevaluated the above datasets to determine whether scaffold number influenced model performance. Although there were significant improvements in model sensitivity and specificity using the larger scaffold list, the dataset comparison results were very similar. These results suggest that additional biochemical scaffolds will not further improve our representation of biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method for endogenous mammalian biochemical annotation of chemical space, and thus will be useful in the identification of unknown biochemical structures in metabolomics. BioSM is freely available at
PMCID: PMC3866231  PMID: 23330685
Metabolomics; cheminformatics; structural similarity; graph matching; metabolite identification
11.  Efficient sequential and parallel algorithms for planted motif search 
BMC Bioinformatics  2014;15:34.
Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances.
This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at
We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.
PMCID: PMC3924400  PMID: 24479443
12.  PMS: A Panoptic Motif Search Tool 
PLoS ONE  2013;8(12):e80660.
Identification of DNA/Protein motifs is a crucial problem for biologists. Computational techniques could be of great help in this identification. In this direction, many computational models for motifs have been proposed in the literature.
One such important model is the motif model. In this paper we describe a motif search web tool that predominantly employs this motif model. This web tool exploits the state-of-the art algorithms for solving the motif search problem.
The online tool has been helping scientists identify many unknown motifs. Many of our predictions have been successfully verified as well. We hope that this paper will expose this crucial tool to many more scientists.
Availability and requirements
Project name: PMS - Panoptic Motif Search Tool. Project home page: or Licence: PMS tools will be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions. The online tool is freely available without login.
PMCID: PMC3851466  PMID: 24324619
13.  Self assembly of rectangular shapes on concentration programming and probabilistic tile assembly models 
Natural computing  2012;11(2):10.1007/s11047-012-9313-1.
Efficient tile sets for self assembling rectilinear shapes is of critical importance in algorithmic self assembly. A lower bound on the tile complexity of any deterministic self assembly system for an n × n square is Ω(log(n)log(log(n))) (inferred from the Kolmogrov complexity). Deterministic self assembly systems with an optimal tile complexity have been designed for squares and related shapes in the past. However designing Θ(log(n)log(log(n))) unique tiles specific to a shape is still an intensive task in the laboratory. On the other hand copies of a tile can be made rapidly using PCR (polymerase chain reaction) experiments. This led to the study of self assembly on tile concentration programming models. We present two major results in this paper on the concentration programming model. First we show how to self assemble rectangles with a fixed aspect ratio (α:β), with high probability, using Θ(α + β) tiles. This result is much stronger than the existing results by Kao et al. (Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008) and Doty (Randomized self-assembly for exact shapes. In: proceedings of the 50th annual IEEE symposium on foundations of computer science (FOCS), IEEE, Atlanta. pp 85–94, 2009)—which can only self assembly squares and rely on tiles which perform binary arithmetic. On the other hand, our result is based on a technique called staircase sampling. This technique eliminates the need for sub-tiles which perform binary arithmetic, reduces the constant in the asymptotic bound, and eliminates the need for approximate frames (Kao et al. Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008). Our second result applies staircase sampling on the equimolar concentration programming model (The tile complexity of linear assemblies. In: proceedings of the 36th international colloquium automata, languages and programming: Part I on ICALP ’09, Springer-Verlag, pp 235–253, 2009), to self assemble rectangles (of fixed aspect ratio) with high probability. The tile complexity of our algorithm is Θ(log(n)) and is optimal on the probabilistic tile assembly model (PTAM)—n being an upper bound on the dimensions of a rectangle.
PMCID: PMC3848893  PMID: 24311993
Tile assembly model; Concentration programming; Self assembly algorithms; Probabilistic self assembly model (PTAM); Randomized algorithms
14.  Efficient sequential and parallel algorithms for record linkage 
Background and objective
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms.
Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components.
Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes).
We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
PMCID: PMC3932463  PMID: 24154837
Data Integration; Healthcare Records; Algorithms; Parallel Algorithms; Speedups
15.  Accelerating materials property predictions using machine learning 
Scientific Reports  2013;3:2810.
The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data. In the present contribution, we show that efficient and accurate prediction of a diverse set of properties of material systems is possible by employing machine (or statistical) learning methods trained on quantum mechanical computations in combination with the notions of chemical similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us to discover decision rules that establish a mapping between easily accessible attributes of a system and its properties. It is shown that fingerprints based on either chemo-structural (compositional and configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically explore and mine vast chemical spaces, and can significantly accelerate the discovery of new application-specific materials.
PMCID: PMC3786293  PMID: 24077117
16.  Integrating Clonal Selection and Deterministic Sampling for Efficient Associative Classification 
Traditional Associative Classification (AC) algorithms typically search for all possible association rules to find a representative subset of those rules. Since the search space of such rules may grow exponentially as the support threshold decreases, the rules discovery process can be computationally expensive. One effective way to tackle this problem is to directly find a set of high-stakes association rules that potentially builds a highly accurate classifier. This paper introduces AC-CS, an AC algorithm that integrates the clonal selection of the immune system along with deterministic data sampling. Upon picking a representative sample of the original data, it proceeds in an evolutionary fashion to populate only rules that are likely to yield good classification accuracy. Empirical results on several real datasets show that the approach generates dramatically less rules than traditional AC algorithms. In addition, the proposed approach is significantly more efficient than traditional AC algorithms while achieving a competitive accuracy.
PMCID: PMC3770136  PMID: 24500504
17.  PMS6: A Fast Algorithm for Motif Discovery* 
We propose a new algorithm, PMS6, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. The run time ratio PMS5/PMS6, where PMS5 is the fastest previously known algorithm for motif discovery in large instances, ranges from a high of 2.20 for the (21,8) challenge instances to a low of 1.69 for the (17,6) challenge instances. Both PMS5 and PMS6 require some amount of preprocessing. The preprocessing time for PMS6 is 34 times faster than that for PMS5 for (23,9) instances. When preprocessing time is factored in, the run time ratio PMS5/PMS6 is as high as 2.75 for (13,4) instances and as low as 1.95 for (17,6) instances.
PMCID: PMC3744182  PMID: 23959399
Planted motif search; string algorithms
18.  Efficient algorithms for biological stems search 
BMC Bioinformatics  2013;14:161.
Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.
In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.
Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.
PMCID: PMC3679804  PMID: 23679045
19.  Efficient techniques for genotype‐phenotype correlational analysis 
Single Nucleotide Polymorphisms (SNPs) are sequence variations found in individuals at some specific points in the genomic sequence. As SNPs are highly conserved throughout evolution and within a population, the map of SNPs serves as an excellent genotypic marker. Conventional SNPs analysis mechanisms suffer from large run times, inefficient memory usage, and frequent overestimation. In this paper, we propose efficient, scalable, and reliable algorithms to select a small subset of SNPs from a large set of SNPs which can together be employed to perform phenotypic classification.
Our algorithms exploit the techniques of gene selection and random projections to identify a meaningful subset of SNPs. To the best of our knowledge, these techniques have not been employed before in the context of genotype‐phenotype correlations. Random projections are used to project the input data into a lower dimensional space (closely preserving distances). Gene selection is then applied on the projected data to identify a subset of the most relevant SNPs.
We have compared the performance of our algorithms with one of the currently known best algorithms called Multifactor Dimensionality Reduction (MDR), and Principal Component Analysis (PCA) technique. Experimental results demonstrate that our algorithms are superior in terms of accuracy as well as run time.
In our proposed techniques, random projection is used to map data from a high dimensional space to a lower dimensional space, and thus overcomes the curse of dimensionality problem. From this space of reduced dimension, we select the best subset of attributes. It is a unique mechanism in the domain of SNPs analysis, and to the best of our knowledge it is not employed before. As revealed by our experimental results, our proposed techniques offer the potential of high accuracies while keeping the run times low.
PMCID: PMC3686582  PMID: 23557276
Feature Selection Algorithm (FSA); Gene Selection Algorithm (GSA); Multifactor Dimensionality Reduction (MDR); Random Projection (RP); Single‐Nucleotide Polymorphism (SNP); Support Vector Machine (SVM); Principal Component Analysis (PCA)
20.  Secondary Structure, a Missing Component of Sequence-Based Minimotif Definitions 
PLoS ONE  2012;7(12):e49957.
Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function.
PMCID: PMC3517595  PMID: 23236358
21.  Efficient Out of Core Sorting Algorithms for the Parallel Disks Model⋆ 
In this paper1 we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O’s required to sort the input close to the lower bound of Ω(logMB(NM)). We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
PMCID: PMC3199586  PMID: 22034549
22.  Achieving High Accuracy Prediction of Minimotifs 
PLoS ONE  2012;7(9):e45589.
The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease.
PMCID: PMC3459956  PMID: 23029121
23.  qPMS7: A Fast Algorithm for Finding (ℓ, d)-Motifs in DNA and Protein Sequences 
PLoS ONE  2012;7(7):e41425.
Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the -motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at Our online motif discovery tools that use Algorithm qPMS7 are freely available at or
PMCID: PMC3404135  PMID: 22848493
24.  A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly 
Bioinformatics  2011;27(14):1901-1907.
Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n2) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ovmax(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ−|ovmax(x, y)| if and only if ω≤λ, where |ovmax(x, y)| is the length of ovmax(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ−1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at
PMCID: PMC3129531  PMID: 21636593
25.  Efficient algorithms for fast integration on large data sets from multiple sources 
Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.
Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.
We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.
In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
PMCID: PMC3439324  PMID: 22741525

Results 1-25 (36)