Numerous OLAP queries process selection operations of “top N”, median, “top 5%”, in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε) passes (ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2 + ε)-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = Mc for some constant c). While, our proposed algorithm RSS runs in (2 + ε) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≫ Mc.
doi:10.1109/ISCC.2013.6755015
PMCID: PMC4217089
PMID: 25374478
Selection; Parallel Disk System; Median; OLAP queries
Sequence assembly from short reads is an important problem in biology. It is known that solving the sequence assembly problem exactly on a bi-directed de Bruijn graph or a string graph is intractable. However, finding a shortest double stranded DNA string (SDDNA) containing all the k-long words in the reads seems to be a good heuristic to get close to the original genome. This problem is equivalent to finding a cyclic Chinese Postman (CP) walk on the underlying unweighted bi-directed de Bruijn graph built from the reads. The Chinese Postman walk Problem (CPP) is solved by reducing it to a general bi-directed flow on this graph which runs in O(|E|2 log2(|V|)) time. In this paper we show that the cyclic CPP on bi-directed graphs can be solved without reducing it to bi-directed flow. We present a Θ(p(|V| + |E|) log(|V|) + (dmaxp)3) time algorithm to solve the cyclic CPP on a weighted bi-directed de Bruijn graph, where p = max{|{υ∣din(υ) − dout(υ) > 0}|, |{υ∣din(υ) − dout(υ) < 0}|} and dmax = max{∣din(υ) − dout(υ)}. Our algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. From our experimental results on various datasets, we have noticed that the value of p/|V| lies between 0.08% and 0.13% with 95% probability. Many practical bi-directed de Bruijn graphs do not have cyclic CP walks. In such cases it is not clear how the bi-directed flow can be useful in identifying contigs. Our algorithm can handle such situations and identify maximal bi-directed sub-graphs that have CP walks. A Θ(p(|V| + |E|)) time heuristic algorithm based on these ideas has been implemented for the SDDNA problem. This algorithm was tested on short reads from a plant genome and achieves an approximation ratio of at most 1.0134. We also present a Θ((|V| + |E|) log(V)) time algorithm for the single source shortest path problem on bi-directed de Bruijn graphs, which may be of independent interest.
PMCID: PMC4215799
PMID: 25364472
Sequence assembly algorithms; bioinformatics; Chinese Postman problem; bi-directed graphs
We develop an efficient multicore algorithm, PMS6MC, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. PMS6MC is based on PMS6, which is currently the fastest single-core algorithm for motif discovery in large instances. The speedup, relative to PMS6, attained by our multicore algorithm ranges from a high of 6.62 for the (17,6) challenging instances to a low of 2.75 for the (13,4) challenging instances on an Intel 6-core system. We estimate that PMS6MC is 2 to 4 times faster than other parallel algorithms for motif search on large instances.
doi:10.3390/a6040805
PMCID: PMC4193679
PMID: 25309700
Planted motif search; parallel string algorithms; multi-core algorithms
In the next generation sequencing techniques millions of short reads are produced from a genomic sequence at a single run. The chances of low read coverage to some regions of the sequence are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. This set of overlapping segments are collectively called contigs in the literature. The final step of the sequencing process, called scaffolding, is to assemble the contigs into a correct order. Scaffolding techniques typically exploit additional information such as mate-pairs, pair-ends, or optical restriction maps. In this paper we introduce a series of novel algorithms for scaffolding that exploit optical restriction maps (ORMs). Simulation results show that our algorithms are indeed reliable, scalable, and efficient compared to the best known algorithms in the literature.
doi:10.1186/1471-2164-15-S5-S5
PMCID: PMC4120203
PMID: 25081913
Next Generation Sequencing (NGS); Optical Restriction Map (ORM); Human Genome Project (HGP); String Graph Assembler (SGA); Contigs; Scaffolding
The structural identification of unknown biochemical compounds in complex biofluids continues to be a major challenge in metabolomics research. Using LC/MS there are currently two major options for solving this problem: searching small biochemical databases, which often do not contain the unknown of interest, or searching large chemical databases which include large numbers of non-biochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds of identifying an unknown biochemical compound, but only if non-biochemical structures can be eliminated from consideration. In this paper we present BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds (as scaffolds) and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% of 1,388 Kyoto Encyclopedia of Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1,565 scaffolds. Analysis of two additional biological datasets containing 2,330 human metabolites (HMDB) and 2,416 plant secondary metabolites (KEGG) resulted in biochemical annotations of 89% and 72% of the compounds respectively. When a dataset of 3,895 drugs (DrugBank and USAN) was tested, 48% of these structures were predicted to be biochemical. However, when a set of synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% of the 458,207 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% of 883,199 randomly selected compounds from PubChem were biochemical. We then expanded the scaffold list to 3,927 biochemical compounds and reevaluated the above datasets to determine whether scaffold number influenced model performance. Although there were significant improvements in model sensitivity and specificity using the larger scaffold list, the dataset comparison results were very similar. These results suggest that additional biochemical scaffolds will not further improve our representation of biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method for endogenous mammalian biochemical annotation of chemical space, and thus will be useful in the identification of unknown biochemical structures in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu.
doi:10.1021/ci300512q
PMCID: PMC3866231
PMID: 23330685
Metabolomics; cheminformatics; structural similarity; graph matching; metabolite identification
Background
Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances.
Results
This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/.
Conclusions
We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.
doi:10.1186/1471-2105-15-34
PMCID: PMC3924400
PMID: 24479443
Background
Identification of DNA/Protein motifs is a crucial problem for biologists. Computational techniques could be of great help in this identification. In this direction, many computational models for motifs have been proposed in the literature.
Methods
One such important model is the motif model. In this paper we describe a motif search web tool that predominantly employs this motif model. This web tool exploits the state-of-the art algorithms for solving the motif search problem.
Results
The online tool has been helping scientists identify many unknown motifs. Many of our predictions have been successfully verified as well. We hope that this paper will expose this crucial tool to many more scientists.
Availability and requirements
Project name: PMS - Panoptic Motif Search Tool. Project home page: http://pms.engr.uconn.edu or http://motifsearch.com. Licence: PMS tools will be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions. The online tool is freely available without login.
doi:10.1371/journal.pone.0080660
PMCID: PMC3851466
PMID: 24324619
Efficient tile sets for self assembling rectilinear shapes is of critical importance in algorithmic self assembly. A lower bound on the tile complexity of any deterministic self assembly system for an n × n square is
Ω(log(n)log(log(n))) (inferred from the Kolmogrov complexity). Deterministic self assembly systems with an optimal tile complexity have been designed for squares and related shapes in the past. However designing
Θ(log(n)log(log(n))) unique tiles specific to a shape is still an intensive task in the laboratory. On the other hand copies of a tile can be made rapidly using PCR (polymerase chain reaction) experiments. This led to the study of self assembly on tile concentration programming models. We present two major results in this paper on the concentration programming model. First we show how to self assemble rectangles with a fixed aspect ratio (α:β), with high probability, using Θ(α + β) tiles. This result is much stronger than the existing results by Kao et al. (Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008) and Doty (Randomized self-assembly for exact shapes. In: proceedings of the 50th annual IEEE symposium on foundations of computer science (FOCS), IEEE, Atlanta. pp 85–94, 2009)—which can only self assembly squares and rely on tiles which perform binary arithmetic. On the other hand, our result is based on a technique called staircase sampling. This technique eliminates the need for sub-tiles which perform binary arithmetic, reduces the constant in the asymptotic bound, and eliminates the need for approximate frames (Kao et al. Randomized self-assembly for approximate shapes, LNCS, vol 5125. Springer, Heidelberg, 2008). Our second result applies staircase sampling on the equimolar concentration programming model (The tile complexity of linear assemblies. In: proceedings of the 36th international colloquium automata, languages and programming: Part I on ICALP ’09, Springer-Verlag, pp 235–253, 2009), to self assemble rectangles (of fixed aspect ratio) with high probability. The tile complexity of our algorithm is Θ(log(n)) and is optimal on the probabilistic tile assembly model (PTAM)—n being an upper bound on the dimensions of a rectangle.
doi:10.1007/s11047-012-9313-1
PMCID: PMC3848893
PMID: 24311993
Tile assembly model; Concentration programming; Self assembly algorithms; Probabilistic self assembly model (PTAM); Randomized algorithms
Background and objective
Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms.
Methods
Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components.
Results
Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes).
Conclusions
We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
doi:10.1136/amiajnl-2013-002034
PMCID: PMC3932463
PMID: 24154837
Data Integration; Healthcare Records; Algorithms; Parallel Algorithms; Speedups
The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data. In the present contribution, we show that efficient and accurate prediction of a diverse set of properties of material systems is possible by employing machine (or statistical) learning methods trained on quantum mechanical computations in combination with the notions of chemical similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us to discover decision rules that establish a mapping between easily accessible attributes of a system and its properties. It is shown that fingerprints based on either chemo-structural (compositional and configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically explore and mine vast chemical spaces, and can significantly accelerate the discovery of new application-specific materials.
doi:10.1038/srep02810
PMCID: PMC3786293
PMID: 24077117
Traditional Associative Classification (AC) algorithms typically search for all possible association rules to find a representative subset of those rules. Since the search space of such rules may grow exponentially as the support threshold decreases, the rules discovery process can be computationally expensive. One effective way to tackle this problem is to directly find a set of high-stakes association rules that potentially builds a highly accurate classifier. This paper introduces AC-CS, an AC algorithm that integrates the clonal selection of the immune system along with deterministic data sampling. Upon picking a representative sample of the original data, it proceeds in an evolutionary fashion to populate only rules that are likely to yield good classification accuracy. Empirical results on several real datasets show that the approach generates dramatically less rules than traditional AC algorithms. In addition, the proposed approach is significantly more efficient than traditional AC algorithms while achieving a competitive accuracy.
doi:10.1109/CEC.2013.6557966
PMCID: PMC3770136
PMID: 24500504
We propose a new algorithm, PMS6, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. The run time ratio PMS5/PMS6, where PMS5 is the fastest previously known algorithm for motif discovery in large instances, ranges from a high of 2.20 for the (21,8) challenge instances to a low of 1.69 for the (17,6) challenge instances. Both PMS5 and PMS6 require some amount of preprocessing. The preprocessing time for PMS6 is 34 times faster than that for PMS5 for (23,9) instances. When preprocessing time is factored in, the run time ratio PMS5/PMS6 is as high as 2.75 for (13,4) instances and as low as 1.95 for (17,6) instances.
doi:10.1109/ICCABS.2012.6182627
PMCID: PMC3744182
PMID: 23959399
Planted motif search; string algorithms
Background
Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.
Results
In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.
Conclusions
Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.
doi:10.1186/1471-2105-14-161
PMCID: PMC3679804
PMID: 23679045
Background
Single Nucleotide Polymorphisms (SNPs) are sequence variations found in individuals at some specific points in the genomic sequence. As SNPs are highly conserved throughout evolution and within a population, the map of SNPs serves as an excellent genotypic marker. Conventional SNPs analysis mechanisms suffer from large run times, inefficient memory usage, and frequent overestimation. In this paper, we propose efficient, scalable, and reliable algorithms to select a small subset of SNPs from a large set of SNPs which can together be employed to perform phenotypic classification.
Methods
Our algorithms exploit the techniques of gene selection and random projections to identify a meaningful subset of SNPs. To the best of our knowledge, these techniques have not been employed before in the context of genotype‐phenotype correlations. Random projections are used to project the input data into a lower dimensional space (closely preserving distances). Gene selection is then applied on the projected data to identify a subset of the most relevant SNPs.
Results
We have compared the performance of our algorithms with one of the currently known best algorithms called Multifactor Dimensionality Reduction (MDR), and Principal Component Analysis (PCA) technique. Experimental results demonstrate that our algorithms are superior in terms of accuracy as well as run time.
Conclusions
In our proposed techniques, random projection is used to map data from a high dimensional space to a lower dimensional space, and thus overcomes the curse of dimensionality problem. From this space of reduced dimension, we select the best subset of attributes. It is a unique mechanism in the domain of SNPs analysis, and to the best of our knowledge it is not employed before. As revealed by our experimental results, our proposed techniques offer the potential of high accuracies while keeping the run times low.
doi:10.1186/1472-6947-13-41
PMCID: PMC3686582
PMID: 23557276
Feature Selection Algorithm (FSA); Gene Selection Algorithm (GSA); Multifactor Dimensionality Reduction (MDR); Random Projection (RP); Single‐Nucleotide Polymorphism (SNP); Support Vector Machine (SVM); Principal Component Analysis (PCA)
Sargeant, David P. | Gryk, Michael R. | Maciejewski, Mark W. | Thapar, Vishal | Kundeti, Vamsi | Rajasekaran, Sanguthevar | Romero, Pedro | Dunker, Keith | Li, Shun-Cheng | Kaneko, Tomonori | Schiller, Martin R.
Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function.
doi:10.1371/journal.pone.0049957
PMCID: PMC3517595
PMID: 23236358
In this paper1 we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O’s required to sort the input close to the lower bound of Ω(logMB(NM)). We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
doi:10.1016/j.jpdc.2011.07.004
PMCID: PMC3199586
PMID: 22034549
The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease.
doi:10.1371/journal.pone.0045589
PMCID: PMC3459956
PMID: 23029121
Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the -motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at http://pms.engr.uconn.edu/downloads/qPMS7.zip. Our online motif discovery tools that use Algorithm qPMS7 are freely available at http://pms.engr.uconn.edu or http://motifsearch.com.
doi:10.1371/journal.pone.0041425
PMCID: PMC3404135
PMID: 22848493
Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n2) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ovmax(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ−|ovmax(x, y)| if and only if ω≤λ, where |ovmax(x, y)| is the length of ovmax(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ−1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip
Contact: hdinh@engr.uconn.edu; rajasek@engr.uconn.edu
doi:10.1093/bioinformatics/btr321
PMCID: PMC3129531
PMID: 21636593
Background
Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.
Methods
Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.
Results
We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.
Conclusions
In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
doi:10.1186/1472-6947-12-59
PMCID: PMC3439324
PMID: 22741525
Background
Minimotifs are short contiguous peptide sequences in proteins that have known functions. At its simplest level, the minimotif sequence is present in a source protein and has an activity relationship with a target, most of which are proteins. While many scientists routinely investigate new minimotif functions in proteins, the major web-based discovery tools have a high rate of false-positive prediction. Any new approach that reduces false-positives will be of great help to biologists.
Methods and Findings
We have built three filters that use genetic interactions to reduce false-positive minimotif predictions. The basic filter identifies those minimotifs where the source/target protein pairs have a known genetic interaction. The HomoloGene genetic interaction filter extends these predictions to predicted genetic interactions of orthologous proteins and the node-based filter identifies those minimotifs where proteins that have a genetic interaction with the source or target have a genetic interaction. Each filter was evaluated with a test data set containing thousands of true and false-positives. Based on sensitivity and selectivity performance metrics, the basic filter had the best discrimination for true positives, whereas the node-based filter had the highest sensitivity. We have implemented these genetic interaction filters on the Minimotif Miner 2.3 website. The genetic interaction filter is particularly useful for improving predictions of posttranslational modifications such as phosphorylation and proteolytic cleavage sites.
Conclusions
Genetic interaction data sets can be used to reduce false-positive minimotif predictions. Minimotif prediction in known genetic interactions can help to refine the mechanisms behind the functional connection between genes revealed by genetic experimentation and screens.
doi:10.1371/journal.pone.0032630
PMCID: PMC3293834
PMID: 22403687
Rajasekaran, Sanguthevar | Merlin, Jerlin Camilus | Kundeti, Vamsi | Oommen, Aaron | Mi, Tian | Vyas, Jay | Alaniz, Izua | Chung, Keith | Chowdhury, Farah | Deverasatty, Sandeep | Irvey, Tenisha M. | Lacambacal, David | Lara, Darlene | Panchangam, Subhasree | Rathnayake, Viraj | Watts, Paula | Schiller, Martin R.
Protein-protein interactions are important to understanding cell functions; however our theoretical understanding is limited. There is a general discontinuity between the well-accepted physical and chemical forces that drive protein-protein interactions and the large collections of identified protein-protein interactions in various databases. Minimotifs are short functional peptide sequences that provide a basis to bridge this gap in knowledge. However, there is no systematic way to study minimotifs in the context of protein-protein interactions or vice versa. Here we have engineered a set of algorithms that can be used to identify minimotifs in known protein-protein interactions and implemented this for use by scientists in Minimotif Miner. By globally testing these algorithms on verified data and on 100 individual proteins as test cases, we demonstrate the utility of these new computation tools. This tool also can be used to reduce false positive predictions in the discovery of novel minimotifs. The statistical significance of these algorithms is demonstrated by an ROC analysis (p = 0.001).
doi:10.1002/prot.22868
PMCID: PMC2995834
PMID: 20938975
Minimotif Miner; HomoloGene; BLAST; Grb2; SLiM; short linear motifs
Mi, Tian | Merlin, Jerlin Camilus | Deverasetty, Sandeep | Gryk, Michael R. | Bill, Travis J. | Brooks, Andrew W. | Lee, Logan Y. | Rathnayake, Viraj | Ross, Christian A. | Sargeant, David P. | Strong, Christy L. | Watts, Paula | Rajasekaran, Sanguthevar | Schiller, Martin R.
Minimotif Miner (MnM available at http://minimotifminer.org or http://mnm.engr.uconn.edu) is an online database for identifying new minimotifs in protein queries. Minimotifs are short contiguous peptide sequences that have a known function in at least one protein. Here we report the third release of the MnM database which has now grown 60-fold to approximately 300 000 minimotifs. Since short minimotifs are by their nature not very complex we also summarize a new set of false-positive filters and linear regression scoring that vastly enhance minimotif prediction accuracy on a test data set. This online database can be used to predict new functions in proteins and causes of disease.
doi:10.1093/nar/gkr1189
PMCID: PMC3245078
PMID: 22146221
The photochemical and thermal stability of the detergent solubilized blue- and green-absorbing proteorhodpsins, BPR and GPR respectively, are investigated to determine viability of these proteins for photonic device applications. Photochemical stability is studied by using pulsed laser excitation and differential uv-vis spectroscopy to assign the photocyclicity. GPR, with a cyclicity of 7×104 photocycles protein−1, is 4–5 times more stable than BPR (9×103 photocycles protein−1), but is less stable than native bacteriorhodopsin (9×105 photocycles protein−1) or the 4-keto-bacteriorhodopsin analog (1×105 photocycles protein−1). The thermal stabilities are assigned by using differential scanning calorimetry and thermal bleaching experiments. Both proteorhodopsins display excellent thermal stability, with melting temperatures above 85°C, and remain photochemically stable up to 75°C. The biological relevance of our results is also discussed. The lower cyclicity of BPR is found to be adequate for the long-term biological function of the host organism at ocean depths of 50 m or more.
doi:10.1021/jp106633w
PMCID: PMC2987714
PMID: 20964279
opto-electronic material; cyclicity; differential scanning calorimetry; proteorhodopsin; biophotonic
Background
Motifs are patterns found in biological sequences that are vital for understanding gene function, human disease, drug design, etc. They are helpful in finding transcriptional regulatory elements, transcription factor binding sites, and so on. As a result, the problem of identifying motifs is very crucial in biology.
Results
Many facets of the motif search problem have been identified in the literature. One of them is (ℓ, d)-motif search (or Planted Motif Search (PMS)). The PMS problem has been well investigated and shown to be NP-hard. Any algorithm for PMS that always finds all the (ℓ, d)-motifs on a given input set is called an exact algorithm. In this paper we focus on exact algorithms only. All the known exact algorithms for PMS take exponential time in some of the underlying parameters in the worst case scenario. But it does not mean that we cannot design exact algorithms for solving practical instances within a reasonable amount of time. In this paper, we propose a fast algorithm that can solve the well-known challenging instances of PMS: (21, 8) and (23, 9). No prior exact algorithm could solve these instances. In particular, our proposed algorithm takes about 10 hours on the challenging instance (21, 8) and about 54 hours on the challenging instance (23, 9). The algorithm has been run on a single 2.4GHz PC with 3GB RAM. The implementation of PMS5 is freely available on the web at http://www.pms.engr.uconn.edu/downloads/PMS5.zip.
Conclusions
We present an efficient algorithm PMS5 that uses some novel ideas and combines them with well-known algorithm PMS1 and PMSPrune. PMS5 can tackle the large challenging instances (21, 8) and (23, 9). Therefore, we hope that PMS5 will help biologists discover longer motifs in the futures.
doi:10.1186/1471-2105-12-410
PMCID: PMC3269969
PMID: 22024209