Search tips
Search criteria

Results 1-7 (7)

Clipboard (0)
more »
Year of Publication
Document Types
1.  molBLOCKS: decomposing small molecule sets and uncovering enriched fragments 
Bioinformatics  2014;30(14):2081-2083.
Summary: The chemical structures of biomolecules, whether naturally occurring or synthetic, are composed of functionally important building blocks. Given a set of small molecules—for example, those known to bind a particular protein—computationally decomposing them into chemically meaningful fragments can help elucidate their functional properties, and may be useful for designing novel compounds with similar properties. Here we introduce molBLOCKS, a suite of programs for breaking down sets of small molecules into fragments according to a predefined set of chemical rules, clustering the resulting fragments, and uncovering statistically enriched fragments. Among other applications, our software should be a great aid in large-scale chemical analysis of ligands binding specific targets of interest.
Availability and implementation: molBLOCKS is available as GPL C++ source code at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4080744  PMID: 24681908
2.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays 
Bioinformatics  2009;25(13):1609-1616.
Motivation: High-throughput sequencing technologies place ever increasing demands on existing algorithms for sequence analysis. Algorithms for computing maximal exact matches (MEMs) between sequences appear in two contexts where high-throughput sequencing will vastly increase the volume of sequence data: (i) seeding alignments of high-throughput reads for genome assembly and (ii) designating anchor points for genome–genome comparisons.
Results: We introduce a new algorithm for finding MEMs. The algorithm leverages a sparse suffix array (SA), a text index that stores every K-th position of the text. In contrast to a full text index that stores every position of the text, a sparse SA occupies much less memory. Even though we use a sparse index, the output of our algorithm is the same as a full text index algorithm as long as the space between the indexed suffixes is not greater than a minimum length of a MEM. By relying on partial matches and additional text scanning between indexed positions, the algorithm trades memory for extra computation. The reduced memory usage makes it possible to determine MEMs between significantly longer sequences.
Availability: Source code for the algorithm is available under a BSD open source license at The implementation can serve as a drop-in replacement for the MEMs algorithm in MUMmer 3.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732316  PMID: 19389736
3.  M are better than one: an ensemble-based motif finder and its application to regulatory element prediction 
Bioinformatics  2009;25(7):868-874.
Motivation: Identifying regulatory elements in genomic sequences is a key component in understanding the control of gene expression. Computationally, this problem is often addressed by motif discovery, where the goal is to find a set of mutually similar subsequences within a collection of input sequences. Though motif discovery is widely studied and many approaches to it have been suggested, it remains a challenging and as yet unresolved problem.
Results: We introduce SAMF (Solution-Aggregating Motif Finder), a novel approach for motif discovery. SAMF is based on a Markov Random Field formulation, and its key idea is to uncover and aggregate multiple statistically significant solutions to the given motif finding problem. In contrast to many earlier methods, SAMF does not require prior estimates on the number of motif instances present in the data, is not limited by motif length, and allows motifs to overlap. Though SAMF is broadly applicable, these features make it particularly well suited for addressing the challenges of prokaryotic regulatory element detection. We test SAMF's ability to find transcription factor binding sites in an Escherichia coli dataset and show that it outperforms previous methods. Additionally, we uncover a number of previously unidentified binding sites in this data, and provide evidence that they correspond to actual regulatory elements.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2660878  PMID: 19223448
4.  SPICi: a fast clustering algorithm for large biological networks 
Bioinformatics  2010;26(8):1105-1111.
Motivation: Clustering algorithms play an important role in the analysis of biological networks, and can be used to uncover functional modules and obtain hints about cellular organization. While most available clustering algorithms work well on biological networks of moderate size, such as the yeast protein physical interaction network, they either fail or are too slow in practice for larger networks, such as functional networks for higher eukaryotes. Since an increasing number of larger biological networks are being determined, the limitations of current clustering approaches curtail the types of biological network analyses that can be performed.
Results: We present a fast local network clustering algorithm SPICi. SPICi runs in time O(V log V+E) and space O(E), where V and E are the number of vertices and edges in the network, respectively. We evaluate SPICi's performance on several existing protein interaction networks of varying size, and compare SPICi to nine previous approaches for clustering biological networks. We show that SPICi is typically several orders of magnitude faster than previous approaches and is the only one that can successfully cluster all test networks within very short time. We demonstrate that SPICi has state-of-the-art performance with respect to the quality of the clusters it uncovers, as judged by its ability to recapitulate protein complexes and functional modules. Finally, we demonstrate the power of our fast network clustering algorithm by applying SPICi across hundreds of large context-specific human networks, and identifying modules specific for single conditions.
Availability: Source code is available under the GNU Public License at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2853685  PMID: 20185405
5.  Predicting DNA recognition by Cys2His2 zinc finger proteins 
Bioinformatics  2008;25(1):22-29.
Motivation: Cys2His2 zinc finger (ZF) proteins represent the largest class of eukaryotic transcription factors. Their modular structure and well-conserved protein-DNA interface allow the development of computational approaches for predicting their DNA-binding preferences even when no binding sites are known for a particular protein. The ‘canonical model’ for ZF protein-DNA interaction consists of only four amino acid nucleotide contacts per zinc finger domain.
Results: We present an approach for predicting ZF binding based on support vector machines (SVMs). While most previous computational approaches have been based solely on examples of known ZF protein–DNA interactions, ours additionally incorporates information about protein–DNA pairs known to bind weakly or not at all. Moreover, SVMs with a linear kernel can naturally incorporate constraints about the relative binding affinities of protein-DNA pairs; this type of information has not been used previously in predicting ZF protein-DNA binding. Here, we build a high-quality literature-derived experimental database of ZF–DNA binding examples and utilize it to test both linear and polynomial kernels for predicting ZF protein–DNA binding on the basis of the canonical binding model. The polynomial SVM outperforms previously published prediction procedures as well as the linear SVM. This may indicate the presence of dependencies between contacts in the canonical binding model and suggests that modification of the underlying structural model may result in further improved performance in predicting ZF protein–DNA binding. Overall, this work demonstrates that methods incorporating information about non-binding and relative binding of protein–DNA pairs have great potential for effective prediction of protein–DNA interactions.
Availability: An online tool for predicting ZF DNA binding is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2638941  PMID: 19008249
6.  How and when should interactome-derived clusters be used to predict functional modules and protein function? 
Bioinformatics  2009;25(23):3143-3150.
Motivation: Clustering of protein–protein interaction networks is one of the most common approaches for predicting functional modules, protein complexes and protein functions. But, how well does clustering perform at these tasks?
Results: We develop a general framework to assess how well computationally derived clusters in physical interactomes overlap functional modules derived via the Gene Ontology (GO). Using this framework, we evaluate six diverse network clustering algorithms using Saccharomyces cerevisiae and show that (i) the performances of these algorithms can differ substantially when run on the same network and (ii) their relative performances change depending upon the topological characteristics of the network under consideration. For the specific task of function prediction in S.cerevisiae, we demonstrate that, surprisingly, a simple non-clustering guilt-by-association approach outperforms widely used clustering-based approaches that annotate a protein with the overrepresented biological process and cellular component terms in its cluster; this is true over the range of clustering algorithms considered. Further analysis parameterizes performance based on the number of annotated proteins, and suggests when clustering approaches should be used for interactome functional analyses. Overall our results suggest a re-examination of when and how clustering approaches should be applied to physical interactomes, and establishes guidelines by which novel clustering approaches for biological networks should be justified and evaluated with respect to functional analysis.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3167697  PMID: 19770263
7.  Characterization and prediction of residues determining protein functional specificity 
Bioinformatics  2008;24(13):1473-1480.
Motivation: Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular function-al specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs.
Results: We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolution-ary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs.
Availability: Datasets and GroupSim code are available online at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2718669  PMID: 18450811

Results 1-7 (7)