Search tips
Search criteria

Results 1-3 (3)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches 
Bioinformatics  2010;26(12):1481-1487.
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.
Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.
Availability and implementation: C++ source code freely available for download at
Supplementary information: Supplementary materials are available at Bioinformatics online.
PMCID: PMC2881409  PMID: 20439257
2.  Detection of Biochemical Pathways by Probabilistic Matching of Phyletic Vectors 
PLoS ONE  2009;4(4):e5326.
A phyletic vector, also known as a phyletic (or phylogenetic) pattern, is a binary representation of the presences and absences of orthologous genes in different genomes. Joint occurrence of two or more genes in many genomes results in closely similar binary vectors representing these genes, and this similarity between gene vectors may be used as a measure of functional association between genes. Better understanding of quantitative properties of gene co-occurrences is needed for systematic studies of gene function and evolution. We used the probabilistic iterative algorithm Psi-square to find groups of similar phyletic vectors. An extended Psi-square algorithm, in which pseudocounts are implemented, shows better sensitivity in identifying proteins with known functional links than our earlier hierarchical clustering approach. At the same time, the specificity of inferring functional associations between genes in prokaryotic genomes is strongly dependent on the pathway: phyletic vectors of the genes involved in energy metabolism and in de novo biosynthesis of the essential precursors tend to be lumped together, whereas cellular modules involved in secretion, motility, assembly of cell surfaces, biosynthesis of some coenzymes, and utilization of secondary carbon sources tend to be identified with much greater specificity. It appears that the network of gene coinheritance in prokaryotes contains a giant connected component that encompasses most biosynthetic subsystems, along with a series of more independent modules involved in cell interaction with the environment.
PMCID: PMC2670198  PMID: 19390636
3.  Analyzing Chromatin Remodeling Complexes Using Shotgun Proteomics and Normalized Spectral Abundance Factors 
Methods (San Diego, Calif.)  2006;40(4):303-311.
Mass spectrometry based approaches are commonly used to identify proteins from multiprotein complexes, typically with the goal of identifying new complex members or identifying post translational modifications. However, with the recent demonstration that spectral counting is a powerful quantitative proteomic approach, the analysis of multiprotein complexes by mass spectrometry can be reconsidered in certain cases. Using the chromatography based approach named multidimensional protein identification technology, multiprotein complexes may be analyzed quantitatively using the normalized spectral abundance factor that allows comparison of multiple independent analyses of samples. This study describes an approach to visualize multiprotein complex datasets that provides structure function information that is superior to tabular lists of data. In this method review, we describe a reanalysis of the Rpd3/Sin3 small and large histone deacetylase complexes previously described in a tabular form to demonstrate the normalized spectral abundance factor approach.
PMCID: PMC1815300  PMID: 17101441
Chromatin Remodeling; Protein Complexes; Multidimensional Protein Identification Technology (MudPIT); Tandem Mass Spectrometry; Normalized Spectral Abundance Factor

Results 1-3 (3)