Search tips
Search criteria

Results 1-25 (310470)

Clipboard (0)

Related Articles

1.  Evaluating Ortholog Prediction Algorithms in a Yeast Model Clade 
PLoS ONE  2011;6(4):e18755.
Accurate identification of orthologs is crucial for evolutionary studies and for functional annotation. Several algorithms have been developed for ortholog delineation, but so far, manually curated genome-scale biological databases of orthologous genes for algorithm evaluation have been lacking. We evaluated four popular ortholog prediction algorithms (MultiParanoid; and OrthoMCL; RBH: Reciprocal Best Hit; RSD: Reciprocal Smallest Distance; the last two extended into clustering algorithms cRBH and cRSD, respectively, so that they can predict orthologs across multiple taxa) against a set of 2,723 groups of high-quality curated orthologs from 6 Saccharomycete yeasts in the Yeast Gene Order Browser.
Examination of sensitivity [TP/(TP+FN)], specificity [TN/(TN+FP)], and accuracy [(TP+TN)/(TP+TN+FP+FN)] across a broad parameter range showed that cRBH was the most accurate and specific algorithm, whereas OrthoMCL was the most sensitive. Evaluation of the algorithms across a varying number of species showed that cRBH had the highest accuracy and lowest false discovery rate [FP/(FP+TP)], followed by cRSD. Of the six species in our set, three descended from an ancestor that underwent whole genome duplication. Subsequent differential duplicate loss events in the three descendants resulted in distinct classes of gene loss patterns, including cases where the genes retained in the three descendants are paralogs, constituting ‘traps’ for ortholog prediction algorithms. We found that the false discovery rate of all algorithms dramatically increased in these traps.
These results suggest that simple algorithms, like cRBH, may be better ortholog predictors than more complex ones (e.g., OrthoMCL and MultiParanoid) for evolutionary and functional genomics studies where the objective is the accurate inference of single-copy orthologs (e.g., molecular phylogenetics), but that all algorithms fail to accurately predict orthologs when paralogy is rampant.
PMCID: PMC3076445  PMID: 21533202
2.  Examination of Prokaryotic Multipartite Genome Evolution through Experimental Genome Reduction 
PLoS Genetics  2014;10(10):e1004742.
Many bacteria carry two or more chromosome-like replicons. This occurs in pathogens such as Vibrio cholerea and Brucella abortis as well as in many N2-fixing plant symbionts including all isolates of the alfalfa root-nodule bacteria Sinorhizobium meliloti. Understanding the evolution and role of this multipartite genome organization will provide significant insight into these important organisms; yet this knowledge remains incomplete, in part, because technical challenges of large-scale genome manipulations have limited experimental analyses. The distinct evolutionary histories and characteristics of the three replicons that constitute the S. meliloti genome (the chromosome (3.65 Mb), pSymA megaplasmid (1.35 Mb), and pSymB chromid (1.68 Mb)) makes this a good model to examine this topic. We transferred essential genes from pSymB into the chromosome, and constructed strains that lack pSymB as well as both pSymA and pSymB. This is the largest reduction (45.4%, 3.04 megabases, 2866 genes) of a prokaryotic genome to date and the first removal of an essential chromid. Strikingly, strains lacking pSymA and pSymB (ΔpSymAB) lost the ability to utilize 55 of 74 carbon sources and various sources of nitrogen, phosphorous and sulfur, yet the ΔpSymAB strain grew well in minimal salts media and in sterile soil. This suggests that the core chromosome is sufficient for growth in a bulk soil environment and that the pSymA and pSymB replicons carry genes with more specialized functions such as growth in the rhizosphere and interaction with the plant. These experimental data support a generalized evolutionary model, in which non-chromosomal replicons primarily carry genes with more specialized functions. These large secondary replicons increase the organism's niche range, which offsets their metabolic burden on the cell (e.g. pSymA). Subsequent co-evolution with the chromosome then leads to the formation of a chromid through the acquisition of functions core to all niches (e.g. pSymB).
Author Summary
Rhizobia are free-living bacteria of agricultural and environmental importance that form root-nodules on leguminous plants and provide these plants with fixed nitrogen. Many of the rhizobia have a multipartite genome, as do several plant and animal pathogens. All isolates of the alfalfa symbiont, Sinorhizobium meliloti, carry three large replicons, the chromosome (∼3.7 Mb), pSymA megaplasmid (∼1.4 Mb), and pSymB chromid (∼1.7 Mb). To gain insight into the role and evolutionary history of these replicons, we have ‘reversed evolution’ by constructing a S. meliloti strain consisting solely of the chromosome and lacking the pSymB chromid and pSymA megaplasmid. As the resulting strain was viable, we could perform a detailed phenotypic analysis and these data provided significant insight into the biology and metabolism of S. meliloti. The data lend direct experimental evidence in understanding the evolution and role of the multipartite genome. Specifically the large secondary replicons increase the organism's niche range, and this advantage offsets the metabolic burden of these replicons on the cell. Additionally, the single-chromosome strain offers a useful platform to facilitate future forward genetic approaches to understanding and manipulating the symbiosis and plant-microbe interactions.
PMCID: PMC4207669  PMID: 25340565
3.  Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs 
BMC Bioinformatics  2010;11:560.
Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet).
In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.
The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.
PMCID: PMC2996408  PMID: 21078174
4.  Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs 
Nucleic Acids Research  2011;40(6):e44.
Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from∼kmahmood/EGM2.
PMCID: PMC3315314  PMID: 22210858
5.  Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups 
OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their sequence similarity. OrthoMCL-DB is a public database that allows users to browse and view ortholog groups that were pre-computed using the OrthoMCL algorithm. Version 4 of this database contained 116,536 ortholog groups clustered from 1,270,853 proteins obtained from 88 eukaryotic genomes, 16 archaeal genomes and 34 bacterial genomes. Future versions of OrthoMCL-DB will include more proteomes as more genomes are sequenced. Here, we describe how you can group your proteins of interest into ortholog clusters using two different means provided by the OrthoMCL system. The OrthoMCL-DB website has a tool for uploading and grouping a set of protein sequences, typically representing a proteome. This method maps the uploaded proteins to existing groups in OrthoMCL-DB. Alternatively, if you have proteins from a set of genomes that need to be grouped, you can download, install and run the standalone OrthoMCL software.
PMCID: PMC3196566  PMID: 21901743
OrthoMCL; ortholog groups; paralog; proteome; Markov clustering; reciprocal best hits; MCL
6.  A fast approach to global alignment of protein-protein interaction networks 
BMC Research Notes  2013;6:35.
Global network alignment has been proposed as an effective tool for computing functional orthology. Commonly used global alignment techniques such as IsoRank rely on a two-step process: the first step is an iterative diffusion-based approach for assigning similarity scores to all possible node pairs (matchings); the second step applies a maximum-weight bipartite matching algorithm to this similarity score matrix to identify orthologous node pairs. While demonstrably successful in identifying orthologies beyond those based on sequences, this two-step process is computationally expensive. Recent work on computation of node-pair similarity matrices has demonstrated that the computational cost of the first step can be significantly reduced. The use of these accelerated methods renders the bipartite matching step as the dominant computational cost. This motivates a critical assessment of the tradeoffs of computational cost and solution quality (matching quality, topological matches, and biological significance) associated with the bipartite matching step. In this paper we utilize the state-of-the-art core diffusion-based step in IsoRank for similarity matrix computation, and couple it with two heuristic bipartite matching algorithms – a matrix-based greedy approach, and a tunable, adaptive, auction-based matching algorithm developed by us. We then compare our implementations against the performance and quality characteristics of the solution produced by the reference IsoRank binary, which also implements an optimal matching algorithm.
Using heuristic matching algorithms in the IsoRank pipeline exhibits dramatic speedup improvements; typically ×30 times faster for the total alignment process in most cases of interest. More surprisingly, these improvements in compute times are typically accompanied by better or comparable topological and biological quality for the network alignments generated. These measures are quantified by the number of conserved edges in the alignment graph, the percentage of enriched components, and the total number of covered Gene Ontology (GO) terms.
We have demonstrated significant reductions in global network alignment computation times by coupling heuristic bipartite matching methods with the similarity scoring step of the IsoRank procedure. Our heuristic matching techniques maintain comparable – if not better – quality in resulting alignments. A consequence of our work is that network-alignment based orthologies can be computed within minutes (as compared to hours) on typical protein interaction networks, enabling a more comprehensive tuning of alignment parameters for refined orthologies.
PMCID: PMC3672019  PMID: 23363457
7.  SYM1 Is the Stress-Induced Saccharomyces cerevisiae Ortholog of the Mammalian Kidney Disease Gene Mpv17 and Is Required for Ethanol Metabolism and Tolerance during Heat Shock 
Eukaryotic Cell  2004;3(3):620-631.
Organisms rapidly adapt to severe environmental stress by inducing the expression of a wide array of heat shock proteins as part of a larger cellular response program. We have used a genomics approach to identify novel heat shock-induced genes in Saccharomyces cerevisiae. The uncharacterized open reading frame (ORF) YLR251W was found to be required for both metabolism and tolerance of ethanol during heat shock. YLR251W has significant homology to the mammalian peroxisomal membrane protein Mpv17, and Mpv17−/− mice exhibit age-onset glomerulosclerosis, deafness, hypertension, and, ultimately, death by renal failure. Expression of Mpv17 in ylr251wΔ cells complements the 37°C ethanol growth defect, suggesting that these proteins are functional orthologs. We have therefore renamed ORF YLR251W as SYM1 (for “stress-inducible yeast Mpv17”). In contrast to the peroxisomal localization of Mpv17, we find that Sym1 is an integral membrane protein of the inner mitochondrial membrane. In addition, transcriptional profiling of sym1Δ cells uncovered changes in gene expression, including dysregulation of a number of ethanol-repressed genes, exclusively at 37°C relative to wild-type results. Together, these data suggest an important metabolic role for Sym1 in mitochondrial function during heat shock. Furthermore, this study establishes Sym1 as a potential model for understanding the role of Mpv17 in kidney disease and cardiovascular biology.
PMCID: PMC420134  PMID: 15189984
8.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization 
BMC Bioinformatics  2010;11:345.
High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.
We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors.
Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.
PMCID: PMC2909219  PMID: 20576136
9.  ‘Ca. Liberibacter asiaticus’ Proteins Orthologous with pSymA-Encoded Proteins of Sinorhizobium meliloti: Hypothetical Roles in Plant Host Interaction 
PLoS ONE  2012;7(6):e38725.
Sinorhizobium meliloti strain 1021, a nitrogen-fixing, root-nodulating bacterial microsymbiont of alfalfa, has a 3.5 Mbp circular chromosome and two megaplasmids including 1.3 Mbp pSymA carrying nonessential ‘accessory’ genes for nitrogen fixation (nif), nodulation and host specificity (nod). A related bacterium, psyllid-vectored ‘Ca. Liberibacter asiaticus,’ is an obligate phytopathogen with a reduced genome that was previously analyzed for genes orthologous to genes on the S. meliloti circular chromosome. In general, proteins encoded by pSymA genes are more similar in sequence alignment to those encoded by S. meliloti chromosomal orthologs than to orthologous proteins encoded by genes carried on the ‘Ca. Liberibacter asiaticus’ genome. Only two ‘Ca. Liberibacter asiaticus’ proteins were identified as having orthologous proteins encoded on pSymA but not also encoded on the chromosome of S. meliloti. These two orthologous gene pairs encode a Na+/K+ antiporter (shared with intracellular pathogens of the family Bartonellacea) and a Co++, Zn++ and Cd++ cation efflux protein that is shared with the phytopathogen Agrobacterium. Another shared protein, a redox-regulated K+ efflux pump may regulate cytoplasmic pH and homeostasis. The pSymA and ‘Ca. Liberibacter asiaticus’ orthologs of the latter protein are more highly similar in amino acid alignment compared with the alignment of the pSymA-encoded protein with its S. meliloti chromosomal homolog. About 182 pSymA encoded proteins have sequence similarity (≤E-10) with ‘Ca. Liberibacter asiaticus’ proteins, often present as multiple orthologs of single ‘Ca. Liberibacter asiaticus’ proteins. These proteins are involved with amino acid uptake, cell surface structure, chaperonins, electron transport, export of bioactive molecules, cellular homeostasis, regulation of gene expression, signal transduction and synthesis of amino acids and metabolic cofactors. The presence of multiple orthologs defies mutational analysis and is consistent with the hypothesis that these proteins may be of particular importance in host/microbe interaction and their duplication likely facilitates their ongoing evolution.
PMCID: PMC3382624  PMID: 22761700
10.  Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes 
Nucleic Acids Research  2011;39(22):e150.
Existing methods for orthologous gene mapping suffer from two general problems: (i) they are computationally too slow and their results are difficult to interpret for automated large-scale applications when based on phylogenetic analyses; or (ii) they are too prone to making mistakes in dealing with complex situations involving horizontal gene transfers and gene fusion due to the lack of a sound basis when based on sequence similarity information. We present a novel algorithm, Global Optimization Strategy (GOST), for orthologous gene mapping through combining sequence similarity and contextual (working partners) information, using a combinatorial optimization framework. Genome-scale applications of GOST show substantial improvements over the predictions by three popular sequence similarity-based orthology mapping programs. Our analysis indicates that our algorithm overcomes the intrinsic issues faced by sequence similarity-based methods, when orthology mapping involves gene fusions and horizontal gene transfers. Our program runs as efficiently as the most efficient sequence similarity-based algorithm in the public domain. GOST is freely downloadable at
PMCID: PMC3239196  PMID: 21965536
11.  Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes 
PLoS ONE  2007;2(4):e383.
Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale ‘gold standard’ orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology.
PMCID: PMC1849888  PMID: 17440619
12.  SymGRASS: a database of sugarcane orthologous genes involved in arbuscular mycorrhiza and root nodule symbiosis 
BMC Bioinformatics  2013;14(Suppl 1):S2.
The rationale for gathering information from plants procuring nitrogen through symbiotic interactions controlled by a common genetic program for a sustainable biofuel production is the high energy demanding application of synthetic nitrogen fertilizers. We curated sequence information publicly available for the biofuel plant sugarcane, performed an analysis of the common SYM pathway known to control symbiosis in other plants, and provide results, sequences and literature links as an online database.
Sugarcane sequences and informations were downloaded from the nucEST database, cleaned and trimmed with seqclean, assembled with TGICL plus translating mapping method, and annotated. The annotation is based on BLAST searches against a local formatted plant Uniprot90 generated with CD-HIT for functional assignment, rpsBLAST to CDD database for conserved domain analysis, and BLAST search to sorghum's for Gene Ontology (GO) assignment. Gene expression was normalized according the Unigene standard, presented as ESTs/100 kb. Protein sequences known in the SYM pathway were used as queries to search the SymGRASS sequence database. Additionally, antimicrobial peptides described in the PhytAMP database served as queries to retrieve and generate expression profiles of these defense genes in the libraries compared to the libraries obtained under symbiotic interactions.
We describe the SymGRASS, a database of sugarcane orthologous genes involved in arbuscular mycorrhiza (AM) and root nodule (RN) symbiosis. The database aggregates knowledge about sequences, tissues, organ, developmental stages and experimental conditions, and provides annotation and level of gene expression for sugarcane transcripts and SYM orthologous genes in sugarcane through a web interface. Several candidate genes were found for all nodes in the pathway, and interestingly a set of symbiosis specific genes was found.
The knowledge integrated in SymGRASS may guide studies on molecular, cellular and physiological mechanisms by which sugarcane controls the establishment and efficiency of endophytic associations. We believe that the candidate sequences for the SYM pathway together with the pool of exclusively expressed tentative consensus (TC) sequences are crucial for the design of molecular studies to unravel the mechanisms controlling the establishment of symbioses in sugarcane, ultimately serving as a basis for the improvement of grass crops.
PMCID: PMC3548678  PMID: 23368899
13.  SymD webserver: a platform for detecting internally symmetric protein structures 
Nucleic Acids Research  2014;42(Web Server issue):W296-W300.
Internal symmetry of a protein structure is the pseudo-symmetry that a single protein chain sometimes exhibits. This is in contrast to the symmetry with which monomers are arranged in many multimeric protein complexes. SymD is a program that detects proteins with internal symmetry. It proved to be useful for analyzing protein structure, function and modeling. This web-based interactive tool was developed by implementing the SymD algorithm. To the best of our knowledge, SymD webserver is the first tool of its kind with which users can easily study the symmetry of the protein they are interested in by uploading the structure or retrieving it from databases. It uses the Galaxy platform to take advantage of its extensibility and displays the symmetry properties, the symmetry axis and the sequence alignment of the structures before and after the symmetry transformation via an interactive graphical visualization environment in any modern web browser. An Example Run video displays the workflow to help users navigate. SymD webserver is publicly available at
PMCID: PMC4086132  PMID: 24799435
14.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods 
PLoS Computational Biology  2009;5(1):e1000262.
Accurate genome-wide identification of orthologs is a central problem in comparative genomics, a fact reflected by the numerous orthology identification projects developed in recent years. However, only a few reports have compared their accuracy, and indeed, several recent efforts have not yet been systematically evaluated. Furthermore, orthology is typically only assessed in terms of function conservation, despite the phylogeny-based original definition of Fitch. We collected and mapped the results of nine leading orthology projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and reciprocal smallest distance). We systematically compared their predictions with respect to both phylogeny and function, using six different tests. This required the mapping of millions of sequences, the handling of hundreds of millions of predicted pairs of orthologs, and the computation of tens of thousands of trees. In phylogenetic analysis or in functional analysis where high specificity is required, we find that OMA and Homologene perform best. At lower functional specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG can be of interest to build broad functional grouping, but the method is not specific enough for phylogenetic or detailed function analyses. In terms of general methodology, we observe that the more sophisticated tree reconstruction/reconciliation approach of Ensembl Compara was at times outperformed by pairwise comparison approaches, even in phylogenetic tests. Furthermore, we show that standard bidirectional best-hit often outperforms projects with more complex algorithms. First, the present study provides guidance for the broad community of orthology data users as to which database best suits their needs. Second, it introduces new methodology to verify orthology. And third, it sets performance standards for current and future approaches.
Author Summary
The identification of orthologs, pairs of homologous genes in different species that started diverging through speciation events, is a central problem in genomics with applications in many research areas, including comparative genomics, phylogenetics, protein function annotation, and genome rearrangement. An increasing number of projects aim at inferring orthologs from complete genomes, but little is known about their relative accuracy or coverage. Because the exact evolutionary history of entire genomes remains largely unknown, predictions can only be validated indirectly, that is, in the context of the different applications of orthology. The few comparison studies published so far have asssessed orthology exclusively from the expectation that orthologs have conserved protein function. In the present work, we introduce methodology to verify orthology in terms of phylogeny and perform a comprehensive comparison of nine leading ortholog inference projects and two methods using both phylogenetic and functional tests. The results show large variations among the different projects in terms of performances, which indicates that the choice of orthology database can have a strong impact on any downstream analysis.
PMCID: PMC2612752  PMID: 19148271
15.  RNA canonical and non-canonical base pairing types: a recognition method and complete repertoire 
Nucleic Acids Research  2002;30(19):4250-4263.
The problem of systematic and objective identification of canonical and non-canonical base pairs in RNA three-dimensional (3D) structures was studied. A probabilistic approach was applied, and an algorithm and its implementation in a computer program that detects and analyzes all the base pairs contained in RNA 3D structures were developed. The algorithm objectively distinguishes among canonical and non-canonical base pairing types formed by three, two and one hydrogen bonds (H-bonds), as well as those containing bifurcated and C-H...X H-bonds. The nodes of a bipartite graph are used to encode the donor and acceptor atoms of a 3D structure. The capacities of the edges correspond to probabilities computed from the geometry of the donor and acceptor groups to form H-bonds. The maximum flow from donors to acceptors irectly identifies base pairs and their types. A complete repertoire of base pairing types was built from the detected H-bonds of all X-ray crystal structures of a resolution of 3.0 Å or better, including the large and small ribosomal subunits. The base pairing types are labeled using an extension of the nomenclature recently introduced by Leontis and Westhof. The probabilistic method was implemented in MC-Annotate, an RNA structure analysis computer program used to determine the base pairing parameters of the 3D modeling system MC-Sym.
PMCID: PMC140540  PMID: 12364604
16.  Cell Growth Inhibition upon Deletion of Four Toxin-Antitoxin Loci from the Megaplasmids of Sinorhizobium meliloti 
Journal of Bacteriology  2014;196(4):811-824.
Toxin and antitoxin (TA) gene pairs are addiction systems that are present in many microbial genomes. Sinorhizobium meliloti is an N2-fixing bacterial symbiont of alfalfa and other leguminous plants, and its genome consists of three large replicons, a circular chromosome (3.7 Mb) and the megaplasmids pSymA (1.4 Mb) and pSymB (1.7 Mb). S. meliloti carries 211 predicted type II TA genes, each encoding a toxin or an antitoxin. We constructed defined deletion strains that collectively removed the entire pSymA and pSymB megaplasmids except for their oriV regions. Of approximately 100 TA genes on pSymA and pSymB, we identified four whose loss was associated with cell death or stasis unless copies of the genes were supplied in trans. Orthologs of three of these loci have been characterized in other organisms (relB/E [sma0471/sma0473], Fic [DOC] [sma2105], and VapC [PIN] [orf2230/sma2231]), and this report contains the first experimental proof that RES/Xre (smb21127/smb21128) loci can function as a TA system. Transcriptome sequencing (RNA-seq) analysis did not reveal transcriptional differences between the TA systems to account for why deletion of the four “active” systems resulted in cell toxicity. These data suggest that severe cell growth phenotypes result from the loss of a few TA systems and that loss of most TA systems may result in more subtle phenotypes. These four TA systems do not appear to play a direct role in the S. meliloti-alfalfa symbiosis, as strains lacking these TA systems had a symbiotic N2 fixation phenotype that was indistinguishable from the wild type.
PMCID: PMC3911179  PMID: 24317400
17.  Modularity detection in protein-protein interaction networks 
BMC Research Notes  2011;4:569.
Many recent studies have investigated modularity in biological networks, and its role in functional and structural characterization of constituent biomolecules. A technique that has shown considerable promise in the domain of modularity detection is the Newman and Girvan (NG) algorithm, which relies on the number of shortest-paths across pairs of vertices in the network traversing a given edge, referred to as the betweenness of that edge. The edge with the highest betweenness is iteratively eliminated from the network, with the betweenness of the remaining edges recalculated in every iteration. This generates a complete dendrogram, from which modules are extracted by applying a quality metric called modularity denoted by Q. This exhaustive computation can be prohibitively expensive for large networks such as Protein-Protein Interaction Networks. In this paper, we present a novel optimization to the modularity detection algorithm, in terms of an efficient termination criterion based on a target edge betweenness value, using which the process of iterative edge removal may be terminated.
We validate the robustness of our approach by applying our algorithm on real-world protein-protein interaction networks of Yeast, C.Elegans and Drosophila, and demonstrate that our algorithm consistently has significant computational gains in terms of reduced runtime, when compared to the NG algorithm. Furthermore, our algorithm produces modules comparable to those from the NG algorithm, qualitatively and quantitatively. We illustrate this using comparison metrics such as module distribution, module membership cardinality, modularity Q, and Jaccard Similarity Coefficient.
We have presented an optimized approach for efficient modularity detection in networks. The intuition driving our approach is the extraction of holistic measures of centrality from graphs, which are representative of inherent modular structure of the underlying network, and the application of those measures to efficiently guide the modularity detection process. We have empirically evaluated our approach in the specific context of real-world large scale biological networks, and have demonstrated significant savings in computational time while maintaining comparable quality of detected modules.
PMCID: PMC3292542  PMID: 22206604
18.  A new framework for identifying cis-regulatory motifs in prokaryotes 
Nucleic Acids Research  2010;39(7):e42.
We present a new algorithm, BOBRO, for prediction of cis-regulatory motifs in a given set of promoter sequences. The algorithm substantially improves the prediction accuracy and extends the scope of applicability of the existing programs based on two key new ideas: (i) we developed a highly effective method for reliably assessing the possibility for each position in a given promoter to be the (approximate) start of a conserved sequence motif; and (ii) we developed a highly reliable way for recognition of actual motifs from the accidental ones based on the concept of ‘motif closure’. These two key ideas are embedded in a classical framework for motif finding through finding cliques in a graph but have made this framework substantially more sensitive as well as more selective in motif finding in a very noisy background. A comparative analysis shows that the performance coefficient was improved from 29% to 41% by our program compared to the best among other six state-of-the-art prediction tools on a large-scale data sets of promoters from one genome, and also consistently improved by substantial margins on another kind of large-scale data sets of orthologous promoters across multiple genomes. The power of BOBRO in dealing with noisy data was further demonstrated through identification of the motifs of the global transcriptional regulators by running it over 2390 promoter sequences of Escherichia coli K12.
PMCID: PMC3074163  PMID: 21149261
19.  Computational methods for Gene Orthology inference 
Briefings in Bioinformatics  2011;12(5):379-391.
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
PMCID: PMC3178053  PMID: 21690100
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
20.  Genome-Wide Molecular Clock and Horizontal Gene Transfer in Bacterial Evolution 
Journal of Bacteriology  2004;186(19):6575-6585.
We describe a simple theoretical framework for identifying orthologous sets of genes that deviate from a clock-like model of evolution. The approach used is based on comparing the evolutionary distances within a set of orthologs to a standard intergenomic distance, which was defined as the median of the distribution of the distances between all one-to-one orthologs. Under the clock-like model, the points on a plot of intergenic distances versus intergenomic distances are expected to fit a straight line. A statistical technique to identify significant deviations from the clock-like behavior is described. For several hundred analyzed orthologous sets representing three well-defined bacterial lineages, the α-Proteobacteria, the γ-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for ∼70% of the sets, whereas the rest showed substantial anomalies. Subsequent detailed phylogenetic analysis of the genes with the strongest deviations indicated that over one-half of these genes probably underwent a distinct form of horizontal gene transfer, xenologous gene displacement, in which a gene is displaced by an ortholog from a different lineage. The remaining deviations from the clock-like model could be explained by lineage-specific acceleration of evolution. The results indicate that although xenologous gene displacement is a major force in bacterial evolution, a significant majority of orthologous gene sets in three major bacterial lineages evolved in accordance with the clock-like model. The approach described here allows rapid detection of deviations from this mode of evolution on the genome scale.
PMCID: PMC516599  PMID: 15375139
21.  BFL: a node and edge betweenness based fast layout algorithm for large scale networks 
BMC Bioinformatics  2009;10:19.
Network visualization would serve as a useful first step for analysis. However, current graph layout algorithms for biological pathways are insensitive to biologically important information, e.g. subcellular localization, biological node and graph attributes, or/and not available for large scale networks, e.g. more than 10000 elements.
To overcome these problems, we propose the use of a biologically important graph metric, betweenness, a measure of network flow. This metric is highly correlated with many biological phenomena such as lethality and clusters. We devise a new fast parallel algorithm calculating betweenness to minimize the preprocessing cost. Using this metric, we also invent a node and edge betweenness based fast layout algorithm (BFL). BFL places the high-betweenness nodes to optimal positions and allows the low-betweenness nodes to reach suboptimal positions. Furthermore, BFL reduces the runtime by combining a sequential insertion algorim with betweenness. For a graph with n nodes, this approach reduces the expected runtime of the algorithm to O(n2) when considering edge crossings, and to O(n log n) when considering only density and edge lengths.
Our BFL algorithm is compared against fast graph layout algorithms and approaches requiring intensive optimizations. For gene networks, we show that our algorithm is faster than all layout algorithms tested while providing readability on par with intensive optimization algorithms. We achieve a 1.4 second runtime for a graph with 4000 nodes and 12000 edges on a standard desktop computer.
PMCID: PMC2753844  PMID: 19146673
22.  Service and Collaboration: Enhancing Project Potential at the Stowers Institute Molecular Biology Facility 
Journal of Biomolecular Techniques : JBT  2010;21(3 Suppl):S74-S75.
The Stowers Institute for Medical Research houses 24 independent research programs supported by 13 independent core facilities. The Molecular Biology Facility combines four teams in a cutting-edge laboratory dedicated exclusively to supporting internal research by providing access to the latest technology and high quality service. The Sequencing Team offers routine Sanger sequencing, 96-well plasmid preps followed by high throughput sequencing, riboprobe synthesis and pathogen testing. Sanger resequencing for finding mutations is also available.An Illumina Genome Analyzer is utilized for next generation sequencing projects.Current next gen sequencing applications include whole genome resequencing and SNP discovery, RNA-seq, ChIP-Seq and microRNA sequencing. Microarray works with commercially available Affymetrix and Agilent platforms, and spotted arrays made in-house. Experiments include ChIP-chip, array CGH, and expression applications.Microarray also generates ChIP-seq, RNA-seq and microRNA libraries for next gen sequencing applications. The Mutagenesis and Libraries Team's site-directed mutagenesis service is one of our most popular and unique services. Investigators provide the team with a glycerol stock, sequence information and the desired changes (point mutations, insertions, deletions). The team designs the primers, does the mutagenesis reactions and returns positive clones. Currently the team is developing a recombineering service for larger construct designs, including BACs.The group manages and distributes clones from over 27 vector and clone collections for the Institute. The qPCR, Robotics and Tech Development Team collaborates with many of the Institute's PI labs and core facilities.qPCR instrumentsinclude an ABI 7500 for 96-well qPCR and an ABI 7900 with automatic loader for running 384-well plates and Taqman Low Density Arrays (TLDA cards).The team also collaborates with investigators on custom robotic projects using a liquid handling robot (Biomek FX and CAS-4200) and colony manipulation robots (Singer RoTor and Qpix).Most projects require tech development, such as the recent development of high throughput aneuploidy screens for one of the Institute's PI labs.
PMCID: PMC2918111
23.  Genome-Wide Inference of Ancestral Recombination Graphs 
PLoS Genetics  2014;10(5):e1004342.
The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of chromosomes conditional on an ARG of chromosomes, an operation we call “threading.” Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.
Author Summary
The unusual and complex correlation structure of population samples of genetic sequences presents a fundamental statistical challenge that pervades nearly all areas of population genetics. Historical recombination events produce an intricate network of intertwined genealogies, which impedes demography inference, the detection of natural selection, association mapping, and other applications. It is possible to capture these complex relationships using a representation called the ancestral recombination graph (ARG), which provides a complete description of coalescence and recombination events in the history of the sample. However, previous methods for ARG inference have not been adequately fast and accurate for practical use with large-scale genomic sequence data. In this article, we introduce a new algorithm for ARG inference that has vastly improved scaling properties. Our algorithm is implemented in a computer program called ARGweaver, which is fast enough to be applied to sequences megabases in length. With the aid of a large computer cluster, ARGweaver can be used to sample full ARGs for entire mammalian genome sequences. We show that ARGweaver performs well in simulation experiments and demonstrate that it can be used to provide new insights about both demographic processes and natural selection when applied to real human genome sequence data.
PMCID: PMC4022496  PMID: 24831947
24.  Detecting internally symmetric protein structures 
BMC Bioinformatics  2010;11:303.
Many functional proteins have a symmetric structure. Most of these are multimeric complexes, which are made of non-symmetric monomers arranged in a symmetric manner. However, there are also a large number of proteins that have a symmetric structure in the monomeric state. These internally symmetric proteins are interesting objects from the point of view of their folding, function, and evolution. Most algorithms that detect the internally symmetric proteins depend on finding repeating units of similar structure and do not use the symmetry information.
We describe a new method, called SymD, for detecting symmetric protein structures. The SymD procedure works by comparing the structure to its own copy after the copy is circularly permuted by all possible number of residues. The procedure is relatively insensitive to symmetry-breaking insertions and deletions and amplifies positive signals from symmetry. It finds 70% to 80% of the TIM barrel fold domains in the ASTRAL 40 domain database and 100% of the beta-propellers as symmetric. More globally, 10% to 15% of the proteins in the ASTRAL 40 domain database may be considered symmetric according to this procedure depending on the precise cutoff value used to measure the degree of perfection of the symmetry. Symmetrical proteins occur in all structural classes and can have a closed, circular structure, a cylindrical barrel-like structure, or an open, helical structure.
SymD is a sensitive procedure for detecting internally symmetric protein structures. Using this procedure, we estimate that 10% to 15% of the known protein domains may be considered symmetric. We also report an initial, overall view of the types of symmetries and symmetric folds that occur in the protein domain structure universe.
PMCID: PMC2894822  PMID: 20525292
25.  MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement 
BMC Bioinformatics  2010;11:10.
Ortholog assignment is a critical and fundamental problem in comparative genomics, since orthologs are considered to be functional counterparts in different species and can be used to infer molecular functions of one species from those of other species. MSOAR is a recently developed high-throughput system for assigning one-to-one orthologs between closely related species on a genome scale. It attempts to reconstruct the evolutionary history of input genomes in terms of genome rearrangement and gene duplication events. It assumes that a gene duplication event inserts a duplicated gene into the genome of interest at a random location (i.e., the random duplication model). However, in practice, biologists believe that genes are often duplicated by tandem duplications, where a duplicated gene is located next to the original copy (i.e., the tandem duplication model).
In this paper, we develop MSOAR 2.0, an improved system for one-to-one ortholog assignment. For a pair of input genomes, the system first focuses on the tandemly duplicated genes of each genome and tries to identify among them those that were duplicated after the speciation (i.e., the so-called inparalogs), using a simple phylogenetic tree reconciliation method. For each such set of tandemly duplicated inparalogs, all but one gene will be deleted from the concerned genome (because they cannot possibly appear in any one-to-one ortholog pairs), and MSOAR is invoked. Using both simulated and real data experiments, we show that MSOAR 2.0 is able to achieve a better sensitivity and specificity than MSOAR. In comparison with the well-known genome-scale ortholog assignment tool InParanoid, Ensembl ortholog database, and the orthology information extracted from the well-known whole-genome multiple alignment program MultiZ, MSOAR 2.0 shows the highest sensitivity. Although the specificity of MSOAR 2.0 is slightly worse than that of InParanoid in the real data experiments, it is actually better than that of InParanoid in the simulation tests.
Our preliminary experimental results demonstrate that MSOAR 2.0 is a highly accurate tool for one-to-one ortholog assignment between closely related genomes. The software is available to the public for free and included as online supplementary material.
PMCID: PMC2821317  PMID: 20053291

Results 1-25 (310470)