PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (873344)

Clipboard (0)
None

Related Articles

1.  A Tight Link between Orthologs and Bidirectional Best Hits in Bacterial and Archaeal Genomes 
Genome Biology and Evolution  2012;4(12):1286-1294.
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
doi:10.1093/gbe/evs100
PMCID: PMC3542571  PMID: 23160176
orthology; bidirectional best hit; genome comparison; synteny
2.  Bidirectional Best Hits Miss Many Orthologs in Duplication-Rich Clades such as Plants and Animals 
Genome Biology and Evolution  2013;5(10):1800-1806.
Bidirectional best hits (BBH), which entails identifying the pairs of genes in two different genomes that are more similar to each other than either is to any other gene in the other genome, is a simple and widely used method to infer orthology. A recent study has analyzed the link between BBH and orthology in bacteria and archaea and concluded that, given the very high consistency in BBH they observed among triplets of neighboring genes, a high proportion of BBH are likely to be bona fide orthologs. However, limited by their analysis setup, the previous study could not easily test the reverse question: which proportion of orthologs are BBH? In this follow-up study, we consider this question in theory and answer it based on conceptual arguments, simulated data, and real biological data from all three domains of life. Our analyses corroborate the findings of the previous study, but also show that because of the high rate of gene duplication in plants and animals, as much as 60% of orthologous relations are missed by the BBH criterion.
doi:10.1093/gbe/evt132
PMCID: PMC3814191  PMID: 24013106
orthology; bidirectional best hit; reciprocal best hit; comparative genomics; evolutionary relationships; in-paralogy
3.  Bidirectional best hit r-window gene clusters 
BMC Bioinformatics  2010;11(Suppl 1):S63.
Background
Conserved gene clusters are groups of genes that are located close to one another in the genomes of several species. They tend to code for proteins that have a functional interaction. The identification of conserved gene clusters is an important step towards understanding genome evolution and predicting gene function.
Results
In this paper, we propose a novel pairwise gene cluster model that combines the notion of bidirectional best hits with the r-window model introduced in 2003 by Durand and Sankoff. The bidirectional best hit (BBH) constraint removes the need to specify the minimum number of shared genes in the r-window model and improves the relevance of the results. We design a subquadratic time algorithm to compute the set of BBH r-window gene clusters efficiently.
Conclusion
We apply our cluster model to the comparative analysis of E. coli K-12 and B. subtilis and perform an extensive comparison between our new model and the gene teams model developed by Bergeron et al. As compared to the gene teams model, our new cluster model has a slightly lower recall but a higher precision at all levels of recall when the results were ranked using statistical tests. An analysis of the most significant BBH r-window gene cluster show that they correspond to known operons.
doi:10.1186/1471-2105-11-S1-S63
PMCID: PMC3009537  PMID: 20122239
4.  Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes 
Nucleic Acids Research  2006;34(2):647-658.
Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.
doi:10.1093/nar/gkj448
PMCID: PMC1351371  PMID: 16436801
5.  Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes 
PLoS Computational Biology  2007;3(9):e175.
Transcription factors (TFs) form large paralogous gene families and have complex evolutionary histories. Here, we ask whether putative orthologs of TFs, from bidirectional best BLAST hits (BBHs), are evolutionary orthologs with conserved functions. We show that BBHs of TFs from distantly related bacteria are usually not evolutionary orthologs. Furthermore, the false orthologs usually respond to different signals and regulate distinct pathways, while the few BBHs that are evolutionary orthologs do have conserved functions. To test the conservation of regulatory interactions, we analyze expression patterns. We find that regulatory relationships between TFs and their regulated genes are usually not conserved for BBHs in Escherichia coli K12 and Bacillus subtilis. Even in the much more closely related bacteria Vibrio cholerae and Shewanella oneidensis MR-1, predicting regulation from E. coli BBHs has high error rates. Using gene–regulon correlations, we identify genes whose expression pattern differs between E. coli and S. oneidensis. Using literature searches and sequence analysis, we show that these changes in expression patterns reflect changes in gene regulation, even for evolutionary orthologs. We conclude that the evolution of bacterial regulation should be analyzed with phylogenetic trees, rather than BBHs, and that bacterial regulatory networks evolve more rapidly than previously thought.
Author Summary
Living organisms use transcription factors (TFs) to control the production of proteins. For example, the bacterium E. coli contains a TF that prevents it from making enzymes that degrade lactose when lactose is absent. Bacterial genomes encode a huge diversity of TFs, and except in a few well-studied organisms, the function of these TFs is not known. To predict the function of a TF, biologists often search for a similar TF, from another organism, that has been characterized. It is generally believed that orthologous TFs—TFs that are derived from the organisms' common ancestor—will have conserved functions. The authors show that a commonly used method to identify orthologous TFs gives misleading results when applied to distantly related bacteria: the “orthologous” TFs are evolutionarily distant, they sense different signals, and they regulate different pathways. Biologists often predict, more specifically, that orthologous TFs will regulate orthologous genes. However, the authors show that even in more closely related bacteria, where the orthologous TFs do have conserved functions, these specific predictions are often incorrect. It seems that gene regulation in bacteria evolves rapidly, and it will be difficult to predict regulation in diverse bacteria from our knowledge of a few well-studied bacteria.
doi:10.1371/journal.pcbi.0030175
PMCID: PMC1971122  PMID: 17845071
6.  Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes 
PLoS Computational Biology  2007;3(9):e175.
Transcription factors (TFs) form large paralogous gene families and have complex evolutionary histories. Here, we ask whether putative orthologs of TFs, from bidirectional best BLAST hits (BBHs), are evolutionary orthologs with conserved functions. We show that BBHs of TFs from distantly related bacteria are usually not evolutionary orthologs. Furthermore, the false orthologs usually respond to different signals and regulate distinct pathways, while the few BBHs that are evolutionary orthologs do have conserved functions. To test the conservation of regulatory interactions, we analyze expression patterns. We find that regulatory relationships between TFs and their regulated genes are usually not conserved for BBHs in Escherichia coli K12 and Bacillus subtilis. Even in the much more closely related bacteria Vibrio cholerae and Shewanella oneidensis MR-1, predicting regulation from E. coli BBHs has high error rates. Using gene–regulon correlations, we identify genes whose expression pattern differs between E. coli and S. oneidensis. Using literature searches and sequence analysis, we show that these changes in expression patterns reflect changes in gene regulation, even for evolutionary orthologs. We conclude that the evolution of bacterial regulation should be analyzed with phylogenetic trees, rather than BBHs, and that bacterial regulatory networks evolve more rapidly than previously thought.
Author Summary
Living organisms use transcription factors (TFs) to control the production of proteins. For example, the bacterium E. coli contains a TF that prevents it from making enzymes that degrade lactose when lactose is absent. Bacterial genomes encode a huge diversity of TFs, and except in a few well-studied organisms, the function of these TFs is not known. To predict the function of a TF, biologists often search for a similar TF, from another organism, that has been characterized. It is generally believed that orthologous TFs—TFs that are derived from the organisms' common ancestor—will have conserved functions. The authors show that a commonly used method to identify orthologous TFs gives misleading results when applied to distantly related bacteria: the “orthologous” TFs are evolutionarily distant, they sense different signals, and they regulate different pathways. Biologists often predict, more specifically, that orthologous TFs will regulate orthologous genes. However, the authors show that even in more closely related bacteria, where the orthologous TFs do have conserved functions, these specific predictions are often incorrect. It seems that gene regulation in bacteria evolves rapidly, and it will be difficult to predict regulation in diverse bacteria from our knowledge of a few well-studied bacteria.
doi:10.1371/journal.pcbi.0030175
PMCID: PMC1971122  PMID: 17845071
7.  QuartetS: a fast and accurate algorithm for large-scale orthology detection 
Nucleic Acids Research  2011;39(13):e88.
The unparalleled growth in the availability of genomic data offers both a challenge to develop orthology detection methods that are simultaneously accurate and high throughput and an opportunity to improve orthology detection by leveraging evolutionary evidence in the accumulated sequenced genomes. Here, we report a novel orthology detection method, termed QuartetS, that exploits evolutionary evidence in a computationally efficient manner. Based on the well-established evolutionary concept that gene duplication events can be used to discriminate homologous genes, QuartetS uses an approximate phylogenetic analysis of quartet gene trees to infer the occurrence of duplication events and discriminate paralogous from orthologous genes. We used function- and phylogeny-based metrics to perform a large-scale, systematic comparison of the orthology predictions of QuartetS with those of four other methods [bi-directional best hit (BBH), outgroup, OMA and QuartetS-C (QuartetS followed by clustering)], involving 624 bacterial genomes and >2 million genes. We found that QuartetS slightly, but consistently, outperformed the highly specific OMA method and that, while consuming only 0.5% additional computational time, QuartetS predicted 50% more orthologs with a 50% lower false positive rate than the widely used BBH method. We conclude that, for large-scale phylogenetic and functional analysis, QuartetS and QuartetS-C should be preferred, respectively, in applications where high accuracy and high throughput are required.
doi:10.1093/nar/gkr308
PMCID: PMC3141274  PMID: 21572104
8.  Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs 
PLoS ONE  2013;8(1):e53786.
Hierarchical orthologous groups are defined as sets of genes that have descended from a single common ancestor within a taxonomic range of interest. Identifying such groups is useful in a wide range of contexts, including inference of gene function, study of gene evolution dynamics and comparative genomics. Hierarchical orthologous groups can be derived from reconciled gene/species trees but, this being a computationally costly procedure, many phylogenomic databases work on the basis of pairwise gene comparisons instead (“graph-based” approach). To our knowledge, there is only one published algorithm for graph-based hierarchical group inference, but both its theoretical justification and performance in practice are as of yet largely uncharacterised. We establish a formal correspondence between the orthology graph and hierarchical orthologous groups. Based on that, we devise GETHOGs (“Graph-based Efficient Technique for Hierarchical Orthologous Groups”), a novel algorithm to infer hierarchical groups directly from the orthology graph, thus without needing gene tree inference nor gene/species tree reconciliation. GETHOGs is shown to correctly reconstruct hierarchical orthologous groups when applied to perfect input, and several extensions with stringency parameters are provided to deal with imperfect input data. We demonstrate its competitiveness using both simulated and empirical data. GETHOGs is implemented as a part of the freely-available OMA standalone package (http://omabrowser.org/standalone). Furthermore, hierarchical groups inferred by GETHOGs (“OMA HOGs”) on >1,000 genomes can be interactively queried via the OMA browser (http://omabrowser.org).
doi:10.1371/journal.pone.0053786
PMCID: PMC3544860  PMID: 23342000
9.  Computational methods for Gene Orthology inference 
Briefings in Bioinformatics  2011;12(5):379-391.
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
doi:10.1093/bib/bbr030
PMCID: PMC3178053  PMID: 21690100
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
10.  IONS: Identification of Orthologs by Neighborhood and Similarity—an Automated Method to Identify Orthologs in Chromosomal Regions of Common Evolutionary Ancestry and its Application to Hemiascomycetous Yeasts 
Comparative sequence analysis is widely used to infer gene function and study genome evolution and requires proper ortholog identification across different genomes. We have developed a program for the Identification of Orthologs in one-to-one relationship by Neighborhood and Similarity (IONS) between closely related species. The algorithm combines two levels of evidence to determine co-ancestrality at the genome scale: sequence similarity and shared neighborhood. The method was initially designed to provide anchor points for syntenic blocks within the Génolevures project concerning nine hemiascomycetous yeasts (about 50,000 genes) and is applicable to different input databases. Comparison based on use of a Rand index shows that the results are highly consistent with the pillars of the Yeast Gene Order Browser, a manually curated database. Compared with SYNERGY, another algorithm reporting homology relationships, our method’s main advantages are its automation and the absence of dataset-dependent parameters, facilitating consistent integration of newly released genomes.
doi:10.4137/EBO.S7465
PMCID: PMC3169350  PMID: 21918595
ortholog; synteny; shared neighborhood; hemiascomycete; yeast
11.  The Ortholog Conjecture Is Untestable by the Current Gene Ontology but Is Supported by RNA Sequencing Data 
PLoS Computational Biology  2012;8(11):e1002784.
The ortholog conjecture posits that orthologous genes are functionally more similar than paralogous genes. This conjecture is a cornerstone of phylogenomics and is used daily by both computational and experimental biologists in predicting, interpreting, and understanding gene functions. A recent study, however, challenged the ortholog conjecture on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data in human and mouse. It instead proposed that the functional similarity of homologous genes is primarily determined by the cellular context in which the genes act, explaining why a greater functional similarity of (within-species) paralogs than (between-species) orthologs was observed. Here we show that GO-based functional similarity between human and mouse orthologs, relative to that between paralogs, has been increasing in the last five years. Further, compared with paralogs, orthologs are less likely to be included in the same study, causing an underestimation in their functional similarity. A close examination of functional studies of homologs with identical protein sequences reveals experimental biases, annotation errors, and homology-based functional inferences that are labeled in GO as experimental. These problems and the temporary nature of the GO-based finding make the current GO inappropriate for testing the ortholog conjecture. RNA sequencing (RNA-Seq) is known to be superior to microarray for comparing the expressions of different genes or in different species. Our analysis of a large RNA-Seq dataset of multiple tissues from eight mammals and the chicken shows that the expression similarity between orthologs is significantly higher than that between within-species paralogs, supporting the ortholog conjecture and refuting the cellular context hypothesis for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.
Author Summary
Today's exceedingly high speed of genome sequencing, compared with the generally slow pace of functional assay, means that the functions of most genes identified from genome sequences will be annotated only through computational prediction. The primary source of information for this prediction is the functions of orthologous genes in model organisms, because orthologs are widely believed to be functionally similar, especially when compared with paralogs. This belief, known as the ortholog conjecture, was recently challenged on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data, because these data revealed greater functional and expressional similarities of paralogs than orthologs. Here we show that GO-based estimates of functional similarities are temporary and unreliable, due to experimental biases, annotation errors, and homology-based functional inferences that are incorrectly labeled as experimental in GO. RNA sequencing (RNA-Seq) is superior to microarray for comparing the expressions of different genes or in different species, and our analysis of a large RNA-Seq dataset provides strong support to the ortholog conjecture for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.
doi:10.1371/journal.pcbi.1002784
PMCID: PMC3510086  PMID: 23209392
12.  Positional orthology: putting genomic evolutionary relationships into context 
Briefings in Bioinformatics  2011;12(5):401-412.
Orthology is a powerful refinement of homology that allows us to describe more precisely the evolution of genomes and understand the function of the genes they contain. However, because orthology is not concerned with genomic position, it is limited in its ability to describe genes that are likely to have equivalent roles in different genomes. Because of this limitation, the concept of ‘positional orthology’ has emerged, which describes the relation between orthologous genes that retain their ancestral genomic positions. In this review, we formally define this concept, for which we introduce the shorter term ‘toporthology’, with respect to the evolutionary events experienced by a gene’s ancestors. Through a discussion of recent studies on the role of genomic context in gene evolution, we show that the distinction between orthology and toporthology is biologically significant. We then review a number of orthology prediction methods that take genomic context into account and thus that may be used to infer the important relation of toporthology.
doi:10.1093/bib/bbr040
PMCID: PMC3178058  PMID: 21705766
positional orthology; toporthology; homology; synteny; genome alignment
13.  Syntenic gene analysis between Brassica rapa and other Brassicaceae species 
Chromosomal synteny analysis is important in genome comparison to reveal genomic evolution of related species. Shared synteny describes genomic fragments from different species that originated from an identical ancestor. Syntenic genes are orthologs located in these syntenic fragments, so they often share similar functions. Syntenic gene analysis is very important in Brassicaceae species to share gene annotations and investigate genome evolution. Here we designed and developed a direct and efficient tool, SynOrths, to identify pairwise syntenic genes between genomes of Brassicaceae species. SynOrths determines whether two genes are a conserved syntenic pair based not only on their sequence similarity, but also by the support of homologous flanking genes. Syntenic genes between Arabidopsis thaliana and Brassica rapa, Arabidopsis lyrata and B. rapa, and Thellungiella parvula and B. rapa were then identified using SynOrths. The occurrence of genome triplication in B. rapa was clearly observed, many genes that were evenly distributed in the genomes of A. thaliana, A. lyrata, and T. parvula had three syntenic copies in B. rapa. Additionally, there were many B. rapa genes that had no syntenic orthologs in A. thaliana, but some of these had syntenic orthologs in A. lyrata or T. parvula. Only 5,851 genes in B. rapa had no syntenic counterparts in any of the other three species. These 5,851 genes could have originated after B. rapa diverged from these species. A tool for syntenic gene analysis between species of Brassicaceae was developed, SynOrths, which could be used to accurately identify syntenic genes in differentiated but closely-related genomes. With this tool, we identified syntenic gene sets between B. rapa and each of A. thaliana, A. lyrata, T. parvula. Syntenic gene analysis is important for not only the gene annotation of newly sequenced Brassicaceae genomes by bridging them to model plant A. thaliana, but also the study of genome evolution in these species.
doi:10.3389/fpls.2012.00198
PMCID: PMC3430884  PMID: 22969786
synteny; ortholog; Brassica rapa; Arabidopsis thaliana; Arabidopsis lyrata; Thellugiella parvula; Brassicaceae
14.  The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species 
BMC Genomics  2009;10:604.
Background
With the rapid growth in the availability of genome sequence data, the automated identification of orthologous genes between species (orthologs) is of fundamental importance to facilitate functional annotation and studies on comparative and evolutionary genomics. Genes with no apparent orthologs between the bovine and human genome may be responsible for major differences between the species, however, such genes are often neglected in functional genomics studies.
Results
A BLAST-based method was exploited to explore the current annotation and orthology predictions in Ensembl. Genes with no orthologs between the two genomes were classified into groups based on alignments, ontology, manual curation and publicly available information. Starting from a high quality and specific set of orthology predictions, as provided by Ensembl, hidden relationship between genes and genomes of different mammalian species were unveiled using a highly sensitive approach, based on sequence similarity and genomic comparison.
Conclusions
The analysis identified 3,801 bovine genes with no orthologs in human and 1010 human genes with no orthologs in cow, among which 411 and 43 genes, respectively, had no match at all in the other species. Most of the apparently non-orthologous genes may potentially have orthologs which were missed in the annotation process, despite having a high percentage of identity, because of differences in gene length and structure. The comparative analysis reported here identified gene variants, new genes and species-specific features and gave an overview of the other side of orthology which may help to improve the annotation of the bovine genome and the knowledge of structural differences between species.
doi:10.1186/1471-2164-10-604
PMCID: PMC2808326  PMID: 20003425
15.  Improving the specificity of high-throughput ortholog prediction 
BMC Bioinformatics  2006;7:270.
Background
Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.
Results
To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) – identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida – Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at (software under GNU General Public License).
Conclusion
The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.
doi:10.1186/1471-2105-7-270
PMCID: PMC1524997  PMID: 16729895
16.  DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection 
BMC Bioinformatics  2010;11(Suppl 7):S6.
Background
Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired.
Results
An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases.
Conclusions
DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://140.109.42.19:16080/dodo_web/home.htm
doi:10.1186/1471-2105-11-S7-S6
PMCID: PMC2957689  PMID: 21106128
17.  Systematic discovery of unannotated genes in 11 yeast species using a database of orthologous genomic segments 
BMC Genomics  2011;12:377.
Background
In standard BLAST searches, no information other than the sequences of the query and the database entries is considered. However, in situations where two genes from different species have only borderline similarity in a BLAST search, the discovery that the genes are located within a region of conserved gene order (synteny) can provide additional evidence that they are orthologs. Thus, for interpreting borderline search results, it would be useful to know whether the syntenic context of a database hit is similar to that of the query. This principle has often been used in investigations of particular genes or genomic regions, but to our knowledge it has never been implemented systematically.
Results
We made use of the synteny information contained in the Yeast Gene Order Browser database for 11 yeast species to carry out a systematic search for protein-coding genes that were overlooked in the original annotations of one or more yeast genomes but which are syntenic with their orthologs. Such genes tend to have been overlooked because they are short, highly divergent, or contain introns. The key features of our software - called SearchDOGS - are that the database entries are classified into sets of genomic segments that are already known to be orthologous, and that very weak BLAST hits are retained for further analysis if their genomic location is similar to that of the query. Using SearchDOGS we identified 595 additional protein-coding genes among the 11 yeast species, including two new genes in Saccharomyces cerevisiae. We found additional genes for the mating pheromone a-factor in six species including Kluyveromyces lactis.
Conclusions
SearchDOGS has proven highly successful for identifying overlooked genes in the yeast genomes. We anticipate that our approach can be adapted for study of further groups of species, such as bacterial genomes. More generally, the concept of doing sequence similarity searches against databases to which external information has been added may prove useful in other settings.
doi:10.1186/1471-2164-12-377
PMCID: PMC3161974  PMID: 21791067
18.  Bacterial syntenies: an exact approach with gene quorum 
BMC Bioinformatics  2011;12:193.
Background
The automatic identification of syntenies across multiple species is a key step in comparative genomics that helps biologists shed light both on evolutionary and functional problems.
Results
In this paper, we present a versatile tool to extract all syntenies from multiple bacterial species based on a clear-cut and very flexible definition of the synteny blocks that allows for gene quorum, partial gene correspondence, gaps, and a partial or total conservation of the gene order.
Conclusions
We apply this tool to two different kinds of studies. The first one is a search for functional gene associations. In this context, we compare our tool to a widely used heuristic - I-ADHORE - and show that at least up to ten genomes, the problem remains tractable with our exact definition and algorithm. The second application is linked to evolutionary studies: we verify in a multiple alignment setting that pairs of orthologs in synteny are more conserved than pairs outside, thus extending a previous pairwise study. We then show that this observation is in fact a function of the size of the synteny: the larger the block of synteny is, the more conserved the genes are.
doi:10.1186/1471-2105-12-193
PMCID: PMC3121647  PMID: 21605461
19.  Enrichment of homologs in insignificant BLAST hits by co-complex network alignment 
BMC Bioinformatics  2010;11:86.
Background
Homology is a crucial concept in comparative genomics. The algorithm probably most widely used for homology detection in comparative genomics, is BLAST. Usually a stringent score cutoff is applied to distinguish putative homologs from possible false positive hits. As a consequence, some BLAST hits are discarded that are in fact homologous.
Results
Analogous to the use of the genomics context in genome alignments, we test whether conserved functional context can be used to select candidate homologs from insignificant BLAST hits. We make a co-complex network alignment between complex subunits in yeast and human and find that proteins with an insignificant BLAST hit that are part of homologous complexes, are likely to be homologous themselves. Further analysis of the distant homologs we recovered using the co-complex network alignment, shows that a large majority of these distant homologs are in fact ancient paralogs.
Conclusions
Our results show that, even though evolution takes place at the sequence and genome level, co-complex networks can be used as circumstantial evidence to improve confidence in the homology of distantly related sequences.
doi:10.1186/1471-2105-11-86
PMCID: PMC2836305  PMID: 20152020
20.  GeneAlign: a coding exon prediction tool based on phylogenetical comparisons 
Nucleic Acids Research  2006;34(Web Server issue):W280-W284.
GeneAlign is a coding exon prediction tool for predicting protein coding genes by measuring the homologies between a sequence of a genome and related sequences, which have been annotated, of other genomes. Identifying protein coding genes is one of most important tasks in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in the newly sequenced genomes by comparing to annotated genes of phylogenetically close organisms. GeneAlign applies CORAL, a heuristic linear time alignment tool, to determine if regions flanked by the candidate signals (initiation codon-GT, AG-GT and AG-STOP codon) are similar to annotated coding exons. Employing the conservation of gene structures and sequence homologies between protein coding regions increases the prediction accuracy. GeneAlign was tested on Projector dataset of 491 human–mouse homologous sequence pairs. At the gene level, both the average sensitivity and the average specificity of GeneAlign are 81%, and they are larger than 96% at the exon level. The rates of missing exons and wrong exons are smaller than 1%. GeneAlign is a free tool available at .
doi:10.1093/nar/gkl307
PMCID: PMC1538901  PMID: 16845010
21.  Evola: Ortholog database of all human genes in H-InvDB with manual curation of phylogenetic trees 
Nucleic Acids Research  2007;36(Database issue):D787-D792.
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Currently, with the rapid growth of transcriptome data of various species, more reliable orthology information is prerequisite for further studies. However, detection of orthologs could be erroneous if pairwise distance-based methods, such as reciprocal BLAST searches, are utilized. Thus, as a sub-database of H-InvDB, an integrated database of annotated human genes (http://h-invitational.jp/), we constructed a fully curated database of evolutionary features of human genes, called ‘Evola’. In the process of the ortholog detection, computational analysis based on conserved genome synteny and transcript sequence similarity was followed by manual curation by researchers examining phylogenetic trees. In total, 18 968 human genes have orthologs among 11 vertebrates (chimpanzee, mouse, cow, chicken, zebrafish, etc.), either computationally detected or manually curated orthologs. Evola provides amino acid sequence alignments and phylogenetic trees of orthologs and homologs. In ‘dN/dS view’, natural selection on genes can be analyzed between human and other species. In ‘Locus maps’, all transcript variants and their exon/intron structures can be compared among orthologous gene loci. We expect the Evola to serve as a comprehensive and reliable database to be utilized in comparative analyses for obtaining new knowledge about human genes. Evola is available at http://www.h-invitational.jp/evola/.
doi:10.1093/nar/gkm878
PMCID: PMC2238928  PMID: 17982176
22.  Benchmarking ortholog identification methods using functional genomics data 
Genome Biology  2006;7(4):R31.
A benchmarking of the most popular orthologous identification methods using functional genomics data identifies the two best methods.
Background
The transfer of functional annotations from model organism proteins to human proteins is one of the main applications of comparative genomics. Various methods are used to analyze cross-species orthologous relationships according to an operational definition of orthology. Often the definition of orthology is incorrectly interpreted as a prediction of proteins that are functionally equivalent across species, while in fact it only defines the existence of a common ancestor for a gene in different species. However, it has been demonstrated that orthologs often reveal significant functional similarity. Therefore, the quality of the orthology prediction is an important factor in the transfer of functional annotations (and other related information). To identify protein pairs with the highest possible functional similarity, it is important to qualify ortholog identification methods.
Results
To measure the similarity in function of proteins from different species we used functional genomics data, such as expression data and protein interaction data. We tested several of the most popular ortholog identification methods. In general, we observed a sensitivity/selectivity trade-off: the functional similarity scores per orthologous pair of sequences become higher when the number of proteins included in the ortholog groups decreases.
Conclusion
By combining the sensitivity and the selectivity into an overall score, we show that the InParanoid program is the best ortholog identification method in terms of identifying functionally equivalent proteins.
doi:10.1186/gb-2006-7-4-r31
PMCID: PMC1557999  PMID: 16613613
23.  Elucidating Genome Structure Evolution by Analysis of Isoapostatic Gene Clusters using Statistics of Variance of Gene Distances 
Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In related genomes, clusters of homologous gene pairs serve as evidence for candidate homologous regions, which make up genomic core. Previous studies on the structural organization of bacterial genomes revealed that basic backbone of genomic core is interrupted by genomic islands. Here, we applied statistics using variance of distances as a measure to classify conserved genes within a set of genomes according to their “isoapostatic” relationship, which keeps nearly identical distances of genes. The results of variance statistics analysis of cyanobacterial genomes including Prochlorococcus, Synechococcus, and Anabaena indicated that the conserved genes are classified into several groups called “virtual linkage groups (VLGs)” according to their positional conservation of orthologs over the genomes analyzed. The VLGs were used to define mosaic domain structure of the genomic core. The current model of mosaic genomic domains can explain global evolution of the genomic core of cyanobacteria. It also visualizes islands of lateral gene transfer. The stability and the robustness of the variance statistics are discussed. This method will also be useful in deciphering the structural organization of genomes in other groups of bacteria.
doi:10.1093/gbe/evp051
PMCID: PMC2839351  PMID: 20333218
comparative genomics; cyanobacteria; gene distance profile; genome core; isoapostatic genes
24.  Ortholog identification in the presence of domain architecture rearrangement 
Briefings in Bioinformatics  2011;12(5):413-422.
Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area.
doi:10.1093/bib/bbr036
PMCID: PMC3178056  PMID: 21712343
phylogenomics; orthology; promiscuous domains; multi-domain architecture; function prediction; super-ortholog
25.  The nuclear OXPHOS genes in insecta: a common evolutionary origin, a common cis-regulatory motif, a common destiny for gene duplicates 
Background
When orthologous sequences from species distributed throughout an optimal range of divergence times are available, comparative genomics is a powerful tool to address problems such as the identification of the forces that shape gene structure during evolution, although the functional constraints involved may vary in different genes and lineages.
Results
We identified and annotated in the MitoComp2 dataset the orthologs of 68 nuclear genes controlling oxidative phosphorylation in 11 Drosophilidae species and in five non-Drosophilidae insects, and compared them with each other and with their counterparts in three vertebrates (Fugu rubripes, Danio rerio and Homo sapiens) and in the cnidarian Nematostella vectensis, taking into account conservation of gene structure and regulatory motifs, and preservation of gene paralogs in the genome. Comparative analysis indicates that the ancestral insect OXPHOS genes were intron rich and that extensive intron loss and lineage-specific intron gain occurred during evolution. Comparison with vertebrates and cnidarians also shows that many OXPHOS gene introns predate the cnidarian/Bilateria evolutionary split. The nuclear respiratory gene element (NRG) has played a key role in the evolution of the insect OXPHOS genes; it is constantly conserved in the OXPHOS orthologs of all the insect species examined, while their duplicates either completely lack the element or possess only relics of the motif.
Conclusion
Our observations reinforce the notion that the common ancestor of most animal phyla had intron-rich gene, and suggest that changes in the pattern of expression of the gene facilitate the fixation of duplications in the genome and the development of novel genetic functions.
doi:10.1186/1471-2148-7-215
PMCID: PMC2241641  PMID: 18315839

Results 1-25 (873344)