PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (695445)

Clipboard (0)
None

Related Articles

1.  The Ortholog Conjecture Is Untestable by the Current Gene Ontology but Is Supported by RNA Sequencing Data 
PLoS Computational Biology  2012;8(11):e1002784.
The ortholog conjecture posits that orthologous genes are functionally more similar than paralogous genes. This conjecture is a cornerstone of phylogenomics and is used daily by both computational and experimental biologists in predicting, interpreting, and understanding gene functions. A recent study, however, challenged the ortholog conjecture on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data in human and mouse. It instead proposed that the functional similarity of homologous genes is primarily determined by the cellular context in which the genes act, explaining why a greater functional similarity of (within-species) paralogs than (between-species) orthologs was observed. Here we show that GO-based functional similarity between human and mouse orthologs, relative to that between paralogs, has been increasing in the last five years. Further, compared with paralogs, orthologs are less likely to be included in the same study, causing an underestimation in their functional similarity. A close examination of functional studies of homologs with identical protein sequences reveals experimental biases, annotation errors, and homology-based functional inferences that are labeled in GO as experimental. These problems and the temporary nature of the GO-based finding make the current GO inappropriate for testing the ortholog conjecture. RNA sequencing (RNA-Seq) is known to be superior to microarray for comparing the expressions of different genes or in different species. Our analysis of a large RNA-Seq dataset of multiple tissues from eight mammals and the chicken shows that the expression similarity between orthologs is significantly higher than that between within-species paralogs, supporting the ortholog conjecture and refuting the cellular context hypothesis for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.
Author Summary
Today's exceedingly high speed of genome sequencing, compared with the generally slow pace of functional assay, means that the functions of most genes identified from genome sequences will be annotated only through computational prediction. The primary source of information for this prediction is the functions of orthologous genes in model organisms, because orthologs are widely believed to be functionally similar, especially when compared with paralogs. This belief, known as the ortholog conjecture, was recently challenged on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data, because these data revealed greater functional and expressional similarities of paralogs than orthologs. Here we show that GO-based estimates of functional similarities are temporary and unreliable, due to experimental biases, annotation errors, and homology-based functional inferences that are incorrectly labeled as experimental in GO. RNA sequencing (RNA-Seq) is superior to microarray for comparing the expressions of different genes or in different species, and our analysis of a large RNA-Seq dataset provides strong support to the ortholog conjecture for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.
doi:10.1371/journal.pcbi.1002784
PMCID: PMC3510086  PMID: 23209392
2.  Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals 
PLoS Computational Biology  2011;7(6):e1002073.
A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the “ortholog conjecture”). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act.
Author Summary
The use of model organisms in biological research rests upon the assumption that gene and protein functions discovered in one organism are likely to be the same or similar in another organism. Hence, the assumption that experiments in mouse will tell us about the function of genes in humans. A guiding principle in the assignment of function from one organism to another is that single-copy genes (“orthologs”) are statistically more likely to provide functional information than are multi-copy genes, whether in the same organism or different organisms. Here we have tested this idea by examining genes with known functions in human and mouse. Surprisingly, we find that multi-copy genes are equally or more likely to provide accurate functional information than are single-copy genes. Our results suggest that the organism itself plays at least as large a role in determining the function of genes as does the particular sequence of the gene alone. This insight will benefit the assignment of function to genes whose roles are not yet known by widening the pool of appropriate genes from which function can be inferred.
doi:10.1371/journal.pcbi.1002073
PMCID: PMC3111532  PMID: 21695233
3.  Resolving the Ortholog Conjecture: Orthologs Tend to Be Weakly, but Significantly, More Similar in Function than Paralogs 
PLoS Computational Biology  2012;8(5):e1002514.
The function of most proteins is not determined experimentally, but is extrapolated from homologs. According to the “ortholog conjecture”, or standard model of phylogenomics, protein function changes rapidly after duplication, leading to paralogs with different functions, while orthologs retain the ancestral function. We report here that a comparison of experimentally supported functional annotations among homologs from 13 genomes mostly supports this model. We show that to analyze GO annotation effectively, several confounding factors need to be controlled: authorship bias, variation of GO term frequency among species, variation of background similarity among species pairs, and propagated annotation bias. After controlling for these biases, we observe that orthologs have generally more similar functional annotations than paralogs. This is especially strong for sub-cellular localization. We observe only a weak decrease in functional similarity with increasing sequence divergence. These findings hold over a large diversity of species; notably orthologs from model organisms such as E. coli, yeast or mouse have conserved function with human proteins.
Author Summary
To infer the function of an unknown gene, possibly the most effective way is to identify a well-characterized evolutionarily related gene, and assume that they have both kept their ancestral function. If several such homologs are available, all else being equal, it has long been assumed that those that diverged by speciation (“ortholog”) are functionally closer than those that diverged by duplication (“paralogs”); thus function is more reliably inferred from the former. But despite its prevalence, this model mostly rests on first principles, as for the longest time we have not had sufficient data to test it empirically. Recently, some studies began investigating this question and have cast doubt on the validity of this model. Here, we show that by considering a wide range of organisms and data, and, crucially, by correcting for several easily overlooked biases affecting functional annotations, the standard model is corroborated by the presently available experimental data.
doi:10.1371/journal.pcbi.1002514
PMCID: PMC3355068  PMID: 22615551
4.  Phylogenomics of plant genomes: a methodology for genome-wide searches for orthologs in plants 
BMC Genomics  2008;9:183.
Background
Gene ortholog identification is now a major objective for mining the increasing amount of sequence data generated by complete or partial genome sequencing projects. Comparative and functional genomics urgently need a method for ortholog detection to reduce gene function inference and to aid in the identification of conserved or divergent genetic pathways between several species. As gene functions change during evolution, reconstructing the evolutionary history of genes should be a more accurate way to differentiate orthologs from paralogs. Phylogenomics takes into account phylogenetic information from high-throughput genome annotation and is the most straightforward way to infer orthologs. However, procedures for automatic detection of orthologs are still scarce and suffer from several limitations.
Results
We developed a procedure for ortholog prediction between Oryza sativa and Arabidopsis thaliana. Firstly, we established an efficient method to cluster A. thaliana and O. sativa full proteomes into gene families. Then, we developed an optimized phylogenomics pipeline for ortholog inference. We validated the full procedure using test sets of orthologs and paralogs to demonstrate that our method outperforms pairwise methods for ortholog predictions.
Conclusion
Our procedure achieved a high level of accuracy in predicting ortholog and paralog relationships. Phylogenomic predictions for all validated gene families in both species were easily achieved and we can conclude that our methodology outperforms similarly based methods.
doi:10.1186/1471-2164-9-183
PMCID: PMC2377279  PMID: 18426584
5.  Gene-oriented ortholog database: a functional comparison platform for orthologous loci 
The accumulation of complete genomic sequences enhances the need for functional annotation. Associating existing functional annotation of orthologs can speed up the annotation process and even examine the existing annotation. However, current protein sequence-based ortholog databases provide ambiguous and incomplete orthology in eukaryotes. It is because that isoforms, derived by alternative splicing (AS), often share higher sequence similarity to interfere the sequence-based identification. Gene-Oriented Ortholog Database (GOOD) employs genomic locations of transcripts to cluster AS-derived isoforms prior to ortholog delineation to eliminate the interference from AS. From the gene-oriented presentation, isoforms can be clearly associated to their genes to provide comprehensive ortholog information and further be discriminated from paralogs. Aside from, displaying clusters of isoforms between orthologous genes can present the evolution variation at the transcription level. Based on orthology, GOOD additionally comprises functional annotation from the Gene Ontology (GO) database. However, there exist redundant annotations, both parent and child terms assigned to the same gene, in the GO database. It is difficult to precisely draw the numerical comparison of term counts between orthologous genes annotated with redundant terms. Instead of the description only, GOOD further provides the GO graphs to reveal hierarchical-like relationships among divergent functionalities. Therefore, the redundancy of GO terms can be examined, and the context among compared terms is more comprehensive. In sum, GOOD can improve the interpretation in the molecular function from experiments in the model organism and provide clear comparative genomic annotation across organisms.
Database URL: http://goods.ibms.sinica.edu.tw/goods/
doi:10.1093/database/baq002
PMCID: PMC2860896  PMID: 20428317
6.  Improving the specificity of high-throughput ortholog prediction 
BMC Bioinformatics  2006;7:270.
Background
Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.
Results
To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) – identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida – Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at (software under GNU General Public License).
Conclusion
The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.
doi:10.1186/1471-2105-7-270
PMCID: PMC1524997  PMID: 16729895
7.  OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis 
Nucleic Acids Research  2012;41(D1):D366-D376.
Prediction of orthologs (homologous genes that diverged because of speciation) is an integral component of many comparative genomics methods. Although orthologs are more likely to have similar function versus paralogs (genes that diverged because of duplication), recent studies have shown that their degree of functional conservation is variable. Also, there are inherent problems with several large-scale ortholog prediction approaches. To address these issues, we previously developed Ortholuge, which uses phylogenetic distance ratios to provide more precise ortholog assessments for a set of predicted orthologs. However, the original version of Ortholuge required manual intervention and was not easily accessible; therefore, we now report the development of OrtholugeDB, available online at http://www.pathogenomics.sfu.ca/ortholugedb. OrtholugeDB provides ortholog predictions for completely sequenced bacterial and archaeal genomes from NCBI based on reciprocal best Basic Local Alignment Search Tool hits, supplemented with further evaluation by the more precise Ortholuge method. The OrtholugeDB web interface facilitates user-friendly and flexible ortholog analysis, from single genes to genomes, plus flexible data download options. We compare Ortholuge with similar methods, showing how it may more consistently identify orthologs with conserved features across a wide range of taxonomic distances. OrtholugeDB facilitates rapid, and more accurate, bacterial and archaeal comparative genomic analysis and large-scale ortholog predictions.
doi:10.1093/nar/gks1241
PMCID: PMC3531125  PMID: 23203876
8.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs 
BMC Bioinformatics  2002;3:14.
Background
When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication). The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics") is widely recognized, but existing approaches are either manual or not explicitly based on phylogenetic trees.
Results
Here we present RIO (Resampled Inference of Orthologs), a procedure for automated phylogenomics using explicit phylogenetic inference. RIO analyses are performed over bootstrap resampled phylogenetic trees to estimate the reliability of orthology assignments. We also introduce supplementary concepts that are helpful for functional inference. RIO has been implemented as Perl pipeline connecting several C and Java programs. It is available at http://www.genetics.wustl.edu/eddy/forester/. A web server is at http://www.rio.wustl.edu/. RIO was tested on the Arabidopsis thaliana and Caenorhabditis elegans proteomes.
Conclusion
The RIO procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. We also describe how some orthologies can be misleading for functional inference.
doi:10.1186/1471-2105-3-14
PMCID: PMC116988  PMID: 12028595
9.  The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species 
BMC Genomics  2009;10:604.
Background
With the rapid growth in the availability of genome sequence data, the automated identification of orthologous genes between species (orthologs) is of fundamental importance to facilitate functional annotation and studies on comparative and evolutionary genomics. Genes with no apparent orthologs between the bovine and human genome may be responsible for major differences between the species, however, such genes are often neglected in functional genomics studies.
Results
A BLAST-based method was exploited to explore the current annotation and orthology predictions in Ensembl. Genes with no orthologs between the two genomes were classified into groups based on alignments, ontology, manual curation and publicly available information. Starting from a high quality and specific set of orthology predictions, as provided by Ensembl, hidden relationship between genes and genomes of different mammalian species were unveiled using a highly sensitive approach, based on sequence similarity and genomic comparison.
Conclusions
The analysis identified 3,801 bovine genes with no orthologs in human and 1010 human genes with no orthologs in cow, among which 411 and 43 genes, respectively, had no match at all in the other species. Most of the apparently non-orthologous genes may potentially have orthologs which were missed in the annotation process, despite having a high percentage of identity, because of differences in gene length and structure. The comparative analysis reported here identified gene variants, new genes and species-specific features and gave an overview of the other side of orthology which may help to improve the annotation of the bovine genome and the knowledge of structural differences between species.
doi:10.1186/1471-2164-10-604
PMCID: PMC2808326  PMID: 20003425
10.  Gene duplications in prokaryotes can be associated with environmental adaptation 
BMC Genomics  2010;11:588.
Background
Gene duplication is a normal evolutionary process. If there is no selective advantage in keeping the duplicated gene, it is usually reduced to a pseudogene and disappears from the genome. However, some paralogs are retained. These gene products are likely to be beneficial to the organism, e.g. in adaptation to new environmental conditions. The aim of our analysis is to investigate the properties of paralog-forming genes in prokaryotes, and to analyse the role of these retained paralogs by relating gene properties to life style of the corresponding prokaryotes.
Results
Paralogs were identified in a number of prokaryotes, and these paralogs were compared to singletons of persistent orthologs based on functional classification. This showed that the paralogs were associated with for example energy production, cell motility, ion transport, and defence mechanisms. A statistical overrepresentation analysis of gene and protein annotations was based on paralogs of the 200 prokaryotes with the highest fraction of paralog-forming genes. Biclustering of overrepresented gene ontology terms versus species was used to identify clusters of properties associated with clusters of species. The clusters were classified using similarity scores on properties and species to identify interesting clusters, and a subset of clusters were analysed by comparison to literature data. This analysis showed that paralogs often are associated with properties that are important for survival and proliferation of the specific organisms. This includes processes like ion transport, locomotion, chemotaxis and photosynthesis. However, the analysis also showed that the gene ontology terms sometimes were too general, imprecise or even misleading for automatic analysis.
Conclusions
Properties described by gene ontology terms identified in the overrepresentation analysis are often consistent with individual prokaryote lifestyles and are likely to give a competitive advantage to the organism. Paralogs and singletons dominate different categories of functional classification, where paralogs in particular seem to be associated with processes involving interaction with the environment.
doi:10.1186/1471-2164-11-588
PMCID: PMC3091735  PMID: 20961426
11.  COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations 
Bioinformatics (Oxford, England)  2006;22(7):779-788.
Motivation
Determining orthology relations among genes across multiple genomes is an important problem in the post-genomic era. Identifying orthologous genes can not only help predict functional annotations for newly sequenced or poorly characterized genomes, but can also help predict new protein–protein interactions. Unfortunately, determining orthology relation through computational methods is not straightforward due to the presence of paralogs. Traditional approaches have relied on pairwise sequence comparisons to construct graphs, which were then partitioned into putative clusters of orthologous groups. These methods do not attempt to preserve the non-transitivity and hierarchic nature of the orthology relation.
Results
We propose a new method, COCO-CL, for hierarchical clustering of homology relations and identification of orthologous groups of genes. Unlike previous approaches, which are based on pairwise sequence comparisons, our method explores the correlation of evolutionary histories of individual genes in a more global context. COCO-CL can be used as a semi-independent method to delineate the orthology/paralogy relation for a refined set of homologous proteins obtained using a less-conservative clustering approach, or as a refiner that removes putative out-paralogs from clusters computed using a more inclusive approach. We analyze our clustering results manually, with support from literature and functional annotations. Since our orthology determination procedure does not employ a species tree to infer duplication events, it can be used in situations when the species tree is unknown or uncertain.
doi:10.1093/bioinformatics/btl009
PMCID: PMC1620014  PMID: 16434444
12.  Phyletic Profiling with Cliques of Orthologs Is Enhanced by Signatures of Paralogy Relationships 
PLoS Computational Biology  2013;9(1):e1002852.
New microbial genomes are sequenced at a high pace, allowing insight into the genetics of not only cultured microbes, but a wide range of metagenomic collections such as the human microbiome. To understand the deluge of genomic data we face, computational approaches for gene functional annotation are invaluable. We introduce a novel model for computational annotation that refines two established concepts: annotation based on homology and annotation based on phyletic profiling. The phyletic profiling-based model that includes both inferred orthologs and paralogs—homologs separated by a speciation and a duplication event, respectively—provides more annotations at the same average Precision than the model that includes only inferred orthologs. For experimental validation, we selected 38 poorly annotated Escherichia coli genes for which the model assigned one of three GO terms with high confidence: involvement in DNA repair, protein translation, or cell wall synthesis. Results of antibiotic stress survival assays on E. coli knockout mutants showed high agreement with our model's estimates of accuracy: out of 38 predictions obtained at the reported Precision of 60%, we confirmed 25 predictions, indicating that our confidence estimates can be used to make informed decisions on experimental validation. Our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time. Our predictions for 998 prokaryotic genomes include ∼400000 specific annotations with the estimated Precision of 90%, ∼19000 of which are highly specific—e.g. “penicillin binding,” “tRNA aminoacylation for protein translation,” or “pathogenesis”—and are freely available at http://gorbi.irb.hr/.
Author Summary
While both the number and the diversity of sequenced prokaryotic genomes grow rapidly, the number of specific assignments of gene functions in the databases remains low and skewed toward the model prokaryote Escherichia coli. To aid in understanding the full set of newly sequenced genes, we created a computational model for assignment of function to prokaryotic genomes. The result is an innovative framework for orthology and paralogy-aware phyletic profiling that provides a large number of computational annotations with high predictive accuracy in train/test evaluations. Our predictions include annotations for 1.3 million genes with the estimated Precision of 90%; these, and many more predictions for 998 prokaryotic genomes are freely available at http://gorbi.irb.hr/. More importantly, we show a proof of principle that our functional annotation model can be used to generate new biological hypotheses: we performed experiments on 38 E. coli knockout mutants and showed that our annotation model provides realistic estimates of predictive accuracy. With this, our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time.
doi:10.1371/journal.pcbi.1002852
PMCID: PMC3536626  PMID: 23308060
13.  A Tight Link between Orthologs and Bidirectional Best Hits in Bacterial and Archaeal Genomes 
Genome Biology and Evolution  2012;4(12):1286-1294.
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
doi:10.1093/gbe/evs100
PMCID: PMC3542571  PMID: 23160176
orthology; bidirectional best hit; genome comparison; synteny
14.  Correlating Traits of Gene Retention, Sequence Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi 
Delineating ancestral gene relations among a large set of sequenced eukaryotic genomes allowed us to rigorously examine links between evolutionary and functional traits. We classified 86% of over 1.36 million protein-coding genes from 40 vertebrates, 23 arthropods, and 32 fungi into orthologous groups and linked over 90% of them to Gene Ontology or InterPro annotations. Quantifying properties of ortholog phyletic retention, copy-number variation, and sequence conservation, we examined correlations with gene essentiality and functional traits. More than half of vertebrate, arthropod, and fungal orthologs are universally present across each lineage. These universal orthologs are preferentially distributed in groups with almost all single-copy or all multicopy genes, and sequence evolution of the predominantly single-copy orthologous groups is markedly more constrained. Essential genes from representative model organisms, Mus musculus, Drosophila melanogaster, and Saccharomyces cerevisiae, are significantly enriched in universal orthologs within each lineage, and essential-gene-containing groups consistently exhibit greater sequence conservation than those without. This study of eukaryotic gene repertoire evolution identifies shared fundamental principles and highlights lineage-specific features, it also confirms that essential genes are highly retained and conclusively supports the “knockout-rate prediction” of stronger constraints on essential gene sequence evolution. However, the distinction between sequence conservation of single- versus multicopy orthologs is quantitatively more prominent than between orthologous groups with and without essential genes. The previously underappreciated difference in the tolerance of gene duplications and contrasting evolutionary modes of “single-copy control” versus “multicopy license” may reflect a major evolutionary mechanism that allows extended exploration of gene sequence space.
doi:10.1093/gbe/evq083
PMCID: PMC3030422  PMID: 21148284
orthologs; essential genes; molecular evolution; vertebrates; arthropods; fungi
15.  Selection in the evolution of gene duplications 
Genome Biology  2002;3(2):research0008.1-research0008.9.
Background
Gene duplications have a major role in the evolution of new biological functions. Theoretical studies often assume that a duplication per se is selectively neutral and that, following a duplication, one of the gene copies is freed from purifying (stabilizing) selection, which creates the potential for evolution of a new function.
Results
In search of systematic evidence of accelerated evolution after duplication, we used data from 26 bacterial, six archaeal, and seven eukaryotic genomes to compare the mode and strength of selection acting on recently duplicated genes (paralogs) and on similarly diverged, unduplicated orthologous genes in different species. We find that the ratio of nonsynonymous to synonymous substitutions (Kn/Ks) in most paralogous pairs is <<1 and that paralogs typically evolve at similar rates, without significant asymmetry, indicating that both paralogs produced by a duplication are subject to purifying selection. This selection is, however, substantially weaker than the purifying selection affecting unduplicated orthologs that have diverged to the same extent as the analyzed paralogs. Most of the recently duplicated genes appear to be involved in various forms of environmental response; in particular, many of them encode membrane and secreted proteins.
Conclusions
The results of this analysis indicate that recently duplicated paralogs evolve faster than orthologs with the same level of divergence and similar functions, but apparently do not experience a phase of neutral evolution. We hypothesize that gene duplications that persist in an evolving lineage are beneficial from the time of their origin, due primarily to a protein dosage effect in response to variable environmental conditions; duplications are likely to give rise to new functions at a later phase of their evolution once a higher level of divergence is reached.
PMCID: PMC65685  PMID: 11864370
16.  Identification of gene-oriented exon orthology between human and mouse 
BMC Genomics  2012;13(Suppl 1):S10.
Background
Gene orthology has been well studied in the evolutionary area and is thought to be an important implication to functional genome annotations. As the accumulation of transcriptomic data, alternative splicing is taken into account in the assignments of gene orthologs and the orthology is suggested to be further considered at transcript level. Whether gene or transcript orthology, exons are the basic units that represent the whole gene structure; however, there is no any reported study on how to build exon level orthology in a whole genome scale. Therefore, it is essential to establish a gene-oriented exon orthology dataset.
Results
Using a customized pipeline, we first build exon orthologous relationships from assigned gene orthologs pairs in two well-annotated genomes: human and mouse. More than 92% of non-overlapping exons have at least one ortholog between human and mouse and only a small portion of them own more than one ortholog. The exons located in the coding region are more conserved in terms of finding their ortholog counterparts. Within the untranslated region, the 5' UTR seems to have more diversity than the 3' UTR according to exon orthology designations. Interestingly, most exons located in the coding region are also conserved in length but this conservation phenomenon dramatically drops down in untranslated regions. In addition, we allowed multiple assignments in exon orthologs and a subset of exons with possible fusion/split events were defined here after a thorough analysis procedure.
Conclusions
Identification of orthologs at the exon level is essential to provide a detailed way to interrogate gene orthology and splicing analysis. It could be used to extend the genome annotation as well. Besides examining the one-to-one orthologous relationship, we manage the one-to-multi exon pairs to represent complicated exon generation behavior. Our results can be further applied in many research fields studying intron-exon structure and alternative/constitutive exons in functional genomic areas.
doi:10.1186/1471-2164-13-S1-S10
PMCID: PMC3303729  PMID: 22369432
17.  GreenPhylDB v2.0: comparative and functional genomics in plants 
Nucleic Acids Research  2010;39(Database issue):D1095-D1102.
GreenPhylDB is a database designed for comparative and functional genomics based on complete genomes. Version 2 now contains sixteen full genomes of members of the plantae kingdom, ranging from algae to angiosperms, automatically clustered into gene families. Gene families are manually annotated and then analyzed phylogenetically in order to elucidate orthologous and paralogous relationships. The database offers various lists of gene families including plant, phylum and species specific gene families. For each gene cluster or gene family, easy access to gene composition, protein domains, publications, external links and orthologous gene predictions is provided. Web interfaces have been further developed to improve the navigation through information related to gene families. New analysis tools are also available, such as a gene family ontology browser that facilitates exploration. GreenPhylDB is a component of the South Green Bioinformatics Platform (http://southgreen.cirad.fr/) and is accessible at http://greenphyl.cirad.fr. It enables comparative genomics in a broad taxonomy context to enhance the understanding of evolutionary processes and thus tends to speed up gene discovery.
doi:10.1093/nar/gkq811
PMCID: PMC3013755  PMID: 20864446
18.  A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database 
BMC Bioinformatics  2006;7:201.
Background
We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community.
Discussion
The PhIGs database currently contains 23 completely sequenced genomes of fungi and metazoans, containing 409,653 genes that have been grouped into 42,645 gene clusters. Each gene cluster is built such that the gene sequence distances are consistent with the known organismal relationships and in so doing, maximizing the likelihood for the clusters to represent truly orthologous genes. The PhIGs website contains tools that allow the study of genes within their phylogenetic framework through keyword searches on annotations, such as GO and InterPro assignments, and sequence similarity searches by BLAST and HMM. In addition to displaying the evolutionary relationships of the genes in each cluster, the website also allows users to view the relative physical positions of homologous genes in specified sets of genomes.
Summary
Accurate analyses of genes and genomes can only be done within their full phylogenetic context. The PhIGs database and corresponding website address this problem for the scientific community. Our goal is to expand the content as more genomes are sequenced and use this framework to incorporate more analyses.
doi:10.1186/1471-2105-7-201
PMCID: PMC1523372  PMID: 16608522
19.  QuartetS: a fast and accurate algorithm for large-scale orthology detection 
Nucleic Acids Research  2011;39(13):e88.
The unparalleled growth in the availability of genomic data offers both a challenge to develop orthology detection methods that are simultaneously accurate and high throughput and an opportunity to improve orthology detection by leveraging evolutionary evidence in the accumulated sequenced genomes. Here, we report a novel orthology detection method, termed QuartetS, that exploits evolutionary evidence in a computationally efficient manner. Based on the well-established evolutionary concept that gene duplication events can be used to discriminate homologous genes, QuartetS uses an approximate phylogenetic analysis of quartet gene trees to infer the occurrence of duplication events and discriminate paralogous from orthologous genes. We used function- and phylogeny-based metrics to perform a large-scale, systematic comparison of the orthology predictions of QuartetS with those of four other methods [bi-directional best hit (BBH), outgroup, OMA and QuartetS-C (QuartetS followed by clustering)], involving 624 bacterial genomes and >2 million genes. We found that QuartetS slightly, but consistently, outperformed the highly specific OMA method and that, while consuming only 0.5% additional computational time, QuartetS predicted 50% more orthologs with a 50% lower false positive rate than the widely used BBH method. We conclude that, for large-scale phylogenetic and functional analysis, QuartetS and QuartetS-C should be preferred, respectively, in applications where high accuracy and high throughput are required.
doi:10.1093/nar/gkr308
PMCID: PMC3141274  PMID: 21572104
20.  DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection 
BMC Bioinformatics  2010;11(Suppl 7):S6.
Background
Orthologs are genes derived from the same ancestor gene loci after speciation events. Orthologous proteins usually have similar sequences and perform comparable biological functions. Therefore, ortholog identification is useful in annotations of newly sequenced genomes. With rapidly increasing number of sequenced genomes, constructing or updating ortholog relationship between all genomes requires lots of effort and computation time. In addition, elucidating ortholog relationships between distantly related genomes is challenging because of the lower sequence similarity. Therefore, an efficient ortholog detection method that can deal with large number of distantly related genomes is desired.
Results
An efficient ortholog detection pipeline DODO (DOmain based Detection of Orthologs) is created on the basis of domain architectures in this study. Supported by domain composition, which usually directly related with protein function, DODO could facilitate orthologs detection across distantly related genomes. DODO works in two main steps. Starting from domain information, it first assigns protein groups according to their domain architectures and further identifies orthologs within those groups with much reduced complexity. Here DODO is shown to detect orthologs between two genomes in considerably shorter period of time than traditional methods of reciprocal best hits and it is more significant when analyzed a large number of genomes. The output results of DODO are highly comparable with other known ortholog databases.
Conclusions
DODO provides a new efficient pipeline for detection of orthologs in a large number of genomes. In addition, a database established with DODO is also easier to maintain and could be updated relatively effortlessly. The pipeline of DODO could be downloaded from http://140.109.42.19:16080/dodo_web/home.htm
doi:10.1186/1471-2105-11-S7-S6
PMCID: PMC2957689  PMID: 21106128
21.  Benchmarking ortholog identification methods using functional genomics data 
Genome Biology  2006;7(4):R31.
A benchmarking of the most popular orthologous identification methods using functional genomics data identifies the two best methods.
Background
The transfer of functional annotations from model organism proteins to human proteins is one of the main applications of comparative genomics. Various methods are used to analyze cross-species orthologous relationships according to an operational definition of orthology. Often the definition of orthology is incorrectly interpreted as a prediction of proteins that are functionally equivalent across species, while in fact it only defines the existence of a common ancestor for a gene in different species. However, it has been demonstrated that orthologs often reveal significant functional similarity. Therefore, the quality of the orthology prediction is an important factor in the transfer of functional annotations (and other related information). To identify protein pairs with the highest possible functional similarity, it is important to qualify ortholog identification methods.
Results
To measure the similarity in function of proteins from different species we used functional genomics data, such as expression data and protein interaction data. We tested several of the most popular ortholog identification methods. In general, we observed a sensitivity/selectivity trade-off: the functional similarity scores per orthologous pair of sequences become higher when the number of proteins included in the ortholog groups decreases.
Conclusion
By combining the sensitivity and the selectivity into an overall score, we show that the InParanoid program is the best ortholog identification method in terms of identifying functionally equivalent proteins.
doi:10.1186/gb-2006-7-4-r31
PMCID: PMC1557999  PMID: 16613613
22.  Developing a community-based genetic nomenclature for anole lizards 
BMC Genomics  2011;12:554.
Background
Comparative studies of amniotes have been hindered by a dearth of reptilian molecular sequences. With the genomic assembly of the green anole, Anolis carolinensis available, non-avian reptilian genes can now be compared to mammalian, avian, and amphibian homologs. Furthermore, with more than 350 extant species in the genus Anolis, anoles are an unparalleled example of tetrapod genetic diversity and divergence. As an important ecological, genetic and now genomic reference, it is imperative to develop a standardized Anolis gene nomenclature alongside associated vocabularies and other useful metrics.
Results
Here we report the formation of the Anolis Gene Nomenclature Committee (AGNC) and propose a standardized evolutionary characterization code that will help researchers to define gene orthology and paralogy with tetrapod homologs, provide a system for naming novel genes in Anolis and other reptiles, furnish abbreviations to facilitate comparative studies among the Anolis species and related iguanid squamates, and classify the geographical origins of Anolis subpopulations.
Conclusions
This report has been generated in close consultation with members of the Anolis and genomic research communities, and using public database resources including NCBI and Ensembl. Updates will continue to be regularly posted to new research community websites such as lizardbase. We anticipate that this standardized gene nomenclature will facilitate the accessibility of reptilian sequences for comparative studies among tetrapods and will further serve as a template for other communities in their sequencing and annotation initiatives.
doi:10.1186/1471-2164-12-554
PMCID: PMC3248570  PMID: 22077994
23.  Domain architecture conservation in orthologs 
BMC Bioinformatics  2011;12:326.
Background
As orthologous proteins are expected to retain function more often than other homologs, they are often used for functional annotation transfer between species. However, ortholog identification methods do not take into account changes in domain architecture, which are likely to modify a protein's function. By domain architecture we refer to the sequential arrangement of domains along a protein sequence.
To assess the level of domain architecture conservation among orthologs, we carried out a large-scale study of such events between human and 40 other species spanning the entire evolutionary range. We designed a score to measure domain architecture similarity and used it to analyze differences in domain architecture conservation between orthologs and paralogs relative to the conservation of primary sequence. We also statistically characterized the extents of different types of domain swapping events across pairs of orthologs and paralogs.
Results
The analysis shows that orthologs exhibit greater domain architecture conservation than paralogous homologs, even when differences in average sequence divergence are compensated for, for homologs that have diverged beyond a certain threshold. We interpret this as an indication of a stronger selective pressure on orthologs than paralogs to retain the domain architecture required for the proteins to perform a specific function. In general, orthologs as well as the closest paralogous homologs have very similar domain architectures, even at large evolutionary separation.
The most common domain architecture changes observed in both ortholog and paralog pairs involved insertion/deletion of new domains, while domain shuffling and segment duplication/deletion were very infrequent.
Conclusions
On the whole, our results support the hypothesis that function conservation between orthologs demands higher domain architecture conservation than other types of homologs, relative to primary sequence conservation. This supports the notion that orthologs are functionally more similar than other types of homologs at the same evolutionary distance.
doi:10.1186/1471-2105-12-326
PMCID: PMC3215765  PMID: 21819573
24.  Co-evolutionary Rates of Functionally Related Yeast Genes 
Evolutionary knowledge is often used to facilitate computational attempts at gene function prediction. One rich source of evolutionary information is the relative rates of gene sequence divergence, and in this report we explore the connection between gene evolutionary rates and function. We performed a genome-scale evaluation of the relationship between evolutionary rates and functional annotations for the yeast Saccharomyces cerevisiae. Non-synonymous (dN) and synonymous (dS) substitution rates were calculated for 1,095 orthologous gene sets common to S. cerevisiae and six other closely related yeast species. Differences in evolutionary rates between pairs of genes (ΔdN & ΔdS) were then compared to their functional similarities (sGO), which were measured using Gene Ontology (GO) annotations. Substantial and statistically significant correlations were found between ΔdN and sGO, whereas there is no apparent relationship between ΔdS and sGO. These results are consistent with a mode of action for natural selection that is based on similar rates of elimination of deleterious protein coding sequence variants for functionally related genes. The connection between gene evolutionary rates and function was stronger than seen for phylogenetic profiles, which have previously been employed to inform functional inference. The co-evolution of functionally related yeast genes points to the relevance of specific function for the efficacy of natural selection and underscores the utility of gene evolutionary rates for functional predictions.
PMCID: PMC2674680  PMID: 18345352
Functional inference; Co-evolution; natural selection; genome evolution; gene ontology
25.  Co-evolutionary Rates of Functionally Related Yeast Genes 
Evolutionary knowledge is often used to facilitate computational attempts at gene function prediction. One rich source of evolutionary information is the relative rates of gene sequence divergence, and in this report we explore the connection between gene evolutionary rates and function. We performed a genome-scale evaluation of the relationship between evolutionary rates and functional annotations for the yeast Saccharomyces cerevisiae. Non-synonymous (dN) and synonymous (dS) substitution rates were calculated for 1,095 orthologous gene sets common to S. cerevisiae and six other closely related yeast species. Differences in evolutionary rates between pairs of genes (ΔdN & ΔdS) were then compared to their functional similarities (sGO), which were measured using Gene Ontology (GO) annotations. Substantial and statistically significant correlations were found between ΔdN and sGO, whereas there is no apparent relationship between ΔdS and sGO. These results are consistent with a mode of action for natural selection that is based on similar rates of elimination of deleterious protein coding sequence variants for functionally related genes. The connection between gene evolutionary rates and function was stronger than seen for phylogenetic profiles, which have previously been employed to inform functional inference. The co-evolution of functionally related yeast genes points to the relevance of specific function for the efficacy of natural selection and underscores the utility of gene evolutionary rates for functional predictions.
PMCID: PMC2674680  PMID: 18345352
Functional inference; Co-evolution; natural selection; genome evolution; gene ontology

Results 1-25 (695445)