Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
Proteomes of thermophilic prokaryotes have been instrumental in structural biology and successfully exploited in biotechnology, however many proteins required for eukaryotic cell function are absent from bacteria or archaea. With Chaetomium thermophilum, Thielavia terrestris and Thielavia heterothallica three genome sequences of thermophilic eukaryotes have been published.
Studying the genomes and proteomes of these thermophilic fungi, we found common strategies of thermal adaptation across the different kingdoms of Life, including amino acid biases and a reduced genome size. A phylogenetics-guided comparison of thermophilic proteomes with those of other, mesophilic Sordariomycetes revealed consistent amino acid substitutions associated to thermophily that were also present in an independent lineage of thermophilic fungi. The most consistent pattern is the substitution of lysine by arginine, which we could find in almost all lineages but has not been extensively used in protein stability engineering. By exploiting mutational paths towards the thermophiles, we could predict particular amino acid residues in individual proteins that contribute to thermostability and validated some of them experimentally. By determining the three-dimensional structure of an exemplar protein from C. thermophilum (Arx1), we could also characterise the molecular consequences of some of these mutations.
The comparative analysis of these three genomes not only enhances our understanding of the evolution of thermophily, but also provides new ways to engineer protein stability.
Thermophily; Comparative genomics; Protein engineering; Eukaryotes; Fungi
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made—particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.
Post-translational modifications (PTMs) are involved in the regulation and structural stabilization of eukaryotic proteins. The combination of individual PTM states is a key to modulate cellular functions as became evident in a few well-studied proteins. This combinatorial setting, dubbed the PTM code, has been proposed to be extended to whole proteomes in eukaryotes. Although we are still far from deciphering such a complex language, thousands of protein PTM sites are being mapped by high-throughput technologies, thus providing sufficient data for comparative analysis. PTMcode (http://ptmcode.embl.de) aims to compile known and predicted PTM associations to provide a framework that would enable hypothesis-driven experimental or computational analysis of various scales. In its first release, PTMcode provides PTM functional associations of 13 different PTM types within proteins in 8 eukaryotes. They are based on five evidence channels: a literature survey, residue co-evolution, structural proximity, PTMs at the same residue and location within PTM highly enriched protein regions (hotspots). PTMcode is presented as a protein-based searchable database with an interactive web interface providing the context of the co-regulation of nearly 75 000 residues in >10 000 proteins.
Summary: Drug versus Disease (DvD) provides a pipeline, available through R
or Cytoscape, for the comparison of drug and disease gene expression profiles from public
microarray repositories. Negatively correlated profiles can be used to generate hypotheses
of drug-repurposing, whereas positively correlated profiles may be used to infer side
effects of drugs. DvD allows users to compare drug and disease signatures with dynamic
access to databases Array Express, Gene Expression Omnibus and data from the Connectivity
Availability and implementation: R package (submitted to Bioconductor) under
GPL 3 and Cytoscape plug-in freely available for download at www.ebi.ac.uk/saezrodriguez/DVD/.
Supplementary data are available at Bioinformatics
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
This study is the first large-scale comparative analysis of multiple types of post-translational modifications in different eukaryotic species. The resulting network of co-evolving and functionally associated modifications reveals the global landscape of post-translational regulation.
In all, 115 149 non-redundant post-translational modifications (PTMs) of 13 different types were collected from 8 eukaryotes.Comparison of evolution speed reveals that carboxylation is the most conserved while SUMOylation is the fastest evolving PTM type.Co-evolution of PTM pairs that co-occur within proteins reveals a vastly interconnected global network of functionally associated PTM types in eukaryotes.Central to the network of functionally associated PTM types appear phosphorylation, acetylation, ubiquitination and O-linked glycosylation that control both temporal events and processes that govern protein localization.
Various post-translational modifications (PTMs) fine-tune the functions of almost all eukaryotic proteins, and co-regulation of different types of PTMs has been shown within and between a number of proteins. Aiming at a more global view of the interplay between PTM types, we collected modifications for 13 frequent PTM types in 8 eukaryotes, compared their speed of evolution and developed a method for measuring PTM co-evolution within proteins based on the co-occurrence of sites across eukaryotes. As many sites are still to be discovered, this is a considerable underestimate, yet, assuming that most co-evolving PTMs are functionally associated, we found that PTM types are vastly interconnected, forming a global network that comprise in human alone >50 000 residues in about 6000 proteins. We predict substantial PTM type interplay in secreted and membrane-associated proteins and in the context of particular protein domains and short-linear motifs. The global network of co-evolving PTM types implies a complex and intertwined post-translational regulation landscape that is likely to regulate multiple functional states of many if not all eukaryotic proteins.
post-translational modifications; protein regulation; proteomics; PTM code; PTM crosstalk
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
A new class of small RNA (~45 bases long) is identified in gram positive and negative bacteria. These tssRNAs are associated with RNA polymerase pausing some 45 bases downstream of the transcription start site and show global changes in expression during the growth cycle.
A new class of bacterial small RNAs have been identified. They are related to eukaryotic tiRNAs in their localization (transcription start sites, TSS) but not in their biogenesis.tssRNAs are generated at the same positions as long transcripts, as well as at independent positions, but both seem to have promoter-like characteristics (Pribnow box).We provide compelling evidence that tssRNAs are not mRNA degradation products and neither abortive transcripts; rather, they are newly synthesized transcripts and require more factors than the basal transcription machinery (i.e., RNA polymerase subunits)tssRNAs show dynamic behavior dependent on the growth phase.We show that RNA polymerase is halted at tssRNAs positions, both in bona fide genes and in positions where no long transcript is produced. This indicates that tssRNAs could be generated by RNA polymerase pausing to ensure that no spurious long RNA is generated by random appearance of Pribnow sequences in the genome.
Here, we report the genome-wide identification of small RNAs associated with transcription start sites (TSSs), termed tssRNAs, in Mycoplasma pneumoniae. tssRNAs were also found to be present in a different bacterial phyla, Escherichia coli. Similar to the recently identified promoter-associated tiny RNAs (tiRNAs) in eukaryotes, tssRNAs are associated with active promoters. Evidence suggests that these tssRNAs are distinct from previously described abortive transcription RNAs. ssRNAs have an average size of 45 bases and map exactly to the beginning of cognate full-length transcripts and to cryptic TSSs. Expression of bacterial tssRNAs requires factors other than the standard RNA polymerase holoenzyme. We have found that the RNA polymerase is halted at tssRNA positions in vivo, which may indicate that a pausing mechanism exists to prevent transcription in the absence of genes. These results suggest that small RNAs associated with TSSs could be a universal feature of bacterial transcription.
non-coding RNAs; small RNAs; transcription; transcriptomics
Many characterized metabolic enzymes currently lack associated gene and protein sequences. Here, pathway and genomic neighbour data are used to assign genes to these ‘orphan enzymes,' and the predictions are validated with experimental assays and genome-scale metabolic modelling.
A computational method is developed for assigning candidate sequences to orphan enzymes. The method uses metabolic pathway, genomic neighbourhood, genomic co-occurrence, and protein domain information to predict genes that are likely to perform a particular enzymatic function.Benchmarking of the scoring scheme based on the 4 features above revealed that some combinations of parameters yielded greater than 70% accuracy, and that high-confidence predictions could be generated for 131 orphan enzymes.Enzyme assay experiments confirmed the predicted enzymatic activity for two of the high-confidence candidate sequences.Predicted functions can improve the annotation of genomic and metagenomic data, and can reveal putative genes for enzymes with potential biotechnological applications.Incorporating the predicted enzymatic reactions into genome-scale metabolic models changed the flux connectivity and improved their ability to correctly predict gene essentiality, supporting the biological relevance of these predictions.
Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence–function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction.
genomics; metabolic pathways; metagenomics; neighbourhood information; orphan enzymes
The genome of Mycobacterium tuberculosis (H37Rv) contains 4,019 protein coding genes, of which more than thousand have been categorized as ‘hypothetical’ implying that for these not even weak functional associations could be identified so far. We here predict reliable functional indications for half of this large hypothetical orfeome: 497 genes can be annotated based on orthology, and another 125 can be linked to interacting proteins via integrated genomic context analysis and literature mining. The assignments include newly identified clusters of interacting proteins, hypothetical genes that are associated to well known pathways and putative disease-relevant targets. All together, we have raised the fraction of the proteome with at least some functional annotation to 88% which should considerably enhance the interpretation of large-scale experiments targeting this medically important organism.
The effect of kinase, phosphatase and N-acetyltransferase deletions on proteome phosphorylation and acetylation was investigated in Mycoplasma pneumoniae. Bi-directional cross-talk between post-transcriptional modifications suggests an underlying regulatory molecular code in prokaryotes.
Post-translational modifications (PTMs) change the chemical properties of proteins, conferring diversity beyond the amino-acid sequence. Proteins are often modified on multiple sites. A PTM code has been proposed, whereby modifications at specific positions influence further modifications. These regulatory circuits though have rarely been studied on a large-scale; conservation in prokaryotes remains elusive.Here, we studied two important PTMs– phosphorylation and lysine acetylation in the small bacterium Mycoplasma pneumoniae. We combined genetics and quantitative mass spectrometry to measure the effect of systematic kinase, phosphatase and N-acetyltransferase deletions on proteome abundance, phosphorylation and lysine acetylation.The data set represents a comprehensive analysis of both phosphorylation and lysine acetylation in a single prokaryote. It reveals (1) proteins often carry multiple modifications and multiple types of PTMs, reminiscent of the PTM code proposed in eukaryotes, (2) phosphorylation exerts pleiotropic effect on proteins abundances, phosphorylation, but also lysine acetylation, (3) the cross-talk between the two PTMs is bi-directional and (4) PTMs are frequently located at interaction interfaces and in multifunctional proteins, illustrating how PTMs could modulate protein functions affecting the way they interact.The study provides an unbiased and quantitative view on cross-talk between phosphorylation and lysine acetylation. It suggests that these regulatory circuits are a fundamental principle of regulation that might have evolved before the divergence of prokaryotes and eukaryotes.
Protein post-translational modifications (PTMs) represent important regulatory states that when combined have been hypothesized to act as molecular codes and to generate a functional diversity beyond genome and transcriptome. We systematically investigate the interplay of protein phosphorylation with other post-transcriptional regulatory mechanisms in the genome-reduced bacterium Mycoplasma pneumoniae. Systematic perturbations by deletion of its only two protein kinases and its unique protein phosphatase identified not only the protein-specific effect on the phosphorylation network, but also a modulation of proteome abundance and lysine acetylation patterns, mostly in the absence of transcriptional changes. Reciprocally, deletion of the two putative N-acetyltransferases affects protein phosphorylation, confirming cross-talk between the two PTMs. The measured M. pneumoniae phosphoproteome and lysine acetylome revealed that both PTMs are very common, that (as in Eukaryotes) they often co-occur within the same protein and that they are frequently observed at interaction interfaces and in multifunctional proteins. The results imply previously unreported hidden layers of post-transcriptional regulation intertwining phosphorylation with lysine acetylation and other mechanisms that define the functional state of a cell.
kinase; N-acetyltransferase; network; phosphatase; post-translational modification
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
Recently duplicated genes are believed to often overlap in function and expression. A priori, they are thus less likely to be essential. Although this was indeed observed in yeast, mouse singletons and duplicates were reported to be equally often essential. This contradiction can only partly be explained by experimental biases. We herein show that older genes (i.e., genes with earlier phyletic origin) are more likely to be essential, regardless of their duplication status. At a given phyletic gene age, duplicates are always less likely to be essential compared with singletons. The “paradoxical” high essentiality among mouse gene duplicates is then caused by different age profiles of singletons and duplicates, with the latter tending to be derived from older genes.
gene essentiality; yeast; mouse; phyletic age; linking genotype to phenotype
Combinatorial therapy is a promising strategy for combating complex disorders due to improved efficacy and reduced side effects. However, screening new drug combinations exhaustively is impractical considering all possible combinations between drugs. Here, we present a novel computational approach to predict drug combinations by integrating molecular and pharmacological data. Specifically, drugs are represented by a set of their properties, such as their targets or indications. By integrating several of these features, we show that feature patterns enriched in approved drug combinations are not only predictive for new drug combinations but also provide insights into mechanisms underlying combinatorial therapy. Further analysis confirmed that among our top ranked predictions of effective combinations, 69% are supported by literature, while the others represent novel potential drug combinations. We believe that our proposed approach can help to limit the search space of drug combinations and provide a new way to effectively utilize existing drugs for new purposes.
The combination of distinct drugs in combinatorial therapy can help to improve therapeutic efficacy by overcoming the redundancy and robustness of pathogenic processes, or by lowering the risk of side effects. However, identification of effective drug combinations is cumbersome, considering the possible search space with respect to the large number of drugs that could potentially be combined. In this work, we explore various molecular and pharmacological features of drugs, and show that by utilizing combinations of such features it is possible to predict new drug combinations. Benchmarking the approach using approved drug combinations demonstrates that these feature combinations are indeed predictive and can propose promising new drug combinations. In addition, the enriched feature patterns provide insights into the mechanisms underlying drug combinations. For example, they suggest that if two drugs share targets or therapeutic effects, they can be independently combined with a third common drug. The ability to efficiently predict drug combinations should facilitate the development of more efficient drug therapies for a broader range of indications including hard-to-treat complex diseases.
The identification of single copy (1-to-1) orthologs in any group of organisms is important for functional classification and phylogenetic studies. The Metazoa are no exception, but only recently has there been a wide-enough distribution of taxa with sufficiently high quality sequenced genomes to gain confidence in the wide-spread single copy status of a gene.
Here, we present a phylogenetic approach for identifying overlooked single copy orthologs from multigene families and apply it to the Metazoa. Using 18 sequenced metazoan genomes of high quality we identified a robust set of 1,126 orthologous groups that have been retained in single copy since the last common ancestor of Metazoa. We found that the use of the phylogenetic procedure increased the number of single copy orthologs found by over a third more than standard taxon-count approaches. The orthologs represented a wide range of functional categories, expression profiles and levels of divergence.
To demonstrate the value of our set of single copy orthologs, we used them to assess the completeness of 24 currently published metazoan genomes and 62 EST datasets. We found that the annotated genes in published genomes vary in coverage from 79% (Ciona intestinalis) to 99.8% (human) with an average of 92%, suggesting a value for the underlying error rate in genome annotation, and a strategy for identifying single copy orthologs in larger datasets. In contrast, the vast majority of EST datasets with no corresponding genome sequence available are largely under-sampled and probably do not accurately represent the actual genomic complement of the organisms from which they are derived.
The correct identification of single copy (1-to-1) orthologs is crucial for functional classification of genes and for phylogenetic studies of groups of organisms, including the Metazoa. Nevertheless, despite the recent increase in the number of genomes and short sequence read datasets (e.g. ESTs) from the Metazoa, we know little about their completeness and how useful they may be for phylogenetic studies. Here we describe a novel approach for the identification of single copy gene families at any hierarchical level and demonstrate its effectiveness by identifying a set of over one thousand gene families that have been in single copy since the last common ancestor of the Metazoa. By comparing our orthologs to those predicted by other datasets we show that our procedure identifies a significantly larger set of single copy orthologs in the Metazoa. We then use this dataset to assess 24 metazoan genomes and 61 metazoan EST datasets for their completeness. We thus identify the underlying error rate in genome annotation and suggest a mechanism for assessing the quality of genomes and EST datasets in terms of their suitability for phylogenetic studies.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%).
OGEE is an Online GEne Essentiality database. Its main purpose is to enhance our understanding of the essentiality of genes. This is achieved by collecting not only experimentally tested essential and non-essential genes, but also associated gene features such as expression profiles, duplication status, conservation across species, evolutionary origins and involvement in embryonic development. We focus on large-scale experiments and complement our data with text-mining results. Genes are organized into data sets according to their sources. Genes with variable essentiality status across data sets are tagged as conditionally essential, highlighting the complex interplay between gene functions and environments. Linked tools allow the user to compare gene essentiality among different gene groups, or compare features of essential genes to non-essential genes, and visualize the results. OGEE is freely available at http://ogeedb.embl.de.
To facilitate the study of interactions between proteins and chemicals, we have created STITCH, an aggregated database of interactions connecting over 300 000 chemicals and 2.6 million proteins from 1133 organisms. Compared to the previous version, the number of chemicals with interactions and the number of high-confidence interactions both increase 4-fold. The database can be accessed interactively through a web interface, displaying interactions in an integrated network view. It is also available for computational studies through downloadable files and an API. As an extension in the current version, we offer the option to switch between two levels of detail, namely whether stereoisomers of a given compound are shown as a merged entity or as separate entities. Separate display of stereoisomers is necessary, for example, for carbohydrates and chiral drugs. Combining the isomers increases the coverage, as interaction databases and publications found through text mining will often refer to compounds without specifying the stereoisomer. The database is accessible at http://stitch.embl.de/.
SMART (Simple Modular Architecture Research Tool) is an online resource (http://smart.embl.de/) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 7 contains manually curated models for 1009 protein domains, 200 more than in the previous version. The current release introduces several novel features and a streamlined user interface resulting in a faster and more comfortable workflow. The underlying protein databases were greatly expanded, resulting in a 2-fold increase in number of annotated domains and features. The database of completely sequenced genomes now includes 1133 species, compared to 630 in the previous release. Domain architecture analysis results can now be exported and visualized through the iTOL phylogenetic tree viewer. ‘metaSMART’ was introduced as a novel subresource dedicated to the exploration and analysis of domain architectures in various metagenomics data sets. An advanced full text search engine was implemented, covering the complete annotations for SMART and Pfam domains, as well as the complete set of protein descriptions, allowing users to quickly find relevant information.
The structure, robustness, and dynamics of ocean plankton ecosystems remain poorly understood due to sampling, analysis, and computational limitations. The Tara Oceans consortium organizes expeditions to help fill this gap at the global level.
Single copy genes, universally distributed across the three domains of life and encoding mostly ancient parts of the translation machinery, are thought to be only rarely subjected to horizontal gene transfer (HGT). Indeed it has been proposed to have occurred in only a few genes and implies a rare, probably not advantageous event in which an ortholog displaces the original gene and has to function in a foreign context (orthologous gene displacement, OGD). Here, we have utilised an automatic method to identify HGT based on a conservative statistical approach capable of robustly assigning both donors and acceptors. Applied to 40 universally single copy genes we found that as many as 68 HGTs (implying OGDs) have occurred in these genes with a rate of 1.7 per family since the last universal common ancestor (LUCA). We examined a number of factors that have been claimed to be fundamental to HGT in general and tested their validity in the subset of universally distributed single copy genes. We found that differing functional constraints impact rates of OGD and the more evolutionarily distant the donor and acceptor, the less likely an OGD is to occur. Furthermore, species with larger genomes are more likely to be subjected to OGD. Most importantly, regardless of the trends above, the number of OGDs increases linearly with time, indicating a neutral, constant rate. This suggests that levels of HGT above this rate may be indicative of positively selected transfers that may allow niche adaptation or bestow other benefits to the recipient organism.