With the increasing availability of various ‘omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
STITCH is a database of protein–chemical interactions that integrates many sources of experimental and manually curated evidence with text-mining information and interaction predictions. Available at http://stitch.embl.de, the resulting interaction network includes 390 000 chemicals and 3.6 million proteins from 1133 organisms. Compared with the previous version, the number of high-confidence protein–chemical interactions in human has increased by 45%, to 367 000. In this version, we added features for users to upload their own data to STITCH in the form of internal identifiers, chemical structures or quantitative data. For example, a user can now upload a spreadsheet with screening hits to easily check which interactions are already known. To increase the coverage of STITCH, we expanded the text mining to include full-text articles and added a prediction method based on chemical structures. We further changed our scheme for transferring interactions between species to rely on orthology rather than protein similarity. This improves the performance within protein families, where scores are now transferred only to orthologous proteins, but not to paralogous proteins. STITCH can be accessed with a web-interface, an API and downloadable files.
Nucleo-cytoplasmic large DNA viruses (NCLDVs) constitute a group of eukaryotic viruses that can have crucial ecological roles in the sea by accelerating the turnover of their unicellular hosts or by causing diseases in animals. To better characterize the diversity, abundance and biogeography of marine NCLDVs, we analyzed 17 metagenomes derived from microbial samples (0.2–1.6 μm size range) collected during the Tara Oceans Expedition. The sample set includes ecosystems under-represented in previous studies, such as the Arabian Sea oxygen minimum zone (OMZ) and Indian Ocean lagoons. By combining computationally derived relative abundance and direct prokaryote cell counts, the abundance of NCLDVs was found to be in the order of 104–105 genomes ml−1 for the samples from the photic zone and 102–103 genomes ml−1 for the OMZ. The Megaviridae and Phycodnaviridae dominated the NCLDV populations in the metagenomes, although most of the reads classified in these families showed large divergence from known viral genomes. Our taxon co-occurrence analysis revealed a potential association between viruses of the Megaviridae family and eukaryotes related to oomycetes. In support of this predicted association, we identified six cases of lateral gene transfer between Megaviridae and oomycetes. Our results suggest that marine NCLDVs probably outnumber eukaryotic organisms in the photic layer (per given water mass) and that metagenomic sequence analyses promise to shed new light on the biodiversity of marine viruses and their interactions with potential hosts.
eukaryotic viruses; marine NCLDVs; taxon co-occurrence; oomycetes
Our knowledge on species and function composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about their variation across the world. Combining 22 newly sequenced fecal metagenomes of individuals from 4 countries with previously published datasets, we identified three robust clusters (enterotypes hereafter) that are not nation or continent-specific. We confirmed the enterotypes also in two published, larger cohorts suggesting that intestinal microbiota variation is generally stratified, not continuous. This further indicates the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis for a community understanding. While individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.
While large-scale efforts have rapidly advanced the understanding and practical impact of human genomic variation, the latter is largely unexplored in the human microbiome. We therefore developed a framework for metagenomic variation analysis and applied it to 252 fecal metagenomes of 207 individuals from Europe and North America. Using 7.4 billion reads aligned to 101 reference species, we detected 10.3 million single nucleotide polymorphisms (SNPs), 107,991 short indels, and 1,051 structural variants. The average ratio of non-synonymous to synonymous polymorphism rates of 0.11 was more variable between gut microbial species than across human hosts. Subjects sampled at varying time intervals exhibited individuality and temporal stability of SNP variation patterns, despite considerable composition changes of their gut microbiota. This implies that individual-specific strains are not easily replaced and that an individual might have a unique metagenomic genotype, which may be exploitable for personalized diet or drug intake.
The human gastrointestinal tract (GI tract) harbors a complex community of microbes. The microbiota composition varies between different locations in the GI tract, but most studies focus on the fecal microbiota, and that inhabiting the colonic mucosa. Consequently, little is known about the microbiota at other parts of the GI tract, which is especially true for the small intestine because of its limited accessibility. Here we deduce an ecological model of the microbiota composition and function in the small intestine, using complementing culture-independent approaches. Phylogenetic microarray analyses demonstrated that microbiota compositions that are typically found in effluent samples from ileostomists (subjects without a colon) can also be encountered in the small intestine of healthy individuals. Phylogenetic mapping of small intestinal metagenome of three different ileostomy effluent samples from a single individual indicated that Streptococcus sp., Escherichia coli, Clostridium sp. and high G+C organisms are most abundant in the small intestine. The compositions of these populations fluctuated in time and correlated to the short-chain fatty acids profiles that were determined in parallel. Comparative functional analysis with fecal metagenomes identified functions that are overrepresented in the small intestine, including simple carbohydrate transport phosphotransferase systems (PTS), central metabolism and biotin production. Moreover, metatranscriptome analysis supported high level in-situ expression of PTS and carbohydrate metabolic genes, especially those belonging to Streptococcus sp. Overall, our findings suggest that rapid uptake and fermentation of available carbohydrates contribute to maintaining the microbiota in the human small intestine.
ecological model; function; microbiota; phylogeny; small intestine
Mendelian disorders are often caused by mutations in genes that are not lethal but induce functional distortions leading to diseases. Here we study the extent of gene duplicates that might compensate genes causing monogenic diseases. We provide evidence for pervasive functional redundancy of human monogenic disease genes (MDs) by duplicates by manifesting 1) genes involved in human genetic disorders are enriched in duplicates and 2) duplicated disease genes tend to have higher functional similarities with their closest paralogs in contrast to duplicated non-disease genes of similar age. We propose that functional compensation by duplication of genes masks the phenotypic effects of deleterious mutations and reduces the probability of purging the defective genes from the human population; this functional compensation could be further enhanced by higher purification selection between disease genes and their duplicates as well as their orthologous counterpart compared to non-disease genes. However, due to the intrinsic expression stochasticity among individuals, the deleterious mutations could still be present as genetic diseases in some subpopulations where the duplicate copies are expressed at low abundances. Consequently the defective genes are linked to genetic disorders while they continue propagating within the population. Our results provide insight into the molecular basis underlying the spreading of duplicated disease genes.
Duplicated genes, as opposed to singletons, are genes that have additional copies in a genome due to evolutionary mechanisms such as whole genome duplication, homologous recombination or retrotransposition events. Duplicates can have similar functions and thus mask the phenotypic consequences when one copy is mutated. Conversely, the corresponding phenotypes would manifest themselves when mutations occur in singletons, since functional compensation is rare among non-duplicated genes. It would thus be expected that the primary source of monogenic diseases, diseases caused by mutations within a single gene, is singletons. However, the opposite was found to be true. Additionally, we found the functional similarity of duplicated disease genes to be greater than that of duplicated non-disease genes of an equivalent duplication age. So how could the stronger functional compensation among duplicates increase their likelihood to associate with diseases? We propose that due to functional compensation in duplicates, disease-causing mutations are less likely to be removed from a human population in large scale since the phenotypes are masked; however, the functional compensation could be lost in a subpopulation, perhaps due to intrinsic variation in gene expression, and therefore lead to diseases. As a result, the duplicated disease genes are linked to genetic diseases, yet they continue to spread within the human population.
Drug-induced transcriptional modules (biclusters) were identified and annotated in three human cell lines and rat liver. These were used to assess conservation across systems and to infer and experimentally validate novel drug effects and gene functions.
Biclustering of drug-induced gene expression profiles resulted in modules of drugs and genes, which were enriched in both drug and gene annotations.Identifying drug-induced transcriptional modules separately in three human cell lines and rat liver allows assessment of their conservation across model systems. About 70% of modules are conserved across cell lines, a lower bound of 15% was estimated for their conservation across organisms, and between the in vitro and in vivo systems.Drug-induced transcriptional modules can predict novel gene functions. A conserved module associated with (chole)sterol metabolism revealed novel regulators of cellular cholesterol homeostasis; 10 of them were validated in functional imaging assays.Analysis of drugs clustered into modules can give new insights into their mechanisms of action and provide leads for drug repositioning. We predicted and experimentally validated novel cell cycle inhibitors and modulators of PPARγ, estrogen and adrenergic receptors, with potential for developing new therapies against diabetes and cancer.
In pharmacology, it is crucial to understand the complex biological responses that drugs elicit in the human organism and how well they can be inferred from model organisms. We therefore identified a large set of drug-induced transcriptional modules from genome-wide microarray data of drug-treated human cell lines and rat liver, and first characterized their conservation. Over 70% of these modules were common for multiple cell lines and 15% were conserved between the human in vitro and the rat in vivo system. We then illustrate the utility of conserved and cell-type-specific drug-induced modules by predicting and experimentally validating (i) gene functions, e.g., 10 novel regulators of cellular cholesterol homeostasis and (ii) new mechanisms of action for existing drugs, thereby providing a starting point for drug repositioning, e.g., novel cell cycle inhibitors and new modulators of α-adrenergic receptor, peroxisome proliferator-activated receptor and estrogen receptor. Taken together, the identified modules reveal the conservation of transcriptional responses towards drugs across cell types and organisms, and improve our understanding of both the molecular basis of drug action and human biology.
cell line models in drug discovery; drug-induced transcriptional modules; drug repositioning; gene function prediction; transcriptome conservation across cell types and organisms
Protein–side effects associations are identified by integrating drug–target data with side effects information from drug labels. Benchmarking against the literature and validation with an in vivo mouse model shows that these pairs correspond to causal relations.
For more than half of the investigated side effects, we can predict causal proteins.Off-targets contribute slightly more to the explained side effects than main targets.With the current data, we are most successful in explaining the side effects of drugs that target G protein-coupled receptors.Activation of HTR7 causes hyperesthesia in mice, explaining a side effect of triptan drugs.
Side effect similarities of drugs have recently been employed to predict new drug targets, and networks of side effects and targets have been used to better understand the mechanism of action of drugs. Here, we report a large-scale analysis to systematically predict and characterize proteins that cause drug side effects. We integrated phenotypic data obtained during clinical trials with known drug–target relations to identify overrepresented protein–side effect combinations. Using independent data, we confirm that most of these overrepresentations point to proteins which, when perturbed, cause side effects. Of 1428 side effects studied, 732 were predicted to be predominantly caused by individual proteins, at least 137 of them backed by existing pharmacological or phenotypic data. We prove this concept in vivo by confirming our prediction that activation of the serotonin 7 receptor (HTR7) is responsible for hyperesthesia in mice, which, in turn, can be prevented by a drug that selectively inhibits HTR7. Taken together, we show that a large fraction of complex drug side effects are mediated by individual proteins and create a reference for such relations.
computational biology; drug targets; side effects
The stoichiometry of the human nuclear pore complex is revealed by targeted mass spectrometry and super-resolution microscopy. The analysis reveals that the composition of the nuclear pore and other nuclear protein complexes is remodeled as a function of the cell type.
The human NPC has a previously unanticipated stoichiometry that varies across cell types.Primarily functional Nups are dynamic, while the NPC scaffold is static.Stoichiometries of many complexes are fine-tuned toward cell type-specific needs.
To understand the structure and function of large molecular machines, accurate knowledge of their stoichiometry is essential. In this study, we developed an integrated targeted proteomics and super-resolution microscopy approach to determine the absolute stoichiometry of the human nuclear pore complex (NPC), possibly the largest eukaryotic protein complex. We show that the human NPC has a previously unanticipated stoichiometry that varies across cancer cell types, tissues and in disease. Using large-scale proteomics, we provide evidence that more than one third of the known, well-defined nuclear protein complexes display a similar cell type-specific variation of their subunit stoichiometry. Our data point to compositional rearrangement as a widespread mechanism for adapting the functions of molecular machines toward cell type-specific constraints and context-dependent needs, and highlight the need of deeper investigation of such structural variants.
fluorophore counting; nucleoporin; protein complex-based analysis; super-resolution microscopy; targeted proteomics
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
Proteomes of thermophilic prokaryotes have been instrumental in structural biology and successfully exploited in biotechnology, however many proteins required for eukaryotic cell function are absent from bacteria or archaea. With Chaetomium thermophilum, Thielavia terrestris and Thielavia heterothallica three genome sequences of thermophilic eukaryotes have been published.
Studying the genomes and proteomes of these thermophilic fungi, we found common strategies of thermal adaptation across the different kingdoms of Life, including amino acid biases and a reduced genome size. A phylogenetics-guided comparison of thermophilic proteomes with those of other, mesophilic Sordariomycetes revealed consistent amino acid substitutions associated to thermophily that were also present in an independent lineage of thermophilic fungi. The most consistent pattern is the substitution of lysine by arginine, which we could find in almost all lineages but has not been extensively used in protein stability engineering. By exploiting mutational paths towards the thermophiles, we could predict particular amino acid residues in individual proteins that contribute to thermostability and validated some of them experimentally. By determining the three-dimensional structure of an exemplar protein from C. thermophilum (Arx1), we could also characterise the molecular consequences of some of these mutations.
The comparative analysis of these three genomes not only enhances our understanding of the evolution of thermophily, but also provides new ways to engineer protein stability.
Thermophily; Comparative genomics; Protein engineering; Eukaryotes; Fungi
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made—particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.
Post-translational modifications (PTMs) are involved in the regulation and structural stabilization of eukaryotic proteins. The combination of individual PTM states is a key to modulate cellular functions as became evident in a few well-studied proteins. This combinatorial setting, dubbed the PTM code, has been proposed to be extended to whole proteomes in eukaryotes. Although we are still far from deciphering such a complex language, thousands of protein PTM sites are being mapped by high-throughput technologies, thus providing sufficient data for comparative analysis. PTMcode (http://ptmcode.embl.de) aims to compile known and predicted PTM associations to provide a framework that would enable hypothesis-driven experimental or computational analysis of various scales. In its first release, PTMcode provides PTM functional associations of 13 different PTM types within proteins in 8 eukaryotes. They are based on five evidence channels: a literature survey, residue co-evolution, structural proximity, PTMs at the same residue and location within PTM highly enriched protein regions (hotspots). PTMcode is presented as a protein-based searchable database with an interactive web interface providing the context of the co-regulation of nearly 75 000 residues in >10 000 proteins.
Summary: Drug versus Disease (DvD) provides a pipeline, available through R
or Cytoscape, for the comparison of drug and disease gene expression profiles from public
microarray repositories. Negatively correlated profiles can be used to generate hypotheses
of drug-repurposing, whereas positively correlated profiles may be used to infer side
effects of drugs. DvD allows users to compare drug and disease signatures with dynamic
access to databases Array Express, Gene Expression Omnibus and data from the Connectivity
Availability and implementation: R package (submitted to Bioconductor) under
GPL 3 and Cytoscape plug-in freely available for download at www.ebi.ac.uk/saezrodriguez/DVD/.
Supplementary data are available at Bioinformatics
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
This study is the first large-scale comparative analysis of multiple types of post-translational modifications in different eukaryotic species. The resulting network of co-evolving and functionally associated modifications reveals the global landscape of post-translational regulation.
In all, 115 149 non-redundant post-translational modifications (PTMs) of 13 different types were collected from 8 eukaryotes.Comparison of evolution speed reveals that carboxylation is the most conserved while SUMOylation is the fastest evolving PTM type.Co-evolution of PTM pairs that co-occur within proteins reveals a vastly interconnected global network of functionally associated PTM types in eukaryotes.Central to the network of functionally associated PTM types appear phosphorylation, acetylation, ubiquitination and O-linked glycosylation that control both temporal events and processes that govern protein localization.
Various post-translational modifications (PTMs) fine-tune the functions of almost all eukaryotic proteins, and co-regulation of different types of PTMs has been shown within and between a number of proteins. Aiming at a more global view of the interplay between PTM types, we collected modifications for 13 frequent PTM types in 8 eukaryotes, compared their speed of evolution and developed a method for measuring PTM co-evolution within proteins based on the co-occurrence of sites across eukaryotes. As many sites are still to be discovered, this is a considerable underestimate, yet, assuming that most co-evolving PTMs are functionally associated, we found that PTM types are vastly interconnected, forming a global network that comprise in human alone >50 000 residues in about 6000 proteins. We predict substantial PTM type interplay in secreted and membrane-associated proteins and in the context of particular protein domains and short-linear motifs. The global network of co-evolving PTM types implies a complex and intertwined post-translational regulation landscape that is likely to regulate multiple functional states of many if not all eukaryotic proteins.
post-translational modifications; protein regulation; proteomics; PTM code; PTM crosstalk
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
A new class of small RNA (~45 bases long) is identified in gram positive and negative bacteria. These tssRNAs are associated with RNA polymerase pausing some 45 bases downstream of the transcription start site and show global changes in expression during the growth cycle.
A new class of bacterial small RNAs have been identified. They are related to eukaryotic tiRNAs in their localization (transcription start sites, TSS) but not in their biogenesis.tssRNAs are generated at the same positions as long transcripts, as well as at independent positions, but both seem to have promoter-like characteristics (Pribnow box).We provide compelling evidence that tssRNAs are not mRNA degradation products and neither abortive transcripts; rather, they are newly synthesized transcripts and require more factors than the basal transcription machinery (i.e., RNA polymerase subunits)tssRNAs show dynamic behavior dependent on the growth phase.We show that RNA polymerase is halted at tssRNAs positions, both in bona fide genes and in positions where no long transcript is produced. This indicates that tssRNAs could be generated by RNA polymerase pausing to ensure that no spurious long RNA is generated by random appearance of Pribnow sequences in the genome.
Here, we report the genome-wide identification of small RNAs associated with transcription start sites (TSSs), termed tssRNAs, in Mycoplasma pneumoniae. tssRNAs were also found to be present in a different bacterial phyla, Escherichia coli. Similar to the recently identified promoter-associated tiny RNAs (tiRNAs) in eukaryotes, tssRNAs are associated with active promoters. Evidence suggests that these tssRNAs are distinct from previously described abortive transcription RNAs. ssRNAs have an average size of 45 bases and map exactly to the beginning of cognate full-length transcripts and to cryptic TSSs. Expression of bacterial tssRNAs requires factors other than the standard RNA polymerase holoenzyme. We have found that the RNA polymerase is halted at tssRNA positions in vivo, which may indicate that a pausing mechanism exists to prevent transcription in the absence of genes. These results suggest that small RNAs associated with TSSs could be a universal feature of bacterial transcription.
non-coding RNAs; small RNAs; transcription; transcriptomics
Many characterized metabolic enzymes currently lack associated gene and protein sequences. Here, pathway and genomic neighbour data are used to assign genes to these ‘orphan enzymes,' and the predictions are validated with experimental assays and genome-scale metabolic modelling.
A computational method is developed for assigning candidate sequences to orphan enzymes. The method uses metabolic pathway, genomic neighbourhood, genomic co-occurrence, and protein domain information to predict genes that are likely to perform a particular enzymatic function.Benchmarking of the scoring scheme based on the 4 features above revealed that some combinations of parameters yielded greater than 70% accuracy, and that high-confidence predictions could be generated for 131 orphan enzymes.Enzyme assay experiments confirmed the predicted enzymatic activity for two of the high-confidence candidate sequences.Predicted functions can improve the annotation of genomic and metagenomic data, and can reveal putative genes for enzymes with potential biotechnological applications.Incorporating the predicted enzymatic reactions into genome-scale metabolic models changed the flux connectivity and improved their ability to correctly predict gene essentiality, supporting the biological relevance of these predictions.
Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence–function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction.
genomics; metabolic pathways; metagenomics; neighbourhood information; orphan enzymes
The genome of Mycobacterium tuberculosis (H37Rv) contains 4,019 protein coding genes, of which more than thousand have been categorized as ‘hypothetical’ implying that for these not even weak functional associations could be identified so far. We here predict reliable functional indications for half of this large hypothetical orfeome: 497 genes can be annotated based on orthology, and another 125 can be linked to interacting proteins via integrated genomic context analysis and literature mining. The assignments include newly identified clusters of interacting proteins, hypothetical genes that are associated to well known pathways and putative disease-relevant targets. All together, we have raised the fraction of the proteome with at least some functional annotation to 88% which should considerably enhance the interpretation of large-scale experiments targeting this medically important organism.
The effect of kinase, phosphatase and N-acetyltransferase deletions on proteome phosphorylation and acetylation was investigated in Mycoplasma pneumoniae. Bi-directional cross-talk between post-transcriptional modifications suggests an underlying regulatory molecular code in prokaryotes.
Post-translational modifications (PTMs) change the chemical properties of proteins, conferring diversity beyond the amino-acid sequence. Proteins are often modified on multiple sites. A PTM code has been proposed, whereby modifications at specific positions influence further modifications. These regulatory circuits though have rarely been studied on a large-scale; conservation in prokaryotes remains elusive.Here, we studied two important PTMs– phosphorylation and lysine acetylation in the small bacterium Mycoplasma pneumoniae. We combined genetics and quantitative mass spectrometry to measure the effect of systematic kinase, phosphatase and N-acetyltransferase deletions on proteome abundance, phosphorylation and lysine acetylation.The data set represents a comprehensive analysis of both phosphorylation and lysine acetylation in a single prokaryote. It reveals (1) proteins often carry multiple modifications and multiple types of PTMs, reminiscent of the PTM code proposed in eukaryotes, (2) phosphorylation exerts pleiotropic effect on proteins abundances, phosphorylation, but also lysine acetylation, (3) the cross-talk between the two PTMs is bi-directional and (4) PTMs are frequently located at interaction interfaces and in multifunctional proteins, illustrating how PTMs could modulate protein functions affecting the way they interact.The study provides an unbiased and quantitative view on cross-talk between phosphorylation and lysine acetylation. It suggests that these regulatory circuits are a fundamental principle of regulation that might have evolved before the divergence of prokaryotes and eukaryotes.
Protein post-translational modifications (PTMs) represent important regulatory states that when combined have been hypothesized to act as molecular codes and to generate a functional diversity beyond genome and transcriptome. We systematically investigate the interplay of protein phosphorylation with other post-transcriptional regulatory mechanisms in the genome-reduced bacterium Mycoplasma pneumoniae. Systematic perturbations by deletion of its only two protein kinases and its unique protein phosphatase identified not only the protein-specific effect on the phosphorylation network, but also a modulation of proteome abundance and lysine acetylation patterns, mostly in the absence of transcriptional changes. Reciprocally, deletion of the two putative N-acetyltransferases affects protein phosphorylation, confirming cross-talk between the two PTMs. The measured M. pneumoniae phosphoproteome and lysine acetylome revealed that both PTMs are very common, that (as in Eukaryotes) they often co-occur within the same protein and that they are frequently observed at interaction interfaces and in multifunctional proteins. The results imply previously unreported hidden layers of post-transcriptional regulation intertwining phosphorylation with lysine acetylation and other mechanisms that define the functional state of a cell.
kinase; N-acetyltransferase; network; phosphatase; post-translational modification
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
Recently duplicated genes are believed to often overlap in function and expression. A priori, they are thus less likely to be essential. Although this was indeed observed in yeast, mouse singletons and duplicates were reported to be equally often essential. This contradiction can only partly be explained by experimental biases. We herein show that older genes (i.e., genes with earlier phyletic origin) are more likely to be essential, regardless of their duplication status. At a given phyletic gene age, duplicates are always less likely to be essential compared with singletons. The “paradoxical” high essentiality among mouse gene duplicates is then caused by different age profiles of singletons and duplicates, with the latter tending to be derived from older genes.
gene essentiality; yeast; mouse; phyletic age; linking genotype to phenotype