Bacteriophages have key roles in microbial communities, to a large extent shaping the taxonomic and functional composition of the microbiome, but data on the connections between phage diversity and the composition of communities are scarce. Using taxon-specific marker genes, we identified and monitored 20 viral taxa in 252 human gut metagenomic samples, mostly at the level of genera. On average, five phage taxa were identified in each sample, with up to three of these being highly abundant. The abundances of most phage taxa vary by up to four orders of magnitude between the samples, and several taxa that are highly abundant in some samples are absent in others. Significant correlations exist between the abundances of some phage taxa and human host metadata: for example, ‘Group 936 lactococcal phages' are more prevalent and abundant in Danish samples than in samples from Spain or the United States of America. Quantification of phages that exist as integrated prophages revealed that the abundance profiles of prophages are highly individual-specific and remain unique to an individual over a 1-year time period, and prediction of prophage lysis across the samples identified hundreds of prophages that are apparently active in the gut and vary across the samples, in terms of presence and lytic state. Finally, a prophage–host network of the human gut was established and includes numerous novel host–phage associations.
human gut; metagenomics; phage
Gene content differences in human gut microbes can lead to inter-individual phenotypic variations such as digestive capacity. It is unclear whether gene content variation is caused by differences in microbial species composition or by the presence of different strains of the same species; the extent of gene content variation in the latter is unknown. Unlike pan-genome studies of cultivable strains, the use of metagenomic data can provide an unbiased view of structural variation of gut bacterial strains by measuring them in their natural habitats, the gut of each individual in this case, representing native boundaries between gut bacterial populations. We analyzed publicly available metagenomic data from fecal samples to characterize inter-individual variation in gut bacterial species.
A comparison of 11 abundant gut bacterial species showed that the gene content of strains from the same species differed, on average, by 13% between individuals. This number is based on gene deletions only and represents a lower limit, yet the variation is already in a similar range as observed between completely sequenced strains of cultivable species. We show that accessory genes that differ considerably between individuals can encode important functions, such as polysaccharide utilization and capsular polysaccharide synthesis loci.
Metagenomics can yield insights into gene content variation of strains in complex communities, which cannot be predicted by phylogenetic marker genes alone. The large degree of inter-individual variability in gene content implies that strain resolution must be considered in order to fully assess the functional potential of an individual's human gut microbiome.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0646-9) contains supplementary material, which is available to authorized users.
Metagenomics has become a prominent approach for exploring the role of the gut microbiota in human health. However, the temporal variability of the healthy gut microbiome has not yet been studied in depth using metagenomics and little is known about the effects of different sampling and preservation approaches. We performed metagenomic analysis on fecal samples from seven subjects collected over a period of up to two years to investigate temporal variability and assess preservation-induced variation, specifically, fresh frozen compared to RNALater. We also monitored short-term disturbances caused by antibiotic treatment and bowel cleansing in one subject.
We find that the human gut microbiome is temporally stable and highly personalized at both taxonomic and functional levels. Over multiple time points, samples from the same subject clustered together, even in the context of a large dataset of 888 European and American fecal metagenomes. One exception was observed in an antibiotic intervention case where, more than one year after the treatment, samples did not resemble the pre-treatment state. Clustering was not affected by the preservation method. No species differed significantly in abundance, and only 0.36% of gene families were differentially abundant between preservation methods.
Technical variability is small compared to the temporal variability of an unperturbed gut microbiome, which in turn is much smaller than the observed between-subject variability. Thus, short-term preservation of fecal samples in RNALater is an appropriate and cost-effective alternative to freezing of fecal samples for metagenomic studies.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0639-8) contains supplementary material, which is available to authorized users.
Histone deacetylases (HDACs) and acetyltransferases control the epigenetic regulation of gene expression through modification of histone marks. Histone deacetylase inhibitors (HDACi) are small molecules that interfere with histone tail modification, thus altering chromatin structure and epigenetically controlled pathways. They promote apoptosis in proliferating cells and are promising anticancer drugs. While some HDACi have already been approved for therapy and others are in different phases of clinical trials, the exact mechanism of action of this drug class remains elusive. Previous studies have shown that HDACis cause massive changes in chromatin structure but only moderate changes in gene expression. To what extent these changes manifest at the protein level has never been investigated on a proteome-wide scale. Here, we have studied HDACi-treated cells by large-scale mass spectrometry based proteomics. We show that HDACi treatment affects primarily the nuclear proteome and induces a selective decrease of bromodomain-containing proteins (BCPs), the main readers of acetylated histone marks. By combining time-resolved proteome and transcriptome profiling, we show that BCPs are affected at the protein level as early as 12 h after HDACi treatment and that their abundance is regulated by a combination of transcriptional and post-transcriptional mechanisms. Using gene silencing, we demonstrate that the decreased abundance of BCPs is sufficient to mediate important transcriptional changes induced by HDACi. Our data reveal a new aspect of the mechanism of action of HDACi that is mediated by an interplay between histone acetylation and the abundance of BCPs. Data are available via ProteomeXchange with identifier PXD001660 and NCBI Gene Expression Omnibus with identifier GSE64689.
Identifying all essential genomic components is critical for the assembly of minimal artificial life. In the genome-reduced bacterium Mycoplasma pneumoniae, we found that small ORFs (smORFs; < 100 residues), accounting for 10% of all ORFs, are the most frequently essential genomic components (53%), followed by conventional ORFs (49%). Essentiality of smORFs may be explained by their function as members of protein and/or DNA/RNA complexes. In larger proteins, essentiality applied to individual domains and not entire proteins, a notion we could confirm by expression of truncated domains. The fraction of essential non-coding RNAs (ncRNAs) non-overlapping with essential genes is 5% higher than of non-transcribed regions (0.9%), pointing to the important functions of the former. We found that the minimal essential genome is comprised of 33% (269,410 bp) of the M. pneumoniae genome. Our data highlight an unexpected hidden layer of smORFs with essential functions, as well as non-coding regions, thus changing the focus when aiming to define the minimal essential genome.
minimal genome; non-coding RNAs; small proteins
Several bacterial species have been implicated in the development of colorectal carcinoma (CRC),
but CRC-associated changes of fecal microbiota and their potential for cancer screening remain to be
explored. Here, we used metagenomic sequencing of fecal samples to identify taxonomic markers that
distinguished CRC patients from tumor-free controls in a study population of 156 participants.
Accuracy of metagenomic CRC detection was similar to the standard fecal occult blood test (FOBT) and
when both approaches were combined, sensitivity improved > 45% relative to the FOBT,
while maintaining its specificity. Accuracy of metagenomic CRC detection did not differ
significantly between early- and late-stage cancer and could be validated in independent patient and
control populations (N = 335) from different countries. CRC-associated
changes in the fecal microbiome at least partially reflected microbial community composition at the
tumor itself, indicating that observed gene pool differences may reveal tumor-related
host–microbe interactions. Indeed, we deduced a metabolic shift from fiber degradation in
controls to utilization of host carbohydrates and amino acids in CRC patients, accompanied by an
increase of lipopolysaccharide metabolism.
cancer screening; colorectal cancer; fecal biomarkers; human gut microbiome; metagenomics
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
The thermophilic fungus Chaetomium thermophilum holds great promise for structural biology. To increase the efficiency of its biochemical and structural characterization and to explore its thermophilic properties beyond those of individual proteins, we obtained transcriptomics and proteomics data, and integrated them with computational annotation methods and a multitude of biochemical experiments conducted by the structural biology community. We considerably improved the genome annotation of Chaetomium thermophilum and characterized the transcripts and expression of thousands of genes. We furthermore show that the composition and structure of the expressed proteome of Chaetomium thermophilum is similar to its mesophilic relatives. Data were deposited in a publicly available repository and provide a rich source to the structural biology community.
Systematic interrogation of mutation or protein modification data is important to identify sites with functional consequences and to deduce global consequences from large data sets. Mechismo (mechismo.russellab.org) enables simultaneous consideration of thousands of 3D structures and biomolecular interactions to predict rapidly mechanistic consequences for mutations and modifications. As useful functional information often only comes from homologous proteins, we benchmarked the accuracy of predictions as a function of protein/structure sequence similarity, which permits the use of relatively weak sequence similarities with an appropriate confidence measure. For protein–protein, protein–nucleic acid and a subset of protein–chemical interactions, we also developed and benchmarked a measure of whether modifications are likely to enhance or diminish the interactions, which can assist the detection of modifications with specific effects. Analysis of high-throughput sequencing data shows that the approach can identify interesting differences between cancers, and application to proteomics data finds potential mechanistic insights for how post-translational modifications can alter biomolecular interactions.
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a “core” species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.
The post-translational regulation of proteins is mainly driven by two molecular events, their modification by several types of moieties and their interaction with other proteins. These two processes are interdependent and together are responsible for the function of the protein in a particular cell state. Several databases focus on the prediction and compilation of protein–protein interactions (PPIs) and no less on the collection and analysis of protein post-translational modifications (PTMs), however, there are no resources that concentrate on describing the regulatory role of PTMs in PPIs. We developed several methods based on residue co-evolution and proximity to predict the functional associations of pairs of PTMs that we apply to modifications in the same protein and between two interacting proteins. In order to make data available for understudied organisms, PTMcode v2 (http://ptmcode.embl.de) includes a new strategy to propagate PTMs from validated modified sites through orthologous proteins. The second release of PTMcode covers 19 eukaryotic species from which we collected more than 300 000 experimentally verified PTMs (>1 300 000 propagated) of 69 types extracting the post-translational regulation of >100 000 proteins and >100 000 interactions. In total, we report 8 million associations of PTMs regulating single proteins and over 9.4 million interplays tuning PPIs.
The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING database (http://string-db.org) aims to provide a critical assessment and integration of protein–protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, we have introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein–protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.
SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de/) providing simple identification and extensive annotation of protein domains and the exploration of protein domain architectures. In the current version, SMART contains manually curated models for more than 1200 protein domains, with ∼200 new models since our last update article. The underlying protein databases were synchronized with UniProt, Ensembl and STRING, bringing the total number of annotated domains and other protein features above 100 million. SMART's ‘Genomic’ mode, which annotates proteins from completely sequenced genomes was greatly expanded and now includes 2031 species, compared to 1133 in the previous release. SMART analysis results pages have been completely redesigned and include links to several new information sources. A new, vector-based display engine has been developed for protein schematics in SMART, which can also be exported as high-resolution bitmap images for easy inclusion into other documents. Taxonomic tree displays in SMART have been significantly improved, and can be easily navigated using the integrated search engine.
16S ribosomal DNA (rDNA) amplicon sequencing is frequently used to analyse the structure of bacterial communities from oceans to the human microbiota. However, computational power is still a major bottleneck in the analysis of continuously enlarging metagenomic data sets. Analysis is further complicated by the technical complexity of current bioinformatics tools.
Here we present the less operational taxonomic units scripts (LotuS), a fast and user-friendly open-source tool to calculate denoised, chimera-checked, operational taxonomic units (OTUs). These are the basis to generate taxonomic abundance tables and phylogenetic trees from multiplexed, next-generation sequencing data (454, illumina MiSeq and HiSeq). LotuS is outstanding in its execution speed, as it can process 16S rDNA data up to two orders of magnitude faster than other existing pipelines. This is partly due to an included stand-alone fast simultaneous demultiplexer and quality filter C++ program, simple demultiplexer (sdm), which comes packaged with LotuS. Additionally, we sequenced two MiSeq runs with the intent to validate future pipelines by sequencing 40 technical replicates; these are made available in this work.
We show that LotuS analyses microbial 16S data with comparable or even better results than existing pipelines, requiring a fraction of the execution time and providing state-of-the-art denoising and phylogenetic reconstruction. LotuS is available through the following URL: http://psbweb05.psb.ugent.be/lotus.
OTU; 16S rDNA gene; Pipeline; Metagenomics; Demultiplexing
Protein post‐translational modifications (PTMs) allow the cell to regulate protein activity and play a crucial role in the response to changes in external conditions or internal states. Advances in mass spectrometry now enable proteome wide characterization of PTMs and have revealed a broad functional role for a range of different types of modifications. Here we review advances in the study of the evolution and function of PTMs that were spurred by these technological improvements. We provide an overview of studies focusing on the origin and evolution of regulatory enzymes as well as the evolutionary dynamics of modification sites. Finally, we discuss different mechanisms of altering protein activity via post‐translational regulation and progress made in the large‐scale functional characterization of PTM function.
acetylation; evolution; phosphorylation; post‐translational modifications; PTM cross‐talk
With the increasing availability of various ‘omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
STITCH is a database of protein–chemical interactions that integrates many sources of experimental and manually curated evidence with text-mining information and interaction predictions. Available at http://stitch.embl.de, the resulting interaction network includes 390 000 chemicals and 3.6 million proteins from 1133 organisms. Compared with the previous version, the number of high-confidence protein–chemical interactions in human has increased by 45%, to 367 000. In this version, we added features for users to upload their own data to STITCH in the form of internal identifiers, chemical structures or quantitative data. For example, a user can now upload a spreadsheet with screening hits to easily check which interactions are already known. To increase the coverage of STITCH, we expanded the text mining to include full-text articles and added a prediction method based on chemical structures. We further changed our scheme for transferring interactions between species to rely on orthology rather than protein similarity. This improves the performance within protein families, where scores are now transferred only to orthologous proteins, but not to paralogous proteins. STITCH can be accessed with a web-interface, an API and downloadable files.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million nonredundant microbial genes, derived from 576.7 Gb sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent microbial genes of the cohort and likely includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, suggesting that the entire cohort harbours between 1000 and 1150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions encoded by the gene set.
Nucleo-cytoplasmic large DNA viruses (NCLDVs) constitute a group of eukaryotic viruses that can have crucial ecological roles in the sea by accelerating the turnover of their unicellular hosts or by causing diseases in animals. To better characterize the diversity, abundance and biogeography of marine NCLDVs, we analyzed 17 metagenomes derived from microbial samples (0.2–1.6 μm size range) collected during the Tara Oceans Expedition. The sample set includes ecosystems under-represented in previous studies, such as the Arabian Sea oxygen minimum zone (OMZ) and Indian Ocean lagoons. By combining computationally derived relative abundance and direct prokaryote cell counts, the abundance of NCLDVs was found to be in the order of 104–105 genomes ml−1 for the samples from the photic zone and 102–103 genomes ml−1 for the OMZ. The Megaviridae and Phycodnaviridae dominated the NCLDV populations in the metagenomes, although most of the reads classified in these families showed large divergence from known viral genomes. Our taxon co-occurrence analysis revealed a potential association between viruses of the Megaviridae family and eukaryotes related to oomycetes. In support of this predicted association, we identified six cases of lateral gene transfer between Megaviridae and oomycetes. Our results suggest that marine NCLDVs probably outnumber eukaryotic organisms in the photic layer (per given water mass) and that metagenomic sequence analyses promise to shed new light on the biodiversity of marine viruses and their interactions with potential hosts.
eukaryotic viruses; marine NCLDVs; taxon co-occurrence; oomycetes
Our knowledge on species and function composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about their variation across the world. Combining 22 newly sequenced fecal metagenomes of individuals from 4 countries with previously published datasets, we identified three robust clusters (enterotypes hereafter) that are not nation or continent-specific. We confirmed the enterotypes also in two published, larger cohorts suggesting that intestinal microbiota variation is generally stratified, not continuous. This further indicates the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis for a community understanding. While individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.
The interactions between proteins and nucleic acids have a fundamental function in many biological processes, including gene transcription, RNA homeostasis, protein translation and pathogen sensing for innate immunity. While our knowledge of the ensemble of proteins that bind individual mRNAs in mammalian cells has been greatly augmented by recent surveys, no systematic study on the non-sequence-specific engagement of native human proteins with various types of nucleic acids has been reported.
We designed an experimental approach to achieve broad coverage of the non-sequence-specific RNA and DNA binding space, including methylated cytosine, and tested for interaction potential with the human proteome. We used 25 rationally designed nucleic acid probes in an affinity purification mass spectrometry and bioinformatics workflow to identify proteins from whole cell extracts of three different human cell lines. The proteins were profiled for their binding preferences to the different general types of nucleic acids. The study identified 746 high-confidence direct binders, 139 of which were novel and 237 devoid of previous experimental evidence. We could assign specific affinities for sub-types of nucleic acid probes to 219 distinct proteins and individual domains. The evolutionarily conserved protein YB-1, previously associated with cancer and drug resistance, was shown to bind methylated cytosine preferentially, potentially conferring upon YB-1 an epigenetics-related function.
The dataset described here represents a rich resource of experimentally determined nucleic acid-binding proteins, and our methodology has great potential for further exploration of the interface between the protein and nucleic acid realms.