PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (124)
 

Clipboard (0)
None

Select a Filter Below

Year of Publication
more »
1.  Potential of fecal microbiota for early-stage detection of colorectal cancer 
Molecular Systems Biology  2014;10(11):766.
Several bacterial species have been implicated in the development of colorectal carcinoma (CRC), but CRC-associated changes of fecal microbiota and their potential for cancer screening remain to be explored. Here, we used metagenomic sequencing of fecal samples to identify taxonomic markers that distinguished CRC patients from tumor-free controls in a study population of 156 participants. Accuracy of metagenomic CRC detection was similar to the standard fecal occult blood test (FOBT) and when both approaches were combined, sensitivity improved > 45% relative to the FOBT, while maintaining its specificity. Accuracy of metagenomic CRC detection did not differ significantly between early- and late-stage cancer and could be validated in independent patient and control populations (N = 335) from different countries. CRC-associated changes in the fecal microbiome at least partially reflected microbial community composition at the tumor itself, indicating that observed gene pool differences may reveal tumor-related host–microbe interactions. Indeed, we deduced a metabolic shift from fiber degradation in controls to utilization of host carbohydrates and amino acids in CRC patients, accompanied by an increase of lipopolysaccharide metabolism.
doi:10.15252/msb.20145645
PMCID: PMC4299606  PMID: 25432777
cancer screening; colorectal cancer; fecal biomarkers; human gut microbiome; metagenomics
2.  An integrated approach for genome annotation of the eukaryotic thermophile Chaetomium thermophilum 
Nucleic Acids Research  2014;42(22):13525-13533.
The thermophilic fungus Chaetomium thermophilum holds great promise for structural biology. To increase the efficiency of its biochemical and structural characterization and to explore its thermophilic properties beyond those of individual proteins, we obtained transcriptomics and proteomics data, and integrated them with computational annotation methods and a multitude of biochemical experiments conducted by the structural biology community. We considerably improved the genome annotation of Chaetomium thermophilum and characterized the transcripts and expression of thousands of genes. We furthermore show that the composition and structure of the expressed proteome of Chaetomium thermophilum is similar to its mesophilic relatives. Data were deposited in a publicly available repository and provide a rich source to the structural biology community.
doi:10.1093/nar/gku1147
PMCID: PMC4267624  PMID: 25398899
3.  A Phylogeny-Based Benchmarking Test for Orthology Inference Reveals the Limitations of Function-Based Validation 
PLoS ONE  2014;9(11):e111122.
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a “core” species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.
doi:10.1371/journal.pone.0111122
PMCID: PMC4219706  PMID: 25369365
5.  LotuS: an efficient and user-friendly OTU processing pipeline 
Microbiome  2014;2:30.
Background
16S ribosomal DNA (rDNA) amplicon sequencing is frequently used to analyse the structure of bacterial communities from oceans to the human microbiota. However, computational power is still a major bottleneck in the analysis of continuously enlarging metagenomic data sets. Analysis is further complicated by the technical complexity of current bioinformatics tools.
Results
Here we present the less operational taxonomic units scripts (LotuS), a fast and user-friendly open-source tool to calculate denoised, chimera-checked, operational taxonomic units (OTUs). These are the basis to generate taxonomic abundance tables and phylogenetic trees from multiplexed, next-generation sequencing data (454, illumina MiSeq and HiSeq). LotuS is outstanding in its execution speed, as it can process 16S rDNA data up to two orders of magnitude faster than other existing pipelines. This is partly due to an included stand-alone fast simultaneous demultiplexer and quality filter C++ program, simple demultiplexer (sdm), which comes packaged with LotuS. Additionally, we sequenced two MiSeq runs with the intent to validate future pipelines by sequencing 40 technical replicates; these are made available in this work.
Conclusion
We show that LotuS analyses microbial 16S data with comparable or even better results than existing pipelines, requiring a fraction of the execution time and providing state-of-the-art denoising and phylogenetic reconstruction. LotuS is available through the following URL: http://psbweb05.psb.ugent.be/lotus.
doi:10.1186/2049-2618-2-30
PMCID: PMC4179863
OTU; 16S rDNA gene; Pipeline; Metagenomics; Demultiplexing
7.  Evolution and functional cross‐talk of protein post‐translational modifications 
Abstract
Protein post‐translational modifications (PTMs) allow the cell to regulate protein activity and play a crucial role in the response to changes in external conditions or internal states. Advances in mass spectrometry now enable proteome wide characterization of PTMs and have revealed a broad functional role for a range of different types of modifications. Here we review advances in the study of the evolution and function of PTMs that were spurred by these technological improvements. We provide an overview of studies focusing on the origin and evolution of regulatory enzymes as well as the evolutionary dynamics of modification sites. Finally, we discuss different mechanisms of altering protein activity via post‐translational regulation and progress made in the large‐scale functional characterization of PTM function.
doi:10.1002/msb.201304521
PMCID: PMC4019982  PMID: 24366814
acetylation; evolution; phosphorylation; post‐translational modifications; PTM cross‐talk
8.  eggNOG v4.0: nested orthology inference across 3686 organisms 
Nucleic Acids Research  2013;42(Database issue):D231-D239.
With the increasing availability of various ‘omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
doi:10.1093/nar/gkt1253
PMCID: PMC3964997  PMID: 24297252
9.  STITCH 4: integration of protein–chemical interactions with user data 
Nucleic Acids Research  2013;42(Database issue):D401-D407.
STITCH is a database of protein–chemical interactions that integrates many sources of experimental and manually curated evidence with text-mining information and interaction predictions. Available at http://stitch.embl.de, the resulting interaction network includes 390 000 chemicals and 3.6 million proteins from 1133 organisms. Compared with the previous version, the number of high-confidence protein–chemical interactions in human has increased by 45%, to 367 000. In this version, we added features for users to upload their own data to STITCH in the form of internal identifiers, chemical structures or quantitative data. For example, a user can now upload a spreadsheet with screening hits to easily check which interactions are already known. To increase the coverage of STITCH, we expanded the text mining to include full-text articles and added a prediction method based on chemical structures. We further changed our scheme for transferring interactions between species to rely on orthology rather than protein similarity. This improves the performance within protein families, where scores are now transferred only to orthologous proteins, but not to paralogous proteins. STITCH can be accessed with a web-interface, an API and downloadable files.
doi:10.1093/nar/gkt1207
PMCID: PMC3964996  PMID: 24293645
10.  A human gut microbial gene catalog established by metagenomic sequencing 
Nature  2010;464(7285):59-65.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million nonredundant microbial genes, derived from 576.7 Gb sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent microbial genes of the cohort and likely includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, suggesting that the entire cohort harbours between 1000 and 1150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions encoded by the gene set.
doi:10.1038/nature08821
PMCID: PMC3779803  PMID: 20203603
11.  Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes 
The ISME Journal  2013;7(9):1678-1695.
Nucleo-cytoplasmic large DNA viruses (NCLDVs) constitute a group of eukaryotic viruses that can have crucial ecological roles in the sea by accelerating the turnover of their unicellular hosts or by causing diseases in animals. To better characterize the diversity, abundance and biogeography of marine NCLDVs, we analyzed 17 metagenomes derived from microbial samples (0.2–1.6 μm size range) collected during the Tara Oceans Expedition. The sample set includes ecosystems under-represented in previous studies, such as the Arabian Sea oxygen minimum zone (OMZ) and Indian Ocean lagoons. By combining computationally derived relative abundance and direct prokaryote cell counts, the abundance of NCLDVs was found to be in the order of 104–105 genomes ml−1 for the samples from the photic zone and 102–103 genomes ml−1 for the OMZ. The Megaviridae and Phycodnaviridae dominated the NCLDV populations in the metagenomes, although most of the reads classified in these families showed large divergence from known viral genomes. Our taxon co-occurrence analysis revealed a potential association between viruses of the Megaviridae family and eukaryotes related to oomycetes. In support of this predicted association, we identified six cases of lateral gene transfer between Megaviridae and oomycetes. Our results suggest that marine NCLDVs probably outnumber eukaryotic organisms in the photic layer (per given water mass) and that metagenomic sequence analyses promise to shed new light on the biodiversity of marine viruses and their interactions with potential hosts.
doi:10.1038/ismej.2013.59
PMCID: PMC3749498  PMID: 23575371
eukaryotic viruses; marine NCLDVs; taxon co-occurrence; oomycetes
12.  Enterotypes of the human gut microbiome 
Nature  2011;473(7346):174-180.
Our knowledge on species and function composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about their variation across the world. Combining 22 newly sequenced fecal metagenomes of individuals from 4 countries with previously published datasets, we identified three robust clusters (enterotypes hereafter) that are not nation or continent-specific. We confirmed the enterotypes also in two published, larger cohorts suggesting that intestinal microbiota variation is generally stratified, not continuous. This further indicates the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis for a community understanding. While individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.
doi:10.1038/nature09944
PMCID: PMC3728647  PMID: 21508958
13.  Experimental characterization of the human non-sequence-specific nucleic acid interactome 
Genome Biology  2013;14(7):R81.
Background
The interactions between proteins and nucleic acids have a fundamental function in many biological processes, including gene transcription, RNA homeostasis, protein translation and pathogen sensing for innate immunity. While our knowledge of the ensemble of proteins that bind individual mRNAs in mammalian cells has been greatly augmented by recent surveys, no systematic study on the non-sequence-specific engagement of native human proteins with various types of nucleic acids has been reported.
Results
We designed an experimental approach to achieve broad coverage of the non-sequence-specific RNA and DNA binding space, including methylated cytosine, and tested for interaction potential with the human proteome. We used 25 rationally designed nucleic acid probes in an affinity purification mass spectrometry and bioinformatics workflow to identify proteins from whole cell extracts of three different human cell lines. The proteins were profiled for their binding preferences to the different general types of nucleic acids. The study identified 746 high-confidence direct binders, 139 of which were novel and 237 devoid of previous experimental evidence. We could assign specific affinities for sub-types of nucleic acid probes to 219 distinct proteins and individual domains. The evolutionarily conserved protein YB-1, previously associated with cancer and drug resistance, was shown to bind methylated cytosine preferentially, potentially conferring upon YB-1 an epigenetics-related function.
Conclusions
The dataset described here represents a rich resource of experimentally determined nucleic acid-binding proteins, and our methodology has great potential for further exploration of the interface between the protein and nucleic acid realms.
doi:10.1186/gb-2013-14-7-r81
PMCID: PMC4053969  PMID: 23902751
14.  Genomic variation landscape of the human gut microbiome 
Nature  2012;493(7430):45-50.
While large-scale efforts have rapidly advanced the understanding and practical impact of human genomic variation, the latter is largely unexplored in the human microbiome. We therefore developed a framework for metagenomic variation analysis and applied it to 252 fecal metagenomes of 207 individuals from Europe and North America. Using 7.4 billion reads aligned to 101 reference species, we detected 10.3 million single nucleotide polymorphisms (SNPs), 107,991 short indels, and 1,051 structural variants. The average ratio of non-synonymous to synonymous polymorphism rates of 0.11 was more variable between gut microbial species than across human hosts. Subjects sampled at varying time intervals exhibited individuality and temporal stability of SNP variation patterns, despite considerable composition changes of their gut microbiota. This implies that individual-specific strains are not easily replaced and that an individual might have a unique metagenomic genotype, which may be exploitable for personalized diet or drug intake.
doi:10.1038/nature11711
PMCID: PMC3536929  PMID: 23222524
15.  The human small intestinal microbiota is driven by rapid uptake and conversion of simple carbohydrates 
The ISME Journal  2012;6(7):1415-1426.
The human gastrointestinal tract (GI tract) harbors a complex community of microbes. The microbiota composition varies between different locations in the GI tract, but most studies focus on the fecal microbiota, and that inhabiting the colonic mucosa. Consequently, little is known about the microbiota at other parts of the GI tract, which is especially true for the small intestine because of its limited accessibility. Here we deduce an ecological model of the microbiota composition and function in the small intestine, using complementing culture-independent approaches. Phylogenetic microarray analyses demonstrated that microbiota compositions that are typically found in effluent samples from ileostomists (subjects without a colon) can also be encountered in the small intestine of healthy individuals. Phylogenetic mapping of small intestinal metagenome of three different ileostomy effluent samples from a single individual indicated that Streptococcus sp., Escherichia coli, Clostridium sp. and high G+C organisms are most abundant in the small intestine. The compositions of these populations fluctuated in time and correlated to the short-chain fatty acids profiles that were determined in parallel. Comparative functional analysis with fecal metagenomes identified functions that are overrepresented in the small intestine, including simple carbohydrate transport phosphotransferase systems (PTS), central metabolism and biotin production. Moreover, metatranscriptome analysis supported high level in-situ expression of PTS and carbohydrate metabolic genes, especially those belonging to Streptococcus sp. Overall, our findings suggest that rapid uptake and fermentation of available carbohydrates contribute to maintaining the microbiota in the human small intestine.
doi:10.1038/ismej.2011.212
PMCID: PMC3379644  PMID: 22258098
ecological model; function; microbiota; phylogeny; small intestine
16.  Human Monogenic Disease Genes Have Frequently Functionally Redundant Paralogs 
PLoS Computational Biology  2013;9(5):e1003073.
Mendelian disorders are often caused by mutations in genes that are not lethal but induce functional distortions leading to diseases. Here we study the extent of gene duplicates that might compensate genes causing monogenic diseases. We provide evidence for pervasive functional redundancy of human monogenic disease genes (MDs) by duplicates by manifesting 1) genes involved in human genetic disorders are enriched in duplicates and 2) duplicated disease genes tend to have higher functional similarities with their closest paralogs in contrast to duplicated non-disease genes of similar age. We propose that functional compensation by duplication of genes masks the phenotypic effects of deleterious mutations and reduces the probability of purging the defective genes from the human population; this functional compensation could be further enhanced by higher purification selection between disease genes and their duplicates as well as their orthologous counterpart compared to non-disease genes. However, due to the intrinsic expression stochasticity among individuals, the deleterious mutations could still be present as genetic diseases in some subpopulations where the duplicate copies are expressed at low abundances. Consequently the defective genes are linked to genetic disorders while they continue propagating within the population. Our results provide insight into the molecular basis underlying the spreading of duplicated disease genes.
Author Summary
Duplicated genes, as opposed to singletons, are genes that have additional copies in a genome due to evolutionary mechanisms such as whole genome duplication, homologous recombination or retrotransposition events. Duplicates can have similar functions and thus mask the phenotypic consequences when one copy is mutated. Conversely, the corresponding phenotypes would manifest themselves when mutations occur in singletons, since functional compensation is rare among non-duplicated genes. It would thus be expected that the primary source of monogenic diseases, diseases caused by mutations within a single gene, is singletons. However, the opposite was found to be true. Additionally, we found the functional similarity of duplicated disease genes to be greater than that of duplicated non-disease genes of an equivalent duplication age. So how could the stronger functional compensation among duplicates increase their likelihood to associate with diseases? We propose that due to functional compensation in duplicates, disease-causing mutations are less likely to be removed from a human population in large scale since the phenotypes are masked; however, the functional compensation could be lost in a subpopulation, perhaps due to intrinsic variation in gene expression, and therefore lead to diseases. As a result, the duplicated disease genes are linked to genetic diseases, yet they continue to spread within the human population.
doi:10.1371/journal.pcbi.1003073
PMCID: PMC3656685  PMID: 23696728
17.  Characterization of drug-induced transcriptional modules: towards drug repositioning and functional understanding 
Drug-induced transcriptional modules (biclusters) were identified and annotated in three human cell lines and rat liver. These were used to assess conservation across systems and to infer and experimentally validate novel drug effects and gene functions.
Biclustering of drug-induced gene expression profiles resulted in modules of drugs and genes, which were enriched in both drug and gene annotations.Identifying drug-induced transcriptional modules separately in three human cell lines and rat liver allows assessment of their conservation across model systems. About 70% of modules are conserved across cell lines, a lower bound of 15% was estimated for their conservation across organisms, and between the in vitro and in vivo systems.Drug-induced transcriptional modules can predict novel gene functions. A conserved module associated with (chole)sterol metabolism revealed novel regulators of cellular cholesterol homeostasis; 10 of them were validated in functional imaging assays.Analysis of drugs clustered into modules can give new insights into their mechanisms of action and provide leads for drug repositioning. We predicted and experimentally validated novel cell cycle inhibitors and modulators of PPARγ, estrogen and adrenergic receptors, with potential for developing new therapies against diabetes and cancer.
In pharmacology, it is crucial to understand the complex biological responses that drugs elicit in the human organism and how well they can be inferred from model organisms. We therefore identified a large set of drug-induced transcriptional modules from genome-wide microarray data of drug-treated human cell lines and rat liver, and first characterized their conservation. Over 70% of these modules were common for multiple cell lines and 15% were conserved between the human in vitro and the rat in vivo system. We then illustrate the utility of conserved and cell-type-specific drug-induced modules by predicting and experimentally validating (i) gene functions, e.g., 10 novel regulators of cellular cholesterol homeostasis and (ii) new mechanisms of action for existing drugs, thereby providing a starting point for drug repositioning, e.g., novel cell cycle inhibitors and new modulators of α-adrenergic receptor, peroxisome proliferator-activated receptor and estrogen receptor. Taken together, the identified modules reveal the conservation of transcriptional responses towards drugs across cell types and organisms, and improve our understanding of both the molecular basis of drug action and human biology.
doi:10.1038/msb.2013.20
PMCID: PMC3658274  PMID: 23632384
cell line models in drug discovery; drug-induced transcriptional modules; drug repositioning; gene function prediction; transcriptome conservation across cell types and organisms
18.  Systematic identification of proteins that elicit drug side effects 
Protein–side effects associations are identified by integrating drug–target data with side effects information from drug labels. Benchmarking against the literature and validation with an in vivo mouse model shows that these pairs correspond to causal relations.
For more than half of the investigated side effects, we can predict causal proteins.Off-targets contribute slightly more to the explained side effects than main targets.With the current data, we are most successful in explaining the side effects of drugs that target G protein-coupled receptors.Activation of HTR7 causes hyperesthesia in mice, explaining a side effect of triptan drugs.
Side effect similarities of drugs have recently been employed to predict new drug targets, and networks of side effects and targets have been used to better understand the mechanism of action of drugs. Here, we report a large-scale analysis to systematically predict and characterize proteins that cause drug side effects. We integrated phenotypic data obtained during clinical trials with known drug–target relations to identify overrepresented protein–side effect combinations. Using independent data, we confirm that most of these overrepresentations point to proteins which, when perturbed, cause side effects. Of 1428 side effects studied, 732 were predicted to be predominantly caused by individual proteins, at least 137 of them backed by existing pharmacological or phenotypic data. We prove this concept in vivo by confirming our prediction that activation of the serotonin 7 receptor (HTR7) is responsible for hyperesthesia in mice, which, in turn, can be prevented by a drug that selectively inhibits HTR7. Taken together, we show that a large fraction of complex drug side effects are mediated by individual proteins and create a reference for such relations.
doi:10.1038/msb.2013.10
PMCID: PMC3693830  PMID: 23632385
computational biology; drug targets; side effects
19.  Cell type-specific nuclear pores: a case in point for context-dependent stoichiometry of molecular machines 
The stoichiometry of the human nuclear pore complex is revealed by targeted mass spectrometry and super-resolution microscopy. The analysis reveals that the composition of the nuclear pore and other nuclear protein complexes is remodeled as a function of the cell type.
The human NPC has a previously unanticipated stoichiometry that varies across cell types.Primarily functional Nups are dynamic, while the NPC scaffold is static.Stoichiometries of many complexes are fine-tuned toward cell type-specific needs.
To understand the structure and function of large molecular machines, accurate knowledge of their stoichiometry is essential. In this study, we developed an integrated targeted proteomics and super-resolution microscopy approach to determine the absolute stoichiometry of the human nuclear pore complex (NPC), possibly the largest eukaryotic protein complex. We show that the human NPC has a previously unanticipated stoichiometry that varies across cancer cell types, tissues and in disease. Using large-scale proteomics, we provide evidence that more than one third of the known, well-defined nuclear protein complexes display a similar cell type-specific variation of their subunit stoichiometry. Our data point to compositional rearrangement as a widespread mechanism for adapting the functions of molecular machines toward cell type-specific constraints and context-dependent needs, and highlight the need of deeper investigation of such structural variants.
doi:10.1038/msb.2013.4
PMCID: PMC3619942  PMID: 23511206
fluorophore counting; nucleoporin; protein complex-based analysis; super-resolution microscopy; targeted proteomics
20.  Orthologous Gene Clusters and Taxon Signature Genes for Viruses of Prokaryotes 
Journal of Bacteriology  2013;195(5):941-950.
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
doi:10.1128/JB.01801-12
PMCID: PMC3571318  PMID: 23222723
21.  Consistent mutational paths predict eukaryotic thermostability 
Background
Proteomes of thermophilic prokaryotes have been instrumental in structural biology and successfully exploited in biotechnology, however many proteins required for eukaryotic cell function are absent from bacteria or archaea. With Chaetomium thermophilum, Thielavia terrestris and Thielavia heterothallica three genome sequences of thermophilic eukaryotes have been published.
Results
Studying the genomes and proteomes of these thermophilic fungi, we found common strategies of thermal adaptation across the different kingdoms of Life, including amino acid biases and a reduced genome size. A phylogenetics-guided comparison of thermophilic proteomes with those of other, mesophilic Sordariomycetes revealed consistent amino acid substitutions associated to thermophily that were also present in an independent lineage of thermophilic fungi. The most consistent pattern is the substitution of lysine by arginine, which we could find in almost all lineages but has not been extensively used in protein stability engineering. By exploiting mutational paths towards the thermophiles, we could predict particular amino acid residues in individual proteins that contribute to thermostability and validated some of them experimentally. By determining the three-dimensional structure of an exemplar protein from C. thermophilum (Arx1), we could also characterise the molecular consequences of some of these mutations.
Conclusions
The comparative analysis of these three genomes not only enhances our understanding of the evolution of thermophily, but also provides new ways to engineer protein stability.
doi:10.1186/1471-2148-13-7
PMCID: PMC3546890  PMID: 23305080
Thermophily; Comparative genomics; Protein engineering; Eukaryotes; Fungi
22.  The Ecoresponsive Genome of Daphnia pulex 
Science (New York, N.Y.)  2011;331(6017):555-561.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
doi:10.1126/science.1197761
PMCID: PMC3529199  PMID: 21292972
23.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration 
Nucleic Acids Research  2012;41(Database issue):D808-D815.
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made—particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.
doi:10.1093/nar/gks1094
PMCID: PMC3531103  PMID: 23203871
24.  PTMcode: a database of known and predicted functional associations between post-translational modifications in proteins 
Nucleic Acids Research  2012;41(Database issue):D306-D311.
Post-translational modifications (PTMs) are involved in the regulation and structural stabilization of eukaryotic proteins. The combination of individual PTM states is a key to modulate cellular functions as became evident in a few well-studied proteins. This combinatorial setting, dubbed the PTM code, has been proposed to be extended to whole proteomes in eukaryotes. Although we are still far from deciphering such a complex language, thousands of protein PTM sites are being mapped by high-throughput technologies, thus providing sufficient data for comparative analysis. PTMcode (http://ptmcode.embl.de) aims to compile known and predicted PTM associations to provide a framework that would enable hypothesis-driven experimental or computational analysis of various scales. In its first release, PTMcode provides PTM functional associations of 13 different PTM types within proteins in 8 eukaryotes. They are based on five evidence channels: a literature survey, residue co-evolution, structural proximity, PTMs at the same residue and location within PTM highly enriched protein regions (hotspots). PTMcode is presented as a protein-based searchable database with an interactive web interface providing the context of the co-regulation of nearly 75 000 residues in >10 000 proteins.
doi:10.1093/nar/gks1230
PMCID: PMC3531129  PMID: 23193284
25.  DvD: An R/Cytoscape pipeline for drug repurposing using public repositories of gene expression data 
Bioinformatics  2012;29(1):132-134.
Summary: Drug versus Disease (DvD) provides a pipeline, available through R or Cytoscape, for the comparison of drug and disease gene expression profiles from public microarray repositories. Negatively correlated profiles can be used to generate hypotheses of drug-repurposing, whereas positively correlated profiles may be used to infer side effects of drugs. DvD allows users to compare drug and disease signatures with dynamic access to databases Array Express, Gene Expression Omnibus and data from the Connectivity Map.
Availability and implementation: R package (submitted to Bioconductor) under GPL 3 and Cytoscape plug-in freely available for download at www.ebi.ac.uk/saezrodriguez/DVD/.
Contact: saezrodriguez@ebi.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts656
PMCID: PMC3530913  PMID: 23129297

Results 1-25 (124)