Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made—particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.
Regional genomic copy number alterations (CNA) are observed in the vast majority of cancers. Besides specifically targeting well-known, canonical oncogenes, CNAs may also play more subtle roles in terms of modulating genetic potential and broad gene expression patterns of developing tumors. Any significant differences in the overall CNA patterns between different cancer types may thus point towards specific biological mechanisms acting in those cancers. In addition, differences among CNA profiles may prove valuable for cancer classifications beyond existing annotation systems.
We have analyzed molecular-cytogenetic data from 25579 tumors samples, which were classified into 160 cancer types according to the International Classification of Disease (ICD) coding system. When correcting for differences in the overall CNA frequencies between cancer types, related cancers were often found to cluster together according to similarities in their CNA profiles. Based on a randomization approach, distance measures from the cluster dendrograms were used to identify those specific genomic regions that contributed significantly to this signal. This approach identified 43 non-neutral genomic regions whose propensity for the occurrence of copy number alterations varied with the type of cancer at hand. Only a subset of these identified loci overlapped with previously implied, highly recurrent (hot-spot) cytogenetic imbalance regions.
Thus, for many genomic regions, a simple null-hypothesis of independence between cancer type and relative copy number alteration frequency can be rejected. Since a subset of these regions display relatively low overall CNA frequencies, they may point towards second-tier genomic targets that are adaptively relevant but not necessarily essential for cancer development.
Essential genes are absolutely required for the survival of an organism. The identification of essential genes, besides being one of the most fundamental questions in biology, is also of interest for the emerging science of synthetic biology and for the development of novel antimicrobials. New antimicrobial therapies are desperately needed to treat multidrug-resistant pathogens, such as members of the Burkholderia cepacia complex.
We hypothesize that essential genes may be highly conserved within a group of evolutionary closely related organisms. Using a bioinformatics approach we determined that the core genome of the order Burkholderiales consists of 649 genes. All but two of these identified genes were located on chromosome 1 of Burkholderia cenocepacia. Although many of the 649 core genes of Burkholderiales have been shown to be essential in other bacteria, we were also able to identify a number of novel essential genes present mainly, or exclusively, within this order. The essentiality of some of the core genes, including the known essential genes infB, gyrB, ubiB, and valS, as well as the so far uncharacterized genes BCAL1882, BCAL2769, BCAL3142 and BCAL3369 has been confirmed experimentally in B. cenocepacia.
We report on the identification of essential genes using a novel bioinformatics strategy and provide bioinformatics and experimental evidence that the large majority of the identified genes are indeed essential. The essential genes identified here may represent valuable targets for the development of novel antimicrobials and their detailed study may shed new light on the functions required to support life.
Many protein-protein interactions are mediated by domain-motif interaction, where a domain in one protein binds a short linear motif in its interacting partner. Such interactions are often involved in key cellular processes, necessitating their tight regulation. A common strategy of the cell to control protein function and interaction is by post-translational modifications of specific residues, especially phosphorylation. Indeed, there are motifs, such as SH2-binding motifs, in which motif phosphorylation is required for the domain-motif interaction. On the contrary, there are other examples where motif phosphorylation prevents the domain-motif interaction. Here we present a large-scale integrative analysis of experimental human data of domain-motif interactions and phosphorylation events, demonstrating an intriguing coupling between the two. We report such coupling for SH3, PDZ, SH2 and WW domains, where residue phosphorylation within or next to the motif is implied to be associated with switching on or off domain binding. For domains that require motif phosphorylation for binding, such as SH2 domains, we found coupled phosphorylation events other than the ones required for domain binding. Furthermore, we show that phosphorylation might function as a double switch, concurrently enabling interaction of the motif with one domain and disabling interaction with another domain. Evolutionary analysis shows that co-evolution of the motif and the proximal residues capable of phosphorylation predominates over other evolutionary scenarios, in which the motif appeared before the potentially phosphorylated residue, or vice versa. Our findings provide strengthening evidence for coupled interaction-regulation units, defined by a domain-binding motif and a phosphorylated residue.
Domain-motif interactions are instrumental for many central cellular processes, and are therefore tightly regulated. Phosphorylation events are known modulators of protein-protein interactions in general, including domain-motif interactions. Here, we addressed the association of phosphorylation and domain-motif interaction taking a motif-centred view. We integrated human domain-motif interaction and phosphorylation data for four representative domains (SH2, WW, SH3 and PDZ), and showed that the adjacency between phosphorylation and domain-motif interactions is extensive, suggesting interesting functional links between them that extend the classical and widely studied phospho-regulation of SH2 or WW domain-motif interactions. Furthermore, we show that such interaction-regulation units may function as double switches, concurrently enabling interaction of the motif with one domain and disabling interaction with another domain. These latter interaction-regulation units are more conserved in evolution than the individual units comprising them. Assuming that the four analyzed domain-motif interaction types are reliable representatives of such interactions, our results support the existence of units comprising motifs and associated phosphorylation sites, in which the regulation of domain-motif interaction is inherent.
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%).
To facilitate the study of interactions between proteins and chemicals, we have created STITCH, an aggregated database of interactions connecting over 300 000 chemicals and 2.6 million proteins from 1133 organisms. Compared to the previous version, the number of chemicals with interactions and the number of high-confidence interactions both increase 4-fold. The database can be accessed interactively through a web interface, displaying interactions in an integrated network view. It is also available for computational studies through downloadable files and an API. As an extension in the current version, we offer the option to switch between two levels of detail, namely whether stereoisomers of a given compound are shown as a merged entity or as separate entities. Separate display of stereoisomers is necessary, for example, for carbohydrates and chiral drugs. Combining the isomers increases the coverage, as interaction databases and publications found through text mining will often refer to compounds without specifying the stereoisomer. The database is accessible at http://stitch.embl.de/.
The genus Nepenthes, a carnivorous plant, has a pitcher to trap insects and digest them in the contained fluid to gain nutrient. A distinctive character of the pitcher fluid is the digestive enzyme activity that may be derived from plants and dwelling microbes. However, little is known about in situ digestive enzymes in the fluid. Here we examined the pitcher fluid from four species of Nepenthes. High bacterial density was observed within the fluids, ranging from 7×106 to 2.2×108 cells ml−1. We measured the activity of three common enzymes in the fluid: acid phosphatases, β-d-glucosidases, and β-d-glucosaminidases. All the tested enzymes detected in the liquid of all the pitcher species showed activity that considerably exceeded that observed in aquatic environments such as freshwater, seawater, and sediment. Our results indicate that high enzyme activity within a pitcher could assist in the rapid decomposition of prey to maximize efficient nutrient use. In addition, we filtered the fluid to distinguish between dissolved enzyme activity and particle-bound activity. As a result, filtration treatment significantly decreased the activity in all enzymes, while pH value and Nepenthes species did not affect the enzyme activity. It suggested that enzymes bound to bacteria and other organic particles also would significantly contribute to the total enzyme activity of the fluid. Since organic particles are themselves usually colonized by attached and highly active bacteria, it is possible that microbe-derived enzymes also play an important role in nutrient recycling within the fluid and affect the metabolism of the Nepenthes pitcher plant.
The phosphorylation and dephosphorylation of proteins by kinases and phosphatases constitute an essential regulatory network in eukaryotic cells. This network supports the flow of information from sensors through signaling systems to effector molecules, and ultimately drives the phenotype and function of cells, tissues, and organisms. Dysregulation of this process has severe consequences and is one of the main factors in the emergence and progression of diseases, including cancer. Thus, major efforts have been invested in developing specific inhibitors that modulate the activity of individual kinases or phosphatases; however, it has been difficult to assess how such pharmacological interventions would affect the cellular signaling network as a whole. Here, we used label-free, quantitative phosphoproteomics in a systematically perturbed model organism (Saccharomyces cerevisiae) to determine the relationships between 97 kinases, 27 phosphatases, and more than 1000 phosphoproteins. We identified 8814 regulated phosphorylation events, describing the first system-wide protein phosphorylation network in vivo. Our results show that, at steady state, inactivation of most kinases and phosphatases affected large parts of the phosphorylation-modulated signal transduction machinery, and not only the immediate downstream targets. The observed cellular growth phenotype was often well maintained despite the perturbations, arguing for considerable robustness in the system. Our results serve to constrain future models of cellular signaling and reinforce the idea that simple linear representations of signaling pathways might be insufficient for drug development and for describing organismal homeostasis.
Non-intermingling, adjacent populations of cells define compartment boundaries;
such boundaries are often essential for the positioning and the maintenance of
tissue-organizers during growth. In the developing wing primordium of
Drosophila melanogaster, signaling by the secreted protein
Hedgehog (Hh) is required for compartment boundary maintenance. However, the
precise mechanism of Hh input remains poorly understood. Here, we combine
experimental observations of perturbed Hh signaling with computer simulations of
cellular behavior, and connect physical properties of cells to their Hh
signaling status. We find that experimental disruption of Hh signaling has
observable effects on cell sorting surprisingly far from the compartment
boundary, which is in contrast to a previous model that confines Hh influence to
the compartment boundary itself. We have recapitulated our experimental
observations by simulations of Hh diffusion and transduction coupled to
mechanical tension along cell-to-cell contact surfaces. Intriguingly, the best
results were obtained under the assumption that Hh signaling cannot alter the
overall tension force of the cell, but will merely re-distribute it locally
inside the cell, relative to the signaling status of neighboring cells. Our
results suggest a scenario in which homotypic interactions of a putative Hh
target molecule at the cell surface are converted into a mechanical force. Such
a scenario could explain why the mechanical output of Hh signaling appears to be
confined to the compartment boundary, despite the longer range of the Hh
molecule itself. Our study is the first to couple a cellular vertex model
describing mechanical properties of cells in a growing tissue, to an explicit
model of an entire signaling pathway, including a freely diffusible component.
We discuss potential applications and challenges of such an approach.
In developing animal tissues, cells can often re-arrange locally and mix
relatively freely. However, in some stereotypic and crucially important
instances during body development, cells will strictly not intermingle, and
instead form sharp boundaries along which they will sort out from each other.
This mechanism helps organisms to establish signaling centers and to maintain
distinct cellular identities. Often, cells at such boundaries will remain in
close physical contact and are morphologically alike. Thus, the boundary itself
can be difficult to observe unless the expression status of specific marker
genes is monitored experimentally. How are these ‘compartment
boundaries’ established? Here we devise a computational model that aims to
describe one such boundary in a well-studied animal tissue: the developing wing
primordium of Drosophila melanogaster. We model the production,
diffusion and local sensing of an essential signaling molecule, the
Hedgehog protein. We reveal one possible mechanism by which
Hedgehog sensing can influence the mechanical properties of cells, and compare
the simulated outcome to observations in experimentally perturbed, actual wing
discs. Our relatively simple model suffices to establish a straight and stable
An essential prerequisite for any systems-level understanding of cellular functions is to correctly uncover and annotate all functional interactions among proteins in the cell. Toward this goal, remarkable progress has been made in recent years, both in terms of experimental measurements and computational prediction techniques. However, public efforts to collect and present protein interaction information have struggled to keep up with the pace of interaction discovery, partly because protein–protein interaction information can be error-prone and require considerable effort to annotate. Here, we present an update on the online database resource Search Tool for the Retrieval of Interacting Genes (STRING); it provides uniquely comprehensive coverage and ease of access to both experimental as well as predicted interaction information. Interactions in STRING are provided with a confidence score, and accessory information such as protein domains and 3D structures is made available, all within a stable and consistent identifier space. New features in STRING include an interactive network viewer that can cluster networks on demand, updated on-screen previews of structural information including homology models, extensive data updates and strongly improved connectivity and integration with third-party resources. Version 9.0 of STRING covers more than 1100 completely sequenced organisms; the resource can be reached at http://string-db.org.
Shotgun sequencing of environmental DNA is an essential technique for characterizing uncultivated microbes in situ. However, the taxonomic and functional assignment of the obtained sequence fragments remains a pressing problem.
Existing algorithms are largely optimized for speed and coverage; in contrast, we present here a software framework that focuses on a restricted set of informative gene families, using Maximum Likelihood to assign these with the best possible accuracy. This framework ('MLTreeMap'; http://mltreemap.org/) uses raw nucleotide sequences as input, and includes hand-curated, extensible reference information.
We discuss how we validated our pipeline using complete genomes as well as simulated and actual environmental sequences.
Improving the ability to reverse engineer biochemical networks is a major goal of systems biology. Lesions in signaling networks lead to alterations in gene expression, which in principle should allow network reconstruction. However, the information about the activity levels of signaling proteins conveyed in overall gene expression is limited by the complexity of gene expression dynamics and of regulatory network topology. Two observations provide the basis for overcoming this limitation: a. genes induced without de-novo protein synthesis (early genes) show a linear accumulation of product in the first hour after the change in the cell's state; b. The signaling components in the network largely function in the linear range of their stimulus-response curves. Therefore, unlike most genes or most time points, expression profiles of early genes at an early time point provide direct biochemical assays that represent the activity levels of upstream signaling components. Such expression data provide the basis for an efficient algorithm (Plato's Cave algorithm; PLACA) to reverse engineer functional signaling networks. Unlike conventional reverse engineering algorithms that use steady state values, PLACA uses stimulated early gene expression measurements associated with systematic perturbations of signaling components, without measuring the signaling components themselves. Besides the reverse engineered network, PLACA also identifies the genes detecting the functional interaction, thereby facilitating validation of the predicted functional network. Using simulated datasets, the algorithm is shown to be robust to experimental noise. Using experimental data obtained from gonadotropes, PLACA reverse engineered the interaction network of six perturbed signaling components. The network recapitulated many known interactions and identified novel functional interactions that were validated by further experiment. PLACA uses the results of experiments that are feasible for any signaling network to predict the functional topology of the network and to identify novel relationships.
Elucidating the biochemical interactions in living cells is essential to understanding their behavior under various external conditions. Some of these interactions occur between signaling components with many active states, and their activity levels may be difficult to measure directly. However, most methods to reverse engineer interaction networks rely on measuring gene activity at steady state under various cellular stimuli. Such gene measurements therefore ignore the intermediate effects of signaling components, and cannot reliably convey the interactions between the signaling components themselves. We propose using the changes in activity of early genes shortly after the stimulus to infer the functional interactions between the unmeasured signaling components. The change in expression in such genes at these times is directly and linearly affected by the signaling components, since there is insufficient time for other genes to be transcribed and interfere with the early genes' expression. We present an algorithm that uses such measurements to reverse engineer the functional interaction network between signaling components, and also provides a means for testing these predictions. The algorithm therefore uses feasible experiments to reconstruct functional networks. We applied the algorithm to experimental measurements and uncovered known interactions, as well as novel interactions that were then confirmed experimentally.
Over the last years, the publicly available knowledge on interactions between small molecules and proteins has been steadily increasing. To create a network of interactions, STITCH aims to integrate the data dispersed over the literature and various databases of biological pathways, drug–target relationships and binding affinities. In STITCH 2, the number of relevant interactions is increased by incorporation of BindingDB, PharmGKB and the Comparative Toxicogenomics Database. The resulting network can be explored interactively or used as the basis for large-scale analyses. To facilitate links to other chemical databases, we adopt InChIKeys that allow identification of chemicals with a short, checksum-like string. STITCH 2.0 connects proteins from 630 organisms to over 74 000 different chemicals, including 2200 drugs. STITCH can be accessed at http://stitch.embl.de/.
The Hedgehog signaling pathway plays a crucial role in development and disease. Its putative origins in an ancient system involved in regulating bacterial lipid transport and homeostasis offers clues about how the pathway might work today.
Although functionally related proteins can be reliably predicted from phylogenetic profiles, many functional modules do not seem to evolve cohesively according to case studies and systematic analyses in prokaryotes. In this study we quantify the extent of evolutionary cohesiveness of functional modules in eukaryotes and probe the biological and methodological factors influencing our estimates. We have collected various datasets of protein complexes and pathways in Saccheromyces cerevisiae. We define orthologous groups on 34 eukaryotic genomes and measure the extent of cohesive evolution of sets of orthologous groups of which members constitute a known complex or pathway. Within this framework it appears that most functional modules evolve flexibly rather than cohesively. Even after correcting for uncertain module definitions and potentially problematic orthologous groups, only 46% of pathways and complexes evolve more cohesively than random modules. This flexibility seems partly coupled to the nature of the functional module because biochemical pathways are generally more cohesively evolving than complexes.
Components of a protein complex or a metabolic pathway strongly cooperate to perform a specific function. Because of this functional interdependence, proteins that form a complex or pathway are expected to be present and absent together in different species. Phylogenetic profiling methods, in which proteins with similar presence and absence patterns are inferred to be functionally linked, are based on this assumption. In this report, we quantify to what extent proteins that together constitute a complex or pathway (a functional module) in yeast are present and absent together (evolve cohesively) in other eukaryotic species. We find that more than half of all complexes and pathways are only partially present in a number of species. It appears that evolution of functional modules is very flexible; components are not indispensable; they can be replaced or reused in a different functional context. This places a limit on how well phylogenetic profiling methods can detect functionally related proteins. Functional modules that evolve cohesively are typically involved in biological processes such as translation and amino acid metabolism.
Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein–protein interactions currently available. STRING can be reached at http://string-db.org/.
To investigate the extent of genetic stratification in structured microbial communities, we compared the metagenomes of 10 successive layers of a phylogenetically complex hypersaline mat from Guerrero Negro, Mexico. We found pronounced millimeter-scale genetic gradients that were consistent with the physicochemical profile of the mat. Despite these gradients, all layers displayed near-identical and acid-shifted isoelectric point profiles due to a molecular convergence of amino-acid usage, indicating that hypersalinity enforces an overriding selective pressure on the mat community.
metagenomics; hypersalinity; microbial ecology; fine-scale; salt-in
The knowledge about interactions between proteins and small molecules is essential for the understanding of molecular and cellular functions. However, information on such interactions is widely dispersed across numerous databases and the literature. To facilitate access to this data, STITCH (‘search tool for interactions of chemicals’) integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug–target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. STITCH further allows exploring the network of chemical relations, also in the context of associated binding proteins. Each proposed interaction can be traced back to the original data sources. Our database contains interaction information for over 68 000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes and their interactions contained in the STRING database. STITCH is available at http://stitch.embl.de/
Metagenomic analysis of termite gut flora reveals a diversity of wood-degrading enzymes.
Termites eat and digest wood, but how do they do it? Combining advanced genomics and proteomics techniques, researchers have now shown that microbes found in the termites' hindguts possess just the right tools.
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database (‘evolutionary genealogy of genes: Non-supervised Orthologous Groups’), which contains orthologous groups constructed from Smith–Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
Most mucosal surfaces of the mammalian body are colonized by microbial communities (“microbiota”). A high density of commensal microbiota inhabits the intestine and shields from infection (“colonization resistance”). The virulence strategies allowing enteropathogenic bacteria to successfully compete with the microbiota and overcome colonization resistance are poorly understood. Here, we investigated manipulation of the intestinal microbiota by the enteropathogenic bacterium Salmonella enterica subspecies 1 serovar Typhimurium (S. Tm) in a mouse colitis model: we found that inflammatory host responses induced by S. Tm changed microbiota composition and suppressed its growth. In contrast to wild-type S. Tm, an avirulent invGsseD mutant failing to trigger colitis was outcompeted by the microbiota. This competitive defect was reverted if inflammation was provided concomitantly by mixed infection with wild-type S. Tm or in mice (IL10−/−, VILLIN-HACL4-CD8) with inflammatory bowel disease. Thus, inflammation is necessary and sufficient for overcoming colonization resistance. This reveals a new concept in infectious disease: in contrast to current thinking, inflammation is not always detrimental for the pathogen. Triggering the host's immune defence can shift the balance between the protective microbiota and the pathogen in favour of the pathogen.
A dense microbial community colonizes the intestinal tract of mammals, contributing to health and nutrition and conferring efficient protection against most pathogenic intruders. Intestinal pathogens can overcome this colonization resistance and cause disease; however, the mechanisms used to do this are still elusive. In this study we analyzed intestinal infection by the model pathogen Salmonella enterica subspecies 1 serovar Typhimurium (S. Tm). We show that the virulent wild-type pathogen overcomes colonization resistance by inducing the host's inflammatory immune response and exploiting it for its purpose. In contrast, an avirulent Salmonella mutant defective in triggering inflammation was unable to overcome colonization resistance by itself. Colonization by this mutant was restored if inflammation was provided concomitantly, in mice with inflammatory bowel disease (genetic and inducible) or by co-infection with wild-type S. Tm. These findings reveal a previously unrecognized strategy by which pathogenic bacteria overcome colonization resistance: abusing the host's inflammatory immune response to gain an edge against the normal microbial community of the gut. This represents a first step towards unravelling the molecular mechanisms underlying this three-way interaction of host, microbiota, and pathogens.
Inducing inflammation is key to the ability of the virulent pathogen Salmonella enterica serovar Typhimurium to outcompete the protective resident microbiota in a race to colonize the gut.
Environmental sequencing, also dubbed metagenomics, is increasingly being used to obtain insights into organismal communities in diverse habitats, and has a variety of potential applications foreseeable in biotechnology and medicine. The first public large-scale data provide already a wealth of information hidden in vast amounts of fragmented pieces of DNA from unknown species residing in these environments. Comparative sequence analysis is essential for the interpretation of such data. However, different layers of complexity that are intrinsic to each sample require the establishment of some baselines for comparison: how to normalize for the differences in phylogenetic and functional diversity, how to avoid biases from incomplete data, and how to deal with differences in species dominance or genome sizes? Here we discuss a few of these items and delineate some simple discriminative sequence properties for four distinct habitats.
comparison; diversity; environments; metagenomics
A novel computational approach shows a link between genome size and habitat from analysis of environmental metagenomic DNA reads.
We introduce a novel computational approach to predict effective genome size (EGS; a measure that includes multiple plasmid copies, inserted sequences, and associated phages and viruses) from short sequencing reads of environmental genomics (or metagenomics) projects. We observe considerable EGS differences between environments and link this with ecologic complexity as well as species composition (for instance, the presence of eukaryotes). For example, we estimate EGS in a complex, organism-dense farm soil sample at about 6.3 megabases (Mb) whereas that of the bacteria therein is only 4.7 Mb; for bacteria in a nutrient-poor, organism-sparse ocean surface water sample, EGS is as low as 1.6 Mb. The method also permits evaluation of completion status and assembly bias in single-genome sequencing projects.
Information on protein–protein interactions is still mostly limited to a small number of model organisms, and originates from a wide variety of experimental and computational techniques. The database and online resource STRING generalizes access to protein interaction data, by integrating known and predicted interactions from a variety of sources. The underlying infrastructure includes a consistent body of completely sequenced genomes and exhaustive orthology classifications, based on which interaction evidence is transferred between organisms. Although primarily developed for protein interaction analysis, the resource has also been successfully applied to comparative genomics, phylogenetics and network studies, which are all facilitated by programmatic access to the database backend and the availability of compact download files. As of release 7, STRING has almost doubled to 373 distinct organisms, and contains more than 1.5 million proteins for which associations have been pre-computed. Novel features include AJAX-based web-navigation, inclusion of additional resources such as BioGRID, and detailed protein domain annotation. STRING is available at
Mitochondria carry out specialized functions; compartmentalized, yet integrated into the metabolic and signaling processes of the cell. Although many mitochondrial proteins have been identified, understanding their functional interrelationships has been a challenge. Here we construct a comprehensive network of the mitochondrial system. We integrated genome-wide datasets to generate an accurate and inclusive mitochondrial parts list. Together with benchmarked measures of protein interactions, a network of mitochondria was constructed in their cellular context, including extra-mitochondrial proteins. This network also integrates data from different organisms to expand the known mitochondrial biology beyond the information in the existing databases. Our network brings together annotated and predicted functions into a single framework. This enabled, for the entire system, a survey of mutant phenotypes, gene regulation, evolution, and disease susceptibility. Furthermore, we experimentally validated the localization of several candidate proteins and derived novel functional contexts for hundreds of uncharacterized proteins. Our network thus advances the understanding of the mitochondrial system in yeast and identifies properties of genes underlying human mitochondrial disorders.
Mitochondria are organelles which are best known as the cell's energy powerhouses. They have a special evolutionary origin derived from bacteria engulfed about 2 billion years ago by eukaryotes. Surprisingly, mitochondrial functions have been retained over evolution, so that unicellular yeast and multicellular organisms like humans share many of the same mitochondrial components. Here the authors complemented previous efforts to identify the “parts” of the mitochondrial system, but as for any system, this is not enough to understand how it works. By integrating information on protein localization, function, and interaction, the authors go a step further and propose a map of the mitochondrial organelle and its surroundings. This map suggests the involvement of hundreds of so far uncharacterized proteins in mitochondrial function. By taking advantage of the high conservation of the organelle to humans, the authors investigate properties of human genes involved in mitochondrial diseases. They find that the disease genes have ancient origin and a mild mutant phenotype when their function is abolished in yeast. The approach applied here can be extended to other organelles or organisms and illustrates a growing trend in understanding biological processes in their whole rather than in isolated parts.