Metagenomics is revolutionizing our understanding of microbial communities, showing that their structure and composition have profound effects on the ecosystem and in a variety of health and disease conditions. Despite the flourishing of new analysis methods, current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not allow an adequate description of the structure of the microbiota. This constitutes a severe limitation to the use of metagenomic data in therapeutic and diagnostic applications. To provide a more robust statistical framework, we introduce a class of feature-weighting algorithms that discriminate the taxa responsible for the classification of metagenomic samples. The method unambiguously groups the relevant taxa into clades without relying on pre-defined taxonomic categories, thus including in the analysis also those sequences for which a taxonomic classification is difficult. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.
In metagenomics, the composition of complex microbial communities is characterized using Next Generation Sequencing technologies. Thanks to the decreasing cost of sequencing, large amounts of data have been generated for environmental samples and for a variety of health-associated conditions. In parallel there has been a flourishing of statistical methods to analyze metagenomic datasets, concentrating mainly on the problem of assessing the existence of significant differences between microbial communities in different conditions. However, for a large number of therapeutic and diagnostic applications it would be essential to identify and rank the microbial taxa that are most relevant in these comparisons. Here we present PhyloRelief, a novel feature-ranking algorithm that fills this gap by integrating the phylogenetic relationships amongst the taxa into a statistical feature weighting procedure. Without relying on a precompiled taxonomy, PhyloRelief determines the lineages most relevant to the diversification of the samples guided by the data. As such, PhyloRelief can be applied both to cases in which sequences can be classified according to a known taxonomy, and to cases in which this is not feasible, a common occurrence in metagenomic data analysis given the increasing number of new and uncultivable taxa that are discovered using these technologies.
Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome.
Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously.
This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.
Third-generation sequencing; NGen; Genomics; Assembly; Annotation; Oxford nanopore; Pacific BioSciences; Roche 454
Summary: Pathway Processor 2.0 is a web application designed to analyze high-throughput datasets, including but not limited to microarray and next-generation sequencing, using a pathway centric logic. In addition to well-established methods such as the Fisher’s test and impact analysis, Pathway Processor 2.0 offers innovative methods that convert gene expression into pathway expression, leading to the identification of differentially regulated pathways in a dataset of choice.
Availability and implementation: Pathway Processor 2.0 is available as a web service at http://compbiotoolbox.fmach.it/pathwayProcessor/. Sample datasets to test the functionality can be used directly from the application.
Supplementary data are available at Bioinformatics online.
For over a century microbiologists and immunologist have categorized microorganisms as pathogenic or non-pathogenic species or genera. This definition, clearly relevant at the strain and species level for most bacteria, where differences in virulence between strains of a particular species are well known, has never been probed at the strain level in fungal species. Here, we tested the immune reactivity and the pathogenic potential of a collection of strains from Aspergillus spp, a fungus that is generally considered pathogenic in immuno-compromised hosts. Our results show a wide strain-dependent variation of the immune response elicited indicating that different isolates possess diverse virulence and infectivity. Thus, the definition of markers of inflammation or pathogenicity cannot be generalized. The profound understanding of the molecular mechanisms subtending the different immune responses will result solely from the comparative study of strains with extremely diverse properties.
Metagenomic approaches are increasingly recognized as a baseline for understanding the
ecology and evolution of microbial ecosystems. The development of methods for pathway
inference from metagenomics data is of paramount importance to link a phenotype to a
cascade of events stemming from a series of connected sets of genes or proteins.
Biochemical and regulatory pathways have until recently been thought and modelled within
one cell type, one organism, one species. This vision is being dramatically changed by the
advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial
populations in fundamental biochemical functions. The new landscape we face requires a
clear picture of the potentialities of existing tools and development of new tools to
characterize, reconstruct and model biochemical and regulatory pathways as the result of
integration of function in complex symbiotic interactions of ontologically and
evolutionary distinct cell types.
metagenomics; next-generation sequencing; microbiome; pathway analysis; gene annotation
The type of adaptive immune response following host-fungi interaction is largely determined at the level of the antigen-presenting cells, and in particular by dendritic cells (DCs). The extent to which transcriptional regulatory events determine the decision making process in DCs is still an open question. By applying the highly structured DC-ATLAS pathways to analyze DC responses, we classified the various stimuli by revealing the modular nature of the different transcriptional programs governing the recognition of either pathogenic or commensal fungi. Through comparison of the network parts affected by DC stimulation with fungal cells and purified single agonists, we could determine the contribution of each receptor during the recognition process. We observed that initial recognition of a fungus creates a temporal window during which the simultaneous recruitment of cell surface receptors can intensify, complement and sustain the DC activation process. The breakdown of the response to whole live cells, through the purified components, showed how the response to invading fungi uses a set of specific modules. We find that at the start of fungal recognition, DCs rapidly initiate the activation process. Ligand recognition is further enhanced by over-expression of the receptor genes, with a significant correspondence between gene expression and protein levels and function. Then a marked decrease in the receptor levels follows, suggesting that at this moment the DC commits to a specific fate. Overall our pathway based studies show that the temporal window of the fungal recognition process depends on the availability of ligands and is different for pathogens and commensals. Modular analysis of receptor and signalling-adaptor expression changes, in the early phase of pathogen recognition, is a valuable tool for rapid and efficient dissection of the pathogen derived components that determine the phenotype of the DC and thereby the type of immune response initiated.
► Understanding the complexity of host–fungus interactions during commensalism. ► Genes mediating host colonization or fitness can evolve into infection-associated traits. ► Using bioinformatics to unravel functional genomics in dual-genome datasets. ► Modeling both fungal and host immune responses using network analysis tools. ► Databases and web-based resources for investigating host–pathogen interactions.
Modeling interactions between fungi and their hosts at the systems level requires a molecular understanding both of how the host orchestrates immune surveillance and tolerance, and how this activation, in turn, affects fungal adaptation and survival. The transition from the commensal to pathogenic state, and the co-evolution of fungal strains within their hosts, necessitates the molecular dissection of fungal traits responsible for these interactions. There has been a dramatic increase in publically available genome-wide resources addressing fungal pathophysiology and host–fungal immunology. The integration of these existing data and emerging large-scale technologies addressing host–pathogen interactions requires novel tools to connect genome-wide data sets and theoretical approaches with experimental validation so as to identify inherent and emerging properties of host–pathogen relationships and to obtain a holistic view of infectious processes. If successful, a better understanding of the immune response in health and microbial diseases will eventually emerge and pave the way for improved therapies.
Trabectedin, a new antitumor compound originally derived from a marine tunicate, is clinically effective in soft tissue sarcoma. The drug has shown a high selectivity for myxoid liposarcoma, characterized by the translocation t(12;16)(q13; p11) leading to the expression of FUS-CHOP fusion gene. Trabectedin appears to act interfering with mechanisms of transcription regulation. In particular, the transactivating activity of FUS-CHOP was found to be impaired by trabectedin treatment. Even after prolonged response resistance occurs and thus it is important to elucidate the mechanisms of resistance to trabectedin. To this end we developed and characterized a myxoid liposarcoma cell line resistant to trabectedin (402-91/ET), obtained by exposing the parental 402-91 cell line to stepwise increases in drug concentration. The aim of this study was to compare mRNAs, miRNAs and proteins profiles of 402-91 and 402-91/ET cells through a systems biology approach. We identified 3,083 genes, 47 miRNAs and 336 proteins differentially expressed between 402-91 and 402-91/ET cell lines. Interestingly three miRNAs among those differentially expressed, miR-130a, miR-21 and miR-7, harbored CHOP binding sites in their promoter region. We used computational approaches to integrate the three regulatory layers and to generate a molecular map describing the altered circuits in sensitive and resistant cell lines. By combining transcriptomic and proteomic data, we reconstructed two different networks, i.e. apoptosis and cell cycle regulation, that could play a key role in modulating trabectedin resistance. This approach highlights the central role of genes such as CCDN1, RB1, E2F4, TNF, CDKN1C and ABL1 in both pre- and post-transcriptional regulatory network. The validation of these results in in vivo models might be clinically relevant to stratify myxoid liposarcoma patients with different sensitivity to trabectedin treatment.
Gene set analysis is moving towards considering pathway topology as a crucial feature. Pathway elements are complex entities such as protein complexes, gene family members and chemical compounds. The conversion of pathway topology to a gene/protein networks (where nodes are a simple element like a gene/protein) is a critical and challenging task that enables topology-based gene set analyses.
Unfortunately, currently available R/Bioconductor packages provide pathway networks only from single databases. They do not propagate signals through chemical compounds and do not differentiate between complexes and gene families.
Here we present graphite, a Bioconductor package addressing these issues. Pathway information from four different databases is interpreted following specific biologically-driven rules that allow the reconstruction of gene-gene networks taking into account protein complexes, gene families and sensibly removing chemical compounds from the final graphs. The resulting networks represent a uniform resource for pathway analyses. Indeed, graphite provides easy access to three recently proposed topological methods. The graphite package is available as part of the Bioconductor software suite.
graphite is an innovative package able to gather and make easily available the contents of the four major pathway databases. In the field of topological analysis graphite acts as a provider of biological information by reducing the pathway complexity considering the biological meaning of the pathway elements.
The quest for genes representing genetic relationships of strains or individuals within populations and their evolutionary history is acquiring a novel dimension of complexity with the advancement of next-generation sequencing (NGS) technologies. In fact, sequencing an entire genome uncovers genetic variation in coding and non-coding regions and offers the possibility of studying Saccharomyces cerevisiae populations at the strain level. Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable. In this work we propose an original computational approach to discover genes that can be used as a descriptor of the population structure. We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences. The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.
Biomedical research relies increasingly on large collections of data sets and knowledge whose generation, representation and analysis often require large collaborative and interdisciplinary efforts. This dimension of ‘big data’ research calls for the development of computational tools to manage such a vast amount of data, as well as tools that can improve communication and access to information from collaborating researchers and from the wider community. Whenever research projects have a defined temporal scope, an additional issue of data management arises, namely how the knowledge generated within the project can be made available beyond its boundaries and life-time. DC-THERA is a European ‘Network of Excellence’ (NoE) that spawned a very large collaborative and interdisciplinary research community, focusing on the development of novel immunotherapies derived from fundamental research in dendritic cell immunobiology. In this article we introduce the DC-THERA Directory, which is an information system designed to support knowledge management for this research community and beyond. We present how the use of metadata and Semantic Web technologies can effectively help to organize the knowledge generated by modern collaborative research, how these technologies can enable effective data management solutions during and beyond the project lifecycle, and how resources such as the DC-THERA Directory fit into the larger context of e-science.
semantic web; ontology; immunology; eScience; data integration
Invading C. albicans hyphae are recognized by macrophages, activate the caspase-1/IL-1β pathway, and lead to the activation of IL-17 pathway to control the C. albicans infection.
In the mucosa, the immune pathways discriminating between colonizing and invasive Candida, thus inducing tolerance or inflammation, are poorly understood. Th17 responses induced by Candida albicans hyphae are central for the activation of mucosal antifungal immunity. An essential step for the discrimination between yeasts and hyphae and induction of Th17 responses is the activation of the inflammasome by C. albicans hyphae and the subsequent release of active IL-1β in macrophages. Inflammasome activation in macrophages results from differences in cell-wall architecture between yeasts and hyphae and is partly mediated by the dectin-1/Syk pathway. These results define the dectin-1/inflammasome pathway as the mechanism that enables the host immune system to mount a protective Th17 response and distinguish between colonization and tissue invasion by C. albicans.
Candida; colonization; invasion; IL-1β; IL-17
Motivation: Many models and analysis of signaling pathways have been proposed. However, neither of them takes into account that a biological pathway is not a fixed system, but instead it depends on the organism, tissue and cell type as well as on physiological, pathological and experimental conditions.
Results: The Biological Connection Markup Language (BCML) is a format to describe, annotate and visualize pathways. BCML is able to store multiple information, permitting a selective view of the pathway as it exists and/or behave in specific organisms, tissues and cells. Furthermore, BCML can be automatically converted into data formats suitable for analysis and into a fully SBGN-compliant graphical representation, making it an important tool that can be used by both computational biologists and ‘wet lab’ scientists.
Availability and implementation: The XML schema and the BCML software suite are freely available under the LGPL for download at http://bcml.dc-atlas.net. They are implemented in Java and supported on MS Windows, Linux and OS X.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
The advent of Systems Biology has been accompanied by the blooming of pathway databases. Currently pathways are defined generically with respect to the organ or cell type where a reaction takes place. The cell type specificity of the reactions is the foundation of immunological research, and capturing this specificity is of paramount importance when using pathway-based analyses to decipher complex immunological datasets. Here, we present DC-ATLAS, a novel and versatile resource for the interpretation of high-throughput data generated perturbing the signaling network of dendritic cells (DCs).
Pathways are annotated using a novel data model, the Biological Connection Markup Language (BCML), a SBGN-compliant data format developed to store the large amount of information collected. The application of DC-ATLAS to pathway-based analysis of the transcriptional program of DCs stimulated with agonists of the toll-like receptor family allows an integrated description of the flow of information from the cellular sensors to the functional outcome, capturing the temporal series of activation events by grouping sets of reactions that occur at different time points in well-defined functional modules.
The initiative significantly improves our understanding of DC biology and regulatory networks. Developing a systems biology approach for immune system holds the promise of translating knowledge on the immune system into more successful immunotherapy strategies.
The application of high-throughput genomic tools in nutrition research is a widespread practice. However, it is becoming increasingly clear that the outcome of individual expression studies is insufficient for the comprehensive understanding of such a complex field. Currently, the availability of the large amounts of expression data in public repositories has opened up new challenges on microarray data analyses. We have focused on PPARα, a ligand-activated transcription factor functioning as fatty acid sensor controlling the gene expression regulation of a large set of genes in various metabolic organs such as liver, small intestine or heart. The function of PPARα is strictly connected to the function of its target genes and, although many of these have already been identified, major elements of its physiological function remain to be uncovered. To further investigate the function of PPARα, we have applied a cross-species meta-analysis approach to integrate sixteen microarray datasets studying high fat diet and PPARα signal perturbations in different organisms.
We identified 164 genes (MDEGs) that were differentially expressed in a constant way in response to a high fat diet or to perturbations in PPARs signalling. In particular, we found five genes in yeast which were highly conserved and homologous of PPARα targets in mammals, potential candidates to be used as models for the equivalent mammalian genes. Moreover, a screening of the MDEGs for all known transcription factor binding sites and the comparison with a human genome-wide screening of Peroxisome Proliferating Response Elements (PPRE), enabled us to identify, 20 new potential candidate genes that show, both binding site, both change in expression in the condition studied. Lastly, we found a non random localization of the differentially expressed genes in the genome.
The results presented are potentially of great interest to resume the currently available expression data, exploiting the power of in silico analysis filtered by evolutionary conservation. The analysis enabled us to indicate potential gene candidates that could fill in the gaps with regards to the signalling of PPARα and, moreover, the non-random localization of the differentially expressed genes in the genome, suggest that epigenetic mechanisms are of importance in the regulation of the transcription operated by PPARα.
Widespread use of microarrays has generated large amounts of data, the interrogation of the public microarray repositories, identifying similarities between microarray experiments is now one of the major challenges. Approaches using defined group of genes, such as pathways and cellular networks (pathway analysis), have been proposed to improve the interpretation of microarray experiments. We propose a novel method to compare microarray experiments at the pathway level, this method consists of two steps: first, generate pathway signatures, a set of descriptors recapitulating the biologically meaningful pathways related to some clinical/biological variable of interest, second, use these signatures to interrogate microarray databases. We demonstrate that our approach provides more reliable results than with gene-based approaches. While gene-based approaches tend to suffer from bias generated by the analytical procedures employed, our pathway based method successfully groups together similar samples, independently of the experimental design. The results presented are potentially of great interest to improve the ability to query and compare experiments in public repositories of microarray data. As a matter of fact, this method can be used to retrieve data from public microarray databases and perform comparisons at the pathway level.
Sinorhizobium meliloti is a soil bacterium that forms nitrogen-fixing nodules on the roots of leguminous plants such as alfalfa (Medicago sativa). This species occupies different ecological niches, being present as a free-living soil bacterium and as a symbiont of plant root nodules. The genome of the type strain Rm 1021 contains one chromosome and two megaplasmids for a total genome size of 6 Mb. We applied comparative genomic hybridisation (CGH) on an oligonucleotide microarrays to estimate genetic variation at the genomic level in four natural strains, two isolated from Italian agricultural soil and two from desert soil in the Aral Sea region.
From 4.6 to 5.7 percent of the genes showed a pattern of hybridisation concordant with deletion, nucleotide divergence or ORF duplication when compared to the type strain Rm 1021. A large number of these polymorphisms were confirmed by sequencing and Southern blot. A statistically significant fraction of these variable genes was found on the pSymA megaplasmid and grouped in clusters. These variable genes were found to be mainly transposases or genes with unknown function.
The obtained results allow to conclude that the symbiosis-required megaplasmid pSymA can be considered the major hot-spot for intra-specific differentiation in S. meliloti.