|Home | About | Journals | Submit | Contact Us | Français|
With the expanding availability of sequencing technologies, research previously centered on the human genome can now afford to include the study of humans’ internal ecosystem (human microbiome). Given the scale of the data involved in this metagenomic research (two orders of magnitude larger than the human genome) and their importance in relation to human health, it is crucial to guarantee (along with the appropriate data collection and taxonomy) proper tools for data analysis. We propose to adapt the approaches defined for the analysis of gene-expression microarray in order to infer information in metagenomics. In particular, we applied SAM, a broadly used tool for the identification of differentially expressed genes among different samples classes, to a reported dataset on a research model with mice of two genotypes (a high density lipoprotein knockout mouse and its wild-type counterpart). The data contain two different diets (high-fat or normal-chow) to ensure the onset of obesity, prodrome of metabolic syndromes (MS). By using 16S rRNA gene as a genomic diversity marker, we illustrate how this approach can identify bacterial populations differentially enriched among different genetic and dietary conditions of the host. This approach faithfully reproduces highly-relevant results from phylogenetic and standard statistical analyses, used to explain the role of the gut microbiome in relation to obesity. This represents a promising proof-of-principle for using functional genomic approaches in the fast growing area of metagenomics, and warrants the availability of a large body of thoroughly tested and theoretically sound methodologies to this exciting new field.
The microorganism community in the human gastrointestinal (GI) tract contains more than 1000 species whose accumulated genomes may have 100 times more genes than the human genome. In this perspective, gut microbiota can be viewed as an organ that regulates its host’s metabolic and immune systems [1, 2]. Gut intestinal (GI) microorganisms are in fact known to contribute to diverse human processes, such as preventing the colonization and attacks of pathogens, regulating the immune system through a number of signal molecules and metabolites, aiding the development of intestinal microvilli, breaking down non-digestible polysaccharides, and ensuring anaerobic metabolism of peptides and proteins which results in recovery of metabolic energy for the host [3, 4]. Although definitive proofs remains to be provided, growing evidence indicates that GI microbiota play a crucial role in the progress of human diseases and in particular for metabolic syndromes (MS), such as obesity, diabetes and hypertension [5–7].
To enhance the understanding of the mechanisms of MS development, early research has devoted considerable efforts to the study of the host genomic variations. Nowadays, given the indication that GI is involved to some extent in such diseases, more attention has been paid to exploring the disruption of gut microbiota. This metagenomic approach to diseases challenges researchers in many ways, from a shift in paradigm that modifies the prevalence of genomic etiology of diseases, to more practical issues related to the complexity and vastness of GI microbiota data.
The actual connection between variations in the GI and onset of obesity and other more complex MS is still under huge debate among scientists, and is not the object of this paper. In this study, we concentrate in particular on the identification of methodologies that are able to highlight relationships between the variations in the GI composition and obesity (a well known consequence of fat feeding, related to MS [8–12]), as it is measured in terms of impaired glucose intolerance and fat mass development, making use of tools that are both well-developed and validated in other areas of research.
Based on the observation that complexity and vastness of the GI microbiota data are traits shared with high-throughput transcriptional data, which are the object of study of functional genomics, we sought to adapt some of the well-tested and much-used tools for gene expression analysis of DNA microarray data to metagenomic studies. To clarify this concept, Figure 1 indicates schematically how the two types of data can be considered in this perspective. To clarify further, we briefly summarize the main areas of research of functional genomics. First, in functional genomics, we are interested in mining genes significantly related to a biological query of interest (i.e. genes differentially expressed in healthy versus diseased patients, etc.): this is achieved with methodologies broadly classified as supervised and unsupervised . Such approaches are able to group genes based on their mutual similarity (for a review see ref. ), or in terms of their resemblance to some external trait (e.g. significance analysis of microarray, SAM  and gene-set enrichment analysis, GSEA ), based on the over/under expression of genes across samples. Second, in functional genomics we are also interested in the identification of interactions among selected genes (gene-network inference approaches, for reviews see ref. [17, 18]). Third, through the use of statistical methods, functional genomics concentrates on the inference of the functionality of such selected genes based on previous knowledge (definition of the controlled vocabulary of terms describing genes functionalities, Gene Ontology ). We will show that, interestingly, part of these problems and their solutions can be advantageously adapted to the investigation of the role and activity of GI microorganisms.
In particular, we adapt and apply these approaches to data from a recent work , which investigated 10 genetically insulin resistant model- knockout (leading to impaired glucose tolerance, IGT) mice (K), and their 10 wild-type counterparts (W), both on normal-chow (N) and high-fat diet (F). Their aim was to characterize the relative contributions of the host’s genetics and diet-disrupted gut microbiota in relation to obesity. Gut microbiota samples were harvested from fecal matter, high-throughput sequence data of the 16S rRNA gene were obtained from barcoded 454 pyrosequencing, and original sequences were merged to 516 operational taxonomic units (OTU), based on phylogenetic distance, from which a final set of 65 OTUs was identified as relevant. Overall, this work lead to the conclusion that diet is more active than genotypic host mutation in the onset of obesity. Due to the fact that diet is more effective in causing variations of the GI microbiota composition, according to these results, it is statistically more relevant than genotypic host in explaining obesity and impaired glucose tolerance in mice.
The aim of the current work is to corroborate the results in ref.  with an independent and systematic approach able to make comparisons between groups, and between combinations of groups (from genotype and diet) and to assess their influence on the variation in the GI composition. In particular, further interpretation of the aforementioned final 65 OTUs and their subgroups represents our gold standard (GS) for comparisons. Given the encouraging results, we believe that this work can represent an interesting proof-of-principle of the possibility to adapt functional genomic approaches to metagenomics.
We processed the data with several instances of SAM , a broadly used tool to identify genes with statistically significant changes in expression across categorized samples. Briefly, samples are explicitly classified according to some criterion (here diet, genotype or phenotype) and a generalization of the t-test is applied to each OTU abundance to verify if the average behavior in one class is statistically significantly different from that in any other class. Before moving further it is worth devoting some words to explain the meaning of the word ‘abundance’ when adapted to metagenomics. In the following, we will use ‘gene/OTU abundance’ interchangeably, and in metagenomics this indicates phylotype abundance as defined by the assortment of 16S rRNA sequences.
In SAM, a parameter named Δ (delta) is central in the setting up of the analysis, as it indicates the minimum average difference that is considered relevant to identify genes/OTUs defined as differentially expressed/abundant. Statistical significance is defined after generation of a distribution of such distances obtained with random permutation of the labels of the samples. Statistical significance is corrected for multiple hypotheses testing using false discovery rate (FDR) based on random permutations. Throughout the analysis we used FDR = 0.2 as the threshold for significance.
Subsequently, to compare the GS and the results from the proposed methods, we used the tool FIT  to perform enrichment (η) analysis, a common method adopted to characterize newly found gene sets in terms of previous knowledge, and broadly used to assess the function associated to a set of genes, based on the information of Gene Ontology . Enrichment is defined as the proportion of the relative occurrence of a given category observed in the newly found set (test set) with respect to the relative frequency expected in the whole population. In this case, each genotype-diet group (WF, WN, KF, KN: five samples each) was considered as a different class, resulting in the selection of 66 OTUs (Table 1), which overlap with the 65 OTUs in GS, with specificity = 0.97 and sensitivity = 0.80 (Figure 2).
Based on the 66 OTUs selected above, a second analysis was performed to identify the sub-populations that behave significantly differently over any two mice groups [four basic classes: KF, KN, WF, WN; three super-classes: diet, genotype, phenotype (see Table 2)]. In detail, this consists of six dichotomous comparisons for separating any two out of the total four mice groups (C24= six combinations in total); and three dichotomous comparisons for separating the three super-classes: diet (F2N), genotype (W2K) and phenotype, healthy versus sick (H2S), as they are defined by body weight and glucose tolerance (for more details see Supplementary Data). Significant discrimination is found in all of the comparisons related to diet, indicating that no matter which approach is used, the OTUs found to be relevantly associated to diet are more stable. In particular, a large amount of OTUs are found to be responsible for discriminating high-fat diet mice groups from normal-chow groups, especially when the wild-type mice are considered (59 OTUs), indicating the prevalence of the effect of diet on the variations of the GI microbiome composition. These results are fully consistent with the GS, where, also, the effects of diet are more striking on wild-type mice than on knockout mice. Limited effects appear to be due to genotype: 10 OTUs at most were found able to distinguish the gene-knockout and wild-type mice with high-fat diet. From our analysis, no OTU was found able to separate the gene-knockout mice and the wild-type mice independent of diet. In the GS this same group was extremely small (two OTUs), perhaps indicating a mild discriminant power.
In order to reproduce the complex findings of the GS, a systematic analysis to identify the OTUs that were specifically abundant in any one, any two, or any three of the four groups, was performed with SAM . However, only six combinations were found to be non-null (see Table 3 and Supplementary Data).
The GS also describes six bacteria groups that differentially respond to variations in diet, and in mutant and mice healthy phenotypes. These groups namely include bacteria: increased in high-fat diet (HFD), increased in normal-chow diet (NC), reduced in the mice with IGT (N-IGT) and abundant in the mice with IGT (IGT), as well as genotype-dependent reaction to diet (MutantHF), or increased in Apoa−I−/− mice (mutant). According to the notation used in the tool FIT for enrichment computation, and here preserved, these groups represent our GS, also called the ‘reference’. SAM identified bacteria groups N and H that agree very well with the reference as they are statistically significantly enriched in NC and N-IGT, respectively. Two bacteria groups, WF and F, are significantly enriched in the reference group HFD, with F ‘fitting’ better with HFD, than WF with HFD. This appears to be reasonable since both HFD and F are used to represent ‘fat-loving’ bacteria. Disagreement comes from KF, WF and KNWFWN, as they are not significantly enriched in any reference. Furthermore, we could not find any bacteria group having either a genotype-dependent response to diet (MutantHF) or increased in Apoa−I−/− mice (Mutant). Results are shown in Table 4.
Furthermore, in ref. , four lineages (M1–M4) in the Erysipelotrichaceae family were found to be specifically interesting in the further discrimination of diet and genotype. In particular, M1 represents the bacteria that are only abundant in healthy mice (H, WN). M2 and M4 are both ‘fat-loving’ bacteria, while M3 is specifically abundant in normal-chow mice groups. Among the six bacteria groups identified with our method, four groups are distributed in the Erysipelotrichaceae family, i.e. KF, H, F and N. Since the bacteria in F are distributed in two different classes, in terms of the phylogenetic tree, this group is sub-divided into two groups: labeled F1 and F2, respectively. Table 5 shows that four clusters, associated with healthy state (H), fat diet (F1/F2) and normal-chow (N), were significantly enriched for M1, M2/M4 and M3, respectively. This confirms the association of M1 to the healthy type, of M2/M4 as fat-loving, and M3 as abundant in normal-chow mice. According with GS, we also found two members in the Bifidobacteriaceae family that were only abundant in normal-chow diet. Bifidobacteria have long been found to protect gut barrier integrity , the significant enrichment of this group of bacteria in the normal-chow but not in the high-fat diet group may help in explaining why the high-fat diet mice developed obesity. We note that our strategy did not identify sulfate-reducing bacteria as an important variable as previously shown in ref. .
The study of MS, based on the research of the variation of the gut microbiota composition, is a significant challenge in medical research and yet it provides potentially important biomarkers for diagnosis and treatments. The study here focuses on the onset of obesity and identifies an approach to the analysis of metagenomic data, based on the adaptation of the rich literature and theory developed in the last years, in the functional genomics field. The approach is not only able to reproduce the findings obtained in previous research (showing the robustness of the method), but the results also strengthen the association that is emerging between variations of the GI composition and obesity. These associations include the prevalence of genomic causes, as well as the identification of specific subgroups, with good performance and statistical significance. Although these findings do not support per se the direct relationship or causality between specific groups and the onset of obesity, a statistically significant correlation, confirmed by two different approaches, represents an important result to trigger research for the identification of the underlying biological causal connections. The possibility to use well-developed and validated methods and tools from the functional genomics field to the rapidly growing area of human metagenomics opens a vast space of possibilities. This work represents an encouraging proof-of-principle for this hypothesis for which further validation is granted.
This work is funded by the Sino-Swiss Science and Technology Cooperation Project (Grant no.:GJHZ0911).
The authors would like to thank R. Han and Prof. Y. Chen from the Key Laboratory of Nutrition and Metabolism, Institute for Nutritional Sciences, SIBS, CAS, Shanghai, for the remarkable experimental work on which this analysis is based. The authors would also like to acknowledge the anonymous reviewers who helped improving and clarifying the article, including the recommendation on Figure 1.
Liu Yuanhua is a member of research staff in the Clinical Genomic Group at the Max Planck—Chinese Academy of Sciences Partner Institute for Computational Biology (MPG-CAS PICB), Shanghai.
Chenhong Zhang is a PhD student at the School of Life Sciences and Biotechnology at Jiao Tong University, Shanghai.
Liping Zhao is Professor of Microbiology, Associate Dean of the School of Life Sciences and Biotechnology, and Associate Director of Shanghai Center for Systems Biomedicine at Jiao Tong University, Shanghai.
Christine Nardini is Principal Investigator at MPG-CAS PICB.