|Home | About | Journals | Submit | Contact Us | Français|
While large-scale efforts have rapidly advanced the understanding and practical impact of human genomic variation, the latter is largely unexplored in the human microbiome. We therefore developed a framework for metagenomic variation analysis and applied it to 252 fecal metagenomes of 207 individuals from Europe and North America. Using 7.4 billion reads aligned to 101 reference species, we detected 10.3 million single nucleotide polymorphisms (SNPs), 107,991 short indels, and 1,051 structural variants. The average ratio of non-synonymous to synonymous polymorphism rates of 0.11 was more variable between gut microbial species than across human hosts. Subjects sampled at varying time intervals exhibited individuality and temporal stability of SNP variation patterns, despite considerable composition changes of their gut microbiota. This implies that individual-specific strains are not easily replaced and that an individual might have a unique metagenomic genotype, which may be exploitable for personalized diet or drug intake.
With the increasing availability of individual human genomes, various theoretical and practical aspects of genomic variation can be deduced for individuals and the human population as a whole1,2. Like sequenced human genomes, the number of human gut metagenomes (currently mostly derived from Illumina shotgun sequencing of stool samples) is increasing exponentially. Given the importance of the gut microbiota in human health3,4 and a growing number of studies reporting associations between gut microbiota and diseases5-8, an understanding of genomic variation in gut microbial populations will likely trigger applications towards human well-being and disease.
For example, in the common gut commensal Escherichia coli, just three point mutations in two genes can confer clinically relevant antibiotic resistance5, and natural variation in a single gene can lead to pathogenic adaptation8. Even within pathogenic species in the gut, closely related coexisting strains can exhibit different pathogenic potentials due to minor genomic variation7. These examples illustrate how genomic variation within gut microbes could confer phenotypes that require personalized care or treatment of the host.
Studies based on 16S ribosomal RNA gene surveys or whole metagenome shotgun sequencing characterized taxonomic and functional compositions of healthy individuals’ and patients’ gut microbiota at the genus or species level6,9-12. Variation in taxonomic abundance as well as functions encoded by these gut microbiota have been described between individuals6,11 and used to stratify individuals according to their gut community compositions into enterotypes13. However, genomic variation within species, which leads to their phenotypic diversity and adaptations to different environments, has only been studied in a few taxa, such as Citrobacter spp7.
An early landmark study on a small dataset described metagenomic variation in an acidic biofilm microbiome of low complexity14. The population structure for one species in that habitat was studied and positive selection was observed in some genes15. Another recent study resolved multiple clinical isolates of methicillin-resistant Staphylococcus aureus and delineated its epidemiology and microevolution based on genomic variation16. With the availability of hundreds of deeply sequenced human gut metagenomes9,11,17, sufficient data are becoming available for quantitative analyses of the genetic structure of complex microbial communities, allowing the study of many species at the same time.
Here, we analyzed 1.56 terabases of sequence data from 252 stool samples from 207 individuals (Supplementary Table 1; Supplementary Notes) obtained from the MetaHIT project (71 Danish, 39 Spanish; all sampled once11), the NIH Human Microbiome Project (94 US-American; 51 individuals sampled once, 41 sampled twice, and two sampled three times9), and Washington University (three US-American samples; all sampled once12). Our goals were to (i) develop a framework for genomic variation analysis using metagenomic shotgun data, (ii) gather basic knowledge on the genomic variation landscape in gut metagenomes, and (iii) gain insights into the individuality, temporal stability, and biogeography of metagenomic variation.
We used 1,497 prokaryotic genomes to generate a set of reference genomes (Supplementary Table 2) for the analysis of genomic variation in gut microbial species in 252 samples (on average 6.2 ± 4.1 Gbp were analyzed). Pairwise comparisons of 40 universal marker genes18,19 identified in these genomes were performed to create a set of 929 clusters based on a 95% DNA identity threshold recommended for identifying species20. The genome recruiting the highest number of reads in a cluster was selected as reference for that species (see Methods and Supplementary Information).
Using the same 95% identity threshold, we mapped 7.4 billion metagenomic reads (42% of the total, 91% thereof uniquely) with an average length of 80 bp to the 929 reference genomes (Supplementary Tables 1 and 3). To avoid mapping artifacts (for example caused by high coverage of prophages), we required ≥40% of each reference genome to be covered by reads (corresponding to the gene content similarity between two strains of E. coli21). The resulting 101 prevalent species with base pair coverage from 12x to 32,400x (Fig. 1 and Supplementary Fig. 1) were subjected to genomic variation analysis.
To enable comparative analyses in multiple metagenomes and to identify low frequency variants not detected when analyzing them individually, we used multi-sample calling22 to identify single nucleotide polymorphisms (SNPs), short indels (1 - 50 bp) and structural variations (SVs, >50 bp) in each genome, although SVs were largely underestimated due to small insert sizes (Supplementary Information). We only called variants with allele frequency ≥1% (the conventional definition of polymorphism2) and supported by ≥4 reads. False positive rates were estimated at 0.71% for SNPs and 1.04% for SVs (Supplementary Information, Supplementary Tables 4 and 5, Supplementary Fig. 2).
We identified 10.3 million SNPs in 101 genomes (3.1% of the total 329 Mb positions) across 252 samples from 207 subjects, almost as many as the 14.4 million SNPs recently identified in 179 human genomes2. Within an individual the rate was lower (on average 1.21%, see Supplementary Table 6), yet SNPs/kb increased with base pair coverage when samples were pooled (Supplementary Fig. 3). We also identified 107,991 indels and 1,051 SVs in these 101 species (Supplementary Information). Their relative ratios to SNPs (10,485 short indels and 102 SVs per million SNPs) were robust across species and individuals (Supplementary Fig. 4). Subsequent analyses were restricted to SNPs due to their orders of magnitude higher count over other variation types.
We annotated the genes of the prevalent genomes using orthologous groups (OGs) from eggNOG23 (Supplementary Information) and found that the OGs with the highest SNP density were enriched in functions related to conjugal transfer of antibiotic resistance (Supplementary Table 7). For example, the OG with the highest average SNP density across samples was the Clindamycin resistance transfer factor BtgA (NOG119724), required for conjugative transmission of plasmids. Mutations commonly accrued from the process of conjugation may account for increased diversity among conjugation-associated functions24. Additionally, CRISPR-associated proteins, responsible for conferring resistance in bacteria, were also found among the OGs with high SNP densities (Supplementary Information and Supplementary Table 7).
The large number of SNPs provided the opportunity to compare, for the first time at such scale, the evolution of different coexisting species across a large cohort of individuals. To evaluate selective constraints in these species in their natural habitat, we estimated the ratio of non-synonymous to synonymous polymorphism rates25,26 (pN/pS) within each species in every sample (Fig. 1; Supplementary Information). pN/pS characterizes selective constraint at the level of a population contrary to the more commonly used dN/dS that characterizes it between individual species26. To validate pN/pS ratios, we estimated genetic variation using the sample size-independent nucleotide diversity π, and found that π is highly correlated with SNPs/kb (Fig. 1 and Supplementary Fig. 5). The derived measures of π(N)/π(S) and π(non-degenerate sites)/π(four-fold degenerate sites), the latter of which is less dependent on specific properties of mutation spectra such as transition and transversion ratios, were coherent with pN/pS (Supplementary Fig. 6).
The pN/pS ratio of a genome within a sample remained stable at coverages higher than 10x (Supplementary Fig. 7) – yet another indication of few false SNP calls – and was on average 0.11, but varied considerably for different species (0.04 to 0.58) in accordance with dN/dS ratios estimated independently in a number of interspecies comparisons between closely related bacteria and archaea27,28.
Since meaningful comparison of genomic variation requires both breadth (across samples) and depth (in number of base pairs) of sequencing, we focused on the 66 most dominant species that attracted >99% of the reads (Fig. 1). Their relatively low pN/pS ratios were constant across different hosts (Fig. 2a and Supplementary Table 8), which may indicate similar selective constraints across individuals. Thus, the evolution of gut species is most likely dominated by long-term purifying selection and drift rather than rapid adaptations to specific host environments. The wide range of these ratios across species may suggest that different gut species face different evolutionary constraints.
To investigate how different gut species respond to the pressure from the gut environment, we compared the pN/pS ratios of individual genes in Roseburia intestinalis and Eubacterium eligens, which differed considerably in their overall mean pN/pS ratios (0.236 vs. 0.131 from 106 and 147 samples, respectively) despite having comparable average base pair coverages (Supplementary Information). While 75% of the genes in R. intestinalis had systematically higher pN/pS ratios compared to their orthologs in E. eligens, few others showed considerable deviations (Fig. 2c; Supplementary Table 9) suggesting differing evolutionary constraints for these genes. For example, galK, the gene encoding galactokinase, an essential enzyme in the Leloir pathway for galactose metabolism in most organisms29, was among the lowest in terms of pN/pS ratio in R. intestinalis, but among the highest in E. eligens (0.03 and 0.48, respectively; Fig 2b,c). Although present in E. eligens, this gene may not exert its main function (see also30), since E. eligens cannot ferment galactose, nor the galactose-containing disaccharides lactose and melibiose31. On the other hand, R. intestinalis is known to ferment melibiose32 implying that its galK gene is functional (Supplementary Information). Thus, the same gene can be under tight negative selection in one species, but under more relaxed negative selection in another.
Our framework allowed us additionally to obtain information on all genes in each sample (Supplementary Information). As expected, we found that housekeeping genes had usually lower pN/pS ratios. For example, the DNA-dependent RNA polymerase beta subunit gene was consistently among the genes exhibiting the lowest pN/pS ratios across samples and species (Supplementary Table 10). Less obvious examples included genes related to type IV secretion systems used to transfer DNA between microbes33 and involved in host interactions of both pathogenic33 and commensal bacteria34, specifically in anti-inflammatory responses and immune modulation35. Their low pN/pS ratio suggests that maintaining genome plasticity and antibiotic resistance through conjugative transposition is essential in the constantly changing environment of the gut and that the interaction with the host immune system is under purifying selection (Supplementary Table 10). We also found a few conserved unknown, but apparently gut microbe-specific proteins, which exhibited low pN/pS ratios, suggesting that they perform important, yet hitherto unexplored functions (Supplementary Table 11).
Among the genes or OGs with consistently the highest pN/pS ratios were many transposases and antimicrobial resistance genes including the gut-specific gene bile salt hydrolase (BSH)36 (Supplementary Table 10). Conjugated bile acids (CBA) secreted by the hosts repress microbial growth and up-regulate the host mucosal defense system. BSHs are involved in the initial reaction in the metabolism of CBAs by gut microbes36. Their high pN/pS ratio may be indicative for the genomic plasticity necessary to metabolize and survive the variety of different bile acids present in the gut37.
Several studies on adult human gut microbial samples from a few individuals have suggested that within-individual differences over time are smaller compared to between-individual differences in microbial species composition and abundance38-40. Within a larger cohort, individuality of host-associated microbiota has been reported based on 16S rRNA gene profiling of fingertip-associated communities41, while other studies on a few samples have investigated the persistence of specific strains over time42,43. However, intra-species variation at nucleotide resolution at whole genome level and accompanying changes in species abundances within the human gut over long time periods (>1 year) have not been studied yet in large cohorts. It is unclear if the concept of resident strains is common to other prevalent species, if host-specific strains are retained over time, and how fast they evolve inside the gut environment.
To explore these questions, we used 88 gut metagenomes from 43 healthy US-American subjects (a subset of our cohort) from whom at least two samples were obtained at different time-points with no antibiotics treatment in between (Supplementary Table 12). To measure how similar the subpopulations (strains) of the dominant species were between two samples, we estimated the fixation indices (FST) between the populations (Supplementary Information). Since this measure depends on allele frequencies, which cannot be determined accurately at low base-pair coverage, we also estimated the fraction of alleles shared between the samples out of all polymorphic sites (only 49 genomes that accrued 40% genome coverage in at least two samples were used and genomes with >10x base pair coverage were downsampled to 10x; Supplementary Information and Supplementary Fig. 3). Since the fraction of shared SNPs depends on the number of variable sites, we developed a heuristic allele sharing similarity score that takes into account both (Supplementary Information).
When we compared all 252 samples, FST was significantly lower and allele sharing significantly stronger between different samples from the same individuals than between samples from different individuals (Mann-Whitney: P<0.001 for both; see Figs. 3a and 3b and Supplementary Information). The same trend was observed, albeit much weaker, based on species compositions (Fig. 3c), in line with previous observations from microbial composition-based results38-40. Intra-individual variation being smaller than inter-individual variation does, however, not require that samples from the same individual are more similar to each other than to any other sample in the tested cohort. Our results showed for both measures of variation similarity that all but one of the 88 multi-time-point samples had the highest similarity to another sample from the same individual, which was not true when comparing species abundance over time (Fig. 3c and Supplementary Information). This implies that species abundance in gut microbiota cannot serve as a fingerprint of an individual whereas variation patterns might.
We also tested if differences in FST and allele sharing decreased over time, which may suggest a divergence of the strains or a horizontal transmission of strains from the environment, but the individual-specific variation patterns remained stable over all the time intervals monitored (Fig. 3). Although this stability should be verified for longer periods as well as when antibiotic treatment or other gut microbiota-challenging events have taken place, our observation suggests that healthy human individuals retain specific strains (see also Supplementary Table 12 and 13) for at least one year.
In contrast to the strong evidence for individuality and temporal stability of SNP patterns, we did not observe a significant geographic separation between European and US samples (Supplementary Fig. 8; Supplementary Table 14). This implies that long-term horizontal transmission of at least some dominant gut microbial strains cause geographic mixing over time. The strongest continental separation, based on FST, was seen in Bacteriodes coprocola (Fig. 4), which was also the only genome with sufficient amounts of data that showed continental separation based on the allele sharing score (Supplementary Fig. 8; Supplementary Tables 14 and 15).
We have established a framework for gut microbial genomic variation analysis using metagenomic data and identified in a single analysis, involving 252 stool samples from 207 human individuals, almost as many SNPs in the human gut microbiome as the 1000 Genomes Project recorded in 179 human individuals over several years2. The stable pN/pS ratios of gut microbial species across individuals suggest that host conditions (such as diet, genetic differences, and immune tolerance) have a minor influence on the evolution of species compared to constraints common to the human population (such as gut physiology, anaerobic conditions, and pH). In the 66 dominant species, the analysis of more than 229,000 genes comprising about 8,000 OGs pinpointed consistently fast or slow evolving genes across individuals (Supplementary Table 10). However, further studies are needed in order to interpret different selection types at the gene level.
The availability of time-point data revealed that individual-specific variation patterns were remarkably stable over time (Fig. 3a,b), which was much less the case for similarities in species abundance – for almost 60% of the samples, a sample from a different individual was the most similar (Fig. 3c). Thus, the metagenomic variation patterns observed here support the hypothesis that a healthy individual retains specific strains for extended periods of time. This suggests that each individual has a metagenomic variation profile that could be unique even in very large cohorts. It should be noted that the maximal sampling period was only one year, and 43 individuals might not be sufficient to trace horizontal transmissions of strains. The likelihood of the latter is supported by the apparent absence of clear continental stratification (despite different sampling and sequencing protocols of the European and the US samples), although for one out of eight species analysed, Bacteroides coprocola, we provide preliminary evidence (Fig. 4; Supplementary Figure 8). Geographical stratification has been described for Helicobacter pylori44,45, and weak, but detectable signals have also been observed in some bacterial pathogens46,47. Thus, we expect more gut microbial stratification patterns to emerge when larger datasets under standardized sampling and sequencing protocols become available, although it remains to be tested which factors (such as geographic separation, diseases, host-genetic and life-style/diet factors) shape the distribution of gut microbial strains and segregating SNPs within the population. The absence of clear geographic stratification implies that stable differences in variation patterns of gut species are not explained by large-scale structures of local microbial populations. They may rather be a result of genetic drift due to population bottlenecks that could occur not only during the colonization of the infant gut but also by processes causing community shifts during adult life stages, followed by a rapid population growth accompanied by purifying selection. This model suggests that the source of genetic variation in human gut microbial populations is less likely to be new mutations within the host than the variation in the initial colonizing populations or transmissions from the environment. This would imply that most allelic variants analyzed in this study segregate at time scales greatly exceeding human generation time.
The introduction of large-scale variation analysis in metagenomic data of complex communities and the discovery of individual metagenomic variation profiles open up several applications. It is now possible to screen in silico for many pathogenic or antibiotics resistance variants in the population. Once a sample has been analyzed, the data can also be used in the future given the temporal stability of SNP profiles. As it took years to identify marker genes and variations for diseases or phenotypes in the human genome, the variation landscape uncovered here can only be seen as the beginning to find molecular biomarkers including particular variants that reveal useful information for human health and well-being.
A reference genome set representing 929 species was derived from a total of 1,511 published prokaryotic genomes, based on a median sequence identity of 95% in 40 universal, single copy marker genes18,19. Metagenomic reads from 252 samples were aligned to these 929 genomes using the same 95% sequence identity cutoff.
For each genome, we calculated the sample-specific base pair coverage and the number of bases of the genome covered by at least one read. For a genome to be considered we required a cumulative depth of coverage of ≥10x across all samples. In order to remove species that are not present in our cohort, yet attract reads due to highly conserved regions, we required at least 40% breadth of the genome coverage (the criterion for the species to be considered present) from at least one sample.
We performed SNP calling on the pooled samples and only considered bases with a quality score ≥15. We required SNPs to be supported by ≥4 reads and to occur with a frequency of ≥1%. Structural variations were detected using Pindel48. False positive rates in SNP and structural variation detection were estimated using nonsense and frameshift mutations in 40 essential single copy marker genes.
We estimated nucleotide diversity (π) and fixation indices (FST) based on allele frequencies.
SNPs occurring in coding regions were classified as synonymous or non-synonymous. Genes from the non-redundant genomes were annotated using eggNOG orthologous groups allowing calculation of pN/pS ratios at the level of genomes, OGs and genes.
Similarity in strain populations between two samples was estimated using (i) a similarity score based on shared SNPs and (ii) FST.
The authors are grateful to Jan Korbel and the members of the Bork group at EMBL for discussions and assistance, especially Sean Powell for performing some of the computations. We thank the EMBL IT core facility and Y. Yuan for managing the high-performance computing resources. We would like to thank Jeffrey I. Gordon for providing three of the samples used. We are also grateful to the European METAHIT consortium and the NIH Common Fund Human Microbiome Project Consortium for generating and making available the data sets used in this study. The research leading to these results has received funding from EMBL, the European Community's Seventh Framework Programme via the MetaHIT (HEALTH-F4-2007-201052) and IHMS (HEALTH-F4-2010-261376) grants as well as from the National Institutes of Health grants U54HG003079 and U54HG004968.
P.B. and G.M.W. conceived the study. P.B., M.A., G.M.W., and Sha.S. designed the analyses. Si.S., Shi.S., M.A., M.M., J.T., A.Z., A.W., D.M., J.M., and K.K. performed the analyses. M.A., Shi.S., Si. S., and P.B. wrote the manuscript. All authors read and approved the manuscript.
The authors declare no competing financial interests.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.