1.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples 
PLoS Computational Biology  2009;5(4):e1000352.
Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them.
We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at
Author Summary
The emerging field of metagenomics aims to understand the structure and function of microbial communities solely through DNA analysis. Current metagenomics studies comparing communities resemble large-scale clinical trials with multiple subjects from two general populations (e.g. sick and healthy). To improve analyses of this type of experimental data, we developed a statistical methodology for detecting differentially abundant features between microbial communities, that is, features that are enriched or depleted in one population versus another. We show our methods are applicable to various metagenomic data ranging from taxonomic information to functional annotations. We also provide an assessment of taxonomic differences in gut microbiota between lean and obese humans, as well as differences between the functional capacities of mature and infant gut microbiomes, and those of microbial and viral metagenomes. Our methods are the first to statistically address differential abundance in comparative metagenomics studies with multiple subjects, and we hope will give researchers a more complete picture of how exactly two environments differ.
PMCID: PMC2661018  PMID: 19360128
2.  Reconstructing the Genomic Content of Microbiome Taxa through Shotgun Metagenomic Deconvolution 
PLoS Computational Biology  2013;9(10):e1003292.
Metagenomics has transformed our understanding of the microbial world, allowing researchers to bypass the need to isolate and culture individual taxa and to directly characterize both the taxonomic and gene compositions of environmental samples. However, associating the genes found in a metagenomic sample with the specific taxa of origin remains a critical challenge. Existing binning methods, based on nucleotide composition or alignment to reference genomes allow only a coarse-grained classification and rely heavily on the availability of sequenced genomes from closely related taxa. Here, we introduce a novel computational framework, integrating variation in gene abundances across multiple samples with taxonomic abundance data to deconvolve metagenomic samples into taxa-specific gene profiles and to reconstruct the genomic content of community members. This assembly-free method is not bounded by various factors limiting previously described methods of metagenomic binning or metagenomic assembly and represents a fundamentally different approach to metagenomic-based genome reconstruction. An implementation of this framework is available at We first describe the mathematical foundations of our framework and discuss considerations for implementing its various components. We demonstrate the ability of this framework to accurately deconvolve a set of metagenomic samples and to recover the gene content of individual taxa using synthetic metagenomic samples. We specifically characterize determinants of prediction accuracy and examine the impact of annotation errors on the reconstructed genomes. We finally apply metagenomic deconvolution to samples from the Human Microbiome Project, successfully reconstructing genus-level genomic content of various microbial genera, based solely on variation in gene count. These reconstructed genera are shown to correctly capture genus-specific properties. With the accumulation of metagenomic data, this deconvolution framework provides an essential tool for characterizing microbial taxa never before seen, laying the foundation for addressing fundamental questions concerning the taxa comprising diverse microbial communities.
Author Summary
Most microorganisms inhabit complex, diverse, and largely uncharacterized communities. Metagenomic technologies allow us to determine the taxonomic and gene compositions of these communities and to obtain insights into their function as a whole but usually do not enable the characterization of individual member taxa. Here, we introduce a novel computational framework for decomposing metagenomic community-level gene content data into taxa-specific gene profiles. Specifically, by analyzing the way taxonomic and gene abundances co-vary across a set of metagenomic samples, we are able to associate genes with their taxa of origin. We first demonstrate the ability of this approach to decompose metagenomes and to reconstruct the genomes of member taxa using simulated datasets. We further identify the factors that contribute to the accuracy of our method. We then apply our framework to samples from the human microbiome – the set of microorganisms that inhabit the human body – and show that it can be used to successfully reconstruct the typical genomes of various microbiome genera. Notably, our framework is based solely on variation in gene composition and does not rely on sequence composition signatures, assembly, or available reference genomes. It is therefore especially suited to studying the many microbial habitats yet to be extensively characterized.
PMCID: PMC3798274  PMID: 24146609
3.  Alignment-free supervised classification of metagenomes by recursive SVM 
BMC Genomics  2013;14:641.
Comparison and classification of metagenome samples is one of the major tasks in the study of microbial communities of natural environments or niches on human bodies. Bioinformatics methods play important roles on this task, including 16S rRNA gene analysis and some alignment-based or alignment-free methods on metagenomic data. Alignment-free methods have the advantage of not depending on known genome annotations and therefore have high potential in studying complicated microbiomes. However, the existing alignment-free methods are all based on unsupervised learning strategy (e.g., PCA or hierarchical clustering). These types of methods are powerful in revealing major similarities and grouping relations between microbiome samples, but cannot be applied for discriminating predefined classes of interest which might not be the dominating assortment in the data. Supervised classification is needed in the latter scenario, with the goal of classifying samples into predefined classes and finding the features that can discriminate the classes. The effectiveness of supervised classification with alignment-based features on metagenomic data have been shown in some recent studies. The application of alignment-free supervised classification methods on metagenome data has not been well explored yet.
We developed a method for this task using k-tuple frequencies as features counted directly from metagenome short reads and the R-SVM (Recursive SVM) for feature selection and classification. We tested our method on a simulation dataset, a real dataset composed of several known genomes, and a real metagenome NGS short reads dataset. Experiments on simulated data showed that the method can classify the classes almost perfectly and can recover major sequence signatures that distinguish the two classes. On the real human gut metagenome data, the method can discriminate samples of inflammatory bowel disease (IBD) patients from control samples with high accuracy, which cannot be separated when comparing the samples with unsupervised clustering approaches.
The proposed alignment-free supervised classification method can perform well in discriminating of metagenomic samples of predefined classes and in selecting characteristic sequence features for the discrimination. This study shows as an example on the feasibility of using metagenome sequence features of microbiomes on human bodies to study specific human health conditions using supervised machine learning methods.
PMCID: PMC3849074  PMID: 24053649
Metagenome; Classification; K-tuple; R-SVM; Alignment-free; Sequence signatures
4.  MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm 
Microbiome  2014;2:26.
Recovering individual genomes from metagenomic datasets allows access to uncultivated microbial populations that may have important roles in natural and engineered ecosystems. Understanding the roles of these uncultivated populations has broad application in ecology, evolution, biotechnology and medicine. Accurate binning of assembled metagenomic sequences is an essential step in recovering the genomes and understanding microbial functions.
We have developed a binning algorithm, MaxBin, which automates the binning of assembled metagenomic scaffolds using an expectation-maximization algorithm after the assembly of metagenomic sequencing reads. Binning of simulated metagenomic datasets demonstrated that MaxBin had high levels of accuracy in binning microbial genomes. MaxBin was used to recover genomes from metagenomic data obtained through the Human Microbiome Project, which demonstrated its ability to recover genomes from real metagenomic datasets with variable sequencing coverages. Application of MaxBin to metagenomes obtained from microbial consortia adapted to grow on cellulose allowed genomic analysis of new, uncultivated, cellulolytic bacterial populations, including an abundant myxobacterial population distantly related to Sorangium cellulosum that possessed a much smaller genome (5 MB versus 13 to 14 MB) but has a more extensive set of genes for biomass deconstruction. For the cellulolytic consortia, the MaxBin results were compared to binning using emergent self-organizing maps (ESOMs) and differential coverage binning, demonstrating that it performed comparably to these methods but had distinct advantages in automation, resolution of related genomes and sensitivity.
The automatic binning software that we developed successfully classifies assembled sequences in metagenomic datasets into recovered individual genomes. The isolation of dozens of species in cellulolytic microbial consortia, including a novel species of myxobacteria that has the smallest genome among all sequenced aerobic myxobacteria, was easily achieved using the binning software. This work demonstrates that the processes required for recovering genomes from assembled metagenomic datasets can be readily automated, an important advance in understanding the metabolic potential of microbes in natural environments. MaxBin is available at
PMCID: PMC4129434  PMID: 25136443
Binning; Metagenomics; Expectation-maximization algorithm
5.  Comparative (Meta)genomic Analysis and Ecological Profiling of Human Gut-Specific Bacteriophage φB124-14 
PLoS ONE  2012;7(4):e35053.
Bacteriophage associated with the human gut microbiome are likely to have an important impact on community structure and function, and provide a wealth of biotechnological opportunities. Despite this, knowledge of the ecology and composition of bacteriophage in the gut bacterial community remains poor, with few well characterized gut-associated phage genomes currently available. Here we describe the identification and in-depth (meta)genomic, proteomic, and ecological analysis of a human gut-specific bacteriophage (designated φB124-14). In doing so we illuminate a fraction of the biological dark matter extant in this ecosystem and its surrounding eco-genomic landscape, identifying a novel and uncharted bacteriophage gene-space in this community. φB124-14 infects only a subset of closely related gut-associated Bacteroides fragilis strains, and the circular genome encodes functions previously found to be rare in viral genomes and human gut viral metagenome sequences, including those which potentially confer advantages upon phage and/or host bacteria. Comparative genomic analyses revealed φB124-14 is most closely related to φB40-8, the only other publically available Bacteroides sp. phage genome, whilst comparative metagenomic analysis of both phage failed to identify any homologous sequences in 136 non-human gut metagenomic datasets searched, supporting the human gut-specific nature of this phage. Moreover, a potential geographic variation in the carriage of these and related phage was revealed by analysis of their distribution and prevalence within 151 human gut microbiomes and viromes from Europe, America and Japan. Finally, ecological profiling of φB124-14 and φB40-8, using both gene-centric alignment-driven phylogenetic analyses, as well as alignment-free gene-independent approaches was undertaken. This not only verified the human gut-specific nature of both phage, but also indicated that these phage populate a distinct and unexplored ecological landscape within the human gut microbiome.
PMCID: PMC3338817  PMID: 22558115
6.  Metagenomic analysis of the medicinal leech gut microbiota 
There are trillions of microbes found throughout the human body and they exceed the number of eukaryotic cells by 10-fold. Metagenomic studies have revealed that the majority of these microbes are found within the gut, playing an important role in the host's digestion and nutrition. The complexity of the animal digestive tract, unculturable microbes, and the lack of genetic tools for most culturable microbes make it challenging to explore the nature of these microbial interactions within this niche. The medicinal leech, Hirudo verbana, has been shown to be a useful tool in overcoming these challenges, due to the simplicity of the microbiome and the availability of genetic tools for one of the two dominant gut symbionts, Aeromonas veronii. In this study, we utilize 16S rRNA gene pyrosequencing to further explore the microbial composition of the leech digestive tract, confirming the dominance of two taxa, the Rikenella-like bacterium and A. veronii. The deep sequencing approach revealed the presence of additional members of the microbial community that suggests the presence of a moderately complex microbial community with a richness of 36 taxa. The presence of a Proteus strain as a newly identified resident in the leech crop was confirmed using fluorescence in situ hybridization (FISH). The metagenome of this community was also pyrosequenced and the contigs were binned into the following taxonomic groups: Rikenella-like (3.1 MB), Aeromonas (4.5 MB), Proteus (2.9 MB), Clostridium (1.8 MB), Eryspelothrix (0.96 MB), Desulfovibrio (0.14 MB), and Fusobacterium (0.27 MB). Functional analyses on the leech gut symbionts were explored using the metagenomic data and MG-RAST. A comparison of the COG and KEGG categories of the leech gut metagenome to that of other animal digestive-tract microbiomes revealed that the leech digestive tract had a similar metabolic potential to the human digestive tract, supporting the usefulness of this system as a model for studying digestive-tract microbiomes. This study lays the foundation for more detailed metatranscriptomic studies and the investigation of symbiont population dynamics.
PMCID: PMC4029005  PMID: 24860552
high-throughput sequencing; beneficial microbes; symbiosis; medicinal leech
7.  Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold 
Bioinformatics  2013;29(22):2826-2834.
Motivation: Although many tools are available to study variation and its impact in single genomes, there is a lack of algorithms for finding such variation in metagenomes. This hampers the interpretation of metagenomics sequencing datasets, which are increasingly acquired in research on the (human) microbiome, in environmental studies and in the study of processes in the production of foods and beverages. Existing algorithms often depend on the use of reference genomes, which pose a problem when a metagenome of a priori unknown strain composition is studied. In this article, we develop a method to perform reference-free detection and visual exploration of genomic variation, both within a single metagenome and between metagenomes.
Results: We present the MaryGold algorithm and its implementation, which efficiently detects bubble structures in contig graphs using graph decomposition. These bubbles represent variable genomic regions in closely related strains in metagenomic samples. The variation found is presented in a condensed Circos-based visualization, which allows for easy exploration and interpretation of the found variation.
We validated the algorithm on two simulated datasets containing three respectively seven Escherichia coli genomes and showed that finding allelic variation in these genomes improves assemblies. Additionally, we applied MaryGold to publicly available real metagenomic datasets, enabling us to find within-sample genomic variation in the metagenomes of a kimchi fermentation process, the microbiome of a premature infant and in microbial communities living on acid mine drainage. Moreover, we used MaryGold for between-sample variation detection and exploration by comparing sequencing data sampled at different time points for both of these datasets.
Availability: MaryGold has been written in C++ and Python and can be downloaded from
PMCID: PMC3916741  PMID: 24058058
8.  MetaPath: identifying differentially abundant metabolic pathways in metagenomic datasets 
BMC Proceedings  2011;5(Suppl 2):S9.
Enabled by rapid advances in sequencing technology, metagenomic studies aim to characterize entire communities of microbes bypassing the need for culturing individual bacterial members. One major goal of metagenomic studies is to identify specific functional adaptations of microbial communities to their habitats. The functional profile and the abundances for a sample can be estimated by mapping metagenomic sequences to the global metabolic network consisting of thousands of molecular reactions. Here we describe a powerful analytical method (MetaPath) that can identify differentially abundant pathways in metagenomic datasets, relying on a combination of metagenomic sequence data and prior metabolic pathway knowledge.
First, we introduce a scoring function for an arbitrary subnetwork and find the max-weight subnetwork in the global network by a greedy search algorithm. Then we compute two p values (pabund and pstruct) using nonparametric approaches to answer two different statistical questions: (1) is this subnetwork differentically abundant? (2) What is the probability of finding such good subnetworks by chance given the data and network structure? Finally, significant metabolic subnetworks are discovered based on these two p values.
In order to validate our methods, we have designed a simulated metabolic pathways dataset and show that MetaPath outperforms other commonly used approaches. We also demonstrate the power of our methods in analyzing two publicly available metagenomic datasets, and show that the subnetworks identified by MetaPath provide valuable insights into the biological activities of the microbiome.
We have introduced a statistical method for finding significant metabolic subnetworks from metagenomic datasets. Compared with previous methods, results from MetaPath are more robust against noise in the data, and have significantly higher sensitivity and specificity (when tested on simulated datasets). When applied to two publicly available metagenomic datasets, the output of MetaPath is consistent with previous observations and also provides several new insights into the metabolic activity of the gut microbiome. The software is freely available at
PMCID: PMC3090767  PMID: 21554767
9.  Selecting age-related functional characteristics in the human gut microbiome 
Microbiome  2013;1:2.
Human gut microbial functions are often associated with various diseases and host physiologies. Aging, a less explored factor, is also suspected to affect or be affected by microbiome alterations. By combining functional feature selection with supervised classification, we aim to facilitate identification of age-related functional characteristics in metagenomes from several human gut microbiome studies (MetaHIT, MicroAge, MicroObes, Kurokawa et al.’s and Gill et al.’s dataset).
We apply two feature selection methods, term frequency-inverse document frequency (TF-iDF) and minimum-redundancy maximum-relevancy (mRMR), to identify functional signatures that differentiate metagenomes by age. After features are reduced, we use a support vector machine (SVM) to predict host age of new metagenomes. Functional features are from protein families (Pfams), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, KEGG ontologies and the Gene Ontology (GO) database. Initial investigations demonstrate that ordination of the functional principal components shows great overlap between different age groups. However, when feature selection is applied, mRMR tightens the ordination cluster for each age group, and TF-iDF offers better linear separation. Both TF-iDF and mRMR were used in conjunction with a SVM classifier and achieved areas under receiver operating characteristic curves (AUCs) 10 to 15% above chance to classify individuals above/below mid-ages (about 38 to 43 years old) using Pfams. Better performance around mid-ages is also observed when using other functional categories and age-balanced dataset. We also identified some age-related Pfams that improved age discrimination at age 65 with another feature selection method called LEfSe, on an age-balanced dataset. The selected functional characteristics identify a broad range of age-relevant metabolisms, such as reduced vitamin B12 synthesis, reduced activity of reductases, increased DNA damage, occurrences of stress responses and immune system compromise, and upregulated glycosyltransferases in the aging population.
Feature selection can yield biologically meaningful results when used in conjunction with classification, and makes age classification of new human gut metagenomes feasible. While we demonstrate the promise of this approach, the data-dependent prediction performance could be further improved. We hypothesize that while the Qin et al. dataset is the most comprehensive to date, even deeper sampling is needed to better characterize and predict the microbiomes’ functional content.
PMCID: PMC3869192  PMID: 24467949
Metagenomics; KEGG; Pfam; SVM; Supervised classification
10.  A metagenomic study of diet-dependent interaction between gut microbiota and host in infants reveals differences in immune response 
Genome Biology  2012;13(4):r32.
Gut microbiota and the host exist in a mutualistic relationship, with the functional composition of the microbiota strongly affecting the health and well-being of the host. Thus, it is important to develop a synthetic approach to study the host transcriptome and the microbiome simultaneously. Early microbial colonization in infants is critically important for directing neonatal intestinal and immune development, and is especially attractive for studying the development of human-commensal interactions. Here we report the results from a simultaneous study of the gut microbiome and host epithelial transcriptome of three-month-old exclusively breast- and formula-fed infants.
Variation in both host mRNA expression and the microbiome phylogenetic and functional profiles was observed between breast- and formula-fed infants. To examine the interdependent relationship between host epithelial cell gene expression and bacterial metagenomic-based profiles, the host transcriptome and functionally profiled microbiome data were subjected to novel multivariate statistical analyses. Gut microbiota metagenome virulence characteristics concurrently varied with immunity-related gene expression in epithelial cells between the formula-fed and the breast-fed infants.
Our data provide insight into the integrated responses of the host transcriptome and microbiome to dietary substrates in the early neonatal period. We demonstrate that differences in diet can affect, via gut colonization, host expression of genes associated with the innate immune system. Furthermore, the methodology presented in this study can be adapted to assess other host-commensal and host-pathogen interactions using genomic and transcriptomic data, providing a synthetic genomics-based picture of host-commensal relationships.
PMCID: PMC3446306  PMID: 22546241
11.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome 
PLoS Computational Biology  2012;8(6):e1002358.
Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.
Author Summary
The human body is inhabited by trillions of bacteria and other microbes, which have recently been studied in many different habitats (including gut, mouth, skin, and urogenital) by the Human Microbiome Project (HMP). These microbial communities were assayed using high-throughput DNA sequencing, but it can be challenging to determine their biological functions based solely on the resulting short sequences. To reconstruct the metabolic activities of such communities, we have developed HUMAnN, a method to accurately infer community function directly from short DNA reads. The method's accuracy was validated using a collection of synthetic microbial communities. Applying HUMAnN to data from the HMP, we showed that, unlike individual microbial species, many metabolic processes were present among all body habitats. However, the frequencies of these processes varied dramatically, and some were highly enriched within individual habitats to provide niche specialization (e.g. in the gut, which is abundant in food matter but low in oxygen). Other community functions were linked specifically to properties of the human host, such as biochemical processes only present in vaginal habitats with particularly high or low pH. Studying additional environmental or disease-associated communities using HUMAnN will further improve our understanding of how the microbial organisms in a community are linked to the biological processes they carry out.
PMCID: PMC3374609  PMID: 22719234
12.  Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible 
PLoS Computational Biology  2014;10(4):e1003531.
Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.
Author Summary
The term microbiome refers to the ecosystem of microbes that live in a defined environment. The decreasing cost and increasing speed of DNA sequencing technology has recently provided scientists with affordable and timely access to the genes and genomes of microbiomes that inhabit our planet and even our own bodies. In these investigations many microbiome samples are sequenced at the same time on the same DNA sequencing machine, but often result in total numbers of sequences per sample that are vastly different. The common procedure for addressing this difference in sequencing effort across samples – different library sizes – is to either (1) base analyses on the proportional abundance of each species in a library, or (2) rarefy, throw away sequences from the larger libraries so that all have the same, smallest size. We show that both of these normalization methods can work when comparing obviously-different whole microbiomes, but that neither method works well when comparing the relative proportions of each bacterial species across microbiome samples. We show that alternative methods based on a statistical mixture model perform much better and can be easily adapted from a separate biological sub-discipline, called RNA-Seq analysis.
PMCID: PMC3974642  PMID: 24699258
13.  Novel Bacterial Taxa in the Human Microbiome 
PLoS ONE  2012;7(6):e35294.
The human gut harbors thousands of bacterial taxa. A profusion of metagenomic sequence data has been generated from human stool samples in the last few years, raising the question of whether more taxa remain to be identified. We assessed metagenomic data generated by the Human Microbiome Project Consortium to determine if novel taxa remain to be discovered in stool samples from healthy individuals. To do this, we established a rigorous bioinformatics pipeline that uses sequence data from multiple platforms (Illumina GAIIX and Roche 454 FLX Titanium) and approaches (whole-genome shotgun and 16S rDNA amplicons) to validate novel taxa. We applied this approach to stool samples from 11 healthy subjects collected as part of the Human Microbiome Project. We discovered several low-abundance, novel bacterial taxa, which span three major phyla in the bacterial tree of life. We determined that these taxa are present in a larger set of Human Microbiome Project subjects and are found in two sampling sites (Houston and St. Louis). We show that the number of false-positive novel sequences (primarily chimeric sequences) would have been two orders of magnitude higher than the true number of novel taxa without validation using multiple datasets, highlighting the importance of establishing rigorous standards for the identification of novel taxa in metagenomic data. The majority of novel sequences are related to the recently discovered genus Barnesiella, further encouraging efforts to characterize the members of this genus and to study their roles in the microbial communities of the gut. A better understanding of the effects of less-abundant bacteria is important as we seek to understand the complex gut microbiome in healthy individuals and link changes in the microbiome to disease.
PMCID: PMC3374617  PMID: 22719826
14.  Estimating the extent of horizontal gene transfer in metagenomic sequences 
BMC Genomics  2008;9:136.
Although the extent of horizontal gene transfer (HGT) in complete genomes has been widely studied, its influence in the evolution of natural communities of prokaryotes remains unknown. The availability of metagenomic sequences allows us to address the study of global patterns of prokaryotic evolution in samples from natural communities. However, the methods that have been commonly used for the study of HGT are not suitable for metagenomic samples. Therefore it is important to develop new methods or to adapt existing ones to be used with metagenomic sequences.
We have created two different methods that are suitable for the study of HGT in metagenomic samples. The methods are based on phylogenetic and DNA compositional approaches, and have allowed us to assess the extent of possible HGT events in metagenomes for the first time. The methods are shown to be compatible and quite precise, although they probably underestimate the number of possible events. Our results show that the phylogenetic method detects HGT in between 0.8% and 1.5% of the sequences, while DNA compositional methods identify putative HGT in between 2% and 8% of the sequences. These ranges are very similar to these found in complete genomes by related approaches. Both methods act with a different sensitivity since they probably target HGT events of different ages: the compositional method mostly identifies recent transfers, while the phylogenetic is more suitable for the detections of older events. Nevertheless, the study of the number of HGT events in metagenomic sequences from different communities shows a consistent trend for both methods: the lower amount is found for the sequences of the Sargasso Sea metagenome, while the higher quantity is found in the whale fall metagenome from the bottom of the ocean. The significance of these observations is discussed.
The computational approaches that are used to find possible HGT events in complete genomes can be adapted to work with metagenomic samples, where a level of high performance is shown in different metagenomic samples. The percentage of possible HGT events that were observed is close to that found for complete genomes, and different microbiomes show diverse ratios of putative HGT events. This is probably related with both environmental factors and the composition in the species of each particular community.
PMCID: PMC2324111  PMID: 18366724
15.  Comparative metagenomic analysis of plasmid encoded functions in the human gut microbiome 
BMC Genomics  2010;11:46.
Little is known regarding the pool of mobile genetic elements associated with the human gut microbiome. In this study we employed the culture independent TRACA system to isolate novel plasmids from the human gut microbiota, and a comparative metagenomic analysis to investigate the distribution and relative abundance of functions encoded by these plasmids in the human gut microbiome.
Novel plasmids were acquired from the human gut microbiome, and homologous nucleotide sequences with high identity (>90%) to two plasmids (pTRACA10 and pTRACA22) were identified in the multiple human gut microbiomes analysed here. However, no homologous nucleotide sequences to these plasmids were identified in the murine gut or environmental metagenomes. Functions encoded by the plasmids pTRACA10 and pTRACA22 were found to be more prevalent in the human gut microbiome when compared to microbial communities from other environments. Among the most prevalent functions identified was a putative RelBE toxin-antitoxin (TA) addiction module, and subsequent analysis revealed that this was most closely related to putative TA modules from gut associated bacteria belonging to the Firmicutes. A broad phylogenetic distribution of RelE toxin genes was observed in gut associated bacterial species (Firmicutes, Bacteroidetes, Actinobacteria and Proteobacteria), but no RelE homologues were identified in gut associated archaeal species. We also provide indirect evidence for the horizontal transfer of these genes between bacterial species belonging to disparate phylogenetic divisions, namely Gram negative Proteobacteria and Gram positive species from the Firmicutes division.
The application of a culture independent system to capture novel plasmids from the human gut mobile metagenome, coupled with subsequent comparative metagenomic analysis, highlighted the unexpected prevalence of plasmid encoded functions in the gut microbial ecosystem. In particular the increased relative abundance and broad phylogenetic distribution was identified for a putative RelBE toxin/antitoxin addiction module, a putative phosphohydrolase/phosphoesterase, and an ORF of unknown function. Our analysis also indicates that some plasmids or plasmid families are present in the gut microbiomes of geographically isolated human hosts with a broad global distribution (America, Japan and Europe), and are potentially unique to the human gut microbiome. Further investigation of the plasmid population associated with the human gut is likely to provide important insights into the development, functioning and evolution of the human gut microbiota.
PMCID: PMC2822762  PMID: 20085629
16.  Glycan Degradation (GlyDeR) Analysis Predicts Mammalian Gut Microbiota Abundance and Host Diet-Specific Adaptations 
mBio  2014;5(4):e01526-14.
Glycans form the primary nutritional source for microbes in the human gut, and understanding their metabolism is a critical yet understudied aspect of microbiome research. Here, we present a novel computational pipeline for modeling glycan degradation (GlyDeR) which predicts the glycan degradation potency of 10,000 reference glycans based on either genomic or metagenomic data. We first validated GlyDeR by comparing degradation profiles for genomes in the Human Microbiome Project against KEGG reaction annotations. Next, we applied GlyDeR to the analysis of human and mammalian gut microbial communities, which revealed that the glycan degradation potential of a community is strongly linked to host diet and can be used to predict diet with higher accuracy than sequence data alone. Finally, we show that a microbe’s glycan degradation potential is significantly correlated (R = 0.46) with its abundance, with even higher correlations for potential pathogens such as the class Clostridia (R = 0.76). GlyDeR therefore represents an important tool for advancing our understanding of bacterial metabolism in the gut and for the future development of more effective prebiotics for microbial community manipulation.
The increased availability of high-throughput sequencing data has positioned the gut microbiota as a major new focal point for biomedical research. However, despite the expenditure of huge efforts and resources, sequencing-based analysis of the microbiome has uncovered mostly associative relationships between human health and diet, rather than a causal, mechanistic one. In order to utilize the full potential of systems biology approaches, one must first characterize the metabolic requirements of gut bacteria, specifically, the degradation of glycans, which are their primary nutritional source. We developed a computational framework called GlyDeR for integrating expert knowledge along with high-throughput data to uncover important new relationships within glycan metabolism. GlyDeR analyzes particular bacterial (meta)genomes and predicts the potency by which they degrade a variety of different glycans. Based on GlyDeR, we found a clear connection between microbial glycan degradation and human diet, and we suggest a method for the rational design of novel prebiotics.
PMCID: PMC4145686  PMID: 25118239
17.  Revealing the Bacterial Butyrate Synthesis Pathways by Analyzing (Meta)genomic Data 
mBio  2014;5(2):e00889-14.
Butyrate-producing bacteria have recently gained attention, since they are important for a healthy colon and when altered contribute to emerging diseases, such as ulcerative colitis and type II diabetes. This guild is polyphyletic and cannot be accurately detected by 16S rRNA gene sequencing. Consequently, approaches targeting the terminal genes of the main butyrate-producing pathway have been developed. However, since additional pathways exist and alternative, newly recognized enzymes catalyzing the terminal reaction have been described, previous investigations are often incomplete. We undertook a broad analysis of butyrate-producing pathways and individual genes by screening 3,184 sequenced bacterial genomes from the Integrated Microbial Genome database. Genomes of 225 bacteria with a potential to produce butyrate were identified, including many previously unknown candidates. The majority of candidates belong to distinct families within the Firmicutes, but members of nine other phyla, especially from Actinobacteria, Bacteroidetes, Fusobacteria, Proteobacteria, Spirochaetes, and Thermotogae, were also identified as potential butyrate producers. The established gene catalogue (3,055 entries) was used to screen for butyrate synthesis pathways in 15 metagenomes derived from stool samples of healthy individuals provided by the HMP (Human Microbiome Project) consortium. A high percentage of total genomes exhibited a butyrate-producing pathway (mean, 19.1%; range, 3.2% to 39.4%), where the acetyl-coenzyme A (CoA) pathway was the most prevalent (mean, 79.7% of all pathways), followed by the lysine pathway (mean, 11.2%). Diversity analysis for the acetyl-CoA pathway showed that the same few firmicute groups associated with several Lachnospiraceae and Ruminococcaceae were dominating in most individuals, whereas the other pathways were associated primarily with Bacteroidetes.
Microbiome research has revealed new, important roles of our gut microbiota for maintaining health, but an understanding of effects of specific microbial functions on the host is in its infancy, partly because in-depth functional microbial analyses are rare and publicly available databases are often incomplete/misannotated. In this study, we focused on production of butyrate, the main energy source for colonocytes, which plays a critical role in health and disease. We have provided a complete database of genes from major known butyrate-producing pathways, using in-depth genomic analysis of publicly available genomes, filling an important gap to accurately assess the butyrate-producing potential of complex microbial communities from “-omics”-derived data. Furthermore, a reference data set containing the abundance and diversity of butyrate synthesis pathways from the healthy gut microbiota was established through a metagenomics-based assessment. This study will help in understanding the role of butyrate producers in health and disease and may assist the development of treatments for functional dysbiosis.
PMCID: PMC3994512  PMID: 24757212
18.  In Silico Analysis of Antibiotic Resistance Genes in the Gut Microflora of Individuals from Diverse Geographies and Age-Groups 
PLoS ONE  2013;8(12):e83823.
The spread of antibiotic resistance, originating from the rampant and unrestrictive use of antibiotics in humans and livestock over the past few decades has emerged as a global health problem. This problem has been further compounded by recent reports implicating the gut microbial communities to act as reservoirs of antibiotic resistance. We have profiled the presence of probable antibiotic resistance genes in the gut flora of 275 individuals from eight different nationalities. For this purpose, available metagenomic data sets corresponding to 275 gut microbiomes were analyzed. Sequence similarity searches of the genomic fragments constituting each of these metagenomes were performed against genes conferring resistance to around 240 antibiotics. Potential antibiotic resistance genes conferring resistance against 53 different antibiotics were detected in the human gut microflora analysed in this study. In addition to several geography/country-specific patterns, four distinct clusters of gut microbiomes, referred to as ‘Resistotypes’, exhibiting similarities in their antibiotic resistance profiles, were identified. Groups of antibiotics having similarities in their resistance patterns within each of these clusters were also detected. Apart from this, mobile multi-drug resistance gene operons were detected in certain gut microbiomes. The study highlighted an alarmingly high abundance of antibiotic resistance genes in two infant gut microbiomes. The results obtained in the present study presents a holistic ‘big picture’ on the spectra of antibiotic resistance within our gut microbiota across different geographies. Such insights may help in implementation of new regulations and stringency on the existing ones.
PMCID: PMC3877126  PMID: 24391833
19.  Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut 
BMC Genomics  2014;15:37.
The main limitations in the analysis of viral metagenomes are perhaps the high genetic variability and the lack of information in extant databases. To address these issues, several bioinformatic tools have been specifically designed or adapted for metagenomics by improving read assembly and creating more sensitive methods for homology detection. This study compares the performance of different available assemblers and taxonomic annotation software using simulated viral-metagenomic data.
We simulated two 454 viral metagenomes using genomes from NCBI's RefSeq database based on the list of actual viruses found in previously published metagenomes. Three different assembly strategies, spanning six assemblers, were tested for performance: overlap-layout-consensus algorithms Newbler, Celera and Minimo; de Bruijn graphs algorithms Velvet and MetaVelvet; and read probabilistic model Genovo. The performance of the assemblies was measured by the length of resulting contigs (using N50), the percentage of reads assembled and the overall accuracy when comparing against corresponding reference genomes. Additionally, the number of chimeras per contig and the lowest common ancestor were estimated in order to assess the effect of assembling on taxonomic and functional annotation. The functional classification of the reads was evaluated by counting the reads that correctly matched the functional data previously reported for the original genomes and calculating the number of over-represented functional categories in chimeric contigs. The sensitivity and specificity of tBLASTx, PhymmBL and the k-mer frequencies were measured by accurate predictions when comparing simulated reads against the NCBI Virus genomes RefSeq database.
Assembling improves functional annotation by increasing accurate assignations and decreasing ambiguous hits between viruses and bacteria. However, the success is limited by the chimeric contigs occurring at all taxonomic levels. The assembler and its parameters should be selected based on the focus of each study. Minimo's non-chimeric contigs and Genovo's long contigs excelled in taxonomy assignation and functional annotation, respectively.
tBLASTx stood out as the best approach for taxonomic annotation for virus identification. PhymmBL proved useful in datasets in which no related sequences are present as it uses genomic features that may help identify distant taxa. The k-frequencies underperformed in all viral datasets.
PMCID: PMC3901335  PMID: 24438450
Viral metagenome; Assembler performance; Taxonomic classification; Chimera identification; Functional annotation
20.  Gut microbiome-host interactions in health and disease 
Genome Medicine  2011;3(3):14.
The gut microbiome is the term given to describe the vast collection of symbiotic microorganisms in the human gastrointestinal system and their collective interacting genomes. Recent studies have suggested that the gut microbiome performs numerous important biochemical functions for the host, and disorders of the microbiome are associated with many and diverse human disease processes. Systems biology approaches based on next generation 'omics' technologies are now able to describe the gut microbiome at a detailed genetic and functional (transcriptomic, proteomic and metabolic) level, providing new insights into the importance of the gut microbiome in human health, and they are able to map microbiome variability between species, individuals and populations. This has established the importance of the gut microbiome in the disease pathogenesis for numerous systemic disease states, such as obesity and cardiovascular disease, and in intestinal conditions, such as inflammatory bowel disease. Thus, understanding microbiome activity is essential to the development of future personalized strategies of healthcare, as well as potentially providing new targets for drug development. Here, we review recent metagenomic and metabonomic approaches that have enabled advances in understanding gut microbiome activity in relation to human health, and gut microbial modulation for the treatment of disease. We also describe possible avenues of research in this rapidly growing field with respect to future personalized healthcare strategies.
PMCID: PMC3092099  PMID: 21392406
21.  BiomeNet: A Bayesian Model for Inference of Metabolic Divergence among Microbial Communities 
PLoS Computational Biology  2014;10(11):e1003918.
Metagenomics yields enormous numbers of microbial sequences that can be assigned a metabolic function. Using such data to infer community-level metabolic divergence is hindered by the lack of a suitable statistical framework. Here, we describe a novel hierarchical Bayesian model, called BiomeNet (Bayesian inference of metabolic networks), for inferring differential prevalence of metabolic subnetworks among microbial communities. To infer the structure of community-level metabolic interactions, BiomeNet applies a mixed-membership modelling framework to enzyme abundance information. The basic idea is that the mixture components of the model (metabolic reactions, subnetworks, and networks) are shared across all groups (microbiome samples), but the mixture proportions vary from group to group. Through this framework, the model can capture nested structures within the data. BiomeNet is unique in modeling each metagenome sample as a mixture of complex metabolic systems (metabosystems). The metabosystems are composed of mixtures of tightly connected metabolic subnetworks. BiomeNet differs from other unsupervised methods by allowing researchers to discriminate groups of samples through the metabolic patterns it discovers in the data, and by providing a framework for interpreting them. We describe a collapsed Gibbs sampler for inference of the mixture weights under BiomeNet, and we use simulation to validate the inference algorithm. Application of BiomeNet to human gut metagenomes revealed a metabosystem with greater prevalence among inflammatory bowel disease (IBD) patients. Based on the discriminatory subnetworks for this metabosystem, we inferred that the community is likely to be closely associated with the human gut epithelium, resistant to dietary interventions, and interfere with human uptake of an antioxidant connected to IBD. Because this metabosystem has a greater capacity to exploit host-associated glycans, we speculate that IBD-associated communities might arise from opportunist growth of bacteria that can circumvent the host's nutrient-based mechanism for bacterial partner selection.
Author Summary
Metagenomic studies of microbial communities yield enormous numbers of gene sequences that have a known enzymatic function, and thus have potential to contribute to community-level metabolic activities. Ecologically divergent microbial communities are presumed to differ in metabolic repertoire and function, but detecting such differences is challenging because the required analytical methodology is complex. Here, we present a novel Bayesian model suitable for this task. Our model, BiomeNet, does not assume that microbiome samples of a certain type are the same; rather, a sample is modeled as a unique mixture of complex metabolic systems referred to as “metabosystems”. The metabosystems are composed of mixtures of subnetworks, where subnetworks are mixtures of reactions related by function. Application of BiomeNet to human gut metagenomes revealed a metabosystem with greater prevalence among IBD patients. We inferred that this metabosystem is likely to be closely associated with the human gut epithelium, resistant to dietary interventions, and interfere with human uptake of an important antioxidant, possibly contributing to gut inflammation associated with IBD.
PMCID: PMC4238953  PMID: 25412107
22.  A Case Study for Large-Scale Human Microbiome Analysis Using JCVI’s Metagenomics Reports (METAREP) 
PLoS ONE  2012;7(6):e29044.
As metagenomic studies continue to increase in their number, sequence volume and complexity, the scalability of biological analysis frameworks has become a rate-limiting factor to meaningful data interpretation. To address this issue, we have developed JCVI Metagenomics Reports (METAREP) as an open source tool to query, browse, and compare extremely large volumes of metagenomic annotations. Here we present improvements to this software including the implementation of a dynamic weighting of taxonomic and functional annotation, support for distributed searches, advanced clustering routines, and integration of additional annotation input formats. The utility of these improvements to data interpretation are demonstrated through the application of multiple comparative analysis strategies to shotgun metagenomic data produced by the National Institutes of Health Roadmap for Biomedical Research Human Microbiome Project (HMP) ( Specifically, the scalability of the dynamic weighting feature is evaluated and established by its application to the analysis of over 400 million weighted gene annotations derived from 14 billion short reads as predicted by the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline. Further, the capacity of METAREP to facilitate the identification and simultaneous comparison of taxonomic and functional annotations including biological pathway and individual enzyme abundances from hundreds of community samples is demonstrated by providing scenarios that describe how these data can be mined to answer biological questions related to the human microbiome. These strategies provide users with a reference of how to conduct similar large-scale metagenomic analyses using METAREP with their own sequence data, while in this study they reveal insights into the nature and extent of variation in taxonomic and functional profiles across body habitats and individuals. Over one thousand HMP WGS datasets and the latest open source code are available at
PMCID: PMC3374610  PMID: 22719821
23.  Genome signature-based dissection of human gut metagenomes to extract subliminal viral sequences 
Nature Communications  2013;4:2420.
Bacterial viruses (bacteriophages) have a key role in shaping the development and functional outputs of host microbiomes. Although metagenomic approaches have greatly expanded our understanding of the prokaryotic virosphere, additional tools are required for the phage-oriented dissection of metagenomic data sets, and host-range affiliation of recovered sequences. Here we demonstrate the application of a genome signature-based approach to interrogate conventional whole-community metagenomes and access subliminal, phylogenetically targeted, phage sequences present within. We describe a portion of the biological dark matter extant in the human gut virome, and bring to light a population of potentially gut-specific Bacteroidales-like phage, poorly represented in existing virus like particle-derived viral metagenomes. These predominantly temperate phage were shown to encode functions of direct relevance to human health in the form of antibiotic resistance genes, and provided evidence for the existence of putative ‘viral-enterotypes’ among this fraction of the human gut virome.
Bacteriophages have a significant impact on microbial ecosystems, but additional tools are needed to assess viral communities. Ogilvie et al. present a new strategy to extract viral sequences from metagenomic data sets, and present new insights on their function in the gut ecosystem.
PMCID: PMC3778543  PMID: 24036533
24.  Exploration and retrieval of whole-metagenome sequencing samples 
Bioinformatics  2014;30(17):2471-2479.
Motivation: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation.
Results: In this article, we develop a content-based exploration and retrieval method for whole-metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples. We observe significant enrichment for diseased gut samples in results of queries with another diseased sample and high accuracy in discriminating between different body sites even though the method is unsupervised.
Availability and implementation: A software implementation of the DSM framework is available at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4230234  PMID: 24845653
25.  Comparative analysis of CRISPR cassettes from the human gut metagenomic contigs 
BMC Genomics  2014;15:202.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a prokaryotic adaptive defence system that provides resistance against alien replicons such as viruses and plasmids. Spacers in a CRISPR cassette confer immunity against viruses and plasmids containing regions complementary to the spacers and hence they retain a footprint of interactions between prokaryotes and their viruses in individual strains and ecosystems. The human gut is a rich habitat populated by numerous microorganisms, but a large fraction of these are unculturable and little is known about them in general and their CRISPR systems in particular.
We used human gut metagenomic data from three open projects in order to characterize the composition and dynamics of CRISPR cassettes in the human-associated microbiota. Applying available CRISPR-identification algorithms and a previously designed filtering procedure to the assembled human gut metagenomic contigs, we found 388 CRISPR cassettes, 373 of which had repeats not observed previously in complete genomes or other datasets. Only 171 of 3,545 identified spacers were coupled with protospacers from the human gut metagenomic contigs. The number of matches to GenBank sequences was negligible, providing protospacers for 26 spacers.
Reconstruction of CRISPR cassettes allowed us to track the dynamics of spacer content. In agreement with other published observations we show that spacers shared by different cassettes (and hence likely older ones) tend to the trailer ends, whereas spacers with matches in the metagenomes are distributed unevenly across cassettes, demonstrating a preference to form clusters closer to the active end of a CRISPR cassette, adjacent to the leader, and hence suggesting dynamical interactions between prokaryotes and viruses in the human gut. Remarkably, spacers match protospacers in the metagenome of the same individual with frequency comparable to a random control, but may match protospacers from metagenomes of other individuals.
The analysis of assembled contigs is complementary to the approach based on the analysis of original reads and hence provides additional data about composition and evolution of CRISPR cassettes, revealing the dynamics of CRISPR-phage interactions in metagenomes.
PMCID: PMC4004331  PMID: 24628983
CRISPR; Human gut; Microbiome

