Microbial communities are responsible for a broad spectrum of biological activities carried out in virtually all natural environments including oceans1
and human-associated habitats3–5
. Profiling the taxonomic and phylogenetic compositions of such communities is critical for understanding their biology and characterizing complex disorders like inflammatory bowel diseases4, 6
that do not appear to be associated with any individual microbes.
Metagenomic shotgun sequencing provides a uniquely rich profile of microbial communities, each dataset yielding billions of short reads sampled from the DNA in the community. A community's taxonomic composition can be estimated from such data by assigning each read to the most plausible microbial lineage, often with a taxonomic resolution not achievable by profiling the universal 16S rRNA marker gene alone. Both alignment- and composition-based approaches have been developed for this task, and the two approaches have also been integrated in hybrid methods (see Methods). However, none have simultaneously achieved both the efficiency and the species-level accuracy required by current highly-complexity datasets due to computational limitations, untenable accuracy for short (<400 nt) reads, and the need to normalize read counts into clade-specific relative abundances (Supplementary Note 1
We thus present MetaPhlAn (Metagenomic Phylogenetic Analysis), a tool that accurately profiles microbial communities and requires only minutes to process millions of metagenomic reads. MetaPhlAn estimates the relative abundance of microbial cells by mapping reads against a reduced set of clade-specific marker sequences that are computationally pre-selected from coding sequences that unequivocally identify specific microbial clades at the species or higher taxonomic levels and cover all main functional categories (Supplementary Fig. 1
). Starting from the 2,887 genomes available from the Integrated Microbial Genomes (IMG) system (July 2011)8
, we identified more than 2 million potential markers from which we selected a subset of 400,141 genes most representative of each taxonomic unit (Online Methods). The resulting catalog spans 1,221 species with 231 (standard deviation 107) markers per species and >115,000 markers at higher taxonomic levels (available at http://huttenhower.sph.harvard.edu/metaphlan
The MetaPhlAn classifier compares each metagenomic read from a sample to this marker catalog to identify high-confidence matches. This can be done very efficiently, as the catalog contains only ~4% of sequenced microbial genes, and each read of interest has at most one match due to the markers' uniqueness. Since spurious reads are very unlikely to have significant matches with a marker sequence, no pre-processing of metagenomic DNA (for example error detection, assembly, or gene annotation) is required. The classifier normalizes the total number of reads in each clade by the nucleotide length of its markers and provides the relative abundance of each taxonomic unit, taking into account any markers specific to subclades. Microbial reads belonging to clades with no sequenced genomes available are reported as an “unclassified” subclade of the closest ancestor with available sequence data.
We first evaluated MetaPhlAn's performance in estimating microbial community composition using synthetic data. We constructed ten datasets comprising 4 million noisy reads from 300 organisms. MetaPhlAn mapped a number of reads consistent with the fraction of lineage-specific genomic regions (7.7% and 8.4%), correctly identified all 200 organisms in the two high-complexity datasets, and accurately estimated their species relative abundances (RMSE of 0.17 and 0.14), with 75% of species within 10% deviation from expected value ( and Supplementary Fig. 2
). Similar performance was observed for higher taxonomic ranks ( and Supplementary Figs. 2 and 3
) and for the eight non-evenly distributed datasets that better mimic the abundance distributions of real communities (Pearson r
>0.991, species-level Pearson P
) ( and Supplementary Fig. 4
Comparison of MetaPhlAn to existing methods
In contrast to all existing methods, which are optimized for read-based statistics rather than microbial cell relative abundance estimation, we considered the latter more biologically informative. Microbial clade abundances were thus estimated by normalizing read-based counts by the average genome size of each clade. MetaPhlAn compares favorably to existing methods on all tested synthetic metagenomes ( and Supplementary Figs. 2–4
), with PhymmBL being the closest (but substantially slower) alternative. This also held true for the more challenging scenario in which metagenomes contained microbes without reference genomes (Supplementary Tables 1 and 2
Notably, MetaPhlAn achieved a classification rate of about 450 reads-per-second on standard single-processor systems, thus greatly outperforming all existing methods (; note that PhyloPythiaS provides only genus-level predictions). This allowed us to provide the first practical high-throughput assessments of several real-world metagenomes at the species-level as detailed below.
Composition of healthy vaginal microbiota
We first characterized the vaginal microbiota of asymptomatic pre-menopausal adult women enrolled in the Human Microbiome Project (HMP)3
, analyzing 51 metagenomes sampled from the posterior fornix. MetaPhlAn detected 98 clades with abundances >0.5% in at least one sample (32 species from 17 genera). Lactobacillus
was consistently the most abundant genus, representing >50% of the bacterial community in 49 of the 51 samples and >98% in half of the samples, confirming its well-established role in healthy vaginal microbiomes5, 9
. 16S rRNA gene sequencing in an independent cohort5
previously identified five distinct community types, each characterized by a specific Lactobacillus
species or by the absence of any of them; in MetaPhlAn's results, these five groups are easily identifiable (Supplementary Fig. 5
) and cluster naturally by species (). Although in 16S pyrosequencing data the characterization of Lactobacillus
species-level operational taxonomic units is sensitive to the region being sequenced, we performed a direct comparison with 16S data from the same HMP specimens. Despite extensive technical differences between 16S pyrosequencing and shotgun sequencing, the estimated relative abundances were remarkably similar in all clusters. Moreover, MetaPhlAn's native coverage of all species with sequenced members further details the structure of these clusters. For example, Lactobacillus
is not the only genus with species-specific differences among these microbiome types; Bifidobacterium
species are present in cluster II as B. breve
and B. dentium
, whereas in cluster IV we identified an unclassified member distinct from all sequenced species. Similarly, Prevotella
is represented by P. multiformis
in cluster II in contrast to P. amnii
and P. timonensis
in cluster IV.
MetaPhlAn's methodology is restricted neither to bacteria alone nor to human-associated microbiomes, and it allowed us to investigate the microbial flora collected from oxygen minimum zones at intermediate depths in the Eastern Tropical South Pacific10
. This marine ecosystem proved to include a substantial fraction of archaea, but Proteobacteria (mainly Alphaproteobacteria) was the most abundant phylum, representing approximately half of the community. Depth-associated shifts were observable for Bacteroidia, Chlamydiae, and Gammaproteobacteria, whereas the Cenarchaea dropped off specifically within the deepest sample. MetaPhlAn’s relative abundances (Supplementary Fig. 6
) were consistent with BLAST-based approaches and confirmed its applicability for communities with limited coverage of reference genomes and its suitability for environmental metagenomic samples. Moreover, as the number of sequenced organisms continues to increase, MetaPhlAn's species-specificity in such environments will automatically improve without computational drawbacks.
MetaPhlAn’s read-based estimation of relative abundances enabled straightforward integration of multiple cohorts sequenced with different technologies and depths; specifically, to comprehensively characterize the asymptomatic human gut microbiota, we combined 224 fecal samples (>17 million reads) from the HMP3
and MetaHIT project4
, the two largest gut metagenomic collections available. MetaPhlAn detected 102 species present at least once at >0.5% abundance () with strong consistency among different markers for the same clade (Supplementary Fig. 7
). The MetaHIT project has previously characterized gut microbiomes as arising from three distinct and stable microbiome types (enterotypes11
), and we investigated this hypothesis by hierarchically clustering the 224 samples separately at the genus and species levels (). In some cases, enterotype-like discrete prevalence patterns were readily apparent, the genus Prevotella
being the most striking example with Butyrivibrio
showing similar behavior but for fewer samples. Conversely, many samples were characterized by high fractions of Bacteroides
resembling Enterotype 111
, but this genus’ relative abundance overall formed a continuum across samples, as did those of several other genera including Eubacterium
The gut microbiota in asymptomatic Western populations as inferred by MetaPhlAn on 224 samples combining the HMP and MetaHIT cohorts
MetaPhlAn's estimates of species-level abundance allowed us to refine this investigation (Fig. 4c
). While Enterotype 2 remained clearly identifiable, the Bacteroides
were diversified in a manner quite similar to lactobacilli in the vaginal microbiota, although with more species and less exclusive dominance. This suggests the existence of more complex community patterns than those captured by the proposed genus-level enterotypes (Supplementary Note 2
). The integrated dataset also furnished the opportunity to investigate differences between independent and geographically unrelated healthy Western-diet populations (Fig. 4a
and Supplementary Note 3
). Overall, this analysis showed that MetaPhlAn is effective in processing very large metagenomic datasets with different short read lengths at high taxonomic resolution, enabling meta-analyses difficult to achieve using other technologies.
Shotgun metagenomic data are rapidly decreasing in cost to a per-sample level comparable to that of 16S gene surveys. Community-wide sequence reads already provide unique insights into gene function, metabolism, and polymorphisms that are unavailable from individual marker genes. By enabling efficient, high-resolution taxonomic profiling in such data, MetaPhlAn provides a further advantage with respect to 16S rRNA-based investigations, which can be difficult to extend past a genus level of resolution. Metagenomic sequencing further provides better statistical support (~108 reads/sample) than 16S pyrosequencing approaches (typically <104 reads/sample), and the sequencing protocols do not require potentially biased amplification steps. Finally, the MetaPhlAn database of clade-specific markers is constructed by a fully automated computational pipeline, which will allow improved accuracy as additional microbial genomes become available and improve support for gene markers' intra-clade universality and inter-clade uniqueness.