Culture-independent metagenomic sequencing of microbial communities provides a wealth of data regarding their potential biological functions, particularly as studied for the human body by the Human Microbiome Project. Here, we have described the development of the HUMAnN methodology for high-throughput metagenomic functional reconstruction and its application to 649 communities from 7 body habitats sequenced as part of the HMP. Validation of HUMAnN's accuracy using four additional synthetic communities of increasing complexity demonstrated its ability to quantify both pathway presence and relative abundance, with correlations to true abundances >0.9. When analyzing the human microbiome, relatively few modules were specifically present or absent in any one body habitat, but over two thirds varied in abundance by habitat and 24 were core to all hosts and habitats. Less variation was evident among hosts, although we demonstrated one example in which nutrient transport and mechanisms of central carbon metabolism were strongly associated with vaginal pH. HUMAnN's functional reconstructions include the abundances of large, general pathways, smaller and more specific metabolic modules, and individual orthologous gene families; each data type proved to show distinct patterns of variation among body sites and to provide a different perspective on underlying microbial community function.
Characterization of microbial communities by large-scale shotgun metagenomic sequencing is a relatively recent advance, and computational methods for assessing these data in terms of biological function are under active development. Several previous studies have analyzed individual reads or assembled contigs using direct annotation by BLAST to orthologous gene families
[5],
[11] or to proxy genes
[63]. Other computational pipelines such as MG-RAST
[14] and MEGAN
[15] do include full pathway reconstruction, generally using approaches that rely on the single best BLAST hit for each metagenomic read. Here, HUMAnN scaled easily to provide module and pathway reconstructions for >3.5 Tbp of HMP metagenomic data by avoiding metagenomic assembly and employing an accelerated translated BLAST implementation, requiring a total of approximately 13,000 CPU-hours for sequence search and 175 for metabolic reconstruction (750 and 55K sequences/second, respectively). HUMAnN is not dependent on any particular BLAST implementation, however, and provides default support for NCBI BLAST, MBLASTX as employed here, MAPX (Real Time Genomics, San Francisco, CA), and USEARCH
[27]. In each case, the approximations used to accelerate search against a functionally characterized orthologous sequence database are mitigated by considering all read-to-sequence hits in a weighted manner. This leaves overall gene family abundance recovery essentially unchanged while improving full module/pathway recovery, as ambiguous BLAST hits can be resolved later in the reconstruction process when more information is available (Supplemental
Figure S2). Further, each of HUMAnN's processing modules incorporates one or more types of additional knowledge, e.g. pathway parsimony by means of MinPath
[22] and subsequent automatic taxonomic limitation based on BLAST organismal abundance profiles. These steps are not guaranteed to be optimal in all situations - taxonomic limitation, for example, might degrade performance in environments rich in novel or rapidly evolving organisms - but they are heuristics designed to improve reconstruction in most cases. They thus generally take advantage of the compositional information that can be leveraged when combining multiple gene family sequences into a single module or pathway, decreasing the noise potentially arising from examining single best BLAST hits.
It should be noted that when analyzing metagenomic data as described here, HUMAnN reconstructs a profile of a microbial community's metabolic potential, not its metabolic activity per se. The abundances of gene families and pathways inferred by the system describe only the enzymes encoded by one or more microbial genomes, and their relationship to realized transcriptional or protein activity may not be straightforward in the absence of additional metatranscriptomic, metaproteomic, or metametabolomic data
[64]. However, these metagenomic gene family and module abundances are appropriate as inputs into more sophisticated metabolic network and systems biology models, which have recently begun to incorporate features such as predicted compartmentalization, small molecule transport, and multi-organism interactions in microbial communities
[65],
[66]. HUMAnN as described here was designed to infer a permissive superset of community function that does not yet include realized transcriptional activity or organismal compartmentalization, and we hope to incorporate these features during future work. It should be emphasized that HUMAnN as currently implemented is appropriate for analysis of metatranscriptomic data from short sequence reads as well, from which it will reconstruct the abundance of actively transcribed gene families or pathways within a microbiome.
The results produced by HUMAnN and analyzed above for the human microbiome include the application of several community diversity measures to microbial function. Such measures are typically applied instead to organismal abundances, where α-diversity summarizes complexity and types of different organisms within a community and β-diversity the similarities (or differences) between multiple communities' structures
[34]. Such organismal diversity measures have been very successful in describing properties of the human microbiome in large populations, such as the greater similarity of children's and parents' microbiomes
[5] or reduced microbial diversity in conditions such as Crohn's disease
[67]. Conversely, ecological functional diversity has been developed primarily in macroecology, specifically as applied to phenotypic traits
[35],
[68]. To our knowledge, however, this represents the first application of α-diversity measures to molecular function within microbial communities and specifically to the human microbiome. The HMP consortium has contrasted the functional diversities reported above with comparable organismal diversity measures at the genus, species, and strain levels throughout the human microbiome
[1]. Their results suggest that functional diversity is lower than phylogenetic diversity both within and between communities throughout the human microbiome; that is, the microbes within this human population vary more than do the biological processes carried by their metagenomes. It must be noted that this conclusion speaks so far only to the disease-free HMP population, however, and only to the subset of characterized orthologous gene families currently analyzable by HUMAnN. Further variability in the functional potential of the microbiota may certainly remain to be found in its substantial carriage of uncharacterized gene families (estimated as high as 80%
[11]) and during disruptions of host health.
A key consideration during our development of the HUMAnN pipeline was versatility; the software implementation can easily be extended to assess any functional catalog, characterized sequences, or metagenomic sequences (e.g. 454 reads). In other analyses by the HMP, additional protein databases including MetaCyc
[28], CAZy
[21], virulence related proteins
[69], and antibiotic resistance genes
[70] were all processed using HUMAnN. MetaCyc, for example, includes both characterized sequences and metabolic modules, for which HUMAnN reconstructed coverages and abundances; other databases included no explicit pathway groupings, and gene families were used directly to examine differential abundance. While these smaller databases are less appropriate for quantitative evaluations or broad metabolic reconstruction, they can be used with HUMAnN to provide focused coverage of specific biological areas. As detailed above, in addition to its primary abundance and coverage outputs, HUMAnN by default calculates a number of basic ecological summary statistics as applied to community functional profiles; it also produces detailed gene-level outputs for each community that can be directly imported into the JCVI Metagenomics Reports (METAREP)
[71] software. All components in the pipeline, including taxonomic limitation, are entirely data-driven; the methodology can therefore be used for functional reconstruction on any genomic data from microbial or eukaryotic organisms, although in a single-organism setting, there are not clear benefits over standard genome annotation pipelines. However, individual modules (such as gap filling or the inclusion of multiple BLAST hits) can be manually activated or deactivated by the user for particular datasets. Importantly for microbial communities, HUMAnN can also be used on other data types, including metaproteomic or metatranscriptomic sequences; we anticipate HUMAnN being useful in the reconstruction of pathway activities in transcriptomic sequences from different environmental communities, for example.
In closing, we would like to emphasize that HUMAnN's current approach to microbial community functional reconstruction is explicitly independent of the organismal membership of these communities. It was designed to complement taxonomic classifications of community structure, and integration of community function with membership is an area of further ongoing work
[15]. Particularly in the human microbiome, full genome sequences are available for many reference strains isolated from multiple body sites, which has already allowed community membership to be analyzed simultaneously in metagenomic and 16S taxonomic marker sequences
[1],
[24]. By combining membership with functional reconstruction, specialized processes in specific habitats or hosts, for example, can be correlated with the organisms providing or dependent on these aspects of community function. While the general applicability of Beijerinck's 1913 hypothesis
[72] that, “Everything is everywhere, and the environment selects,” is still unclear, we speculate that it may prove to be more broadly accurate for microbial function than for microbial organisms. That is, there may be a moderately stable pool of core microbial pathways, present in all communities but implemented by different organisms and gene families, with relative abundance (and activity) determined by the local selective pressures of each microbial habitat. This appears to be at least somewhat the case in the human microbiome, and further investigation will determine whether this pattern holds for the functional profiles of broader classes of microbial communities.