HMP Data and Organization
IMG/M-HMP contains 748 metagenome datasets
generated as part of the HMP initiative by sequencing samples collected from various body sites (airways, gastrointestinal, oral, skin, urogenital). This first release of IMG/M-HMP includes only the subset of assembled metagenome sequences on which a total of 80 million
protein coding genes have been predicted 
. Metagenome datasets are integrated with publicly available bacterial, archaeal, eukaryotic, and viral genomes, including reference genomes sequenced as part of the HMP initiative.
HMP genomes and metagenomes in IMG/M-HMP are grouped both by body site category and by taxonomy, as shown in the left upper and lower panes of . Metagenome datasets are also grouped according to the primary body site and human subjects sampled, as shown in and 1(iii), respectively. The names and classification of metagenome datasets in IMG/M-HMP are curated in GOLD 
following a five-tiered classification system 
. This classification scheme underlies the organization of metagenome datasets in IMG/M in general and IMG/M-HMP in particular, as illustrated in . Similar to the phylogenetic classification of isolate genomes, the classification of metagenomes is a critical element for conducting metagenome comparative analysis in a rapidly growing universe of metagenome datasets. Thus, metagenome datasets are organized in three main ecosystem classes
, host associated
, and engineered
classes, then further divided in subclasses characterized by ecosystem categories
(e.g., arthropoda, human, mammals, plants for host associated metagenomes), ecosystem type
(e.g. digestive system, reproductive system, respiratory system, skin), ecosystem subtype
(e.g., oral, intestine), and specific ecosystem
(e.g., hard palate, palatine tonsils, saliva).
Grouping and Classification of metagenome datasets in IMG/M HMP.
Metagenome datasets in IMG/M HMP can be selected using several browse and search tools 
, as well as using predefined groupings and classification discussed above and illustrated in . Selected metagenome datasets are displayed as lists with each dataset associated with medical record number
, host gender
metadata, as illustrated in . Datasets of interest from the lists can be included into a “Genome Cart
” for further analysis.
Individual metagenomes can be explored using the “Metagenome Details” page which provides a variety of tools for browsing, searching genes, or downloading metagenome datasets (). This page also provides information (metadata) on the metagenome together with various statistics of interest, such as the number of genes that are associated with KEGG, COG, Pfam, InterPro or enzyme information.
Comparative Analysis Tools
IMG/M-HMP’s front page provides three comparative analysis “workflows” of HMP metagenome datasets for estimating taxonomic composition of individual samples as well as predefined sample aggregates grouped by sampled body sites, and analysing them in the context of reference genomes grouped according to their taxonomy and isolation source (body site category).
The “Body Sites Distribution” () shows the distribution of best BLASTp hits of the genes in the aggregate metagenome samples (grouped by body site) against the genes of the reference genomes grouped according to the body site category from which they were isolated. For example, there are 90,002 genes across all airways samples (from a total of 476,963 genes) that have their best BLASTp hits to some of the 286,127 genes of reference genomes isolated from human airways, as illustrated in .
Distribution of genes of microbiomes grouped taxonomically and by habitat according to their best BLASTp hit to reference genomes.
As expected, in most cases the aggregate genes of the samples grouped by each of the primary body sites, had the highest percentage of their best BLAST hits against the genes of reference genomes isolated from the same body site (). For example 94.7% of the genes from the gastro-intestinal (GI) tract samples, have their best BLAST hits to genes of reference genomes isolated from GI tract. In a similar manner 73% of the genes from the urinary tract (UT) samples, have their best BLAST hits to genes of reference genomes isolated from UT. The only exception to this observation are the genes from the airway samples, most of which (48.9%) have their best BLAST hits to genes from reference genomes isolated from skin, and only 27% to genes from reference genomes isolated from airways. This may be explained by the fact that the majority of the airway samples have been collected from the anterior nares, whereas the majority of reference genomes classified into “Airways” category have been isolated from lower airways. Anterior nares are characterized by the presence of squamous epithelium which is an environment more similar to the skin than to lower airways covered with ciliated mucosa. Overall, the results of this type of data comparisons are heavily dependent on the quality of the metadata available for reference genomes: if the isolation site of the latter has not been properly documented in the original publication or accurately recorded in the database, then the observed results may be largely inaccurate and/or misleading.
The percentage distribution of best blast hits of aggregate samples from each major body site run against reference isolate genomes grouped by each major body site.
The “Phylogenetic Distribution of Genes” is an IMG/M comparative analysis tool that provides an estimate of the phylogenetic composition of a metagenome sample based on the distribution of the best BLAST hits of the protein-coding genes in the sample. The result of “Phylogenetic Distribution of Genes” can be displayed using the “Radial Phylogenetic Tree” viewer as illustrated in , or in a tabular format consisting of a histogram with counts protein-coding genes in the sample that have best BLASTp hits to proteins of isolate genomes in each phylum or class with more than 90% identity, 60–90% identity and 30–60% identity, respectively. A specialized version of this tool, the “Body Sites Phylogenetic Distribution” is available on the front page of IMG/M-HMP, whereby all the genes of the metagenomic samples are grouped by their primary body site and their best blast hits against the reference genomes are organized taxonomically. The results of this comparison are displayed using the “Radial Phylogenetic Tree” tool with all the reference genomes organized in a color-coded hierarchical circular tree according to the taxonomic level of choice as illustrated in . Using this radial tree, the distribution of the best BLAST hits of a group of genomes or metagenomes against the reference set of genomes, can be projected. In this case, , shows the phylogenetic distribution of the genes associated with all the metagenomic samples aggregated by their primary body site, to all isolate genomes, grouped at the taxonomic level of family.
The “Significant Family Plot” summarizes the relationship between the samples using BLASTp-based estimation of their taxonomic composition. The tool illustrated in is available on the front page of IMG/M-HMP. The number of genes from the sample with best BLAST hit to genes in the specific taxonomic family is used as a proxy for the abundance of microbes from this family in the sample. Only families with at least 1% contribution to the total number of genes are considered, and hierarchical clustering is performed using the gene counts described above, with the results represented as a two-dimensional dendrogram. The same results can be obtained by using the “Genome Clustering” tool and the option “Hierarchical Clustering” which is described below.
Several other comparative analysis tools which allow examining the gene content and functional capabilities of microbial communities are available under the “Compare Genomes” main menu tab of IMG/M, as shown in . For instance, the “Abundance Profile Overview” tool provides a quick estimate of the functional capabilities of metagenomes of interest in terms of the relative abundance of protein families (COGs, Pfams) and functional families (Enzymes), with the result displayed either as a heat map or in matrix format, with each column corresponding to a metagenome and each row corresponding to a family. A counterpart “Abundance Profile Search” tool allows finding protein families (COGs, Pfams) in metagenome datasets based on their relative abundance, with the ability of selecting abundance cut-offs and the way the results are displayed, namely using raw or normalized gene counts.
Metagenome comparison tools in IMG/M-HMP.
We discuss below in more detail how IMG/M HMP’s comparative analysis tools can be used in the context of a HMP specific analysis. A potential starting point for such an analysis is identifying the outliers among samples collected from the same type of human body site, such as all the human gut samples, using “Genome Clustering” tools, as illustrated in . In this example, Principal Component Analysis (PCA) of human gut samples based on COGs identifies several outliers, as illustrated in .
Next, the “Function Comparison” tool can be used to determine which protein families distinguish outlier samples from the rest of the human gut samples. The “Function Comparison” tool takes into account the stochastic nature of metagenome datasets and tests whether the differences in abundance of protein families can be ascribed to chance variation or not. This tool allows comparing a metagenome dataset with other metagenome datasets or reference genomes in terms of the relative abundance of protein families (COGs, Pfams, TIGRfams) and functional families (Enzymes) and assigns statistical significance to the differences in protein family abundance. In our specific example one outlier sample is selected as a query sample, and compared to reference samples that consist of both outliers and non-outliers in terms of relative abundance of COGs, as illustrated in . The result of such comparison is represented as a list of functions or protein families, whereby for each function or protein family F, the number of genes or an estimated gene copy number in the target (query) metagenome associated with F is displayed. Similar counts are displayed for each reference genome/metagenome, and the differences in protein family abundance are assessed for their statistical significance reflected in the associated p-value and D-scores. The latter represents a standard score obtained by subtracting the mean frequency of a protein family in the datasets and divided by the standard deviation of frequency of a protein family in the datasets under an assumption of normal distribution, and p-values are corrected for multiple hypothesis testing using False Discovery Rate. The cells displaying the p-value and D-score of families with statistically significant differences are highlighted in yellow. For instance, in the example shown in , protein families (COGs) that are more abundant in the query sample than in non-outlier samples from are highlighted in yellow; note that the same protein families in outlier samples from are not highlighted in yellow, indicating that they don’t have statistically significant differences with the query sample.
The results of “Function Comparison” tool indicate that the outlier human gut samples shown in may have similar functional composition. Analysis of the protein families that consistently show up as more abundant in these outlier samples, such as COG1629 (Outer membrane receptor proteins, mostly Fe transport), COG4206 (Outer membrane cobalamin receptor protein), and COG1538 (Outer membrane protein), suggests that these functional differences may be due to differences in the taxonomic composition of the samples, namely in the different abundance of Gram-positive and Gram-negative bacteria, since all of the protein families distinguishing two groups of samples are found in Gram-negative, but not in Gram-positive bacteria. Thus, it is possible that the outlier samples are dominated by Gram-negative bacteria, whereas non-outlier samples are dominated by Gram-positive bacteria. These two groups of bacteria have different surface structures, which in turn can be linked to the differences in transport mechanisms and in certain metabolic pathways.
This hypothesis can be directly tested using IMG/M-HMP tools, whereby the genes from one or more distinguishing protein families in “Function Comparison” results can be selected and added to “Gene Cart”. The scaffolds on which these genes are encoded can be added to “Scaffold Cart”, and analyzed in terms of BLASTp hits of all proteins encoded on them. When such analysis has been performed on the distinguishing protein families in the example above, the majority of BLASTp hits were found to be to the genomes from Bacteroidetes phylum, which is indeed a phylum of gram-negative bacteria. This supports the idea that functional differences between outlier and non-outlier samples are due to the differences in taxonomic composition. The “Metagenomes Phylogenetic Distribution” tool can be used to confirm this hypothesis, as illustrated in .
Estimation of metagenome composition using Phylogenetic Distribution tool in IMG/M-HMP.
The “Metagenomes Phylogenetic Distribution” tool is based on “Phylogenetic Distribution of Genes” tool described in the previous section and it provides a comparison of multiple metagenome samples based on the distribution of the best BLASTp. The results of “Metagenomes Phylogenetic Distribution” can be displayed in tabular format, as illustrated in for one outlier and several non-outlier samples, which shows counts of protein-coding genes with best BLASTp hits with more than 60% identity to proteins of isolate genomes grouped by phylum. The same results can be displayed as a bar chart (), which clearly shows that the outlier sample (pink bar) is dominated by Bacteroidetes (a gram-negative phylum), while non-outlier samples are dominated by Firmicutes, which mostly includes gram-positive bacteria.