Several large scale metagenomic studies have been completed or are underway to investigate the genetic composition of microbes in their natural environment. Prominent efforts include the Global Ocean Sampling
[1]–
[3], interrogations of a variety of diverse environments
[4]–
[6] and more recently the human microbiome
[7],
[8]. Increasingly such work is planned and carried out as part of larger consortia and funding efforts. Examples include MetaHIT
[7], the Earth Microbiome Project
[9],
http://www.terragenome.org, and the HMP
[10]. The HMP, represents an effort to characterize the microbial communities associated with multiple habitats across the human body, and is an excellent example of the complexity, scale and nature of such projects and consortia. With its focus on the resident bacteria of so called normal donors, this project provides a critical baseline for future metagenomic studies of the human microbiome including their associations with human health and disease. As a multi-faceted community resource, the HMP includes taxonomic marker studies of 16S rRNA gene sequences
[11] as well as a whole genome shotgun (WGS) data survey
[10],
[12]–
[15]. This WGS metagenomic data survey has examined the taxonomy and functional potential of microbial communities from 741 samples taken from up to fifteen body habitats of 108 healthy adult men and women generating in total approximately 38 billion short read sequences (3.5 Tbp) of which over 14 billion sequences were processed and analyzed as a part of this study. This information is complementary to 16S rRNA gene based organismal identifications and other taxonomic marker sequences, however the task of annotating and characterizing large collections of such data is similarly challenging.
To identify taxonomic and functional signatures, WGS metagenomic data are curated by either directly annotating short reads
[16],
[17] or, as would be performed for the sequenced genome of a single organism, annotated post assembly taking advantage of the larger contigs
[18]. Annotation of these data is a computationally intensive activity, which requires extensive BLAST-like homology searches that can be difficult both to perform and store. Fortunately, billions of short sequence reads can be most usefully analyzed after condensing the data to taxonomic, enzymatic, and/or pathway abundances, which can subsequently be studied more efficiently.
To provide a computational framework within which to perform such tasks, we have developed JCVI Metagenomics Reports (METAREP), an open source tool for high-performance comparative metagenomics
[19]. The software utilizes a scalable data warehouse solution that allows effective storage and dynamic querying of annotation data that can be produced by various annotation methods. The data model of METAREP version 1.3.1, presented in this report, has been expanded to allow the direct importation and analysis of results produced by two annotation pipelines used in the HMP: (1) JCVI’s Prokaryotic Metagenomics Annotation Pipeline (JPMAP)
[18] used for the annotation of open reading frames from assemblies and (2) HUMAnN
[16] to annotate short reads. In addition, frequencies of functional and taxonomic attributes can be adjusted using custom annotation weights. The scalability of such weighted frequency calculations has been improved by utilizing distributed searches.
In this study, we present advancements to the METAREP software focusing on the implementation of an extended data model, improved scalability and analytical features which have facilitated biological comparisons and interpretation of human microbiome metagenomic data generated by the HMP across multiple samples, body habitats and individuals. In particular, we introduce several biological scenarios and hypotheses along with appropriate analytical strategies designed to investigate these questions as well as demonstrate important downstream, analytical features of METAREP including: how to filter the data for enzymatic markers, visualize marker composition across organisms and human habitats, conduct hierarchical clustering analysis of individual samples, and carry out non-parametric statistical analyses to detect differentially abundant taxa and pathways in oral habitats. The results of these scenarios provide templates of analytical strategies for future users of METAREP that can be applied to similar data. Further, the results of the current scenarios have revealed new insights into the taxonomic and functional relationships between multiple body habitats and individuals of the human microbiome. Finally, we also provide specific descriptions of software architecture improvements and results of tests designed to benchmark performance response time of the software. Overall, this work introduces an important software tool and strategies for comparative analysis of large-scale metagenomic data generated from complex experimental designs.