Microbes inhabit virtually all sites of the human body, yet we know very little about the role they play in our health. In recent years, there has been increasing interest in studying human-associated microbial communities, particularly since microbial dysbioses have now been implicated in a number of human diseases –. Dysbiosis, the disruption of the normal microbial community structure, however, is impossible to define without first establishing what “normal microbial community structure” means within the healthy human microbiome. Recent advances in sequencing technologies have made it feasible to perform large-scale studies of microbial communities, providing the tools necessary to begin to address this question , . This led to the implementation of the Human Microbiome Project (HMP) in 2007, an initiative funded by the National Institutes of Health Roadmap for Biomedical Research and constructed as a large, genome-scale community research project . Any such project must plan for data analysis, computational methods development, and the public availability of tools and data; here, we provide an overview of the corresponding bioinformatics organization, history, and results from the HMP (Figure 1).
One of the HMP's major goals was the generation of a baseline catalog of the microorganisms found in and on normal human hosts, which includes defining their normal patterns of phylogeny, taxonomy, biogeography, ecology, metabolism, and function. The HMP's study design included extensive sampling of the human microbiome from 300 subjects at five clinically relevant body areas (airways, skin, oral cavity, gastrointestinal tract, and vagina). Several specific body sites were sampled within each of these major areas, often at multiple time points, resulting in a total of 11,700 samples . Advances in sequencing technologies over the course of the HMP allowed subsets of these samples to be explored both using marker gene sequencing  and through metagenomic shotgun sequencing of whole-community DNA , . While these assays allowed the project's focus to scale from individual organisms to microbial communities as a whole, they presented daunting bioinformatic challenges. To date, the HMP has released over 100 million 16S rRNA gene reads and more than 8 Tbp of shotgun metagenomic sequences .
Before tackling the analysis of such a massive, heterogeneous sequencing data collection, early study design in the HMP planned for two critical and potentially conflicting bioinformatic considerations: subject privacy and rapid, public data release. Protection of human subjects for such a large cohort was handled by the EMMES Corporation, leveraging the resource of dbGaP  and emerging sequencing metadata standards  to provide quality control, security, and anonymous access to subject information for subsequent analyses. Deposition of nonprotected HMP data, its organization, and subsequently its public release were the mandate of the Data Analysis Coordination Center (DACC; http://hmpdacc.org), which was likewise formed early in the project. These steps were and are familiar aspects of genome sequencing and molecular epidemiology investigations, but once these data were protected and coordinated, the HMP was left with the task of developing appropriate and efficient analysis methodology.
The first bioinformatic challenges arose from the combination of large amounts of data with newly emerging sequencing technologies, particularly for 16S rRNA gene sequencing . HMP data generation began in earnest during the spring of 2010, at which time the largest published microbiome datasets contained approximately 1–2 million 16S rRNA gene reads using the 454 platform , . The HMP anticipated at least an order of magnitude more data, and these published datasets were themselves two orders of magnitude larger than previous studies. Identifying microbial membership and abundance using 16S rRNA gene sequencing has a long history , and many analysis tools and platforms were available , –. However, none were prepared to scale to the amount of data generated by the HMP. Major bioinformatic issues that were immediately apparent included high-throughput solutions for chimera detection in short reads 19, tackling increased sequence error rates , and adapting methods as the 454/Roche chemistry evolved , .
Computational analysis of shotgun metagenomic reads raised similar, even more extensive issues. The largest previous human-associated metagenomic data using the Illumina GA platform comprised some 0.5 Tbp , again several orders of magnitude more than commonly found in the literature at that time. Earlier work, in both environmental and human-associated communities – provided both critical biological insights and some analysis tools , , but while the former were vital for the HMP's later interpretation, the latter were not prepared for hundreds of samples comprising multiple terabases of 100 nt paired end reads from the Illumina GAIIx instrument. Over the course of the project, new analysis tools became available that partly addressed the challenges faced in this project: accelerated high-performance alternatives to BLAST , short read clustering , , and mapping approaches , , new interfaces to heterogeneous microbial community data , , and new de novo assembly software tailored to the Illumina data .
In order to address these challenges, as data generation began, the HMP specifically reached out to the bioinformatic community to create an analysis ecosystem around the anticipated large-scale datasets. The project aimed to bring together the extensive expertise and robust computational infrastructures of the large-scale sequencing centers with the many scientists actively developing new cutting-edge approaches for the analysis of metagenomic data. A Data Analysis Working Group (DAWG) was created, incorporating members of the four sequencing centers, the DACC, and researchers from the computational and microbiological research communities, many of whom volunteered their time out of enthusiasm for the project and its scientific potential. As the first HMP datasets became available in May of 2010, more than a hundred participants were organized into working groups focusing on different aspects of the data analysis process, including sequence quality control, assembly, annotation, metabolic reconstruction, and 16S-based studies. Through a series of conference calls, face-to-face meetings, computational breakthroughs, and hard work, the HMP DAWG developed and validated the series of bioinformatic solutions for human microbiome studies detailed below.